linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/5] Two-phase seccomp and an x86_64 fast path
@ 2014-06-11 20:22 Andy Lutomirski
  2014-06-11 20:22 ` [RFC 1/5] seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing Andy Lutomirski
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Andy Lutomirski @ 2014-06-11 20:22 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Andy Lutomirski

On my VM, getpid takes about 70ns.  Before this patchset, adding a
single-instruction always-accept seccomp filter added about 134ns of
overhead to getpid.  With this patchset, the overhead is down to about
13ns.

I'd really appreciate careful review from all relevant arch
maintainers for patch 1.

This is an RFC for now.  I'll submit a non-RFC version after the merge
window ends.

Andy Lutomirski (5):
  seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing
  x86_64,entry: Treat regs->ax the same in fastpath and slowpath
    syscalls
  seccomp: Refactor the filter callback and the API
  seccomp: Allow arch code to provide seccomp_data
  x86,seccomp: Add a seccomp fastpath

 arch/arm/kernel/ptrace.c       |   7 +-
 arch/mips/kernel/ptrace.c      |   2 +-
 arch/s390/kernel/ptrace.c      |   2 +-
 arch/x86/include/asm/calling.h |   6 +-
 arch/x86/kernel/entry_64.S     |  52 ++++++++-
 arch/x86/kernel/ptrace.c       |   2 +-
 arch/x86/kernel/vsyscall_64.c  |   2 +-
 include/linux/seccomp.h        |  25 +++--
 kernel/seccomp.c               | 244 +++++++++++++++++++++++++++--------------
 9 files changed, 241 insertions(+), 101 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 1/5] seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing
  2014-06-11 20:22 [RFC 0/5] Two-phase seccomp and an x86_64 fast path Andy Lutomirski
@ 2014-06-11 20:22 ` Andy Lutomirski
  2014-06-11 20:22 ` [RFC 2/5] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls Andy Lutomirski
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2014-06-11 20:22 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Andy Lutomirski, Russell King,
	Ralf Baechle, Martin Schwidefsky, Heiko Carstens, linux-s390

The secure_computing function took a syscall number parameter, but
it only paid any attention to that parameter if seccomp mode 1 was
enabled.  Rather than coming up with a kludge to get the parameter
to work in mode 2, just remove the parameter.

To avoid churn in arches that don't have seccomp filters (and may
not even support syscall_get_nr right now), this leaves the
parameter in secure_computing_strict, which is now a real function.

For ARM, this is a bit ugly due to the fact that ARM conditionally
supports seccomp filters.  Fixing that is probably only be a couple
of lines of code, but it should be coordinated with the audit
maintainers.

This will be a slight slowdown on some arches.  The right fix is to
pass in all of seccomp_data instead of trying to make just the
syscall nr part be fast.

I'm working on a big seccomp speedup for x86, but I'd like to remove
this wart first.

Cc: Russell King <linux@arm.linux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/arm/kernel/ptrace.c      |  7 ++++-
 arch/mips/kernel/ptrace.c     |  2 +-
 arch/s390/kernel/ptrace.c     |  2 +-
 arch/x86/kernel/ptrace.c      |  2 +-
 arch/x86/kernel/vsyscall_64.c |  2 +-
 include/linux/seccomp.h       | 21 ++++++++-------
 kernel/seccomp.c              | 63 ++++++++++++++++++++++++++++++-------------
 7 files changed, 65 insertions(+), 34 deletions(-)

diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index 0dd3b79..f79c9cc 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -934,8 +934,13 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs, int scno)
 	current_thread_info()->syscall = scno;
 
 	/* Do the secure computing check first; failures should be fast. */
-	if (secure_computing(scno) == -1)
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+	if (secure_computing() == -1)
 		return -1;
+#else
+	/* XXX: remove this once OABI gets fixed */
+	secure_computing_strict(scno);
+#endif
 
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		scno = tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c
index f639ccd..808bafc 100644
--- a/arch/mips/kernel/ptrace.c
+++ b/arch/mips/kernel/ptrace.c
@@ -639,7 +639,7 @@ asmlinkage long syscall_trace_enter(struct pt_regs *regs, long syscall)
 	long ret = 0;
 	user_exit();
 
-	if (secure_computing(syscall) == -1)
+	if (secure_computing() == -1)
 		return -1;
 
 	if (test_thread_flag(TIF_SYSCALL_TRACE) &&
diff --git a/arch/s390/kernel/ptrace.c b/arch/s390/kernel/ptrace.c
index 1c82619..20f1998 100644
--- a/arch/s390/kernel/ptrace.c
+++ b/arch/s390/kernel/ptrace.c
@@ -795,7 +795,7 @@ asmlinkage long do_syscall_trace_enter(struct pt_regs *regs)
 	long ret = 0;
 
 	/* Do the secure computing check first. */
-	if (secure_computing(regs->gprs[2])) {
+	if (secure_computing()) {
 		/* seccomp failures shouldn't expose any additional code. */
 		ret = -1;
 		goto out;
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 678c0ad..93c182a 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1471,7 +1471,7 @@ long syscall_trace_enter(struct pt_regs *regs)
 		regs->flags |= X86_EFLAGS_TF;
 
 	/* do the secure computing check first */
-	if (secure_computing(regs->orig_ax)) {
+	if (secure_computing()) {
 		/* seccomp failures shouldn't expose any additional code. */
 		ret = -1L;
 		goto out;
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 8b3b3eb..47a4123 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -216,7 +216,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
 	 */
 	regs->orig_ax = syscall_nr;
 	regs->ax = -ENOSYS;
-	tmp = secure_computing(syscall_nr);
+	tmp = secure_computing();
 	if ((!tmp && regs->orig_ax != syscall_nr) || regs->ip != address) {
 		warn_bad_vsyscall(KERN_DEBUG, regs,
 				  "seccomp tried to change syscall nr or ip");
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 4054b09..6e655a6 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -25,19 +25,17 @@ struct seccomp {
 	struct seccomp_filter *filter;
 };
 
-extern int __secure_computing(int);
-static inline int secure_computing(int this_syscall)
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+extern int __secure_computing(void);
+static inline int secure_computing(void)
 {
 	if (unlikely(test_thread_flag(TIF_SECCOMP)))
-		return  __secure_computing(this_syscall);
+		return  __secure_computing();
 	return 0;
 }
-
-/* A wrapper for architectures supporting only SECCOMP_MODE_STRICT. */
-static inline void secure_computing_strict(int this_syscall)
-{
-	BUG_ON(secure_computing(this_syscall) != 0);
-}
+#else
+extern void secure_computing_strict(int this_syscall);
+#endif
 
 extern long prctl_get_seccomp(void);
 extern long prctl_set_seccomp(unsigned long, char __user *);
@@ -54,8 +52,11 @@ static inline int seccomp_mode(struct seccomp *s)
 struct seccomp { };
 struct seccomp_filter { };
 
-static inline int secure_computing(int this_syscall) { return 0; }
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+static inline int secure_computing(void) { return 0; }
+#else
 static inline void secure_computing_strict(int this_syscall) { return; }
+#endif
 
 static inline long prctl_get_seccomp(void)
 {
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b35c215..4af399a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -21,8 +21,11 @@
 
 /* #define SECCOMP_DEBUG 1 */
 
-#ifdef CONFIG_SECCOMP_FILTER
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
 #include <asm/syscall.h>
+#endif
+
+#ifdef CONFIG_SECCOMP_FILTER
 #include <linux/filter.h>
 #include <linux/ptrace.h>
 #include <linux/security.h>
@@ -172,7 +175,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
  *
  * Returns valid seccomp BPF response codes.
  */
-static u32 seccomp_run_filters(int syscall)
+static u32 seccomp_run_filters(void)
 {
 	struct seccomp_filter *f;
 	struct seccomp_data sd;
@@ -372,32 +375,53 @@ static int mode1_syscalls_32[] = {
 };
 #endif
 
-int __secure_computing(int this_syscall)
+static void __secure_computing_strict(int this_syscall)
+{
+	int *syscall_whitelist = mode1_syscalls;
+#ifdef CONFIG_COMPAT
+	if (is_compat_task())
+		syscall_whitelist = mode1_syscalls_32;
+#endif
+	do {
+		if (*syscall_whitelist == this_syscall)
+			return;
+	} while (*++syscall_whitelist);
+
+#ifdef SECCOMP_DEBUG
+	dump_stack();
+#endif
+	audit_seccomp(this_syscall, SIGKILL, SECCOMP_RET_KILL);
+	do_exit(SIGKILL);
+}
+
+#ifndef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+void secure_computing_strict(int this_syscall)
+{
+	int mode = current->seccomp.mode;
+	if (mode == 0)
+		return;
+	else if (mode == SECCOMP_MODE_STRICT)
+		__secure_computing_strict(this_syscall);
+	else
+		BUG();
+}
+#else
+int __secure_computing(void)
 {
 	int mode = current->seccomp.mode;
+	struct pt_regs *regs = task_pt_regs(current);
+	int this_syscall = syscall_get_nr(current, regs);
 	int exit_sig = 0;
-	int *syscall;
 	u32 ret;
 
 	switch (mode) {
 	case SECCOMP_MODE_STRICT:
-		syscall = mode1_syscalls;
-#ifdef CONFIG_COMPAT
-		if (is_compat_task())
-			syscall = mode1_syscalls_32;
-#endif
-		do {
-			if (*syscall == this_syscall)
-				return 0;
-		} while (*++syscall);
-		exit_sig = SIGKILL;
-		ret = SECCOMP_RET_KILL;
-		break;
+		__secure_computing_strict(this_syscall);
+		return 0;
 #ifdef CONFIG_SECCOMP_FILTER
 	case SECCOMP_MODE_FILTER: {
 		int data;
-		struct pt_regs *regs = task_pt_regs(current);
-		ret = seccomp_run_filters(this_syscall);
+		ret = seccomp_run_filters();
 		data = ret & SECCOMP_RET_DATA;
 		ret &= SECCOMP_RET_ACTION;
 		switch (ret) {
@@ -455,9 +479,10 @@ int __secure_computing(int this_syscall)
 #ifdef CONFIG_SECCOMP_FILTER
 skip:
 	audit_seccomp(this_syscall, exit_sig, ret);
-#endif
 	return -1;
+#endif
 }
+#endif /* CONFIG_HAVE_ARCH_SECCOMP_FILTER */
 
 long prctl_get_seccomp(void)
 {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 2/5] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls
  2014-06-11 20:22 [RFC 0/5] Two-phase seccomp and an x86_64 fast path Andy Lutomirski
  2014-06-11 20:22 ` [RFC 1/5] seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing Andy Lutomirski
@ 2014-06-11 20:22 ` Andy Lutomirski
  2014-06-11 20:23 ` [RFC 3/5] seccomp: Refactor the filter callback and the API Andy Lutomirski
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2014-06-11 20:22 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Andy Lutomirski

For slowpath syscalls, we initialize regs->ax to -ENOSYS and stick
the syscall number into regs->orig_ax prior to any possible tracing
and syscall execution.  This is user-visible ABI used by ptrace
syscall emulation and seccomp.

For fastpath syscalls, there's no good reason not to do the same
thing.  It's even slightly simpler than what we're currently doing.
It probably has no measureable performance impact.  It should have
no user-visible effect.

The purpose of this patch is to prepare for seccomp-based syscall
emulation in the fast path.  This change is just subtle enough that
I'm keeping it separate.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/calling.h |  6 +++++-
 arch/x86/kernel/entry_64.S     | 11 +++--------
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index cb4c73b..76659b6 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -85,7 +85,7 @@ For 32-bit we have the following conventions - kernel is built with
 #define ARGOFFSET	R11
 #define SWFRAME		ORIG_RAX
 
-	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1
+	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1, rax_enosys=0
 	subq  $9*8+\addskip, %rsp
 	CFI_ADJUST_CFA_OFFSET	9*8+\addskip
 	movq_cfi rdi, 8*8
@@ -96,7 +96,11 @@ For 32-bit we have the following conventions - kernel is built with
 	movq_cfi rcx, 5*8
 	.endif
 
+	.if \rax_enosys
+	movq $-ENOSYS, 4*8(%rsp)
+	.else
 	movq_cfi rax, 4*8
+	.endif
 
 	.if \save_r891011
 	movq_cfi r8,  3*8
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1e96c36..f9e713a 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -611,8 +611,8 @@ GLOBAL(system_call_after_swapgs)
 	 * and short:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	SAVE_ARGS 8,0
-	movq  %rax,ORIG_RAX-ARGOFFSET(%rsp)
+	SAVE_ARGS 8, 0, rax_enosys=1
+	movq_cfi rax,(ORIG_RAX-ARGOFFSET)
 	movq  %rcx,RIP-ARGOFFSET(%rsp)
 	CFI_REL_OFFSET rip,RIP-ARGOFFSET
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
@@ -624,7 +624,7 @@ system_call_fastpath:
 	andl $__SYSCALL_MASK,%eax
 	cmpl $__NR_syscall_max,%eax
 #endif
-	ja badsys
+	ja ret_from_sys_call  /* and return regs->ax */
 	movq %r10,%rcx
 	call *sys_call_table(,%rax,8)  # XXX:	 rip relative
 	movq %rax,RAX-ARGOFFSET(%rsp)
@@ -683,10 +683,6 @@ sysret_signal:
 	FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 	jmp int_check_syscall_exit_work
 
-badsys:
-	movq $-ENOSYS,RAX-ARGOFFSET(%rsp)
-	jmp ret_from_sys_call
-
 #ifdef CONFIG_AUDITSYSCALL
 	/*
 	 * Fast path for syscall audit without full syscall trace.
@@ -726,7 +722,6 @@ tracesys:
 	jz auditsys
 #endif
 	SAVE_REST
-	movq $-ENOSYS,RAX(%rsp) /* ptrace can change this for a bad syscall */
 	FIXUP_TOP_OF_STACK %rdi
 	movq %rsp,%rdi
 	call syscall_trace_enter
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 3/5] seccomp: Refactor the filter callback and the API
  2014-06-11 20:22 [RFC 0/5] Two-phase seccomp and an x86_64 fast path Andy Lutomirski
  2014-06-11 20:22 ` [RFC 1/5] seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing Andy Lutomirski
  2014-06-11 20:22 ` [RFC 2/5] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls Andy Lutomirski
@ 2014-06-11 20:23 ` Andy Lutomirski
  2014-06-11 20:23 ` [RFC 4/5] seccomp: Allow arch code to provide seccomp_data Andy Lutomirski
  2014-06-11 20:23 ` [RFC 5/5] x86,seccomp: Add a seccomp fastpath Andy Lutomirski
  4 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2014-06-11 20:23 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Andy Lutomirski

The reason I did this is to add a seccomp API that will be usable
for an x86 fast path.  The x86 entry code needs to use a rather
expensive slow path for a syscall that might be visible to things
like ptrace.  By splitting seccomp into two phases, we can check
whether we need the slow path and then use the fast path in if the
filter allows the syscall or just returns some errno.

As a side effect, I think the new code is much easier to understand
than the old code.

This has one user-visible effect: the audit record written for
SECCOMP_RET_TRACE is now a simple indication that SECCOMP_RET_TRACE
happened.  It used to depend in a complicated way on what the tracer
did.  I couldn't make much sense of it.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/seccomp.h |   6 ++
 kernel/seccomp.c        | 179 +++++++++++++++++++++++++++++++-----------------
 2 files changed, 122 insertions(+), 63 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 6e655a6..8345fdc 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -33,6 +33,12 @@ static inline int secure_computing(void)
 		return  __secure_computing();
 	return 0;
 }
+
+#define SECCOMP_PHASE1_OK	0
+#define SECCOMP_PHASE1_SKIP	1
+
+extern u32 seccomp_phase1(void);
+int seccomp_phase2(u32 phase1_result);
 #else
 extern void secure_computing_strict(int this_syscall);
 #endif
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 4af399a..dfdb38a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -19,8 +19,6 @@
 #include <linux/sched.h>
 #include <linux/seccomp.h>
 
-/* #define SECCOMP_DEBUG 1 */
-
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
 #include <asm/syscall.h>
 #endif
@@ -408,79 +406,134 @@ void secure_computing_strict(int this_syscall)
 #else
 int __secure_computing(void)
 {
+	u32 phase1_result = seccomp_phase1();
+	if (likely(phase1_result == SECCOMP_PHASE1_OK))
+		return 0;
+	else if (likely(phase1_result == SECCOMP_PHASE1_SKIP))
+		return -1;
+	else
+		return seccomp_phase2(phase1_result);
+}
+
+#ifdef CONFIG_SECCOMP_FILTER
+static u32 __seccomp_phase1_filter(int this_syscall, struct pt_regs *regs)
+{
+	u32 filter_ret = seccomp_run_filters();
+	int data = filter_ret & SECCOMP_RET_DATA;
+	u32 action = filter_ret & SECCOMP_RET_ACTION;
+
+	switch (action) {
+	case SECCOMP_RET_ERRNO:
+		/* Set the low-order 16-bits as a errno. */
+		syscall_set_return_value(current, regs,
+					 -data, 0);
+		goto skip;
+
+	case SECCOMP_RET_TRAP:
+		/* Show the handler the original registers. */
+		syscall_rollback(current, regs);
+		/* Let the filter pass back 16 bits of data. */
+		seccomp_send_sigsys(this_syscall, data);
+		goto skip;
+
+	case SECCOMP_RET_TRACE:
+		return filter_ret;  /* Save the rest for phase 2. */
+
+	case SECCOMP_RET_ALLOW:
+		return SECCOMP_PHASE1_OK;
+
+	case SECCOMP_RET_KILL:
+	default:
+		audit_seccomp(this_syscall, SIGSYS, action);
+		do_exit(SIGSYS);
+	}
+
+	unreachable();
+
+skip:
+	audit_seccomp(this_syscall, 0, action);
+	return SECCOMP_PHASE1_SKIP;
+}
+#endif
+
+/**
+ * seccomp_phase1() - run fast path seccomp checks on the current syscall
+ *
+ * This only reads pt_regs via the syscall_xyz helpers.  The only change
+ * it will make to pt_regs is via syscall_set_return_value, and it will
+ * only do that if it returns SECCOMP_PHASE1_SKIP.
+ *
+ * It may also call do_exit or force a signal; these actions must be
+ * safe.
+ *
+ * If it returns SECCOMP_PHASE1_OK, the syscall passes checks and should
+ * be processed normally.
+ *
+ * If it returns SECCOMP_PHASE1_SKIP, then the syscall should not be
+ * invoked.  In this case, seccomp_phase1 will have set the return value
+ * using syscall_set_return_value.
+ *
+ * If it returns anything else, then the return value should be passed
+ * to seccomp_phase2 from a context in which ptrace hooks are safe.
+ */
+u32 seccomp_phase1(void)
+{
 	int mode = current->seccomp.mode;
 	struct pt_regs *regs = task_pt_regs(current);
 	int this_syscall = syscall_get_nr(current, regs);
-	int exit_sig = 0;
-	u32 ret;
 
 	switch (mode) {
 	case SECCOMP_MODE_STRICT:
-		__secure_computing_strict(this_syscall);
-		return 0;
+		__secure_computing_strict(this_syscall);  /* may call do_exit */
+		return SECCOMP_PHASE1_OK;
 #ifdef CONFIG_SECCOMP_FILTER
-	case SECCOMP_MODE_FILTER: {
-		int data;
-		ret = seccomp_run_filters();
-		data = ret & SECCOMP_RET_DATA;
-		ret &= SECCOMP_RET_ACTION;
-		switch (ret) {
-		case SECCOMP_RET_ERRNO:
-			/* Set the low-order 16-bits as a errno. */
-			syscall_set_return_value(current, regs,
-						 -data, 0);
-			goto skip;
-		case SECCOMP_RET_TRAP:
-			/* Show the handler the original registers. */
-			syscall_rollback(current, regs);
-			/* Let the filter pass back 16 bits of data. */
-			seccomp_send_sigsys(this_syscall, data);
-			goto skip;
-		case SECCOMP_RET_TRACE:
-			/* Skip these calls if there is no tracer. */
-			if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) {
-				syscall_set_return_value(current, regs,
-							 -ENOSYS, 0);
-				goto skip;
-			}
-			/* Allow the BPF to provide the event message */
-			ptrace_event(PTRACE_EVENT_SECCOMP, data);
-			/*
-			 * The delivery of a fatal signal during event
-			 * notification may silently skip tracer notification.
-			 * Terminating the task now avoids executing a system
-			 * call that may not be intended.
-			 */
-			if (fatal_signal_pending(current))
-				break;
-			if (syscall_get_nr(current, regs) < 0)
-				goto skip;  /* Explicit request to skip. */
-
-			return 0;
-		case SECCOMP_RET_ALLOW:
-			return 0;
-		case SECCOMP_RET_KILL:
-		default:
-			break;
-		}
-		exit_sig = SIGSYS;
-		break;
-	}
+	case SECCOMP_MODE_FILTER:
+		return __seccomp_phase1_filter(this_syscall, regs);
 #endif
 	default:
 		BUG();
 	}
+}
 
-#ifdef SECCOMP_DEBUG
-	dump_stack();
-#endif
-	audit_seccomp(this_syscall, exit_sig, ret);
-	do_exit(exit_sig);
-#ifdef CONFIG_SECCOMP_FILTER
-skip:
-	audit_seccomp(this_syscall, exit_sig, ret);
-	return -1;
-#endif
+/**
+ * seccomp_phase2() - finish slow path seccomp work for the current syscall
+ * @phase1_result: The return value from seccomp_phase1()
+ *
+ * This must be called from a context in which ptrace hooks can be used.
+ *
+ * Returns 0 if the syscall should be processed or -1 to skip the syscall.
+ */
+int seccomp_phase2(u32 phase1_result)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	u32 action = phase1_result & SECCOMP_RET_ACTION;
+	int data = phase1_result & SECCOMP_RET_DATA;
+
+	BUG_ON(action != SECCOMP_RET_TRACE);
+
+	audit_seccomp(syscall_get_nr(current, regs), 0, action);
+
+	/* Skip these calls if there is no tracer. */
+	if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) {
+		syscall_set_return_value(current, regs,
+					 -ENOSYS, 0);
+		return -1;
+	}
+
+	/* Allow the BPF to provide the event message */
+	ptrace_event(PTRACE_EVENT_SECCOMP, data);
+	/*
+	 * The delivery of a fatal signal during event
+	 * notification may silently skip tracer notification.
+	 * Terminating the task now avoids executing a system
+	 * call that may not be intended.
+	 */
+	if (fatal_signal_pending(current))
+		do_exit(SIGSYS);
+	if (syscall_get_nr(current, regs) < 0)
+		return -1;  /* Explicit request to skip. */
+
+	return 0;
 }
 #endif /* CONFIG_HAVE_ARCH_SECCOMP_FILTER */
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 4/5] seccomp: Allow arch code to provide seccomp_data
  2014-06-11 20:22 [RFC 0/5] Two-phase seccomp and an x86_64 fast path Andy Lutomirski
                   ` (2 preceding siblings ...)
  2014-06-11 20:23 ` [RFC 3/5] seccomp: Refactor the filter callback and the API Andy Lutomirski
@ 2014-06-11 20:23 ` Andy Lutomirski
  2014-06-11 20:23 ` [RFC 5/5] x86,seccomp: Add a seccomp fastpath Andy Lutomirski
  4 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2014-06-11 20:23 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Andy Lutomirski

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/seccomp.h |  2 +-
 kernel/seccomp.c        | 32 +++++++++++++++++++-------------
 2 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 8345fdc..4fc7a84 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -37,7 +37,7 @@ static inline int secure_computing(void)
 #define SECCOMP_PHASE1_OK	0
 #define SECCOMP_PHASE1_SKIP	1
 
-extern u32 seccomp_phase1(void);
+extern u32 seccomp_phase1(struct seccomp_data *sd);
 int seccomp_phase2(u32 phase1_result);
 #else
 extern void secure_computing_strict(int this_syscall);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index dfdb38a..6848912 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -173,24 +173,27 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
  *
  * Returns valid seccomp BPF response codes.
  */
-static u32 seccomp_run_filters(void)
+static u32 seccomp_run_filters(struct seccomp_data *sd)
 {
 	struct seccomp_filter *f;
-	struct seccomp_data sd;
+	struct seccomp_data sd_local;
 	u32 ret = SECCOMP_RET_ALLOW;
 
 	/* Ensure unexpected behavior doesn't result in failing open. */
 	if (WARN_ON(current->seccomp.filter == NULL))
 		return SECCOMP_RET_KILL;
 
-	populate_seccomp_data(&sd);
+	if (!sd) {
+		populate_seccomp_data(&sd_local);
+		sd = &sd_local;
+	}
 
 	/*
 	 * All filters in the list are evaluated and the lowest BPF return
 	 * value always takes priority (ignoring the DATA).
 	 */
 	for (f = current->seccomp.filter; f; f = f->prev) {
-		u32 cur_ret = sk_run_filter_int_seccomp(&sd, f->insnsi);
+		u32 cur_ret = sk_run_filter_int_seccomp(sd, f->insnsi);
 		if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
 			ret = cur_ret;
 	}
@@ -406,7 +409,7 @@ void secure_computing_strict(int this_syscall)
 #else
 int __secure_computing(void)
 {
-	u32 phase1_result = seccomp_phase1();
+	u32 phase1_result = seccomp_phase1(NULL);
 	if (likely(phase1_result == SECCOMP_PHASE1_OK))
 		return 0;
 	else if (likely(phase1_result == SECCOMP_PHASE1_SKIP))
@@ -416,22 +419,22 @@ int __secure_computing(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
-static u32 __seccomp_phase1_filter(int this_syscall, struct pt_regs *regs)
+static u32 __seccomp_phase1_filter(int this_syscall, struct seccomp_data *sd)
 {
-	u32 filter_ret = seccomp_run_filters();
+	u32 filter_ret = seccomp_run_filters(sd);
 	int data = filter_ret & SECCOMP_RET_DATA;
 	u32 action = filter_ret & SECCOMP_RET_ACTION;
 
 	switch (action) {
 	case SECCOMP_RET_ERRNO:
 		/* Set the low-order 16-bits as a errno. */
-		syscall_set_return_value(current, regs,
+		syscall_set_return_value(current, task_pt_regs(current),
 					 -data, 0);
 		goto skip;
 
 	case SECCOMP_RET_TRAP:
 		/* Show the handler the original registers. */
-		syscall_rollback(current, regs);
+		syscall_rollback(current, task_pt_regs(current));
 		/* Let the filter pass back 16 bits of data. */
 		seccomp_send_sigsys(this_syscall, data);
 		goto skip;
@@ -458,11 +461,14 @@ skip:
 
 /**
  * seccomp_phase1() - run fast path seccomp checks on the current syscall
+ * @arg sd: The seccomp_data or NULL
  *
  * This only reads pt_regs via the syscall_xyz helpers.  The only change
  * it will make to pt_regs is via syscall_set_return_value, and it will
  * only do that if it returns SECCOMP_PHASE1_SKIP.
  *
+ * If sd is provided, it will not read pt_regs at all.
+ *
  * It may also call do_exit or force a signal; these actions must be
  * safe.
  *
@@ -476,11 +482,11 @@ skip:
  * If it returns anything else, then the return value should be passed
  * to seccomp_phase2 from a context in which ptrace hooks are safe.
  */
-u32 seccomp_phase1(void)
+u32 seccomp_phase1(struct seccomp_data *sd)
 {
 	int mode = current->seccomp.mode;
-	struct pt_regs *regs = task_pt_regs(current);
-	int this_syscall = syscall_get_nr(current, regs);
+	int this_syscall = sd ? sd->nr :
+		syscall_get_nr(current, task_pt_regs(current));
 
 	switch (mode) {
 	case SECCOMP_MODE_STRICT:
@@ -488,7 +494,7 @@ u32 seccomp_phase1(void)
 		return SECCOMP_PHASE1_OK;
 #ifdef CONFIG_SECCOMP_FILTER
 	case SECCOMP_MODE_FILTER:
-		return __seccomp_phase1_filter(this_syscall, regs);
+		return __seccomp_phase1_filter(this_syscall, sd);
 #endif
 	default:
 		BUG();
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 5/5] x86,seccomp: Add a seccomp fastpath
  2014-06-11 20:22 [RFC 0/5] Two-phase seccomp and an x86_64 fast path Andy Lutomirski
                   ` (3 preceding siblings ...)
  2014-06-11 20:23 ` [RFC 4/5] seccomp: Allow arch code to provide seccomp_data Andy Lutomirski
@ 2014-06-11 20:23 ` Andy Lutomirski
  2014-06-11 21:29   ` Alexei Starovoitov
  4 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2014-06-11 20:23 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Andy Lutomirski

On my VM, getpid takes about 70ns.  Before this patch, adding a
single-instruction always-accept seccomp filter added about 134ns of
overhead to getpid.  With this patch, the overhead is down to about
13ns.

I'm not really thrilled by this patch.  It has two main issues:

1. Calling into code in kernel/seccomp.c from assembly feels ugly.

2. The x86 64-bit syscall entry now has four separate code paths:
fast, seccomp only, audit only, and slow.  This kind of sucks.
Would it be worth trying to rewrite the whole thing in C with a
two-phase slow path approach like I'm using here for seccomp?

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 45 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/seccomp.h    |  4 ++--
 2 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index f9e713a..feb32b2 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -683,6 +683,45 @@ sysret_signal:
 	FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 	jmp int_check_syscall_exit_work
 
+#ifdef CONFIG_SECCOMP
+	/*
+	 * Fast path for seccomp without any other slow path triggers.
+	 */
+seccomp_fastpath:
+	/* Build seccomp_data */
+	pushq %r9				/* args[5] */
+	pushq %r8				/* args[4] */
+	pushq %r10				/* args[3] */
+	pushq %rdx				/* args[2] */
+	pushq %rsi				/* args[1] */
+	pushq %rdi				/* args[0] */
+	pushq RIP-ARGOFFSET+6*8(%rsp)		/* rip */
+	pushq %rax 				/* nr and junk */
+	movl $AUDIT_ARCH_X86_64, 4(%rsp)	/* arch */
+	movq %rsp, %rdi
+	call seccomp_phase1
+	addq $8*8, %rsp
+	cmpq $1, %rax
+	ja seccomp_invoke_phase2
+	LOAD_ARGS 0  /* restore clobbered regs */
+	jb system_call_fastpath
+	jmp ret_from_sys_call
+
+seccomp_invoke_phase2:
+	SAVE_REST
+	FIXUP_TOP_OF_STACK %rdi
+	movq %rax,%rdi
+	call seccomp_phase2
+
+	/* if seccomp says to skip, then set orig_ax to -1 and skip */
+	test %eax,%eax
+	jz 1f
+	movq $-1, ORIG_RAX(%rsp)
+1:
+	mov ORIG_RAX(%rsp), %rax		/* reload rax */
+	jmp system_call_post_trace		/* and maybe do the syscall */
+#endif
+
 #ifdef CONFIG_AUDITSYSCALL
 	/*
 	 * Fast path for syscall audit without full syscall trace.
@@ -717,6 +756,10 @@ sysret_audit:
 
 	/* Do syscall tracing */
 tracesys:
+#ifdef CONFIG_SECCOMP
+	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SECCOMP),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	jz seccomp_fastpath
+#endif
 #ifdef CONFIG_AUDITSYSCALL
 	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
 	jz auditsys
@@ -725,6 +768,8 @@ tracesys:
 	FIXUP_TOP_OF_STACK %rdi
 	movq %rsp,%rdi
 	call syscall_trace_enter
+
+system_call_post_trace:
 	/*
 	 * Reload arg registers from stack in case ptrace changed them.
 	 * We don't reload %rax because syscall_trace_enter() returned
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 4fc7a84..d3d4c52 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -37,8 +37,8 @@ static inline int secure_computing(void)
 #define SECCOMP_PHASE1_OK	0
 #define SECCOMP_PHASE1_SKIP	1
 
-extern u32 seccomp_phase1(struct seccomp_data *sd);
-int seccomp_phase2(u32 phase1_result);
+asmlinkage __visible extern u32 seccomp_phase1(struct seccomp_data *sd);
+asmlinkage __visible int seccomp_phase2(u32 phase1_result);
 #else
 extern void secure_computing_strict(int this_syscall);
 #endif
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC 5/5] x86,seccomp: Add a seccomp fastpath
  2014-06-11 20:23 ` [RFC 5/5] x86,seccomp: Add a seccomp fastpath Andy Lutomirski
@ 2014-06-11 21:29   ` Alexei Starovoitov
  2014-06-11 21:56     ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2014-06-11 21:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kees Cook, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, linux-mips, linux-arch, linux-security-module

On Wed, Jun 11, 2014 at 1:23 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On my VM, getpid takes about 70ns.  Before this patch, adding a
> single-instruction always-accept seccomp filter added about 134ns of
> overhead to getpid.  With this patch, the overhead is down to about
> 13ns.

interesting.
Is this the gain from patch 4 into patch 5 or from patch 0 to patch 5?
13ns is still with seccomp enabled, but without filters?

> I'm not really thrilled by this patch.  It has two main issues:
>
> 1. Calling into code in kernel/seccomp.c from assembly feels ugly.
>
> 2. The x86 64-bit syscall entry now has four separate code paths:
> fast, seccomp only, audit only, and slow.  This kind of sucks.
> Would it be worth trying to rewrite the whole thing in C with a
> two-phase slow path approach like I'm using here for seccomp?
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/kernel/entry_64.S | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/seccomp.h    |  4 ++--
>  2 files changed, 47 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index f9e713a..feb32b2 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -683,6 +683,45 @@ sysret_signal:
>         FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
>         jmp int_check_syscall_exit_work
>
> +#ifdef CONFIG_SECCOMP
> +       /*
> +        * Fast path for seccomp without any other slow path triggers.
> +        */
> +seccomp_fastpath:
> +       /* Build seccomp_data */
> +       pushq %r9                               /* args[5] */
> +       pushq %r8                               /* args[4] */
> +       pushq %r10                              /* args[3] */
> +       pushq %rdx                              /* args[2] */
> +       pushq %rsi                              /* args[1] */
> +       pushq %rdi                              /* args[0] */
> +       pushq RIP-ARGOFFSET+6*8(%rsp)           /* rip */
> +       pushq %rax                              /* nr and junk */
> +       movl $AUDIT_ARCH_X86_64, 4(%rsp)        /* arch */
> +       movq %rsp, %rdi
> +       call seccomp_phase1

the assembler code is pretty much repeating what C does in
populate_seccomp_data(). Assuming the whole gain came from
patch 5 why asm version is so much faster than C?
it skips SAVE/RESTORE_REST... what else?
If the most of the gain is from all patches combined (mainly from
2 phase approach) then why bother with asm?

Somehow it feels that the gain is due to better branch prediction
in asm version. If you change few 'unlikely' in C to 'likely', it may
get to the same performance...

btw patches #1-3 look good to me. especially #1 is nice.

> +       addq $8*8, %rsp
> +       cmpq $1, %rax
> +       ja seccomp_invoke_phase2
> +       LOAD_ARGS 0  /* restore clobbered regs */
> +       jb system_call_fastpath
> +       jmp ret_from_sys_call
> +
> +seccomp_invoke_phase2:
> +       SAVE_REST
> +       FIXUP_TOP_OF_STACK %rdi
> +       movq %rax,%rdi
> +       call seccomp_phase2
> +
> +       /* if seccomp says to skip, then set orig_ax to -1 and skip */
> +       test %eax,%eax
> +       jz 1f
> +       movq $-1, ORIG_RAX(%rsp)
> +1:
> +       mov ORIG_RAX(%rsp), %rax                /* reload rax */
> +       jmp system_call_post_trace              /* and maybe do the syscall */
> +#endif
> +
>  #ifdef CONFIG_AUDITSYSCALL
>         /*
>          * Fast path for syscall audit without full syscall trace.
> @@ -717,6 +756,10 @@ sysret_audit:
>
>         /* Do syscall tracing */
>  tracesys:
> +#ifdef CONFIG_SECCOMP
> +       testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SECCOMP),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> +       jz seccomp_fastpath
> +#endif
>  #ifdef CONFIG_AUDITSYSCALL
>         testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
>         jz auditsys
> @@ -725,6 +768,8 @@ tracesys:
>         FIXUP_TOP_OF_STACK %rdi
>         movq %rsp,%rdi
>         call syscall_trace_enter
> +
> +system_call_post_trace:
>         /*
>          * Reload arg registers from stack in case ptrace changed them.
>          * We don't reload %rax because syscall_trace_enter() returned
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 4fc7a84..d3d4c52 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -37,8 +37,8 @@ static inline int secure_computing(void)
>  #define SECCOMP_PHASE1_OK      0
>  #define SECCOMP_PHASE1_SKIP    1
>
> -extern u32 seccomp_phase1(struct seccomp_data *sd);
> -int seccomp_phase2(u32 phase1_result);
> +asmlinkage __visible extern u32 seccomp_phase1(struct seccomp_data *sd);
> +asmlinkage __visible int seccomp_phase2(u32 phase1_result);
>  #else
>  extern void secure_computing_strict(int this_syscall);
>  #endif
> --
> 1.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 5/5] x86,seccomp: Add a seccomp fastpath
  2014-06-11 21:29   ` Alexei Starovoitov
@ 2014-06-11 21:56     ` Andy Lutomirski
  2014-06-11 22:18       ` H. Peter Anvin
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2014-06-11 21:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: linux-kernel, Kees Cook, Will Drewry, Oleg Nesterov, X86 ML,
	linux-arm-kernel, linux-mips, linux-arch, LSM List

On Wed, Jun 11, 2014 at 2:29 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Wed, Jun 11, 2014 at 1:23 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On my VM, getpid takes about 70ns.  Before this patch, adding a
>> single-instruction always-accept seccomp filter added about 134ns of
>> overhead to getpid.  With this patch, the overhead is down to about
>> 13ns.
>
> interesting.
> Is this the gain from patch 4 into patch 5 or from patch 0 to patch 5?
> 13ns is still with seccomp enabled, but without filters?

13ns is with the simplest nonempty filter.  I hope that empty filters
don't work.

>
>> I'm not really thrilled by this patch.  It has two main issues:
>>
>> 1. Calling into code in kernel/seccomp.c from assembly feels ugly.
>>
>> 2. The x86 64-bit syscall entry now has four separate code paths:
>> fast, seccomp only, audit only, and slow.  This kind of sucks.
>> Would it be worth trying to rewrite the whole thing in C with a
>> two-phase slow path approach like I'm using here for seccomp?
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  arch/x86/kernel/entry_64.S | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/seccomp.h    |  4 ++--
>>  2 files changed, 47 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index f9e713a..feb32b2 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -683,6 +683,45 @@ sysret_signal:
>>         FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
>>         jmp int_check_syscall_exit_work
>>
>> +#ifdef CONFIG_SECCOMP
>> +       /*
>> +        * Fast path for seccomp without any other slow path triggers.
>> +        */
>> +seccomp_fastpath:
>> +       /* Build seccomp_data */
>> +       pushq %r9                               /* args[5] */
>> +       pushq %r8                               /* args[4] */
>> +       pushq %r10                              /* args[3] */
>> +       pushq %rdx                              /* args[2] */
>> +       pushq %rsi                              /* args[1] */
>> +       pushq %rdi                              /* args[0] */
>> +       pushq RIP-ARGOFFSET+6*8(%rsp)           /* rip */

>> +       pushq %rax                              /* nr and junk */
>> +       movl $AUDIT_ARCH_X86_64, 4(%rsp)        /* arch */

It wouldn't shock me if this pair of instructions were
microarchitecturally bad.  Maybe I can save a few more cycles by using
bitwise arithmetic or a pair of movls and explicit rsp manipulation
here.  I haven't tried.

>> +       movq %rsp, %rdi
>> +       call seccomp_phase1
>
> the assembler code is pretty much repeating what C does in
> populate_seccomp_data(). Assuming the whole gain came from
> patch 5 why asm version is so much faster than C?
>
> it skips SAVE/RESTORE_REST... what else?
> If the most of the gain is from all patches combined (mainly from
> 2 phase approach) then why bother with asm?

The whole gain should be patch 5, but there are three things going on here.

The biggest win is skipping int_ret_from_sys_call.  IRET sucks.
There's some extra win from skipping SAVE/RESTORE_REST, but I haven't
benchmarked that.  I would guess it's on the order of 5ns.  In theory
a one-pass implementation could skip int_ret_from_sys_call, but that
will be awkward to implement correctly.

The other benefit is generating seccomp_data in assembly.  It saves
about 7ns.  This is likely due to avoiding all the indirection in
syscall_xyz and to avoiding prodding at flags to figure out the arch
token.

Fiddling with branch prediction made no difference that I could measure.

>
> Somehow it feels that the gain is due to better branch prediction
> in asm version. If you change few 'unlikely' in C to 'likely', it may
> get to the same performance...
>
> btw patches #1-3 look good to me. especially #1 is nice.

Thanks :)

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 5/5] x86,seccomp: Add a seccomp fastpath
  2014-06-11 21:56     ` Andy Lutomirski
@ 2014-06-11 22:18       ` H. Peter Anvin
  2014-06-11 22:22         ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: H. Peter Anvin @ 2014-06-11 22:18 UTC (permalink / raw)
  To: Andy Lutomirski, Alexei Starovoitov
  Cc: linux-kernel, Kees Cook, Will Drewry, Oleg Nesterov, X86 ML,
	linux-arm-kernel, linux-mips, linux-arch, LSM List

On 06/11/2014 02:56 PM, Andy Lutomirski wrote:
> 
> 13ns is with the simplest nonempty filter.  I hope that empty filters
> don't work.
> 

Why wouldn't they?

	-hpa



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 5/5] x86,seccomp: Add a seccomp fastpath
  2014-06-11 22:18       ` H. Peter Anvin
@ 2014-06-11 22:22         ` Andy Lutomirski
  2014-06-11 22:27           ` H. Peter Anvin
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2014-06-11 22:22 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexei Starovoitov, linux-kernel, Kees Cook, Will Drewry,
	Oleg Nesterov, X86 ML, linux-arm-kernel, linux-mips, linux-arch,
	LSM List

On Wed, Jun 11, 2014 at 3:18 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 06/11/2014 02:56 PM, Andy Lutomirski wrote:
>>
>> 13ns is with the simplest nonempty filter.  I hope that empty filters
>> don't work.
>>
>
> Why wouldn't they?

Is it permissible to fall off the end of a BPF program?  I'm getting
EINVAL trying to install an actual empty filter.  The filter I tested
with was:

#include <unistd.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <sys/syscall.h>
#include <err.h>
#include <sys/prctl.h>
#include <stddef.h>
#include <stdio.h>

int main(int argc, char **argv)
{
    int rc;

    struct sock_filter filter[] = {
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
    };

    struct sock_fprog prog = {
        .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
        .filter = filter,
    };

    if (argc < 2) {
        printf("Usage: null_seccomp PATH ARGS...\n");
        return 1;
    }

    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
        err(1, "PR_SET_NO_NEW_PRIVS");
    if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog))
        err(1, "PR_SET_SECCOMP");

    execv(argv[1], argv + 1);
    err(1, argv[1]);
}


--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 5/5] x86,seccomp: Add a seccomp fastpath
  2014-06-11 22:22         ` Andy Lutomirski
@ 2014-06-11 22:27           ` H. Peter Anvin
  2014-06-11 22:28             ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: H. Peter Anvin @ 2014-06-11 22:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, linux-kernel, Kees Cook, Will Drewry,
	Oleg Nesterov, X86 ML, linux-arm-kernel, linux-mips, linux-arch,
	LSM List

On 06/11/2014 03:22 PM, Andy Lutomirski wrote:
> On Wed, Jun 11, 2014 at 3:18 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 06/11/2014 02:56 PM, Andy Lutomirski wrote:
>>>
>>> 13ns is with the simplest nonempty filter.  I hope that empty filters
>>> don't work.
>>>
>>
>> Why wouldn't they?
> 
> Is it permissible to fall off the end of a BPF program?  I'm getting
> EINVAL trying to install an actual empty filter.  The filter I tested
> with was:
> 

What I meant was that there has to be a well-defined behavior for the
program falling off the end anyway, and that that should be preserved.

I guess it is possible to require that all code paths must provably
reach a termination point.

	-hpa



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 5/5] x86,seccomp: Add a seccomp fastpath
  2014-06-11 22:27           ` H. Peter Anvin
@ 2014-06-11 22:28             ` Andy Lutomirski
  2014-06-11 22:32               ` Kees Cook
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2014-06-11 22:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexei Starovoitov, linux-kernel, Kees Cook, Will Drewry,
	Oleg Nesterov, X86 ML, linux-arm-kernel, linux-mips, linux-arch,
	LSM List

On Wed, Jun 11, 2014 at 3:27 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 06/11/2014 03:22 PM, Andy Lutomirski wrote:
>> On Wed, Jun 11, 2014 at 3:18 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 06/11/2014 02:56 PM, Andy Lutomirski wrote:
>>>>
>>>> 13ns is with the simplest nonempty filter.  I hope that empty filters
>>>> don't work.
>>>>
>>>
>>> Why wouldn't they?
>>
>> Is it permissible to fall off the end of a BPF program?  I'm getting
>> EINVAL trying to install an actual empty filter.  The filter I tested
>> with was:
>>
>
> What I meant was that there has to be a well-defined behavior for the
> program falling off the end anyway, and that that should be preserved.
>
> I guess it is possible to require that all code paths must provably
> reach a termination point.
>

Dunno.  I haven't ever touched any of the actual BPF code.  This whole
patchset only changes the code that invokes the BPF evaluator.

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 5/5] x86,seccomp: Add a seccomp fastpath
  2014-06-11 22:28             ` Andy Lutomirski
@ 2014-06-11 22:32               ` Kees Cook
  2014-06-13 16:29                 ` Will Drewry
  0 siblings, 1 reply; 14+ messages in thread
From: Kees Cook @ 2014-06-11 22:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Alexei Starovoitov, linux-kernel, Will Drewry,
	Oleg Nesterov, X86 ML, linux-arm-kernel, linux-mips, linux-arch,
	LSM List

On Wed, Jun 11, 2014 at 3:28 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Jun 11, 2014 at 3:27 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 06/11/2014 03:22 PM, Andy Lutomirski wrote:
>>> On Wed, Jun 11, 2014 at 3:18 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>>> On 06/11/2014 02:56 PM, Andy Lutomirski wrote:
>>>>>
>>>>> 13ns is with the simplest nonempty filter.  I hope that empty filters
>>>>> don't work.
>>>>>
>>>>
>>>> Why wouldn't they?
>>>
>>> Is it permissible to fall off the end of a BPF program?  I'm getting
>>> EINVAL trying to install an actual empty filter.  The filter I tested
>>> with was:
>>>
>>
>> What I meant was that there has to be a well-defined behavior for the
>> program falling off the end anyway, and that that should be preserved.
>>
>> I guess it is possible to require that all code paths must provably
>> reach a termination point.
>>
>
> Dunno.  I haven't ever touched any of the actual BPF code.  This whole
> patchset only changes the code that invokes the BPF evaluator.

Yes, this is how BPF works: runs to the end or exit early. With
seccomp BPF specifically, the return value defaults to kill the
process. If a filter was missing (NULL), or empty, or didn't
explicitly return with a new value, the default (kill) should be
taken.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 5/5] x86,seccomp: Add a seccomp fastpath
  2014-06-11 22:32               ` Kees Cook
@ 2014-06-13 16:29                 ` Will Drewry
  0 siblings, 0 replies; 14+ messages in thread
From: Will Drewry @ 2014-06-13 16:29 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, H. Peter Anvin, Alexei Starovoitov,
	linux-kernel, Oleg Nesterov, X86 ML, linux-arm-kernel,
	linux-mips, linux-arch, LSM List

On Wed, Jun 11, 2014 at 5:32 PM, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Jun 11, 2014 at 3:28 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Wed, Jun 11, 2014 at 3:27 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 06/11/2014 03:22 PM, Andy Lutomirski wrote:
>>>> On Wed, Jun 11, 2014 at 3:18 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>>>> On 06/11/2014 02:56 PM, Andy Lutomirski wrote:
>>>>>>
>>>>>> 13ns is with the simplest nonempty filter.  I hope that empty filters
>>>>>> don't work.
>>>>>>
>>>>>
>>>>> Why wouldn't they?
>>>>
>>>> Is it permissible to fall off the end of a BPF program?  I'm getting
>>>> EINVAL trying to install an actual empty filter.  The filter I tested
>>>> with was:
>>>>
>>>
>>> What I meant was that there has to be a well-defined behavior for the
>>> program falling off the end anyway, and that that should be preserved.
>>>
>>> I guess it is possible to require that all code paths must provably
>>> reach a termination point.
>>>
>>
>> Dunno.  I haven't ever touched any of the actual BPF code.  This whole
>> patchset only changes the code that invokes the BPF evaluator.
>
> Yes, this is how BPF works: runs to the end or exit early. With
> seccomp BPF specifically, the return value defaults to kill the
> process. If a filter was missing (NULL), or empty, or didn't
> explicitly return with a new value, the default (kill) should be
> taken.

Yup - this is just a property of BPF (and a nice one :)

On seccomp_attach_filter this check fires:
  if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
    return -EINVAL;

As well as in sk_chk_filter:
  if (flen == 0 || flen > BPF_MAXINSNS)
    return -EINVAL;

And:
  /* last instruction must be a RET code */
  switch (filter[flen - 1].code) {
    case BPF_S_RET_K:
    case BPF_S_RET_A:
      return check_load_and_stores(filter, flen);
  }

cheers!
will

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-06-13 16:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-11 20:22 [RFC 0/5] Two-phase seccomp and an x86_64 fast path Andy Lutomirski
2014-06-11 20:22 ` [RFC 1/5] seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing Andy Lutomirski
2014-06-11 20:22 ` [RFC 2/5] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls Andy Lutomirski
2014-06-11 20:23 ` [RFC 3/5] seccomp: Refactor the filter callback and the API Andy Lutomirski
2014-06-11 20:23 ` [RFC 4/5] seccomp: Allow arch code to provide seccomp_data Andy Lutomirski
2014-06-11 20:23 ` [RFC 5/5] x86,seccomp: Add a seccomp fastpath Andy Lutomirski
2014-06-11 21:29   ` Alexei Starovoitov
2014-06-11 21:56     ` Andy Lutomirski
2014-06-11 22:18       ` H. Peter Anvin
2014-06-11 22:22         ` Andy Lutomirski
2014-06-11 22:27           ` H. Peter Anvin
2014-06-11 22:28             ` Andy Lutomirski
2014-06-11 22:32               ` Kees Cook
2014-06-13 16:29                 ` Will Drewry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).