linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] Two-phase seccomp and x86 tracing changes
@ 2014-07-15 19:32 Andy Lutomirski
  2014-07-15 19:32 ` [PATCH 1/7] seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing Andy Lutomirski
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Andy Lutomirski @ 2014-07-15 19:32 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry, James Morris
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, Andy Lutomirski

This is both a cleanup and a speedup.  It reduces overhead due to
installing a trivial seccomp filter by 87%.  The speedup comes from
avoiding the full syscall tracing mechanism for filters that don't
return SECCOMP_RET_TRACE.

This series works by splitting the seccomp hooks into two phases.
The first phase evaluates the filter; it can skip syscalls, allow
them, kill the calling task, or pass a u32 to the second phase.  The
second phase requires a full tracing context, and it sends ptrace
events if necessary.

Once this is done, I implemented a similar split for the x86 syscall
entry work.  The C callback is invoked in two phases: the first has
only a partial frame, and it can request phase 2 processing with a
full frame.

Finally, I switch the 64-bit system_call code to use the new split
entry work.  This is a net deletion of assembly code: it replaces
all of the audit entry muck.

In the process, I fixed some bugs.

If this is acceptable, someone can do the same tweak for the
ia32entry and entry_32 code.

This passes all seccomp tests that I know of, except for the ones
that don't work on current kernels.

Presumably, once everyone gets a chance to poke at this, it should
go in to some combination of James' and hpa's trees.

Changes from RFC version:
 - The first three patches are more or less the same
 - The rest is more or less a rewrite

Andy Lutomirski (7):
  seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing
  seccomp: Refactor the filter callback and the API
  seccomp: Allow arch code to provide seccomp_data
  x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
  x86: Split syscall_trace_enter into two phases
  x86_64,entry: Treat regs->ax the same in fastpath and slowpath
    syscalls
  x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls

 arch/arm/kernel/ptrace.c       |   7 +-
 arch/mips/kernel/ptrace.c      |   2 +-
 arch/s390/kernel/ptrace.c      |   2 +-
 arch/x86/include/asm/calling.h |   6 +-
 arch/x86/include/asm/ptrace.h  |   5 +
 arch/x86/kernel/entry_64.S     |  51 ++++-----
 arch/x86/kernel/ptrace.c       | 146 +++++++++++++++++++-----
 arch/x86/kernel/vsyscall_64.c  |   2 +-
 include/linux/seccomp.h        |  25 +++--
 kernel/seccomp.c               | 246 +++++++++++++++++++++++++++--------------
 10 files changed, 339 insertions(+), 153 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/7] seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing
  2014-07-15 19:32 [PATCH 0/7] Two-phase seccomp and x86 tracing changes Andy Lutomirski
@ 2014-07-15 19:32 ` Andy Lutomirski
  2014-07-15 19:32 ` [PATCH 2/7] seccomp: Refactor the filter callback and the API Andy Lutomirski
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2014-07-15 19:32 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry, James Morris
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, Andy Lutomirski,
	Russell King, Ralf Baechle, Martin Schwidefsky, Heiko Carstens,
	linux-s390

The secure_computing function took a syscall number parameter, but
it only paid any attention to that parameter if seccomp mode 1 was
enabled.  Rather than coming up with a kludge to get the parameter
to work in mode 2, just remove the parameter.

To avoid churn in arches that don't have seccomp filters (and may
not even support syscall_get_nr right now), this leaves the
parameter in secure_computing_strict, which is now a real function.

For ARM, this is a bit ugly due to the fact that ARM conditionally
supports seccomp filters.  Fixing that would probably only be a
couple of lines of code, but it should be coordinated with the audit
maintainers.

This will be a slight slowdown on some arches.  The right fix is to
pass in all of seccomp_data instead of trying to make just the
syscall nr part be fast.

This is a prerequisite for making two-phase seccomp work cleanly.

Cc: Russell King <linux@arm.linux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/arm/kernel/ptrace.c      |  7 ++++-
 arch/mips/kernel/ptrace.c     |  2 +-
 arch/s390/kernel/ptrace.c     |  2 +-
 arch/x86/kernel/ptrace.c      |  2 +-
 arch/x86/kernel/vsyscall_64.c |  2 +-
 include/linux/seccomp.h       | 21 +++++++-------
 kernel/seccomp.c              | 64 ++++++++++++++++++++++++++++++-------------
 7 files changed, 66 insertions(+), 34 deletions(-)

diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index 0c27ed6..5e772a2 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -933,8 +933,13 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs, int scno)
 	current_thread_info()->syscall = scno;
 
 	/* Do the secure computing check first; failures should be fast. */
-	if (secure_computing(scno) == -1)
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+	if (secure_computing() == -1)
 		return -1;
+#else
+	/* XXX: remove this once OABI gets fixed */
+	secure_computing_strict(scno);
+#endif
 
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c
index f639ccd..808bafc 100644
--- a/arch/mips/kernel/ptrace.c
+++ b/arch/mips/kernel/ptrace.c
@@ -639,7 +639,7 @@ asmlinkage long syscall_trace_enter(struct pt_regs *regs, long syscall)
 	long ret = 0;
 	user_exit();
 
-	if (secure_computing(syscall) == -1)
+	if (secure_computing() == -1)
 		return -1;
 
 	if (test_thread_flag(TIF_SYSCALL_TRACE) &&
diff --git a/arch/s390/kernel/ptrace.c b/arch/s390/kernel/ptrace.c
index 2d716734..7ab8b91 100644
--- a/arch/s390/kernel/ptrace.c
+++ b/arch/s390/kernel/ptrace.c
@@ -795,7 +795,7 @@ asmlinkage long do_syscall_trace_enter(struct pt_regs *regs)
 	long ret = 0;
 
 	/* Do the secure computing check first. */
-	if (secure_computing(regs->gprs[2])) {
+	if (secure_computing()) {
 		/* seccomp failures shouldn't expose any additional code. */
 		ret = -1;
 		goto out;
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 678c0ad..93c182a 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1471,7 +1471,7 @@ long syscall_trace_enter(struct pt_regs *regs)
 		regs->flags |= X86_EFLAGS_TF;
 
 	/* do the secure computing check first */
-	if (secure_computing(regs->orig_ax)) {
+	if (secure_computing()) {
 		/* seccomp failures shouldn't expose any additional code. */
 		ret = -1L;
 		goto out;
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index ea5b570..23c0c23 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -216,7 +216,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
 	 */
 	regs->orig_ax = syscall_nr;
 	regs->ax = -ENOSYS;
-	tmp = secure_computing(syscall_nr);
+	tmp = secure_computing();
 	if ((!tmp && regs->orig_ax != syscall_nr) || regs->ip != address) {
 		warn_bad_vsyscall(KERN_DEBUG, regs,
 				  "seccomp tried to change syscall nr or ip");
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 4054b09..6e655a6 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -25,19 +25,17 @@ struct seccomp {
 	struct seccomp_filter *filter;
 };
 
-extern int __secure_computing(int);
-static inline int secure_computing(int this_syscall)
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+extern int __secure_computing(void);
+static inline int secure_computing(void)
 {
 	if (unlikely(test_thread_flag(TIF_SECCOMP)))
-		return  __secure_computing(this_syscall);
+		return  __secure_computing();
 	return 0;
 }
-
-/* A wrapper for architectures supporting only SECCOMP_MODE_STRICT. */
-static inline void secure_computing_strict(int this_syscall)
-{
-	BUG_ON(secure_computing(this_syscall) != 0);
-}
+#else
+extern void secure_computing_strict(int this_syscall);
+#endif
 
 extern long prctl_get_seccomp(void);
 extern long prctl_set_seccomp(unsigned long, char __user *);
@@ -54,8 +52,11 @@ static inline int seccomp_mode(struct seccomp *s)
 struct seccomp { };
 struct seccomp_filter { };
 
-static inline int secure_computing(int this_syscall) { return 0; }
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+static inline int secure_computing(void) { return 0; }
+#else
 static inline void secure_computing_strict(int this_syscall) { return; }
+#endif
 
 static inline long prctl_get_seccomp(void)
 {
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 301bbc2..15ee9d6 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -21,8 +21,11 @@
 
 /* #define SECCOMP_DEBUG 1 */
 
-#ifdef CONFIG_SECCOMP_FILTER
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
 #include <asm/syscall.h>
+#endif
+
+#ifdef CONFIG_SECCOMP_FILTER
 #include <linux/filter.h>
 #include <linux/ptrace.h>
 #include <linux/security.h>
@@ -170,7 +173,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
  *
  * Returns valid seccomp BPF response codes.
  */
-static u32 seccomp_run_filters(int syscall)
+static u32 seccomp_run_filters(void)
 {
 	struct seccomp_filter *f;
 	struct seccomp_data sd;
@@ -380,32 +383,54 @@ static int mode1_syscalls_32[] = {
 };
 #endif
 
-int __secure_computing(int this_syscall)
+static void __secure_computing_strict(int this_syscall)
+{
+	int *syscall_whitelist = mode1_syscalls;
+#ifdef CONFIG_COMPAT
+	if (is_compat_task())
+		syscall_whitelist = mode1_syscalls_32;
+#endif
+	do {
+		if (*syscall_whitelist == this_syscall)
+			return;
+	} while (*++syscall_whitelist);
+
+#ifdef SECCOMP_DEBUG
+	dump_stack();
+#endif
+	audit_seccomp(this_syscall, SIGKILL, SECCOMP_RET_KILL);
+	do_exit(SIGKILL);
+}
+
+#ifndef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+void secure_computing_strict(int this_syscall)
+{
+	int mode = current->seccomp.mode;
+
+	if (mode == 0)
+		return;
+	else if (mode == SECCOMP_MODE_STRICT)
+		__secure_computing_strict(this_syscall);
+	else
+		BUG();
+}
+#else
+int __secure_computing(void)
 {
 	int mode = current->seccomp.mode;
+	struct pt_regs *regs = task_pt_regs(current);
+	int this_syscall = syscall_get_nr(current, regs);
 	int exit_sig = 0;
-	int *syscall;
 	u32 ret;
 
 	switch (mode) {
 	case SECCOMP_MODE_STRICT:
-		syscall = mode1_syscalls;
-#ifdef CONFIG_COMPAT
-		if (is_compat_task())
-			syscall = mode1_syscalls_32;
-#endif
-		do {
-			if (*syscall == this_syscall)
-				return 0;
-		} while (*++syscall);
-		exit_sig = SIGKILL;
-		ret = SECCOMP_RET_KILL;
-		break;
+		__secure_computing_strict(this_syscall);
+		return 0;
 #ifdef CONFIG_SECCOMP_FILTER
 	case SECCOMP_MODE_FILTER: {
 		int data;
-		struct pt_regs *regs = task_pt_regs(current);
-		ret = seccomp_run_filters(this_syscall);
+		ret = seccomp_run_filters();
 		data = ret & SECCOMP_RET_DATA;
 		ret &= SECCOMP_RET_ACTION;
 		switch (ret) {
@@ -463,9 +488,10 @@ int __secure_computing(int this_syscall)
 #ifdef CONFIG_SECCOMP_FILTER
 skip:
 	audit_seccomp(this_syscall, exit_sig, ret);
-#endif
 	return -1;
+#endif
 }
+#endif /* CONFIG_HAVE_ARCH_SECCOMP_FILTER */
 
 long prctl_get_seccomp(void)
 {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/7] seccomp: Refactor the filter callback and the API
  2014-07-15 19:32 [PATCH 0/7] Two-phase seccomp and x86 tracing changes Andy Lutomirski
  2014-07-15 19:32 ` [PATCH 1/7] seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing Andy Lutomirski
@ 2014-07-15 19:32 ` Andy Lutomirski
  2014-07-16 20:12   ` Kees Cook
  2014-07-15 19:32 ` [PATCH 3/7] seccomp: Allow arch code to provide seccomp_data Andy Lutomirski
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2014-07-15 19:32 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry, James Morris
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, Andy Lutomirski

The reason I did this is to add a seccomp API that will be usable
for an x86 fast path.  The x86 entry code needs to use a rather
expensive slow path for a syscall that might be visible to things
like ptrace.  By splitting seccomp into two phases, we can check
whether we need the slow path and then use the fast path in if the
filter allows the syscall or just returns some errno.

As a side effect, I think the new code is much easier to understand
than the old code.

This has one user-visible effect: the audit record written for
SECCOMP_RET_TRACE is now a simple indication that SECCOMP_RET_TRACE
happened.  It used to depend in a complicated way on what the tracer
did.  I couldn't make much sense of it.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/seccomp.h |   6 ++
 kernel/seccomp.c        | 180 +++++++++++++++++++++++++++++++-----------------
 2 files changed, 123 insertions(+), 63 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 6e655a6..8345fdc 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -33,6 +33,12 @@ static inline int secure_computing(void)
 		return  __secure_computing();
 	return 0;
 }
+
+#define SECCOMP_PHASE1_OK	0
+#define SECCOMP_PHASE1_SKIP	1
+
+extern u32 seccomp_phase1(void);
+int seccomp_phase2(u32 phase1_result);
 #else
 extern void secure_computing_strict(int this_syscall);
 #endif
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 15ee9d6..d737445 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -19,8 +19,6 @@
 #include <linux/sched.h>
 #include <linux/seccomp.h>
 
-/* #define SECCOMP_DEBUG 1 */
-
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
 #include <asm/syscall.h>
 #endif
@@ -417,79 +415,135 @@ void secure_computing_strict(int this_syscall)
 #else
 int __secure_computing(void)
 {
+	u32 phase1_result = seccomp_phase1();
+
+	if (likely(phase1_result == SECCOMP_PHASE1_OK))
+		return 0;
+	else if (likely(phase1_result == SECCOMP_PHASE1_SKIP))
+		return -1;
+	else
+		return seccomp_phase2(phase1_result);
+}
+
+#ifdef CONFIG_SECCOMP_FILTER
+static u32 __seccomp_phase1_filter(int this_syscall, struct pt_regs *regs)
+{
+	u32 filter_ret = seccomp_run_filters();
+	int data = filter_ret & SECCOMP_RET_DATA;
+	u32 action = filter_ret & SECCOMP_RET_ACTION;
+
+	switch (action) {
+	case SECCOMP_RET_ERRNO:
+		/* Set the low-order 16-bits as a errno. */
+		syscall_set_return_value(current, regs,
+					 -data, 0);
+		goto skip;
+
+	case SECCOMP_RET_TRAP:
+		/* Show the handler the original registers. */
+		syscall_rollback(current, regs);
+		/* Let the filter pass back 16 bits of data. */
+		seccomp_send_sigsys(this_syscall, data);
+		goto skip;
+
+	case SECCOMP_RET_TRACE:
+		return filter_ret;  /* Save the rest for phase 2. */
+
+	case SECCOMP_RET_ALLOW:
+		return SECCOMP_PHASE1_OK;
+
+	case SECCOMP_RET_KILL:
+	default:
+		audit_seccomp(this_syscall, SIGSYS, action);
+		do_exit(SIGSYS);
+	}
+
+	unreachable();
+
+skip:
+	audit_seccomp(this_syscall, 0, action);
+	return SECCOMP_PHASE1_SKIP;
+}
+#endif
+
+/**
+ * seccomp_phase1() - run fast path seccomp checks on the current syscall
+ *
+ * This only reads pt_regs via the syscall_xyz helpers.  The only change
+ * it will make to pt_regs is via syscall_set_return_value, and it will
+ * only do that if it returns SECCOMP_PHASE1_SKIP.
+ *
+ * It may also call do_exit or force a signal; these actions must be
+ * safe.
+ *
+ * If it returns SECCOMP_PHASE1_OK, the syscall passes checks and should
+ * be processed normally.
+ *
+ * If it returns SECCOMP_PHASE1_SKIP, then the syscall should not be
+ * invoked.  In this case, seccomp_phase1 will have set the return value
+ * using syscall_set_return_value.
+ *
+ * If it returns anything else, then the return value should be passed
+ * to seccomp_phase2 from a context in which ptrace hooks are safe.
+ */
+u32 seccomp_phase1(void)
+{
 	int mode = current->seccomp.mode;
 	struct pt_regs *regs = task_pt_regs(current);
 	int this_syscall = syscall_get_nr(current, regs);
-	int exit_sig = 0;
-	u32 ret;
 
 	switch (mode) {
 	case SECCOMP_MODE_STRICT:
-		__secure_computing_strict(this_syscall);
-		return 0;
+		__secure_computing_strict(this_syscall);  /* may call do_exit */
+		return SECCOMP_PHASE1_OK;
 #ifdef CONFIG_SECCOMP_FILTER
-	case SECCOMP_MODE_FILTER: {
-		int data;
-		ret = seccomp_run_filters();
-		data = ret & SECCOMP_RET_DATA;
-		ret &= SECCOMP_RET_ACTION;
-		switch (ret) {
-		case SECCOMP_RET_ERRNO:
-			/* Set the low-order 16-bits as a errno. */
-			syscall_set_return_value(current, regs,
-						 -data, 0);
-			goto skip;
-		case SECCOMP_RET_TRAP:
-			/* Show the handler the original registers. */
-			syscall_rollback(current, regs);
-			/* Let the filter pass back 16 bits of data. */
-			seccomp_send_sigsys(this_syscall, data);
-			goto skip;
-		case SECCOMP_RET_TRACE:
-			/* Skip these calls if there is no tracer. */
-			if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) {
-				syscall_set_return_value(current, regs,
-							 -ENOSYS, 0);
-				goto skip;
-			}
-			/* Allow the BPF to provide the event message */
-			ptrace_event(PTRACE_EVENT_SECCOMP, data);
-			/*
-			 * The delivery of a fatal signal during event
-			 * notification may silently skip tracer notification.
-			 * Terminating the task now avoids executing a system
-			 * call that may not be intended.
-			 */
-			if (fatal_signal_pending(current))
-				break;
-			if (syscall_get_nr(current, regs) < 0)
-				goto skip;  /* Explicit request to skip. */
-
-			return 0;
-		case SECCOMP_RET_ALLOW:
-			return 0;
-		case SECCOMP_RET_KILL:
-		default:
-			break;
-		}
-		exit_sig = SIGSYS;
-		break;
-	}
+	case SECCOMP_MODE_FILTER:
+		return __seccomp_phase1_filter(this_syscall, regs);
 #endif
 	default:
 		BUG();
 	}
+}
 
-#ifdef SECCOMP_DEBUG
-	dump_stack();
-#endif
-	audit_seccomp(this_syscall, exit_sig, ret);
-	do_exit(exit_sig);
-#ifdef CONFIG_SECCOMP_FILTER
-skip:
-	audit_seccomp(this_syscall, exit_sig, ret);
-	return -1;
-#endif
+/**
+ * seccomp_phase2() - finish slow path seccomp work for the current syscall
+ * @phase1_result: The return value from seccomp_phase1()
+ *
+ * This must be called from a context in which ptrace hooks can be used.
+ *
+ * Returns 0 if the syscall should be processed or -1 to skip the syscall.
+ */
+int seccomp_phase2(u32 phase1_result)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	u32 action = phase1_result & SECCOMP_RET_ACTION;
+	int data = phase1_result & SECCOMP_RET_DATA;
+
+	BUG_ON(action != SECCOMP_RET_TRACE);
+
+	audit_seccomp(syscall_get_nr(current, regs), 0, action);
+
+	/* Skip these calls if there is no tracer. */
+	if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) {
+		syscall_set_return_value(current, regs,
+					 -ENOSYS, 0);
+		return -1;
+	}
+
+	/* Allow the BPF to provide the event message */
+	ptrace_event(PTRACE_EVENT_SECCOMP, data);
+	/*
+	 * The delivery of a fatal signal during event
+	 * notification may silently skip tracer notification.
+	 * Terminating the task now avoids executing a system
+	 * call that may not be intended.
+	 */
+	if (fatal_signal_pending(current))
+		do_exit(SIGSYS);
+	if (syscall_get_nr(current, regs) < 0)
+		return -1;  /* Explicit request to skip. */
+
+	return 0;
 }
 #endif /* CONFIG_HAVE_ARCH_SECCOMP_FILTER */
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/7] seccomp: Allow arch code to provide seccomp_data
  2014-07-15 19:32 [PATCH 0/7] Two-phase seccomp and x86 tracing changes Andy Lutomirski
  2014-07-15 19:32 ` [PATCH 1/7] seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing Andy Lutomirski
  2014-07-15 19:32 ` [PATCH 2/7] seccomp: Refactor the filter callback and the API Andy Lutomirski
@ 2014-07-15 19:32 ` Andy Lutomirski
  2014-07-15 19:32 ` [PATCH 4/7] x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit Andy Lutomirski
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2014-07-15 19:32 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry, James Morris
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, Andy Lutomirski

populate_seccomp_data is expensive: it works by inspecting
task_pt_regs and various other bits to piece together all the
information, and it's does so in multiple partially redundant steps.

Arch-specific code in the syscall entry path can do much better.

Admittedly this adds a bit of additional room for error, but the
speedup should be worth it.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/seccomp.h |  2 +-
 kernel/seccomp.c        | 32 +++++++++++++++++++-------------
 2 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 8345fdc..4fc7a84 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -37,7 +37,7 @@ static inline int secure_computing(void)
 #define SECCOMP_PHASE1_OK	0
 #define SECCOMP_PHASE1_SKIP	1
 
-extern u32 seccomp_phase1(void);
+extern u32 seccomp_phase1(struct seccomp_data *sd);
 int seccomp_phase2(u32 phase1_result);
 #else
 extern void secure_computing_strict(int this_syscall);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index d737445..391f6c4 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -171,24 +171,27 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
  *
  * Returns valid seccomp BPF response codes.
  */
-static u32 seccomp_run_filters(void)
+static u32 seccomp_run_filters(struct seccomp_data *sd)
 {
 	struct seccomp_filter *f;
-	struct seccomp_data sd;
+	struct seccomp_data sd_local;
 	u32 ret = SECCOMP_RET_ALLOW;
 
 	/* Ensure unexpected behavior doesn't result in failing open. */
 	if (WARN_ON(current->seccomp.filter == NULL))
 		return SECCOMP_RET_KILL;
 
-	populate_seccomp_data(&sd);
+	if (!sd) {
+		populate_seccomp_data(&sd_local);
+		sd = &sd_local;
+	}
 
 	/*
 	 * All filters in the list are evaluated and the lowest BPF return
 	 * value always takes priority (ignoring the DATA).
 	 */
 	for (f = current->seccomp.filter; f; f = f->prev) {
-		u32 cur_ret = SK_RUN_FILTER(f->prog, (void *)&sd);
+		u32 cur_ret = SK_RUN_FILTER(f->prog, (void *)sd);
 
 		if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
 			ret = cur_ret;
@@ -415,7 +418,7 @@ void secure_computing_strict(int this_syscall)
 #else
 int __secure_computing(void)
 {
-	u32 phase1_result = seccomp_phase1();
+	u32 phase1_result = seccomp_phase1(NULL);
 
 	if (likely(phase1_result == SECCOMP_PHASE1_OK))
 		return 0;
@@ -426,22 +429,22 @@ int __secure_computing(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
-static u32 __seccomp_phase1_filter(int this_syscall, struct pt_regs *regs)
+static u32 __seccomp_phase1_filter(int this_syscall, struct seccomp_data *sd)
 {
-	u32 filter_ret = seccomp_run_filters();
+	u32 filter_ret = seccomp_run_filters(sd);
 	int data = filter_ret & SECCOMP_RET_DATA;
 	u32 action = filter_ret & SECCOMP_RET_ACTION;
 
 	switch (action) {
 	case SECCOMP_RET_ERRNO:
 		/* Set the low-order 16-bits as a errno. */
-		syscall_set_return_value(current, regs,
+		syscall_set_return_value(current, task_pt_regs(current),
 					 -data, 0);
 		goto skip;
 
 	case SECCOMP_RET_TRAP:
 		/* Show the handler the original registers. */
-		syscall_rollback(current, regs);
+		syscall_rollback(current, task_pt_regs(current));
 		/* Let the filter pass back 16 bits of data. */
 		seccomp_send_sigsys(this_syscall, data);
 		goto skip;
@@ -468,11 +471,14 @@ skip:
 
 /**
  * seccomp_phase1() - run fast path seccomp checks on the current syscall
+ * @arg sd: The seccomp_data or NULL
  *
  * This only reads pt_regs via the syscall_xyz helpers.  The only change
  * it will make to pt_regs is via syscall_set_return_value, and it will
  * only do that if it returns SECCOMP_PHASE1_SKIP.
  *
+ * If sd is provided, it will not read pt_regs at all.
+ *
  * It may also call do_exit or force a signal; these actions must be
  * safe.
  *
@@ -486,11 +492,11 @@ skip:
  * If it returns anything else, then the return value should be passed
  * to seccomp_phase2 from a context in which ptrace hooks are safe.
  */
-u32 seccomp_phase1(void)
+u32 seccomp_phase1(struct seccomp_data *sd)
 {
 	int mode = current->seccomp.mode;
-	struct pt_regs *regs = task_pt_regs(current);
-	int this_syscall = syscall_get_nr(current, regs);
+	int this_syscall = sd ? sd->nr :
+		syscall_get_nr(current, task_pt_regs(current));
 
 	switch (mode) {
 	case SECCOMP_MODE_STRICT:
@@ -498,7 +504,7 @@ u32 seccomp_phase1(void)
 		return SECCOMP_PHASE1_OK;
 #ifdef CONFIG_SECCOMP_FILTER
 	case SECCOMP_MODE_FILTER:
-		return __seccomp_phase1_filter(this_syscall, regs);
+		return __seccomp_phase1_filter(this_syscall, sd);
 #endif
 	default:
 		BUG();
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/7] x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
  2014-07-15 19:32 [PATCH 0/7] Two-phase seccomp and x86 tracing changes Andy Lutomirski
                   ` (2 preceding siblings ...)
  2014-07-15 19:32 ` [PATCH 3/7] seccomp: Allow arch code to provide seccomp_data Andy Lutomirski
@ 2014-07-15 19:32 ` Andy Lutomirski
  2014-07-15 19:32 ` [PATCH 5/7] x86: Split syscall_trace_enter into two phases Andy Lutomirski
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2014-07-15 19:32 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry, James Morris
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, Andy Lutomirski

is_compat_task() is the wrong check for audit arch; the check should
be is_ia32_task(): x32 syscalls should be AUDIT_ARCH_X86_64, not
AUDIT_ARCH_I386.

CONFIG_AUDITSYSCALL is currently incompatible with x32, so this has
no visible effect.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/ptrace.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 93c182a..39296d2 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1441,15 +1441,6 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 	force_sig_info(SIGTRAP, &info, tsk);
 }
 
-
-#ifdef CONFIG_X86_32
-# define IS_IA32	1
-#elif defined CONFIG_IA32_EMULATION
-# define IS_IA32	is_compat_task()
-#else
-# define IS_IA32	0
-#endif
-
 /*
  * We must return the syscall number to actually look up in the table.
  * This can be -1L to skip running any syscall at all.
@@ -1487,7 +1478,7 @@ long syscall_trace_enter(struct pt_regs *regs)
 	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
 		trace_sys_enter(regs, regs->orig_ax);
 
-	if (IS_IA32)
+	if (is_ia32_task())
 		audit_syscall_entry(AUDIT_ARCH_I386,
 				    regs->orig_ax,
 				    regs->bx, regs->cx,
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/7] x86: Split syscall_trace_enter into two phases
  2014-07-15 19:32 [PATCH 0/7] Two-phase seccomp and x86 tracing changes Andy Lutomirski
                   ` (3 preceding siblings ...)
  2014-07-15 19:32 ` [PATCH 4/7] x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit Andy Lutomirski
@ 2014-07-15 19:32 ` Andy Lutomirski
  2014-07-16 20:21   ` Kees Cook
  2014-07-15 19:32 ` [PATCH 6/7] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls Andy Lutomirski
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2014-07-15 19:32 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry, James Morris
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, Andy Lutomirski

This splits syscall_trace_enter into syscall_trace_enter_phase1 and
syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
phase 2 is permitted to modify any of pt_regs except for orig_ax.

The intent is that phase 1 can be called from the syscall fast path.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/ptrace.h |   5 ++
 arch/x86/kernel/ptrace.c      | 139 ++++++++++++++++++++++++++++++++++++------
 2 files changed, 125 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 14fd6fd..dcbfb49 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
 extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 			 int error_code, int si_code);
 
+
+extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
+extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
+				       unsigned long phase1_result);
+
 extern long syscall_trace_enter(struct pt_regs *);
 extern void syscall_trace_leave(struct pt_regs *);
 
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 39296d2..8e05418 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1441,13 +1441,111 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 	force_sig_info(SIGTRAP, &info, tsk);
 }
 
+static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
+{
+	if (arch == AUDIT_ARCH_X86_64) {
+		audit_syscall_entry(arch, regs->orig_ax, regs->di,
+				    regs->si, regs->dx, regs->r10);
+	} else {
+		audit_syscall_entry(arch, regs->orig_ax, regs->bx,
+				    regs->cx, regs->dx, regs->si);
+	}
+}
+
 /*
- * We must return the syscall number to actually look up in the table.
- * This can be -1L to skip running any syscall at all.
+ * We can return 0 to resume the syscall or anything else to go to phase
+ * 2.  If we resume the syscall, we need to put something appropriate in
+ * regs->orig_ax.
+ *
+ * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
+ * are fully functional.
+ *
+ * For phase 2's benefit, our return value is:
+ * 0: resume the syscall
+ * 1: go to phase 2; no seccomp phase 2 needed
+ * 2: go to phase 2; pass return value to seccomp
  */
-long syscall_trace_enter(struct pt_regs *regs)
+unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
+{
+	unsigned long ret = 0;
+	u32 work;
+
+	BUG_ON(regs != task_pt_regs(current));
+
+	work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
+
+#ifdef CONFIG_SECCOMP
+	/*
+	 * Do seccomp first -- it should minimize exposure of other
+	 * code, and keeping seccomp fast is probably more valuable
+	 * than the rest of this.
+	 */
+	if (work & _TIF_SECCOMP) {
+		struct seccomp_data sd;
+
+		sd.arch = arch;
+		sd.nr = regs->orig_ax;
+		sd.instruction_pointer = regs->ip;
+		if (arch == AUDIT_ARCH_X86_64) {
+			sd.args[0] = regs->di;
+			sd.args[1] = regs->si;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->r10;
+			sd.args[4] = regs->r8;
+			sd.args[5] = regs->r9;
+		} else {
+			sd.args[0] = regs->bx;
+			sd.args[1] = regs->cx;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->si;
+			sd.args[4] = regs->di;
+			sd.args[5] = regs->bp;
+		}
+
+		BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
+		BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
+
+		ret = seccomp_phase1(&sd);
+		if (ret == SECCOMP_PHASE1_SKIP) {
+			regs->orig_ax = -ENOSYS;
+			ret = 0;
+		} else if (ret != SECCOMP_PHASE1_OK) {
+			return ret;  /* Go directly to phase 2 */
+		}
+
+		work &= ~_TIF_SECCOMP;
+	}
+#endif
+
+	/* Do our best to finish without phase 2. */
+	if (work == 0)
+		return ret;  /* seccomp only (ret == 0 here) */
+
+#ifdef CONFIG_AUDITSYSCALL
+	if (work == _TIF_SYSCALL_AUDIT) {
+		/*
+		 * If there is no more work to be done except auditing,
+		 * then audit in phase 1.  Phase 2 always audits, so, if
+		 * we audit here, then we can't go on to phase 2.
+		 */
+		do_audit_syscall_entry(regs, arch);
+		return 0;
+	}
+#endif
+
+	return 1;  /* Something is enabled that we can't handle in phase 1 */
+}
+
+/* Returns the syscall nr to run (which should match regs->orig_ax). */
+long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
+				unsigned long phase1_result)
 {
 	long ret = 0;
+	u32 work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
+
+	BUG_ON(regs != task_pt_regs(current));
 
 	user_exit();
 
@@ -1458,17 +1556,20 @@ long syscall_trace_enter(struct pt_regs *regs)
 	 * do_debug() and we need to set it again to restore the user
 	 * state.  If we entered on the slow path, TF was already set.
 	 */
-	if (test_thread_flag(TIF_SINGLESTEP))
+	if (work & _TIF_SINGLESTEP)
 		regs->flags |= X86_EFLAGS_TF;
 
-	/* do the secure computing check first */
-	if (secure_computing()) {
+	/*
+	 * Call seccomp_phase2 before running the other hooks so that
+	 * they can see any changes made by a seccomp tracer.
+	 */
+	if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
 		/* seccomp failures shouldn't expose any additional code. */
 		ret = -1L;
 		goto out;
 	}
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_EMU)))
+	if (unlikely(work & _TIF_SYSCALL_EMU))
 		ret = -1L;
 
 	if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
@@ -1478,23 +1579,23 @@ long syscall_trace_enter(struct pt_regs *regs)
 	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
 		trace_sys_enter(regs, regs->orig_ax);
 
-	if (is_ia32_task())
-		audit_syscall_entry(AUDIT_ARCH_I386,
-				    regs->orig_ax,
-				    regs->bx, regs->cx,
-				    regs->dx, regs->si);
-#ifdef CONFIG_X86_64
-	else
-		audit_syscall_entry(AUDIT_ARCH_X86_64,
-				    regs->orig_ax,
-				    regs->di, regs->si,
-				    regs->dx, regs->r10);
-#endif
+	do_audit_syscall_entry(regs, arch);
 
 out:
 	return ret ?: regs->orig_ax;
 }
 
+long syscall_trace_enter(struct pt_regs *regs)
+{
+	u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
+	unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
+
+	if (phase1_result == 0)
+		return regs->orig_ax;
+	else
+		return syscall_trace_enter_phase2(regs, arch, phase1_result);
+}
+
 void syscall_trace_leave(struct pt_regs *regs)
 {
 	bool step;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/7] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls
  2014-07-15 19:32 [PATCH 0/7] Two-phase seccomp and x86 tracing changes Andy Lutomirski
                   ` (4 preceding siblings ...)
  2014-07-15 19:32 ` [PATCH 5/7] x86: Split syscall_trace_enter into two phases Andy Lutomirski
@ 2014-07-15 19:32 ` Andy Lutomirski
  2014-07-16 20:08   ` Alexei Starovoitov
  2014-07-15 19:32 ` [PATCH 7/7] x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls Andy Lutomirski
  2014-07-16 20:41 ` [PATCH 0/7] Two-phase seccomp and x86 tracing changes Kees Cook
  7 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2014-07-15 19:32 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry, James Morris
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, Andy Lutomirski

For slowpath syscalls, we initialize regs->ax to -ENOSYS and stick
the syscall number into regs->orig_ax prior to any possible tracing
and syscall execution.  This is user-visible ABI used by ptrace
syscall emulation and seccomp.

For fastpath syscalls, there's no good reason not to do the same
thing.  It's even slightly simpler than what we're currently doing.
It probably has no measureable performance impact.  It should have
no user-visible effect.

The purpose of this patch is to prepare for seccomp-based syscall
emulation in the fast path.  This change is just subtle enough that
I'm keeping it separate.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/calling.h |  6 +++++-
 arch/x86/kernel/entry_64.S     | 11 +++--------
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index cb4c73b..76659b6 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -85,7 +85,7 @@ For 32-bit we have the following conventions - kernel is built with
 #define ARGOFFSET	R11
 #define SWFRAME		ORIG_RAX
 
-	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1
+	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1, rax_enosys=0
 	subq  $9*8+\addskip, %rsp
 	CFI_ADJUST_CFA_OFFSET	9*8+\addskip
 	movq_cfi rdi, 8*8
@@ -96,7 +96,11 @@ For 32-bit we have the following conventions - kernel is built with
 	movq_cfi rcx, 5*8
 	.endif
 
+	.if \rax_enosys
+	movq $-ENOSYS, 4*8(%rsp)
+	.else
 	movq_cfi rax, 4*8
+	.endif
 
 	.if \save_r891011
 	movq_cfi r8,  3*8
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index b25ca96..432c190 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -405,8 +405,8 @@ GLOBAL(system_call_after_swapgs)
 	 * and short:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	SAVE_ARGS 8,0
-	movq  %rax,ORIG_RAX-ARGOFFSET(%rsp)
+	SAVE_ARGS 8, 0, rax_enosys=1
+	movq_cfi rax,(ORIG_RAX-ARGOFFSET)
 	movq  %rcx,RIP-ARGOFFSET(%rsp)
 	CFI_REL_OFFSET rip,RIP-ARGOFFSET
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
@@ -418,7 +418,7 @@ system_call_fastpath:
 	andl $__SYSCALL_MASK,%eax
 	cmpl $__NR_syscall_max,%eax
 #endif
-	ja badsys
+	ja ret_from_sys_call  /* and return regs->ax */
 	movq %r10,%rcx
 	call *sys_call_table(,%rax,8)  # XXX:	 rip relative
 	movq %rax,RAX-ARGOFFSET(%rsp)
@@ -477,10 +477,6 @@ sysret_signal:
 	FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 	jmp int_check_syscall_exit_work
 
-badsys:
-	movq $-ENOSYS,RAX-ARGOFFSET(%rsp)
-	jmp ret_from_sys_call
-
 #ifdef CONFIG_AUDITSYSCALL
 	/*
 	 * Fast path for syscall audit without full syscall trace.
@@ -520,7 +516,6 @@ tracesys:
 	jz auditsys
 #endif
 	SAVE_REST
-	movq $-ENOSYS,RAX(%rsp) /* ptrace can change this for a bad syscall */
 	FIXUP_TOP_OF_STACK %rdi
 	movq %rsp,%rdi
 	call syscall_trace_enter
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 7/7] x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls
  2014-07-15 19:32 [PATCH 0/7] Two-phase seccomp and x86 tracing changes Andy Lutomirski
                   ` (5 preceding siblings ...)
  2014-07-15 19:32 ` [PATCH 6/7] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls Andy Lutomirski
@ 2014-07-15 19:32 ` Andy Lutomirski
  2014-07-16 20:41 ` [PATCH 0/7] Two-phase seccomp and x86 tracing changes Kees Cook
  7 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2014-07-15 19:32 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry, James Morris
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, Andy Lutomirski

On KVM on my box, this reduces the overhead from an always-accept
seccomp filter from ~130ns to ~17ns.  Most of that comes from
avoiding IRET on every syscall when seccomp is enabled.

In extremely approximate hacked-up benchmarking, just bypassing IRET
saves about 80ns, so there's another 43ns of savings here from
simplifying the seccomp path.

The diffstat is also rather nice :)

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 40 ++++++++++++++++------------------------
 1 file changed, 16 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 432c190..13e0c1d 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -479,22 +479,6 @@ sysret_signal:
 
 #ifdef CONFIG_AUDITSYSCALL
 	/*
-	 * Fast path for syscall audit without full syscall trace.
-	 * We just call __audit_syscall_entry() directly, and then
-	 * jump back to the normal fast path.
-	 */
-auditsys:
-	movq %r10,%r9			/* 6th arg: 4th syscall arg */
-	movq %rdx,%r8			/* 5th arg: 3rd syscall arg */
-	movq %rsi,%rcx			/* 4th arg: 2nd syscall arg */
-	movq %rdi,%rdx			/* 3rd arg: 1st syscall arg */
-	movq %rax,%rsi			/* 2nd arg: syscall number */
-	movl $AUDIT_ARCH_X86_64,%edi	/* 1st arg: audit arch */
-	call __audit_syscall_entry
-	LOAD_ARGS 0		/* reload call-clobbered registers */
-	jmp system_call_fastpath
-
-	/*
 	 * Return fast path for syscall audit.  Call __audit_syscall_exit()
 	 * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
 	 * masked off.
@@ -511,17 +495,25 @@ sysret_audit:
 
 	/* Do syscall tracing */
 tracesys:
-#ifdef CONFIG_AUDITSYSCALL
-	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
-	jz auditsys
-#endif
+	leaq -REST_SKIP(%rsp), %rdi
+	movq $AUDIT_ARCH_X86_64, %rsi
+	call syscall_trace_enter_phase1
+	test %rax, %rax
+	jnz tracesys_phase2		/* if needed, run the slow path */
+	LOAD_ARGS 0			/* else restore clobbered regs */
+	jmp system_call_fastpath	/*      and return to the fast path */
+
+tracesys_phase2:
 	SAVE_REST
 	FIXUP_TOP_OF_STACK %rdi
-	movq %rsp,%rdi
-	call syscall_trace_enter
+	movq %rsp, %rdi
+	movq $AUDIT_ARCH_X86_64, %rsi
+	movq %rax,%rdx
+	call syscall_trace_enter_phase2
+
 	/*
 	 * Reload arg registers from stack in case ptrace changed them.
-	 * We don't reload %rax because syscall_trace_enter() returned
+	 * We don't reload %rax because syscall_trace_entry_phase2() returned
 	 * the value it wants us to use in the table lookup.
 	 */
 	LOAD_ARGS ARGOFFSET, 1
@@ -532,7 +524,7 @@ tracesys:
 	andl $__SYSCALL_MASK,%eax
 	cmpl $__NR_syscall_max,%eax
 #endif
-	ja   int_ret_from_sys_call	/* RAX(%rsp) set to -ENOSYS above */
+	ja   int_ret_from_sys_call	/* RAX(%rsp) is already set */
 	movq %r10,%rcx	/* fixup for C */
 	call *sys_call_table(,%rax,8)
 	movq %rax,RAX-ARGOFFSET(%rsp)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/7] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls
  2014-07-15 19:32 ` [PATCH 6/7] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls Andy Lutomirski
@ 2014-07-16 20:08   ` Alexei Starovoitov
  2014-07-16 20:53     ` Andy Lutomirski
  0 siblings, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2014-07-16 20:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Kees Cook, Will Drewry, James Morris, Oleg Nesterov,
	X86 ML, linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	LSM List

On Tue, Jul 15, 2014 at 12:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> For slowpath syscalls, we initialize regs->ax to -ENOSYS and stick
> the syscall number into regs->orig_ax prior to any possible tracing
> and syscall execution.  This is user-visible ABI used by ptrace
> syscall emulation and seccomp.
>
> For fastpath syscalls, there's no good reason not to do the same
> thing.  It's even slightly simpler than what we're currently doing.
> It probably has no measureable performance impact.  It should have
> no user-visible effect.
>
> The purpose of this patch is to prepare for seccomp-based syscall
> emulation in the fast path.  This change is just subtle enough that
> I'm keeping it separate.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/include/asm/calling.h |  6 +++++-
>  arch/x86/kernel/entry_64.S     | 11 +++--------
>  2 files changed, 8 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
> index cb4c73b..76659b6 100644
> --- a/arch/x86/include/asm/calling.h
> +++ b/arch/x86/include/asm/calling.h
> @@ -85,7 +85,7 @@ For 32-bit we have the following conventions - kernel is built with
>  #define ARGOFFSET      R11
>  #define SWFRAME                ORIG_RAX
>
> -       .macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1
> +       .macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1, rax_enosys=0
>         subq  $9*8+\addskip, %rsp
>         CFI_ADJUST_CFA_OFFSET   9*8+\addskip
>         movq_cfi rdi, 8*8
> @@ -96,7 +96,11 @@ For 32-bit we have the following conventions - kernel is built with
>         movq_cfi rcx, 5*8
>         .endif
>
> +       .if \rax_enosys
> +       movq $-ENOSYS, 4*8(%rsp)
> +       .else
>         movq_cfi rax, 4*8
> +       .endif
>
>         .if \save_r891011
>         movq_cfi r8,  3*8
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index b25ca96..432c190 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -405,8 +405,8 @@ GLOBAL(system_call_after_swapgs)
>          * and short:
>          */
>         ENABLE_INTERRUPTS(CLBR_NONE)
> -       SAVE_ARGS 8,0
> -       movq  %rax,ORIG_RAX-ARGOFFSET(%rsp)
> +       SAVE_ARGS 8, 0, rax_enosys=1
> +       movq_cfi rax,(ORIG_RAX-ARGOFFSET)

I think changing store rax to macro is unnecessary,
since it breaks common style of asm with the next line:
>         movq  %rcx,RIP-ARGOFFSET(%rsp)
Also it made the diff harder to grasp.

The change from the next patch 7/7:

> -       ja   int_ret_from_sys_call      /* RAX(%rsp) set to -ENOSYS above */
> +       ja   int_ret_from_sys_call      /* RAX(%rsp) is already set */

probably belongs in this 6/7 patch as well.

The rest look good to me. I think it's a big improvement in readability
comparing to v1.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] seccomp: Refactor the filter callback and the API
  2014-07-15 19:32 ` [PATCH 2/7] seccomp: Refactor the filter callback and the API Andy Lutomirski
@ 2014-07-16 20:12   ` Kees Cook
  2014-07-16 20:56     ` Andy Lutomirski
  0 siblings, 1 reply; 19+ messages in thread
From: Kees Cook @ 2014-07-16 20:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Will Drewry, James Morris, Oleg Nesterov, x86,
	linux-arm-kernel, linux-mips, linux-arch, linux-security-module,
	Alexei Starovoitov

On Tue, Jul 15, 2014 at 12:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> The reason I did this is to add a seccomp API that will be usable
> for an x86 fast path.  The x86 entry code needs to use a rather
> expensive slow path for a syscall that might be visible to things
> like ptrace.  By splitting seccomp into two phases, we can check
> whether we need the slow path and then use the fast path in if the
> filter allows the syscall or just returns some errno.
>
> As a side effect, I think the new code is much easier to understand
> than the old code.

I'd agree. The #idefs got a little weirder, but the actual code flow
was much easier to read. I wonder if "phase1" and "phase2" should be
renamed "pretrace" and "tracing" or something more meaningful? Or
"fast" and "slow"?

> This has one user-visible effect: the audit record written for
> SECCOMP_RET_TRACE is now a simple indication that SECCOMP_RET_TRACE
> happened.  It used to depend in a complicated way on what the tracer
> did.  I couldn't make much sense of it.

I think this change is okay. The only way to get the audit record to
report SIGSYS before was to have an additional signal come in and kill
it while the tracer was working on it. Which is confusing too. I like
this way better.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/7] x86: Split syscall_trace_enter into two phases
  2014-07-15 19:32 ` [PATCH 5/7] x86: Split syscall_trace_enter into two phases Andy Lutomirski
@ 2014-07-16 20:21   ` Kees Cook
  2014-07-16 21:15     ` Andy Lutomirski
  0 siblings, 1 reply; 19+ messages in thread
From: Kees Cook @ 2014-07-16 20:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Will Drewry, James Morris, Oleg Nesterov, x86,
	linux-arm-kernel, linux-mips, linux-arch, linux-security-module,
	Alexei Starovoitov

On Tue, Jul 15, 2014 at 12:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>
> The intent is that phase 1 can be called from the syscall fast path.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/include/asm/ptrace.h |   5 ++
>  arch/x86/kernel/ptrace.c      | 139 ++++++++++++++++++++++++++++++++++++------
>  2 files changed, 125 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> index 14fd6fd..dcbfb49 100644
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
>  extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>                          int error_code, int si_code);
>
> +
> +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
> +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
> +                                      unsigned long phase1_result);
> +
>  extern long syscall_trace_enter(struct pt_regs *);
>  extern void syscall_trace_leave(struct pt_regs *);
>
> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> index 39296d2..8e05418 100644
> --- a/arch/x86/kernel/ptrace.c
> +++ b/arch/x86/kernel/ptrace.c
> @@ -1441,13 +1441,111 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>         force_sig_info(SIGTRAP, &info, tsk);
>  }
>
> +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
> +{
> +       if (arch == AUDIT_ARCH_X86_64) {
> +               audit_syscall_entry(arch, regs->orig_ax, regs->di,
> +                                   regs->si, regs->dx, regs->r10);
> +       } else {
> +               audit_syscall_entry(arch, regs->orig_ax, regs->bx,
> +                                   regs->cx, regs->dx, regs->si);
> +       }
> +}
> +
>  /*
> - * We must return the syscall number to actually look up in the table.
> - * This can be -1L to skip running any syscall at all.
> + * We can return 0 to resume the syscall or anything else to go to phase
> + * 2.  If we resume the syscall, we need to put something appropriate in
> + * regs->orig_ax.
> + *
> + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
> + * are fully functional.
> + *
> + * For phase 2's benefit, our return value is:
> + * 0: resume the syscall
> + * 1: go to phase 2; no seccomp phase 2 needed
> + * 2: go to phase 2; pass return value to seccomp
>   */
> -long syscall_trace_enter(struct pt_regs *regs)
> +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
> +{
> +       unsigned long ret = 0;
> +       u32 work;
> +
> +       BUG_ON(regs != task_pt_regs(current));
> +
> +       work = ACCESS_ONCE(current_thread_info()->flags) &
> +               _TIF_WORK_SYSCALL_ENTRY;
> +
> +#ifdef CONFIG_SECCOMP
> +       /*
> +        * Do seccomp first -- it should minimize exposure of other
> +        * code, and keeping seccomp fast is probably more valuable
> +        * than the rest of this.
> +        */
> +       if (work & _TIF_SECCOMP) {
> +               struct seccomp_data sd;
> +
> +               sd.arch = arch;
> +               sd.nr = regs->orig_ax;
> +               sd.instruction_pointer = regs->ip;
> +               if (arch == AUDIT_ARCH_X86_64) {
> +                       sd.args[0] = regs->di;
> +                       sd.args[1] = regs->si;
> +                       sd.args[2] = regs->dx;
> +                       sd.args[3] = regs->r10;
> +                       sd.args[4] = regs->r8;
> +                       sd.args[5] = regs->r9;
> +               } else {
> +                       sd.args[0] = regs->bx;
> +                       sd.args[1] = regs->cx;
> +                       sd.args[2] = regs->dx;
> +                       sd.args[3] = regs->si;
> +                       sd.args[4] = regs->di;
> +                       sd.args[5] = regs->bp;
> +               }
> +
> +               BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
> +               BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
> +
> +               ret = seccomp_phase1(&sd);
> +               if (ret == SECCOMP_PHASE1_SKIP) {
> +                       regs->orig_ax = -ENOSYS;

Before, seccomp didn't touch orig_ax on a skip. I don't see any
problem with this, and it's probably more clear this way, but are you
sure there aren't unexpected side-effects from this?

-Kees

> +                       ret = 0;
> +               } else if (ret != SECCOMP_PHASE1_OK) {
> +                       return ret;  /* Go directly to phase 2 */
> +               }
> +
> +               work &= ~_TIF_SECCOMP;
> +       }
> +#endif
> +
> +       /* Do our best to finish without phase 2. */
> +       if (work == 0)
> +               return ret;  /* seccomp only (ret == 0 here) */
> +
> +#ifdef CONFIG_AUDITSYSCALL
> +       if (work == _TIF_SYSCALL_AUDIT) {
> +               /*
> +                * If there is no more work to be done except auditing,
> +                * then audit in phase 1.  Phase 2 always audits, so, if
> +                * we audit here, then we can't go on to phase 2.
> +                */
> +               do_audit_syscall_entry(regs, arch);
> +               return 0;
> +       }
> +#endif
> +
> +       return 1;  /* Something is enabled that we can't handle in phase 1 */
> +}
> +
> +/* Returns the syscall nr to run (which should match regs->orig_ax). */
> +long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
> +                               unsigned long phase1_result)
>  {
>         long ret = 0;
> +       u32 work = ACCESS_ONCE(current_thread_info()->flags) &
> +               _TIF_WORK_SYSCALL_ENTRY;
> +
> +       BUG_ON(regs != task_pt_regs(current));
>
>         user_exit();
>
> @@ -1458,17 +1556,20 @@ long syscall_trace_enter(struct pt_regs *regs)
>          * do_debug() and we need to set it again to restore the user
>          * state.  If we entered on the slow path, TF was already set.
>          */
> -       if (test_thread_flag(TIF_SINGLESTEP))
> +       if (work & _TIF_SINGLESTEP)
>                 regs->flags |= X86_EFLAGS_TF;
>
> -       /* do the secure computing check first */
> -       if (secure_computing()) {
> +       /*
> +        * Call seccomp_phase2 before running the other hooks so that
> +        * they can see any changes made by a seccomp tracer.
> +        */
> +       if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
>                 /* seccomp failures shouldn't expose any additional code. */
>                 ret = -1L;
>                 goto out;
>         }
>
> -       if (unlikely(test_thread_flag(TIF_SYSCALL_EMU)))
> +       if (unlikely(work & _TIF_SYSCALL_EMU))
>                 ret = -1L;
>
>         if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
> @@ -1478,23 +1579,23 @@ long syscall_trace_enter(struct pt_regs *regs)
>         if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
>                 trace_sys_enter(regs, regs->orig_ax);
>
> -       if (is_ia32_task())
> -               audit_syscall_entry(AUDIT_ARCH_I386,
> -                                   regs->orig_ax,
> -                                   regs->bx, regs->cx,
> -                                   regs->dx, regs->si);
> -#ifdef CONFIG_X86_64
> -       else
> -               audit_syscall_entry(AUDIT_ARCH_X86_64,
> -                                   regs->orig_ax,
> -                                   regs->di, regs->si,
> -                                   regs->dx, regs->r10);
> -#endif
> +       do_audit_syscall_entry(regs, arch);
>
>  out:
>         return ret ?: regs->orig_ax;
>  }
>
> +long syscall_trace_enter(struct pt_regs *regs)
> +{
> +       u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
> +       unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
> +
> +       if (phase1_result == 0)
> +               return regs->orig_ax;
> +       else
> +               return syscall_trace_enter_phase2(regs, arch, phase1_result);
> +}
> +
>  void syscall_trace_leave(struct pt_regs *regs)
>  {
>         bool step;
> --
> 1.9.3
>



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/7] Two-phase seccomp and x86 tracing changes
  2014-07-15 19:32 [PATCH 0/7] Two-phase seccomp and x86 tracing changes Andy Lutomirski
                   ` (6 preceding siblings ...)
  2014-07-15 19:32 ` [PATCH 7/7] x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls Andy Lutomirski
@ 2014-07-16 20:41 ` Kees Cook
  2014-07-16 21:17   ` Andy Lutomirski
  7 siblings, 1 reply; 19+ messages in thread
From: Kees Cook @ 2014-07-16 20:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Will Drewry, James Morris, Oleg Nesterov, x86,
	linux-arm-kernel, linux-mips, linux-arch, linux-security-module,
	Alexei Starovoitov

On Tue, Jul 15, 2014 at 12:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> This is both a cleanup and a speedup.  It reduces overhead due to
> installing a trivial seccomp filter by 87%.  The speedup comes from
> avoiding the full syscall tracing mechanism for filters that don't
> return SECCOMP_RET_TRACE.
>
> This series works by splitting the seccomp hooks into two phases.
> The first phase evaluates the filter; it can skip syscalls, allow
> them, kill the calling task, or pass a u32 to the second phase.  The
> second phase requires a full tracing context, and it sends ptrace
> events if necessary.
>
> Once this is done, I implemented a similar split for the x86 syscall
> entry work.  The C callback is invoked in two phases: the first has
> only a partial frame, and it can request phase 2 processing with a
> full frame.
>
> Finally, I switch the 64-bit system_call code to use the new split
> entry work.  This is a net deletion of assembly code: it replaces
> all of the audit entry muck.
>
> In the process, I fixed some bugs.
>
> If this is acceptable, someone can do the same tweak for the
> ia32entry and entry_32 code.
>
> This passes all seccomp tests that I know of, except for the ones
> that don't work on current kernels.

After fighting a bit with merging this with the tsync series, I can
confirm this all behaves nicely on x86_64 and ARM.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/7] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls
  2014-07-16 20:08   ` Alexei Starovoitov
@ 2014-07-16 20:53     ` Andy Lutomirski
  0 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2014-07-16 20:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: LKML, Kees Cook, Will Drewry, James Morris, Oleg Nesterov,
	X86 ML, linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	LSM List

On Wed, Jul 16, 2014 at 1:08 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Tue, Jul 15, 2014 at 12:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> For slowpath syscalls, we initialize regs->ax to -ENOSYS and stick
>> the syscall number into regs->orig_ax prior to any possible tracing
>> and syscall execution.  This is user-visible ABI used by ptrace
>> syscall emulation and seccomp.
>>
>> For fastpath syscalls, there's no good reason not to do the same
>> thing.  It's even slightly simpler than what we're currently doing.
>> It probably has no measureable performance impact.  It should have
>> no user-visible effect.
>>
>> The purpose of this patch is to prepare for seccomp-based syscall
>> emulation in the fast path.  This change is just subtle enough that
>> I'm keeping it separate.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  arch/x86/include/asm/calling.h |  6 +++++-
>>  arch/x86/kernel/entry_64.S     | 11 +++--------
>>  2 files changed, 8 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
>> index cb4c73b..76659b6 100644
>> --- a/arch/x86/include/asm/calling.h
>> +++ b/arch/x86/include/asm/calling.h
>> @@ -85,7 +85,7 @@ For 32-bit we have the following conventions - kernel is built with
>>  #define ARGOFFSET      R11
>>  #define SWFRAME                ORIG_RAX
>>
>> -       .macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1
>> +       .macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1, rax_enosys=0
>>         subq  $9*8+\addskip, %rsp
>>         CFI_ADJUST_CFA_OFFSET   9*8+\addskip
>>         movq_cfi rdi, 8*8
>> @@ -96,7 +96,11 @@ For 32-bit we have the following conventions - kernel is built with
>>         movq_cfi rcx, 5*8
>>         .endif
>>
>> +       .if \rax_enosys
>> +       movq $-ENOSYS, 4*8(%rsp)
>> +       .else
>>         movq_cfi rax, 4*8
>> +       .endif
>>
>>         .if \save_r891011
>>         movq_cfi r8,  3*8
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index b25ca96..432c190 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -405,8 +405,8 @@ GLOBAL(system_call_after_swapgs)
>>          * and short:
>>          */
>>         ENABLE_INTERRUPTS(CLBR_NONE)
>> -       SAVE_ARGS 8,0
>> -       movq  %rax,ORIG_RAX-ARGOFFSET(%rsp)
>> +       SAVE_ARGS 8, 0, rax_enosys=1
>> +       movq_cfi rax,(ORIG_RAX-ARGOFFSET)
>
> I think changing store rax to macro is unnecessary,
> since it breaks common style of asm with the next line:

I think it's necessary to maintain CFI correctness.  rax is no longer
saved in "ax", so "orig_ax" is now the correct slot.

>>         movq  %rcx,RIP-ARGOFFSET(%rsp)

This, in contrast, is the saved rip, not the saved rcx.  rcx is lost
when syscall happens.

> Also it made the diff harder to grasp.


>
> The change from the next patch 7/7:
>
>> -       ja   int_ret_from_sys_call      /* RAX(%rsp) set to -ENOSYS above */
>> +       ja   int_ret_from_sys_call      /* RAX(%rsp) is already set */
>
> probably belongs in this 6/7 patch as well.

Agreed.  Will do for v3.

--Andy

>
> The rest look good to me. I think it's a big improvement in readability
> comparing to v1.



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] seccomp: Refactor the filter callback and the API
  2014-07-16 20:12   ` Kees Cook
@ 2014-07-16 20:56     ` Andy Lutomirski
  2014-07-16 21:56       ` Kees Cook
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2014-07-16 20:56 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Will Drewry, James Morris, Oleg Nesterov, x86,
	linux-arm-kernel, linux-mips, linux-arch, linux-security-module,
	Alexei Starovoitov

On Wed, Jul 16, 2014 at 1:12 PM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Jul 15, 2014 at 12:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> The reason I did this is to add a seccomp API that will be usable
>> for an x86 fast path.  The x86 entry code needs to use a rather
>> expensive slow path for a syscall that might be visible to things
>> like ptrace.  By splitting seccomp into two phases, we can check
>> whether we need the slow path and then use the fast path in if the
>> filter allows the syscall or just returns some errno.
>>
>> As a side effect, I think the new code is much easier to understand
>> than the old code.
>
> I'd agree. The #idefs got a little weirder, but the actual code flow
> was much easier to read. I wonder if "phase1" and "phase2" should be
> renamed "pretrace" and "tracing" or something more meaningful? Or
> "fast" and "slow"?

Queue the bikeshedding :)

I like "phase1" and "phase2" because it makes it clear that phase1 has
to come first.  But I'd be amenable to counterarguments.

>
>> This has one user-visible effect: the audit record written for
>> SECCOMP_RET_TRACE is now a simple indication that SECCOMP_RET_TRACE
>> happened.  It used to depend in a complicated way on what the tracer
>> did.  I couldn't make much sense of it.
>
> I think this change is okay. The only way to get the audit record to
> report SIGSYS before was to have an additional signal come in and kill
> it while the tracer was working on it. Which is confusing too. I like
> this way better.

Thanks :)

--Andy

>
> -Kees
>
> --
> Kees Cook
> Chrome OS Security



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/7] x86: Split syscall_trace_enter into two phases
  2014-07-16 20:21   ` Kees Cook
@ 2014-07-16 21:15     ` Andy Lutomirski
  2014-07-16 21:57       ` Kees Cook
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2014-07-16 21:15 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Will Drewry, James Morris, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov

On Wed, Jul 16, 2014 at 1:21 PM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Jul 15, 2014 at 12:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> +
>> +               ret = seccomp_phase1(&sd);
>> +               if (ret == SECCOMP_PHASE1_SKIP) {
>> +                       regs->orig_ax = -ENOSYS;
>
> Before, seccomp didn't touch orig_ax on a skip. I don't see any
> problem with this, and it's probably more clear this way, but are you
> sure there aren't unexpected side-effects from this?

It's necessary to cause the syscall to be skipped -- see syscall_trace_enter.

That being said, setting it to -ENOSYS is nonsense and probably
confused you at least as much as it confused me.  It should be
regs->orig_ax = -1.

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/7] Two-phase seccomp and x86 tracing changes
  2014-07-16 20:41 ` [PATCH 0/7] Two-phase seccomp and x86 tracing changes Kees Cook
@ 2014-07-16 21:17   ` Andy Lutomirski
  0 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2014-07-16 21:17 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Will Drewry, James Morris, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov

On Wed, Jul 16, 2014 at 1:41 PM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Jul 15, 2014 at 12:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> This is both a cleanup and a speedup.  It reduces overhead due to
>> installing a trivial seccomp filter by 87%.  The speedup comes from
>> avoiding the full syscall tracing mechanism for filters that don't
>> return SECCOMP_RET_TRACE.
>>
>> This series works by splitting the seccomp hooks into two phases.
>> The first phase evaluates the filter; it can skip syscalls, allow
>> them, kill the calling task, or pass a u32 to the second phase.  The
>> second phase requires a full tracing context, and it sends ptrace
>> events if necessary.
>>
>> Once this is done, I implemented a similar split for the x86 syscall
>> entry work.  The C callback is invoked in two phases: the first has
>> only a partial frame, and it can request phase 2 processing with a
>> full frame.
>>
>> Finally, I switch the 64-bit system_call code to use the new split
>> entry work.  This is a net deletion of assembly code: it replaces
>> all of the audit entry muck.
>>
>> In the process, I fixed some bugs.
>>
>> If this is acceptable, someone can do the same tweak for the
>> ia32entry and entry_32 code.
>>
>> This passes all seccomp tests that I know of, except for the ones
>> that don't work on current kernels.
>
> After fighting a bit with merging this with the tsync series, I can
> confirm this all behaves nicely on x86_64 and ARM.
>

I'll hold off on v3 until your stuff lands.

--Andy

> -Kees
>
> --
> Kees Cook
> Chrome OS Security



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] seccomp: Refactor the filter callback and the API
  2014-07-16 20:56     ` Andy Lutomirski
@ 2014-07-16 21:56       ` Kees Cook
  2014-07-16 22:06         ` Andy Lutomirski
  0 siblings, 1 reply; 19+ messages in thread
From: Kees Cook @ 2014-07-16 21:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Will Drewry, James Morris, Oleg Nesterov, x86,
	linux-arm-kernel, linux-mips, linux-arch, linux-security-module,
	Alexei Starovoitov

On Wed, Jul 16, 2014 at 1:56 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Jul 16, 2014 at 1:12 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Tue, Jul 15, 2014 at 12:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> The reason I did this is to add a seccomp API that will be usable
>>> for an x86 fast path.  The x86 entry code needs to use a rather
>>> expensive slow path for a syscall that might be visible to things
>>> like ptrace.  By splitting seccomp into two phases, we can check
>>> whether we need the slow path and then use the fast path in if the
>>> filter allows the syscall or just returns some errno.
>>>
>>> As a side effect, I think the new code is much easier to understand
>>> than the old code.
>>
>> I'd agree. The #idefs got a little weirder, but the actual code flow
>> was much easier to read. I wonder if "phase1" and "phase2" should be
>> renamed "pretrace" and "tracing" or something more meaningful? Or
>> "fast" and "slow"?
>
> Queue the bikeshedding :)
>
> I like "phase1" and "phase2" because it makes it clear that phase1 has
> to come first.  But I'd be amenable to counterarguments.

That works. I didn't have a strong feeling about it. I was just
wondering if there was a good way to self-document that phase1 is on
the fast path, and phase2 was on the slow path for tracing. The
existing comments really should be sufficient, though.

You mentioned architectures providing "sd" directly. I wonder if that
new optional ability should be mentioned in the Kconfig help text that
defines what's needed for an arch to support SECCOMP_FILTER?

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/7] x86: Split syscall_trace_enter into two phases
  2014-07-16 21:15     ` Andy Lutomirski
@ 2014-07-16 21:57       ` Kees Cook
  0 siblings, 0 replies; 19+ messages in thread
From: Kees Cook @ 2014-07-16 21:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Will Drewry, James Morris, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov

On Wed, Jul 16, 2014 at 2:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Jul 16, 2014 at 1:21 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Tue, Jul 15, 2014 at 12:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>
>>> +
>>> +               ret = seccomp_phase1(&sd);
>>> +               if (ret == SECCOMP_PHASE1_SKIP) {
>>> +                       regs->orig_ax = -ENOSYS;
>>
>> Before, seccomp didn't touch orig_ax on a skip. I don't see any
>> problem with this, and it's probably more clear this way, but are you
>> sure there aren't unexpected side-effects from this?
>
> It's necessary to cause the syscall to be skipped -- see syscall_trace_enter.
>
> That being said, setting it to -ENOSYS is nonsense and probably
> confused you at least as much as it confused me.  It should be
> regs->orig_ax = -1.

Yes, I think that would be better.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] seccomp: Refactor the filter callback and the API
  2014-07-16 21:56       ` Kees Cook
@ 2014-07-16 22:06         ` Andy Lutomirski
  0 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2014-07-16 22:06 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Will Drewry, James Morris, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov

On Wed, Jul 16, 2014 at 2:56 PM, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Jul 16, 2014 at 1:56 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Wed, Jul 16, 2014 at 1:12 PM, Kees Cook <keescook@chromium.org> wrote:
>>> On Tue, Jul 15, 2014 at 12:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> The reason I did this is to add a seccomp API that will be usable
>>>> for an x86 fast path.  The x86 entry code needs to use a rather
>>>> expensive slow path for a syscall that might be visible to things
>>>> like ptrace.  By splitting seccomp into two phases, we can check
>>>> whether we need the slow path and then use the fast path in if the
>>>> filter allows the syscall or just returns some errno.
>>>>
>>>> As a side effect, I think the new code is much easier to understand
>>>> than the old code.
>>>
>>> I'd agree. The #idefs got a little weirder, but the actual code flow
>>> was much easier to read. I wonder if "phase1" and "phase2" should be
>>> renamed "pretrace" and "tracing" or something more meaningful? Or
>>> "fast" and "slow"?
>>
>> Queue the bikeshedding :)
>>
>> I like "phase1" and "phase2" because it makes it clear that phase1 has
>> to come first.  But I'd be amenable to counterarguments.
>
> That works. I didn't have a strong feeling about it. I was just
> wondering if there was a good way to self-document that phase1 is on
> the fast path, and phase2 was on the slow path for tracing. The
> existing comments really should be sufficient, though.
>
> You mentioned architectures providing "sd" directly. I wonder if that
> new optional ability should be mentioned in the Kconfig help text that
> defines what's needed for an arch to support SECCOMP_FILTER?

Good call.  Queued for v2.

>
> -Kees
>
> --
> Kees Cook
> Chrome OS Security



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-07-16 22:06 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-15 19:32 [PATCH 0/7] Two-phase seccomp and x86 tracing changes Andy Lutomirski
2014-07-15 19:32 ` [PATCH 1/7] seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing Andy Lutomirski
2014-07-15 19:32 ` [PATCH 2/7] seccomp: Refactor the filter callback and the API Andy Lutomirski
2014-07-16 20:12   ` Kees Cook
2014-07-16 20:56     ` Andy Lutomirski
2014-07-16 21:56       ` Kees Cook
2014-07-16 22:06         ` Andy Lutomirski
2014-07-15 19:32 ` [PATCH 3/7] seccomp: Allow arch code to provide seccomp_data Andy Lutomirski
2014-07-15 19:32 ` [PATCH 4/7] x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit Andy Lutomirski
2014-07-15 19:32 ` [PATCH 5/7] x86: Split syscall_trace_enter into two phases Andy Lutomirski
2014-07-16 20:21   ` Kees Cook
2014-07-16 21:15     ` Andy Lutomirski
2014-07-16 21:57       ` Kees Cook
2014-07-15 19:32 ` [PATCH 6/7] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls Andy Lutomirski
2014-07-16 20:08   ` Alexei Starovoitov
2014-07-16 20:53     ` Andy Lutomirski
2014-07-15 19:32 ` [PATCH 7/7] x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls Andy Lutomirski
2014-07-16 20:41 ` [PATCH 0/7] Two-phase seccomp and x86 tracing changes Kees Cook
2014-07-16 21:17   ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).