[PATCH v5 0/5] x86: two-phase syscall tracing and seccomp fastpath

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/5] x86: two-phase syscall tracing and seccomp fastpath
@ 2014-09-05 22:13 ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2014-09-05 22:13 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry, Oleg Nesterov
  Cc: x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa,
	Frederic Weisbecker, Andy Lutomirski

This applies to:
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git seccomp-fastpath

Gitweb:
https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath

This is both a cleanup and a speedup.  It reduces overhead due to
installing a trivial seccomp filter by 87%.  The speedup comes from
avoiding the full syscall tracing mechanism for filters that don't
return SECCOMP_RET_TRACE.

This series depends on splitting the seccomp hooks into two phases.
The first phase evaluates the filter; it can skip syscalls, allow
them, kill the calling task, or pass a u32 to the second phase.  The
second phase requires a full tracing context, and it sends ptrace
events if necessary.  The seccomp core part is in Kees' seccomp/fastpath
tree.

These patches implement a similar split for the x86 syscall
entry work.  The C callback is invoked in two phases: the first has
only a partial frame, and it can request phase 2 processing with a
full frame.

Finally, I switch the 64-bit system_call code to use the new split
entry work.  This is a net deletion of assembly code: it replaces
all of the audit entry muck.

In the process, I fixed some bugs.

If this is acceptable, someone can do the same tweak for the
ia32entry and entry_32 code.

This passes all seccomp tests that I know of.

Changes from v4:
 - Rebased (which seems to have been a no-op)
 - Fixed embarrassing bug that broke allnoconfig
   (patch 3 was missing an ifdef).

Changes from v3:
 - Dropped the core seccomp changes from the email -- Kees has applied them.
 - Add patch 2 (the TIF_NOHZ change).
 - Fix TIF_NOHZ in the two-phase entry code (thanks, Oleg).

Changes from v2:
 - Fixed 32-bit x86 build (and the tests pass).
 - Put the doc patch where it belongs.

Changes from v1:
 - Rebased on top of Kees' shiny new seccomp tree (no effect on the x86
   part).
 - Improved patch 6 vs patch 7 split (thanks Alexei!)
 - Fixed bogus -ENOSYS in patch 5 (thanks Kees!)
 - Improved changelog message in patch 6.

Changes from RFC version:
 - The first three patches are more or less the same
 - The rest is more or less a rewrite

Andy Lutomirski (5):
  x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
  x86,entry: Only call user_exit if TIF_NOHZ
  x86: Split syscall_trace_enter into two phases
  x86_64,entry: Treat regs->ax the same in fastpath and slowpath
    syscalls
  x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls

 arch/x86/include/asm/calling.h |   6 +-
 arch/x86/include/asm/ptrace.h  |   5 ++
 arch/x86/kernel/entry_64.S     |  51 +++++--------
 arch/x86/kernel/ptrace.c       | 165 +++++++++++++++++++++++++++++++++--------
 4 files changed, 164 insertions(+), 63 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 0/5] x86: two-phase syscall tracing and seccomp fastpath
@ 2014-09-05 22:13 ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2014-09-05 22:13 UTC (permalink / raw)
  To: linux-arm-kernel

This applies to:
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git seccomp-fastpath

Gitweb:
https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath

This is both a cleanup and a speedup.  It reduces overhead due to
installing a trivial seccomp filter by 87%.  The speedup comes from
avoiding the full syscall tracing mechanism for filters that don't
return SECCOMP_RET_TRACE.

This series depends on splitting the seccomp hooks into two phases.
The first phase evaluates the filter; it can skip syscalls, allow
them, kill the calling task, or pass a u32 to the second phase.  The
second phase requires a full tracing context, and it sends ptrace
events if necessary.  The seccomp core part is in Kees' seccomp/fastpath
tree.

These patches implement a similar split for the x86 syscall
entry work.  The C callback is invoked in two phases: the first has
only a partial frame, and it can request phase 2 processing with a
full frame.

Finally, I switch the 64-bit system_call code to use the new split
entry work.  This is a net deletion of assembly code: it replaces
all of the audit entry muck.

In the process, I fixed some bugs.

If this is acceptable, someone can do the same tweak for the
ia32entry and entry_32 code.

This passes all seccomp tests that I know of.

Changes from v4:
 - Rebased (which seems to have been a no-op)
 - Fixed embarrassing bug that broke allnoconfig
   (patch 3 was missing an ifdef).

Changes from v3:
 - Dropped the core seccomp changes from the email -- Kees has applied them.
 - Add patch 2 (the TIF_NOHZ change).
 - Fix TIF_NOHZ in the two-phase entry code (thanks, Oleg).

Changes from v2:
 - Fixed 32-bit x86 build (and the tests pass).
 - Put the doc patch where it belongs.

Changes from v1:
 - Rebased on top of Kees' shiny new seccomp tree (no effect on the x86
   part).
 - Improved patch 6 vs patch 7 split (thanks Alexei!)
 - Fixed bogus -ENOSYS in patch 5 (thanks Kees!)
 - Improved changelog message in patch 6.

Changes from RFC version:
 - The first three patches are more or less the same
 - The rest is more or less a rewrite

Andy Lutomirski (5):
  x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
  x86,entry: Only call user_exit if TIF_NOHZ
  x86: Split syscall_trace_enter into two phases
  x86_64,entry: Treat regs->ax the same in fastpath and slowpath
    syscalls
  x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls

 arch/x86/include/asm/calling.h |   6 +-
 arch/x86/include/asm/ptrace.h  |   5 ++
 arch/x86/kernel/entry_64.S     |  51 +++++--------
 arch/x86/kernel/ptrace.c       | 165 +++++++++++++++++++++++++++++++++--------
 4 files changed, 164 insertions(+), 63 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 1/5] x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
  2014-09-05 22:13 ` Andy Lutomirski
@ 2014-09-05 22:13   ` Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2014-09-05 22:13 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry, Oleg Nesterov
  Cc: x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa,
	Frederic Weisbecker, Andy Lutomirski

is_compat_task() is the wrong check for audit arch; the check should
be is_ia32_task(): x32 syscalls should be AUDIT_ARCH_X86_64, not
AUDIT_ARCH_I386.

CONFIG_AUDITSYSCALL is currently incompatible with x32, so this has
no visible effect.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/ptrace.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 93c182a00506..39296d25708c 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1441,15 +1441,6 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 	force_sig_info(SIGTRAP, &info, tsk);
 }
 
-
-#ifdef CONFIG_X86_32
-# define IS_IA32	1
-#elif defined CONFIG_IA32_EMULATION
-# define IS_IA32	is_compat_task()
-#else
-# define IS_IA32	0
-#endif
-
 /*
  * We must return the syscall number to actually look up in the table.
  * This can be -1L to skip running any syscall at all.
@@ -1487,7 +1478,7 @@ long syscall_trace_enter(struct pt_regs *regs)
 	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
 		trace_sys_enter(regs, regs->orig_ax);
 
-	if (IS_IA32)
+	if (is_ia32_task())
 		audit_syscall_entry(AUDIT_ARCH_I386,
 				    regs->orig_ax,
 				    regs->bx, regs->cx,
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 1/5] x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
@ 2014-09-05 22:13   ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2014-09-05 22:13 UTC (permalink / raw)
  To: linux-arm-kernel

is_compat_task() is the wrong check for audit arch; the check should
be is_ia32_task(): x32 syscalls should be AUDIT_ARCH_X86_64, not
AUDIT_ARCH_I386.

CONFIG_AUDITSYSCALL is currently incompatible with x32, so this has
no visible effect.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/ptrace.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 93c182a00506..39296d25708c 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1441,15 +1441,6 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 	force_sig_info(SIGTRAP, &info, tsk);
 }
 
-
-#ifdef CONFIG_X86_32
-# define IS_IA32	1
-#elif defined CONFIG_IA32_EMULATION
-# define IS_IA32	is_compat_task()
-#else
-# define IS_IA32	0
-#endif
-
 /*
  * We must return the syscall number to actually look up in the table.
  * This can be -1L to skip running any syscall at all.
@@ -1487,7 +1478,7 @@ long syscall_trace_enter(struct pt_regs *regs)
 	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
 		trace_sys_enter(regs, regs->orig_ax);
 
-	if (IS_IA32)
+	if (is_ia32_task())
 		audit_syscall_entry(AUDIT_ARCH_I386,
 				    regs->orig_ax,
 				    regs->bx, regs->cx,
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 2/5] x86,entry: Only call user_exit if TIF_NOHZ
  2014-09-05 22:13 ` Andy Lutomirski
@ 2014-09-05 22:13   ` Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2014-09-05 22:13 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry, Oleg Nesterov
  Cc: x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa,
	Frederic Weisbecker, Andy Lutomirski

The RCU context tracking code requires that arch code call
user_exit() on any entry into kernel code if TIF_NOHZ is set.  This
patch adds a check for TIF_NOHZ and a comment to the syscall entry
tracing code.

The main purpose of this patch is to make the code easier to follow:
one can read the body of user_exit and of every function it calls
without finding any explanation of why it's called for traced
syscalls but not for untraced syscalls.  This makes it clear when
user_exit() is necessary.

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/ptrace.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 39296d25708c..bbf338a04a5d 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1449,7 +1449,12 @@ long syscall_trace_enter(struct pt_regs *regs)
 {
 	long ret = 0;
 
-	user_exit();
+	/*
+	 * If TIF_NOHZ is set, we are required to call user_exit() before
+	 * doing anything that could touch RCU.
+	 */
+	if (test_thread_flag(TIF_NOHZ))
+		user_exit();
 
 	/*
 	 * If we stepped into a sysenter/syscall insn, it trapped in
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 2/5] x86,entry: Only call user_exit if TIF_NOHZ
@ 2014-09-05 22:13   ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2014-09-05 22:13 UTC (permalink / raw)
  To: linux-arm-kernel

The RCU context tracking code requires that arch code call
user_exit() on any entry into kernel code if TIF_NOHZ is set.  This
patch adds a check for TIF_NOHZ and a comment to the syscall entry
tracing code.

The main purpose of this patch is to make the code easier to follow:
one can read the body of user_exit and of every function it calls
without finding any explanation of why it's called for traced
syscalls but not for untraced syscalls.  This makes it clear when
user_exit() is necessary.

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/ptrace.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 39296d25708c..bbf338a04a5d 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1449,7 +1449,12 @@ long syscall_trace_enter(struct pt_regs *regs)
 {
 	long ret = 0;

-	user_exit();
+	/*
+	 * If TIF_NOHZ is set, we are required to call user_exit() before
+	 * doing anything that could touch RCU.
+	 */
+	if (test_thread_flag(TIF_NOHZ))
+		user_exit();

 	/*
 	 * If we stepped into a sysenter/syscall insn, it trapped in
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2014-09-05 22:13 ` Andy Lutomirski
@ 2014-09-05 22:13   ` Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2014-09-05 22:13 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry, Oleg Nesterov
  Cc: x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa,
	Frederic Weisbecker, Andy Lutomirski

This splits syscall_trace_enter into syscall_trace_enter_phase1 and
syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
phase 2 is permitted to modify any of pt_regs except for orig_ax.

The intent is that phase 1 can be called from the syscall fast path.

In this implementation, phase1 can handle any combination of
TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
unless seccomp requests a ptrace event, in which case phase2 is
forced.

In principle, this could yield a big speedup for TIF_NOHZ as well as
for TIF_SECCOMP if syscall exit work were similarly split up.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/ptrace.h |   5 ++
 arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
 2 files changed, 138 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 6205f0c434db..86fc2bb82287 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
 extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 			 int error_code, int si_code);
 
+
+extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
+extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
+				       unsigned long phase1_result);
+
 extern long syscall_trace_enter(struct pt_regs *);
 extern void syscall_trace_leave(struct pt_regs *);
 
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index bbf338a04a5d..29576c244699 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 	force_sig_info(SIGTRAP, &info, tsk);
 }
 
+static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
+{
+#ifdef CONFIG_X86_64
+	if (arch == AUDIT_ARCH_X86_64) {
+		audit_syscall_entry(arch, regs->orig_ax, regs->di,
+				    regs->si, regs->dx, regs->r10);
+	} else
+#endif
+	{
+		audit_syscall_entry(arch, regs->orig_ax, regs->bx,
+				    regs->cx, regs->dx, regs->si);
+	}
+}
+
 /*
- * We must return the syscall number to actually look up in the table.
- * This can be -1L to skip running any syscall at all.
+ * We can return 0 to resume the syscall or anything else to go to phase
+ * 2.  If we resume the syscall, we need to put something appropriate in
+ * regs->orig_ax.
+ *
+ * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
+ * are fully functional.
+ *
+ * For phase 2's benefit, our return value is:
+ * 0:			resume the syscall
+ * 1:			go to phase 2; no seccomp phase 2 needed
+ * anything else:	go to phase 2; pass return value to seccomp
  */
-long syscall_trace_enter(struct pt_regs *regs)
+unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 {
-	long ret = 0;
+	unsigned long ret = 0;
+	u32 work;
+
+	BUG_ON(regs != task_pt_regs(current));
+
+	work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
 
 	/*
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (test_thread_flag(TIF_NOHZ))
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		work &= ~TIF_NOHZ;
+	}
+
+#ifdef CONFIG_SECCOMP
+	/*
+	 * Do seccomp first -- it should minimize exposure of other
+	 * code, and keeping seccomp fast is probably more valuable
+	 * than the rest of this.
+	 */
+	if (work & _TIF_SECCOMP) {
+		struct seccomp_data sd;
+
+		sd.arch = arch;
+		sd.nr = regs->orig_ax;
+		sd.instruction_pointer = regs->ip;
+#ifdef CONFIG_X86_64
+		if (arch == AUDIT_ARCH_X86_64) {
+			sd.args[0] = regs->di;
+			sd.args[1] = regs->si;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->r10;
+			sd.args[4] = regs->r8;
+			sd.args[5] = regs->r9;
+		} else
+#endif
+		{
+			sd.args[0] = regs->bx;
+			sd.args[1] = regs->cx;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->si;
+			sd.args[4] = regs->di;
+			sd.args[5] = regs->bp;
+		}
+
+		BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
+		BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
+
+		ret = seccomp_phase1(&sd);
+		if (ret == SECCOMP_PHASE1_SKIP) {
+			regs->orig_ax = -1;
+			ret = 0;
+		} else if (ret != SECCOMP_PHASE1_OK) {
+			return ret;  /* Go directly to phase 2 */
+		}
+
+		work &= ~_TIF_SECCOMP;
+	}
+#endif
+
+	/* Do our best to finish without phase 2. */
+	if (work == 0)
+		return ret;  /* seccomp and/or nohz only (ret == 0 here) */
+
+#ifdef CONFIG_AUDITSYSCALL
+	if (work == _TIF_SYSCALL_AUDIT) {
+		/*
+		 * If there is no more work to be done except auditing,
+		 * then audit in phase 1.  Phase 2 always audits, so, if
+		 * we audit here, then we can't go on to phase 2.
+		 */
+		do_audit_syscall_entry(regs, arch);
+		return 0;
+	}
+#endif
+
+	return 1;  /* Something is enabled that we can't handle in phase 1 */
+}
+
+/* Returns the syscall nr to run (which should match regs->orig_ax). */
+long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
+				unsigned long phase1_result)
+{
+	long ret = 0;
+	u32 work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
+
+	BUG_ON(regs != task_pt_regs(current));
 
 	/*
 	 * If we stepped into a sysenter/syscall insn, it trapped in
@@ -1463,17 +1569,21 @@ long syscall_trace_enter(struct pt_regs *regs)
 	 * do_debug() and we need to set it again to restore the user
 	 * state.  If we entered on the slow path, TF was already set.
 	 */
-	if (test_thread_flag(TIF_SINGLESTEP))
+	if (work & _TIF_SINGLESTEP)
 		regs->flags |= X86_EFLAGS_TF;
 
-	/* do the secure computing check first */
-	if (secure_computing()) {
+#ifdef CONFIG_SECCOMP
+	/*
+	 * Call seccomp_phase2 before running the other hooks so that
+	 * they can see any changes made by a seccomp tracer.
+	 */
+	if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
 		/* seccomp failures shouldn't expose any additional code. */
-		ret = -1L;
-		goto out;
+		return -1;
 	}
+#endif
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_EMU)))
+	if (unlikely(work & _TIF_SYSCALL_EMU))
 		ret = -1L;
 
 	if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
@@ -1483,23 +1593,22 @@ long syscall_trace_enter(struct pt_regs *regs)
 	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
 		trace_sys_enter(regs, regs->orig_ax);
 
-	if (is_ia32_task())
-		audit_syscall_entry(AUDIT_ARCH_I386,
-				    regs->orig_ax,
-				    regs->bx, regs->cx,
-				    regs->dx, regs->si);
-#ifdef CONFIG_X86_64
-	else
-		audit_syscall_entry(AUDIT_ARCH_X86_64,
-				    regs->orig_ax,
-				    regs->di, regs->si,
-				    regs->dx, regs->r10);
-#endif
+	do_audit_syscall_entry(regs, arch);
 
-out:
 	return ret ?: regs->orig_ax;
 }
 
+long syscall_trace_enter(struct pt_regs *regs)
+{
+	u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
+	unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
+
+	if (phase1_result == 0)
+		return regs->orig_ax;
+	else
+		return syscall_trace_enter_phase2(regs, arch, phase1_result);
+}
+
 void syscall_trace_leave(struct pt_regs *regs)
 {
 	bool step;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2014-09-05 22:13   ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2014-09-05 22:13 UTC (permalink / raw)
  To: linux-arm-kernel

This splits syscall_trace_enter into syscall_trace_enter_phase1 and
syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
phase 2 is permitted to modify any of pt_regs except for orig_ax.

The intent is that phase 1 can be called from the syscall fast path.

In this implementation, phase1 can handle any combination of
TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
unless seccomp requests a ptrace event, in which case phase2 is
forced.

In principle, this could yield a big speedup for TIF_NOHZ as well as
for TIF_SECCOMP if syscall exit work were similarly split up.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/ptrace.h |   5 ++
 arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
 2 files changed, 138 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 6205f0c434db..86fc2bb82287 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
 extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 			 int error_code, int si_code);
 
+
+extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
+extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
+				       unsigned long phase1_result);
+
 extern long syscall_trace_enter(struct pt_regs *);
 extern void syscall_trace_leave(struct pt_regs *);
 
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index bbf338a04a5d..29576c244699 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 	force_sig_info(SIGTRAP, &info, tsk);
 }
 
+static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
+{
+#ifdef CONFIG_X86_64
+	if (arch == AUDIT_ARCH_X86_64) {
+		audit_syscall_entry(arch, regs->orig_ax, regs->di,
+				    regs->si, regs->dx, regs->r10);
+	} else
+#endif
+	{
+		audit_syscall_entry(arch, regs->orig_ax, regs->bx,
+				    regs->cx, regs->dx, regs->si);
+	}
+}
+
 /*
- * We must return the syscall number to actually look up in the table.
- * This can be -1L to skip running any syscall at all.
+ * We can return 0 to resume the syscall or anything else to go to phase
+ * 2.  If we resume the syscall, we need to put something appropriate in
+ * regs->orig_ax.
+ *
+ * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
+ * are fully functional.
+ *
+ * For phase 2's benefit, our return value is:
+ * 0:			resume the syscall
+ * 1:			go to phase 2; no seccomp phase 2 needed
+ * anything else:	go to phase 2; pass return value to seccomp
  */
-long syscall_trace_enter(struct pt_regs *regs)
+unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 {
-	long ret = 0;
+	unsigned long ret = 0;
+	u32 work;
+
+	BUG_ON(regs != task_pt_regs(current));
+
+	work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
 
 	/*
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (test_thread_flag(TIF_NOHZ))
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		work &= ~TIF_NOHZ;
+	}
+
+#ifdef CONFIG_SECCOMP
+	/*
+	 * Do seccomp first -- it should minimize exposure of other
+	 * code, and keeping seccomp fast is probably more valuable
+	 * than the rest of this.
+	 */
+	if (work & _TIF_SECCOMP) {
+		struct seccomp_data sd;
+
+		sd.arch = arch;
+		sd.nr = regs->orig_ax;
+		sd.instruction_pointer = regs->ip;
+#ifdef CONFIG_X86_64
+		if (arch == AUDIT_ARCH_X86_64) {
+			sd.args[0] = regs->di;
+			sd.args[1] = regs->si;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->r10;
+			sd.args[4] = regs->r8;
+			sd.args[5] = regs->r9;
+		} else
+#endif
+		{
+			sd.args[0] = regs->bx;
+			sd.args[1] = regs->cx;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->si;
+			sd.args[4] = regs->di;
+			sd.args[5] = regs->bp;
+		}
+
+		BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
+		BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
+
+		ret = seccomp_phase1(&sd);
+		if (ret == SECCOMP_PHASE1_SKIP) {
+			regs->orig_ax = -1;
+			ret = 0;
+		} else if (ret != SECCOMP_PHASE1_OK) {
+			return ret;  /* Go directly to phase 2 */
+		}
+
+		work &= ~_TIF_SECCOMP;
+	}
+#endif
+
+	/* Do our best to finish without phase 2. */
+	if (work == 0)
+		return ret;  /* seccomp and/or nohz only (ret == 0 here) */
+
+#ifdef CONFIG_AUDITSYSCALL
+	if (work == _TIF_SYSCALL_AUDIT) {
+		/*
+		 * If there is no more work to be done except auditing,
+		 * then audit in phase 1.  Phase 2 always audits, so, if
+		 * we audit here, then we can't go on to phase 2.
+		 */
+		do_audit_syscall_entry(regs, arch);
+		return 0;
+	}
+#endif
+
+	return 1;  /* Something is enabled that we can't handle in phase 1 */
+}
+
+/* Returns the syscall nr to run (which should match regs->orig_ax). */
+long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
+				unsigned long phase1_result)
+{
+	long ret = 0;
+	u32 work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
+
+	BUG_ON(regs != task_pt_regs(current));
 
 	/*
 	 * If we stepped into a sysenter/syscall insn, it trapped in
@@ -1463,17 +1569,21 @@ long syscall_trace_enter(struct pt_regs *regs)
 	 * do_debug() and we need to set it again to restore the user
 	 * state.  If we entered on the slow path, TF was already set.
 	 */
-	if (test_thread_flag(TIF_SINGLESTEP))
+	if (work & _TIF_SINGLESTEP)
 		regs->flags |= X86_EFLAGS_TF;
 
-	/* do the secure computing check first */
-	if (secure_computing()) {
+#ifdef CONFIG_SECCOMP
+	/*
+	 * Call seccomp_phase2 before running the other hooks so that
+	 * they can see any changes made by a seccomp tracer.
+	 */
+	if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
 		/* seccomp failures shouldn't expose any additional code. */
-		ret = -1L;
-		goto out;
+		return -1;
 	}
+#endif
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_EMU)))
+	if (unlikely(work & _TIF_SYSCALL_EMU))
 		ret = -1L;
 
 	if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
@@ -1483,23 +1593,22 @@ long syscall_trace_enter(struct pt_regs *regs)
 	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
 		trace_sys_enter(regs, regs->orig_ax);
 
-	if (is_ia32_task())
-		audit_syscall_entry(AUDIT_ARCH_I386,
-				    regs->orig_ax,
-				    regs->bx, regs->cx,
-				    regs->dx, regs->si);
-#ifdef CONFIG_X86_64
-	else
-		audit_syscall_entry(AUDIT_ARCH_X86_64,
-				    regs->orig_ax,
-				    regs->di, regs->si,
-				    regs->dx, regs->r10);
-#endif
+	do_audit_syscall_entry(regs, arch);
 
-out:
 	return ret ?: regs->orig_ax;
 }
 
+long syscall_trace_enter(struct pt_regs *regs)
+{
+	u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
+	unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
+
+	if (phase1_result == 0)
+		return regs->orig_ax;
+	else
+		return syscall_trace_enter_phase2(regs, arch, phase1_result);
+}
+
 void syscall_trace_leave(struct pt_regs *regs)
 {
 	bool step;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 4/5] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls
  2014-09-05 22:13 ` Andy Lutomirski
@ 2014-09-05 22:13   ` Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2014-09-05 22:13 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry, Oleg Nesterov
  Cc: x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa,
	Frederic Weisbecker, Andy Lutomirski

For slowpath syscalls, we initialize regs->ax to -ENOSYS and stick
the syscall number into regs->orig_ax prior to any possible tracing
and syscall execution.  This is user-visible ABI used by ptrace
syscall emulation and seccomp.

For fastpath syscalls, there's no good reason not to do the same
thing.  It's even slightly simpler than what we're currently doing.
It probably has no measureable performance impact.  It should have
no user-visible effect.

The purpose of this patch is to prepare for two-phase syscall
tracing, in which the first phase might modify the saved RAX without
leaving the fast path.  This change is just subtle enough that I'm
keeping it separate.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/calling.h |  6 +++++-
 arch/x86/kernel/entry_64.S     | 13 ++++---------
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index cb4c73bfeb48..76659b67fd11 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -85,7 +85,7 @@ For 32-bit we have the following conventions - kernel is built with
 #define ARGOFFSET	R11
 #define SWFRAME		ORIG_RAX
 
-	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1
+	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1, rax_enosys=0
 	subq  $9*8+\addskip, %rsp
 	CFI_ADJUST_CFA_OFFSET	9*8+\addskip
 	movq_cfi rdi, 8*8
@@ -96,7 +96,11 @@ For 32-bit we have the following conventions - kernel is built with
 	movq_cfi rcx, 5*8
 	.endif
 
+	.if \rax_enosys
+	movq $-ENOSYS, 4*8(%rsp)
+	.else
 	movq_cfi rax, 4*8
+	.endif
 
 	.if \save_r891011
 	movq_cfi r8,  3*8
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 2fac1343a90b..0bd6d3c28064 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -404,8 +404,8 @@ GLOBAL(system_call_after_swapgs)
 	 * and short:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	SAVE_ARGS 8,0
-	movq  %rax,ORIG_RAX-ARGOFFSET(%rsp)
+	SAVE_ARGS 8, 0, rax_enosys=1
+	movq_cfi rax,(ORIG_RAX-ARGOFFSET)
 	movq  %rcx,RIP-ARGOFFSET(%rsp)
 	CFI_REL_OFFSET rip,RIP-ARGOFFSET
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
@@ -417,7 +417,7 @@ system_call_fastpath:
 	andl $__SYSCALL_MASK,%eax
 	cmpl $__NR_syscall_max,%eax
 #endif
-	ja badsys
+	ja ret_from_sys_call  /* and return regs->ax */
 	movq %r10,%rcx
 	call *sys_call_table(,%rax,8)  # XXX:	 rip relative
 	movq %rax,RAX-ARGOFFSET(%rsp)
@@ -476,10 +476,6 @@ sysret_signal:
 	FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 	jmp int_check_syscall_exit_work
 
-badsys:
-	movq $-ENOSYS,RAX-ARGOFFSET(%rsp)
-	jmp ret_from_sys_call
-
 #ifdef CONFIG_AUDITSYSCALL
 	/*
 	 * Fast path for syscall audit without full syscall trace.
@@ -519,7 +515,6 @@ tracesys:
 	jz auditsys
 #endif
 	SAVE_REST
-	movq $-ENOSYS,RAX(%rsp) /* ptrace can change this for a bad syscall */
 	FIXUP_TOP_OF_STACK %rdi
 	movq %rsp,%rdi
 	call syscall_trace_enter
@@ -536,7 +531,7 @@ tracesys:
 	andl $__SYSCALL_MASK,%eax
 	cmpl $__NR_syscall_max,%eax
 #endif
-	ja   int_ret_from_sys_call	/* RAX(%rsp) set to -ENOSYS above */
+	ja   int_ret_from_sys_call	/* RAX(%rsp) is already set */
 	movq %r10,%rcx	/* fixup for C */
 	call *sys_call_table(,%rax,8)
 	movq %rax,RAX-ARGOFFSET(%rsp)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 4/5] x86_64, entry: Treat regs->ax the same in fastpath and slowpath syscalls
@ 2014-09-05 22:13   ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2014-09-05 22:13 UTC (permalink / raw)
  To: linux-arm-kernel

For slowpath syscalls, we initialize regs->ax to -ENOSYS and stick
the syscall number into regs->orig_ax prior to any possible tracing
and syscall execution.  This is user-visible ABI used by ptrace
syscall emulation and seccomp.

For fastpath syscalls, there's no good reason not to do the same
thing.  It's even slightly simpler than what we're currently doing.
It probably has no measureable performance impact.  It should have
no user-visible effect.

The purpose of this patch is to prepare for two-phase syscall
tracing, in which the first phase might modify the saved RAX without
leaving the fast path.  This change is just subtle enough that I'm
keeping it separate.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/calling.h |  6 +++++-
 arch/x86/kernel/entry_64.S     | 13 ++++---------
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index cb4c73bfeb48..76659b67fd11 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -85,7 +85,7 @@ For 32-bit we have the following conventions - kernel is built with
 #define ARGOFFSET	R11
 #define SWFRAME		ORIG_RAX
 
-	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1
+	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1, rax_enosys=0
 	subq  $9*8+\addskip, %rsp
 	CFI_ADJUST_CFA_OFFSET	9*8+\addskip
 	movq_cfi rdi, 8*8
@@ -96,7 +96,11 @@ For 32-bit we have the following conventions - kernel is built with
 	movq_cfi rcx, 5*8
 	.endif
 
+	.if \rax_enosys
+	movq $-ENOSYS, 4*8(%rsp)
+	.else
 	movq_cfi rax, 4*8
+	.endif
 
 	.if \save_r891011
 	movq_cfi r8,  3*8
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 2fac1343a90b..0bd6d3c28064 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -404,8 +404,8 @@ GLOBAL(system_call_after_swapgs)
 	 * and short:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	SAVE_ARGS 8,0
-	movq  %rax,ORIG_RAX-ARGOFFSET(%rsp)
+	SAVE_ARGS 8, 0, rax_enosys=1
+	movq_cfi rax,(ORIG_RAX-ARGOFFSET)
 	movq  %rcx,RIP-ARGOFFSET(%rsp)
 	CFI_REL_OFFSET rip,RIP-ARGOFFSET
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
@@ -417,7 +417,7 @@ system_call_fastpath:
 	andl $__SYSCALL_MASK,%eax
 	cmpl $__NR_syscall_max,%eax
 #endif
-	ja badsys
+	ja ret_from_sys_call  /* and return regs->ax */
 	movq %r10,%rcx
 	call *sys_call_table(,%rax,8)  # XXX:	 rip relative
 	movq %rax,RAX-ARGOFFSET(%rsp)
@@ -476,10 +476,6 @@ sysret_signal:
 	FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 	jmp int_check_syscall_exit_work
 
-badsys:
-	movq $-ENOSYS,RAX-ARGOFFSET(%rsp)
-	jmp ret_from_sys_call
-
 #ifdef CONFIG_AUDITSYSCALL
 	/*
 	 * Fast path for syscall audit without full syscall trace.
@@ -519,7 +515,6 @@ tracesys:
 	jz auditsys
 #endif
 	SAVE_REST
-	movq $-ENOSYS,RAX(%rsp) /* ptrace can change this for a bad syscall */
 	FIXUP_TOP_OF_STACK %rdi
 	movq %rsp,%rdi
 	call syscall_trace_enter
@@ -536,7 +531,7 @@ tracesys:
 	andl $__SYSCALL_MASK,%eax
 	cmpl $__NR_syscall_max,%eax
 #endif
-	ja   int_ret_from_sys_call	/* RAX(%rsp) set to -ENOSYS above */
+	ja   int_ret_from_sys_call	/* RAX(%rsp) is already set */
 	movq %r10,%rcx	/* fixup for C */
 	call *sys_call_table(,%rax,8)
 	movq %rax,RAX-ARGOFFSET(%rsp)
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 5/5] x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls
  2014-09-05 22:13 ` Andy Lutomirski
@ 2014-09-05 22:13   ` Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2014-09-05 22:13 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry, Oleg Nesterov
  Cc: x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa,
	Frederic Weisbecker, Andy Lutomirski

On KVM on my box, this reduces the overhead from an always-accept
seccomp filter from ~130ns to ~17ns.  Most of that comes from
avoiding IRET on every syscall when seccomp is enabled.

In extremely approximate hacked-up benchmarking, just bypassing IRET
saves about 80ns, so there's another 43ns of savings here from
simplifying the seccomp path.

The diffstat is also rather nice :)

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 38 +++++++++++++++-----------------------
 1 file changed, 15 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 0bd6d3c28064..df088bb03fb3 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -478,22 +478,6 @@ sysret_signal:
 
 #ifdef CONFIG_AUDITSYSCALL
 	/*
-	 * Fast path for syscall audit without full syscall trace.
-	 * We just call __audit_syscall_entry() directly, and then
-	 * jump back to the normal fast path.
-	 */
-auditsys:
-	movq %r10,%r9			/* 6th arg: 4th syscall arg */
-	movq %rdx,%r8			/* 5th arg: 3rd syscall arg */
-	movq %rsi,%rcx			/* 4th arg: 2nd syscall arg */
-	movq %rdi,%rdx			/* 3rd arg: 1st syscall arg */
-	movq %rax,%rsi			/* 2nd arg: syscall number */
-	movl $AUDIT_ARCH_X86_64,%edi	/* 1st arg: audit arch */
-	call __audit_syscall_entry
-	LOAD_ARGS 0		/* reload call-clobbered registers */
-	jmp system_call_fastpath
-
-	/*
 	 * Return fast path for syscall audit.  Call __audit_syscall_exit()
 	 * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
 	 * masked off.
@@ -510,17 +494,25 @@ sysret_audit:
 
 	/* Do syscall tracing */
 tracesys:
-#ifdef CONFIG_AUDITSYSCALL
-	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
-	jz auditsys
-#endif
+	leaq -REST_SKIP(%rsp), %rdi
+	movq $AUDIT_ARCH_X86_64, %rsi
+	call syscall_trace_enter_phase1
+	test %rax, %rax
+	jnz tracesys_phase2		/* if needed, run the slow path */
+	LOAD_ARGS 0			/* else restore clobbered regs */
+	jmp system_call_fastpath	/*      and return to the fast path */
+
+tracesys_phase2:
 	SAVE_REST
 	FIXUP_TOP_OF_STACK %rdi
-	movq %rsp,%rdi
-	call syscall_trace_enter
+	movq %rsp, %rdi
+	movq $AUDIT_ARCH_X86_64, %rsi
+	movq %rax,%rdx
+	call syscall_trace_enter_phase2
+
 	/*
 	 * Reload arg registers from stack in case ptrace changed them.
-	 * We don't reload %rax because syscall_trace_enter() returned
+	 * We don't reload %rax because syscall_trace_entry_phase2() returned
 	 * the value it wants us to use in the table lookup.
 	 */
 	LOAD_ARGS ARGOFFSET, 1
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 5/5] x86_64, entry: Use split-phase syscall_trace_enter for 64-bit syscalls
@ 2014-09-05 22:13   ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2014-09-05 22:13 UTC (permalink / raw)
  To: linux-arm-kernel

On KVM on my box, this reduces the overhead from an always-accept
seccomp filter from ~130ns to ~17ns.  Most of that comes from
avoiding IRET on every syscall when seccomp is enabled.

In extremely approximate hacked-up benchmarking, just bypassing IRET
saves about 80ns, so there's another 43ns of savings here from
simplifying the seccomp path.

The diffstat is also rather nice :)

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 38 +++++++++++++++-----------------------
 1 file changed, 15 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 0bd6d3c28064..df088bb03fb3 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -478,22 +478,6 @@ sysret_signal:
 
 #ifdef CONFIG_AUDITSYSCALL
 	/*
-	 * Fast path for syscall audit without full syscall trace.
-	 * We just call __audit_syscall_entry() directly, and then
-	 * jump back to the normal fast path.
-	 */
-auditsys:
-	movq %r10,%r9			/* 6th arg: 4th syscall arg */
-	movq %rdx,%r8			/* 5th arg: 3rd syscall arg */
-	movq %rsi,%rcx			/* 4th arg: 2nd syscall arg */
-	movq %rdi,%rdx			/* 3rd arg: 1st syscall arg */
-	movq %rax,%rsi			/* 2nd arg: syscall number */
-	movl $AUDIT_ARCH_X86_64,%edi	/* 1st arg: audit arch */
-	call __audit_syscall_entry
-	LOAD_ARGS 0		/* reload call-clobbered registers */
-	jmp system_call_fastpath
-
-	/*
 	 * Return fast path for syscall audit.  Call __audit_syscall_exit()
 	 * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
 	 * masked off.
@@ -510,17 +494,25 @@ sysret_audit:
 
 	/* Do syscall tracing */
 tracesys:
-#ifdef CONFIG_AUDITSYSCALL
-	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
-	jz auditsys
-#endif
+	leaq -REST_SKIP(%rsp), %rdi
+	movq $AUDIT_ARCH_X86_64, %rsi
+	call syscall_trace_enter_phase1
+	test %rax, %rax
+	jnz tracesys_phase2		/* if needed, run the slow path */
+	LOAD_ARGS 0			/* else restore clobbered regs */
+	jmp system_call_fastpath	/*      and return to the fast path */
+
+tracesys_phase2:
 	SAVE_REST
 	FIXUP_TOP_OF_STACK %rdi
-	movq %rsp,%rdi
-	call syscall_trace_enter
+	movq %rsp, %rdi
+	movq $AUDIT_ARCH_X86_64, %rsi
+	movq %rax,%rdx
+	call syscall_trace_enter_phase2
+
 	/*
 	 * Reload arg registers from stack in case ptrace changed them.
-	 * We don't reload %rax because syscall_trace_enter() returned
+	 * We don't reload %rax because syscall_trace_entry_phase2() returned
 	 * the value it wants us to use in the table lookup.
 	 */
 	LOAD_ARGS ARGOFFSET, 1
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 0/5] x86: two-phase syscall tracing and seccomp fastpath
  2014-09-05 22:13 ` Andy Lutomirski
  (?)
@ 2014-09-08 19:29   ` Kees Cook
  -1 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2014-09-08 19:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Will Drewry, Oleg Nesterov, x86, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, linux-security-module,
	Alexei Starovoitov, H. Peter Anvin, Frederic Weisbecker

On Fri, Sep 5, 2014 at 3:13 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> This applies to:
> git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git seccomp-fastpath
>
> Gitweb:
> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>
> This is both a cleanup and a speedup.  It reduces overhead due to
> installing a trivial seccomp filter by 87%.  The speedup comes from
> avoiding the full syscall tracing mechanism for filters that don't
> return SECCOMP_RET_TRACE.
>
> This series depends on splitting the seccomp hooks into two phases.
> The first phase evaluates the filter; it can skip syscalls, allow
> them, kill the calling task, or pass a u32 to the second phase.  The
> second phase requires a full tracing context, and it sends ptrace
> events if necessary.  The seccomp core part is in Kees' seccomp/fastpath
> tree.
>
> These patches implement a similar split for the x86 syscall
> entry work.  The C callback is invoked in two phases: the first has
> only a partial frame, and it can request phase 2 processing with a
> full frame.
>
> Finally, I switch the 64-bit system_call code to use the new split
> entry work.  This is a net deletion of assembly code: it replaces
> all of the audit entry muck.
>
> In the process, I fixed some bugs.
>
> If this is acceptable, someone can do the same tweak for the
> ia32entry and entry_32 code.
>
> This passes all seccomp tests that I know of.
>
> Changes from v4:
>  - Rebased (which seems to have been a no-op)
>  - Fixed embarrassing bug that broke allnoconfig
>    (patch 3 was missing an ifdef).
>
> Changes from v3:
>  - Dropped the core seccomp changes from the email -- Kees has applied them.
>  - Add patch 2 (the TIF_NOHZ change).
>  - Fix TIF_NOHZ in the two-phase entry code (thanks, Oleg).
>
> Changes from v2:
>  - Fixed 32-bit x86 build (and the tests pass).
>  - Put the doc patch where it belongs.
>
> Changes from v1:
>  - Rebased on top of Kees' shiny new seccomp tree (no effect on the x86
>    part).
>  - Improved patch 6 vs patch 7 split (thanks Alexei!)
>  - Fixed bogus -ENOSYS in patch 5 (thanks Kees!)
>  - Improved changelog message in patch 6.
>
> Changes from RFC version:
>  - The first three patches are more or less the same
>  - The rest is more or less a rewrite
>
> Andy Lutomirski (5):
>   x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
>   x86,entry: Only call user_exit if TIF_NOHZ
>   x86: Split syscall_trace_enter into two phases
>   x86_64,entry: Treat regs->ax the same in fastpath and slowpath
>     syscalls
>   x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls
>
>  arch/x86/include/asm/calling.h |   6 +-
>  arch/x86/include/asm/ptrace.h  |   5 ++
>  arch/x86/kernel/entry_64.S     |  51 +++++--------
>  arch/x86/kernel/ptrace.c       | 165 +++++++++++++++++++++++++++++++++--------
>  4 files changed, 164 insertions(+), 63 deletions(-)

Consider the series:

Acked-by: Kees Cook <keescook@chromium.org>

As far as doing pulls, Peter, can you take the seccomp change from my
tree as well? It makes sense to land all of this together.

Thanks!

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 0/5] x86: two-phase syscall tracing and seccomp fastpath
@ 2014-09-08 19:29   ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2014-09-08 19:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Will Drewry, Oleg Nesterov, x86, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, linux-security-module,
	Alexei Starovoitov, H. Peter Anvin, Frederic Weisbecker

On Fri, Sep 5, 2014 at 3:13 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> This applies to:
> git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git seccomp-fastpath
>
> Gitweb:
> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>
> This is both a cleanup and a speedup.  It reduces overhead due to
> installing a trivial seccomp filter by 87%.  The speedup comes from
> avoiding the full syscall tracing mechanism for filters that don't
> return SECCOMP_RET_TRACE.
>
> This series depends on splitting the seccomp hooks into two phases.
> The first phase evaluates the filter; it can skip syscalls, allow
> them, kill the calling task, or pass a u32 to the second phase.  The
> second phase requires a full tracing context, and it sends ptrace
> events if necessary.  The seccomp core part is in Kees' seccomp/fastpath
> tree.
>
> These patches implement a similar split for the x86 syscall
> entry work.  The C callback is invoked in two phases: the first has
> only a partial frame, and it can request phase 2 processing with a
> full frame.
>
> Finally, I switch the 64-bit system_call code to use the new split
> entry work.  This is a net deletion of assembly code: it replaces
> all of the audit entry muck.
>
> In the process, I fixed some bugs.
>
> If this is acceptable, someone can do the same tweak for the
> ia32entry and entry_32 code.
>
> This passes all seccomp tests that I know of.
>
> Changes from v4:
>  - Rebased (which seems to have been a no-op)
>  - Fixed embarrassing bug that broke allnoconfig
>    (patch 3 was missing an ifdef).
>
> Changes from v3:
>  - Dropped the core seccomp changes from the email -- Kees has applied them.
>  - Add patch 2 (the TIF_NOHZ change).
>  - Fix TIF_NOHZ in the two-phase entry code (thanks, Oleg).
>
> Changes from v2:
>  - Fixed 32-bit x86 build (and the tests pass).
>  - Put the doc patch where it belongs.
>
> Changes from v1:
>  - Rebased on top of Kees' shiny new seccomp tree (no effect on the x86
>    part).
>  - Improved patch 6 vs patch 7 split (thanks Alexei!)
>  - Fixed bogus -ENOSYS in patch 5 (thanks Kees!)
>  - Improved changelog message in patch 6.
>
> Changes from RFC version:
>  - The first three patches are more or less the same
>  - The rest is more or less a rewrite
>
> Andy Lutomirski (5):
>   x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
>   x86,entry: Only call user_exit if TIF_NOHZ
>   x86: Split syscall_trace_enter into two phases
>   x86_64,entry: Treat regs->ax the same in fastpath and slowpath
>     syscalls
>   x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls
>
>  arch/x86/include/asm/calling.h |   6 +-
>  arch/x86/include/asm/ptrace.h  |   5 ++
>  arch/x86/kernel/entry_64.S     |  51 +++++--------
>  arch/x86/kernel/ptrace.c       | 165 +++++++++++++++++++++++++++++++++--------
>  4 files changed, 164 insertions(+), 63 deletions(-)

Consider the series:

Acked-by: Kees Cook <keescook@chromium.org>

As far as doing pulls, Peter, can you take the seccomp change from my
tree as well? It makes sense to land all of this together.

Thanks!

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 0/5] x86: two-phase syscall tracing and seccomp fastpath
@ 2014-09-08 19:29   ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2014-09-08 19:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Sep 5, 2014 at 3:13 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> This applies to:
> git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git seccomp-fastpath
>
> Gitweb:
> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>
> This is both a cleanup and a speedup.  It reduces overhead due to
> installing a trivial seccomp filter by 87%.  The speedup comes from
> avoiding the full syscall tracing mechanism for filters that don't
> return SECCOMP_RET_TRACE.
>
> This series depends on splitting the seccomp hooks into two phases.
> The first phase evaluates the filter; it can skip syscalls, allow
> them, kill the calling task, or pass a u32 to the second phase.  The
> second phase requires a full tracing context, and it sends ptrace
> events if necessary.  The seccomp core part is in Kees' seccomp/fastpath
> tree.
>
> These patches implement a similar split for the x86 syscall
> entry work.  The C callback is invoked in two phases: the first has
> only a partial frame, and it can request phase 2 processing with a
> full frame.
>
> Finally, I switch the 64-bit system_call code to use the new split
> entry work.  This is a net deletion of assembly code: it replaces
> all of the audit entry muck.
>
> In the process, I fixed some bugs.
>
> If this is acceptable, someone can do the same tweak for the
> ia32entry and entry_32 code.
>
> This passes all seccomp tests that I know of.
>
> Changes from v4:
>  - Rebased (which seems to have been a no-op)
>  - Fixed embarrassing bug that broke allnoconfig
>    (patch 3 was missing an ifdef).
>
> Changes from v3:
>  - Dropped the core seccomp changes from the email -- Kees has applied them.
>  - Add patch 2 (the TIF_NOHZ change).
>  - Fix TIF_NOHZ in the two-phase entry code (thanks, Oleg).
>
> Changes from v2:
>  - Fixed 32-bit x86 build (and the tests pass).
>  - Put the doc patch where it belongs.
>
> Changes from v1:
>  - Rebased on top of Kees' shiny new seccomp tree (no effect on the x86
>    part).
>  - Improved patch 6 vs patch 7 split (thanks Alexei!)
>  - Fixed bogus -ENOSYS in patch 5 (thanks Kees!)
>  - Improved changelog message in patch 6.
>
> Changes from RFC version:
>  - The first three patches are more or less the same
>  - The rest is more or less a rewrite
>
> Andy Lutomirski (5):
>   x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
>   x86,entry: Only call user_exit if TIF_NOHZ
>   x86: Split syscall_trace_enter into two phases
>   x86_64,entry: Treat regs->ax the same in fastpath and slowpath
>     syscalls
>   x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls
>
>  arch/x86/include/asm/calling.h |   6 +-
>  arch/x86/include/asm/ptrace.h  |   5 ++
>  arch/x86/kernel/entry_64.S     |  51 +++++--------
>  arch/x86/kernel/ptrace.c       | 165 +++++++++++++++++++++++++++++++++--------
>  4 files changed, 164 insertions(+), 63 deletions(-)

Consider the series:

Acked-by: Kees Cook <keescook@chromium.org>

As far as doing pulls, Peter, can you take the seccomp change from my
tree as well? It makes sense to land all of this together.

Thanks!

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 0/5] x86: two-phase syscall tracing and seccomp fastpath
  2014-09-08 19:29   ` Kees Cook
  (?)
@ 2014-09-08 19:49     ` H. Peter Anvin
  -1 siblings, 0 replies; 82+ messages in thread
From: H. Peter Anvin @ 2014-09-08 19:49 UTC (permalink / raw)
  To: Kees Cook, Andy Lutomirski
  Cc: LKML, Will Drewry, Oleg Nesterov, x86, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, linux-security-module,
	Alexei Starovoitov, Frederic Weisbecker

On 09/08/2014 12:29 PM, Kees Cook wrote:
> 
> As far as doing pulls, Peter, can you take the seccomp change from my
> tree as well? It makes sense to land all of this together.
> 

That is the plan.

	-hpa



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 0/5] x86: two-phase syscall tracing and seccomp fastpath
@ 2014-09-08 19:49     ` H. Peter Anvin
  0 siblings, 0 replies; 82+ messages in thread
From: H. Peter Anvin @ 2014-09-08 19:49 UTC (permalink / raw)
  To: Kees Cook, Andy Lutomirski
  Cc: LKML, Will Drewry, Oleg Nesterov, x86, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, linux-security-module,
	Alexei Starovoitov, Frederic Weisbecker

On 09/08/2014 12:29 PM, Kees Cook wrote:
> 
> As far as doing pulls, Peter, can you take the seccomp change from my
> tree as well? It makes sense to land all of this together.
> 

That is the plan.

	-hpa

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 0/5] x86: two-phase syscall tracing and seccomp fastpath
@ 2014-09-08 19:49     ` H. Peter Anvin
  0 siblings, 0 replies; 82+ messages in thread
From: H. Peter Anvin @ 2014-09-08 19:49 UTC (permalink / raw)
  To: linux-arm-kernel

On 09/08/2014 12:29 PM, Kees Cook wrote:
> 
> As far as doing pulls, Peter, can you take the seccomp change from my
> tree as well? It makes sense to land all of this together.
> 

That is the plan.

	-hpa

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [tip:x86/seccomp] x86, x32, audit: Fix x32's AUDIT_ARCH wrt audit
  2014-09-05 22:13   ` Andy Lutomirski
  (?)
@ 2014-09-09  2:43   ` tip-bot for Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: tip-bot for Andy Lutomirski @ 2014-09-09  2:43 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, luto, hpa, mingo, tglx, hpa

Commit-ID:  81f49a8fd7088cfcb588d182eeede862c0e3303e
Gitweb:     http://git.kernel.org/tip/81f49a8fd7088cfcb588d182eeede862c0e3303e
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Fri, 5 Sep 2014 15:13:52 -0700
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Mon, 8 Sep 2014 14:13:55 -0700

x86, x32, audit: Fix x32's AUDIT_ARCH wrt audit

is_compat_task() is the wrong check for audit arch; the check should
be is_ia32_task(): x32 syscalls should be AUDIT_ARCH_X86_64, not
AUDIT_ARCH_I386.

CONFIG_AUDITSYSCALL is currently incompatible with x32, so this has
no visible effect.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/a0138ed8c709882aec06e4acc30bfa9b623b8717.1409954077.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/kernel/ptrace.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 93c182a..39296d2 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1441,15 +1441,6 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 	force_sig_info(SIGTRAP, &info, tsk);
 }
 
-
-#ifdef CONFIG_X86_32
-# define IS_IA32	1
-#elif defined CONFIG_IA32_EMULATION
-# define IS_IA32	is_compat_task()
-#else
-# define IS_IA32	0
-#endif
-
 /*
  * We must return the syscall number to actually look up in the table.
  * This can be -1L to skip running any syscall at all.
@@ -1487,7 +1478,7 @@ long syscall_trace_enter(struct pt_regs *regs)
 	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
 		trace_sys_enter(regs, regs->orig_ax);
 
-	if (IS_IA32)
+	if (is_ia32_task())
 		audit_syscall_entry(AUDIT_ARCH_I386,
 				    regs->orig_ax,
 				    regs->bx, regs->cx,

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [tip:x86/seccomp] x86, entry: Only call user_exit if TIF_NOHZ
  2014-09-05 22:13   ` Andy Lutomirski
  (?)
@ 2014-09-09  2:43   ` tip-bot for Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: tip-bot for Andy Lutomirski @ 2014-09-09  2:43 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, luto, hpa, mingo, fweisbec, tglx, hpa

Commit-ID:  fd143b210e685f0c4b37895f03fb79cd0555b00d
Gitweb:     http://git.kernel.org/tip/fd143b210e685f0c4b37895f03fb79cd0555b00d
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Fri, 5 Sep 2014 15:13:53 -0700
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Mon, 8 Sep 2014 14:13:59 -0700

x86, entry: Only call user_exit if TIF_NOHZ

The RCU context tracking code requires that arch code call
user_exit() on any entry into kernel code if TIF_NOHZ is set.  This
patch adds a check for TIF_NOHZ and a comment to the syscall entry
tracing code.

The main purpose of this patch is to make the code easier to follow:
one can read the body of user_exit and of every function it calls
without finding any explanation of why it's called for traced
syscalls but not for untraced syscalls.  This makes it clear when
user_exit() is necessary.

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/0b13e0e24ec0307d67ab7a23b58764f6b1270116.1409954077.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/kernel/ptrace.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 39296d2..bbf338a 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1449,7 +1449,12 @@ long syscall_trace_enter(struct pt_regs *regs)
 {
 	long ret = 0;
 
-	user_exit();
+	/*
+	 * If TIF_NOHZ is set, we are required to call user_exit() before
+	 * doing anything that could touch RCU.
+	 */
+	if (test_thread_flag(TIF_NOHZ))
+		user_exit();
 
 	/*
 	 * If we stepped into a sysenter/syscall insn, it trapped in

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [tip:x86/seccomp] x86: Split syscall_trace_enter into two phases
  2014-09-05 22:13   ` Andy Lutomirski
  (?)
@ 2014-09-09  2:44   ` tip-bot for Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: tip-bot for Andy Lutomirski @ 2014-09-09  2:44 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, luto, hpa, mingo, tglx, hpa

Commit-ID:  e0ffbaabc46db508b8717f023c0ce03b980eefac
Gitweb:     http://git.kernel.org/tip/e0ffbaabc46db508b8717f023c0ce03b980eefac
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Fri, 5 Sep 2014 15:13:54 -0700
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Mon, 8 Sep 2014 14:14:03 -0700

x86: Split syscall_trace_enter into two phases

This splits syscall_trace_enter into syscall_trace_enter_phase1 and
syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
phase 2 is permitted to modify any of pt_regs except for orig_ax.

The intent is that phase 1 can be called from the syscall fast path.

In this implementation, phase1 can handle any combination of
TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
unless seccomp requests a ptrace event, in which case phase2 is
forced.

In principle, this could yield a big speedup for TIF_NOHZ as well as
for TIF_SECCOMP if syscall exit work were similarly split up.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/2df320a600020fda055fccf2b668145729dd0c04.1409954077.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/include/asm/ptrace.h |   5 ++
 arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
 2 files changed, 138 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 6205f0c..86fc2bb 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
 extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 			 int error_code, int si_code);
 
+
+extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
+extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
+				       unsigned long phase1_result);
+
 extern long syscall_trace_enter(struct pt_regs *);
 extern void syscall_trace_leave(struct pt_regs *);
 
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index bbf338a..29576c2 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 	force_sig_info(SIGTRAP, &info, tsk);
 }
 
+static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
+{
+#ifdef CONFIG_X86_64
+	if (arch == AUDIT_ARCH_X86_64) {
+		audit_syscall_entry(arch, regs->orig_ax, regs->di,
+				    regs->si, regs->dx, regs->r10);
+	} else
+#endif
+	{
+		audit_syscall_entry(arch, regs->orig_ax, regs->bx,
+				    regs->cx, regs->dx, regs->si);
+	}
+}
+
 /*
- * We must return the syscall number to actually look up in the table.
- * This can be -1L to skip running any syscall at all.
+ * We can return 0 to resume the syscall or anything else to go to phase
+ * 2.  If we resume the syscall, we need to put something appropriate in
+ * regs->orig_ax.
+ *
+ * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
+ * are fully functional.
+ *
+ * For phase 2's benefit, our return value is:
+ * 0:			resume the syscall
+ * 1:			go to phase 2; no seccomp phase 2 needed
+ * anything else:	go to phase 2; pass return value to seccomp
  */
-long syscall_trace_enter(struct pt_regs *regs)
+unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 {
-	long ret = 0;
+	unsigned long ret = 0;
+	u32 work;
+
+	BUG_ON(regs != task_pt_regs(current));
+
+	work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
 
 	/*
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (test_thread_flag(TIF_NOHZ))
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		work &= ~TIF_NOHZ;
+	}
+
+#ifdef CONFIG_SECCOMP
+	/*
+	 * Do seccomp first -- it should minimize exposure of other
+	 * code, and keeping seccomp fast is probably more valuable
+	 * than the rest of this.
+	 */
+	if (work & _TIF_SECCOMP) {
+		struct seccomp_data sd;
+
+		sd.arch = arch;
+		sd.nr = regs->orig_ax;
+		sd.instruction_pointer = regs->ip;
+#ifdef CONFIG_X86_64
+		if (arch == AUDIT_ARCH_X86_64) {
+			sd.args[0] = regs->di;
+			sd.args[1] = regs->si;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->r10;
+			sd.args[4] = regs->r8;
+			sd.args[5] = regs->r9;
+		} else
+#endif
+		{
+			sd.args[0] = regs->bx;
+			sd.args[1] = regs->cx;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->si;
+			sd.args[4] = regs->di;
+			sd.args[5] = regs->bp;
+		}
+
+		BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
+		BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
+
+		ret = seccomp_phase1(&sd);
+		if (ret == SECCOMP_PHASE1_SKIP) {
+			regs->orig_ax = -1;
+			ret = 0;
+		} else if (ret != SECCOMP_PHASE1_OK) {
+			return ret;  /* Go directly to phase 2 */
+		}
+
+		work &= ~_TIF_SECCOMP;
+	}
+#endif
+
+	/* Do our best to finish without phase 2. */
+	if (work == 0)
+		return ret;  /* seccomp and/or nohz only (ret == 0 here) */
+
+#ifdef CONFIG_AUDITSYSCALL
+	if (work == _TIF_SYSCALL_AUDIT) {
+		/*
+		 * If there is no more work to be done except auditing,
+		 * then audit in phase 1.  Phase 2 always audits, so, if
+		 * we audit here, then we can't go on to phase 2.
+		 */
+		do_audit_syscall_entry(regs, arch);
+		return 0;
+	}
+#endif
+
+	return 1;  /* Something is enabled that we can't handle in phase 1 */
+}
+
+/* Returns the syscall nr to run (which should match regs->orig_ax). */
+long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
+				unsigned long phase1_result)
+{
+	long ret = 0;
+	u32 work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
+
+	BUG_ON(regs != task_pt_regs(current));
 
 	/*
 	 * If we stepped into a sysenter/syscall insn, it trapped in
@@ -1463,17 +1569,21 @@ long syscall_trace_enter(struct pt_regs *regs)
 	 * do_debug() and we need to set it again to restore the user
 	 * state.  If we entered on the slow path, TF was already set.
 	 */
-	if (test_thread_flag(TIF_SINGLESTEP))
+	if (work & _TIF_SINGLESTEP)
 		regs->flags |= X86_EFLAGS_TF;
 
-	/* do the secure computing check first */
-	if (secure_computing()) {
+#ifdef CONFIG_SECCOMP
+	/*
+	 * Call seccomp_phase2 before running the other hooks so that
+	 * they can see any changes made by a seccomp tracer.
+	 */
+	if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
 		/* seccomp failures shouldn't expose any additional code. */
-		ret = -1L;
-		goto out;
+		return -1;
 	}
+#endif
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_EMU)))
+	if (unlikely(work & _TIF_SYSCALL_EMU))
 		ret = -1L;
 
 	if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
@@ -1483,23 +1593,22 @@ long syscall_trace_enter(struct pt_regs *regs)
 	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
 		trace_sys_enter(regs, regs->orig_ax);
 
-	if (is_ia32_task())
-		audit_syscall_entry(AUDIT_ARCH_I386,
-				    regs->orig_ax,
-				    regs->bx, regs->cx,
-				    regs->dx, regs->si);
-#ifdef CONFIG_X86_64
-	else
-		audit_syscall_entry(AUDIT_ARCH_X86_64,
-				    regs->orig_ax,
-				    regs->di, regs->si,
-				    regs->dx, regs->r10);
-#endif
+	do_audit_syscall_entry(regs, arch);
 
-out:
 	return ret ?: regs->orig_ax;
 }
 
+long syscall_trace_enter(struct pt_regs *regs)
+{
+	u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
+	unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
+
+	if (phase1_result == 0)
+		return regs->orig_ax;
+	else
+		return syscall_trace_enter_phase2(regs, arch, phase1_result);
+}
+
 void syscall_trace_leave(struct pt_regs *regs)
 {
 	bool step;

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [tip:x86/seccomp] x86_64, entry: Treat regs-> ax the same in fastpath and slowpath syscalls
  2014-09-05 22:13   ` [PATCH v5 4/5] x86_64, entry: " Andy Lutomirski
  (?)
@ 2014-09-09  2:44   ` tip-bot for Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: tip-bot for Andy Lutomirski @ 2014-09-09  2:44 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, luto, hpa, mingo, tglx, hpa

Commit-ID:  54eea9957f5763dd1a2555d7e4cb53b4dd389cc6
Gitweb:     http://git.kernel.org/tip/54eea9957f5763dd1a2555d7e4cb53b4dd389cc6
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Fri, 5 Sep 2014 15:13:55 -0700
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Mon, 8 Sep 2014 14:14:08 -0700

x86_64, entry: Treat regs->ax the same in fastpath and slowpath syscalls

For slowpath syscalls, we initialize regs->ax to -ENOSYS and stick
the syscall number into regs->orig_ax prior to any possible tracing
and syscall execution.  This is user-visible ABI used by ptrace
syscall emulation and seccomp.

For fastpath syscalls, there's no good reason not to do the same
thing.  It's even slightly simpler than what we're currently doing.
It probably has no measureable performance impact.  It should have
no user-visible effect.

The purpose of this patch is to prepare for two-phase syscall
tracing, in which the first phase might modify the saved RAX without
leaving the fast path.  This change is just subtle enough that I'm
keeping it separate.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/01218b493f12ae2f98034b78c9ae085e38e94350.1409954077.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/include/asm/calling.h |  6 +++++-
 arch/x86/kernel/entry_64.S     | 13 ++++---------
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index cb4c73b..76659b6 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -85,7 +85,7 @@ For 32-bit we have the following conventions - kernel is built with
 #define ARGOFFSET	R11
 #define SWFRAME		ORIG_RAX
 
-	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1
+	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1, rax_enosys=0
 	subq  $9*8+\addskip, %rsp
 	CFI_ADJUST_CFA_OFFSET	9*8+\addskip
 	movq_cfi rdi, 8*8
@@ -96,7 +96,11 @@ For 32-bit we have the following conventions - kernel is built with
 	movq_cfi rcx, 5*8
 	.endif
 
+	.if \rax_enosys
+	movq $-ENOSYS, 4*8(%rsp)
+	.else
 	movq_cfi rax, 4*8
+	.endif
 
 	.if \save_r891011
 	movq_cfi r8,  3*8
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 2fac134..0bd6d3c 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -404,8 +404,8 @@ GLOBAL(system_call_after_swapgs)
 	 * and short:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	SAVE_ARGS 8,0
-	movq  %rax,ORIG_RAX-ARGOFFSET(%rsp)
+	SAVE_ARGS 8, 0, rax_enosys=1
+	movq_cfi rax,(ORIG_RAX-ARGOFFSET)
 	movq  %rcx,RIP-ARGOFFSET(%rsp)
 	CFI_REL_OFFSET rip,RIP-ARGOFFSET
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
@@ -417,7 +417,7 @@ system_call_fastpath:
 	andl $__SYSCALL_MASK,%eax
 	cmpl $__NR_syscall_max,%eax
 #endif
-	ja badsys
+	ja ret_from_sys_call  /* and return regs->ax */
 	movq %r10,%rcx
 	call *sys_call_table(,%rax,8)  # XXX:	 rip relative
 	movq %rax,RAX-ARGOFFSET(%rsp)
@@ -476,10 +476,6 @@ sysret_signal:
 	FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 	jmp int_check_syscall_exit_work
 
-badsys:
-	movq $-ENOSYS,RAX-ARGOFFSET(%rsp)
-	jmp ret_from_sys_call
-
 #ifdef CONFIG_AUDITSYSCALL
 	/*
 	 * Fast path for syscall audit without full syscall trace.
@@ -519,7 +515,6 @@ tracesys:
 	jz auditsys
 #endif
 	SAVE_REST
-	movq $-ENOSYS,RAX(%rsp) /* ptrace can change this for a bad syscall */
 	FIXUP_TOP_OF_STACK %rdi
 	movq %rsp,%rdi
 	call syscall_trace_enter
@@ -536,7 +531,7 @@ tracesys:
 	andl $__SYSCALL_MASK,%eax
 	cmpl $__NR_syscall_max,%eax
 #endif
-	ja   int_ret_from_sys_call	/* RAX(%rsp) set to -ENOSYS above */
+	ja   int_ret_from_sys_call	/* RAX(%rsp) is already set */
 	movq %r10,%rcx	/* fixup for C */
 	call *sys_call_table(,%rax,8)
 	movq %rax,RAX-ARGOFFSET(%rsp)

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [tip:x86/seccomp] x86_64, entry: Use split-phase syscall_trace_enter for 64-bit syscalls
  2014-09-05 22:13   ` [PATCH v5 5/5] x86_64, entry: " Andy Lutomirski
  (?)
@ 2014-09-09  2:44   ` tip-bot for Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: tip-bot for Andy Lutomirski @ 2014-09-09  2:44 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, luto, hpa, mingo, tglx, hpa

Commit-ID:  1dcf74f6edfc3a9acd84d83d8865dd9e2a3b1d1e
Gitweb:     http://git.kernel.org/tip/1dcf74f6edfc3a9acd84d83d8865dd9e2a3b1d1e
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Fri, 5 Sep 2014 15:13:56 -0700
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Mon, 8 Sep 2014 14:14:12 -0700

x86_64, entry: Use split-phase syscall_trace_enter for 64-bit syscalls

On KVM on my box, this reduces the overhead from an always-accept
seccomp filter from ~130ns to ~17ns.  Most of that comes from
avoiding IRET on every syscall when seccomp is enabled.

In extremely approximate hacked-up benchmarking, just bypassing IRET
saves about 80ns, so there's another 43ns of savings here from
simplifying the seccomp path.

The diffstat is also rather nice :)

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/a3dbd267ee990110478d349f78cccfdac5497a84.1409954077.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/kernel/entry_64.S | 38 +++++++++++++++-----------------------
 1 file changed, 15 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 0bd6d3c..df088bb 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -478,22 +478,6 @@ sysret_signal:
 
 #ifdef CONFIG_AUDITSYSCALL
 	/*
-	 * Fast path for syscall audit without full syscall trace.
-	 * We just call __audit_syscall_entry() directly, and then
-	 * jump back to the normal fast path.
-	 */
-auditsys:
-	movq %r10,%r9			/* 6th arg: 4th syscall arg */
-	movq %rdx,%r8			/* 5th arg: 3rd syscall arg */
-	movq %rsi,%rcx			/* 4th arg: 2nd syscall arg */
-	movq %rdi,%rdx			/* 3rd arg: 1st syscall arg */
-	movq %rax,%rsi			/* 2nd arg: syscall number */
-	movl $AUDIT_ARCH_X86_64,%edi	/* 1st arg: audit arch */
-	call __audit_syscall_entry
-	LOAD_ARGS 0		/* reload call-clobbered registers */
-	jmp system_call_fastpath
-
-	/*
 	 * Return fast path for syscall audit.  Call __audit_syscall_exit()
 	 * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
 	 * masked off.
@@ -510,17 +494,25 @@ sysret_audit:
 
 	/* Do syscall tracing */
 tracesys:
-#ifdef CONFIG_AUDITSYSCALL
-	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
-	jz auditsys
-#endif
+	leaq -REST_SKIP(%rsp), %rdi
+	movq $AUDIT_ARCH_X86_64, %rsi
+	call syscall_trace_enter_phase1
+	test %rax, %rax
+	jnz tracesys_phase2		/* if needed, run the slow path */
+	LOAD_ARGS 0			/* else restore clobbered regs */
+	jmp system_call_fastpath	/*      and return to the fast path */
+
+tracesys_phase2:
 	SAVE_REST
 	FIXUP_TOP_OF_STACK %rdi
-	movq %rsp,%rdi
-	call syscall_trace_enter
+	movq %rsp, %rdi
+	movq $AUDIT_ARCH_X86_64, %rsi
+	movq %rax,%rdx
+	call syscall_trace_enter_phase2
+
 	/*
 	 * Reload arg registers from stack in case ptrace changed them.
-	 * We don't reload %rax because syscall_trace_enter() returned
+	 * We don't reload %rax because syscall_trace_entry_phase2() returned
 	 * the value it wants us to use in the table lookup.
 	 */
 	LOAD_ARGS ARGOFFSET, 1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2014-09-05 22:13   ` Andy Lutomirski
@ 2015-02-05 21:19     ` Dmitry V. Levin
  -1 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-05 21:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kees Cook, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, linux-mips, linux-arch, linux-security-module,
	Alexei Starovoitov, hpa, Frederic Weisbecker

Hi,

On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
> phase 2 is permitted to modify any of pt_regs except for orig_ax.

This breaks ptrace, see below.

> The intent is that phase 1 can be called from the syscall fast path.
> 
> In this implementation, phase1 can handle any combination of
> TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
> unless seccomp requests a ptrace event, in which case phase2 is
> forced.
> 
> In principle, this could yield a big speedup for TIF_NOHZ as well as
> for TIF_SECCOMP if syscall exit work were similarly split up.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/include/asm/ptrace.h |   5 ++
>  arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
>  2 files changed, 138 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> index 6205f0c434db..86fc2bb82287 100644
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
>  extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>  			 int error_code, int si_code);
>  
> +
> +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
> +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
> +				       unsigned long phase1_result);
> +
>  extern long syscall_trace_enter(struct pt_regs *);
>  extern void syscall_trace_leave(struct pt_regs *);
>  
> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> index bbf338a04a5d..29576c244699 100644
> --- a/arch/x86/kernel/ptrace.c
> +++ b/arch/x86/kernel/ptrace.c
> @@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>  	force_sig_info(SIGTRAP, &info, tsk);
>  }
>  
> +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
> +{
> +#ifdef CONFIG_X86_64
> +	if (arch == AUDIT_ARCH_X86_64) {
> +		audit_syscall_entry(arch, regs->orig_ax, regs->di,
> +				    regs->si, regs->dx, regs->r10);
> +	} else
> +#endif
> +	{
> +		audit_syscall_entry(arch, regs->orig_ax, regs->bx,
> +				    regs->cx, regs->dx, regs->si);
> +	}
> +}
> +
>  /*
> - * We must return the syscall number to actually look up in the table.
> - * This can be -1L to skip running any syscall at all.
> + * We can return 0 to resume the syscall or anything else to go to phase
> + * 2.  If we resume the syscall, we need to put something appropriate in
> + * regs->orig_ax.
> + *
> + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
> + * are fully functional.
> + *
> + * For phase 2's benefit, our return value is:
> + * 0:			resume the syscall
> + * 1:			go to phase 2; no seccomp phase 2 needed
> + * anything else:	go to phase 2; pass return value to seccomp
>   */
> -long syscall_trace_enter(struct pt_regs *regs)
> +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>  {
> -	long ret = 0;
> +	unsigned long ret = 0;
> +	u32 work;
> +
> +	BUG_ON(regs != task_pt_regs(current));
> +
> +	work = ACCESS_ONCE(current_thread_info()->flags) &
> +		_TIF_WORK_SYSCALL_ENTRY;
>  
>  	/*
>  	 * If TIF_NOHZ is set, we are required to call user_exit() before
>  	 * doing anything that could touch RCU.
>  	 */
> -	if (test_thread_flag(TIF_NOHZ))
> +	if (work & _TIF_NOHZ) {
>  		user_exit();
> +		work &= ~TIF_NOHZ;
> +	}
> +
> +#ifdef CONFIG_SECCOMP
> +	/*
> +	 * Do seccomp first -- it should minimize exposure of other
> +	 * code, and keeping seccomp fast is probably more valuable
> +	 * than the rest of this.
> +	 */
> +	if (work & _TIF_SECCOMP) {
> +		struct seccomp_data sd;
> +
> +		sd.arch = arch;
> +		sd.nr = regs->orig_ax;
> +		sd.instruction_pointer = regs->ip;
> +#ifdef CONFIG_X86_64
> +		if (arch == AUDIT_ARCH_X86_64) {
> +			sd.args[0] = regs->di;
> +			sd.args[1] = regs->si;
> +			sd.args[2] = regs->dx;
> +			sd.args[3] = regs->r10;
> +			sd.args[4] = regs->r8;
> +			sd.args[5] = regs->r9;
> +		} else
> +#endif
> +		{
> +			sd.args[0] = regs->bx;
> +			sd.args[1] = regs->cx;
> +			sd.args[2] = regs->dx;
> +			sd.args[3] = regs->si;
> +			sd.args[4] = regs->di;
> +			sd.args[5] = regs->bp;
> +		}
> +
> +		BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
> +		BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
> +
> +		ret = seccomp_phase1(&sd);
> +		if (ret == SECCOMP_PHASE1_SKIP) {
> +			regs->orig_ax = -1;

How the tracer is expected to get the correct syscall number after that?


-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-05 21:19     ` Dmitry V. Levin
  0 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-05 21:19 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
> phase 2 is permitted to modify any of pt_regs except for orig_ax.

This breaks ptrace, see below.

> The intent is that phase 1 can be called from the syscall fast path.
> 
> In this implementation, phase1 can handle any combination of
> TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
> unless seccomp requests a ptrace event, in which case phase2 is
> forced.
> 
> In principle, this could yield a big speedup for TIF_NOHZ as well as
> for TIF_SECCOMP if syscall exit work were similarly split up.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/include/asm/ptrace.h |   5 ++
>  arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
>  2 files changed, 138 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> index 6205f0c434db..86fc2bb82287 100644
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
>  extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>  			 int error_code, int si_code);
>  
> +
> +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
> +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
> +				       unsigned long phase1_result);
> +
>  extern long syscall_trace_enter(struct pt_regs *);
>  extern void syscall_trace_leave(struct pt_regs *);
>  
> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> index bbf338a04a5d..29576c244699 100644
> --- a/arch/x86/kernel/ptrace.c
> +++ b/arch/x86/kernel/ptrace.c
> @@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>  	force_sig_info(SIGTRAP, &info, tsk);
>  }
>  
> +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
> +{
> +#ifdef CONFIG_X86_64
> +	if (arch == AUDIT_ARCH_X86_64) {
> +		audit_syscall_entry(arch, regs->orig_ax, regs->di,
> +				    regs->si, regs->dx, regs->r10);
> +	} else
> +#endif
> +	{
> +		audit_syscall_entry(arch, regs->orig_ax, regs->bx,
> +				    regs->cx, regs->dx, regs->si);
> +	}
> +}
> +
>  /*
> - * We must return the syscall number to actually look up in the table.
> - * This can be -1L to skip running any syscall at all.
> + * We can return 0 to resume the syscall or anything else to go to phase
> + * 2.  If we resume the syscall, we need to put something appropriate in
> + * regs->orig_ax.
> + *
> + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
> + * are fully functional.
> + *
> + * For phase 2's benefit, our return value is:
> + * 0:			resume the syscall
> + * 1:			go to phase 2; no seccomp phase 2 needed
> + * anything else:	go to phase 2; pass return value to seccomp
>   */
> -long syscall_trace_enter(struct pt_regs *regs)
> +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>  {
> -	long ret = 0;
> +	unsigned long ret = 0;
> +	u32 work;
> +
> +	BUG_ON(regs != task_pt_regs(current));
> +
> +	work = ACCESS_ONCE(current_thread_info()->flags) &
> +		_TIF_WORK_SYSCALL_ENTRY;
>  
>  	/*
>  	 * If TIF_NOHZ is set, we are required to call user_exit() before
>  	 * doing anything that could touch RCU.
>  	 */
> -	if (test_thread_flag(TIF_NOHZ))
> +	if (work & _TIF_NOHZ) {
>  		user_exit();
> +		work &= ~TIF_NOHZ;
> +	}
> +
> +#ifdef CONFIG_SECCOMP
> +	/*
> +	 * Do seccomp first -- it should minimize exposure of other
> +	 * code, and keeping seccomp fast is probably more valuable
> +	 * than the rest of this.
> +	 */
> +	if (work & _TIF_SECCOMP) {
> +		struct seccomp_data sd;
> +
> +		sd.arch = arch;
> +		sd.nr = regs->orig_ax;
> +		sd.instruction_pointer = regs->ip;
> +#ifdef CONFIG_X86_64
> +		if (arch == AUDIT_ARCH_X86_64) {
> +			sd.args[0] = regs->di;
> +			sd.args[1] = regs->si;
> +			sd.args[2] = regs->dx;
> +			sd.args[3] = regs->r10;
> +			sd.args[4] = regs->r8;
> +			sd.args[5] = regs->r9;
> +		} else
> +#endif
> +		{
> +			sd.args[0] = regs->bx;
> +			sd.args[1] = regs->cx;
> +			sd.args[2] = regs->dx;
> +			sd.args[3] = regs->si;
> +			sd.args[4] = regs->di;
> +			sd.args[5] = regs->bp;
> +		}
> +
> +		BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
> +		BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
> +
> +		ret = seccomp_phase1(&sd);
> +		if (ret == SECCOMP_PHASE1_SKIP) {
> +			regs->orig_ax = -1;

How the tracer is expected to get the correct syscall number after that?


-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-05 21:19     ` Dmitry V. Levin
  (?)
@ 2015-02-05 21:27       ` Kees Cook
  -1 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-05 21:27 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Andy Lutomirski, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> Hi,
>
> On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
>> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
>> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
>> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>
> This breaks ptrace, see below.
>
>> The intent is that phase 1 can be called from the syscall fast path.
>>
>> In this implementation, phase1 can handle any combination of
>> TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
>> unless seccomp requests a ptrace event, in which case phase2 is
>> forced.
>>
>> In principle, this could yield a big speedup for TIF_NOHZ as well as
>> for TIF_SECCOMP if syscall exit work were similarly split up.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  arch/x86/include/asm/ptrace.h |   5 ++
>>  arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
>>  2 files changed, 138 insertions(+), 24 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
>> index 6205f0c434db..86fc2bb82287 100644
>> --- a/arch/x86/include/asm/ptrace.h
>> +++ b/arch/x86/include/asm/ptrace.h
>> @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
>>  extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>>                        int error_code, int si_code);
>>
>> +
>> +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
>> +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
>> +                                    unsigned long phase1_result);
>> +
>>  extern long syscall_trace_enter(struct pt_regs *);
>>  extern void syscall_trace_leave(struct pt_regs *);
>>
>> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
>> index bbf338a04a5d..29576c244699 100644
>> --- a/arch/x86/kernel/ptrace.c
>> +++ b/arch/x86/kernel/ptrace.c
>> @@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>>       force_sig_info(SIGTRAP, &info, tsk);
>>  }
>>
>> +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
>> +{
>> +#ifdef CONFIG_X86_64
>> +     if (arch == AUDIT_ARCH_X86_64) {
>> +             audit_syscall_entry(arch, regs->orig_ax, regs->di,
>> +                                 regs->si, regs->dx, regs->r10);
>> +     } else
>> +#endif
>> +     {
>> +             audit_syscall_entry(arch, regs->orig_ax, regs->bx,
>> +                                 regs->cx, regs->dx, regs->si);
>> +     }
>> +}
>> +
>>  /*
>> - * We must return the syscall number to actually look up in the table.
>> - * This can be -1L to skip running any syscall at all.
>> + * We can return 0 to resume the syscall or anything else to go to phase
>> + * 2.  If we resume the syscall, we need to put something appropriate in
>> + * regs->orig_ax.
>> + *
>> + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
>> + * are fully functional.
>> + *
>> + * For phase 2's benefit, our return value is:
>> + * 0:                        resume the syscall
>> + * 1:                        go to phase 2; no seccomp phase 2 needed
>> + * anything else:    go to phase 2; pass return value to seccomp
>>   */
>> -long syscall_trace_enter(struct pt_regs *regs)
>> +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>>  {
>> -     long ret = 0;
>> +     unsigned long ret = 0;
>> +     u32 work;
>> +
>> +     BUG_ON(regs != task_pt_regs(current));
>> +
>> +     work = ACCESS_ONCE(current_thread_info()->flags) &
>> +             _TIF_WORK_SYSCALL_ENTRY;
>>
>>       /*
>>        * If TIF_NOHZ is set, we are required to call user_exit() before
>>        * doing anything that could touch RCU.
>>        */
>> -     if (test_thread_flag(TIF_NOHZ))
>> +     if (work & _TIF_NOHZ) {
>>               user_exit();
>> +             work &= ~TIF_NOHZ;
>> +     }
>> +
>> +#ifdef CONFIG_SECCOMP
>> +     /*
>> +      * Do seccomp first -- it should minimize exposure of other
>> +      * code, and keeping seccomp fast is probably more valuable
>> +      * than the rest of this.
>> +      */
>> +     if (work & _TIF_SECCOMP) {
>> +             struct seccomp_data sd;
>> +
>> +             sd.arch = arch;
>> +             sd.nr = regs->orig_ax;
>> +             sd.instruction_pointer = regs->ip;
>> +#ifdef CONFIG_X86_64
>> +             if (arch == AUDIT_ARCH_X86_64) {
>> +                     sd.args[0] = regs->di;
>> +                     sd.args[1] = regs->si;
>> +                     sd.args[2] = regs->dx;
>> +                     sd.args[3] = regs->r10;
>> +                     sd.args[4] = regs->r8;
>> +                     sd.args[5] = regs->r9;
>> +             } else
>> +#endif
>> +             {
>> +                     sd.args[0] = regs->bx;
>> +                     sd.args[1] = regs->cx;
>> +                     sd.args[2] = regs->dx;
>> +                     sd.args[3] = regs->si;
>> +                     sd.args[4] = regs->di;
>> +                     sd.args[5] = regs->bp;
>> +             }
>> +
>> +             BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
>> +             BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
>> +
>> +             ret = seccomp_phase1(&sd);
>> +             if (ret == SECCOMP_PHASE1_SKIP) {
>> +                     regs->orig_ax = -1;
>
> How the tracer is expected to get the correct syscall number after that?

There shouldn't be a tracer if a skip is encountered. (A seccomp skip
would skip ptrace.) This behavior hasn't changed, but maybe I don't
see what you mean? (I haven't encountered any problems with syscall
tracing as a result of these changes.)

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-05 21:27       ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-05 21:27 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Andy Lutomirski, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> Hi,
>
> On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
>> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
>> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
>> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>
> This breaks ptrace, see below.
>
>> The intent is that phase 1 can be called from the syscall fast path.
>>
>> In this implementation, phase1 can handle any combination of
>> TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
>> unless seccomp requests a ptrace event, in which case phase2 is
>> forced.
>>
>> In principle, this could yield a big speedup for TIF_NOHZ as well as
>> for TIF_SECCOMP if syscall exit work were similarly split up.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  arch/x86/include/asm/ptrace.h |   5 ++
>>  arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
>>  2 files changed, 138 insertions(+), 24 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
>> index 6205f0c434db..86fc2bb82287 100644
>> --- a/arch/x86/include/asm/ptrace.h
>> +++ b/arch/x86/include/asm/ptrace.h
>> @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
>>  extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>>                        int error_code, int si_code);
>>
>> +
>> +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
>> +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
>> +                                    unsigned long phase1_result);
>> +
>>  extern long syscall_trace_enter(struct pt_regs *);
>>  extern void syscall_trace_leave(struct pt_regs *);
>>
>> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
>> index bbf338a04a5d..29576c244699 100644
>> --- a/arch/x86/kernel/ptrace.c
>> +++ b/arch/x86/kernel/ptrace.c
>> @@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>>       force_sig_info(SIGTRAP, &info, tsk);
>>  }
>>
>> +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
>> +{
>> +#ifdef CONFIG_X86_64
>> +     if (arch == AUDIT_ARCH_X86_64) {
>> +             audit_syscall_entry(arch, regs->orig_ax, regs->di,
>> +                                 regs->si, regs->dx, regs->r10);
>> +     } else
>> +#endif
>> +     {
>> +             audit_syscall_entry(arch, regs->orig_ax, regs->bx,
>> +                                 regs->cx, regs->dx, regs->si);
>> +     }
>> +}
>> +
>>  /*
>> - * We must return the syscall number to actually look up in the table.
>> - * This can be -1L to skip running any syscall at all.
>> + * We can return 0 to resume the syscall or anything else to go to phase
>> + * 2.  If we resume the syscall, we need to put something appropriate in
>> + * regs->orig_ax.
>> + *
>> + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
>> + * are fully functional.
>> + *
>> + * For phase 2's benefit, our return value is:
>> + * 0:                        resume the syscall
>> + * 1:                        go to phase 2; no seccomp phase 2 needed
>> + * anything else:    go to phase 2; pass return value to seccomp
>>   */
>> -long syscall_trace_enter(struct pt_regs *regs)
>> +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>>  {
>> -     long ret = 0;
>> +     unsigned long ret = 0;
>> +     u32 work;
>> +
>> +     BUG_ON(regs != task_pt_regs(current));
>> +
>> +     work = ACCESS_ONCE(current_thread_info()->flags) &
>> +             _TIF_WORK_SYSCALL_ENTRY;
>>
>>       /*
>>        * If TIF_NOHZ is set, we are required to call user_exit() before
>>        * doing anything that could touch RCU.
>>        */
>> -     if (test_thread_flag(TIF_NOHZ))
>> +     if (work & _TIF_NOHZ) {
>>               user_exit();
>> +             work &= ~TIF_NOHZ;
>> +     }
>> +
>> +#ifdef CONFIG_SECCOMP
>> +     /*
>> +      * Do seccomp first -- it should minimize exposure of other
>> +      * code, and keeping seccomp fast is probably more valuable
>> +      * than the rest of this.
>> +      */
>> +     if (work & _TIF_SECCOMP) {
>> +             struct seccomp_data sd;
>> +
>> +             sd.arch = arch;
>> +             sd.nr = regs->orig_ax;
>> +             sd.instruction_pointer = regs->ip;
>> +#ifdef CONFIG_X86_64
>> +             if (arch == AUDIT_ARCH_X86_64) {
>> +                     sd.args[0] = regs->di;
>> +                     sd.args[1] = regs->si;
>> +                     sd.args[2] = regs->dx;
>> +                     sd.args[3] = regs->r10;
>> +                     sd.args[4] = regs->r8;
>> +                     sd.args[5] = regs->r9;
>> +             } else
>> +#endif
>> +             {
>> +                     sd.args[0] = regs->bx;
>> +                     sd.args[1] = regs->cx;
>> +                     sd.args[2] = regs->dx;
>> +                     sd.args[3] = regs->si;
>> +                     sd.args[4] = regs->di;
>> +                     sd.args[5] = regs->bp;
>> +             }
>> +
>> +             BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
>> +             BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
>> +
>> +             ret = seccomp_phase1(&sd);
>> +             if (ret == SECCOMP_PHASE1_SKIP) {
>> +                     regs->orig_ax = -1;
>
> How the tracer is expected to get the correct syscall number after that?

There shouldn't be a tracer if a skip is encountered. (A seccomp skip
would skip ptrace.) This behavior hasn't changed, but maybe I don't
see what you mean? (I haven't encountered any problems with syscall
tracing as a result of these changes.)

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-05 21:27       ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-05 21:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> Hi,
>
> On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
>> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
>> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
>> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>
> This breaks ptrace, see below.
>
>> The intent is that phase 1 can be called from the syscall fast path.
>>
>> In this implementation, phase1 can handle any combination of
>> TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
>> unless seccomp requests a ptrace event, in which case phase2 is
>> forced.
>>
>> In principle, this could yield a big speedup for TIF_NOHZ as well as
>> for TIF_SECCOMP if syscall exit work were similarly split up.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  arch/x86/include/asm/ptrace.h |   5 ++
>>  arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
>>  2 files changed, 138 insertions(+), 24 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
>> index 6205f0c434db..86fc2bb82287 100644
>> --- a/arch/x86/include/asm/ptrace.h
>> +++ b/arch/x86/include/asm/ptrace.h
>> @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
>>  extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>>                        int error_code, int si_code);
>>
>> +
>> +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
>> +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
>> +                                    unsigned long phase1_result);
>> +
>>  extern long syscall_trace_enter(struct pt_regs *);
>>  extern void syscall_trace_leave(struct pt_regs *);
>>
>> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
>> index bbf338a04a5d..29576c244699 100644
>> --- a/arch/x86/kernel/ptrace.c
>> +++ b/arch/x86/kernel/ptrace.c
>> @@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>>       force_sig_info(SIGTRAP, &info, tsk);
>>  }
>>
>> +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
>> +{
>> +#ifdef CONFIG_X86_64
>> +     if (arch == AUDIT_ARCH_X86_64) {
>> +             audit_syscall_entry(arch, regs->orig_ax, regs->di,
>> +                                 regs->si, regs->dx, regs->r10);
>> +     } else
>> +#endif
>> +     {
>> +             audit_syscall_entry(arch, regs->orig_ax, regs->bx,
>> +                                 regs->cx, regs->dx, regs->si);
>> +     }
>> +}
>> +
>>  /*
>> - * We must return the syscall number to actually look up in the table.
>> - * This can be -1L to skip running any syscall at all.
>> + * We can return 0 to resume the syscall or anything else to go to phase
>> + * 2.  If we resume the syscall, we need to put something appropriate in
>> + * regs->orig_ax.
>> + *
>> + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
>> + * are fully functional.
>> + *
>> + * For phase 2's benefit, our return value is:
>> + * 0:                        resume the syscall
>> + * 1:                        go to phase 2; no seccomp phase 2 needed
>> + * anything else:    go to phase 2; pass return value to seccomp
>>   */
>> -long syscall_trace_enter(struct pt_regs *regs)
>> +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>>  {
>> -     long ret = 0;
>> +     unsigned long ret = 0;
>> +     u32 work;
>> +
>> +     BUG_ON(regs != task_pt_regs(current));
>> +
>> +     work = ACCESS_ONCE(current_thread_info()->flags) &
>> +             _TIF_WORK_SYSCALL_ENTRY;
>>
>>       /*
>>        * If TIF_NOHZ is set, we are required to call user_exit() before
>>        * doing anything that could touch RCU.
>>        */
>> -     if (test_thread_flag(TIF_NOHZ))
>> +     if (work & _TIF_NOHZ) {
>>               user_exit();
>> +             work &= ~TIF_NOHZ;
>> +     }
>> +
>> +#ifdef CONFIG_SECCOMP
>> +     /*
>> +      * Do seccomp first -- it should minimize exposure of other
>> +      * code, and keeping seccomp fast is probably more valuable
>> +      * than the rest of this.
>> +      */
>> +     if (work & _TIF_SECCOMP) {
>> +             struct seccomp_data sd;
>> +
>> +             sd.arch = arch;
>> +             sd.nr = regs->orig_ax;
>> +             sd.instruction_pointer = regs->ip;
>> +#ifdef CONFIG_X86_64
>> +             if (arch == AUDIT_ARCH_X86_64) {
>> +                     sd.args[0] = regs->di;
>> +                     sd.args[1] = regs->si;
>> +                     sd.args[2] = regs->dx;
>> +                     sd.args[3] = regs->r10;
>> +                     sd.args[4] = regs->r8;
>> +                     sd.args[5] = regs->r9;
>> +             } else
>> +#endif
>> +             {
>> +                     sd.args[0] = regs->bx;
>> +                     sd.args[1] = regs->cx;
>> +                     sd.args[2] = regs->dx;
>> +                     sd.args[3] = regs->si;
>> +                     sd.args[4] = regs->di;
>> +                     sd.args[5] = regs->bp;
>> +             }
>> +
>> +             BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
>> +             BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
>> +
>> +             ret = seccomp_phase1(&sd);
>> +             if (ret == SECCOMP_PHASE1_SKIP) {
>> +                     regs->orig_ax = -1;
>
> How the tracer is expected to get the correct syscall number after that?

There shouldn't be a tracer if a skip is encountered. (A seccomp skip
would skip ptrace.) This behavior hasn't changed, but maybe I don't
see what you mean? (I haven't encountered any problems with syscall
tracing as a result of these changes.)

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-05 21:27       ` Kees Cook
  (?)
@ 2015-02-05 21:40         ` Dmitry V. Levin
  -1 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-05 21:40 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> > Hi,
> >
> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
> >
> > This breaks ptrace, see below.
> >
> >> The intent is that phase 1 can be called from the syscall fast path.
> >>
> >> In this implementation, phase1 can handle any combination of
> >> TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
> >> unless seccomp requests a ptrace event, in which case phase2 is
> >> forced.
> >>
> >> In principle, this could yield a big speedup for TIF_NOHZ as well as
> >> for TIF_SECCOMP if syscall exit work were similarly split up.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> ---
> >>  arch/x86/include/asm/ptrace.h |   5 ++
> >>  arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
> >>  2 files changed, 138 insertions(+), 24 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> >> index 6205f0c434db..86fc2bb82287 100644
> >> --- a/arch/x86/include/asm/ptrace.h
> >> +++ b/arch/x86/include/asm/ptrace.h
> >> @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
> >>  extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
> >>                        int error_code, int si_code);
> >>
> >> +
> >> +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
> >> +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
> >> +                                    unsigned long phase1_result);
> >> +
> >>  extern long syscall_trace_enter(struct pt_regs *);
> >>  extern void syscall_trace_leave(struct pt_regs *);
> >>
> >> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> >> index bbf338a04a5d..29576c244699 100644
> >> --- a/arch/x86/kernel/ptrace.c
> >> +++ b/arch/x86/kernel/ptrace.c
> >> @@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
> >>       force_sig_info(SIGTRAP, &info, tsk);
> >>  }
> >>
> >> +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
> >> +{
> >> +#ifdef CONFIG_X86_64
> >> +     if (arch == AUDIT_ARCH_X86_64) {
> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->di,
> >> +                                 regs->si, regs->dx, regs->r10);
> >> +     } else
> >> +#endif
> >> +     {
> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->bx,
> >> +                                 regs->cx, regs->dx, regs->si);
> >> +     }
> >> +}
> >> +
> >>  /*
> >> - * We must return the syscall number to actually look up in the table.
> >> - * This can be -1L to skip running any syscall at all.
> >> + * We can return 0 to resume the syscall or anything else to go to phase
> >> + * 2.  If we resume the syscall, we need to put something appropriate in
> >> + * regs->orig_ax.
> >> + *
> >> + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
> >> + * are fully functional.
> >> + *
> >> + * For phase 2's benefit, our return value is:
> >> + * 0:                        resume the syscall
> >> + * 1:                        go to phase 2; no seccomp phase 2 needed
> >> + * anything else:    go to phase 2; pass return value to seccomp
> >>   */
> >> -long syscall_trace_enter(struct pt_regs *regs)
> >> +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
> >>  {
> >> -     long ret = 0;
> >> +     unsigned long ret = 0;
> >> +     u32 work;
> >> +
> >> +     BUG_ON(regs != task_pt_regs(current));
> >> +
> >> +     work = ACCESS_ONCE(current_thread_info()->flags) &
> >> +             _TIF_WORK_SYSCALL_ENTRY;
> >>
> >>       /*
> >>        * If TIF_NOHZ is set, we are required to call user_exit() before
> >>        * doing anything that could touch RCU.
> >>        */
> >> -     if (test_thread_flag(TIF_NOHZ))
> >> +     if (work & _TIF_NOHZ) {
> >>               user_exit();
> >> +             work &= ~TIF_NOHZ;
> >> +     }
> >> +
> >> +#ifdef CONFIG_SECCOMP
> >> +     /*
> >> +      * Do seccomp first -- it should minimize exposure of other
> >> +      * code, and keeping seccomp fast is probably more valuable
> >> +      * than the rest of this.
> >> +      */
> >> +     if (work & _TIF_SECCOMP) {
> >> +             struct seccomp_data sd;
> >> +
> >> +             sd.arch = arch;
> >> +             sd.nr = regs->orig_ax;
> >> +             sd.instruction_pointer = regs->ip;
> >> +#ifdef CONFIG_X86_64
> >> +             if (arch == AUDIT_ARCH_X86_64) {
> >> +                     sd.args[0] = regs->di;
> >> +                     sd.args[1] = regs->si;
> >> +                     sd.args[2] = regs->dx;
> >> +                     sd.args[3] = regs->r10;
> >> +                     sd.args[4] = regs->r8;
> >> +                     sd.args[5] = regs->r9;
> >> +             } else
> >> +#endif
> >> +             {
> >> +                     sd.args[0] = regs->bx;
> >> +                     sd.args[1] = regs->cx;
> >> +                     sd.args[2] = regs->dx;
> >> +                     sd.args[3] = regs->si;
> >> +                     sd.args[4] = regs->di;
> >> +                     sd.args[5] = regs->bp;
> >> +             }
> >> +
> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
> >> +
> >> +             ret = seccomp_phase1(&sd);
> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
> >> +                     regs->orig_ax = -1;
> >
> > How the tracer is expected to get the correct syscall number after that?
> 
> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
> would skip ptrace.) This behavior hasn't changed, but maybe I don't
> see what you mean? (I haven't encountered any problems with syscall
> tracing as a result of these changes.)

SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
it will get -1 as a syscall number.

I've found this while testing a strace parser for
SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.


-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-05 21:40         ` Dmitry V. Levin
  0 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-05 21:40 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> > Hi,
> >
> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
> >
> > This breaks ptrace, see below.
> >
> >> The intent is that phase 1 can be called from the syscall fast path.
> >>
> >> In this implementation, phase1 can handle any combination of
> >> TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
> >> unless seccomp requests a ptrace event, in which case phase2 is
> >> forced.
> >>
> >> In principle, this could yield a big speedup for TIF_NOHZ as well as
> >> for TIF_SECCOMP if syscall exit work were similarly split up.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> ---
> >>  arch/x86/include/asm/ptrace.h |   5 ++
> >>  arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
> >>  2 files changed, 138 insertions(+), 24 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> >> index 6205f0c434db..86fc2bb82287 100644
> >> --- a/arch/x86/include/asm/ptrace.h
> >> +++ b/arch/x86/include/asm/ptrace.h
> >> @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
> >>  extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
> >>                        int error_code, int si_code);
> >>
> >> +
> >> +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
> >> +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
> >> +                                    unsigned long phase1_result);
> >> +
> >>  extern long syscall_trace_enter(struct pt_regs *);
> >>  extern void syscall_trace_leave(struct pt_regs *);
> >>
> >> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> >> index bbf338a04a5d..29576c244699 100644
> >> --- a/arch/x86/kernel/ptrace.c
> >> +++ b/arch/x86/kernel/ptrace.c
> >> @@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
> >>       force_sig_info(SIGTRAP, &info, tsk);
> >>  }
> >>
> >> +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
> >> +{
> >> +#ifdef CONFIG_X86_64
> >> +     if (arch == AUDIT_ARCH_X86_64) {
> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->di,
> >> +                                 regs->si, regs->dx, regs->r10);
> >> +     } else
> >> +#endif
> >> +     {
> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->bx,
> >> +                                 regs->cx, regs->dx, regs->si);
> >> +     }
> >> +}
> >> +
> >>  /*
> >> - * We must return the syscall number to actually look up in the table.
> >> - * This can be -1L to skip running any syscall at all.
> >> + * We can return 0 to resume the syscall or anything else to go to phase
> >> + * 2.  If we resume the syscall, we need to put something appropriate in
> >> + * regs->orig_ax.
> >> + *
> >> + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
> >> + * are fully functional.
> >> + *
> >> + * For phase 2's benefit, our return value is:
> >> + * 0:                        resume the syscall
> >> + * 1:                        go to phase 2; no seccomp phase 2 needed
> >> + * anything else:    go to phase 2; pass return value to seccomp
> >>   */
> >> -long syscall_trace_enter(struct pt_regs *regs)
> >> +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
> >>  {
> >> -     long ret = 0;
> >> +     unsigned long ret = 0;
> >> +     u32 work;
> >> +
> >> +     BUG_ON(regs != task_pt_regs(current));
> >> +
> >> +     work = ACCESS_ONCE(current_thread_info()->flags) &
> >> +             _TIF_WORK_SYSCALL_ENTRY;
> >>
> >>       /*
> >>        * If TIF_NOHZ is set, we are required to call user_exit() before
> >>        * doing anything that could touch RCU.
> >>        */
> >> -     if (test_thread_flag(TIF_NOHZ))
> >> +     if (work & _TIF_NOHZ) {
> >>               user_exit();
> >> +             work &= ~TIF_NOHZ;
> >> +     }
> >> +
> >> +#ifdef CONFIG_SECCOMP
> >> +     /*
> >> +      * Do seccomp first -- it should minimize exposure of other
> >> +      * code, and keeping seccomp fast is probably more valuable
> >> +      * than the rest of this.
> >> +      */
> >> +     if (work & _TIF_SECCOMP) {
> >> +             struct seccomp_data sd;
> >> +
> >> +             sd.arch = arch;
> >> +             sd.nr = regs->orig_ax;
> >> +             sd.instruction_pointer = regs->ip;
> >> +#ifdef CONFIG_X86_64
> >> +             if (arch == AUDIT_ARCH_X86_64) {
> >> +                     sd.args[0] = regs->di;
> >> +                     sd.args[1] = regs->si;
> >> +                     sd.args[2] = regs->dx;
> >> +                     sd.args[3] = regs->r10;
> >> +                     sd.args[4] = regs->r8;
> >> +                     sd.args[5] = regs->r9;
> >> +             } else
> >> +#endif
> >> +             {
> >> +                     sd.args[0] = regs->bx;
> >> +                     sd.args[1] = regs->cx;
> >> +                     sd.args[2] = regs->dx;
> >> +                     sd.args[3] = regs->si;
> >> +                     sd.args[4] = regs->di;
> >> +                     sd.args[5] = regs->bp;
> >> +             }
> >> +
> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
> >> +
> >> +             ret = seccomp_phase1(&sd);
> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
> >> +                     regs->orig_ax = -1;
> >
> > How the tracer is expected to get the correct syscall number after that?
> 
> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
> would skip ptrace.) This behavior hasn't changed, but maybe I don't
> see what you mean? (I haven't encountered any problems with syscall
> tracing as a result of these changes.)

SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
it will get -1 as a syscall number.

I've found this while testing a strace parser for
SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.


-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-05 21:40         ` Dmitry V. Levin
  0 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-05 21:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> > Hi,
> >
> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
> >
> > This breaks ptrace, see below.
> >
> >> The intent is that phase 1 can be called from the syscall fast path.
> >>
> >> In this implementation, phase1 can handle any combination of
> >> TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
> >> unless seccomp requests a ptrace event, in which case phase2 is
> >> forced.
> >>
> >> In principle, this could yield a big speedup for TIF_NOHZ as well as
> >> for TIF_SECCOMP if syscall exit work were similarly split up.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> ---
> >>  arch/x86/include/asm/ptrace.h |   5 ++
> >>  arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
> >>  2 files changed, 138 insertions(+), 24 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> >> index 6205f0c434db..86fc2bb82287 100644
> >> --- a/arch/x86/include/asm/ptrace.h
> >> +++ b/arch/x86/include/asm/ptrace.h
> >> @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
> >>  extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
> >>                        int error_code, int si_code);
> >>
> >> +
> >> +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
> >> +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
> >> +                                    unsigned long phase1_result);
> >> +
> >>  extern long syscall_trace_enter(struct pt_regs *);
> >>  extern void syscall_trace_leave(struct pt_regs *);
> >>
> >> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> >> index bbf338a04a5d..29576c244699 100644
> >> --- a/arch/x86/kernel/ptrace.c
> >> +++ b/arch/x86/kernel/ptrace.c
> >> @@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
> >>       force_sig_info(SIGTRAP, &info, tsk);
> >>  }
> >>
> >> +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
> >> +{
> >> +#ifdef CONFIG_X86_64
> >> +     if (arch == AUDIT_ARCH_X86_64) {
> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->di,
> >> +                                 regs->si, regs->dx, regs->r10);
> >> +     } else
> >> +#endif
> >> +     {
> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->bx,
> >> +                                 regs->cx, regs->dx, regs->si);
> >> +     }
> >> +}
> >> +
> >>  /*
> >> - * We must return the syscall number to actually look up in the table.
> >> - * This can be -1L to skip running any syscall at all.
> >> + * We can return 0 to resume the syscall or anything else to go to phase
> >> + * 2.  If we resume the syscall, we need to put something appropriate in
> >> + * regs->orig_ax.
> >> + *
> >> + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
> >> + * are fully functional.
> >> + *
> >> + * For phase 2's benefit, our return value is:
> >> + * 0:                        resume the syscall
> >> + * 1:                        go to phase 2; no seccomp phase 2 needed
> >> + * anything else:    go to phase 2; pass return value to seccomp
> >>   */
> >> -long syscall_trace_enter(struct pt_regs *regs)
> >> +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
> >>  {
> >> -     long ret = 0;
> >> +     unsigned long ret = 0;
> >> +     u32 work;
> >> +
> >> +     BUG_ON(regs != task_pt_regs(current));
> >> +
> >> +     work = ACCESS_ONCE(current_thread_info()->flags) &
> >> +             _TIF_WORK_SYSCALL_ENTRY;
> >>
> >>       /*
> >>        * If TIF_NOHZ is set, we are required to call user_exit() before
> >>        * doing anything that could touch RCU.
> >>        */
> >> -     if (test_thread_flag(TIF_NOHZ))
> >> +     if (work & _TIF_NOHZ) {
> >>               user_exit();
> >> +             work &= ~TIF_NOHZ;
> >> +     }
> >> +
> >> +#ifdef CONFIG_SECCOMP
> >> +     /*
> >> +      * Do seccomp first -- it should minimize exposure of other
> >> +      * code, and keeping seccomp fast is probably more valuable
> >> +      * than the rest of this.
> >> +      */
> >> +     if (work & _TIF_SECCOMP) {
> >> +             struct seccomp_data sd;
> >> +
> >> +             sd.arch = arch;
> >> +             sd.nr = regs->orig_ax;
> >> +             sd.instruction_pointer = regs->ip;
> >> +#ifdef CONFIG_X86_64
> >> +             if (arch == AUDIT_ARCH_X86_64) {
> >> +                     sd.args[0] = regs->di;
> >> +                     sd.args[1] = regs->si;
> >> +                     sd.args[2] = regs->dx;
> >> +                     sd.args[3] = regs->r10;
> >> +                     sd.args[4] = regs->r8;
> >> +                     sd.args[5] = regs->r9;
> >> +             } else
> >> +#endif
> >> +             {
> >> +                     sd.args[0] = regs->bx;
> >> +                     sd.args[1] = regs->cx;
> >> +                     sd.args[2] = regs->dx;
> >> +                     sd.args[3] = regs->si;
> >> +                     sd.args[4] = regs->di;
> >> +                     sd.args[5] = regs->bp;
> >> +             }
> >> +
> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
> >> +
> >> +             ret = seccomp_phase1(&sd);
> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
> >> +                     regs->orig_ax = -1;
> >
> > How the tracer is expected to get the correct syscall number after that?
> 
> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
> would skip ptrace.) This behavior hasn't changed, but maybe I don't
> see what you mean? (I haven't encountered any problems with syscall
> tracing as a result of these changes.)

SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
it will get -1 as a syscall number.

I've found this while testing a strace parser for
SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.


-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-05 21:40         ` Dmitry V. Levin
  (?)
@ 2015-02-05 21:52           ` Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-05 21:52 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Kees Cook, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
>> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> > Hi,
>> >
>> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
>> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
>> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
>> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>> >
>> > This breaks ptrace, see below.
>> >
>> >> The intent is that phase 1 can be called from the syscall fast path.
>> >>
>> >> In this implementation, phase1 can handle any combination of
>> >> TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
>> >> unless seccomp requests a ptrace event, in which case phase2 is
>> >> forced.
>> >>
>> >> In principle, this could yield a big speedup for TIF_NOHZ as well as
>> >> for TIF_SECCOMP if syscall exit work were similarly split up.
>> >>
>> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> >> ---
>> >>  arch/x86/include/asm/ptrace.h |   5 ++
>> >>  arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
>> >>  2 files changed, 138 insertions(+), 24 deletions(-)
>> >>
>> >> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
>> >> index 6205f0c434db..86fc2bb82287 100644
>> >> --- a/arch/x86/include/asm/ptrace.h
>> >> +++ b/arch/x86/include/asm/ptrace.h
>> >> @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
>> >>  extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>> >>                        int error_code, int si_code);
>> >>
>> >> +
>> >> +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
>> >> +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
>> >> +                                    unsigned long phase1_result);
>> >> +
>> >>  extern long syscall_trace_enter(struct pt_regs *);
>> >>  extern void syscall_trace_leave(struct pt_regs *);
>> >>
>> >> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
>> >> index bbf338a04a5d..29576c244699 100644
>> >> --- a/arch/x86/kernel/ptrace.c
>> >> +++ b/arch/x86/kernel/ptrace.c
>> >> @@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>> >>       force_sig_info(SIGTRAP, &info, tsk);
>> >>  }
>> >>
>> >> +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
>> >> +{
>> >> +#ifdef CONFIG_X86_64
>> >> +     if (arch == AUDIT_ARCH_X86_64) {
>> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->di,
>> >> +                                 regs->si, regs->dx, regs->r10);
>> >> +     } else
>> >> +#endif
>> >> +     {
>> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->bx,
>> >> +                                 regs->cx, regs->dx, regs->si);
>> >> +     }
>> >> +}
>> >> +
>> >>  /*
>> >> - * We must return the syscall number to actually look up in the table.
>> >> - * This can be -1L to skip running any syscall at all.
>> >> + * We can return 0 to resume the syscall or anything else to go to phase
>> >> + * 2.  If we resume the syscall, we need to put something appropriate in
>> >> + * regs->orig_ax.
>> >> + *
>> >> + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
>> >> + * are fully functional.
>> >> + *
>> >> + * For phase 2's benefit, our return value is:
>> >> + * 0:                        resume the syscall
>> >> + * 1:                        go to phase 2; no seccomp phase 2 needed
>> >> + * anything else:    go to phase 2; pass return value to seccomp
>> >>   */
>> >> -long syscall_trace_enter(struct pt_regs *regs)
>> >> +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>> >>  {
>> >> -     long ret = 0;
>> >> +     unsigned long ret = 0;
>> >> +     u32 work;
>> >> +
>> >> +     BUG_ON(regs != task_pt_regs(current));
>> >> +
>> >> +     work = ACCESS_ONCE(current_thread_info()->flags) &
>> >> +             _TIF_WORK_SYSCALL_ENTRY;
>> >>
>> >>       /*
>> >>        * If TIF_NOHZ is set, we are required to call user_exit() before
>> >>        * doing anything that could touch RCU.
>> >>        */
>> >> -     if (test_thread_flag(TIF_NOHZ))
>> >> +     if (work & _TIF_NOHZ) {
>> >>               user_exit();
>> >> +             work &= ~TIF_NOHZ;
>> >> +     }
>> >> +
>> >> +#ifdef CONFIG_SECCOMP
>> >> +     /*
>> >> +      * Do seccomp first -- it should minimize exposure of other
>> >> +      * code, and keeping seccomp fast is probably more valuable
>> >> +      * than the rest of this.
>> >> +      */
>> >> +     if (work & _TIF_SECCOMP) {
>> >> +             struct seccomp_data sd;
>> >> +
>> >> +             sd.arch = arch;
>> >> +             sd.nr = regs->orig_ax;
>> >> +             sd.instruction_pointer = regs->ip;
>> >> +#ifdef CONFIG_X86_64
>> >> +             if (arch == AUDIT_ARCH_X86_64) {
>> >> +                     sd.args[0] = regs->di;
>> >> +                     sd.args[1] = regs->si;
>> >> +                     sd.args[2] = regs->dx;
>> >> +                     sd.args[3] = regs->r10;
>> >> +                     sd.args[4] = regs->r8;
>> >> +                     sd.args[5] = regs->r9;
>> >> +             } else
>> >> +#endif
>> >> +             {
>> >> +                     sd.args[0] = regs->bx;
>> >> +                     sd.args[1] = regs->cx;
>> >> +                     sd.args[2] = regs->dx;
>> >> +                     sd.args[3] = regs->si;
>> >> +                     sd.args[4] = regs->di;
>> >> +                     sd.args[5] = regs->bp;
>> >> +             }
>> >> +
>> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
>> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
>> >> +
>> >> +             ret = seccomp_phase1(&sd);
>> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
>> >> +                     regs->orig_ax = -1;
>> >
>> > How the tracer is expected to get the correct syscall number after that?
>>
>> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
>> would skip ptrace.) This behavior hasn't changed, but maybe I don't
>> see what you mean? (I haven't encountered any problems with syscall
>> tracing as a result of these changes.)
>
> SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
> it will get -1 as a syscall number.
>
> I've found this while testing a strace parser for
> SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.
>
>

Hasn't it always been this way?

I admit that I kind of wish this worked the other way -- that is, I
think it would be nice to have a mode in which ptrace runs before
seccomp, which would close the ptrace hole (where ptrace can do things
that seccomp would disallow) and maybe have more comprehensible
results.

--Andy

> --
> ldv



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-05 21:52           ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-05 21:52 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Kees Cook, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
>> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> > Hi,
>> >
>> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
>> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
>> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
>> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>> >
>> > This breaks ptrace, see below.
>> >
>> >> The intent is that phase 1 can be called from the syscall fast path.
>> >>
>> >> In this implementation, phase1 can handle any combination of
>> >> TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
>> >> unless seccomp requests a ptrace event, in which case phase2 is
>> >> forced.
>> >>
>> >> In principle, this could yield a big speedup for TIF_NOHZ as well as
>> >> for TIF_SECCOMP if syscall exit work were similarly split up.
>> >>
>> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> >> ---
>> >>  arch/x86/include/asm/ptrace.h |   5 ++
>> >>  arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
>> >>  2 files changed, 138 insertions(+), 24 deletions(-)
>> >>
>> >> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
>> >> index 6205f0c434db..86fc2bb82287 100644
>> >> --- a/arch/x86/include/asm/ptrace.h
>> >> +++ b/arch/x86/include/asm/ptrace.h
>> >> @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
>> >>  extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>> >>                        int error_code, int si_code);
>> >>
>> >> +
>> >> +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
>> >> +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
>> >> +                                    unsigned long phase1_result);
>> >> +
>> >>  extern long syscall_trace_enter(struct pt_regs *);
>> >>  extern void syscall_trace_leave(struct pt_regs *);
>> >>
>> >> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
>> >> index bbf338a04a5d..29576c244699 100644
>> >> --- a/arch/x86/kernel/ptrace.c
>> >> +++ b/arch/x86/kernel/ptrace.c
>> >> @@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>> >>       force_sig_info(SIGTRAP, &info, tsk);
>> >>  }
>> >>
>> >> +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
>> >> +{
>> >> +#ifdef CONFIG_X86_64
>> >> +     if (arch == AUDIT_ARCH_X86_64) {
>> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->di,
>> >> +                                 regs->si, regs->dx, regs->r10);
>> >> +     } else
>> >> +#endif
>> >> +     {
>> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->bx,
>> >> +                                 regs->cx, regs->dx, regs->si);
>> >> +     }
>> >> +}
>> >> +
>> >>  /*
>> >> - * We must return the syscall number to actually look up in the table.
>> >> - * This can be -1L to skip running any syscall at all.
>> >> + * We can return 0 to resume the syscall or anything else to go to phase
>> >> + * 2.  If we resume the syscall, we need to put something appropriate in
>> >> + * regs->orig_ax.
>> >> + *
>> >> + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
>> >> + * are fully functional.
>> >> + *
>> >> + * For phase 2's benefit, our return value is:
>> >> + * 0:                        resume the syscall
>> >> + * 1:                        go to phase 2; no seccomp phase 2 needed
>> >> + * anything else:    go to phase 2; pass return value to seccomp
>> >>   */
>> >> -long syscall_trace_enter(struct pt_regs *regs)
>> >> +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>> >>  {
>> >> -     long ret = 0;
>> >> +     unsigned long ret = 0;
>> >> +     u32 work;
>> >> +
>> >> +     BUG_ON(regs != task_pt_regs(current));
>> >> +
>> >> +     work = ACCESS_ONCE(current_thread_info()->flags) &
>> >> +             _TIF_WORK_SYSCALL_ENTRY;
>> >>
>> >>       /*
>> >>        * If TIF_NOHZ is set, we are required to call user_exit() before
>> >>        * doing anything that could touch RCU.
>> >>        */
>> >> -     if (test_thread_flag(TIF_NOHZ))
>> >> +     if (work & _TIF_NOHZ) {
>> >>               user_exit();
>> >> +             work &= ~TIF_NOHZ;
>> >> +     }
>> >> +
>> >> +#ifdef CONFIG_SECCOMP
>> >> +     /*
>> >> +      * Do seccomp first -- it should minimize exposure of other
>> >> +      * code, and keeping seccomp fast is probably more valuable
>> >> +      * than the rest of this.
>> >> +      */
>> >> +     if (work & _TIF_SECCOMP) {
>> >> +             struct seccomp_data sd;
>> >> +
>> >> +             sd.arch = arch;
>> >> +             sd.nr = regs->orig_ax;
>> >> +             sd.instruction_pointer = regs->ip;
>> >> +#ifdef CONFIG_X86_64
>> >> +             if (arch == AUDIT_ARCH_X86_64) {
>> >> +                     sd.args[0] = regs->di;
>> >> +                     sd.args[1] = regs->si;
>> >> +                     sd.args[2] = regs->dx;
>> >> +                     sd.args[3] = regs->r10;
>> >> +                     sd.args[4] = regs->r8;
>> >> +                     sd.args[5] = regs->r9;
>> >> +             } else
>> >> +#endif
>> >> +             {
>> >> +                     sd.args[0] = regs->bx;
>> >> +                     sd.args[1] = regs->cx;
>> >> +                     sd.args[2] = regs->dx;
>> >> +                     sd.args[3] = regs->si;
>> >> +                     sd.args[4] = regs->di;
>> >> +                     sd.args[5] = regs->bp;
>> >> +             }
>> >> +
>> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
>> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
>> >> +
>> >> +             ret = seccomp_phase1(&sd);
>> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
>> >> +                     regs->orig_ax = -1;
>> >
>> > How the tracer is expected to get the correct syscall number after that?
>>
>> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
>> would skip ptrace.) This behavior hasn't changed, but maybe I don't
>> see what you mean? (I haven't encountered any problems with syscall
>> tracing as a result of these changes.)
>
> SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
> it will get -1 as a syscall number.
>
> I've found this while testing a strace parser for
> SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.
>
>

Hasn't it always been this way?

I admit that I kind of wish this worked the other way -- that is, I
think it would be nice to have a mode in which ptrace runs before
seccomp, which would close the ptrace hole (where ptrace can do things
that seccomp would disallow) and maybe have more comprehensible
results.

--Andy

> --
> ldv



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-05 21:52           ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-05 21:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
>> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> > Hi,
>> >
>> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
>> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
>> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
>> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>> >
>> > This breaks ptrace, see below.
>> >
>> >> The intent is that phase 1 can be called from the syscall fast path.
>> >>
>> >> In this implementation, phase1 can handle any combination of
>> >> TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
>> >> unless seccomp requests a ptrace event, in which case phase2 is
>> >> forced.
>> >>
>> >> In principle, this could yield a big speedup for TIF_NOHZ as well as
>> >> for TIF_SECCOMP if syscall exit work were similarly split up.
>> >>
>> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> >> ---
>> >>  arch/x86/include/asm/ptrace.h |   5 ++
>> >>  arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
>> >>  2 files changed, 138 insertions(+), 24 deletions(-)
>> >>
>> >> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
>> >> index 6205f0c434db..86fc2bb82287 100644
>> >> --- a/arch/x86/include/asm/ptrace.h
>> >> +++ b/arch/x86/include/asm/ptrace.h
>> >> @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
>> >>  extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>> >>                        int error_code, int si_code);
>> >>
>> >> +
>> >> +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
>> >> +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
>> >> +                                    unsigned long phase1_result);
>> >> +
>> >>  extern long syscall_trace_enter(struct pt_regs *);
>> >>  extern void syscall_trace_leave(struct pt_regs *);
>> >>
>> >> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
>> >> index bbf338a04a5d..29576c244699 100644
>> >> --- a/arch/x86/kernel/ptrace.c
>> >> +++ b/arch/x86/kernel/ptrace.c
>> >> @@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>> >>       force_sig_info(SIGTRAP, &info, tsk);
>> >>  }
>> >>
>> >> +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
>> >> +{
>> >> +#ifdef CONFIG_X86_64
>> >> +     if (arch == AUDIT_ARCH_X86_64) {
>> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->di,
>> >> +                                 regs->si, regs->dx, regs->r10);
>> >> +     } else
>> >> +#endif
>> >> +     {
>> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->bx,
>> >> +                                 regs->cx, regs->dx, regs->si);
>> >> +     }
>> >> +}
>> >> +
>> >>  /*
>> >> - * We must return the syscall number to actually look up in the table.
>> >> - * This can be -1L to skip running any syscall at all.
>> >> + * We can return 0 to resume the syscall or anything else to go to phase
>> >> + * 2.  If we resume the syscall, we need to put something appropriate in
>> >> + * regs->orig_ax.
>> >> + *
>> >> + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
>> >> + * are fully functional.
>> >> + *
>> >> + * For phase 2's benefit, our return value is:
>> >> + * 0:                        resume the syscall
>> >> + * 1:                        go to phase 2; no seccomp phase 2 needed
>> >> + * anything else:    go to phase 2; pass return value to seccomp
>> >>   */
>> >> -long syscall_trace_enter(struct pt_regs *regs)
>> >> +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>> >>  {
>> >> -     long ret = 0;
>> >> +     unsigned long ret = 0;
>> >> +     u32 work;
>> >> +
>> >> +     BUG_ON(regs != task_pt_regs(current));
>> >> +
>> >> +     work = ACCESS_ONCE(current_thread_info()->flags) &
>> >> +             _TIF_WORK_SYSCALL_ENTRY;
>> >>
>> >>       /*
>> >>        * If TIF_NOHZ is set, we are required to call user_exit() before
>> >>        * doing anything that could touch RCU.
>> >>        */
>> >> -     if (test_thread_flag(TIF_NOHZ))
>> >> +     if (work & _TIF_NOHZ) {
>> >>               user_exit();
>> >> +             work &= ~TIF_NOHZ;
>> >> +     }
>> >> +
>> >> +#ifdef CONFIG_SECCOMP
>> >> +     /*
>> >> +      * Do seccomp first -- it should minimize exposure of other
>> >> +      * code, and keeping seccomp fast is probably more valuable
>> >> +      * than the rest of this.
>> >> +      */
>> >> +     if (work & _TIF_SECCOMP) {
>> >> +             struct seccomp_data sd;
>> >> +
>> >> +             sd.arch = arch;
>> >> +             sd.nr = regs->orig_ax;
>> >> +             sd.instruction_pointer = regs->ip;
>> >> +#ifdef CONFIG_X86_64
>> >> +             if (arch == AUDIT_ARCH_X86_64) {
>> >> +                     sd.args[0] = regs->di;
>> >> +                     sd.args[1] = regs->si;
>> >> +                     sd.args[2] = regs->dx;
>> >> +                     sd.args[3] = regs->r10;
>> >> +                     sd.args[4] = regs->r8;
>> >> +                     sd.args[5] = regs->r9;
>> >> +             } else
>> >> +#endif
>> >> +             {
>> >> +                     sd.args[0] = regs->bx;
>> >> +                     sd.args[1] = regs->cx;
>> >> +                     sd.args[2] = regs->dx;
>> >> +                     sd.args[3] = regs->si;
>> >> +                     sd.args[4] = regs->di;
>> >> +                     sd.args[5] = regs->bp;
>> >> +             }
>> >> +
>> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
>> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
>> >> +
>> >> +             ret = seccomp_phase1(&sd);
>> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
>> >> +                     regs->orig_ax = -1;
>> >
>> > How the tracer is expected to get the correct syscall number after that?
>>
>> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
>> would skip ptrace.) This behavior hasn't changed, but maybe I don't
>> see what you mean? (I haven't encountered any problems with syscall
>> tracing as a result of these changes.)
>
> SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
> it will get -1 as a syscall number.
>
> I've found this while testing a strace parser for
> SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.
>
>

Hasn't it always been this way?

I admit that I kind of wish this worked the other way -- that is, I
think it would be nice to have a mode in which ptrace runs before
seccomp, which would close the ptrace hole (where ptrace can do things
that seccomp would disallow) and maybe have more comprehensible
results.

--Andy

> --
> ldv



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-05 21:52           ` Andy Lutomirski
  (?)
@ 2015-02-05 23:12             ` Kees Cook
  -1 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-05 23:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 5, 2015 at 1:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
>>> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>> > Hi,
>>> >
>>> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
>>> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
>>> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
>>> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>>> >
>>> > This breaks ptrace, see below.
>>> >
>>> >> The intent is that phase 1 can be called from the syscall fast path.
>>> >>
>>> >> In this implementation, phase1 can handle any combination of
>>> >> TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
>>> >> unless seccomp requests a ptrace event, in which case phase2 is
>>> >> forced.
>>> >>
>>> >> In principle, this could yield a big speedup for TIF_NOHZ as well as
>>> >> for TIF_SECCOMP if syscall exit work were similarly split up.
>>> >>
>>> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>>> >> ---
>>> >>  arch/x86/include/asm/ptrace.h |   5 ++
>>> >>  arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
>>> >>  2 files changed, 138 insertions(+), 24 deletions(-)
>>> >>
>>> >> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
>>> >> index 6205f0c434db..86fc2bb82287 100644
>>> >> --- a/arch/x86/include/asm/ptrace.h
>>> >> +++ b/arch/x86/include/asm/ptrace.h
>>> >> @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
>>> >>  extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>>> >>                        int error_code, int si_code);
>>> >>
>>> >> +
>>> >> +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
>>> >> +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
>>> >> +                                    unsigned long phase1_result);
>>> >> +
>>> >>  extern long syscall_trace_enter(struct pt_regs *);
>>> >>  extern void syscall_trace_leave(struct pt_regs *);
>>> >>
>>> >> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
>>> >> index bbf338a04a5d..29576c244699 100644
>>> >> --- a/arch/x86/kernel/ptrace.c
>>> >> +++ b/arch/x86/kernel/ptrace.c
>>> >> @@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>>> >>       force_sig_info(SIGTRAP, &info, tsk);
>>> >>  }
>>> >>
>>> >> +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
>>> >> +{
>>> >> +#ifdef CONFIG_X86_64
>>> >> +     if (arch == AUDIT_ARCH_X86_64) {
>>> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->di,
>>> >> +                                 regs->si, regs->dx, regs->r10);
>>> >> +     } else
>>> >> +#endif
>>> >> +     {
>>> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->bx,
>>> >> +                                 regs->cx, regs->dx, regs->si);
>>> >> +     }
>>> >> +}
>>> >> +
>>> >>  /*
>>> >> - * We must return the syscall number to actually look up in the table.
>>> >> - * This can be -1L to skip running any syscall at all.
>>> >> + * We can return 0 to resume the syscall or anything else to go to phase
>>> >> + * 2.  If we resume the syscall, we need to put something appropriate in
>>> >> + * regs->orig_ax.
>>> >> + *
>>> >> + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
>>> >> + * are fully functional.
>>> >> + *
>>> >> + * For phase 2's benefit, our return value is:
>>> >> + * 0:                        resume the syscall
>>> >> + * 1:                        go to phase 2; no seccomp phase 2 needed
>>> >> + * anything else:    go to phase 2; pass return value to seccomp
>>> >>   */
>>> >> -long syscall_trace_enter(struct pt_regs *regs)
>>> >> +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>>> >>  {
>>> >> -     long ret = 0;
>>> >> +     unsigned long ret = 0;
>>> >> +     u32 work;
>>> >> +
>>> >> +     BUG_ON(regs != task_pt_regs(current));
>>> >> +
>>> >> +     work = ACCESS_ONCE(current_thread_info()->flags) &
>>> >> +             _TIF_WORK_SYSCALL_ENTRY;
>>> >>
>>> >>       /*
>>> >>        * If TIF_NOHZ is set, we are required to call user_exit() before
>>> >>        * doing anything that could touch RCU.
>>> >>        */
>>> >> -     if (test_thread_flag(TIF_NOHZ))
>>> >> +     if (work & _TIF_NOHZ) {
>>> >>               user_exit();
>>> >> +             work &= ~TIF_NOHZ;
>>> >> +     }
>>> >> +
>>> >> +#ifdef CONFIG_SECCOMP
>>> >> +     /*
>>> >> +      * Do seccomp first -- it should minimize exposure of other
>>> >> +      * code, and keeping seccomp fast is probably more valuable
>>> >> +      * than the rest of this.
>>> >> +      */
>>> >> +     if (work & _TIF_SECCOMP) {
>>> >> +             struct seccomp_data sd;
>>> >> +
>>> >> +             sd.arch = arch;
>>> >> +             sd.nr = regs->orig_ax;
>>> >> +             sd.instruction_pointer = regs->ip;
>>> >> +#ifdef CONFIG_X86_64
>>> >> +             if (arch == AUDIT_ARCH_X86_64) {
>>> >> +                     sd.args[0] = regs->di;
>>> >> +                     sd.args[1] = regs->si;
>>> >> +                     sd.args[2] = regs->dx;
>>> >> +                     sd.args[3] = regs->r10;
>>> >> +                     sd.args[4] = regs->r8;
>>> >> +                     sd.args[5] = regs->r9;
>>> >> +             } else
>>> >> +#endif
>>> >> +             {
>>> >> +                     sd.args[0] = regs->bx;
>>> >> +                     sd.args[1] = regs->cx;
>>> >> +                     sd.args[2] = regs->dx;
>>> >> +                     sd.args[3] = regs->si;
>>> >> +                     sd.args[4] = regs->di;
>>> >> +                     sd.args[5] = regs->bp;
>>> >> +             }
>>> >> +
>>> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
>>> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
>>> >> +
>>> >> +             ret = seccomp_phase1(&sd);
>>> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
>>> >> +                     regs->orig_ax = -1;
>>> >
>>> > How the tracer is expected to get the correct syscall number after that?
>>>
>>> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
>>> would skip ptrace.) This behavior hasn't changed, but maybe I don't
>>> see what you mean? (I haven't encountered any problems with syscall
>>> tracing as a result of these changes.)
>>
>> SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
>> it will get -1 as a syscall number.
>>
>> I've found this while testing a strace parser for
>> SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.
>>
>>
>
> Hasn't it always been this way?

As far as I know, yes, it's always been this way. The point is to the
skip the syscall, which is what the -1 signals. Userspace then reads
back the errno.

> I admit that I kind of wish this worked the other way -- that is, I
> think it would be nice to have a mode in which ptrace runs before
> seccomp, which would close the ptrace hole (where ptrace can do things
> that seccomp would disallow) and maybe have more comprehensible
> results.

I prefer keeping the seccomp attack surface as tiny as possible. I
would not like to ptrace happening before seccomp.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-05 23:12             ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-05 23:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 5, 2015 at 1:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
>>> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>> > Hi,
>>> >
>>> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
>>> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
>>> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
>>> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>>> >
>>> > This breaks ptrace, see below.
>>> >
>>> >> The intent is that phase 1 can be called from the syscall fast path.
>>> >>
>>> >> In this implementation, phase1 can handle any combination of
>>> >> TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
>>> >> unless seccomp requests a ptrace event, in which case phase2 is
>>> >> forced.
>>> >>
>>> >> In principle, this could yield a big speedup for TIF_NOHZ as well as
>>> >> for TIF_SECCOMP if syscall exit work were similarly split up.
>>> >>
>>> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>>> >> ---
>>> >>  arch/x86/include/asm/ptrace.h |   5 ++
>>> >>  arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
>>> >>  2 files changed, 138 insertions(+), 24 deletions(-)
>>> >>
>>> >> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
>>> >> index 6205f0c434db..86fc2bb82287 100644
>>> >> --- a/arch/x86/include/asm/ptrace.h
>>> >> +++ b/arch/x86/include/asm/ptrace.h
>>> >> @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
>>> >>  extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>>> >>                        int error_code, int si_code);
>>> >>
>>> >> +
>>> >> +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
>>> >> +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
>>> >> +                                    unsigned long phase1_result);
>>> >> +
>>> >>  extern long syscall_trace_enter(struct pt_regs *);
>>> >>  extern void syscall_trace_leave(struct pt_regs *);
>>> >>
>>> >> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
>>> >> index bbf338a04a5d..29576c244699 100644
>>> >> --- a/arch/x86/kernel/ptrace.c
>>> >> +++ b/arch/x86/kernel/ptrace.c
>>> >> @@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>>> >>       force_sig_info(SIGTRAP, &info, tsk);
>>> >>  }
>>> >>
>>> >> +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
>>> >> +{
>>> >> +#ifdef CONFIG_X86_64
>>> >> +     if (arch == AUDIT_ARCH_X86_64) {
>>> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->di,
>>> >> +                                 regs->si, regs->dx, regs->r10);
>>> >> +     } else
>>> >> +#endif
>>> >> +     {
>>> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->bx,
>>> >> +                                 regs->cx, regs->dx, regs->si);
>>> >> +     }
>>> >> +}
>>> >> +
>>> >>  /*
>>> >> - * We must return the syscall number to actually look up in the table.
>>> >> - * This can be -1L to skip running any syscall at all.
>>> >> + * We can return 0 to resume the syscall or anything else to go to phase
>>> >> + * 2.  If we resume the syscall, we need to put something appropriate in
>>> >> + * regs->orig_ax.
>>> >> + *
>>> >> + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
>>> >> + * are fully functional.
>>> >> + *
>>> >> + * For phase 2's benefit, our return value is:
>>> >> + * 0:                        resume the syscall
>>> >> + * 1:                        go to phase 2; no seccomp phase 2 needed
>>> >> + * anything else:    go to phase 2; pass return value to seccomp
>>> >>   */
>>> >> -long syscall_trace_enter(struct pt_regs *regs)
>>> >> +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>>> >>  {
>>> >> -     long ret = 0;
>>> >> +     unsigned long ret = 0;
>>> >> +     u32 work;
>>> >> +
>>> >> +     BUG_ON(regs != task_pt_regs(current));
>>> >> +
>>> >> +     work = ACCESS_ONCE(current_thread_info()->flags) &
>>> >> +             _TIF_WORK_SYSCALL_ENTRY;
>>> >>
>>> >>       /*
>>> >>        * If TIF_NOHZ is set, we are required to call user_exit() before
>>> >>        * doing anything that could touch RCU.
>>> >>        */
>>> >> -     if (test_thread_flag(TIF_NOHZ))
>>> >> +     if (work & _TIF_NOHZ) {
>>> >>               user_exit();
>>> >> +             work &= ~TIF_NOHZ;
>>> >> +     }
>>> >> +
>>> >> +#ifdef CONFIG_SECCOMP
>>> >> +     /*
>>> >> +      * Do seccomp first -- it should minimize exposure of other
>>> >> +      * code, and keeping seccomp fast is probably more valuable
>>> >> +      * than the rest of this.
>>> >> +      */
>>> >> +     if (work & _TIF_SECCOMP) {
>>> >> +             struct seccomp_data sd;
>>> >> +
>>> >> +             sd.arch = arch;
>>> >> +             sd.nr = regs->orig_ax;
>>> >> +             sd.instruction_pointer = regs->ip;
>>> >> +#ifdef CONFIG_X86_64
>>> >> +             if (arch == AUDIT_ARCH_X86_64) {
>>> >> +                     sd.args[0] = regs->di;
>>> >> +                     sd.args[1] = regs->si;
>>> >> +                     sd.args[2] = regs->dx;
>>> >> +                     sd.args[3] = regs->r10;
>>> >> +                     sd.args[4] = regs->r8;
>>> >> +                     sd.args[5] = regs->r9;
>>> >> +             } else
>>> >> +#endif
>>> >> +             {
>>> >> +                     sd.args[0] = regs->bx;
>>> >> +                     sd.args[1] = regs->cx;
>>> >> +                     sd.args[2] = regs->dx;
>>> >> +                     sd.args[3] = regs->si;
>>> >> +                     sd.args[4] = regs->di;
>>> >> +                     sd.args[5] = regs->bp;
>>> >> +             }
>>> >> +
>>> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
>>> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
>>> >> +
>>> >> +             ret = seccomp_phase1(&sd);
>>> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
>>> >> +                     regs->orig_ax = -1;
>>> >
>>> > How the tracer is expected to get the correct syscall number after that?
>>>
>>> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
>>> would skip ptrace.) This behavior hasn't changed, but maybe I don't
>>> see what you mean? (I haven't encountered any problems with syscall
>>> tracing as a result of these changes.)
>>
>> SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
>> it will get -1 as a syscall number.
>>
>> I've found this while testing a strace parser for
>> SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.
>>
>>
>
> Hasn't it always been this way?

As far as I know, yes, it's always been this way. The point is to the
skip the syscall, which is what the -1 signals. Userspace then reads
back the errno.

> I admit that I kind of wish this worked the other way -- that is, I
> think it would be nice to have a mode in which ptrace runs before
> seccomp, which would close the ptrace hole (where ptrace can do things
> that seccomp would disallow) and maybe have more comprehensible
> results.

I prefer keeping the seccomp attack surface as tiny as possible. I
would not like to ptrace happening before seccomp.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-05 23:12             ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-05 23:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 5, 2015 at 1:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
>>> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>> > Hi,
>>> >
>>> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
>>> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
>>> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
>>> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>>> >
>>> > This breaks ptrace, see below.
>>> >
>>> >> The intent is that phase 1 can be called from the syscall fast path.
>>> >>
>>> >> In this implementation, phase1 can handle any combination of
>>> >> TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
>>> >> unless seccomp requests a ptrace event, in which case phase2 is
>>> >> forced.
>>> >>
>>> >> In principle, this could yield a big speedup for TIF_NOHZ as well as
>>> >> for TIF_SECCOMP if syscall exit work were similarly split up.
>>> >>
>>> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>>> >> ---
>>> >>  arch/x86/include/asm/ptrace.h |   5 ++
>>> >>  arch/x86/kernel/ptrace.c      | 157 +++++++++++++++++++++++++++++++++++-------
>>> >>  2 files changed, 138 insertions(+), 24 deletions(-)
>>> >>
>>> >> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
>>> >> index 6205f0c434db..86fc2bb82287 100644
>>> >> --- a/arch/x86/include/asm/ptrace.h
>>> >> +++ b/arch/x86/include/asm/ptrace.h
>>> >> @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
>>> >>  extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>>> >>                        int error_code, int si_code);
>>> >>
>>> >> +
>>> >> +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
>>> >> +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
>>> >> +                                    unsigned long phase1_result);
>>> >> +
>>> >>  extern long syscall_trace_enter(struct pt_regs *);
>>> >>  extern void syscall_trace_leave(struct pt_regs *);
>>> >>
>>> >> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
>>> >> index bbf338a04a5d..29576c244699 100644
>>> >> --- a/arch/x86/kernel/ptrace.c
>>> >> +++ b/arch/x86/kernel/ptrace.c
>>> >> @@ -1441,20 +1441,126 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
>>> >>       force_sig_info(SIGTRAP, &info, tsk);
>>> >>  }
>>> >>
>>> >> +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
>>> >> +{
>>> >> +#ifdef CONFIG_X86_64
>>> >> +     if (arch == AUDIT_ARCH_X86_64) {
>>> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->di,
>>> >> +                                 regs->si, regs->dx, regs->r10);
>>> >> +     } else
>>> >> +#endif
>>> >> +     {
>>> >> +             audit_syscall_entry(arch, regs->orig_ax, regs->bx,
>>> >> +                                 regs->cx, regs->dx, regs->si);
>>> >> +     }
>>> >> +}
>>> >> +
>>> >>  /*
>>> >> - * We must return the syscall number to actually look up in the table.
>>> >> - * This can be -1L to skip running any syscall at all.
>>> >> + * We can return 0 to resume the syscall or anything else to go to phase
>>> >> + * 2.  If we resume the syscall, we need to put something appropriate in
>>> >> + * regs->orig_ax.
>>> >> + *
>>> >> + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
>>> >> + * are fully functional.
>>> >> + *
>>> >> + * For phase 2's benefit, our return value is:
>>> >> + * 0:                        resume the syscall
>>> >> + * 1:                        go to phase 2; no seccomp phase 2 needed
>>> >> + * anything else:    go to phase 2; pass return value to seccomp
>>> >>   */
>>> >> -long syscall_trace_enter(struct pt_regs *regs)
>>> >> +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>>> >>  {
>>> >> -     long ret = 0;
>>> >> +     unsigned long ret = 0;
>>> >> +     u32 work;
>>> >> +
>>> >> +     BUG_ON(regs != task_pt_regs(current));
>>> >> +
>>> >> +     work = ACCESS_ONCE(current_thread_info()->flags) &
>>> >> +             _TIF_WORK_SYSCALL_ENTRY;
>>> >>
>>> >>       /*
>>> >>        * If TIF_NOHZ is set, we are required to call user_exit() before
>>> >>        * doing anything that could touch RCU.
>>> >>        */
>>> >> -     if (test_thread_flag(TIF_NOHZ))
>>> >> +     if (work & _TIF_NOHZ) {
>>> >>               user_exit();
>>> >> +             work &= ~TIF_NOHZ;
>>> >> +     }
>>> >> +
>>> >> +#ifdef CONFIG_SECCOMP
>>> >> +     /*
>>> >> +      * Do seccomp first -- it should minimize exposure of other
>>> >> +      * code, and keeping seccomp fast is probably more valuable
>>> >> +      * than the rest of this.
>>> >> +      */
>>> >> +     if (work & _TIF_SECCOMP) {
>>> >> +             struct seccomp_data sd;
>>> >> +
>>> >> +             sd.arch = arch;
>>> >> +             sd.nr = regs->orig_ax;
>>> >> +             sd.instruction_pointer = regs->ip;
>>> >> +#ifdef CONFIG_X86_64
>>> >> +             if (arch == AUDIT_ARCH_X86_64) {
>>> >> +                     sd.args[0] = regs->di;
>>> >> +                     sd.args[1] = regs->si;
>>> >> +                     sd.args[2] = regs->dx;
>>> >> +                     sd.args[3] = regs->r10;
>>> >> +                     sd.args[4] = regs->r8;
>>> >> +                     sd.args[5] = regs->r9;
>>> >> +             } else
>>> >> +#endif
>>> >> +             {
>>> >> +                     sd.args[0] = regs->bx;
>>> >> +                     sd.args[1] = regs->cx;
>>> >> +                     sd.args[2] = regs->dx;
>>> >> +                     sd.args[3] = regs->si;
>>> >> +                     sd.args[4] = regs->di;
>>> >> +                     sd.args[5] = regs->bp;
>>> >> +             }
>>> >> +
>>> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
>>> >> +             BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
>>> >> +
>>> >> +             ret = seccomp_phase1(&sd);
>>> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
>>> >> +                     regs->orig_ax = -1;
>>> >
>>> > How the tracer is expected to get the correct syscall number after that?
>>>
>>> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
>>> would skip ptrace.) This behavior hasn't changed, but maybe I don't
>>> see what you mean? (I haven't encountered any problems with syscall
>>> tracing as a result of these changes.)
>>
>> SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
>> it will get -1 as a syscall number.
>>
>> I've found this while testing a strace parser for
>> SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.
>>
>>
>
> Hasn't it always been this way?

As far as I know, yes, it's always been this way. The point is to the
skip the syscall, which is what the -1 signals. Userspace then reads
back the errno.

> I admit that I kind of wish this worked the other way -- that is, I
> think it would be nice to have a mode in which ptrace runs before
> seccomp, which would close the ptrace hole (where ptrace can do things
> that seccomp would disallow) and maybe have more comprehensible
> results.

I prefer keeping the seccomp attack surface as tiny as possible. I
would not like to ptrace happening before seccomp.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-05 23:12             ` Kees Cook
  (?)
@ 2015-02-05 23:39               ` Dmitry V. Levin
  -1 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-05 23:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 05, 2015 at 03:12:39PM -0800, Kees Cook wrote:
> On Thu, Feb 5, 2015 at 1:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> >> On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
> >>> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> >>> > Hi,
> >>> >
> >>> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
> >>> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
> >>> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
> >>> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
> >>> >
> >>> > This breaks ptrace, see below.
> >>> >
[...]
> >>> >> +             ret = seccomp_phase1(&sd);
> >>> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
> >>> >> +                     regs->orig_ax = -1;
> >>> >
> >>> > How the tracer is expected to get the correct syscall number after that?
> >>>
> >>> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
> >>> would skip ptrace.) This behavior hasn't changed, but maybe I don't
> >>> see what you mean? (I haven't encountered any problems with syscall
> >>> tracing as a result of these changes.)
> >>
> >> SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
> >> it will get -1 as a syscall number.
> >>
> >> I've found this while testing a strace parser for
> >> SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.
> >
> > Hasn't it always been this way?
> 
> As far as I know, yes, it's always been this way. The point is to the
> skip the syscall, which is what the -1 signals. Userspace then reads
> back the errno.

There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
to keep the syscall number unchanged and suppress syscall-exit-stop event,
which was awful because userspace cannot distinguish syscall-enter-stop
from syscall-exit-stop and therefore relies on the kernel that
syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).

After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
events to be suppressed, but now the syscall number is lost.


-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-05 23:39               ` Dmitry V. Levin
  0 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-05 23:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 05, 2015 at 03:12:39PM -0800, Kees Cook wrote:
> On Thu, Feb 5, 2015 at 1:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> >> On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
> >>> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> >>> > Hi,
> >>> >
> >>> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
> >>> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
> >>> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
> >>> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
> >>> >
> >>> > This breaks ptrace, see below.
> >>> >
[...]
> >>> >> +             ret = seccomp_phase1(&sd);
> >>> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
> >>> >> +                     regs->orig_ax = -1;
> >>> >
> >>> > How the tracer is expected to get the correct syscall number after that?
> >>>
> >>> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
> >>> would skip ptrace.) This behavior hasn't changed, but maybe I don't
> >>> see what you mean? (I haven't encountered any problems with syscall
> >>> tracing as a result of these changes.)
> >>
> >> SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
> >> it will get -1 as a syscall number.
> >>
> >> I've found this while testing a strace parser for
> >> SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.
> >
> > Hasn't it always been this way?
> 
> As far as I know, yes, it's always been this way. The point is to the
> skip the syscall, which is what the -1 signals. Userspace then reads
> back the errno.

There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
to keep the syscall number unchanged and suppress syscall-exit-stop event,
which was awful because userspace cannot distinguish syscall-enter-stop
from syscall-exit-stop and therefore relies on the kernel that
syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).

After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
events to be suppressed, but now the syscall number is lost.


-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-05 23:39               ` Dmitry V. Levin
  0 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-05 23:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 05, 2015 at 03:12:39PM -0800, Kees Cook wrote:
> On Thu, Feb 5, 2015 at 1:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> >> On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
> >>> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> >>> > Hi,
> >>> >
> >>> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
> >>> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
> >>> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
> >>> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
> >>> >
> >>> > This breaks ptrace, see below.
> >>> >
[...]
> >>> >> +             ret = seccomp_phase1(&sd);
> >>> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
> >>> >> +                     regs->orig_ax = -1;
> >>> >
> >>> > How the tracer is expected to get the correct syscall number after that?
> >>>
> >>> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
> >>> would skip ptrace.) This behavior hasn't changed, but maybe I don't
> >>> see what you mean? (I haven't encountered any problems with syscall
> >>> tracing as a result of these changes.)
> >>
> >> SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
> >> it will get -1 as a syscall number.
> >>
> >> I've found this while testing a strace parser for
> >> SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.
> >
> > Hasn't it always been this way?
> 
> As far as I know, yes, it's always been this way. The point is to the
> skip the syscall, which is what the -1 signals. Userspace then reads
> back the errno.

There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
to keep the syscall number unchanged and suppress syscall-exit-stop event,
which was awful because userspace cannot distinguish syscall-enter-stop
from syscall-exit-stop and therefore relies on the kernel that
syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).

After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
events to be suppressed, but now the syscall number is lost.


-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-05 23:39               ` Dmitry V. Levin
  (?)
@ 2015-02-05 23:49                 ` Kees Cook
  -1 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-05 23:49 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Andy Lutomirski, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> On Thu, Feb 05, 2015 at 03:12:39PM -0800, Kees Cook wrote:
>> On Thu, Feb 5, 2015 at 1:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> >> On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
>> >>> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
>> >>> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
>> >>> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
>> >>> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>> >>> >
>> >>> > This breaks ptrace, see below.
>> >>> >
> [...]
>> >>> >> +             ret = seccomp_phase1(&sd);
>> >>> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
>> >>> >> +                     regs->orig_ax = -1;
>> >>> >
>> >>> > How the tracer is expected to get the correct syscall number after that?
>> >>>
>> >>> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
>> >>> would skip ptrace.) This behavior hasn't changed, but maybe I don't
>> >>> see what you mean? (I haven't encountered any problems with syscall
>> >>> tracing as a result of these changes.)
>> >>
>> >> SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
>> >> it will get -1 as a syscall number.
>> >>
>> >> I've found this while testing a strace parser for
>> >> SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.
>> >
>> > Hasn't it always been this way?
>>
>> As far as I know, yes, it's always been this way. The point is to the
>> skip the syscall, which is what the -1 signals. Userspace then reads
>> back the errno.
>
> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
> to keep the syscall number unchanged and suppress syscall-exit-stop event,
> which was awful because userspace cannot distinguish syscall-enter-stop
> from syscall-exit-stop and therefore relies on the kernel that
> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>
> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
> events to be suppressed, but now the syscall number is lost.

Ah-ha! Okay, thanks, I understand now. I think this means seccomp
phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
think here?

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-05 23:49                 ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-05 23:49 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Andy Lutomirski, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> On Thu, Feb 05, 2015 at 03:12:39PM -0800, Kees Cook wrote:
>> On Thu, Feb 5, 2015 at 1:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> >> On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
>> >>> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
>> >>> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
>> >>> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
>> >>> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>> >>> >
>> >>> > This breaks ptrace, see below.
>> >>> >
> [...]
>> >>> >> +             ret = seccomp_phase1(&sd);
>> >>> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
>> >>> >> +                     regs->orig_ax = -1;
>> >>> >
>> >>> > How the tracer is expected to get the correct syscall number after that?
>> >>>
>> >>> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
>> >>> would skip ptrace.) This behavior hasn't changed, but maybe I don't
>> >>> see what you mean? (I haven't encountered any problems with syscall
>> >>> tracing as a result of these changes.)
>> >>
>> >> SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
>> >> it will get -1 as a syscall number.
>> >>
>> >> I've found this while testing a strace parser for
>> >> SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.
>> >
>> > Hasn't it always been this way?
>>
>> As far as I know, yes, it's always been this way. The point is to the
>> skip the syscall, which is what the -1 signals. Userspace then reads
>> back the errno.
>
> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
> to keep the syscall number unchanged and suppress syscall-exit-stop event,
> which was awful because userspace cannot distinguish syscall-enter-stop
> from syscall-exit-stop and therefore relies on the kernel that
> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>
> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
> events to be suppressed, but now the syscall number is lost.

Ah-ha! Okay, thanks, I understand now. I think this means seccomp
phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
think here?

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-05 23:49                 ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-05 23:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> On Thu, Feb 05, 2015 at 03:12:39PM -0800, Kees Cook wrote:
>> On Thu, Feb 5, 2015 at 1:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> >> On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
>> >>> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
>> >>> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
>> >>> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
>> >>> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>> >>> >
>> >>> > This breaks ptrace, see below.
>> >>> >
> [...]
>> >>> >> +             ret = seccomp_phase1(&sd);
>> >>> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
>> >>> >> +                     regs->orig_ax = -1;
>> >>> >
>> >>> > How the tracer is expected to get the correct syscall number after that?
>> >>>
>> >>> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
>> >>> would skip ptrace.) This behavior hasn't changed, but maybe I don't
>> >>> see what you mean? (I haven't encountered any problems with syscall
>> >>> tracing as a result of these changes.)
>> >>
>> >> SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
>> >> it will get -1 as a syscall number.
>> >>
>> >> I've found this while testing a strace parser for
>> >> SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.
>> >
>> > Hasn't it always been this way?
>>
>> As far as I know, yes, it's always been this way. The point is to the
>> skip the syscall, which is what the -1 signals. Userspace then reads
>> back the errno.
>
> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
> to keep the syscall number unchanged and suppress syscall-exit-stop event,
> which was awful because userspace cannot distinguish syscall-enter-stop
> from syscall-exit-stop and therefore relies on the kernel that
> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>
> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
> events to be suppressed, but now the syscall number is lost.

Ah-ha! Okay, thanks, I understand now. I think this means seccomp
phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
think here?

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-05 23:49                 ` Kees Cook
  (?)
@ 2015-02-06  0:09                   ` Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-06  0:09 UTC (permalink / raw)
  To: Kees Cook
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> On Thu, Feb 05, 2015 at 03:12:39PM -0800, Kees Cook wrote:
>>> On Thu, Feb 5, 2015 at 1:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> > On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>> >> On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
>>> >>> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>> >>> > Hi,
>>> >>> >
>>> >>> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
>>> >>> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
>>> >>> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
>>> >>> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>>> >>> >
>>> >>> > This breaks ptrace, see below.
>>> >>> >
>> [...]
>>> >>> >> +             ret = seccomp_phase1(&sd);
>>> >>> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
>>> >>> >> +                     regs->orig_ax = -1;
>>> >>> >
>>> >>> > How the tracer is expected to get the correct syscall number after that?
>>> >>>
>>> >>> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
>>> >>> would skip ptrace.) This behavior hasn't changed, but maybe I don't
>>> >>> see what you mean? (I haven't encountered any problems with syscall
>>> >>> tracing as a result of these changes.)
>>> >>
>>> >> SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
>>> >> it will get -1 as a syscall number.
>>> >>
>>> >> I've found this while testing a strace parser for
>>> >> SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.
>>> >
>>> > Hasn't it always been this way?
>>>
>>> As far as I know, yes, it's always been this way. The point is to the
>>> skip the syscall, which is what the -1 signals. Userspace then reads
>>> back the errno.
>>
>> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
>> to keep the syscall number unchanged and suppress syscall-exit-stop event,
>> which was awful because userspace cannot distinguish syscall-enter-stop
>> from syscall-exit-stop and therefore relies on the kernel that
>> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>>
>> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
>> events to be suppressed, but now the syscall number is lost.
>
> Ah-ha! Okay, thanks, I understand now. I think this means seccomp
> phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
> think here?
>

I still don't quite see how this change caused this.  I can play with
it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
because it needs to skip the syscall.

We could change this by treating RET_ERRNO as an instruction to enter
phase 2 and then asking for a skip in phase 2 without changing
orig_ax, but IMO this is pretty ugly.

I think this all kind of sucks.  We're trying to run ptrace after
seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
That means that if we use RET_TRAP, then ptrace will see the
possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
correctly given the current design) showing syscall -1, and if we use
RET_KILL, then ptrace just sees the process mysteriously die.

I think it would be more useful and easier to understand if ptrace saw
syscalls as the traced process saw them, i.e. before seccomp
modification.  How would this meaningfully increase the attack
surface?  As far as ptrace is concerned, a syscall is just seven
numbers, and as long as a process can issue *any* syscall that seccomp
allows, then it can invoke the ptrace hooks.  I don't think the
entry/exit hooks care *at all* about the syscall nr or args.

Audit is a different story.  I think we should absolutely continue to
audit syscalls that actually happen, not syscalls that were requested.

Given this bug, I doubt we'd break anything if we changed it, since it
appears that it's already rather broken.  Also, changing it would make
me happy, because I want to add a SECCOMP_RET_MONITOR that freezes the
process, sends an event to a seccompfd, and then executes syscalls as
requested by the holder of the seccompfd.  Those syscalls would, in
turn, be filtered again through inner layers of seccomp.  One sticking
point would be that the current ptrace behavior is very hard to
reconcile with this type of feature.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06  0:09                   ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-06  0:09 UTC (permalink / raw)
  To: Kees Cook
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> On Thu, Feb 05, 2015 at 03:12:39PM -0800, Kees Cook wrote:
>>> On Thu, Feb 5, 2015 at 1:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> > On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>> >> On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
>>> >>> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>> >>> > Hi,
>>> >>> >
>>> >>> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
>>> >>> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
>>> >>> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
>>> >>> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>>> >>> >
>>> >>> > This breaks ptrace, see below.
>>> >>> >
>> [...]
>>> >>> >> +             ret = seccomp_phase1(&sd);
>>> >>> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
>>> >>> >> +                     regs->orig_ax = -1;
>>> >>> >
>>> >>> > How the tracer is expected to get the correct syscall number after that?
>>> >>>
>>> >>> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
>>> >>> would skip ptrace.) This behavior hasn't changed, but maybe I don't
>>> >>> see what you mean? (I haven't encountered any problems with syscall
>>> >>> tracing as a result of these changes.)
>>> >>
>>> >> SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
>>> >> it will get -1 as a syscall number.
>>> >>
>>> >> I've found this while testing a strace parser for
>>> >> SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.
>>> >
>>> > Hasn't it always been this way?
>>>
>>> As far as I know, yes, it's always been this way. The point is to the
>>> skip the syscall, which is what the -1 signals. Userspace then reads
>>> back the errno.
>>
>> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
>> to keep the syscall number unchanged and suppress syscall-exit-stop event,
>> which was awful because userspace cannot distinguish syscall-enter-stop
>> from syscall-exit-stop and therefore relies on the kernel that
>> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>>
>> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
>> events to be suppressed, but now the syscall number is lost.
>
> Ah-ha! Okay, thanks, I understand now. I think this means seccomp
> phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
> think here?
>

I still don't quite see how this change caused this.  I can play with
it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
because it needs to skip the syscall.

We could change this by treating RET_ERRNO as an instruction to enter
phase 2 and then asking for a skip in phase 2 without changing
orig_ax, but IMO this is pretty ugly.

I think this all kind of sucks.  We're trying to run ptrace after
seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
That means that if we use RET_TRAP, then ptrace will see the
possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
correctly given the current design) showing syscall -1, and if we use
RET_KILL, then ptrace just sees the process mysteriously die.

I think it would be more useful and easier to understand if ptrace saw
syscalls as the traced process saw them, i.e. before seccomp
modification.  How would this meaningfully increase the attack
surface?  As far as ptrace is concerned, a syscall is just seven
numbers, and as long as a process can issue *any* syscall that seccomp
allows, then it can invoke the ptrace hooks.  I don't think the
entry/exit hooks care *at all* about the syscall nr or args.

Audit is a different story.  I think we should absolutely continue to
audit syscalls that actually happen, not syscalls that were requested.

Given this bug, I doubt we'd break anything if we changed it, since it
appears that it's already rather broken.  Also, changing it would make
me happy, because I want to add a SECCOMP_RET_MONITOR that freezes the
process, sends an event to a seccompfd, and then executes syscalls as
requested by the holder of the seccompfd.  Those syscalls would, in
turn, be filtered again through inner layers of seccomp.  One sticking
point would be that the current ptrace behavior is very hard to
reconcile with this type of feature.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06  0:09                   ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-06  0:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> On Thu, Feb 05, 2015 at 03:12:39PM -0800, Kees Cook wrote:
>>> On Thu, Feb 5, 2015 at 1:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> > On Thu, Feb 5, 2015 at 1:40 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>> >> On Thu, Feb 05, 2015 at 01:27:16PM -0800, Kees Cook wrote:
>>> >>> On Thu, Feb 5, 2015 at 1:19 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>> >>> > Hi,
>>> >>> >
>>> >>> > On Fri, Sep 05, 2014 at 03:13:54PM -0700, Andy Lutomirski wrote:
>>> >>> >> This splits syscall_trace_enter into syscall_trace_enter_phase1 and
>>> >>> >> syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
>>> >>> >> phase 2 is permitted to modify any of pt_regs except for orig_ax.
>>> >>> >
>>> >>> > This breaks ptrace, see below.
>>> >>> >
>> [...]
>>> >>> >> +             ret = seccomp_phase1(&sd);
>>> >>> >> +             if (ret == SECCOMP_PHASE1_SKIP) {
>>> >>> >> +                     regs->orig_ax = -1;
>>> >>> >
>>> >>> > How the tracer is expected to get the correct syscall number after that?
>>> >>>
>>> >>> There shouldn't be a tracer if a skip is encountered. (A seccomp skip
>>> >>> would skip ptrace.) This behavior hasn't changed, but maybe I don't
>>> >>> see what you mean? (I haven't encountered any problems with syscall
>>> >>> tracing as a result of these changes.)
>>> >>
>>> >> SECCOMP_RET_ERRNO leads to SECCOMP_PHASE1_SKIP, and if there is a tracer,
>>> >> it will get -1 as a syscall number.
>>> >>
>>> >> I've found this while testing a strace parser for
>>> >> SECCOMP_MODE_FILTER/SECCOMP_SET_MODE_FILTER, so the problem is quite real.
>>> >
>>> > Hasn't it always been this way?
>>>
>>> As far as I know, yes, it's always been this way. The point is to the
>>> skip the syscall, which is what the -1 signals. Userspace then reads
>>> back the errno.
>>
>> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
>> to keep the syscall number unchanged and suppress syscall-exit-stop event,
>> which was awful because userspace cannot distinguish syscall-enter-stop
>> from syscall-exit-stop and therefore relies on the kernel that
>> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>>
>> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
>> events to be suppressed, but now the syscall number is lost.
>
> Ah-ha! Okay, thanks, I understand now. I think this means seccomp
> phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
> think here?
>

I still don't quite see how this change caused this.  I can play with
it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
because it needs to skip the syscall.

We could change this by treating RET_ERRNO as an instruction to enter
phase 2 and then asking for a skip in phase 2 without changing
orig_ax, but IMO this is pretty ugly.

I think this all kind of sucks.  We're trying to run ptrace after
seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
That means that if we use RET_TRAP, then ptrace will see the
possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
correctly given the current design) showing syscall -1, and if we use
RET_KILL, then ptrace just sees the process mysteriously die.

I think it would be more useful and easier to understand if ptrace saw
syscalls as the traced process saw them, i.e. before seccomp
modification.  How would this meaningfully increase the attack
surface?  As far as ptrace is concerned, a syscall is just seven
numbers, and as long as a process can issue *any* syscall that seccomp
allows, then it can invoke the ptrace hooks.  I don't think the
entry/exit hooks care *at all* about the syscall nr or args.

Audit is a different story.  I think we should absolutely continue to
audit syscalls that actually happen, not syscalls that were requested.

Given this bug, I doubt we'd break anything if we changed it, since it
appears that it's already rather broken.  Also, changing it would make
me happy, because I want to add a SECCOMP_RET_MONITOR that freezes the
process, sends an event to a seccompfd, and then executes syscalls as
requested by the holder of the seccompfd.  Those syscalls would, in
turn, be filtered again through inner layers of seccomp.  One sticking
point would be that the current ptrace behavior is very hard to
reconcile with this type of feature.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-06  0:09                   ` Andy Lutomirski
  (?)
@ 2015-02-06  2:32                     ` Dmitry V. Levin
  -1 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-06  2:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
[...]
> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
> >> to keep the syscall number unchanged and suppress syscall-exit-stop event,
> >> which was awful because userspace cannot distinguish syscall-enter-stop
> >> from syscall-exit-stop and therefore relies on the kernel that
> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
> >>
> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
> >> events to be suppressed, but now the syscall number is lost.
> >
> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
> > think here?
> 
> I still don't quite see how this change caused this.

I have a test for this at
http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c

> I can play with
> it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
> because it needs to skip the syscall.
> 
> We could change this by treating RET_ERRNO as an instruction to enter
> phase 2 and then asking for a skip in phase 2 without changing
> orig_ax, but IMO this is pretty ugly.
> 
> I think this all kind of sucks.  We're trying to run ptrace after
> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
> That means that if we use RET_TRAP, then ptrace will see the
> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
> correctly given the current design) showing syscall -1, and if we use
> RET_KILL, then ptrace just sees the process mysteriously die.

Userspace is usually not prepared to see syscall -1.
For example, strace had to be patched, otherwise it just skipped such
syscalls as "not a syscall" events or did other improper things:
http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486
http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891

A slightly different but related story: userspace is also not prepared
to handle large errno values produced by seccomp filters like this:
BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)

For example, glibc assumes that syscalls do not return errno values greater than 0xfff:
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20

If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.


-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06  2:32                     ` Dmitry V. Levin
  0 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-06  2:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
[...]
> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
> >> to keep the syscall number unchanged and suppress syscall-exit-stop event,
> >> which was awful because userspace cannot distinguish syscall-enter-stop
> >> from syscall-exit-stop and therefore relies on the kernel that
> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
> >>
> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
> >> events to be suppressed, but now the syscall number is lost.
> >
> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
> > think here?
> 
> I still don't quite see how this change caused this.

I have a test for this at
http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c

> I can play with
> it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
> because it needs to skip the syscall.
> 
> We could change this by treating RET_ERRNO as an instruction to enter
> phase 2 and then asking for a skip in phase 2 without changing
> orig_ax, but IMO this is pretty ugly.
> 
> I think this all kind of sucks.  We're trying to run ptrace after
> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
> That means that if we use RET_TRAP, then ptrace will see the
> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
> correctly given the current design) showing syscall -1, and if we use
> RET_KILL, then ptrace just sees the process mysteriously die.

Userspace is usually not prepared to see syscall -1.
For example, strace had to be patched, otherwise it just skipped such
syscalls as "not a syscall" events or did other improper things:
http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486
http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891

A slightly different but related story: userspace is also not prepared
to handle large errno values produced by seccomp filters like this:
BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)

For example, glibc assumes that syscalls do not return errno values greater than 0xfff:
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20

If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.


-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06  2:32                     ` Dmitry V. Levin
  0 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-06  2:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
[...]
> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
> >> to keep the syscall number unchanged and suppress syscall-exit-stop event,
> >> which was awful because userspace cannot distinguish syscall-enter-stop
> >> from syscall-exit-stop and therefore relies on the kernel that
> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
> >>
> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
> >> events to be suppressed, but now the syscall number is lost.
> >
> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
> > think here?
> 
> I still don't quite see how this change caused this.

I have a test for this at
http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c

> I can play with
> it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
> because it needs to skip the syscall.
> 
> We could change this by treating RET_ERRNO as an instruction to enter
> phase 2 and then asking for a skip in phase 2 without changing
> orig_ax, but IMO this is pretty ugly.
> 
> I think this all kind of sucks.  We're trying to run ptrace after
> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
> That means that if we use RET_TRAP, then ptrace will see the
> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
> correctly given the current design) showing syscall -1, and if we use
> RET_KILL, then ptrace just sees the process mysteriously die.

Userspace is usually not prepared to see syscall -1.
For example, strace had to be patched, otherwise it just skipped such
syscalls as "not a syscall" events or did other improper things:
http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486
http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891

A slightly different but related story: userspace is also not prepared
to handle large errno values produced by seccomp filters like this:
BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)

For example, glibc assumes that syscalls do not return errno values greater than 0xfff:
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20

If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.


-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-06  2:32                     ` Dmitry V. Levin
  (?)
@ 2015-02-06  2:38                       ` Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-06  2:38 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Kees Cook, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
>> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
>> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> [...]
>> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
>> >> to keep the syscall number unchanged and suppress syscall-exit-stop event,
>> >> which was awful because userspace cannot distinguish syscall-enter-stop
>> >> from syscall-exit-stop and therefore relies on the kernel that
>> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>> >>
>> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
>> >> events to be suppressed, but now the syscall number is lost.
>> >
>> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
>> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
>> > think here?
>>
>> I still don't quite see how this change caused this.
>
> I have a test for this at
> http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c
>
>> I can play with
>> it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
>> because it needs to skip the syscall.
>>
>> We could change this by treating RET_ERRNO as an instruction to enter
>> phase 2 and then asking for a skip in phase 2 without changing
>> orig_ax, but IMO this is pretty ugly.
>>
>> I think this all kind of sucks.  We're trying to run ptrace after
>> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
>> That means that if we use RET_TRAP, then ptrace will see the
>> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
>> correctly given the current design) showing syscall -1, and if we use
>> RET_KILL, then ptrace just sees the process mysteriously die.
>
> Userspace is usually not prepared to see syscall -1.
> For example, strace had to be patched, otherwise it just skipped such
> syscalls as "not a syscall" events or did other improper things:
> http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486
> http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891
>

The x32 thing is a legit ABI bug, I'd argue.  I'd be happy to submit a
patch to fix that (clear the x32 bit if we're not x32).

> A slightly different but related story: userspace is also not prepared
> to handle large errno values produced by seccomp filters like this:
> BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)
>
> For example, glibc assumes that syscalls do not return errno values greater than 0xfff:
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20
>
> If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
> applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.

I think this is solidly in the "don't do that" category.  Seccomp lets
you tamper with syscalls.  If you tamper wrong, then you lose.

Kees, what do you think about reversing the whole thing so that
ptracers always see the original syscall?

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06  2:38                       ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-06  2:38 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Kees Cook, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker

On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
>> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
>> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> [...]
>> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
>> >> to keep the syscall number unchanged and suppress syscall-exit-stop event,
>> >> which was awful because userspace cannot distinguish syscall-enter-stop
>> >> from syscall-exit-stop and therefore relies on the kernel that
>> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>> >>
>> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
>> >> events to be suppressed, but now the syscall number is lost.
>> >
>> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
>> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
>> > think here?
>>
>> I still don't quite see how this change caused this.
>
> I have a test for this at
> http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c
>
>> I can play with
>> it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
>> because it needs to skip the syscall.
>>
>> We could change this by treating RET_ERRNO as an instruction to enter
>> phase 2 and then asking for a skip in phase 2 without changing
>> orig_ax, but IMO this is pretty ugly.
>>
>> I think this all kind of sucks.  We're trying to run ptrace after
>> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
>> That means that if we use RET_TRAP, then ptrace will see the
>> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
>> correctly given the current design) showing syscall -1, and if we use
>> RET_KILL, then ptrace just sees the process mysteriously die.
>
> Userspace is usually not prepared to see syscall -1.
> For example, strace had to be patched, otherwise it just skipped such
> syscalls as "not a syscall" events or did other improper things:
> http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486
> http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891
>

The x32 thing is a legit ABI bug, I'd argue.  I'd be happy to submit a
patch to fix that (clear the x32 bit if we're not x32).

> A slightly different but related story: userspace is also not prepared
> to handle large errno values produced by seccomp filters like this:
> BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)
>
> For example, glibc assumes that syscalls do not return errno values greater than 0xfff:
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20
>
> If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
> applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.

I think this is solidly in the "don't do that" category.  Seccomp lets
you tamper with syscalls.  If you tamper wrong, then you lose.

Kees, what do you think about reversing the whole thing so that
ptracers always see the original syscall?

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06  2:38                       ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-06  2:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
>> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
>> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> [...]
>> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
>> >> to keep the syscall number unchanged and suppress syscall-exit-stop event,
>> >> which was awful because userspace cannot distinguish syscall-enter-stop
>> >> from syscall-exit-stop and therefore relies on the kernel that
>> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>> >>
>> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
>> >> events to be suppressed, but now the syscall number is lost.
>> >
>> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
>> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
>> > think here?
>>
>> I still don't quite see how this change caused this.
>
> I have a test for this at
> http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c
>
>> I can play with
>> it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
>> because it needs to skip the syscall.
>>
>> We could change this by treating RET_ERRNO as an instruction to enter
>> phase 2 and then asking for a skip in phase 2 without changing
>> orig_ax, but IMO this is pretty ugly.
>>
>> I think this all kind of sucks.  We're trying to run ptrace after
>> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
>> That means that if we use RET_TRAP, then ptrace will see the
>> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
>> correctly given the current design) showing syscall -1, and if we use
>> RET_KILL, then ptrace just sees the process mysteriously die.
>
> Userspace is usually not prepared to see syscall -1.
> For example, strace had to be patched, otherwise it just skipped such
> syscalls as "not a syscall" events or did other improper things:
> http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486
> http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891
>

The x32 thing is a legit ABI bug, I'd argue.  I'd be happy to submit a
patch to fix that (clear the x32 bit if we're not x32).

> A slightly different but related story: userspace is also not prepared
> to handle large errno values produced by seccomp filters like this:
> BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)
>
> For example, glibc assumes that syscalls do not return errno values greater than 0xfff:
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20
>
> If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
> applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.

I think this is solidly in the "don't do that" category.  Seccomp lets
you tamper with syscalls.  If you tamper wrong, then you lose.

Kees, what do you think about reversing the whole thing so that
ptracers always see the original syscall?

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-06  2:38                       ` Andy Lutomirski
  (?)
@ 2015-02-06 19:23                         ` Kees Cook
  -1 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-06 19:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Thu, Feb 5, 2015 at 6:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
>>> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
>>> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> [...]
>>> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
>>> >> to keep the syscall number unchanged and suppress syscall-exit-stop event,
>>> >> which was awful because userspace cannot distinguish syscall-enter-stop
>>> >> from syscall-exit-stop and therefore relies on the kernel that
>>> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>>> >>
>>> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
>>> >> events to be suppressed, but now the syscall number is lost.
>>> >
>>> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
>>> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
>>> > think here?
>>>
>>> I still don't quite see how this change caused this.
>>
>> I have a test for this at
>> http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c
>>
>>> I can play with
>>> it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
>>> because it needs to skip the syscall.
>>>
>>> We could change this by treating RET_ERRNO as an instruction to enter
>>> phase 2 and then asking for a skip in phase 2 without changing
>>> orig_ax, but IMO this is pretty ugly.
>>>
>>> I think this all kind of sucks.  We're trying to run ptrace after
>>> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
>>> That means that if we use RET_TRAP, then ptrace will see the
>>> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
>>> correctly given the current design) showing syscall -1, and if we use
>>> RET_KILL, then ptrace just sees the process mysteriously die.
>>
>> Userspace is usually not prepared to see syscall -1.
>> For example, strace had to be patched, otherwise it just skipped such
>> syscalls as "not a syscall" events or did other improper things:
>> http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486
>> http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891
>>
>
> The x32 thing is a legit ABI bug, I'd argue.  I'd be happy to submit a
> patch to fix that (clear the x32 bit if we're not x32).
>
>> A slightly different but related story: userspace is also not prepared
>> to handle large errno values produced by seccomp filters like this:
>> BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)
>>
>> For example, glibc assumes that syscalls do not return errno values greater than 0xfff:
>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55
>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20

To save others the link reading: "Linus said he will make sure the no
syscall returns a value in -1 .. -4095 as a valid result so we can
savely test with -4095."

Strictly speaking (ISO C, "man 3 errno"), errno is supposed to be a
full int, though digging around I find this in include/linux/err.h:

/*
 * Kernel pointers have redundant information, so we can use a
 * scheme where we can return either an error code or a normal
 * pointer with the same return value.
 *
 * This should be a per-architecture thing, to allow different
 * error and pointer decisions.
 */
#define MAX_ERRNO       4095

#ifndef __ASSEMBLY__

#define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)

But no architecture overrides this.

>> If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
>> applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.

I'm not opposed to this. I would want to more explicitly document the
4095 max value in man pages, though.

> I think this is solidly in the "don't do that" category.  Seccomp lets
> you tamper with syscalls.  If you tamper wrong, then you lose.
>
> Kees, what do you think about reversing the whole thing so that
> ptracers always see the original syscall?

What do you mean by "reversing"? The interactions I see here are:

PTRACE_SYSCALL
SECCOMP_RET_ERRNO
SECCOMP_RET_TRACE
SECCOMP_RET_TRAP

Both ptrace and seccomp will trigger via _TIF_WORK_SYSCALL_ENTRY. Only
ptrace will trigger via _TIF_WORK_SYSCALL_EXIT.

For SECCOMP_RET_ERRNO to work, we must skip the syscall, as mentioned earlier:

arch/x86/kernel/entry_32.S ...
syscall_trace_entry:
        movl $-ENOSYS,PT_EAX(%esp)
        movl %esp, %eax
        call syscall_trace_enter
        /* What it returned is what we'll actually use.  */
        cmpl $(NR_syscalls), %eax
        jnae syscall_call
        jmp syscall_exit
END(syscall_trace_entry)

Both before and after the 2-phase change, syscall_trace_enter would
return -1 if it hit SECCOMP_RET_ERRNO, before calling
tracehook_report_syscall_entry. On exit, if PTRACE_SYSCALL, we'd hit
tracehook_report_syscall_exit during syscall_trace_leave, which means
a ptracer will see a syscall-exit-stop without a matching
syscall-enter-stop.

Using SECCOMP_RET_TRACE with PTRACE_SYSCALL in place seems totally
crazy, as the ptracer would need to be the same program, and if it
chose to skip a syscall, it would be in the same place: it would see
PTRACE_EVENT_SECCOMP, then no syscall-enter-stop, then a
syscall-exit-stop. I think we can ignore this pathological case.

Using SECCOMP_RET_TRAP with PTRACE_SYSCALL also results in a skip,
which produces the same "only syscall-exit-stop seen" problem.

In the SECCOMP_RET_ERRNO case, the syscall nr doesn't change (and
isn't executed). In the SECCOMP_RET_TRAP case, the syscall nr doesn't
change (and isn't executed). In the SECCOMP_RET_TRACE, the syscall nr
_could_ change, but the ptracer would be doing it, so the crazy
situation around PTRACE_SYSCALL is probably safe to ignore (as long as
we document what is expected to happen).

So, the question is: should PTRACE_SYSCALL see a syscall that is _not_
being executed (due to seccomp)? Audit doesn't see it currently, and
neither does ptrace. I would argue that it should continue to not see
the syscall. That said, if it shouldn't be shown, we also shouldn't
trigger syscall-exit-stop. If you can convince me it should see
syscall-enter-stop, then I have two questions:

1) Do we accept that a ptracer can interfere with SECCOMP_RET_ERRNO? I
think we probably must, since it can already interfere via
syscall-exit-stop and change the errno. And especially since a ptracer
can change syscalls during syscall-enter-stop to any syscall it wants,
bypassing seccomp. This condition is already documented.

2) What do we do with audit? Suddenly we have ptrace seeing a syscall
that audit doesn't?

And an unrelated thought:

3) Can't we find some way to fix the inability of a ptracer to
distinguish between syscall-enter-stop and syscall-exit-stop?

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06 19:23                         ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-06 19:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Thu, Feb 5, 2015 at 6:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
>>> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
>>> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> [...]
>>> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
>>> >> to keep the syscall number unchanged and suppress syscall-exit-stop event,
>>> >> which was awful because userspace cannot distinguish syscall-enter-stop
>>> >> from syscall-exit-stop and therefore relies on the kernel that
>>> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>>> >>
>>> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
>>> >> events to be suppressed, but now the syscall number is lost.
>>> >
>>> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
>>> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
>>> > think here?
>>>
>>> I still don't quite see how this change caused this.
>>
>> I have a test for this at
>> http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c
>>
>>> I can play with
>>> it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
>>> because it needs to skip the syscall.
>>>
>>> We could change this by treating RET_ERRNO as an instruction to enter
>>> phase 2 and then asking for a skip in phase 2 without changing
>>> orig_ax, but IMO this is pretty ugly.
>>>
>>> I think this all kind of sucks.  We're trying to run ptrace after
>>> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
>>> That means that if we use RET_TRAP, then ptrace will see the
>>> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
>>> correctly given the current design) showing syscall -1, and if we use
>>> RET_KILL, then ptrace just sees the process mysteriously die.
>>
>> Userspace is usually not prepared to see syscall -1.
>> For example, strace had to be patched, otherwise it just skipped such
>> syscalls as "not a syscall" events or did other improper things:
>> http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486
>> http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891
>>
>
> The x32 thing is a legit ABI bug, I'd argue.  I'd be happy to submit a
> patch to fix that (clear the x32 bit if we're not x32).
>
>> A slightly different but related story: userspace is also not prepared
>> to handle large errno values produced by seccomp filters like this:
>> BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)
>>
>> For example, glibc assumes that syscalls do not return errno values greater than 0xfff:
>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55
>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20

To save others the link reading: "Linus said he will make sure the no
syscall returns a value in -1 .. -4095 as a valid result so we can
savely test with -4095."

Strictly speaking (ISO C, "man 3 errno"), errno is supposed to be a
full int, though digging around I find this in include/linux/err.h:

/*
 * Kernel pointers have redundant information, so we can use a
 * scheme where we can return either an error code or a normal
 * pointer with the same return value.
 *
 * This should be a per-architecture thing, to allow different
 * error and pointer decisions.
 */
#define MAX_ERRNO       4095

#ifndef __ASSEMBLY__

#define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)

But no architecture overrides this.

>> If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
>> applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.

I'm not opposed to this. I would want to more explicitly document the
4095 max value in man pages, though.

> I think this is solidly in the "don't do that" category.  Seccomp lets
> you tamper with syscalls.  If you tamper wrong, then you lose.
>
> Kees, what do you think about reversing the whole thing so that
> ptracers always see the original syscall?

What do you mean by "reversing"? The interactions I see here are:

PTRACE_SYSCALL
SECCOMP_RET_ERRNO
SECCOMP_RET_TRACE
SECCOMP_RET_TRAP

Both ptrace and seccomp will trigger via _TIF_WORK_SYSCALL_ENTRY. Only
ptrace will trigger via _TIF_WORK_SYSCALL_EXIT.

For SECCOMP_RET_ERRNO to work, we must skip the syscall, as mentioned earlier:

arch/x86/kernel/entry_32.S ...
syscall_trace_entry:
        movl $-ENOSYS,PT_EAX(%esp)
        movl %esp, %eax
        call syscall_trace_enter
        /* What it returned is what we'll actually use.  */
        cmpl $(NR_syscalls), %eax
        jnae syscall_call
        jmp syscall_exit
END(syscall_trace_entry)

Both before and after the 2-phase change, syscall_trace_enter would
return -1 if it hit SECCOMP_RET_ERRNO, before calling
tracehook_report_syscall_entry. On exit, if PTRACE_SYSCALL, we'd hit
tracehook_report_syscall_exit during syscall_trace_leave, which means
a ptracer will see a syscall-exit-stop without a matching
syscall-enter-stop.

Using SECCOMP_RET_TRACE with PTRACE_SYSCALL in place seems totally
crazy, as the ptracer would need to be the same program, and if it
chose to skip a syscall, it would be in the same place: it would see
PTRACE_EVENT_SECCOMP, then no syscall-enter-stop, then a
syscall-exit-stop. I think we can ignore this pathological case.

Using SECCOMP_RET_TRAP with PTRACE_SYSCALL also results in a skip,
which produces the same "only syscall-exit-stop seen" problem.

In the SECCOMP_RET_ERRNO case, the syscall nr doesn't change (and
isn't executed). In the SECCOMP_RET_TRAP case, the syscall nr doesn't
change (and isn't executed). In the SECCOMP_RET_TRACE, the syscall nr
_could_ change, but the ptracer would be doing it, so the crazy
situation around PTRACE_SYSCALL is probably safe to ignore (as long as
we document what is expected to happen).

So, the question is: should PTRACE_SYSCALL see a syscall that is _not_
being executed (due to seccomp)? Audit doesn't see it currently, and
neither does ptrace. I would argue that it should continue to not see
the syscall. That said, if it shouldn't be shown, we also shouldn't
trigger syscall-exit-stop. If you can convince me it should see
syscall-enter-stop, then I have two questions:

1) Do we accept that a ptracer can interfere with SECCOMP_RET_ERRNO? I
think we probably must, since it can already interfere via
syscall-exit-stop and change the errno. And especially since a ptracer
can change syscalls during syscall-enter-stop to any syscall it wants,
bypassing seccomp. This condition is already documented.

2) What do we do with audit? Suddenly we have ptrace seeing a syscall
that audit doesn't?

And an unrelated thought:

3) Can't we find some way to fix the inability of a ptracer to
distinguish between syscall-enter-stop and syscall-exit-stop?

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06 19:23                         ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-06 19:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 5, 2015 at 6:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
>>> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
>>> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>> [...]
>>> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
>>> >> to keep the syscall number unchanged and suppress syscall-exit-stop event,
>>> >> which was awful because userspace cannot distinguish syscall-enter-stop
>>> >> from syscall-exit-stop and therefore relies on the kernel that
>>> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>>> >>
>>> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
>>> >> events to be suppressed, but now the syscall number is lost.
>>> >
>>> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
>>> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
>>> > think here?
>>>
>>> I still don't quite see how this change caused this.
>>
>> I have a test for this at
>> http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c
>>
>>> I can play with
>>> it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
>>> because it needs to skip the syscall.
>>>
>>> We could change this by treating RET_ERRNO as an instruction to enter
>>> phase 2 and then asking for a skip in phase 2 without changing
>>> orig_ax, but IMO this is pretty ugly.
>>>
>>> I think this all kind of sucks.  We're trying to run ptrace after
>>> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
>>> That means that if we use RET_TRAP, then ptrace will see the
>>> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
>>> correctly given the current design) showing syscall -1, and if we use
>>> RET_KILL, then ptrace just sees the process mysteriously die.
>>
>> Userspace is usually not prepared to see syscall -1.
>> For example, strace had to be patched, otherwise it just skipped such
>> syscalls as "not a syscall" events or did other improper things:
>> http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486
>> http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891
>>
>
> The x32 thing is a legit ABI bug, I'd argue.  I'd be happy to submit a
> patch to fix that (clear the x32 bit if we're not x32).
>
>> A slightly different but related story: userspace is also not prepared
>> to handle large errno values produced by seccomp filters like this:
>> BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)
>>
>> For example, glibc assumes that syscalls do not return errno values greater than 0xfff:
>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55
>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20

To save others the link reading: "Linus said he will make sure the no
syscall returns a value in -1 .. -4095 as a valid result so we can
savely test with -4095."

Strictly speaking (ISO C, "man 3 errno"), errno is supposed to be a
full int, though digging around I find this in include/linux/err.h:

/*
 * Kernel pointers have redundant information, so we can use a
 * scheme where we can return either an error code or a normal
 * pointer with the same return value.
 *
 * This should be a per-architecture thing, to allow different
 * error and pointer decisions.
 */
#define MAX_ERRNO       4095

#ifndef __ASSEMBLY__

#define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)

But no architecture overrides this.

>> If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
>> applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.

I'm not opposed to this. I would want to more explicitly document the
4095 max value in man pages, though.

> I think this is solidly in the "don't do that" category.  Seccomp lets
> you tamper with syscalls.  If you tamper wrong, then you lose.
>
> Kees, what do you think about reversing the whole thing so that
> ptracers always see the original syscall?

What do you mean by "reversing"? The interactions I see here are:

PTRACE_SYSCALL
SECCOMP_RET_ERRNO
SECCOMP_RET_TRACE
SECCOMP_RET_TRAP

Both ptrace and seccomp will trigger via _TIF_WORK_SYSCALL_ENTRY. Only
ptrace will trigger via _TIF_WORK_SYSCALL_EXIT.

For SECCOMP_RET_ERRNO to work, we must skip the syscall, as mentioned earlier:

arch/x86/kernel/entry_32.S ...
syscall_trace_entry:
        movl $-ENOSYS,PT_EAX(%esp)
        movl %esp, %eax
        call syscall_trace_enter
        /* What it returned is what we'll actually use.  */
        cmpl $(NR_syscalls), %eax
        jnae syscall_call
        jmp syscall_exit
END(syscall_trace_entry)

Both before and after the 2-phase change, syscall_trace_enter would
return -1 if it hit SECCOMP_RET_ERRNO, before calling
tracehook_report_syscall_entry. On exit, if PTRACE_SYSCALL, we'd hit
tracehook_report_syscall_exit during syscall_trace_leave, which means
a ptracer will see a syscall-exit-stop without a matching
syscall-enter-stop.

Using SECCOMP_RET_TRACE with PTRACE_SYSCALL in place seems totally
crazy, as the ptracer would need to be the same program, and if it
chose to skip a syscall, it would be in the same place: it would see
PTRACE_EVENT_SECCOMP, then no syscall-enter-stop, then a
syscall-exit-stop. I think we can ignore this pathological case.

Using SECCOMP_RET_TRAP with PTRACE_SYSCALL also results in a skip,
which produces the same "only syscall-exit-stop seen" problem.

In the SECCOMP_RET_ERRNO case, the syscall nr doesn't change (and
isn't executed). In the SECCOMP_RET_TRAP case, the syscall nr doesn't
change (and isn't executed). In the SECCOMP_RET_TRACE, the syscall nr
_could_ change, but the ptracer would be doing it, so the crazy
situation around PTRACE_SYSCALL is probably safe to ignore (as long as
we document what is expected to happen).

So, the question is: should PTRACE_SYSCALL see a syscall that is _not_
being executed (due to seccomp)? Audit doesn't see it currently, and
neither does ptrace. I would argue that it should continue to not see
the syscall. That said, if it shouldn't be shown, we also shouldn't
trigger syscall-exit-stop. If you can convince me it should see
syscall-enter-stop, then I have two questions:

1) Do we accept that a ptracer can interfere with SECCOMP_RET_ERRNO? I
think we probably must, since it can already interfere via
syscall-exit-stop and change the errno. And especially since a ptracer
can change syscalls during syscall-enter-stop to any syscall it wants,
bypassing seccomp. This condition is already documented.

2) What do we do with audit? Suddenly we have ptrace seeing a syscall
that audit doesn't?

And an unrelated thought:

3) Can't we find some way to fix the inability of a ptracer to
distinguish between syscall-enter-stop and syscall-exit-stop?

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-06 19:23                         ` Kees Cook
  (?)
@ 2015-02-06 19:32                           ` Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-06 19:32 UTC (permalink / raw)
  To: Kees Cook
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Feb 5, 2015 at 6:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>> On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
>>>> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
>>>> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>> [...]
>>>> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
>>>> >> to keep the syscall number unchanged and suppress syscall-exit-stop event,
>>>> >> which was awful because userspace cannot distinguish syscall-enter-stop
>>>> >> from syscall-exit-stop and therefore relies on the kernel that
>>>> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>>>> >>
>>>> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
>>>> >> events to be suppressed, but now the syscall number is lost.
>>>> >
>>>> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
>>>> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
>>>> > think here?
>>>>
>>>> I still don't quite see how this change caused this.
>>>
>>> I have a test for this at
>>> http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c
>>>
>>>> I can play with
>>>> it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
>>>> because it needs to skip the syscall.
>>>>
>>>> We could change this by treating RET_ERRNO as an instruction to enter
>>>> phase 2 and then asking for a skip in phase 2 without changing
>>>> orig_ax, but IMO this is pretty ugly.
>>>>
>>>> I think this all kind of sucks.  We're trying to run ptrace after
>>>> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
>>>> That means that if we use RET_TRAP, then ptrace will see the
>>>> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
>>>> correctly given the current design) showing syscall -1, and if we use
>>>> RET_KILL, then ptrace just sees the process mysteriously die.
>>>
>>> Userspace is usually not prepared to see syscall -1.
>>> For example, strace had to be patched, otherwise it just skipped such
>>> syscalls as "not a syscall" events or did other improper things:
>>> http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486
>>> http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891
>>>
>>
>> The x32 thing is a legit ABI bug, I'd argue.  I'd be happy to submit a
>> patch to fix that (clear the x32 bit if we're not x32).
>>
>>> A slightly different but related story: userspace is also not prepared
>>> to handle large errno values produced by seccomp filters like this:
>>> BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)
>>>
>>> For example, glibc assumes that syscalls do not return errno values greater than 0xfff:
>>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55
>>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20
>
> To save others the link reading: "Linus said he will make sure the no
> syscall returns a value in -1 .. -4095 as a valid result so we can
> savely test with -4095."
>
> Strictly speaking (ISO C, "man 3 errno"), errno is supposed to be a
> full int, though digging around I find this in include/linux/err.h:
>
> /*
>  * Kernel pointers have redundant information, so we can use a
>  * scheme where we can return either an error code or a normal
>  * pointer with the same return value.
>  *
>  * This should be a per-architecture thing, to allow different
>  * error and pointer decisions.
>  */
> #define MAX_ERRNO       4095
>
> #ifndef __ASSEMBLY__
>
> #define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)
>
> But no architecture overrides this.
>
>>> If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
>>> applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.
>
> I'm not opposed to this. I would want to more explicitly document the
> 4095 max value in man pages, though.
>
>> I think this is solidly in the "don't do that" category.  Seccomp lets
>> you tamper with syscalls.  If you tamper wrong, then you lose.
>>
>> Kees, what do you think about reversing the whole thing so that
>> ptracers always see the original syscall?
>
> What do you mean by "reversing"? The interactions I see here are:
>
> PTRACE_SYSCALL
> SECCOMP_RET_ERRNO
> SECCOMP_RET_TRACE
> SECCOMP_RET_TRAP
>
> Both ptrace and seccomp will trigger via _TIF_WORK_SYSCALL_ENTRY. Only
> ptrace will trigger via _TIF_WORK_SYSCALL_EXIT.
>
> For SECCOMP_RET_ERRNO to work, we must skip the syscall, as mentioned earlier:
>
> arch/x86/kernel/entry_32.S ...
> syscall_trace_entry:
>         movl $-ENOSYS,PT_EAX(%esp)
>         movl %esp, %eax
>         call syscall_trace_enter
>         /* What it returned is what we'll actually use.  */
>         cmpl $(NR_syscalls), %eax
>         jnae syscall_call
>         jmp syscall_exit
> END(syscall_trace_entry)
>
> Both before and after the 2-phase change, syscall_trace_enter would
> return -1 if it hit SECCOMP_RET_ERRNO, before calling
> tracehook_report_syscall_entry. On exit, if PTRACE_SYSCALL, we'd hit
> tracehook_report_syscall_exit during syscall_trace_leave, which means
> a ptracer will see a syscall-exit-stop without a matching
> syscall-enter-stop.
>
> Using SECCOMP_RET_TRACE with PTRACE_SYSCALL in place seems totally
> crazy, as the ptracer would need to be the same program, and if it
> chose to skip a syscall, it would be in the same place: it would see
> PTRACE_EVENT_SECCOMP, then no syscall-enter-stop, then a
> syscall-exit-stop. I think we can ignore this pathological case.
>
> Using SECCOMP_RET_TRAP with PTRACE_SYSCALL also results in a skip,
> which produces the same "only syscall-exit-stop seen" problem.
>
> In the SECCOMP_RET_ERRNO case, the syscall nr doesn't change (and
> isn't executed). In the SECCOMP_RET_TRAP case, the syscall nr doesn't
> change (and isn't executed). In the SECCOMP_RET_TRACE, the syscall nr
> _could_ change, but the ptracer would be doing it, so the crazy
> situation around PTRACE_SYSCALL is probably safe to ignore (as long as
> we document what is expected to happen).
>
> So, the question is: should PTRACE_SYSCALL see a syscall that is _not_
> being executed (due to seccomp)? Audit doesn't see it currently, and
> neither does ptrace. I would argue that it should continue to not see
> the syscall. That said, if it shouldn't be shown, we also shouldn't
> trigger syscall-exit-stop. If you can convince me it should see
> syscall-enter-stop, then I have two questions:

I think PTRACE_SYSCALL should see syscalls that are skipped due to
seccomp.  I think that the exit event should see the modified errno,
if any, so that strace will show whatever the traced process thinks is
happening.

>
> 1) Do we accept that a ptracer can interfere with SECCOMP_RET_ERRNO? I
> think we probably must, since it can already interfere via
> syscall-exit-stop and change the errno.

I think this is fine.

> And especially since a ptracer
> can change syscalls during syscall-enter-stop to any syscall it wants,
> bypassing seccomp. This condition is already documented.

If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
before seccomp, then this oddity would go away, which might be a good
thing.  A ptracer could change the syscall, but seccomp would based on
what the ptracer changed the syscall to.

>
> 2) What do we do with audit? Suddenly we have ptrace seeing a syscall
> that audit doesn't?

Is this a problem?  I'd be amazed if program uses both ptrace and
audit -- after all, audit is a global thing, and it only has one
implementation (AFAIK): auditd.  auditd doesn't ptrace the world.

>
> And an unrelated thought:
>
> 3) Can't we find some way to fix the inability of a ptracer to
> distinguish between syscall-enter-stop and syscall-exit-stop?
>

Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
the lines of PTRACE_O_TRACESYSGOOD?

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06 19:32                           ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-06 19:32 UTC (permalink / raw)
  To: Kees Cook
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Feb 5, 2015 at 6:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>> On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
>>>> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
>>>> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>> [...]
>>>> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
>>>> >> to keep the syscall number unchanged and suppress syscall-exit-stop event,
>>>> >> which was awful because userspace cannot distinguish syscall-enter-stop
>>>> >> from syscall-exit-stop and therefore relies on the kernel that
>>>> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>>>> >>
>>>> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
>>>> >> events to be suppressed, but now the syscall number is lost.
>>>> >
>>>> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
>>>> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
>>>> > think here?
>>>>
>>>> I still don't quite see how this change caused this.
>>>
>>> I have a test for this at
>>> http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c
>>>
>>>> I can play with
>>>> it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
>>>> because it needs to skip the syscall.
>>>>
>>>> We could change this by treating RET_ERRNO as an instruction to enter
>>>> phase 2 and then asking for a skip in phase 2 without changing
>>>> orig_ax, but IMO this is pretty ugly.
>>>>
>>>> I think this all kind of sucks.  We're trying to run ptrace after
>>>> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
>>>> That means that if we use RET_TRAP, then ptrace will see the
>>>> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
>>>> correctly given the current design) showing syscall -1, and if we use
>>>> RET_KILL, then ptrace just sees the process mysteriously die.
>>>
>>> Userspace is usually not prepared to see syscall -1.
>>> For example, strace had to be patched, otherwise it just skipped such
>>> syscalls as "not a syscall" events or did other improper things:
>>> http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486
>>> http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891
>>>
>>
>> The x32 thing is a legit ABI bug, I'd argue.  I'd be happy to submit a
>> patch to fix that (clear the x32 bit if we're not x32).
>>
>>> A slightly different but related story: userspace is also not prepared
>>> to handle large errno values produced by seccomp filters like this:
>>> BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)
>>>
>>> For example, glibc assumes that syscalls do not return errno values greater than 0xfff:
>>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55
>>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20
>
> To save others the link reading: "Linus said he will make sure the no
> syscall returns a value in -1 .. -4095 as a valid result so we can
> savely test with -4095."
>
> Strictly speaking (ISO C, "man 3 errno"), errno is supposed to be a
> full int, though digging around I find this in include/linux/err.h:
>
> /*
>  * Kernel pointers have redundant information, so we can use a
>  * scheme where we can return either an error code or a normal
>  * pointer with the same return value.
>  *
>  * This should be a per-architecture thing, to allow different
>  * error and pointer decisions.
>  */
> #define MAX_ERRNO       4095
>
> #ifndef __ASSEMBLY__
>
> #define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)
>
> But no architecture overrides this.
>
>>> If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
>>> applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.
>
> I'm not opposed to this. I would want to more explicitly document the
> 4095 max value in man pages, though.
>
>> I think this is solidly in the "don't do that" category.  Seccomp lets
>> you tamper with syscalls.  If you tamper wrong, then you lose.
>>
>> Kees, what do you think about reversing the whole thing so that
>> ptracers always see the original syscall?
>
> What do you mean by "reversing"? The interactions I see here are:
>
> PTRACE_SYSCALL
> SECCOMP_RET_ERRNO
> SECCOMP_RET_TRACE
> SECCOMP_RET_TRAP
>
> Both ptrace and seccomp will trigger via _TIF_WORK_SYSCALL_ENTRY. Only
> ptrace will trigger via _TIF_WORK_SYSCALL_EXIT.
>
> For SECCOMP_RET_ERRNO to work, we must skip the syscall, as mentioned earlier:
>
> arch/x86/kernel/entry_32.S ...
> syscall_trace_entry:
>         movl $-ENOSYS,PT_EAX(%esp)
>         movl %esp, %eax
>         call syscall_trace_enter
>         /* What it returned is what we'll actually use.  */
>         cmpl $(NR_syscalls), %eax
>         jnae syscall_call
>         jmp syscall_exit
> END(syscall_trace_entry)
>
> Both before and after the 2-phase change, syscall_trace_enter would
> return -1 if it hit SECCOMP_RET_ERRNO, before calling
> tracehook_report_syscall_entry. On exit, if PTRACE_SYSCALL, we'd hit
> tracehook_report_syscall_exit during syscall_trace_leave, which means
> a ptracer will see a syscall-exit-stop without a matching
> syscall-enter-stop.
>
> Using SECCOMP_RET_TRACE with PTRACE_SYSCALL in place seems totally
> crazy, as the ptracer would need to be the same program, and if it
> chose to skip a syscall, it would be in the same place: it would see
> PTRACE_EVENT_SECCOMP, then no syscall-enter-stop, then a
> syscall-exit-stop. I think we can ignore this pathological case.
>
> Using SECCOMP_RET_TRAP with PTRACE_SYSCALL also results in a skip,
> which produces the same "only syscall-exit-stop seen" problem.
>
> In the SECCOMP_RET_ERRNO case, the syscall nr doesn't change (and
> isn't executed). In the SECCOMP_RET_TRAP case, the syscall nr doesn't
> change (and isn't executed). In the SECCOMP_RET_TRACE, the syscall nr
> _could_ change, but the ptracer would be doing it, so the crazy
> situation around PTRACE_SYSCALL is probably safe to ignore (as long as
> we document what is expected to happen).
>
> So, the question is: should PTRACE_SYSCALL see a syscall that is _not_
> being executed (due to seccomp)? Audit doesn't see it currently, and
> neither does ptrace. I would argue that it should continue to not see
> the syscall. That said, if it shouldn't be shown, we also shouldn't
> trigger syscall-exit-stop. If you can convince me it should see
> syscall-enter-stop, then I have two questions:

I think PTRACE_SYSCALL should see syscalls that are skipped due to
seccomp.  I think that the exit event should see the modified errno,
if any, so that strace will show whatever the traced process thinks is
happening.

>
> 1) Do we accept that a ptracer can interfere with SECCOMP_RET_ERRNO? I
> think we probably must, since it can already interfere via
> syscall-exit-stop and change the errno.

I think this is fine.

> And especially since a ptracer
> can change syscalls during syscall-enter-stop to any syscall it wants,
> bypassing seccomp. This condition is already documented.

If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
before seccomp, then this oddity would go away, which might be a good
thing.  A ptracer could change the syscall, but seccomp would based on
what the ptracer changed the syscall to.

>
> 2) What do we do with audit? Suddenly we have ptrace seeing a syscall
> that audit doesn't?

Is this a problem?  I'd be amazed if program uses both ptrace and
audit -- after all, audit is a global thing, and it only has one
implementation (AFAIK): auditd.  auditd doesn't ptrace the world.

>
> And an unrelated thought:
>
> 3) Can't we find some way to fix the inability of a ptracer to
> distinguish between syscall-enter-stop and syscall-exit-stop?
>

Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
the lines of PTRACE_O_TRACESYSGOOD?

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06 19:32                           ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-06 19:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Feb 5, 2015 at 6:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>> On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
>>>> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
>>>> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>> [...]
>>>> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
>>>> >> to keep the syscall number unchanged and suppress syscall-exit-stop event,
>>>> >> which was awful because userspace cannot distinguish syscall-enter-stop
>>>> >> from syscall-exit-stop and therefore relies on the kernel that
>>>> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>>>> >>
>>>> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
>>>> >> events to be suppressed, but now the syscall number is lost.
>>>> >
>>>> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
>>>> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
>>>> > think here?
>>>>
>>>> I still don't quite see how this change caused this.
>>>
>>> I have a test for this at
>>> http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c
>>>
>>>> I can play with
>>>> it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
>>>> because it needs to skip the syscall.
>>>>
>>>> We could change this by treating RET_ERRNO as an instruction to enter
>>>> phase 2 and then asking for a skip in phase 2 without changing
>>>> orig_ax, but IMO this is pretty ugly.
>>>>
>>>> I think this all kind of sucks.  We're trying to run ptrace after
>>>> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
>>>> That means that if we use RET_TRAP, then ptrace will see the
>>>> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
>>>> correctly given the current design) showing syscall -1, and if we use
>>>> RET_KILL, then ptrace just sees the process mysteriously die.
>>>
>>> Userspace is usually not prepared to see syscall -1.
>>> For example, strace had to be patched, otherwise it just skipped such
>>> syscalls as "not a syscall" events or did other improper things:
>>> http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486
>>> http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891
>>>
>>
>> The x32 thing is a legit ABI bug, I'd argue.  I'd be happy to submit a
>> patch to fix that (clear the x32 bit if we're not x32).
>>
>>> A slightly different but related story: userspace is also not prepared
>>> to handle large errno values produced by seccomp filters like this:
>>> BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)
>>>
>>> For example, glibc assumes that syscalls do not return errno values greater than 0xfff:
>>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55
>>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20
>
> To save others the link reading: "Linus said he will make sure the no
> syscall returns a value in -1 .. -4095 as a valid result so we can
> savely test with -4095."
>
> Strictly speaking (ISO C, "man 3 errno"), errno is supposed to be a
> full int, though digging around I find this in include/linux/err.h:
>
> /*
>  * Kernel pointers have redundant information, so we can use a
>  * scheme where we can return either an error code or a normal
>  * pointer with the same return value.
>  *
>  * This should be a per-architecture thing, to allow different
>  * error and pointer decisions.
>  */
> #define MAX_ERRNO       4095
>
> #ifndef __ASSEMBLY__
>
> #define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)
>
> But no architecture overrides this.
>
>>> If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
>>> applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.
>
> I'm not opposed to this. I would want to more explicitly document the
> 4095 max value in man pages, though.
>
>> I think this is solidly in the "don't do that" category.  Seccomp lets
>> you tamper with syscalls.  If you tamper wrong, then you lose.
>>
>> Kees, what do you think about reversing the whole thing so that
>> ptracers always see the original syscall?
>
> What do you mean by "reversing"? The interactions I see here are:
>
> PTRACE_SYSCALL
> SECCOMP_RET_ERRNO
> SECCOMP_RET_TRACE
> SECCOMP_RET_TRAP
>
> Both ptrace and seccomp will trigger via _TIF_WORK_SYSCALL_ENTRY. Only
> ptrace will trigger via _TIF_WORK_SYSCALL_EXIT.
>
> For SECCOMP_RET_ERRNO to work, we must skip the syscall, as mentioned earlier:
>
> arch/x86/kernel/entry_32.S ...
> syscall_trace_entry:
>         movl $-ENOSYS,PT_EAX(%esp)
>         movl %esp, %eax
>         call syscall_trace_enter
>         /* What it returned is what we'll actually use.  */
>         cmpl $(NR_syscalls), %eax
>         jnae syscall_call
>         jmp syscall_exit
> END(syscall_trace_entry)
>
> Both before and after the 2-phase change, syscall_trace_enter would
> return -1 if it hit SECCOMP_RET_ERRNO, before calling
> tracehook_report_syscall_entry. On exit, if PTRACE_SYSCALL, we'd hit
> tracehook_report_syscall_exit during syscall_trace_leave, which means
> a ptracer will see a syscall-exit-stop without a matching
> syscall-enter-stop.
>
> Using SECCOMP_RET_TRACE with PTRACE_SYSCALL in place seems totally
> crazy, as the ptracer would need to be the same program, and if it
> chose to skip a syscall, it would be in the same place: it would see
> PTRACE_EVENT_SECCOMP, then no syscall-enter-stop, then a
> syscall-exit-stop. I think we can ignore this pathological case.
>
> Using SECCOMP_RET_TRAP with PTRACE_SYSCALL also results in a skip,
> which produces the same "only syscall-exit-stop seen" problem.
>
> In the SECCOMP_RET_ERRNO case, the syscall nr doesn't change (and
> isn't executed). In the SECCOMP_RET_TRAP case, the syscall nr doesn't
> change (and isn't executed). In the SECCOMP_RET_TRACE, the syscall nr
> _could_ change, but the ptracer would be doing it, so the crazy
> situation around PTRACE_SYSCALL is probably safe to ignore (as long as
> we document what is expected to happen).
>
> So, the question is: should PTRACE_SYSCALL see a syscall that is _not_
> being executed (due to seccomp)? Audit doesn't see it currently, and
> neither does ptrace. I would argue that it should continue to not see
> the syscall. That said, if it shouldn't be shown, we also shouldn't
> trigger syscall-exit-stop. If you can convince me it should see
> syscall-enter-stop, then I have two questions:

I think PTRACE_SYSCALL should see syscalls that are skipped due to
seccomp.  I think that the exit event should see the modified errno,
if any, so that strace will show whatever the traced process thinks is
happening.

>
> 1) Do we accept that a ptracer can interfere with SECCOMP_RET_ERRNO? I
> think we probably must, since it can already interfere via
> syscall-exit-stop and change the errno.

I think this is fine.

> And especially since a ptracer
> can change syscalls during syscall-enter-stop to any syscall it wants,
> bypassing seccomp. This condition is already documented.

If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
before seccomp, then this oddity would go away, which might be a good
thing.  A ptracer could change the syscall, but seccomp would based on
what the ptracer changed the syscall to.

>
> 2) What do we do with audit? Suddenly we have ptrace seeing a syscall
> that audit doesn't?

Is this a problem?  I'd be amazed if program uses both ptrace and
audit -- after all, audit is a global thing, and it only has one
implementation (AFAIK): auditd.  auditd doesn't ptrace the world.

>
> And an unrelated thought:
>
> 3) Can't we find some way to fix the inability of a ptracer to
> distinguish between syscall-enter-stop and syscall-exit-stop?
>

Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
the lines of PTRACE_O_TRACESYSGOOD?

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-06 19:32                           ` Andy Lutomirski
  (?)
@ 2015-02-06 20:07                             ` Kees Cook
  -1 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-06 20:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
>> On Thu, Feb 5, 2015 at 6:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>>> On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
>>>>> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>>> [...]
>>>>> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
>>>>> >> to keep the syscall number unchanged and suppress syscall-exit-stop event,
>>>>> >> which was awful because userspace cannot distinguish syscall-enter-stop
>>>>> >> from syscall-exit-stop and therefore relies on the kernel that
>>>>> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>>>>> >>
>>>>> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
>>>>> >> events to be suppressed, but now the syscall number is lost.
>>>>> >
>>>>> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
>>>>> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
>>>>> > think here?
>>>>>
>>>>> I still don't quite see how this change caused this.
>>>>
>>>> I have a test for this at
>>>> http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c
>>>>
>>>>> I can play with
>>>>> it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
>>>>> because it needs to skip the syscall.
>>>>>
>>>>> We could change this by treating RET_ERRNO as an instruction to enter
>>>>> phase 2 and then asking for a skip in phase 2 without changing
>>>>> orig_ax, but IMO this is pretty ugly.
>>>>>
>>>>> I think this all kind of sucks.  We're trying to run ptrace after
>>>>> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
>>>>> That means that if we use RET_TRAP, then ptrace will see the
>>>>> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
>>>>> correctly given the current design) showing syscall -1, and if we use
>>>>> RET_KILL, then ptrace just sees the process mysteriously die.
>>>>
>>>> Userspace is usually not prepared to see syscall -1.
>>>> For example, strace had to be patched, otherwise it just skipped such
>>>> syscalls as "not a syscall" events or did other improper things:
>>>> http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486
>>>> http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891
>>>>
>>>
>>> The x32 thing is a legit ABI bug, I'd argue.  I'd be happy to submit a
>>> patch to fix that (clear the x32 bit if we're not x32).
>>>
>>>> A slightly different but related story: userspace is also not prepared
>>>> to handle large errno values produced by seccomp filters like this:
>>>> BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)
>>>>
>>>> For example, glibc assumes that syscalls do not return errno values greater than 0xfff:
>>>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55
>>>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20
>>
>> To save others the link reading: "Linus said he will make sure the no
>> syscall returns a value in -1 .. -4095 as a valid result so we can
>> savely test with -4095."
>>
>> Strictly speaking (ISO C, "man 3 errno"), errno is supposed to be a
>> full int, though digging around I find this in include/linux/err.h:
>>
>> /*
>>  * Kernel pointers have redundant information, so we can use a
>>  * scheme where we can return either an error code or a normal
>>  * pointer with the same return value.
>>  *
>>  * This should be a per-architecture thing, to allow different
>>  * error and pointer decisions.
>>  */
>> #define MAX_ERRNO       4095
>>
>> #ifndef __ASSEMBLY__
>>
>> #define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)
>>
>> But no architecture overrides this.
>>
>>>> If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
>>>> applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.
>>
>> I'm not opposed to this. I would want to more explicitly document the
>> 4095 max value in man pages, though.
>>
>>> I think this is solidly in the "don't do that" category.  Seccomp lets
>>> you tamper with syscalls.  If you tamper wrong, then you lose.
>>>
>>> Kees, what do you think about reversing the whole thing so that
>>> ptracers always see the original syscall?
>>
>> What do you mean by "reversing"? The interactions I see here are:
>>
>> PTRACE_SYSCALL
>> SECCOMP_RET_ERRNO
>> SECCOMP_RET_TRACE
>> SECCOMP_RET_TRAP
>>
>> Both ptrace and seccomp will trigger via _TIF_WORK_SYSCALL_ENTRY. Only
>> ptrace will trigger via _TIF_WORK_SYSCALL_EXIT.
>>
>> For SECCOMP_RET_ERRNO to work, we must skip the syscall, as mentioned earlier:
>>
>> arch/x86/kernel/entry_32.S ...
>> syscall_trace_entry:
>>         movl $-ENOSYS,PT_EAX(%esp)
>>         movl %esp, %eax
>>         call syscall_trace_enter
>>         /* What it returned is what we'll actually use.  */
>>         cmpl $(NR_syscalls), %eax
>>         jnae syscall_call
>>         jmp syscall_exit
>> END(syscall_trace_entry)
>>
>> Both before and after the 2-phase change, syscall_trace_enter would
>> return -1 if it hit SECCOMP_RET_ERRNO, before calling
>> tracehook_report_syscall_entry. On exit, if PTRACE_SYSCALL, we'd hit
>> tracehook_report_syscall_exit during syscall_trace_leave, which means
>> a ptracer will see a syscall-exit-stop without a matching
>> syscall-enter-stop.
>>
>> Using SECCOMP_RET_TRACE with PTRACE_SYSCALL in place seems totally
>> crazy, as the ptracer would need to be the same program, and if it
>> chose to skip a syscall, it would be in the same place: it would see
>> PTRACE_EVENT_SECCOMP, then no syscall-enter-stop, then a
>> syscall-exit-stop. I think we can ignore this pathological case.
>>
>> Using SECCOMP_RET_TRAP with PTRACE_SYSCALL also results in a skip,
>> which produces the same "only syscall-exit-stop seen" problem.
>>
>> In the SECCOMP_RET_ERRNO case, the syscall nr doesn't change (and
>> isn't executed). In the SECCOMP_RET_TRAP case, the syscall nr doesn't
>> change (and isn't executed). In the SECCOMP_RET_TRACE, the syscall nr
>> _could_ change, but the ptracer would be doing it, so the crazy
>> situation around PTRACE_SYSCALL is probably safe to ignore (as long as
>> we document what is expected to happen).
>>
>> So, the question is: should PTRACE_SYSCALL see a syscall that is _not_
>> being executed (due to seccomp)? Audit doesn't see it currently, and
>> neither does ptrace. I would argue that it should continue to not see
>> the syscall. That said, if it shouldn't be shown, we also shouldn't
>> trigger syscall-exit-stop. If you can convince me it should see
>> syscall-enter-stop, then I have two questions:
>
> I think PTRACE_SYSCALL should see syscalls that are skipped due to
> seccomp.  I think that the exit event should see the modified errno,
> if any, so that strace will show whatever the traced process thinks is
> happening.
>
>>
>> 1) Do we accept that a ptracer can interfere with SECCOMP_RET_ERRNO? I
>> think we probably must, since it can already interfere via
>> syscall-exit-stop and change the errno.
>
> I think this is fine.
>
>> And especially since a ptracer
>> can change syscalls during syscall-enter-stop to any syscall it wants,
>> bypassing seccomp. This condition is already documented.
>
> If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
> before seccomp, then this oddity would go away, which might be a good
> thing.  A ptracer could change the syscall, but seccomp would based on
> what the ptracer changed the syscall to.

I want kill events to trigger immediately. I don't want to leave the
ptrace surface available on a SECCOMP_RET_KILL. So maybe it can be
seccomp phase 1, then ptrace, then seccomp phase 2? And pass more
information between phases to determine how things should behave
beyond just "skip"?

>> 2) What do we do with audit? Suddenly we have ptrace seeing a syscall
>> that audit doesn't?
>
> Is this a problem?  I'd be amazed if program uses both ptrace and
> audit -- after all, audit is a global thing, and it only has one
> implementation (AFAIK): auditd.  auditd doesn't ptrace the world.
>
>>
>> And an unrelated thought:
>>
>> 3) Can't we find some way to fix the inability of a ptracer to
>> distinguish between syscall-enter-stop and syscall-exit-stop?
>>
>
> Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
> the lines of PTRACE_O_TRACESYSGOOD?

That might be a nice idea. I haven't written a test to see, but what
does PTRACE_GETEVENTMSG return on syscall-enter/exit-stop? If we can't
add something there, then yeah, adding PTRACE_O_TRACESYSENTRY and
PTRACE_O_TRACESYSEXIT with their own event msgs would be nice. Could
even add the syscall nr to the event msg so ptracers don't have to dig
around in per-arch registers, too.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06 20:07                             ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-06 20:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
>> On Thu, Feb 5, 2015 at 6:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>>> On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
>>>>> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>>> [...]
>>>>> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
>>>>> >> to keep the syscall number unchanged and suppress syscall-exit-stop event,
>>>>> >> which was awful because userspace cannot distinguish syscall-enter-stop
>>>>> >> from syscall-exit-stop and therefore relies on the kernel that
>>>>> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>>>>> >>
>>>>> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
>>>>> >> events to be suppressed, but now the syscall number is lost.
>>>>> >
>>>>> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
>>>>> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
>>>>> > think here?
>>>>>
>>>>> I still don't quite see how this change caused this.
>>>>
>>>> I have a test for this at
>>>> http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c
>>>>
>>>>> I can play with
>>>>> it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
>>>>> because it needs to skip the syscall.
>>>>>
>>>>> We could change this by treating RET_ERRNO as an instruction to enter
>>>>> phase 2 and then asking for a skip in phase 2 without changing
>>>>> orig_ax, but IMO this is pretty ugly.
>>>>>
>>>>> I think this all kind of sucks.  We're trying to run ptrace after
>>>>> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
>>>>> That means that if we use RET_TRAP, then ptrace will see the
>>>>> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
>>>>> correctly given the current design) showing syscall -1, and if we use
>>>>> RET_KILL, then ptrace just sees the process mysteriously die.
>>>>
>>>> Userspace is usually not prepared to see syscall -1.
>>>> For example, strace had to be patched, otherwise it just skipped such
>>>> syscalls as "not a syscall" events or did other improper things:
>>>> http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486
>>>> http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891
>>>>
>>>
>>> The x32 thing is a legit ABI bug, I'd argue.  I'd be happy to submit a
>>> patch to fix that (clear the x32 bit if we're not x32).
>>>
>>>> A slightly different but related story: userspace is also not prepared
>>>> to handle large errno values produced by seccomp filters like this:
>>>> BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)
>>>>
>>>> For example, glibc assumes that syscalls do not return errno values greater than 0xfff:
>>>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55
>>>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20
>>
>> To save others the link reading: "Linus said he will make sure the no
>> syscall returns a value in -1 .. -4095 as a valid result so we can
>> savely test with -4095."
>>
>> Strictly speaking (ISO C, "man 3 errno"), errno is supposed to be a
>> full int, though digging around I find this in include/linux/err.h:
>>
>> /*
>>  * Kernel pointers have redundant information, so we can use a
>>  * scheme where we can return either an error code or a normal
>>  * pointer with the same return value.
>>  *
>>  * This should be a per-architecture thing, to allow different
>>  * error and pointer decisions.
>>  */
>> #define MAX_ERRNO       4095
>>
>> #ifndef __ASSEMBLY__
>>
>> #define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)
>>
>> But no architecture overrides this.
>>
>>>> If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
>>>> applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.
>>
>> I'm not opposed to this. I would want to more explicitly document the
>> 4095 max value in man pages, though.
>>
>>> I think this is solidly in the "don't do that" category.  Seccomp lets
>>> you tamper with syscalls.  If you tamper wrong, then you lose.
>>>
>>> Kees, what do you think about reversing the whole thing so that
>>> ptracers always see the original syscall?
>>
>> What do you mean by "reversing"? The interactions I see here are:
>>
>> PTRACE_SYSCALL
>> SECCOMP_RET_ERRNO
>> SECCOMP_RET_TRACE
>> SECCOMP_RET_TRAP
>>
>> Both ptrace and seccomp will trigger via _TIF_WORK_SYSCALL_ENTRY. Only
>> ptrace will trigger via _TIF_WORK_SYSCALL_EXIT.
>>
>> For SECCOMP_RET_ERRNO to work, we must skip the syscall, as mentioned earlier:
>>
>> arch/x86/kernel/entry_32.S ...
>> syscall_trace_entry:
>>         movl $-ENOSYS,PT_EAX(%esp)
>>         movl %esp, %eax
>>         call syscall_trace_enter
>>         /* What it returned is what we'll actually use.  */
>>         cmpl $(NR_syscalls), %eax
>>         jnae syscall_call
>>         jmp syscall_exit
>> END(syscall_trace_entry)
>>
>> Both before and after the 2-phase change, syscall_trace_enter would
>> return -1 if it hit SECCOMP_RET_ERRNO, before calling
>> tracehook_report_syscall_entry. On exit, if PTRACE_SYSCALL, we'd hit
>> tracehook_report_syscall_exit during syscall_trace_leave, which means
>> a ptracer will see a syscall-exit-stop without a matching
>> syscall-enter-stop.
>>
>> Using SECCOMP_RET_TRACE with PTRACE_SYSCALL in place seems totally
>> crazy, as the ptracer would need to be the same program, and if it
>> chose to skip a syscall, it would be in the same place: it would see
>> PTRACE_EVENT_SECCOMP, then no syscall-enter-stop, then a
>> syscall-exit-stop. I think we can ignore this pathological case.
>>
>> Using SECCOMP_RET_TRAP with PTRACE_SYSCALL also results in a skip,
>> which produces the same "only syscall-exit-stop seen" problem.
>>
>> In the SECCOMP_RET_ERRNO case, the syscall nr doesn't change (and
>> isn't executed). In the SECCOMP_RET_TRAP case, the syscall nr doesn't
>> change (and isn't executed). In the SECCOMP_RET_TRACE, the syscall nr
>> _could_ change, but the ptracer would be doing it, so the crazy
>> situation around PTRACE_SYSCALL is probably safe to ignore (as long as
>> we document what is expected to happen).
>>
>> So, the question is: should PTRACE_SYSCALL see a syscall that is _not_
>> being executed (due to seccomp)? Audit doesn't see it currently, and
>> neither does ptrace. I would argue that it should continue to not see
>> the syscall. That said, if it shouldn't be shown, we also shouldn't
>> trigger syscall-exit-stop. If you can convince me it should see
>> syscall-enter-stop, then I have two questions:
>
> I think PTRACE_SYSCALL should see syscalls that are skipped due to
> seccomp.  I think that the exit event should see the modified errno,
> if any, so that strace will show whatever the traced process thinks is
> happening.
>
>>
>> 1) Do we accept that a ptracer can interfere with SECCOMP_RET_ERRNO? I
>> think we probably must, since it can already interfere via
>> syscall-exit-stop and change the errno.
>
> I think this is fine.
>
>> And especially since a ptracer
>> can change syscalls during syscall-enter-stop to any syscall it wants,
>> bypassing seccomp. This condition is already documented.
>
> If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
> before seccomp, then this oddity would go away, which might be a good
> thing.  A ptracer could change the syscall, but seccomp would based on
> what the ptracer changed the syscall to.

I want kill events to trigger immediately. I don't want to leave the
ptrace surface available on a SECCOMP_RET_KILL. So maybe it can be
seccomp phase 1, then ptrace, then seccomp phase 2? And pass more
information between phases to determine how things should behave
beyond just "skip"?

>> 2) What do we do with audit? Suddenly we have ptrace seeing a syscall
>> that audit doesn't?
>
> Is this a problem?  I'd be amazed if program uses both ptrace and
> audit -- after all, audit is a global thing, and it only has one
> implementation (AFAIK): auditd.  auditd doesn't ptrace the world.
>
>>
>> And an unrelated thought:
>>
>> 3) Can't we find some way to fix the inability of a ptracer to
>> distinguish between syscall-enter-stop and syscall-exit-stop?
>>
>
> Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
> the lines of PTRACE_O_TRACESYSGOOD?

That might be a nice idea. I haven't written a test to see, but what
does PTRACE_GETEVENTMSG return on syscall-enter/exit-stop? If we can't
add something there, then yeah, adding PTRACE_O_TRACESYSENTRY and
PTRACE_O_TRACESYSEXIT with their own event msgs would be nice. Could
even add the syscall nr to the event msg so ptracers don't have to dig
around in per-arch registers, too.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06 20:07                             ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-06 20:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
>> On Thu, Feb 5, 2015 at 6:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>>> On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
>>>>> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
>>>> [...]
>>>>> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used
>>>>> >> to keep the syscall number unchanged and suppress syscall-exit-stop event,
>>>>> >> which was awful because userspace cannot distinguish syscall-enter-stop
>>>>> >> from syscall-exit-stop and therefore relies on the kernel that
>>>>> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.).
>>>>> >>
>>>>> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop
>>>>> >> events to be suppressed, but now the syscall number is lost.
>>>>> >
>>>>> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
>>>>> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
>>>>> > think here?
>>>>>
>>>>> I still don't quite see how this change caused this.
>>>>
>>>> I have a test for this at
>>>> http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c
>>>>
>>>>> I can play with
>>>>> it a bit more.  But RET_ERRNO *has* to be some kind of skip event,
>>>>> because it needs to skip the syscall.
>>>>>
>>>>> We could change this by treating RET_ERRNO as an instruction to enter
>>>>> phase 2 and then asking for a skip in phase 2 without changing
>>>>> orig_ax, but IMO this is pretty ugly.
>>>>>
>>>>> I think this all kind of sucks.  We're trying to run ptrace after
>>>>> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
>>>>> That means that if we use RET_TRAP, then ptrace will see the
>>>>> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
>>>>> correctly given the current design) showing syscall -1, and if we use
>>>>> RET_KILL, then ptrace just sees the process mysteriously die.
>>>>
>>>> Userspace is usually not prepared to see syscall -1.
>>>> For example, strace had to be patched, otherwise it just skipped such
>>>> syscalls as "not a syscall" events or did other improper things:
>>>> http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486
>>>> http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891
>>>>
>>>
>>> The x32 thing is a legit ABI bug, I'd argue.  I'd be happy to submit a
>>> patch to fix that (clear the x32 bit if we're not x32).
>>>
>>>> A slightly different but related story: userspace is also not prepared
>>>> to handle large errno values produced by seccomp filters like this:
>>>> BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)
>>>>
>>>> For example, glibc assumes that syscalls do not return errno values greater than 0xfff:
>>>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55
>>>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20
>>
>> To save others the link reading: "Linus said he will make sure the no
>> syscall returns a value in -1 .. -4095 as a valid result so we can
>> savely test with -4095."
>>
>> Strictly speaking (ISO C, "man 3 errno"), errno is supposed to be a
>> full int, though digging around I find this in include/linux/err.h:
>>
>> /*
>>  * Kernel pointers have redundant information, so we can use a
>>  * scheme where we can return either an error code or a normal
>>  * pointer with the same return value.
>>  *
>>  * This should be a per-architecture thing, to allow different
>>  * error and pointer decisions.
>>  */
>> #define MAX_ERRNO       4095
>>
>> #ifndef __ASSEMBLY__
>>
>> #define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)
>>
>> But no architecture overrides this.
>>
>>>> If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
>>>> applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.
>>
>> I'm not opposed to this. I would want to more explicitly document the
>> 4095 max value in man pages, though.
>>
>>> I think this is solidly in the "don't do that" category.  Seccomp lets
>>> you tamper with syscalls.  If you tamper wrong, then you lose.
>>>
>>> Kees, what do you think about reversing the whole thing so that
>>> ptracers always see the original syscall?
>>
>> What do you mean by "reversing"? The interactions I see here are:
>>
>> PTRACE_SYSCALL
>> SECCOMP_RET_ERRNO
>> SECCOMP_RET_TRACE
>> SECCOMP_RET_TRAP
>>
>> Both ptrace and seccomp will trigger via _TIF_WORK_SYSCALL_ENTRY. Only
>> ptrace will trigger via _TIF_WORK_SYSCALL_EXIT.
>>
>> For SECCOMP_RET_ERRNO to work, we must skip the syscall, as mentioned earlier:
>>
>> arch/x86/kernel/entry_32.S ...
>> syscall_trace_entry:
>>         movl $-ENOSYS,PT_EAX(%esp)
>>         movl %esp, %eax
>>         call syscall_trace_enter
>>         /* What it returned is what we'll actually use.  */
>>         cmpl $(NR_syscalls), %eax
>>         jnae syscall_call
>>         jmp syscall_exit
>> END(syscall_trace_entry)
>>
>> Both before and after the 2-phase change, syscall_trace_enter would
>> return -1 if it hit SECCOMP_RET_ERRNO, before calling
>> tracehook_report_syscall_entry. On exit, if PTRACE_SYSCALL, we'd hit
>> tracehook_report_syscall_exit during syscall_trace_leave, which means
>> a ptracer will see a syscall-exit-stop without a matching
>> syscall-enter-stop.
>>
>> Using SECCOMP_RET_TRACE with PTRACE_SYSCALL in place seems totally
>> crazy, as the ptracer would need to be the same program, and if it
>> chose to skip a syscall, it would be in the same place: it would see
>> PTRACE_EVENT_SECCOMP, then no syscall-enter-stop, then a
>> syscall-exit-stop. I think we can ignore this pathological case.
>>
>> Using SECCOMP_RET_TRAP with PTRACE_SYSCALL also results in a skip,
>> which produces the same "only syscall-exit-stop seen" problem.
>>
>> In the SECCOMP_RET_ERRNO case, the syscall nr doesn't change (and
>> isn't executed). In the SECCOMP_RET_TRAP case, the syscall nr doesn't
>> change (and isn't executed). In the SECCOMP_RET_TRACE, the syscall nr
>> _could_ change, but the ptracer would be doing it, so the crazy
>> situation around PTRACE_SYSCALL is probably safe to ignore (as long as
>> we document what is expected to happen).
>>
>> So, the question is: should PTRACE_SYSCALL see a syscall that is _not_
>> being executed (due to seccomp)? Audit doesn't see it currently, and
>> neither does ptrace. I would argue that it should continue to not see
>> the syscall. That said, if it shouldn't be shown, we also shouldn't
>> trigger syscall-exit-stop. If you can convince me it should see
>> syscall-enter-stop, then I have two questions:
>
> I think PTRACE_SYSCALL should see syscalls that are skipped due to
> seccomp.  I think that the exit event should see the modified errno,
> if any, so that strace will show whatever the traced process thinks is
> happening.
>
>>
>> 1) Do we accept that a ptracer can interfere with SECCOMP_RET_ERRNO? I
>> think we probably must, since it can already interfere via
>> syscall-exit-stop and change the errno.
>
> I think this is fine.
>
>> And especially since a ptracer
>> can change syscalls during syscall-enter-stop to any syscall it wants,
>> bypassing seccomp. This condition is already documented.
>
> If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
> before seccomp, then this oddity would go away, which might be a good
> thing.  A ptracer could change the syscall, but seccomp would based on
> what the ptracer changed the syscall to.

I want kill events to trigger immediately. I don't want to leave the
ptrace surface available on a SECCOMP_RET_KILL. So maybe it can be
seccomp phase 1, then ptrace, then seccomp phase 2? And pass more
information between phases to determine how things should behave
beyond just "skip"?

>> 2) What do we do with audit? Suddenly we have ptrace seeing a syscall
>> that audit doesn't?
>
> Is this a problem?  I'd be amazed if program uses both ptrace and
> audit -- after all, audit is a global thing, and it only has one
> implementation (AFAIK): auditd.  auditd doesn't ptrace the world.
>
>>
>> And an unrelated thought:
>>
>> 3) Can't we find some way to fix the inability of a ptracer to
>> distinguish between syscall-enter-stop and syscall-exit-stop?
>>
>
> Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
> the lines of PTRACE_O_TRACESYSGOOD?

That might be a nice idea. I haven't written a test to see, but what
does PTRACE_GETEVENTMSG return on syscall-enter/exit-stop? If we can't
add something there, then yeah, adding PTRACE_O_TRACESYSENTRY and
PTRACE_O_TRACESYSEXIT with their own event msgs would be nice. Could
even add the syscall nr to the event msg so ptracers don't have to dig
around in per-arch registers, too.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-06 19:23                         ` Kees Cook
  (?)
@ 2015-02-06 20:11                           ` H. Peter Anvin
  -1 siblings, 0 replies; 82+ messages in thread
From: H. Peter Anvin @ 2015-02-06 20:11 UTC (permalink / raw)
  To: Kees Cook, Andy Lutomirski
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, Frederic Weisbecker,
	Michael Kerrisk-manpages

On 02/06/2015 11:23 AM, Kees Cook wrote:
> 
> Strictly speaking (ISO C, "man 3 errno"), errno is supposed to be a
> full int, though digging around I find this in include/linux/err.h:
> 

That doesn't mean the kernel has to support them.

> /*
>  * Kernel pointers have redundant information, so we can use a
>  * scheme where we can return either an error code or a normal
>  * pointer with the same return value.
>  *
>  * This should be a per-architecture thing, to allow different
>  * error and pointer decisions.
>  */
> #define MAX_ERRNO       4095
> 
> #ifndef __ASSEMBLY__
> 
> #define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)
> 
> But no architecture overrides this.
> 

We used to have a much lower value, that was per-architecture, in order
to optimize the resulting assembly (e.g. 8-bit immediates on x86).  This
didn't work as the number of errnos increased.  The other motivation was
probably binary compatibility with other Unices, which was an idea for a
while.

	-hpa


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06 20:11                           ` H. Peter Anvin
  0 siblings, 0 replies; 82+ messages in thread
From: H. Peter Anvin @ 2015-02-06 20:11 UTC (permalink / raw)
  To: Kees Cook, Andy Lutomirski
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, Frederic Weisbecker,
	Michael Kerrisk-manpages

On 02/06/2015 11:23 AM, Kees Cook wrote:
> 
> Strictly speaking (ISO C, "man 3 errno"), errno is supposed to be a
> full int, though digging around I find this in include/linux/err.h:
> 

That doesn't mean the kernel has to support them.

> /*
>  * Kernel pointers have redundant information, so we can use a
>  * scheme where we can return either an error code or a normal
>  * pointer with the same return value.
>  *
>  * This should be a per-architecture thing, to allow different
>  * error and pointer decisions.
>  */
> #define MAX_ERRNO       4095
> 
> #ifndef __ASSEMBLY__
> 
> #define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)
> 
> But no architecture overrides this.
> 

We used to have a much lower value, that was per-architecture, in order
to optimize the resulting assembly (e.g. 8-bit immediates on x86).  This
didn't work as the number of errnos increased.  The other motivation was
probably binary compatibility with other Unices, which was an idea for a
while.

	-hpa

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06 20:11                           ` H. Peter Anvin
  0 siblings, 0 replies; 82+ messages in thread
From: H. Peter Anvin @ 2015-02-06 20:11 UTC (permalink / raw)
  To: linux-arm-kernel

On 02/06/2015 11:23 AM, Kees Cook wrote:
> 
> Strictly speaking (ISO C, "man 3 errno"), errno is supposed to be a
> full int, though digging around I find this in include/linux/err.h:
> 

That doesn't mean the kernel has to support them.

> /*
>  * Kernel pointers have redundant information, so we can use a
>  * scheme where we can return either an error code or a normal
>  * pointer with the same return value.
>  *
>  * This should be a per-architecture thing, to allow different
>  * error and pointer decisions.
>  */
> #define MAX_ERRNO       4095
> 
> #ifndef __ASSEMBLY__
> 
> #define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)
> 
> But no architecture overrides this.
> 

We used to have a much lower value, that was per-architecture, in order
to optimize the resulting assembly (e.g. 8-bit immediates on x86).  This
didn't work as the number of errnos increased.  The other motivation was
probably binary compatibility with other Unices, which was an idea for a
while.

	-hpa

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-06 20:07                             ` Kees Cook
  (?)
@ 2015-02-06 20:12                               ` Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-06 20:12 UTC (permalink / raw)
  To: Kees Cook
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 6, 2015 at 12:07 PM, Kees Cook <keescook@chromium.org> wrote:
> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
>>> And especially since a ptracer
>>> can change syscalls during syscall-enter-stop to any syscall it wants,
>>> bypassing seccomp. This condition is already documented.
>>
>> If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
>> before seccomp, then this oddity would go away, which might be a good
>> thing.  A ptracer could change the syscall, but seccomp would based on
>> what the ptracer changed the syscall to.
>
> I want kill events to trigger immediately. I don't want to leave the
> ptrace surface available on a SECCOMP_RET_KILL. So maybe it can be
> seccomp phase 1, then ptrace, then seccomp phase 2? And pass more
> information between phases to determine how things should behave
> beyond just "skip"?

I thought so too, originally, but I'm far less convinced now, for two reasons:

1. I think that a lot of filters these days use RET_ERRNO heavily, so
this won't benefit them.

2. I'm not convinced it really reduces the attack surface for anyone.
Unless your filter is literally "return SECCOMP_RET_KILL", then the
seccomp-filtered task can always cause the ptracer to get a pair of
syscall notifications.  Also, the task can send itself signals (using
page faults, breakpoints, etc) and cause ptrace events via other
paths.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06 20:12                               ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-06 20:12 UTC (permalink / raw)
  To: Kees Cook
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 6, 2015 at 12:07 PM, Kees Cook <keescook@chromium.org> wrote:
> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
>>> And especially since a ptracer
>>> can change syscalls during syscall-enter-stop to any syscall it wants,
>>> bypassing seccomp. This condition is already documented.
>>
>> If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
>> before seccomp, then this oddity would go away, which might be a good
>> thing.  A ptracer could change the syscall, but seccomp would based on
>> what the ptracer changed the syscall to.
>
> I want kill events to trigger immediately. I don't want to leave the
> ptrace surface available on a SECCOMP_RET_KILL. So maybe it can be
> seccomp phase 1, then ptrace, then seccomp phase 2? And pass more
> information between phases to determine how things should behave
> beyond just "skip"?

I thought so too, originally, but I'm far less convinced now, for two reasons:

1. I think that a lot of filters these days use RET_ERRNO heavily, so
this won't benefit them.

2. I'm not convinced it really reduces the attack surface for anyone.
Unless your filter is literally "return SECCOMP_RET_KILL", then the
seccomp-filtered task can always cause the ptracer to get a pair of
syscall notifications.  Also, the task can send itself signals (using
page faults, breakpoints, etc) and cause ptrace events via other
paths.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06 20:12                               ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-06 20:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 6, 2015 at 12:07 PM, Kees Cook <keescook@chromium.org> wrote:
> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
>>> And especially since a ptracer
>>> can change syscalls during syscall-enter-stop to any syscall it wants,
>>> bypassing seccomp. This condition is already documented.
>>
>> If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
>> before seccomp, then this oddity would go away, which might be a good
>> thing.  A ptracer could change the syscall, but seccomp would based on
>> what the ptracer changed the syscall to.
>
> I want kill events to trigger immediately. I don't want to leave the
> ptrace surface available on a SECCOMP_RET_KILL. So maybe it can be
> seccomp phase 1, then ptrace, then seccomp phase 2? And pass more
> information between phases to determine how things should behave
> beyond just "skip"?

I thought so too, originally, but I'm far less convinced now, for two reasons:

1. I think that a lot of filters these days use RET_ERRNO heavily, so
this won't benefit them.

2. I'm not convinced it really reduces the attack surface for anyone.
Unless your filter is literally "return SECCOMP_RET_KILL", then the
seccomp-filtered task can always cause the ptracer to get a pair of
syscall notifications.  Also, the task can send itself signals (using
page faults, breakpoints, etc) and cause ptrace events via other
paths.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-06 20:12                               ` Andy Lutomirski
  (?)
@ 2015-02-06 20:16                                 ` Kees Cook
  -1 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-06 20:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 6, 2015 at 12:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Feb 6, 2015 at 12:07 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
>>>> And especially since a ptracer
>>>> can change syscalls during syscall-enter-stop to any syscall it wants,
>>>> bypassing seccomp. This condition is already documented.
>>>
>>> If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
>>> before seccomp, then this oddity would go away, which might be a good
>>> thing.  A ptracer could change the syscall, but seccomp would based on
>>> what the ptracer changed the syscall to.
>>
>> I want kill events to trigger immediately. I don't want to leave the
>> ptrace surface available on a SECCOMP_RET_KILL. So maybe it can be
>> seccomp phase 1, then ptrace, then seccomp phase 2? And pass more
>> information between phases to determine how things should behave
>> beyond just "skip"?
>
> I thought so too, originally, but I'm far less convinced now, for two reasons:
>
> 1. I think that a lot of filters these days use RET_ERRNO heavily, so
> this won't benefit them.
>
> 2. I'm not convinced it really reduces the attack surface for anyone.
> Unless your filter is literally "return SECCOMP_RET_KILL", then the
> seccomp-filtered task can always cause the ptracer to get a pair of
> syscall notifications.  Also, the task can send itself signals (using
> page faults, breakpoints, etc) and cause ptrace events via other
> paths.

What are you thinking for a solution?

As for capping SECCOMP_RET_ERRNO to MAX_ERRNO, how about this (sorry
if gmail butchers the paste):

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 4ef9687ac115..c88148d20bd5 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -629,7 +629,9 @@ static u32 __seccomp_phase1_filter(int this_syscall, struct

        switch (action) {
        case SECCOMP_RET_ERRNO:
-               /* Set the low-order 16-bits as a errno. */
+               /* Set the low-order bits as a errno. */
+               if (data > MAX_ERRNO)
+                       data = MAX_ERRNO;
                syscall_set_return_value(current, task_pt_regs(current),
                                         -data, 0);
                goto skip;


-- 
Kees Cook
Chrome OS Security

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06 20:16                                 ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-06 20:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 6, 2015 at 12:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Feb 6, 2015 at 12:07 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
>>>> And especially since a ptracer
>>>> can change syscalls during syscall-enter-stop to any syscall it wants,
>>>> bypassing seccomp. This condition is already documented.
>>>
>>> If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
>>> before seccomp, then this oddity would go away, which might be a good
>>> thing.  A ptracer could change the syscall, but seccomp would based on
>>> what the ptracer changed the syscall to.
>>
>> I want kill events to trigger immediately. I don't want to leave the
>> ptrace surface available on a SECCOMP_RET_KILL. So maybe it can be
>> seccomp phase 1, then ptrace, then seccomp phase 2? And pass more
>> information between phases to determine how things should behave
>> beyond just "skip"?
>
> I thought so too, originally, but I'm far less convinced now, for two reasons:
>
> 1. I think that a lot of filters these days use RET_ERRNO heavily, so
> this won't benefit them.
>
> 2. I'm not convinced it really reduces the attack surface for anyone.
> Unless your filter is literally "return SECCOMP_RET_KILL", then the
> seccomp-filtered task can always cause the ptracer to get a pair of
> syscall notifications.  Also, the task can send itself signals (using
> page faults, breakpoints, etc) and cause ptrace events via other
> paths.

What are you thinking for a solution?

As for capping SECCOMP_RET_ERRNO to MAX_ERRNO, how about this (sorry
if gmail butchers the paste):

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 4ef9687ac115..c88148d20bd5 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -629,7 +629,9 @@ static u32 __seccomp_phase1_filter(int this_syscall, struct

        switch (action) {
        case SECCOMP_RET_ERRNO:
-               /* Set the low-order 16-bits as a errno. */
+               /* Set the low-order bits as a errno. */
+               if (data > MAX_ERRNO)
+                       data = MAX_ERRNO;
                syscall_set_return_value(current, task_pt_regs(current),
                                         -data, 0);
                goto skip;


-- 
Kees Cook
Chrome OS Security

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06 20:16                                 ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-06 20:16 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 6, 2015 at 12:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Feb 6, 2015 at 12:07 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
>>>> And especially since a ptracer
>>>> can change syscalls during syscall-enter-stop to any syscall it wants,
>>>> bypassing seccomp. This condition is already documented.
>>>
>>> If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
>>> before seccomp, then this oddity would go away, which might be a good
>>> thing.  A ptracer could change the syscall, but seccomp would based on
>>> what the ptracer changed the syscall to.
>>
>> I want kill events to trigger immediately. I don't want to leave the
>> ptrace surface available on a SECCOMP_RET_KILL. So maybe it can be
>> seccomp phase 1, then ptrace, then seccomp phase 2? And pass more
>> information between phases to determine how things should behave
>> beyond just "skip"?
>
> I thought so too, originally, but I'm far less convinced now, for two reasons:
>
> 1. I think that a lot of filters these days use RET_ERRNO heavily, so
> this won't benefit them.
>
> 2. I'm not convinced it really reduces the attack surface for anyone.
> Unless your filter is literally "return SECCOMP_RET_KILL", then the
> seccomp-filtered task can always cause the ptracer to get a pair of
> syscall notifications.  Also, the task can send itself signals (using
> page faults, breakpoints, etc) and cause ptrace events via other
> paths.

What are you thinking for a solution?

As for capping SECCOMP_RET_ERRNO to MAX_ERRNO, how about this (sorry
if gmail butchers the paste):

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 4ef9687ac115..c88148d20bd5 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -629,7 +629,9 @@ static u32 __seccomp_phase1_filter(int this_syscall, struct

        switch (action) {
        case SECCOMP_RET_ERRNO:
-               /* Set the low-order 16-bits as a errno. */
+               /* Set the low-order bits as a errno. */
+               if (data > MAX_ERRNO)
+                       data = MAX_ERRNO;
                syscall_set_return_value(current, task_pt_regs(current),
                                         -data, 0);
                goto skip;


-- 
Kees Cook
Chrome OS Security

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
  2015-02-06 20:16                                 ` Kees Cook
  (?)
@ 2015-02-06 20:20                                   ` Andy Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-06 20:20 UTC (permalink / raw)
  To: Kees Cook
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 6, 2015 at 12:16 PM, Kees Cook <keescook@chromium.org> wrote:
> On Fri, Feb 6, 2015 at 12:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Feb 6, 2015 at 12:07 PM, Kees Cook <keescook@chromium.org> wrote:
>>> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>> And especially since a ptracer
>>>>> can change syscalls during syscall-enter-stop to any syscall it wants,
>>>>> bypassing seccomp. This condition is already documented.
>>>>
>>>> If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
>>>> before seccomp, then this oddity would go away, which might be a good
>>>> thing.  A ptracer could change the syscall, but seccomp would based on
>>>> what the ptracer changed the syscall to.
>>>
>>> I want kill events to trigger immediately. I don't want to leave the
>>> ptrace surface available on a SECCOMP_RET_KILL. So maybe it can be
>>> seccomp phase 1, then ptrace, then seccomp phase 2? And pass more
>>> information between phases to determine how things should behave
>>> beyond just "skip"?
>>
>> I thought so too, originally, but I'm far less convinced now, for two reasons:
>>
>> 1. I think that a lot of filters these days use RET_ERRNO heavily, so
>> this won't benefit them.
>>
>> 2. I'm not convinced it really reduces the attack surface for anyone.
>> Unless your filter is literally "return SECCOMP_RET_KILL", then the
>> seccomp-filtered task can always cause the ptracer to get a pair of
>> syscall notifications.  Also, the task can send itself signals (using
>> page faults, breakpoints, etc) and cause ptrace events via other
>> paths.
>
> What are you thinking for a solution?
>

I'm writing a patch now.  It's an ABI break, but this thread seems to
show that the ABI was somewhat useless before the split-phase changes,
and it's differently broken now, so I would be surprised if the change
broke anything that was currently working.  I'll send it later today,
hopefully.

> As for capping SECCOMP_RET_ERRNO to MAX_ERRNO, how about this (sorry
> if gmail butchers the paste):
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 4ef9687ac115..c88148d20bd5 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -629,7 +629,9 @@ static u32 __seccomp_phase1_filter(int this_syscall, struct
>
>         switch (action) {
>         case SECCOMP_RET_ERRNO:
> -               /* Set the low-order 16-bits as a errno. */
> +               /* Set the low-order bits as a errno. */
> +               if (data > MAX_ERRNO)
> +                       data = MAX_ERRNO;
>                 syscall_set_return_value(current, task_pt_regs(current),
>                                          -data, 0);
>                 goto skip;
>

I'm fine with this, but I'm not entirely convinced it solves a
problem.  SECCOMP_RET_ERRNO | 5000 didn't work before, and it doesn't
work now.  Admittedly, the new failure mode is possibly better.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06 20:20                                   ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-06 20:20 UTC (permalink / raw)
  To: Kees Cook
  Cc: Dmitry V. Levin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 6, 2015 at 12:16 PM, Kees Cook <keescook@chromium.org> wrote:
> On Fri, Feb 6, 2015 at 12:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Feb 6, 2015 at 12:07 PM, Kees Cook <keescook@chromium.org> wrote:
>>> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>> And especially since a ptracer
>>>>> can change syscalls during syscall-enter-stop to any syscall it wants,
>>>>> bypassing seccomp. This condition is already documented.
>>>>
>>>> If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
>>>> before seccomp, then this oddity would go away, which might be a good
>>>> thing.  A ptracer could change the syscall, but seccomp would based on
>>>> what the ptracer changed the syscall to.
>>>
>>> I want kill events to trigger immediately. I don't want to leave the
>>> ptrace surface available on a SECCOMP_RET_KILL. So maybe it can be
>>> seccomp phase 1, then ptrace, then seccomp phase 2? And pass more
>>> information between phases to determine how things should behave
>>> beyond just "skip"?
>>
>> I thought so too, originally, but I'm far less convinced now, for two reasons:
>>
>> 1. I think that a lot of filters these days use RET_ERRNO heavily, so
>> this won't benefit them.
>>
>> 2. I'm not convinced it really reduces the attack surface for anyone.
>> Unless your filter is literally "return SECCOMP_RET_KILL", then the
>> seccomp-filtered task can always cause the ptracer to get a pair of
>> syscall notifications.  Also, the task can send itself signals (using
>> page faults, breakpoints, etc) and cause ptrace events via other
>> paths.
>
> What are you thinking for a solution?
>

I'm writing a patch now.  It's an ABI break, but this thread seems to
show that the ABI was somewhat useless before the split-phase changes,
and it's differently broken now, so I would be surprised if the change
broke anything that was currently working.  I'll send it later today,
hopefully.

> As for capping SECCOMP_RET_ERRNO to MAX_ERRNO, how about this (sorry
> if gmail butchers the paste):
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 4ef9687ac115..c88148d20bd5 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -629,7 +629,9 @@ static u32 __seccomp_phase1_filter(int this_syscall, struct
>
>         switch (action) {
>         case SECCOMP_RET_ERRNO:
> -               /* Set the low-order 16-bits as a errno. */
> +               /* Set the low-order bits as a errno. */
> +               if (data > MAX_ERRNO)
> +                       data = MAX_ERRNO;
>                 syscall_set_return_value(current, task_pt_regs(current),
>                                          -data, 0);
>                 goto skip;
>

I'm fine with this, but I'm not entirely convinced it solves a
problem.  SECCOMP_RET_ERRNO | 5000 didn't work before, and it doesn't
work now.  Admittedly, the new failure mode is possibly better.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases
@ 2015-02-06 20:20                                   ` Andy Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2015-02-06 20:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 6, 2015 at 12:16 PM, Kees Cook <keescook@chromium.org> wrote:
> On Fri, Feb 6, 2015 at 12:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Feb 6, 2015 at 12:07 PM, Kees Cook <keescook@chromium.org> wrote:
>>> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>> And especially since a ptracer
>>>>> can change syscalls during syscall-enter-stop to any syscall it wants,
>>>>> bypassing seccomp. This condition is already documented.
>>>>
>>>> If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
>>>> before seccomp, then this oddity would go away, which might be a good
>>>> thing.  A ptracer could change the syscall, but seccomp would based on
>>>> what the ptracer changed the syscall to.
>>>
>>> I want kill events to trigger immediately. I don't want to leave the
>>> ptrace surface available on a SECCOMP_RET_KILL. So maybe it can be
>>> seccomp phase 1, then ptrace, then seccomp phase 2? And pass more
>>> information between phases to determine how things should behave
>>> beyond just "skip"?
>>
>> I thought so too, originally, but I'm far less convinced now, for two reasons:
>>
>> 1. I think that a lot of filters these days use RET_ERRNO heavily, so
>> this won't benefit them.
>>
>> 2. I'm not convinced it really reduces the attack surface for anyone.
>> Unless your filter is literally "return SECCOMP_RET_KILL", then the
>> seccomp-filtered task can always cause the ptracer to get a pair of
>> syscall notifications.  Also, the task can send itself signals (using
>> page faults, breakpoints, etc) and cause ptrace events via other
>> paths.
>
> What are you thinking for a solution?
>

I'm writing a patch now.  It's an ABI break, but this thread seems to
show that the ABI was somewhat useless before the split-phase changes,
and it's differently broken now, so I would be surprised if the change
broke anything that was currently working.  I'll send it later today,
hopefully.

> As for capping SECCOMP_RET_ERRNO to MAX_ERRNO, how about this (sorry
> if gmail butchers the paste):
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 4ef9687ac115..c88148d20bd5 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -629,7 +629,9 @@ static u32 __seccomp_phase1_filter(int this_syscall, struct
>
>         switch (action) {
>         case SECCOMP_RET_ERRNO:
> -               /* Set the low-order 16-bits as a errno. */
> +               /* Set the low-order bits as a errno. */
> +               if (data > MAX_ERRNO)
> +                       data = MAX_ERRNO;
>                 syscall_set_return_value(current, task_pt_regs(current),
>                                          -data, 0);
>                 goto skip;
>

I'm fine with this, but I'm not entirely convinced it solves a
problem.  SECCOMP_RET_ERRNO | 5000 didn't work before, and it doesn't
work now.  Admittedly, the new failure mode is possibly better.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: a method to distinguish between syscall-enter/exit-stop
  2015-02-06 20:07                             ` Kees Cook
  (?)
@ 2015-02-06 23:17                               ` Dmitry V. Levin
  -1 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-06 23:17 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 06, 2015 at 12:07:03PM -0800, Kees Cook wrote:
> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
[...]
> >> And an unrelated thought:
> >>
> >> 3) Can't we find some way to fix the inability of a ptracer to
> >> distinguish between syscall-enter-stop and syscall-exit-stop?
> >
> > Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
> > the lines of PTRACE_O_TRACESYSGOOD?
> 
> That might be a nice idea. I haven't written a test to see, but what
> does PTRACE_GETEVENTMSG return on syscall-enter/exit-stop?

The value returned by PTRACE_GETEVENTMSG is the value set along with the
latest PTRACE_EVENT_*.
In case of syscall-enter/exit-stop (which is not a PTRACE_EVENT_*),
there is no particular value set for PTRACE_GETEVENTMSG.


-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: a method to distinguish between syscall-enter/exit-stop
@ 2015-02-06 23:17                               ` Dmitry V. Levin
  0 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-06 23:17 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 06, 2015 at 12:07:03PM -0800, Kees Cook wrote:
> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
[...]
> >> And an unrelated thought:
> >>
> >> 3) Can't we find some way to fix the inability of a ptracer to
> >> distinguish between syscall-enter-stop and syscall-exit-stop?
> >
> > Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
> > the lines of PTRACE_O_TRACESYSGOOD?
> 
> That might be a nice idea. I haven't written a test to see, but what
> does PTRACE_GETEVENTMSG return on syscall-enter/exit-stop?

The value returned by PTRACE_GETEVENTMSG is the value set along with the
latest PTRACE_EVENT_*.
In case of syscall-enter/exit-stop (which is not a PTRACE_EVENT_*),
there is no particular value set for PTRACE_GETEVENTMSG.


-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* a method to distinguish between syscall-enter/exit-stop
@ 2015-02-06 23:17                               ` Dmitry V. Levin
  0 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-06 23:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 06, 2015 at 12:07:03PM -0800, Kees Cook wrote:
> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
[...]
> >> And an unrelated thought:
> >>
> >> 3) Can't we find some way to fix the inability of a ptracer to
> >> distinguish between syscall-enter-stop and syscall-exit-stop?
> >
> > Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
> > the lines of PTRACE_O_TRACESYSGOOD?
> 
> That might be a nice idea. I haven't written a test to see, but what
> does PTRACE_GETEVENTMSG return on syscall-enter/exit-stop?

The value returned by PTRACE_GETEVENTMSG is the value set along with the
latest PTRACE_EVENT_*.
In case of syscall-enter/exit-stop (which is not a PTRACE_EVENT_*),
there is no particular value set for PTRACE_GETEVENTMSG.


-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: a method to distinguish between syscall-enter/exit-stop
  2015-02-06 23:17                               ` Dmitry V. Levin
  (?)
@ 2015-02-07  1:07                                 ` Kees Cook
  -1 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-07  1:07 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Andy Lutomirski, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 6, 2015 at 3:17 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> On Fri, Feb 06, 2015 at 12:07:03PM -0800, Kees Cook wrote:
>> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
> [...]
>> >> And an unrelated thought:
>> >>
>> >> 3) Can't we find some way to fix the inability of a ptracer to
>> >> distinguish between syscall-enter-stop and syscall-exit-stop?
>> >
>> > Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
>> > the lines of PTRACE_O_TRACESYSGOOD?
>>
>> That might be a nice idea. I haven't written a test to see, but what
>> does PTRACE_GETEVENTMSG return on syscall-enter/exit-stop?
>
> The value returned by PTRACE_GETEVENTMSG is the value set along with the
> latest PTRACE_EVENT_*.
> In case of syscall-enter/exit-stop (which is not a PTRACE_EVENT_*),
> there is no particular value set for PTRACE_GETEVENTMSG.

Could we define one to help distinguish?

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: a method to distinguish between syscall-enter/exit-stop
@ 2015-02-07  1:07                                 ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-07  1:07 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Andy Lutomirski, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 6, 2015 at 3:17 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> On Fri, Feb 06, 2015 at 12:07:03PM -0800, Kees Cook wrote:
>> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
> [...]
>> >> And an unrelated thought:
>> >>
>> >> 3) Can't we find some way to fix the inability of a ptracer to
>> >> distinguish between syscall-enter-stop and syscall-exit-stop?
>> >
>> > Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
>> > the lines of PTRACE_O_TRACESYSGOOD?
>>
>> That might be a nice idea. I haven't written a test to see, but what
>> does PTRACE_GETEVENTMSG return on syscall-enter/exit-stop?
>
> The value returned by PTRACE_GETEVENTMSG is the value set along with the
> latest PTRACE_EVENT_*.
> In case of syscall-enter/exit-stop (which is not a PTRACE_EVENT_*),
> there is no particular value set for PTRACE_GETEVENTMSG.

Could we define one to help distinguish?

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* a method to distinguish between syscall-enter/exit-stop
@ 2015-02-07  1:07                                 ` Kees Cook
  0 siblings, 0 replies; 82+ messages in thread
From: Kees Cook @ 2015-02-07  1:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 6, 2015 at 3:17 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> On Fri, Feb 06, 2015 at 12:07:03PM -0800, Kees Cook wrote:
>> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
> [...]
>> >> And an unrelated thought:
>> >>
>> >> 3) Can't we find some way to fix the inability of a ptracer to
>> >> distinguish between syscall-enter-stop and syscall-exit-stop?
>> >
>> > Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
>> > the lines of PTRACE_O_TRACESYSGOOD?
>>
>> That might be a nice idea. I haven't written a test to see, but what
>> does PTRACE_GETEVENTMSG return on syscall-enter/exit-stop?
>
> The value returned by PTRACE_GETEVENTMSG is the value set along with the
> latest PTRACE_EVENT_*.
> In case of syscall-enter/exit-stop (which is not a PTRACE_EVENT_*),
> there is no particular value set for PTRACE_GETEVENTMSG.

Could we define one to help distinguish?

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: a method to distinguish between syscall-enter/exit-stop
  2015-02-07  1:07                                 ` Kees Cook
  (?)
@ 2015-02-07  3:04                                   ` Dmitry V. Levin
  -1 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-07  3:04 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 06, 2015 at 05:07:41PM -0800, Kees Cook wrote:
> On Fri, Feb 6, 2015 at 3:17 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> > On Fri, Feb 06, 2015 at 12:07:03PM -0800, Kees Cook wrote:
> >> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> > On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
> > [...]
> >> >> And an unrelated thought:
> >> >>
> >> >> 3) Can't we find some way to fix the inability of a ptracer to
> >> >> distinguish between syscall-enter-stop and syscall-exit-stop?
> >> >
> >> > Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
> >> > the lines of PTRACE_O_TRACESYSGOOD?
> >>
> >> That might be a nice idea. I haven't written a test to see, but what
> >> does PTRACE_GETEVENTMSG return on syscall-enter/exit-stop?
> >
> > The value returned by PTRACE_GETEVENTMSG is the value set along with the
> > latest PTRACE_EVENT_*.
> > In case of syscall-enter/exit-stop (which is not a PTRACE_EVENT_*),
> > there is no particular value set for PTRACE_GETEVENTMSG.
> 
> Could we define one to help distinguish?

I suppose we could define one, but performing extra PTRACE_GETEVENTMSG
for every syscall-stop may be too expensive.

For example, strace makes about 4.5 syscalls per syscall-stop.
The minimum is 4 syscalls: wait4, PTRACE_GETREGSET, write, and PTRACE_SYSCALL;
processing some syscall-stops may require additional process_vm_readv calls.

That is, forcing strace to make extra PTRACE_GETEVENTMSG per syscall-stop
would result to about 20% more syscalls per syscall-stop, that is a
noticeable cost.

A better alternative is to define an event that wouldn't require this
extra PTRACE_GETEVENTMSG per syscall-stop.  For example, it could be a
PTRACE_EVENT_SYSCALL_ENTRY and/or PTRACE_EVENT_SYSCALL_EXIT.  In practice,
adding just one of these two events would be enough to distinguish two
kinds of syscall-stops.  Adding two events would look less surprising,
though.

If the decision would be to add both events, I'd recommend adding just one
new option to cover both events - there is a room only for 32 different
PTRACE_O_* options.

-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: a method to distinguish between syscall-enter/exit-stop
@ 2015-02-07  3:04                                   ` Dmitry V. Levin
  0 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-07  3:04 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov, H. Peter Anvin,
	Frederic Weisbecker, Michael Kerrisk-manpages

On Fri, Feb 06, 2015 at 05:07:41PM -0800, Kees Cook wrote:
> On Fri, Feb 6, 2015 at 3:17 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> > On Fri, Feb 06, 2015 at 12:07:03PM -0800, Kees Cook wrote:
> >> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> > On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
> > [...]
> >> >> And an unrelated thought:
> >> >>
> >> >> 3) Can't we find some way to fix the inability of a ptracer to
> >> >> distinguish between syscall-enter-stop and syscall-exit-stop?
> >> >
> >> > Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
> >> > the lines of PTRACE_O_TRACESYSGOOD?
> >>
> >> That might be a nice idea. I haven't written a test to see, but what
> >> does PTRACE_GETEVENTMSG return on syscall-enter/exit-stop?
> >
> > The value returned by PTRACE_GETEVENTMSG is the value set along with the
> > latest PTRACE_EVENT_*.
> > In case of syscall-enter/exit-stop (which is not a PTRACE_EVENT_*),
> > there is no particular value set for PTRACE_GETEVENTMSG.
> 
> Could we define one to help distinguish?

I suppose we could define one, but performing extra PTRACE_GETEVENTMSG
for every syscall-stop may be too expensive.

For example, strace makes about 4.5 syscalls per syscall-stop.
The minimum is 4 syscalls: wait4, PTRACE_GETREGSET, write, and PTRACE_SYSCALL;
processing some syscall-stops may require additional process_vm_readv calls.

That is, forcing strace to make extra PTRACE_GETEVENTMSG per syscall-stop
would result to about 20% more syscalls per syscall-stop, that is a
noticeable cost.

A better alternative is to define an event that wouldn't require this
extra PTRACE_GETEVENTMSG per syscall-stop.  For example, it could be a
PTRACE_EVENT_SYSCALL_ENTRY and/or PTRACE_EVENT_SYSCALL_EXIT.  In practice,
adding just one of these two events would be enough to distinguish two
kinds of syscall-stops.  Adding two events would look less surprising,
though.

If the decision would be to add both events, I'd recommend adding just one
new option to cover both events - there is a room only for 32 different
PTRACE_O_* options.

-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

* a method to distinguish between syscall-enter/exit-stop
@ 2015-02-07  3:04                                   ` Dmitry V. Levin
  0 siblings, 0 replies; 82+ messages in thread
From: Dmitry V. Levin @ 2015-02-07  3:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 06, 2015 at 05:07:41PM -0800, Kees Cook wrote:
> On Fri, Feb 6, 2015 at 3:17 PM, Dmitry V. Levin <ldv@altlinux.org> wrote:
> > On Fri, Feb 06, 2015 at 12:07:03PM -0800, Kees Cook wrote:
> >> On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> > On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@chromium.org> wrote:
> > [...]
> >> >> And an unrelated thought:
> >> >>
> >> >> 3) Can't we find some way to fix the inability of a ptracer to
> >> >> distinguish between syscall-enter-stop and syscall-exit-stop?
> >> >
> >> > Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
> >> > the lines of PTRACE_O_TRACESYSGOOD?
> >>
> >> That might be a nice idea. I haven't written a test to see, but what
> >> does PTRACE_GETEVENTMSG return on syscall-enter/exit-stop?
> >
> > The value returned by PTRACE_GETEVENTMSG is the value set along with the
> > latest PTRACE_EVENT_*.
> > In case of syscall-enter/exit-stop (which is not a PTRACE_EVENT_*),
> > there is no particular value set for PTRACE_GETEVENTMSG.
> 
> Could we define one to help distinguish?

I suppose we could define one, but performing extra PTRACE_GETEVENTMSG
for every syscall-stop may be too expensive.

For example, strace makes about 4.5 syscalls per syscall-stop.
The minimum is 4 syscalls: wait4, PTRACE_GETREGSET, write, and PTRACE_SYSCALL;
processing some syscall-stops may require additional process_vm_readv calls.

That is, forcing strace to make extra PTRACE_GETEVENTMSG per syscall-stop
would result to about 20% more syscalls per syscall-stop, that is a
noticeable cost.

A better alternative is to define an event that wouldn't require this
extra PTRACE_GETEVENTMSG per syscall-stop.  For example, it could be a
PTRACE_EVENT_SYSCALL_ENTRY and/or PTRACE_EVENT_SYSCALL_EXIT.  In practice,
adding just one of these two events would be enough to distinguish two
kinds of syscall-stops.  Adding two events would look less surprising,
though.

If the decision would be to add both events, I'd recommend adding just one
new option to cover both events - there is a room only for 32 different
PTRACE_O_* options.

-- 
ldv

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2015-02-07  3:04 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-05 22:13 [PATCH v5 0/5] x86: two-phase syscall tracing and seccomp fastpath Andy Lutomirski
2014-09-05 22:13 ` Andy Lutomirski
2014-09-05 22:13 ` [PATCH v5 1/5] x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit Andy Lutomirski
2014-09-05 22:13   ` Andy Lutomirski
2014-09-09  2:43   ` [tip:x86/seccomp] x86, x32, audit: " tip-bot for Andy Lutomirski
2014-09-05 22:13 ` [PATCH v5 2/5] x86,entry: Only call user_exit if TIF_NOHZ Andy Lutomirski
2014-09-05 22:13   ` Andy Lutomirski
2014-09-09  2:43   ` [tip:x86/seccomp] x86, entry: " tip-bot for Andy Lutomirski
2014-09-05 22:13 ` [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases Andy Lutomirski
2014-09-05 22:13   ` Andy Lutomirski
2014-09-09  2:44   ` [tip:x86/seccomp] " tip-bot for Andy Lutomirski
2015-02-05 21:19   ` [PATCH v5 3/5] " Dmitry V. Levin
2015-02-05 21:19     ` Dmitry V. Levin
2015-02-05 21:27     ` Kees Cook
2015-02-05 21:27       ` Kees Cook
2015-02-05 21:27       ` Kees Cook
2015-02-05 21:40       ` Dmitry V. Levin
2015-02-05 21:40         ` Dmitry V. Levin
2015-02-05 21:40         ` Dmitry V. Levin
2015-02-05 21:52         ` Andy Lutomirski
2015-02-05 21:52           ` Andy Lutomirski
2015-02-05 21:52           ` Andy Lutomirski
2015-02-05 23:12           ` Kees Cook
2015-02-05 23:12             ` Kees Cook
2015-02-05 23:12             ` Kees Cook
2015-02-05 23:39             ` Dmitry V. Levin
2015-02-05 23:39               ` Dmitry V. Levin
2015-02-05 23:39               ` Dmitry V. Levin
2015-02-05 23:49               ` Kees Cook
2015-02-05 23:49                 ` Kees Cook
2015-02-05 23:49                 ` Kees Cook
2015-02-06  0:09                 ` Andy Lutomirski
2015-02-06  0:09                   ` Andy Lutomirski
2015-02-06  0:09                   ` Andy Lutomirski
2015-02-06  2:32                   ` Dmitry V. Levin
2015-02-06  2:32                     ` Dmitry V. Levin
2015-02-06  2:32                     ` Dmitry V. Levin
2015-02-06  2:38                     ` Andy Lutomirski
2015-02-06  2:38                       ` Andy Lutomirski
2015-02-06  2:38                       ` Andy Lutomirski
2015-02-06 19:23                       ` Kees Cook
2015-02-06 19:23                         ` Kees Cook
2015-02-06 19:23                         ` Kees Cook
2015-02-06 19:32                         ` Andy Lutomirski
2015-02-06 19:32                           ` Andy Lutomirski
2015-02-06 19:32                           ` Andy Lutomirski
2015-02-06 20:07                           ` Kees Cook
2015-02-06 20:07                             ` Kees Cook
2015-02-06 20:07                             ` Kees Cook
2015-02-06 20:12                             ` Andy Lutomirski
2015-02-06 20:12                               ` Andy Lutomirski
2015-02-06 20:12                               ` Andy Lutomirski
2015-02-06 20:16                               ` Kees Cook
2015-02-06 20:16                                 ` Kees Cook
2015-02-06 20:16                                 ` Kees Cook
2015-02-06 20:20                                 ` Andy Lutomirski
2015-02-06 20:20                                   ` Andy Lutomirski
2015-02-06 20:20                                   ` Andy Lutomirski
2015-02-06 23:17                             ` a method to distinguish between syscall-enter/exit-stop Dmitry V. Levin
2015-02-06 23:17                               ` Dmitry V. Levin
2015-02-06 23:17                               ` Dmitry V. Levin
2015-02-07  1:07                               ` Kees Cook
2015-02-07  1:07                                 ` Kees Cook
2015-02-07  1:07                                 ` Kees Cook
2015-02-07  3:04                                 ` Dmitry V. Levin
2015-02-07  3:04                                   ` Dmitry V. Levin
2015-02-07  3:04                                   ` Dmitry V. Levin
2015-02-06 20:11                         ` [PATCH v5 3/5] x86: Split syscall_trace_enter into two phases H. Peter Anvin
2015-02-06 20:11                           ` H. Peter Anvin
2015-02-06 20:11                           ` H. Peter Anvin
2014-09-05 22:13 ` [PATCH v5 4/5] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls Andy Lutomirski
2014-09-05 22:13   ` [PATCH v5 4/5] x86_64, entry: " Andy Lutomirski
2014-09-09  2:44   ` [tip:x86/seccomp] x86_64, entry: Treat regs-> ax " tip-bot for Andy Lutomirski
2014-09-05 22:13 ` [PATCH v5 5/5] x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls Andy Lutomirski
2014-09-05 22:13   ` [PATCH v5 5/5] x86_64, entry: " Andy Lutomirski
2014-09-09  2:44   ` [tip:x86/seccomp] " tip-bot for Andy Lutomirski
2014-09-08 19:29 ` [PATCH v5 0/5] x86: two-phase syscall tracing and seccomp fastpath Kees Cook
2014-09-08 19:29   ` Kees Cook
2014-09-08 19:29   ` Kees Cook
2014-09-08 19:49   ` H. Peter Anvin
2014-09-08 19:49     ` H. Peter Anvin
2014-09-08 19:49     ` H. Peter Anvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.