All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 3.5 0/2] seccomp and vsyscall fixes
@ 2012-07-17 23:19 Andy Lutomirski
  2012-07-17 23:19 ` [PATCH 3.5 1/2] seccomp: Make syscall skipping and nr changes more consistent Andy Lutomirski
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Andy Lutomirski @ 2012-07-17 23:19 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel; +Cc: Will Drewry, Kees Cook, Andy Lutomirski

Apologies for the lateness of this stuff.  I was at a conference last
week when the Chrome issue was discovered and I couldn't do this
properly until I got back.

Will, can you confirm that this version is okay and passes your tests?
It passes mine.

While there are no known seccomp users that will have trouble,
SECCOMP_RET_TRAP and SECCOMP_RET_TRACE currently interact oddly with
emulated vsyscalls.  This might lead to ABI issues down the road (if
something starts to rely on current behavior) or unexpected malfunctions
(if something tries to change, say, sys_gettimeofday, into a different
syscall and gets completely bogus results on a vsyscall-using distro.

It's unlikely that fixing this later will cause issues, but it would be
nice to nail down and document the vsyscall quirks for the first
released kernel with seccomp mode 2 support.

(Patch 2/2 is very much optional.  It fixes a strange corner case.  It
 ought to be fine for 3.6, since I very much doubt that any real code
 will hit that corner case and cause ABI problems.)

Andy Lutomirski (2):
  seccomp: Make syscall skipping and nr changes more consistent
  seccomp: Future-proof against silly tracers

 Documentation/prctl/seccomp_filter.txt |   74 ++++++++++++++++++++--
 arch/x86/include/asm/syscall.h         |   11 +++
 arch/x86/kernel/vsyscall_64.c          |  110 +++++++++++++++++---------------
 kernel/seccomp.c                       |   28 +++++++-
 4 files changed, 163 insertions(+), 60 deletions(-)

-- 
1.7.7.6


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 3.5 1/2] seccomp: Make syscall skipping and nr changes more consistent
  2012-07-17 23:19 [PATCH 3.5 0/2] seccomp and vsyscall fixes Andy Lutomirski
@ 2012-07-17 23:19 ` Andy Lutomirski
  2012-07-18 18:31   ` Will Drewry
  2012-07-26 15:43   ` Andy Lutomirski
  2012-07-17 23:19 ` [PATCH 3.5 2/2] seccomp: Future-proof against silly tracers Andy Lutomirski
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 16+ messages in thread
From: Andy Lutomirski @ 2012-07-17 23:19 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel; +Cc: Will Drewry, Kees Cook, Andy Lutomirski

This fixes two issues that could cause incompatibility between
kernel versions:

 - If a tracer uses SECCOMP_RET_TRACE to select a syscall number
   higher than the largest known syscall, emulate the unknown
   vsyscall by returning -ENOSYS.  (This is unlikely to make a
   noticeable difference on x86-64 due to the way the system call
   entry works.)

 - On x86-64 with vsyscall=emulate, skipped vsyscalls were buggy.

This updates the documentation accordingly.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
---
 Documentation/prctl/seccomp_filter.txt |   74 ++++++++++++++++++++--
 arch/x86/kernel/vsyscall_64.c          |  110 +++++++++++++++++---------------
 kernel/seccomp.c                       |   13 +++-
 3 files changed, 137 insertions(+), 60 deletions(-)

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
index 597c3c5..1e469ef 100644
--- a/Documentation/prctl/seccomp_filter.txt
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -95,12 +95,15 @@ SECCOMP_RET_KILL:
 
 SECCOMP_RET_TRAP:
 	Results in the kernel sending a SIGSYS signal to the triggering
-	task without executing the system call.  The kernel will
-	rollback the register state to just before the system call
-	entry such that a signal handler in the task will be able to
-	inspect the ucontext_t->uc_mcontext registers and emulate
-	system call success or failure upon return from the signal
-	handler.
+	task without executing the system call.  siginfo->si_call_addr
+	will show the address of the system call instruction, and
+	siginfo->si_syscall and siginfo->si_arch will indicate which
+	syscall was attempted.  The program counter will be as though
+	the syscall happened (i.e. it will not point to the syscall
+	instruction).  The return value register will contain an arch-
+	dependent value -- if resuming execution, set it to something
+	sensible.  (The architecture dependency is because replacing
+	it with -ENOSYS could overwrite some useful information.)
 
 	The SECCOMP_RET_DATA portion of the return value will be passed
 	as si_errno.
@@ -123,6 +126,18 @@ SECCOMP_RET_TRACE:
 	the BPF program return value will be available to the tracer
 	via PTRACE_GETEVENTMSG.
 
+	The tracer can skip the system call by changing the syscall number
+	to -1.  Alternatively, the tracer can change the system call
+	requested by changing the system call to a valid syscall number.  If
+	the tracer asks to skip the system call, then the system call will
+	appear to return the value that the tracer puts in the return value
+	register.
+
+	The seccomp check will not be run again after the tracer is
+	notified.  (This means that seccomp-based sandboxes MUST NOT
+	allow use of ptrace, even of other sandboxed processes, without
+	extreme care; ptracers can use this mechanism to escape.)
+
 SECCOMP_RET_ALLOW:
 	Results in the system call being executed.
 
@@ -161,3 +176,50 @@ architecture supports both ptrace_event and seccomp, it will be able to
 support seccomp filter with minor fixup: SIGSYS support and seccomp return
 value checking.  Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER
 to its arch-specific Kconfig.
+
+
+
+Caveats
+-------
+
+The vDSO can cause some system calls to run entirely in userspace,
+leading to surprises when you run programs on different machines that
+fall back to real syscalls.  To minimize these surprises on x86, make
+sure you test with
+/sys/devices/system/clocksource/clocksource0/current_clocksource set to
+something like acpi_pm.
+
+On x86-64, vsyscall emulation is enabled by default.  (vsyscalls are
+legacy variants on vDSO calls.)  Currently, emulated vsyscalls will honor seccomp, with a few oddities:
+
+- A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to
+  the vsyscall entry for the given call and not the address after the
+  'syscall' instruction.  Any code which wants to restart the call
+  should be aware that (a) a ret instruction has been emulated and (b)
+  trying to resume the syscall will again trigger the standard vsyscall
+  emulation security checks, making resuming the syscall mostly
+  pointless.
+
+- A return value of SECCOMP_RET_TRACE will signal the tracer as usual,
+  but the syscall may not be changed to another system call using the
+  orig_rax register. It may only be changed to -1 order to skip the
+  currently emulated call. Any other change MAY terminate the process.
+  The rip value seen by the tracer will be the syscall entry address;
+  this is different from normal behavior.  The tracer MUST NOT modify
+  rip or rsp.  (Do not rely on other changes terminating the process.
+  They might work.  For example, on some kernels, choosing a syscall
+  that only exists in future kernels will be correctly emulated (by
+  returning -ENOSYS).
+
+To detect this quirky behavior, check for addr & ~0x0C00 ==
+0xFFFFFFFFFF600000.  (For SECCOMP_RET_TRACE, use rip.  For
+SECCOMP_RET_TRAP, use siginfo->si_call_addr.)  Do not check any other
+condition: future kernels may improve vsyscall emulation and current
+kernels in vsyscall=native mode will behave differently, but the
+instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these
+cases.
+
+Note that modern systems are unlikely to use vsyscalls at all -- they
+are a legacy feature and they are considerably slower than standard
+syscalls.  New code will use the vDSO, and vDSO-issued system calls
+are indistinguishable from normal system calls.
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 5db36ca..44a3a2e 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -139,19 +139,6 @@ static int addr_to_vsyscall_nr(unsigned long addr)
 	return nr;
 }
 
-#ifdef CONFIG_SECCOMP
-static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr)
-{
-	if (!seccomp_mode(&tsk->seccomp))
-		return 0;
-	task_pt_regs(tsk)->orig_ax = syscall_nr;
-	task_pt_regs(tsk)->ax = syscall_nr;
-	return __secure_computing(syscall_nr);
-}
-#else
-#define vsyscall_seccomp(_tsk, _nr) 0
-#endif
-
 static bool write_ok_or_segv(unsigned long ptr, size_t size)
 {
 	/*
@@ -184,10 +171,9 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
 {
 	struct task_struct *tsk;
 	unsigned long caller;
-	int vsyscall_nr;
+	int vsyscall_nr, syscall_nr, tmp;
 	int prev_sig_on_uaccess_error;
 	long ret;
-	int skip;
 
 	/*
 	 * No point in checking CS -- the only way to get here is a user mode
@@ -219,56 +205,84 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
 	}
 
 	tsk = current;
-	/*
-	 * With a real vsyscall, page faults cause SIGSEGV.  We want to
-	 * preserve that behavior to make writing exploits harder.
-	 */
-	prev_sig_on_uaccess_error = current_thread_info()->sig_on_uaccess_error;
-	current_thread_info()->sig_on_uaccess_error = 1;
 
 	/*
+	 * Check for access_ok violations and find the syscall nr.
+	 *
 	 * NULL is a valid user pointer (in the access_ok sense) on 32-bit and
 	 * 64-bit, so we don't need to special-case it here.  For all the
 	 * vsyscalls, NULL means "don't write anything" not "write it at
 	 * address 0".
 	 */
-	ret = -EFAULT;
-	skip = 0;
 	switch (vsyscall_nr) {
 	case 0:
-		skip = vsyscall_seccomp(tsk, __NR_gettimeofday);
-		if (skip)
-			break;
-
 		if (!write_ok_or_segv(regs->di, sizeof(struct timeval)) ||
-		    !write_ok_or_segv(regs->si, sizeof(struct timezone)))
-			break;
+		    !write_ok_or_segv(regs->si, sizeof(struct timezone))) {
+			ret = -EFAULT;
+			goto check_fault;
+		}
+
+		syscall_nr = __NR_gettimeofday;
+		break;
+
+	case 1:
+		if (!write_ok_or_segv(regs->di, sizeof(time_t))) {
+			ret = -EFAULT;
+			goto check_fault;
+		}
+
+		syscall_nr = __NR_time;
+		break;
+
+	case 2:
+		if (!write_ok_or_segv(regs->di, sizeof(unsigned)) ||
+		    !write_ok_or_segv(regs->si, sizeof(unsigned))) {
+			ret = -EFAULT;
+			goto check_fault;
+		}
+
+		syscall_nr = __NR_getcpu;
+		break;
+	}
+
+	/*
+	 * Handle seccomp.  regs->ip must be the original value.
+	 * See seccomp_send_sigsys and Documentation/prctl/seccomp_filter.txt.
+	 *
+	 * We could optimize the seccomp disabled case, but performance
+	 * here doesn't matter.
+	 */
+	regs->orig_ax = syscall_nr;
+	regs->ax = -ENOSYS;
+	tmp = secure_computing(syscall_nr);
+	if ((regs->orig_ax != syscall_nr && !tmp) || regs->ip != address) {
+		warn_bad_vsyscall(KERN_DEBUG, regs,
+				  "seccomp tried to change syscall nr or ip");
+		do_exit(SIGSYS);
+	}
+	if (tmp)
+		goto do_ret;  /* skip requested */
 
+	/*
+	 * With a real vsyscall, page faults cause SIGSEGV.  We want to
+	 * preserve that behavior to make writing exploits harder.
+	 */
+	prev_sig_on_uaccess_error = current_thread_info()->sig_on_uaccess_error;
+	current_thread_info()->sig_on_uaccess_error = 1;
+
+	ret = -EFAULT;
+	switch (vsyscall_nr) {
+	case 0:
 		ret = sys_gettimeofday(
 			(struct timeval __user *)regs->di,
 			(struct timezone __user *)regs->si);
 		break;
 
 	case 1:
-		skip = vsyscall_seccomp(tsk, __NR_time);
-		if (skip)
-			break;
-
-		if (!write_ok_or_segv(regs->di, sizeof(time_t)))
-			break;
-
 		ret = sys_time((time_t __user *)regs->di);
 		break;
 
 	case 2:
-		skip = vsyscall_seccomp(tsk, __NR_getcpu);
-		if (skip)
-			break;
-
-		if (!write_ok_or_segv(regs->di, sizeof(unsigned)) ||
-		    !write_ok_or_segv(regs->si, sizeof(unsigned)))
-			break;
-
 		ret = sys_getcpu((unsigned __user *)regs->di,
 				 (unsigned __user *)regs->si,
 				 NULL);
@@ -277,12 +291,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
 
 	current_thread_info()->sig_on_uaccess_error = prev_sig_on_uaccess_error;
 
-	if (skip) {
-		if ((long)regs->ax <= 0L) /* seccomp errno emulation */
-			goto do_ret;
-		goto done; /* seccomp trace/trap */
-	}
-
+check_fault:
 	if (ret == -EFAULT) {
 		/* Bad news -- userspace fed a bad pointer to a vsyscall. */
 		warn_bad_vsyscall(KERN_INFO, regs,
@@ -305,7 +314,6 @@ do_ret:
 	/* Emulate a ret instruction. */
 	regs->ip = caller;
 	regs->sp += 8;
-done:
 	return true;
 
 sigsegv:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ee376be..5af44b5 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -396,25 +396,29 @@ int __secure_computing(int this_syscall)
 #ifdef CONFIG_SECCOMP_FILTER
 	case SECCOMP_MODE_FILTER: {
 		int data;
+		struct pt_regs *regs = task_pt_regs(current);
 		ret = seccomp_run_filters(this_syscall);
 		data = ret & SECCOMP_RET_DATA;
 		ret &= SECCOMP_RET_ACTION;
 		switch (ret) {
 		case SECCOMP_RET_ERRNO:
 			/* Set the low-order 16-bits as a errno. */
-			syscall_set_return_value(current, task_pt_regs(current),
+			syscall_set_return_value(current, regs,
 						 -data, 0);
 			goto skip;
 		case SECCOMP_RET_TRAP:
 			/* Show the handler the original registers. */
-			syscall_rollback(current, task_pt_regs(current));
+			syscall_rollback(current, regs);
 			/* Let the filter pass back 16 bits of data. */
 			seccomp_send_sigsys(this_syscall, data);
 			goto skip;
 		case SECCOMP_RET_TRACE:
 			/* Skip these calls if there is no tracer. */
-			if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP))
+			if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) {
+				syscall_set_return_value(current, regs,
+							 -ENOSYS, 0);
 				goto skip;
+			}
 			/* Allow the BPF to provide the event message */
 			ptrace_event(PTRACE_EVENT_SECCOMP, data);
 			/*
@@ -425,6 +429,9 @@ int __secure_computing(int this_syscall)
 			 */
 			if (fatal_signal_pending(current))
 				break;
+			if (syscall_get_nr(current, regs) < 0)
+				goto skip;  /* Explicit request to skip. */
+
 			return 0;
 		case SECCOMP_RET_ALLOW:
 			return 0;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3.5 2/2] seccomp: Future-proof against silly tracers
  2012-07-17 23:19 [PATCH 3.5 0/2] seccomp and vsyscall fixes Andy Lutomirski
  2012-07-17 23:19 ` [PATCH 3.5 1/2] seccomp: Make syscall skipping and nr changes more consistent Andy Lutomirski
@ 2012-07-17 23:19 ` Andy Lutomirski
  2012-07-18  2:13   ` Will Drewry
  2012-07-26 15:41   ` Andy Lutomirski
  2012-07-18  2:10 ` [PATCH 3.5 0/2] seccomp and vsyscall fixes Will Drewry
  2012-09-27 17:36 ` Greg KH
  3 siblings, 2 replies; 16+ messages in thread
From: Andy Lutomirski @ 2012-07-17 23:19 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel; +Cc: Will Drewry, Kees Cook, Andy Lutomirski

Currently, if a tracer changes a syscall nr to __NR_future_enosys,
behavior will differ between kernels that know about
__NR_future_enosys (and return -ENOSYS) and older kernels (which
return the value from pt_regs).  This is silly; we should just
return -ENOSYS.

This is unlikely to ever happen on x86 because the return value in
pt_regs starts out as -ENOSYS, but a silly tracer can change that.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
---
 arch/x86/include/asm/syscall.h |   11 +++++++++++
 kernel/seccomp.c               |   15 +++++++++++++++
 2 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 1ace47b..8191e057 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -70,6 +70,17 @@ static inline void syscall_set_return_value(struct task_struct *task,
 	regs->ax = (long) error ?: val;
 }
 
+static inline bool syscall_is_nr_future(struct task_struct *task,
+					struct pt_regs *regs)
+{
+#ifdef CONFIG_IA32_EMULATION
+	if (task_thread_info(task)->status & TS_COMPAT)
+		return syscall_get_nr(task, regs) > __NR_ia32_syscall_max;
+#endif
+
+	return syscall_get_nr(task, regs) > __NR_syscall_max;
+}
+
 #ifdef CONFIG_X86_32
 
 static inline void syscall_get_arguments(struct task_struct *task,
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 5af44b5..bd7527d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -429,6 +429,21 @@ int __secure_computing(int this_syscall)
 			 */
 			if (fatal_signal_pending(current))
 				break;
+
+			if (syscall_is_nr_future(current, regs)) {
+				/*
+				 * If the tracer selects a system call that
+				 * only future kernels know about, we still need
+				 * to execute that syscall by returning
+				 * -ENOSYS.  (On x86, this only matters if the
+				 * tracer changed the return value, which would
+				 * be silly, but user code can be silly.)
+				 */
+				syscall_set_return_value(current, regs,
+							 -ENOSYS, 0);
+				goto skip;
+			}
+
 			if (syscall_get_nr(current, regs) < 0)
 				goto skip;  /* Explicit request to skip. */
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 3.5 0/2] seccomp and vsyscall fixes
  2012-07-17 23:19 [PATCH 3.5 0/2] seccomp and vsyscall fixes Andy Lutomirski
  2012-07-17 23:19 ` [PATCH 3.5 1/2] seccomp: Make syscall skipping and nr changes more consistent Andy Lutomirski
  2012-07-17 23:19 ` [PATCH 3.5 2/2] seccomp: Future-proof against silly tracers Andy Lutomirski
@ 2012-07-18  2:10 ` Will Drewry
  2012-09-27 17:36 ` Greg KH
  3 siblings, 0 replies; 16+ messages in thread
From: Will Drewry @ 2012-07-18  2:10 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Linus Torvalds, linux-kernel, Kees Cook

On Tue, Jul 17, 2012 at 6:19 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Apologies for the lateness of this stuff.  I was at a conference last
> week when the Chrome issue was discovered and I couldn't do this
> properly until I got back.
>
> Will, can you confirm that this version is okay and passes your tests?
> It passes mine.

I'll pull it and test it!

> While there are no known seccomp users that will have trouble,
> SECCOMP_RET_TRAP and SECCOMP_RET_TRACE currently interact oddly with
> emulated vsyscalls.  This might lead to ABI issues down the road (if
> something starts to rely on current behavior) or unexpected malfunctions
> (if something tries to change, say, sys_gettimeofday, into a different
> syscall and gets completely bogus results on a vsyscall-using distro.
>
> It's unlikely that fixing this later will cause issues, but it would be
> nice to nail down and document the vsyscall quirks for the first
> released kernel with seccomp mode 2 support.
>
> (Patch 2/2 is very much optional.  It fixes a strange corner case.  It
>  ought to be fine for 3.6, since I very much doubt that any real code
>  will hit that corner case and cause ABI problems.)
>
> Andy Lutomirski (2):
>   seccomp: Make syscall skipping and nr changes more consistent
>   seccomp: Future-proof against silly tracers
>
>  Documentation/prctl/seccomp_filter.txt |   74 ++++++++++++++++++++--
>  arch/x86/include/asm/syscall.h         |   11 +++
>  arch/x86/kernel/vsyscall_64.c          |  110 +++++++++++++++++---------------
>  kernel/seccomp.c                       |   28 +++++++-
>  4 files changed, 163 insertions(+), 60 deletions(-)
>
> --
> 1.7.7.6
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3.5 2/2] seccomp: Future-proof against silly tracers
  2012-07-17 23:19 ` [PATCH 3.5 2/2] seccomp: Future-proof against silly tracers Andy Lutomirski
@ 2012-07-18  2:13   ` Will Drewry
  2012-07-18 18:35     ` Will Drewry
  2012-07-26 15:41   ` Andy Lutomirski
  1 sibling, 1 reply; 16+ messages in thread
From: Will Drewry @ 2012-07-18  2:13 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Linus Torvalds, linux-kernel, Kees Cook

On Tue, Jul 17, 2012 at 6:19 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Currently, if a tracer changes a syscall nr to __NR_future_enosys,
> behavior will differ between kernels that know about
> __NR_future_enosys (and return -ENOSYS) and older kernels (which
> return the value from pt_regs).  This is silly; we should just
> return -ENOSYS.
>
> This is unlikely to ever happen on x86 because the return value in
> pt_regs starts out as -ENOSYS, but a silly tracer can change that.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> Cc: Will Drewry <wad@chromium.org>
> ---
>  arch/x86/include/asm/syscall.h |   11 +++++++++++
>  kernel/seccomp.c               |   15 +++++++++++++++
>  2 files changed, 26 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
> index 1ace47b..8191e057 100644
> --- a/arch/x86/include/asm/syscall.h
> +++ b/arch/x86/include/asm/syscall.h

Since this is used in an arch-wide location, the prototype and comment
should be in asm-generic/syscall.h too.

> @@ -70,6 +70,17 @@ static inline void syscall_set_return_value(struct task_struct *task,
>         regs->ax = (long) error ?: val;
>  }
>
> +static inline bool syscall_is_nr_future(struct task_struct *task,
> +                                       struct pt_regs *regs)
> +{
> +#ifdef CONFIG_IA32_EMULATION
> +       if (task_thread_info(task)->status & TS_COMPAT)
> +               return syscall_get_nr(task, regs) > __NR_ia32_syscall_max;
> +#endif
> +
> +       return syscall_get_nr(task, regs) > __NR_syscall_max;

I'm not sure how easy this will be to implement on some of the arches
where this data isn't bubbled up.  It'd be good if some non-x86 arch
maintainers chimed in (since x86 is easy and already works as expected
:).

> +}
> +
>  #ifdef CONFIG_X86_32
>
>  static inline void syscall_get_arguments(struct task_struct *task,
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 5af44b5..bd7527d 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -429,6 +429,21 @@ int __secure_computing(int this_syscall)
>                          */
>                         if (fatal_signal_pending(current))
>                                 break;
> +
> +                       if (syscall_is_nr_future(current, regs)) {
> +                               /*
> +                                * If the tracer selects a system call that
> +                                * only future kernels know about, we still need
> +                                * to execute that syscall by returning
> +                                * -ENOSYS.  (On x86, this only matters if the
> +                                * tracer changed the return value, which would
> +                                * be silly, but user code can be silly.)
> +                                */
> +                               syscall_set_return_value(current, regs,
> +                                                        -ENOSYS, 0);
> +                               goto skip;
> +                       }
> +
>                         if (syscall_get_nr(current, regs) < 0)
>                                 goto skip;  /* Explicit request to skip. */
>
> --
> 1.7.7.6
>

thanks!
will

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3.5 1/2] seccomp: Make syscall skipping and nr changes more consistent
  2012-07-17 23:19 ` [PATCH 3.5 1/2] seccomp: Make syscall skipping and nr changes more consistent Andy Lutomirski
@ 2012-07-18 18:31   ` Will Drewry
  2012-07-18 20:00     ` Andy Lutomirski
  2012-07-26 15:43   ` Andy Lutomirski
  1 sibling, 1 reply; 16+ messages in thread
From: Will Drewry @ 2012-07-18 18:31 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Linus Torvalds, linux-kernel, Kees Cook

On Tue, Jul 17, 2012 at 6:19 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> This fixes two issues that could cause incompatibility between
> kernel versions:
>
>  - If a tracer uses SECCOMP_RET_TRACE to select a syscall number
>    higher than the largest known syscall, emulate the unknown
>    vsyscall by returning -ENOSYS.  (This is unlikely to make a
>    noticeable difference on x86-64 due to the way the system call
>    entry works.)
>
>  - On x86-64 with vsyscall=emulate, skipped vsyscalls were buggy.
>
> This updates the documentation accordingly.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> Cc: Will Drewry <wad@chromium.org>
> ---
>  Documentation/prctl/seccomp_filter.txt |   74 ++++++++++++++++++++--
>  arch/x86/kernel/vsyscall_64.c          |  110 +++++++++++++++++---------------
>  kernel/seccomp.c                       |   13 +++-
>  3 files changed, 137 insertions(+), 60 deletions(-)
>
> diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
> index 597c3c5..1e469ef 100644
> --- a/Documentation/prctl/seccomp_filter.txt
> +++ b/Documentation/prctl/seccomp_filter.txt
> @@ -95,12 +95,15 @@ SECCOMP_RET_KILL:
>
>  SECCOMP_RET_TRAP:
>         Results in the kernel sending a SIGSYS signal to the triggering
> -       task without executing the system call.  The kernel will
> -       rollback the register state to just before the system call
> -       entry such that a signal handler in the task will be able to
> -       inspect the ucontext_t->uc_mcontext registers and emulate
> -       system call success or failure upon return from the signal
> -       handler.
> +       task without executing the system call.  siginfo->si_call_addr
> +       will show the address of the system call instruction, and
> +       siginfo->si_syscall and siginfo->si_arch will indicate which
> +       syscall was attempted.  The program counter will be as though
> +       the syscall happened (i.e. it will not point to the syscall
> +       instruction).  The return value register will contain an arch-
> +       dependent value -- if resuming execution, set it to something
> +       sensible.  (The architecture dependency is because replacing
> +       it with -ENOSYS could overwrite some useful information.)
>
>         The SECCOMP_RET_DATA portion of the return value will be passed
>         as si_errno.
> @@ -123,6 +126,18 @@ SECCOMP_RET_TRACE:
>         the BPF program return value will be available to the tracer
>         via PTRACE_GETEVENTMSG.
>
> +       The tracer can skip the system call by changing the syscall number
> +       to -1.  Alternatively, the tracer can change the system call
> +       requested by changing the system call to a valid syscall number.  If
> +       the tracer asks to skip the system call, then the system call will
> +       appear to return the value that the tracer puts in the return value
> +       register.
> +
> +       The seccomp check will not be run again after the tracer is
> +       notified.  (This means that seccomp-based sandboxes MUST NOT
> +       allow use of ptrace, even of other sandboxed processes, without
> +       extreme care; ptracers can use this mechanism to escape.)
> +
>  SECCOMP_RET_ALLOW:
>         Results in the system call being executed.
>
> @@ -161,3 +176,50 @@ architecture supports both ptrace_event and seccomp, it will be able to
>  support seccomp filter with minor fixup: SIGSYS support and seccomp return
>  value checking.  Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER
>  to its arch-specific Kconfig.
> +
> +
> +
> +Caveats
> +-------
> +
> +The vDSO can cause some system calls to run entirely in userspace,
> +leading to surprises when you run programs on different machines that
> +fall back to real syscalls.  To minimize these surprises on x86, make
> +sure you test with
> +/sys/devices/system/clocksource/clocksource0/current_clocksource set to
> +something like acpi_pm.
> +
> +On x86-64, vsyscall emulation is enabled by default.  (vsyscalls are
> +legacy variants on vDSO calls.)  Currently, emulated vsyscalls will honor seccomp, with a few oddities:
> +
> +- A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to
> +  the vsyscall entry for the given call and not the address after the
> +  'syscall' instruction.  Any code which wants to restart the call
> +  should be aware that (a) a ret instruction has been emulated and (b)
> +  trying to resume the syscall will again trigger the standard vsyscall
> +  emulation security checks, making resuming the syscall mostly
> +  pointless.
> +
> +- A return value of SECCOMP_RET_TRACE will signal the tracer as usual,
> +  but the syscall may not be changed to another system call using the
> +  orig_rax register. It may only be changed to -1 order to skip the
> +  currently emulated call. Any other change MAY terminate the process.
> +  The rip value seen by the tracer will be the syscall entry address;
> +  this is different from normal behavior.  The tracer MUST NOT modify
> +  rip or rsp.  (Do not rely on other changes terminating the process.
> +  They might work.  For example, on some kernels, choosing a syscall
> +  that only exists in future kernels will be correctly emulated (by
> +  returning -ENOSYS).
> +
> +To detect this quirky behavior, check for addr & ~0x0C00 ==
> +0xFFFFFFFFFF600000.  (For SECCOMP_RET_TRACE, use rip.  For
> +SECCOMP_RET_TRAP, use siginfo->si_call_addr.)  Do not check any other
> +condition: future kernels may improve vsyscall emulation and current
> +kernels in vsyscall=native mode will behave differently, but the
> +instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these
> +cases.
> +
> +Note that modern systems are unlikely to use vsyscalls at all -- they
> +are a legacy feature and they are considerably slower than standard
> +syscalls.  New code will use the vDSO, and vDSO-issued system calls
> +are indistinguishable from normal system calls.
> diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
> index 5db36ca..44a3a2e 100644
> --- a/arch/x86/kernel/vsyscall_64.c
> +++ b/arch/x86/kernel/vsyscall_64.c
> @@ -139,19 +139,6 @@ static int addr_to_vsyscall_nr(unsigned long addr)
>         return nr;
>  }
>
> -#ifdef CONFIG_SECCOMP
> -static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr)
> -{
> -       if (!seccomp_mode(&tsk->seccomp))
> -               return 0;
> -       task_pt_regs(tsk)->orig_ax = syscall_nr;
> -       task_pt_regs(tsk)->ax = syscall_nr;
> -       return __secure_computing(syscall_nr);
> -}
> -#else
> -#define vsyscall_seccomp(_tsk, _nr) 0
> -#endif
> -
>  static bool write_ok_or_segv(unsigned long ptr, size_t size)
>  {
>         /*
> @@ -184,10 +171,9 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
>  {
>         struct task_struct *tsk;
>         unsigned long caller;
> -       int vsyscall_nr;
> +       int vsyscall_nr, syscall_nr, tmp;
>         int prev_sig_on_uaccess_error;
>         long ret;
> -       int skip;
>
>         /*
>          * No point in checking CS -- the only way to get here is a user mode
> @@ -219,56 +205,84 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
>         }
>
>         tsk = current;
> -       /*
> -        * With a real vsyscall, page faults cause SIGSEGV.  We want to
> -        * preserve that behavior to make writing exploits harder.
> -        */
> -       prev_sig_on_uaccess_error = current_thread_info()->sig_on_uaccess_error;
> -       current_thread_info()->sig_on_uaccess_error = 1;
>
>         /*
> +        * Check for access_ok violations and find the syscall nr.
> +        *
>          * NULL is a valid user pointer (in the access_ok sense) on 32-bit and
>          * 64-bit, so we don't need to special-case it here.  For all the
>          * vsyscalls, NULL means "don't write anything" not "write it at
>          * address 0".
>          */
> -       ret = -EFAULT;
> -       skip = 0;
>         switch (vsyscall_nr) {
>         case 0:
> -               skip = vsyscall_seccomp(tsk, __NR_gettimeofday);
> -               if (skip)
> -                       break;
> -
>                 if (!write_ok_or_segv(regs->di, sizeof(struct timeval)) ||
> -                   !write_ok_or_segv(regs->si, sizeof(struct timezone)))
> -                       break;
> +                   !write_ok_or_segv(regs->si, sizeof(struct timezone))) {
> +                       ret = -EFAULT;
> +                       goto check_fault;
> +               }
> +
> +               syscall_nr = __NR_gettimeofday;
> +               break;
> +
> +       case 1:
> +               if (!write_ok_or_segv(regs->di, sizeof(time_t))) {
> +                       ret = -EFAULT;
> +                       goto check_fault;
> +               }
> +
> +               syscall_nr = __NR_time;
> +               break;
> +
> +       case 2:
> +               if (!write_ok_or_segv(regs->di, sizeof(unsigned)) ||
> +                   !write_ok_or_segv(regs->si, sizeof(unsigned))) {
> +                       ret = -EFAULT;
> +                       goto check_fault;
> +               }
> +
> +               syscall_nr = __NR_getcpu;
> +               break;
> +       }
> +
> +       /*
> +        * Handle seccomp.  regs->ip must be the original value.
> +        * See seccomp_send_sigsys and Documentation/prctl/seccomp_filter.txt.
> +        *
> +        * We could optimize the seccomp disabled case, but performance
> +        * here doesn't matter.
> +        */
> +       regs->orig_ax = syscall_nr;
> +       regs->ax = -ENOSYS;
> +       tmp = secure_computing(syscall_nr);
> +       if ((regs->orig_ax != syscall_nr && !tmp) || regs->ip != address) {

Would it make sense to check tmp first since it is the most common case?

If it is skipping, is there any reason to block ip changes?  Of
course, I don't have a test for normal ptrace with IP change at
syscall exit versus this, so I'm fine with it starting more
restrictive.


> +               warn_bad_vsyscall(KERN_DEBUG, regs,
> +                                 "seccomp tried to change syscall nr or ip");
> +               do_exit(SIGSYS);
> +       }
> +       if (tmp)
> +               goto do_ret;  /* skip requested */
>
> +       /*
> +        * With a real vsyscall, page faults cause SIGSEGV.  We want to
> +        * preserve that behavior to make writing exploits harder.
> +        */
> +       prev_sig_on_uaccess_error = current_thread_info()->sig_on_uaccess_error;
> +       current_thread_info()->sig_on_uaccess_error = 1;
> +
> +       ret = -EFAULT;
> +       switch (vsyscall_nr) {
> +       case 0:
>                 ret = sys_gettimeofday(
>                         (struct timeval __user *)regs->di,
>                         (struct timezone __user *)regs->si);
>                 break;
>
>         case 1:
> -               skip = vsyscall_seccomp(tsk, __NR_time);
> -               if (skip)
> -                       break;
> -
> -               if (!write_ok_or_segv(regs->di, sizeof(time_t)))
> -                       break;
> -
>                 ret = sys_time((time_t __user *)regs->di);
>                 break;
>
>         case 2:
> -               skip = vsyscall_seccomp(tsk, __NR_getcpu);
> -               if (skip)
> -                       break;
> -
> -               if (!write_ok_or_segv(regs->di, sizeof(unsigned)) ||
> -                   !write_ok_or_segv(regs->si, sizeof(unsigned)))
> -                       break;
> -
>                 ret = sys_getcpu((unsigned __user *)regs->di,
>                                  (unsigned __user *)regs->si,
>                                  NULL);
> @@ -277,12 +291,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
>
>         current_thread_info()->sig_on_uaccess_error = prev_sig_on_uaccess_error;
>
> -       if (skip) {
> -               if ((long)regs->ax <= 0L) /* seccomp errno emulation */
> -                       goto do_ret;
> -               goto done; /* seccomp trace/trap */
> -       }
> -
> +check_fault:
>         if (ret == -EFAULT) {
>                 /* Bad news -- userspace fed a bad pointer to a vsyscall. */
>                 warn_bad_vsyscall(KERN_INFO, regs,
> @@ -305,7 +314,6 @@ do_ret:
>         /* Emulate a ret instruction. */
>         regs->ip = caller;
>         regs->sp += 8;
> -done:
>         return true;
>
>  sigsegv:
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index ee376be..5af44b5 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -396,25 +396,29 @@ int __secure_computing(int this_syscall)
>  #ifdef CONFIG_SECCOMP_FILTER
>         case SECCOMP_MODE_FILTER: {
>                 int data;
> +               struct pt_regs *regs = task_pt_regs(current);
>                 ret = seccomp_run_filters(this_syscall);
>                 data = ret & SECCOMP_RET_DATA;
>                 ret &= SECCOMP_RET_ACTION;
>                 switch (ret) {
>                 case SECCOMP_RET_ERRNO:
>                         /* Set the low-order 16-bits as a errno. */
> -                       syscall_set_return_value(current, task_pt_regs(current),
> +                       syscall_set_return_value(current, regs,
>                                                  -data, 0);
>                         goto skip;
>                 case SECCOMP_RET_TRAP:
>                         /* Show the handler the original registers. */
> -                       syscall_rollback(current, task_pt_regs(current));
> +                       syscall_rollback(current, regs);
>                         /* Let the filter pass back 16 bits of data. */
>                         seccomp_send_sigsys(this_syscall, data);
>                         goto skip;
>                 case SECCOMP_RET_TRACE:
>                         /* Skip these calls if there is no tracer. */
> -                       if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP))
> +                       if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) {
> +                               syscall_set_return_value(current, regs,
> +                                                        -ENOSYS, 0);

Thanks! I've been meaning to post this, but was waiting until I added
a non-x86 arch :)

>                                 goto skip;
> +                       }
>                         /* Allow the BPF to provide the event message */
>                         ptrace_event(PTRACE_EVENT_SECCOMP, data);
>                         /*
> @@ -425,6 +429,9 @@ int __secure_computing(int this_syscall)
>                          */
>                         if (fatal_signal_pending(current))
>                                 break;
> +                       if (syscall_get_nr(current, regs) < 0)
> +                               goto skip;  /* Explicit request to skip. */
> +
>                         return 0;
>                 case SECCOMP_RET_ALLOW:
>                         return 0;
> --
> 1.7.7.6
>

Acked-by: Will Drewry <wad@chromium.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3.5 2/2] seccomp: Future-proof against silly tracers
  2012-07-18  2:13   ` Will Drewry
@ 2012-07-18 18:35     ` Will Drewry
  2012-07-18 20:08       ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: Will Drewry @ 2012-07-18 18:35 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Linus Torvalds, linux-kernel, Kees Cook

On Tue, Jul 17, 2012 at 9:13 PM, Will Drewry <wad@chromium.org> wrote:
> On Tue, Jul 17, 2012 at 6:19 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> Currently, if a tracer changes a syscall nr to __NR_future_enosys,
>> behavior will differ between kernels that know about
>> __NR_future_enosys (and return -ENOSYS) and older kernels (which
>> return the value from pt_regs).  This is silly; we should just
>> return -ENOSYS.
>>
>> This is unlikely to ever happen on x86 because the return value in
>> pt_regs starts out as -ENOSYS, but a silly tracer can change that.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> Cc: Will Drewry <wad@chromium.org>
>> ---
>>  arch/x86/include/asm/syscall.h |   11 +++++++++++
>>  kernel/seccomp.c               |   15 +++++++++++++++
>>  2 files changed, 26 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
>> index 1ace47b..8191e057 100644
>> --- a/arch/x86/include/asm/syscall.h
>> +++ b/arch/x86/include/asm/syscall.h
>
> Since this is used in an arch-wide location, the prototype and comment
> should be in asm-generic/syscall.h too.
>
>> @@ -70,6 +70,17 @@ static inline void syscall_set_return_value(struct task_struct *task,
>>         regs->ax = (long) error ?: val;
>>  }
>>
>> +static inline bool syscall_is_nr_future(struct task_struct *task,
>> +                                       struct pt_regs *regs)
>> +{
>> +#ifdef CONFIG_IA32_EMULATION
>> +       if (task_thread_info(task)->status & TS_COMPAT)
>> +               return syscall_get_nr(task, regs) > __NR_ia32_syscall_max;
>> +#endif
>> +
>> +       return syscall_get_nr(task, regs) > __NR_syscall_max;
>
> I'm not sure how easy this will be to implement on some of the arches
> where this data isn't bubbled up.  It'd be good if some non-x86 arch
> maintainers chimed in (since x86 is easy and already works as expected
> :).
>

Since x86 always returns -ENOSYS with an invalid syscall and only x86
supports seccomp filter in 3.5, I'd propose pushing this off for 3.6+
to get more feedback from the relevant parties.  Not doing it now
doesn't expose any users of 3.5 to any sort of changing ABI.

(Otherwise, it seems fine but may make adding new arches slightly more
onerous, but I suspect ftrace needs this sort of info too as it
spreads to other arches!)

>> +}
>> +
>>  #ifdef CONFIG_X86_32
>>
>>  static inline void syscall_get_arguments(struct task_struct *task,
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index 5af44b5..bd7527d 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -429,6 +429,21 @@ int __secure_computing(int this_syscall)
>>                          */
>>                         if (fatal_signal_pending(current))
>>                                 break;
>> +
>> +                       if (syscall_is_nr_future(current, regs)) {
>> +                               /*
>> +                                * If the tracer selects a system call that
>> +                                * only future kernels know about, we still need
>> +                                * to execute that syscall by returning
>> +                                * -ENOSYS.  (On x86, this only matters if the
>> +                                * tracer changed the return value, which would
>> +                                * be silly, but user code can be silly.)
>> +                                */
>> +                               syscall_set_return_value(current, regs,
>> +                                                        -ENOSYS, 0);
>> +                               goto skip;
>> +                       }
>> +
>>                         if (syscall_get_nr(current, regs) < 0)
>>                                 goto skip;  /* Explicit request to skip. */
>>
>> --
>> 1.7.7.6
>>
>
> thanks!
> will

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3.5 1/2] seccomp: Make syscall skipping and nr changes more consistent
  2012-07-18 18:31   ` Will Drewry
@ 2012-07-18 20:00     ` Andy Lutomirski
  0 siblings, 0 replies; 16+ messages in thread
From: Andy Lutomirski @ 2012-07-18 20:00 UTC (permalink / raw)
  To: Will Drewry; +Cc: Linus Torvalds, linux-kernel, Kees Cook

On Wed, Jul 18, 2012 at 11:31 AM, Will Drewry <wad@chromium.org> wrote:
> On Tue, Jul 17, 2012 at 6:19 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> This fixes two issues that could cause incompatibility between
>> kernel versions:
>>
>>  - If a tracer uses SECCOMP_RET_TRACE to select a syscall number
>>    higher than the largest known syscall, emulate the unknown
>>    vsyscall by returning -ENOSYS.  (This is unlikely to make a
>>    noticeable difference on x86-64 due to the way the system call
>>    entry works.)
>>
>>  - On x86-64 with vsyscall=emulate, skipped vsyscalls were buggy.
>>
>> This updates the documentation accordingly.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> Cc: Will Drewry <wad@chromium.org>
>> ---
>>  Documentation/prctl/seccomp_filter.txt |   74 ++++++++++++++++++++--
>>  arch/x86/kernel/vsyscall_64.c          |  110 +++++++++++++++++---------------
>>  kernel/seccomp.c                       |   13 +++-
>>  3 files changed, 137 insertions(+), 60 deletions(-)
>>
>> diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
>> index 597c3c5..1e469ef 100644
>> --- a/Documentation/prctl/seccomp_filter.txt
>> +++ b/Documentation/prctl/seccomp_filter.txt
>> @@ -95,12 +95,15 @@ SECCOMP_RET_KILL:
>>
>>  SECCOMP_RET_TRAP:
>>         Results in the kernel sending a SIGSYS signal to the triggering
>> -       task without executing the system call.  The kernel will
>> -       rollback the register state to just before the system call
>> -       entry such that a signal handler in the task will be able to
>> -       inspect the ucontext_t->uc_mcontext registers and emulate
>> -       system call success or failure upon return from the signal
>> -       handler.
>> +       task without executing the system call.  siginfo->si_call_addr
>> +       will show the address of the system call instruction, and
>> +       siginfo->si_syscall and siginfo->si_arch will indicate which
>> +       syscall was attempted.  The program counter will be as though
>> +       the syscall happened (i.e. it will not point to the syscall
>> +       instruction).  The return value register will contain an arch-
>> +       dependent value -- if resuming execution, set it to something
>> +       sensible.  (The architecture dependency is because replacing
>> +       it with -ENOSYS could overwrite some useful information.)
>>
>>         The SECCOMP_RET_DATA portion of the return value will be passed
>>         as si_errno.
>> @@ -123,6 +126,18 @@ SECCOMP_RET_TRACE:
>>         the BPF program return value will be available to the tracer
>>         via PTRACE_GETEVENTMSG.
>>
>> +       The tracer can skip the system call by changing the syscall number
>> +       to -1.  Alternatively, the tracer can change the system call
>> +       requested by changing the system call to a valid syscall number.  If
>> +       the tracer asks to skip the system call, then the system call will
>> +       appear to return the value that the tracer puts in the return value
>> +       register.
>> +
>> +       The seccomp check will not be run again after the tracer is
>> +       notified.  (This means that seccomp-based sandboxes MUST NOT
>> +       allow use of ptrace, even of other sandboxed processes, without
>> +       extreme care; ptracers can use this mechanism to escape.)
>> +
>>  SECCOMP_RET_ALLOW:
>>         Results in the system call being executed.
>>
>> @@ -161,3 +176,50 @@ architecture supports both ptrace_event and seccomp, it will be able to
>>  support seccomp filter with minor fixup: SIGSYS support and seccomp return
>>  value checking.  Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER
>>  to its arch-specific Kconfig.
>> +
>> +
>> +
>> +Caveats
>> +-------
>> +
>> +The vDSO can cause some system calls to run entirely in userspace,
>> +leading to surprises when you run programs on different machines that
>> +fall back to real syscalls.  To minimize these surprises on x86, make
>> +sure you test with
>> +/sys/devices/system/clocksource/clocksource0/current_clocksource set to
>> +something like acpi_pm.
>> +
>> +On x86-64, vsyscall emulation is enabled by default.  (vsyscalls are
>> +legacy variants on vDSO calls.)  Currently, emulated vsyscalls will honor seccomp, with a few oddities:
>> +
>> +- A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to
>> +  the vsyscall entry for the given call and not the address after the
>> +  'syscall' instruction.  Any code which wants to restart the call
>> +  should be aware that (a) a ret instruction has been emulated and (b)
>> +  trying to resume the syscall will again trigger the standard vsyscall
>> +  emulation security checks, making resuming the syscall mostly
>> +  pointless.
>> +
>> +- A return value of SECCOMP_RET_TRACE will signal the tracer as usual,
>> +  but the syscall may not be changed to another system call using the
>> +  orig_rax register. It may only be changed to -1 order to skip the
>> +  currently emulated call. Any other change MAY terminate the process.
>> +  The rip value seen by the tracer will be the syscall entry address;
>> +  this is different from normal behavior.  The tracer MUST NOT modify
>> +  rip or rsp.  (Do not rely on other changes terminating the process.
>> +  They might work.  For example, on some kernels, choosing a syscall
>> +  that only exists in future kernels will be correctly emulated (by
>> +  returning -ENOSYS).
>> +
>> +To detect this quirky behavior, check for addr & ~0x0C00 ==
>> +0xFFFFFFFFFF600000.  (For SECCOMP_RET_TRACE, use rip.  For
>> +SECCOMP_RET_TRAP, use siginfo->si_call_addr.)  Do not check any other
>> +condition: future kernels may improve vsyscall emulation and current
>> +kernels in vsyscall=native mode will behave differently, but the
>> +instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these
>> +cases.
>> +
>> +Note that modern systems are unlikely to use vsyscalls at all -- they
>> +are a legacy feature and they are considerably slower than standard
>> +syscalls.  New code will use the vDSO, and vDSO-issued system calls
>> +are indistinguishable from normal system calls.
>> diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
>> index 5db36ca..44a3a2e 100644
>> --- a/arch/x86/kernel/vsyscall_64.c
>> +++ b/arch/x86/kernel/vsyscall_64.c
>> @@ -139,19 +139,6 @@ static int addr_to_vsyscall_nr(unsigned long addr)
>>         return nr;
>>  }
>>
>> -#ifdef CONFIG_SECCOMP
>> -static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr)
>> -{
>> -       if (!seccomp_mode(&tsk->seccomp))
>> -               return 0;
>> -       task_pt_regs(tsk)->orig_ax = syscall_nr;
>> -       task_pt_regs(tsk)->ax = syscall_nr;
>> -       return __secure_computing(syscall_nr);
>> -}
>> -#else
>> -#define vsyscall_seccomp(_tsk, _nr) 0
>> -#endif
>> -
>>  static bool write_ok_or_segv(unsigned long ptr, size_t size)
>>  {
>>         /*
>> @@ -184,10 +171,9 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
>>  {
>>         struct task_struct *tsk;
>>         unsigned long caller;
>> -       int vsyscall_nr;
>> +       int vsyscall_nr, syscall_nr, tmp;
>>         int prev_sig_on_uaccess_error;
>>         long ret;
>> -       int skip;
>>
>>         /*
>>          * No point in checking CS -- the only way to get here is a user mode
>> @@ -219,56 +205,84 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
>>         }
>>
>>         tsk = current;
>> -       /*
>> -        * With a real vsyscall, page faults cause SIGSEGV.  We want to
>> -        * preserve that behavior to make writing exploits harder.
>> -        */
>> -       prev_sig_on_uaccess_error = current_thread_info()->sig_on_uaccess_error;
>> -       current_thread_info()->sig_on_uaccess_error = 1;
>>
>>         /*
>> +        * Check for access_ok violations and find the syscall nr.
>> +        *
>>          * NULL is a valid user pointer (in the access_ok sense) on 32-bit and
>>          * 64-bit, so we don't need to special-case it here.  For all the
>>          * vsyscalls, NULL means "don't write anything" not "write it at
>>          * address 0".
>>          */
>> -       ret = -EFAULT;
>> -       skip = 0;
>>         switch (vsyscall_nr) {
>>         case 0:
>> -               skip = vsyscall_seccomp(tsk, __NR_gettimeofday);
>> -               if (skip)
>> -                       break;
>> -
>>                 if (!write_ok_or_segv(regs->di, sizeof(struct timeval)) ||
>> -                   !write_ok_or_segv(regs->si, sizeof(struct timezone)))
>> -                       break;
>> +                   !write_ok_or_segv(regs->si, sizeof(struct timezone))) {
>> +                       ret = -EFAULT;
>> +                       goto check_fault;
>> +               }
>> +
>> +               syscall_nr = __NR_gettimeofday;
>> +               break;
>> +
>> +       case 1:
>> +               if (!write_ok_or_segv(regs->di, sizeof(time_t))) {
>> +                       ret = -EFAULT;
>> +                       goto check_fault;
>> +               }
>> +
>> +               syscall_nr = __NR_time;
>> +               break;
>> +
>> +       case 2:
>> +               if (!write_ok_or_segv(regs->di, sizeof(unsigned)) ||
>> +                   !write_ok_or_segv(regs->si, sizeof(unsigned))) {
>> +                       ret = -EFAULT;
>> +                       goto check_fault;
>> +               }
>> +
>> +               syscall_nr = __NR_getcpu;
>> +               break;
>> +       }
>> +
>> +       /*
>> +        * Handle seccomp.  regs->ip must be the original value.
>> +        * See seccomp_send_sigsys and Documentation/prctl/seccomp_filter.txt.
>> +        *
>> +        * We could optimize the seccomp disabled case, but performance
>> +        * here doesn't matter.
>> +        */
>> +       regs->orig_ax = syscall_nr;
>> +       regs->ax = -ENOSYS;
>> +       tmp = secure_computing(syscall_nr);
>> +       if ((regs->orig_ax != syscall_nr && !tmp) || regs->ip != address) {
>
> Would it make sense to check tmp first since it is the most common case?

This code is basically irrelevant from a performance perspective.  If
I do another version, I'll change it.

>
> If it is skipping, is there any reason to block ip changes?  Of
> course, I don't have a test for normal ptrace with IP change at
> syscall exit versus this, so I'm fine with it starting more
> restrictive.
>

The emulation of the ret instruction will blow away the ip change.
Giving the tracer the option to skip the ret emulation seems rather
silly.

We could get fancy by setting ip to point to the ret instruction,
which would be a little less odd.  Then, if the tracer changed ip, we
could accept that change and skip the ret emulation.  (This would
involve changing the docs, though.)

Or we could say "if the rip is in the vsyscall page, quirks may happen
or may not depending on kernel version.  Test if you care."  Thoughts?
 (The only downside of this patch that I can see is that, if we make
that change later, we risk breaking something.  But tracers really
have little business changing ip anyway, especially since it's
nontrivial (or even impossible) to find the original syscall entry
point on x86 given just the return address.

>>                 case SECCOMP_RET_TRACE:
>>                         /* Skip these calls if there is no tracer. */
>> -                       if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP))
>> +                       if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) {
>> +                               syscall_set_return_value(current, regs,
>> +                                                        -ENOSYS, 0);
>
> Thanks! I've been meaning to post this, but was waiting until I added
> a non-x86 arch :)

This hunk could be dropped for now, I guess.  It's pretty clearly safe, though.

>
> Acked-by: Will Drewry <wad@chromium.org>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3.5 2/2] seccomp: Future-proof against silly tracers
  2012-07-18 18:35     ` Will Drewry
@ 2012-07-18 20:08       ` Andy Lutomirski
  0 siblings, 0 replies; 16+ messages in thread
From: Andy Lutomirski @ 2012-07-18 20:08 UTC (permalink / raw)
  To: Will Drewry; +Cc: Linus Torvalds, linux-kernel, Kees Cook

On Wed, Jul 18, 2012 at 11:35 AM, Will Drewry <wad@chromium.org> wrote:
> On Tue, Jul 17, 2012 at 9:13 PM, Will Drewry <wad@chromium.org> wrote:
>> On Tue, Jul 17, 2012 at 6:19 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> Currently, if a tracer changes a syscall nr to __NR_future_enosys,
>>> behavior will differ between kernels that know about
>>> __NR_future_enosys (and return -ENOSYS) and older kernels (which
>>> return the value from pt_regs).  This is silly; we should just
>>> return -ENOSYS.
>>>
>>> This is unlikely to ever happen on x86 because the return value in
>>> pt_regs starts out as -ENOSYS, but a silly tracer can change that.
>>>
>>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>>> Cc: Will Drewry <wad@chromium.org>
>>> ---
>>>  arch/x86/include/asm/syscall.h |   11 +++++++++++
>>>  kernel/seccomp.c               |   15 +++++++++++++++
>>>  2 files changed, 26 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
>>> index 1ace47b..8191e057 100644
>>> --- a/arch/x86/include/asm/syscall.h
>>> +++ b/arch/x86/include/asm/syscall.h
>>
>> Since this is used in an arch-wide location, the prototype and comment
>> should be in asm-generic/syscall.h too.
>>
>>> @@ -70,6 +70,17 @@ static inline void syscall_set_return_value(struct task_struct *task,
>>>         regs->ax = (long) error ?: val;
>>>  }
>>>
>>> +static inline bool syscall_is_nr_future(struct task_struct *task,
>>> +                                       struct pt_regs *regs)
>>> +{
>>> +#ifdef CONFIG_IA32_EMULATION
>>> +       if (task_thread_info(task)->status & TS_COMPAT)
>>> +               return syscall_get_nr(task, regs) > __NR_ia32_syscall_max;
>>> +#endif
>>> +
>>> +       return syscall_get_nr(task, regs) > __NR_syscall_max;
>>
>> I'm not sure how easy this will be to implement on some of the arches
>> where this data isn't bubbled up.  It'd be good if some non-x86 arch
>> maintainers chimed in (since x86 is easy and already works as expected
>> :).
>>
>
> Since x86 always returns -ENOSYS with an invalid syscall and only x86
> supports seccomp filter in 3.5, I'd propose pushing this off for 3.6+
> to get more feedback from the relevant parties.  Not doing it now
> doesn't expose any users of 3.5 to any sort of changing ABI.

It's (barely) visible.  See VSYS.trace_changenr_high here:

https://github.com/amluto/seccomp

>
> (Otherwise, it seems fine but may make adding new arches slightly more
> onerous, but I suspect ftrace needs this sort of info too as it
> spreads to other arches!)

Are there any arches for which the return value isn't its own
register?  If so, this is necessary.

Agreed, though: let's wait for 3.6.

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3.5 2/2] seccomp: Future-proof against silly tracers
  2012-07-17 23:19 ` [PATCH 3.5 2/2] seccomp: Future-proof against silly tracers Andy Lutomirski
  2012-07-18  2:13   ` Will Drewry
@ 2012-07-26 15:41   ` Andy Lutomirski
  2012-08-02 14:32     ` Will Drewry
  1 sibling, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2012-07-26 15:41 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel; +Cc: Will Drewry, Kees Cook, Andy Lutomirski

On Tue, Jul 17, 2012 at 4:19 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Currently, if a tracer changes a syscall nr to __NR_future_enosys,
> behavior will differ between kernels that know about
> __NR_future_enosys (and return -ENOSYS) and older kernels (which
> return the value from pt_regs).  This is silly; we should just
> return -ENOSYS.
>
> This is unlikely to ever happen on x86 because the return value in
> pt_regs starts out as -ENOSYS, but a silly tracer can change that.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> Cc: Will Drewry <wad@chromium.org>
> ---
>  arch/x86/include/asm/syscall.h |   11 +++++++++++
>  kernel/seccomp.c               |   15 +++++++++++++++
>  2 files changed, 26 insertions(+), 0 deletions(-)

Will, can you pick this, or some version of it, up in your
seccomp-for-ARM tree or wherever your development is?

--Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3.5 1/2] seccomp: Make syscall skipping and nr changes more consistent
  2012-07-17 23:19 ` [PATCH 3.5 1/2] seccomp: Make syscall skipping and nr changes more consistent Andy Lutomirski
  2012-07-18 18:31   ` Will Drewry
@ 2012-07-26 15:43   ` Andy Lutomirski
  1 sibling, 0 replies; 16+ messages in thread
From: Andy Lutomirski @ 2012-07-26 15:43 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel
  Cc: Will Drewry, Kees Cook, Andy Lutomirski, James Morris

On Tue, Jul 17, 2012 at 4:19 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> This fixes two issues that could cause incompatibility between
> kernel versions:
>
>  - If a tracer uses SECCOMP_RET_TRACE to select a syscall number
>    higher than the largest known syscall, emulate the unknown
>    vsyscall by returning -ENOSYS.  (This is unlikely to make a
>    noticeable difference on x86-64 due to the way the system call
>    entry works.)
>
>  - On x86-64 with vsyscall=emulate, skipped vsyscalls were buggy.
>
> This updates the documentation accordingly.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> Cc: Will Drewry <wad@chromium.org>
> ---
>  Documentation/prctl/seccomp_filter.txt |   74 ++++++++++++++++++++--
>  arch/x86/kernel/vsyscall_64.c          |  110 +++++++++++++++++---------------
>  kernel/seccomp.c                       |   13 +++-
>  3 files changed, 137 insertions(+), 60 deletions(-)

This is still necessary for vsyscall emulation to play nicely with
fancy seccomp tricks.  Can any of you (James?) send it toward Linus?

We might want to tag this for -stable as well if it survives in the
3.6 tree for a while.

--Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3.5 2/2] seccomp: Future-proof against silly tracers
  2012-07-26 15:41   ` Andy Lutomirski
@ 2012-08-02 14:32     ` Will Drewry
  2012-08-02 16:54       ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: Will Drewry @ 2012-08-02 14:32 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Linus Torvalds, linux-kernel, Kees Cook

On Thu, Jul 26, 2012 at 10:41 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jul 17, 2012 at 4:19 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> Currently, if a tracer changes a syscall nr to __NR_future_enosys,
>> behavior will differ between kernels that know about
>> __NR_future_enosys (and return -ENOSYS) and older kernels (which
>> return the value from pt_regs).  This is silly; we should just
>> return -ENOSYS.
>>
>> This is unlikely to ever happen on x86 because the return value in
>> pt_regs starts out as -ENOSYS, but a silly tracer can change that.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> Cc: Will Drewry <wad@chromium.org>
>> ---
>>  arch/x86/include/asm/syscall.h |   11 +++++++++++
>>  kernel/seccomp.c               |   15 +++++++++++++++
>>  2 files changed, 26 insertions(+), 0 deletions(-)
>
> Will, can you pick this, or some version of it, up in your
> seccomp-for-ARM tree or wherever your development is?

I'm still not sure about this change though the end result is nice.
Regardless, I'll explore it when I can -- my family has just increased
in size, so I'm going to be a bit delayed!

cheers!
will

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3.5 2/2] seccomp: Future-proof against silly tracers
  2012-08-02 14:32     ` Will Drewry
@ 2012-08-02 16:54       ` Andy Lutomirski
  0 siblings, 0 replies; 16+ messages in thread
From: Andy Lutomirski @ 2012-08-02 16:54 UTC (permalink / raw)
  To: Will Drewry; +Cc: Linus Torvalds, linux-kernel, Kees Cook

On Thu, Aug 2, 2012 at 7:32 AM, Will Drewry <wad@chromium.org> wrote:
> On Thu, Jul 26, 2012 at 10:41 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, Jul 17, 2012 at 4:19 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> Currently, if a tracer changes a syscall nr to __NR_future_enosys,
>>> behavior will differ between kernels that know about
>>> __NR_future_enosys (and return -ENOSYS) and older kernels (which
>>> return the value from pt_regs).  This is silly; we should just
>>> return -ENOSYS.
>>>
>>> This is unlikely to ever happen on x86 because the return value in
>>> pt_regs starts out as -ENOSYS, but a silly tracer can change that.
>>>
>>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>>> Cc: Will Drewry <wad@chromium.org>
>>> ---
>>>  arch/x86/include/asm/syscall.h |   11 +++++++++++
>>>  kernel/seccomp.c               |   15 +++++++++++++++
>>>  2 files changed, 26 insertions(+), 0 deletions(-)
>>
>> Will, can you pick this, or some version of it, up in your
>> seccomp-for-ARM tree or wherever your development is?
>
> I'm still not sure about this change though the end result is nice.
> Regardless, I'll explore it when I can -- my family has just increased
> in size, so I'm going to be a bit delayed!

I don't think there's any particular rush here.  It's a very minor ABI
change, but I had to write code to explicitly look for it to detect
any difference at all.

--Andy


>
> cheers!
> will



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3.5 0/2] seccomp and vsyscall fixes
  2012-07-17 23:19 [PATCH 3.5 0/2] seccomp and vsyscall fixes Andy Lutomirski
                   ` (2 preceding siblings ...)
  2012-07-18  2:10 ` [PATCH 3.5 0/2] seccomp and vsyscall fixes Will Drewry
@ 2012-09-27 17:36 ` Greg KH
  2012-09-27 19:40   ` Andy Lutomirski
  3 siblings, 1 reply; 16+ messages in thread
From: Greg KH @ 2012-09-27 17:36 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Linus Torvalds, linux-kernel, Will Drewry, Kees Cook

On Tue, Jul 17, 2012 at 04:19:18PM -0700, Andy Lutomirski wrote:
> Apologies for the lateness of this stuff.  I was at a conference last
> week when the Chrome issue was discovered and I couldn't do this
> properly until I got back.
> 
> Will, can you confirm that this version is okay and passes your tests?
> It passes mine.
> 
> While there are no known seccomp users that will have trouble,
> SECCOMP_RET_TRAP and SECCOMP_RET_TRACE currently interact oddly with
> emulated vsyscalls.  This might lead to ABI issues down the road (if
> something starts to rely on current behavior) or unexpected malfunctions
> (if something tries to change, say, sys_gettimeofday, into a different
> syscall and gets completely bogus results on a vsyscall-using distro.
> 
> It's unlikely that fixing this later will cause issues, but it would be
> nice to nail down and document the vsyscall quirks for the first
> released kernel with seccomp mode 2 support.
> 
> (Patch 2/2 is very much optional.  It fixes a strange corner case.  It
>  ought to be fine for 3.6, since I very much doubt that any real code
>  will hit that corner case and cause ABI problems.)
> 
> Andy Lutomirski (2):
>   seccomp: Make syscall skipping and nr changes more consistent
>   seccomp: Future-proof against silly tracers
> 
>  Documentation/prctl/seccomp_filter.txt |   74 ++++++++++++++++++++--
>  arch/x86/include/asm/syscall.h         |   11 +++
>  arch/x86/kernel/vsyscall_64.c          |  110 +++++++++++++++++---------------
>  kernel/seccomp.c                       |   28 +++++++-
>  4 files changed, 163 insertions(+), 60 deletions(-)

What ever happened to these patches?  I don't see them in 3.6-rc7, are
they pending for 3.7?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3.5 0/2] seccomp and vsyscall fixes
  2012-09-27 17:36 ` Greg KH
@ 2012-09-27 19:40   ` Andy Lutomirski
  2012-09-27 20:09     ` Greg KH
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2012-09-27 19:40 UTC (permalink / raw)
  To: Greg KH
  Cc: Linus Torvalds, linux-kernel, Will Drewry, Kees Cook, James Morris

[cc James Morris]

This has been pending since the 3.6 merge window.  Patch 2/2 barely
matters because it's almost impossible to detect its effect -- it's
more about future proofing against new architectures.  Patch 1/2 has
been slightly tweaked here:

http://lkml.kernel.org/r/%3C744e07394a02be3d3ef52c22ccedb24d9a478fe1.1343869850.git.luto@amacapital.net%3E

and will soon appear here (once the cache refreshes)

https://git.kernel.org/?p=linux/kernel/git/luto/linux.git;a=shortlog;h=refs/heads/seccomp-vsyscall/patch_v2

I can wait for someone to pick it up or I can send a pull request from
my tree.  FWIW, the same patch applies cleanly to -next.

--Andy

On Thu, Sep 27, 2012 at 10:36 AM, Greg KH <gregkh@linuxfoundation.org> wrote:
> On Tue, Jul 17, 2012 at 04:19:18PM -0700, Andy Lutomirski wrote:
>> Apologies for the lateness of this stuff.  I was at a conference last
>> week when the Chrome issue was discovered and I couldn't do this
>> properly until I got back.
>>
>> Will, can you confirm that this version is okay and passes your tests?
>> It passes mine.
>>
>> While there are no known seccomp users that will have trouble,
>> SECCOMP_RET_TRAP and SECCOMP_RET_TRACE currently interact oddly with
>> emulated vsyscalls.  This might lead to ABI issues down the road (if
>> something starts to rely on current behavior) or unexpected malfunctions
>> (if something tries to change, say, sys_gettimeofday, into a different
>> syscall and gets completely bogus results on a vsyscall-using distro.
>>
>> It's unlikely that fixing this later will cause issues, but it would be
>> nice to nail down and document the vsyscall quirks for the first
>> released kernel with seccomp mode 2 support.
>>
>> (Patch 2/2 is very much optional.  It fixes a strange corner case.  It
>>  ought to be fine for 3.6, since I very much doubt that any real code
>>  will hit that corner case and cause ABI problems.)
>>
>> Andy Lutomirski (2):
>>   seccomp: Make syscall skipping and nr changes more consistent
>>   seccomp: Future-proof against silly tracers
>>
>>  Documentation/prctl/seccomp_filter.txt |   74 ++++++++++++++++++++--
>>  arch/x86/include/asm/syscall.h         |   11 +++
>>  arch/x86/kernel/vsyscall_64.c          |  110 +++++++++++++++++---------------
>>  kernel/seccomp.c                       |   28 +++++++-
>>  4 files changed, 163 insertions(+), 60 deletions(-)
>
> What ever happened to these patches?  I don't see them in 3.6-rc7, are
> they pending for 3.7?
>
> thanks,
>
> greg k-h



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3.5 0/2] seccomp and vsyscall fixes
  2012-09-27 19:40   ` Andy Lutomirski
@ 2012-09-27 20:09     ` Greg KH
  0 siblings, 0 replies; 16+ messages in thread
From: Greg KH @ 2012-09-27 20:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, linux-kernel, Will Drewry, Kees Cook, James Morris

On Thu, Sep 27, 2012 at 12:40:10PM -0700, Andy Lutomirski wrote:
> [cc James Morris]
> 
> This has been pending since the 3.6 merge window.  Patch 2/2 barely
> matters because it's almost impossible to detect its effect -- it's
> more about future proofing against new architectures.  Patch 1/2 has
> been slightly tweaked here:
> 
> http://lkml.kernel.org/r/%3C744e07394a02be3d3ef52c22ccedb24d9a478fe1.1343869850.git.luto@amacapital.net%3E
> 
> and will soon appear here (once the cache refreshes)
> 
> https://git.kernel.org/?p=linux/kernel/git/luto/linux.git;a=shortlog;h=refs/heads/seccomp-vsyscall/patch_v2
> 
> I can wait for someone to pick it up or I can send a pull request from
> my tree.  FWIW, the same patch applies cleanly to -next.

You might want to resend them, in patch form, to James, if he's missed
them already.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-09-27 20:09 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-17 23:19 [PATCH 3.5 0/2] seccomp and vsyscall fixes Andy Lutomirski
2012-07-17 23:19 ` [PATCH 3.5 1/2] seccomp: Make syscall skipping and nr changes more consistent Andy Lutomirski
2012-07-18 18:31   ` Will Drewry
2012-07-18 20:00     ` Andy Lutomirski
2012-07-26 15:43   ` Andy Lutomirski
2012-07-17 23:19 ` [PATCH 3.5 2/2] seccomp: Future-proof against silly tracers Andy Lutomirski
2012-07-18  2:13   ` Will Drewry
2012-07-18 18:35     ` Will Drewry
2012-07-18 20:08       ` Andy Lutomirski
2012-07-26 15:41   ` Andy Lutomirski
2012-08-02 14:32     ` Will Drewry
2012-08-02 16:54       ` Andy Lutomirski
2012-07-18  2:10 ` [PATCH 3.5 0/2] seccomp and vsyscall fixes Will Drewry
2012-09-27 17:36 ` Greg KH
2012-09-27 19:40   ` Andy Lutomirski
2012-09-27 20:09     ` Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.