[PATCH RFC] seccomp: Implement syscall isolation based on memory areas

* [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
@ 2020-05-30  5:59 Gabriel Krisman Bertazi
  2020-05-30 17:30 ` Kees Cook
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-05-30  5:59 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Gabriel Krisman Bertazi, kernel, Thomas Gleixner,
	Kees Cook, Andy Lutomirski, Will Drewry, H . Peter Anvin,
	Paul Gofman

Modern Windows applications are executing system call instructions
directly from the application's code without going through the WinAPI.
This breaks Wine emulation, because it doesn't have a chance to
intercept and emulate these syscalls before they are submitted to Linux.

In addition, we cannot simply trap every system call of the application
to userspace using PTRACE_SYSEMU, because performance would suffer,
since our main use case is to run Windows games over Linux.  Therefore,
we need some in-kernel filtering to decide whether the syscall was
issued by the wine code or by the windows application.

The filtering cannot really be done based solely on the syscall number,
because those could collide with existing Linux syscalls.  Instead, our
proposed solution is to trap syscalls based on the userspace memory
region that triggered the syscall, as wine is responsible for the
Windows code allocations and it can apply correct memory protections to
those areas.

Therefore, this patch reuses the seccomp infrastructure to trap
system calls, but introduces a new mode to trap based on a vma attribute
that describes whether the userspace memory region is allowed to execute
syscalls or not.  The protection is defined at mmap/mprotect time with a
new protection flag PROT_NOSYSCALL.  This setting only takes effect if
the new SECCOMP_MODE_MEMMAP is enabled through seccomp().

It goes without saying that this is in no way a security mechanism
despite being built on top of seccomp, since an evil application can
always jump to a whitelisted memory region and run the syscall.  This
is not a concern for Wine games.  Nevertheless, we reuse seccomp as a
way to avoid adding a new mechanism to essentially do the same job of
filtering system calls.

* Why not SECCOMP_MODE_FILTER?

We experimented with dynamically generating BPF filters for whitelisted
memory regions and using SECCOMP_MODE_FILTER, but there are a few
reasons why it isn't enough nor a good idea for our use case:

1. We cannot set the filters at program initialization time and forget
about it, since there is no way of knowing which modules will be loaded,
whether native and windows.  Filter would need a way to be updated
frequently during game execution.

2. We cannot predict which Linux libraries will issue syscalls directly.
Most of the time, whitelisting libc and a few other libraries is enough,
but there are no guarantees other Linux libraries won't issue syscalls
directly and break the execution.  Adding every linux library that is
loaded also has a large performance cost due to the large resulting
filter.

3. As I mentioned before, performance is critical.  In our testing with
just a single memory segment blacklisted/whitelisted, the minimum size
of a bpf filter would be 4 instructions.  In that scenario,
SECCOMP_MODE_FILTER added an average overhead of 10% to the execution
time of sysinfo(2) in comparison to seccomp disabled, while the impact
of SECCOMP_MODE_MEMMAP was averaged around 1.5%.

Indeed, points 1 and 2 could be worked around with some userspace work
and improved SECCOMP_MODE_FILTER support, but at a high performance and
some stability cost, to obtain the semantics we want.  Still, the
performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive
enough that I believe it should be considered as an upstream solution.

Sending as an RFC for now to get the discussion started.  In particular:

1) Would this design be acceptable?

2) I'm not comfortable with using the high part of vm_flags, though I'm
not sure where else it would fit.  Suggestions?

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Will Drewry <wad@chromium.org>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Paul Gofman <gofmanp@gmail.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
 arch/Kconfig                           | 21 +++++++++
 arch/x86/Kconfig                       |  1 +
 include/linux/mm.h                     |  8 ++++
 include/linux/mman.h                   |  4 +-
 include/uapi/asm-generic/mman-common.h |  2 +
 include/uapi/linux/seccomp.h           |  2 +
 kernel/seccomp.c                       | 59 ++++++++++++++++++++++++++
 mm/mprotect.c                          |  2 +-
 8 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 786a85d4ad40..e5c10231c9c2 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -471,6 +471,27 @@ config SECCOMP_FILTER
 
 	  See Documentation/userspace-api/seccomp_filter.rst for details.
 
+config HAVE_ARCH_SECCOMP_MEMMAP
+	bool
+	help
+	  An arch should select this symbol if it provides all of these things:
+	  - syscall_get_arch()
+	  - syscall_get_arguments()
+	  - syscall_rollback()
+	  - syscall_set_return_value()
+	  - SIGSYS siginfo_t support
+	  - secure_computing is called from a ptrace_event()-safe context
+	  - secure_computing return value is checked and a return value of -1
+	    results in the system call being skipped immediately.
+	  - seccomp syscall wired up
+
+config SECCOMP_MEMMAP
+	def_bool y
+	depends on HAVE_ARCH_SECCOMP_MEMMAP && SECCOMP
+	help
+	  Enable tasks to trap syscalls based on the page that triggered
+	  the syscall.
+
 config HAVE_ARCH_STACKLEAK
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2d3f963fd6f1..70998b01c533 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -145,6 +145,7 @@ config X86
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_SECCOMP_MEMMAP
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_TRACEHOOK
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a323422d783..6271fb7fd5bc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -294,11 +294,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -340,6 +342,12 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_GROWSUP	VM_NONE
 #endif
 
+#if defined(CONFIG_X86)
+# define VM_NOSYSCALL	VM_HIGH_ARCH_5
+#else
+# define VM_NOSYSCALL	VM_NONE
+#endif
+
 /* Bits set in the VMA until the stack is in its final location */
 #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
 
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 4b08e9c9c538..a5ca42eb685a 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -94,7 +94,8 @@ static inline void vm_unacct_memory(long pages)
  */
 static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
 {
-	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM)) == 0;
+	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM
+			 | PROT_NOSYSCALL)) == 0;
 }
 #define arch_validate_prot arch_validate_prot
 #endif
@@ -119,6 +120,7 @@ calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
 	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
 	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
+	       _calc_vm_trans(prot, PROT_NOSYSCALL, VM_NOSYSCALL) |
 	       arch_calc_vm_prot_bits(prot, pkey);
 }
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f94f65d429be..4eafbaa3f4bc 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -13,6 +13,8 @@
 #define PROT_SEM	0x8		/* page may be used for atomic ops */
 /*			0x10		   reserved for arch-specific use */
 /*			0x20		   reserved for arch-specific use */
+#define PROT_NOSYSCALL	0x40		/* page cannot trigger syscalls */
+
 #define PROT_NONE	0x0		/* page can not be accessed */
 #define PROT_GROWSDOWN	0x01000000	/* mprotect flag: extend change to start of growsdown vma */
 #define PROT_GROWSUP	0x02000000	/* mprotect flag: extend change to end of growsup vma */
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index c1735455bc53..c7d042a409e7 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -10,12 +10,14 @@
 #define SECCOMP_MODE_DISABLED	0 /* seccomp is not in use. */
 #define SECCOMP_MODE_STRICT	1 /* uses hard-coded filter. */
 #define SECCOMP_MODE_FILTER	2 /* uses user-supplied filter. */
+#define SECCOMP_MODE_MEMMAP	3 /* Lock syscalls per memory region. */
 
 /* Valid operations for seccomp syscall. */
 #define SECCOMP_SET_MODE_STRICT		0
 #define SECCOMP_SET_MODE_FILTER		1
 #define SECCOMP_GET_ACTION_AVAIL	2
 #define SECCOMP_GET_NOTIF_SIZES		3
+#define SECCOMP_SET_MODE_MEMMAP		4
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 55a6184f5990..ebf09b02db8d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -930,6 +930,55 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 }
 #endif
 
+#ifdef CONFIG_SECCOMP_MEMMAP
+static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd)
+{
+	struct vm_area_struct *vma = find_vma(current->mm,
+					      sd->instruction_pointer);
+	if (!vma)
+		BUG();
+
+	if (!(vma->vm_flags & VM_NOSYSCALL))
+		return 0;
+
+	syscall_rollback(current, task_pt_regs(current));
+	seccomp_send_sigsys(this_syscall, 0);
+
+	seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_TRAP, true);
+
+	return -1;
+}
+
+static long seccomp_set_mode_memmap(unsigned int flags)
+{
+	const unsigned long seccomp_mode = SECCOMP_MODE_MEMMAP;
+	long ret = 0;
+
+	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
+		return -EINVAL;
+
+	spin_lock_irq(&current->sighand->siglock);
+
+	if (seccomp_may_assign_mode(seccomp_mode))
+		seccomp_assign_mode(current, seccomp_mode, flags);
+	else
+		ret = -EINVAL;
+
+	spin_unlock_irq(&current->sighand->siglock);
+
+	return ret;
+}
+#else
+static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd)
+{
+	BUG();
+}
+static long seccomp_set_mode_memmap(unsigned int flags)
+{
+	return -EINVAL;
+}
+#endif /* CONFIG_SECCOMP_MEMMAP */
+
 int __secure_computing(const struct seccomp_data *sd)
 {
 	int mode = current->seccomp.mode;
@@ -948,6 +997,8 @@ int __secure_computing(const struct seccomp_data *sd)
 		return 0;
 	case SECCOMP_MODE_FILTER:
 		return __seccomp_filter(this_syscall, sd, false);
+	case SECCOMP_MODE_MEMMAP:
+		return __seccomp_memmap(this_syscall, sd);
 	default:
 		BUG();
 	}
@@ -1425,6 +1476,10 @@ static long do_seccomp(unsigned int op, unsigned int flags,
 			return -EINVAL;
 
 		return seccomp_get_notif_sizes(uargs);
+	case SECCOMP_SET_MODE_MEMMAP:
+		if (uargs != NULL)
+			return -EINVAL;
+		return seccomp_set_mode_memmap(flags);
 	default:
 		return -EINVAL;
 	}
@@ -1462,6 +1517,10 @@ long prctl_set_seccomp(unsigned long seccomp_mode, void __user *filter)
 		op = SECCOMP_SET_MODE_FILTER;
 		uargs = filter;
 		break;
+	case SECCOMP_MODE_MEMMAP:
+		op = SECCOMP_SET_MODE_MEMMAP;
+		uargs = NULL;
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 494192ca954b..6b5c00e8bbdc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -591,7 +591,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		 * cleared from the VMA.
 		 */
 		mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC |
-					VM_FLAGS_CLEAR;
+			VM_NOSYSCALL | VM_FLAGS_CLEAR;
 
 		new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey);
 		newflags = calc_vm_prot_bits(prot, new_vma_pkey);
-- 
2.27.0.rc2


^ permalink raw reply related	[flat|nested] 32+ messages in thread