linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
@ 2020-05-30  5:59 Gabriel Krisman Bertazi
  2020-05-30 17:30 ` Kees Cook
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-05-30  5:59 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Gabriel Krisman Bertazi, kernel, Thomas Gleixner,
	Kees Cook, Andy Lutomirski, Will Drewry, H . Peter Anvin,
	Paul Gofman

Modern Windows applications are executing system call instructions
directly from the application's code without going through the WinAPI.
This breaks Wine emulation, because it doesn't have a chance to
intercept and emulate these syscalls before they are submitted to Linux.

In addition, we cannot simply trap every system call of the application
to userspace using PTRACE_SYSEMU, because performance would suffer,
since our main use case is to run Windows games over Linux.  Therefore,
we need some in-kernel filtering to decide whether the syscall was
issued by the wine code or by the windows application.

The filtering cannot really be done based solely on the syscall number,
because those could collide with existing Linux syscalls.  Instead, our
proposed solution is to trap syscalls based on the userspace memory
region that triggered the syscall, as wine is responsible for the
Windows code allocations and it can apply correct memory protections to
those areas.

Therefore, this patch reuses the seccomp infrastructure to trap
system calls, but introduces a new mode to trap based on a vma attribute
that describes whether the userspace memory region is allowed to execute
syscalls or not.  The protection is defined at mmap/mprotect time with a
new protection flag PROT_NOSYSCALL.  This setting only takes effect if
the new SECCOMP_MODE_MEMMAP is enabled through seccomp().

It goes without saying that this is in no way a security mechanism
despite being built on top of seccomp, since an evil application can
always jump to a whitelisted memory region and run the syscall.  This
is not a concern for Wine games.  Nevertheless, we reuse seccomp as a
way to avoid adding a new mechanism to essentially do the same job of
filtering system calls.

* Why not SECCOMP_MODE_FILTER?

We experimented with dynamically generating BPF filters for whitelisted
memory regions and using SECCOMP_MODE_FILTER, but there are a few
reasons why it isn't enough nor a good idea for our use case:

1. We cannot set the filters at program initialization time and forget
about it, since there is no way of knowing which modules will be loaded,
whether native and windows.  Filter would need a way to be updated
frequently during game execution.

2. We cannot predict which Linux libraries will issue syscalls directly.
Most of the time, whitelisting libc and a few other libraries is enough,
but there are no guarantees other Linux libraries won't issue syscalls
directly and break the execution.  Adding every linux library that is
loaded also has a large performance cost due to the large resulting
filter.

3. As I mentioned before, performance is critical.  In our testing with
just a single memory segment blacklisted/whitelisted, the minimum size
of a bpf filter would be 4 instructions.  In that scenario,
SECCOMP_MODE_FILTER added an average overhead of 10% to the execution
time of sysinfo(2) in comparison to seccomp disabled, while the impact
of SECCOMP_MODE_MEMMAP was averaged around 1.5%.

Indeed, points 1 and 2 could be worked around with some userspace work
and improved SECCOMP_MODE_FILTER support, but at a high performance and
some stability cost, to obtain the semantics we want.  Still, the
performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive
enough that I believe it should be considered as an upstream solution.

Sending as an RFC for now to get the discussion started.  In particular:

1) Would this design be acceptable?

2) I'm not comfortable with using the high part of vm_flags, though I'm
not sure where else it would fit.  Suggestions?

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Will Drewry <wad@chromium.org>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Paul Gofman <gofmanp@gmail.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
 arch/Kconfig                           | 21 +++++++++
 arch/x86/Kconfig                       |  1 +
 include/linux/mm.h                     |  8 ++++
 include/linux/mman.h                   |  4 +-
 include/uapi/asm-generic/mman-common.h |  2 +
 include/uapi/linux/seccomp.h           |  2 +
 kernel/seccomp.c                       | 59 ++++++++++++++++++++++++++
 mm/mprotect.c                          |  2 +-
 8 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 786a85d4ad40..e5c10231c9c2 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -471,6 +471,27 @@ config SECCOMP_FILTER
 
 	  See Documentation/userspace-api/seccomp_filter.rst for details.
 
+config HAVE_ARCH_SECCOMP_MEMMAP
+	bool
+	help
+	  An arch should select this symbol if it provides all of these things:
+	  - syscall_get_arch()
+	  - syscall_get_arguments()
+	  - syscall_rollback()
+	  - syscall_set_return_value()
+	  - SIGSYS siginfo_t support
+	  - secure_computing is called from a ptrace_event()-safe context
+	  - secure_computing return value is checked and a return value of -1
+	    results in the system call being skipped immediately.
+	  - seccomp syscall wired up
+
+config SECCOMP_MEMMAP
+	def_bool y
+	depends on HAVE_ARCH_SECCOMP_MEMMAP && SECCOMP
+	help
+	  Enable tasks to trap syscalls based on the page that triggered
+	  the syscall.
+
 config HAVE_ARCH_STACKLEAK
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2d3f963fd6f1..70998b01c533 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -145,6 +145,7 @@ config X86
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_SECCOMP_MEMMAP
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_TRACEHOOK
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a323422d783..6271fb7fd5bc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -294,11 +294,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -340,6 +342,12 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_GROWSUP	VM_NONE
 #endif
 
+#if defined(CONFIG_X86)
+# define VM_NOSYSCALL	VM_HIGH_ARCH_5
+#else
+# define VM_NOSYSCALL	VM_NONE
+#endif
+
 /* Bits set in the VMA until the stack is in its final location */
 #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
 
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 4b08e9c9c538..a5ca42eb685a 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -94,7 +94,8 @@ static inline void vm_unacct_memory(long pages)
  */
 static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
 {
-	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM)) == 0;
+	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM
+			 | PROT_NOSYSCALL)) == 0;
 }
 #define arch_validate_prot arch_validate_prot
 #endif
@@ -119,6 +120,7 @@ calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
 	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
 	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
+	       _calc_vm_trans(prot, PROT_NOSYSCALL, VM_NOSYSCALL) |
 	       arch_calc_vm_prot_bits(prot, pkey);
 }
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f94f65d429be..4eafbaa3f4bc 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -13,6 +13,8 @@
 #define PROT_SEM	0x8		/* page may be used for atomic ops */
 /*			0x10		   reserved for arch-specific use */
 /*			0x20		   reserved for arch-specific use */
+#define PROT_NOSYSCALL	0x40		/* page cannot trigger syscalls */
+
 #define PROT_NONE	0x0		/* page can not be accessed */
 #define PROT_GROWSDOWN	0x01000000	/* mprotect flag: extend change to start of growsdown vma */
 #define PROT_GROWSUP	0x02000000	/* mprotect flag: extend change to end of growsup vma */
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index c1735455bc53..c7d042a409e7 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -10,12 +10,14 @@
 #define SECCOMP_MODE_DISABLED	0 /* seccomp is not in use. */
 #define SECCOMP_MODE_STRICT	1 /* uses hard-coded filter. */
 #define SECCOMP_MODE_FILTER	2 /* uses user-supplied filter. */
+#define SECCOMP_MODE_MEMMAP	3 /* Lock syscalls per memory region. */
 
 /* Valid operations for seccomp syscall. */
 #define SECCOMP_SET_MODE_STRICT		0
 #define SECCOMP_SET_MODE_FILTER		1
 #define SECCOMP_GET_ACTION_AVAIL	2
 #define SECCOMP_GET_NOTIF_SIZES		3
+#define SECCOMP_SET_MODE_MEMMAP		4
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 55a6184f5990..ebf09b02db8d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -930,6 +930,55 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 }
 #endif
 
+#ifdef CONFIG_SECCOMP_MEMMAP
+static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd)
+{
+	struct vm_area_struct *vma = find_vma(current->mm,
+					      sd->instruction_pointer);
+	if (!vma)
+		BUG();
+
+	if (!(vma->vm_flags & VM_NOSYSCALL))
+		return 0;
+
+	syscall_rollback(current, task_pt_regs(current));
+	seccomp_send_sigsys(this_syscall, 0);
+
+	seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_TRAP, true);
+
+	return -1;
+}
+
+static long seccomp_set_mode_memmap(unsigned int flags)
+{
+	const unsigned long seccomp_mode = SECCOMP_MODE_MEMMAP;
+	long ret = 0;
+
+	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
+		return -EINVAL;
+
+	spin_lock_irq(&current->sighand->siglock);
+
+	if (seccomp_may_assign_mode(seccomp_mode))
+		seccomp_assign_mode(current, seccomp_mode, flags);
+	else
+		ret = -EINVAL;
+
+	spin_unlock_irq(&current->sighand->siglock);
+
+	return ret;
+}
+#else
+static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd)
+{
+	BUG();
+}
+static long seccomp_set_mode_memmap(unsigned int flags)
+{
+	return -EINVAL;
+}
+#endif /* CONFIG_SECCOMP_MEMMAP */
+
 int __secure_computing(const struct seccomp_data *sd)
 {
 	int mode = current->seccomp.mode;
@@ -948,6 +997,8 @@ int __secure_computing(const struct seccomp_data *sd)
 		return 0;
 	case SECCOMP_MODE_FILTER:
 		return __seccomp_filter(this_syscall, sd, false);
+	case SECCOMP_MODE_MEMMAP:
+		return __seccomp_memmap(this_syscall, sd);
 	default:
 		BUG();
 	}
@@ -1425,6 +1476,10 @@ static long do_seccomp(unsigned int op, unsigned int flags,
 			return -EINVAL;
 
 		return seccomp_get_notif_sizes(uargs);
+	case SECCOMP_SET_MODE_MEMMAP:
+		if (uargs != NULL)
+			return -EINVAL;
+		return seccomp_set_mode_memmap(flags);
 	default:
 		return -EINVAL;
 	}
@@ -1462,6 +1517,10 @@ long prctl_set_seccomp(unsigned long seccomp_mode, void __user *filter)
 		op = SECCOMP_SET_MODE_FILTER;
 		uargs = filter;
 		break;
+	case SECCOMP_MODE_MEMMAP:
+		op = SECCOMP_SET_MODE_MEMMAP;
+		uargs = NULL;
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 494192ca954b..6b5c00e8bbdc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -591,7 +591,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		 * cleared from the VMA.
 		 */
 		mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC |
-					VM_FLAGS_CLEAR;
+			VM_NOSYSCALL | VM_FLAGS_CLEAR;
 
 		new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey);
 		newflags = calc_vm_prot_bits(prot, new_vma_pkey);
-- 
2.27.0.rc2


^ permalink raw reply related	[flat|nested] 32+ messages in thread
* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
@ 2020-06-01  9:23 Billy Laws
  2020-06-01 13:59 ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Billy Laws @ 2020-06-01  9:23 UTC (permalink / raw)
  To: krisman
  Cc: gofmanp, hpa, keescook, kernel, linux-kernel, linux-mm, luto, tglx, wad

> On May 30, 2020, at 5:26 PM, Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
>
> Andy Lutomirski <luto@amacapital.net> writes:
>
>>>> On May 29, 2020, at 11:00 PM, Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
>>>
>>> Modern Windows applications are executing system call instructions
>>> directly from the application's code without going through the WinAPI.
>>> This breaks Wine emulation, because it doesn't have a chance to
>>> intercept and emulate these syscalls before they are submitted to Linux.
>>>
>>> In addition, we cannot simply trap every system call of the application
>>> to userspace using PTRACE_SYSEMU, because performance would suffer,
>>> since our main use case is to run Windows games over Linux.  Therefore,
>>> we need some in-kernel filtering to decide whether the syscall was
>>> issued by the wine code or by the windows application.
>>
>> Do you really need in-kernel filtering?  What if you could have
>> efficient userspace filtering instead?  That is, set something up so
>> that all syscalls, except those from a special address, are translated
>> to CALL thunk where the thunk is configured per task.  Then the thunk
>> can do whatever emulation is needed.
>
> Hi,
>
> I suggested something similar to my customer, by using
> libsyscall-intercept.  The idea would be overwritting the syscall
> instruction with a call to the entry point.  I'm not a specialist on the
> specifics of Windows games, (cc'ed Paul Gofman, who can provide more
> details on that side), but as far as I understand, the reason why that
> is not feasible is that the anti-cheat protection in games will abort
> execution if the binary region was modified either on-disk or in-memory.
>
> Is there some mechanism to do that without modiyfing the application?

Hi,

I work on an emulator for the Nintendo Switch that uses a similar technique,
in our testing it works very well and is much more performant than even
PTRACE_SYSEMU.

To work around DRM reading the memory contents I think mprotect could
be used, after patching the syscall a copy of the original code could be
kept somewhere in memory and the patched region mapped --X.
With this, any time the DRM attempts to read to the patched region and
perform integrity checks it will cause a segfault and a branch to the
signal handler. This handler can then return the contents of the original,
unpatched region to satisfy them checks.

Are memory contents checked by DRM solutions too often for this to be
performant?
--
Billy Laws
>
>> Getting the details and especially the interaction with any seccomp
>> filters that may be installed right could be tricky, but the performance
>> should be decent, at least on non-PTI systems.
>>
>> (If we go this route, I suspect that the correct interaction with
>> seccomp is that this type of redirection takes precedence over seccomp
>> and seccomp filters are not invoked for redirected syscalls. After all,
>> a redirected syscall is, functionally, not a syscall at all.)
>>
>
>
> --
> Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2020-06-26  1:05 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-30  5:59 [PATCH RFC] seccomp: Implement syscall isolation based on memory areas Gabriel Krisman Bertazi
2020-05-30 17:30 ` Kees Cook
2020-05-31  5:56   ` Gabriel Krisman Bertazi
2020-05-31 12:39     ` Paul Gofman
2020-05-31 16:49       ` Matthew Wilcox
2020-05-31 17:10         ` Paul Gofman
2020-05-31 17:31           ` Matthew Wilcox
2020-05-31 18:01             ` Paul Gofman
2020-06-01 17:54               ` Gabriel Krisman Bertazi
2020-06-01 17:53         ` Gabriel Krisman Bertazi
2020-05-30 22:09 ` Andy Lutomirski
2020-05-31  0:26   ` Gabriel Krisman Bertazi
2020-05-31  0:59     ` Andy Lutomirski
2020-05-31 12:56       ` Paul Gofman
2020-05-31 18:10         ` Andy Lutomirski
2020-05-31 18:36           ` Paul Gofman
2020-05-31 18:57             ` Andy Lutomirski
2020-05-31 19:37               ` Paul Gofman
2020-05-31 21:03               ` Andy Lutomirski
2020-06-01 18:06                 ` Gabriel Krisman Bertazi
2020-06-01 20:08                 ` Kees Cook
2020-06-01 23:18                   ` Andy Lutomirski
2020-06-11 19:38                 ` Gabriel Krisman Bertazi
2020-05-31 23:33               ` Brendan Shanks
2020-06-01  1:51                 ` Andy Lutomirski
2020-06-25 23:14     ` Robert O'Callahan
2020-06-25 23:48       ` Gabriel Krisman Bertazi
2020-06-26  1:03         ` Robert O'Callahan
2020-06-05  6:06 ` Sargun Dhillon
2020-06-01  9:23 Billy Laws
2020-06-01 13:59 ` Andy Lutomirski
2020-06-01 17:48   ` hpa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).