[PATCH v4 0/2] Control over userfaultfd kernel-fault handling

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/2] Control over userfaultfd kernel-fault handling
@ 2020-09-24  6:56 Lokesh Gidra
  2020-09-24  6:56 ` [PATCH v4 1/2] Add UFFD_USER_MODE_ONLY Lokesh Gidra
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Lokesh Gidra @ 2020-09-24  6:56 UTC (permalink / raw)
  To: Kees Cook, Jonathan Corbet, Peter Xu, Andrea Arcangeli,
	Sebastian Andrzej Siewior, Andrew Morton
  Cc: Alexander Viro, Stephen Smalley, Eric Biggers, Lokesh Gidra,
	Daniel Colascione, Joel Fernandes (Google),
	linux-fsdevel, linux-kernel, linux-doc, kaleshsingh, calin,
	surenb, nnk, jeffv, kernel-team, Mike Rapoport, Shaohua Li,
	Jerome Glisse, Mauro Carvalho Chehab, Johannes Weiner,
	Mel Gorman, Nitin Gupta, Vlastimil Babka, Iurii Zaikin,
	Luis Chamberlain

This patch series is split from [1]. The other series enables SELinux
support for userfaultfd file descriptors so that its creation and
movement can be controlled.

It has been demonstrated on various occasions that suspending kernel
code execution for an arbitrary amount of time at any access to
userspace memory (copy_from_user()/copy_to_user()/...) can be exploited
to change the intended behavior of the kernel. For instance, handling
page faults in kernel-mode using userfaultfd has been exploited in [2, 3].
Likewise, FUSE, which is similar to userfaultfd in this respect, has been
exploited in [4, 5] for similar outcome.

This small patch series adds a new flag to userfaultfd(2) that allows
callers to give up the ability to handle kernel-mode faults with the
resulting UFFD file object. It then adds a 'user-mode only' option to
the unprivileged_userfaultfd sysctl knob to require unprivileged
callers to use this new flag.

The purpose of this new interface is to decrease the chance of an
unprivileged userfaultfd user taking advantage of userfaultfd to
enhance security vulnerabilities by lengthening the race window in
kernel code.

[1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
[2] https://duasynt.com/blog/linux-kernel-heap-spray
[3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
[4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
[5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808

Changes since v3:

  - Modified the meaning of value '0' of unprivileged_userfaultfd
    sysctl knob. Setting this knob to '0' now allows unprivileged users
    to use userfaultfd, but can handle page faults in user-mode only.
  - The default value of unprivileged_userfaultfd sysctl knob is changed
    to '0'.

Changes since v2:

  - Removed 'uffd_flags' and directly used 'UFFD_USER_MODE_ONLY' in
    userfaultfd().

Changes since v1:

  - Added external references to the threats from allowing unprivileged
    users to handle page faults from kernel-mode.
  - Removed the new sysctl knob restricting handling of page
    faults from kernel-mode, and added an option for the same
    in the existing 'unprivileged_userfaultfd' knob.

Lokesh Gidra (2):
  Add UFFD_USER_MODE_ONLY
  Add user-mode only option to unprivileged_userfaultfd sysctl knob

 Documentation/admin-guide/sysctl/vm.rst | 15 ++++++++++-----
 fs/userfaultfd.c                        | 12 +++++++++---
 include/uapi/linux/userfaultfd.h        |  9 +++++++++
 3 files changed, 28 insertions(+), 8 deletions(-)

-- 
2.28.0.681.g6f77f65b4e-goog

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v4 1/2] Add UFFD_USER_MODE_ONLY
  2020-09-24  6:56 [PATCH v4 0/2] Control over userfaultfd kernel-fault handling Lokesh Gidra
@ 2020-09-24  6:56 ` Lokesh Gidra
  2020-09-24  6:56 ` [PATCH v4 2/2] Add user-mode only option to unprivileged_userfaultfd sysctl knob Lokesh Gidra
  2020-10-07 20:26 ` [PATCH v4 0/2] Control over userfaultfd kernel-fault handling Lokesh Gidra
  2 siblings, 0 replies; 8+ messages in thread
From: Lokesh Gidra @ 2020-09-24  6:56 UTC (permalink / raw)
  To: Kees Cook, Jonathan Corbet, Peter Xu, Andrea Arcangeli,
	Sebastian Andrzej Siewior, Andrew Morton
  Cc: Alexander Viro, Stephen Smalley, Eric Biggers, Lokesh Gidra,
	Daniel Colascione, Joel Fernandes (Google),
	linux-fsdevel, linux-kernel, linux-doc, kaleshsingh, calin,
	surenb, nnk, jeffv, kernel-team, Mike Rapoport, Shaohua Li,
	Jerome Glisse, Mauro Carvalho Chehab, Johannes Weiner,
	Mel Gorman, Nitin Gupta, Vlastimil Babka, Iurii Zaikin,
	Luis Chamberlain, Daniel Colascione

userfaultfd handles page faults from both user and kernel code.
Add a new UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes
the resulting userfaultfd object refuse to handle faults from kernel
mode, treating these faults as if SIGBUS were always raised, causing
the kernel code to fail with EFAULT.

A future patch adds a knob allowing administrators to give some
processes the ability to create userfaultfd file objects only if they
pass UFFD_USER_MODE_ONLY, reducing the likelihood that these processes
will exploit userfaultfd's ability to delay kernel page faults to open
timing windows for future exploits.

Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
---
 fs/userfaultfd.c                 | 6 +++++-
 include/uapi/linux/userfaultfd.h | 9 +++++++++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 0e4a3837da52..3191434057f3 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -405,6 +405,9 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	if (ctx->features & UFFD_FEATURE_SIGBUS)
 		goto out;
+	if ((vmf->flags & FAULT_FLAG_USER) == 0 &&
+	    ctx->flags & UFFD_USER_MODE_ONLY)
+		goto out;
 
 	/*
 	 * If it's already released don't get it. This avoids to loop
@@ -1975,10 +1978,11 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 	BUG_ON(!current->mm);
 
 	/* Check the UFFD_* constants for consistency.  */
+	BUILD_BUG_ON(UFFD_USER_MODE_ONLY & UFFD_SHARED_FCNTL_FLAGS);
 	BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
 	BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
 
-	if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
+	if (flags & ~(UFFD_SHARED_FCNTL_FLAGS | UFFD_USER_MODE_ONLY))
 		return -EINVAL;
 
 	ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index e7e98bde221f..5f2d88212f7c 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -257,4 +257,13 @@ struct uffdio_writeprotect {
 	__u64 mode;
 };
 
+/*
+ * Flags for the userfaultfd(2) system call itself.
+ */
+
+/*
+ * Create a userfaultfd that can handle page faults only in user mode.
+ */
+#define UFFD_USER_MODE_ONLY 1
+
 #endif /* _LINUX_USERFAULTFD_H */
-- 
2.28.0.681.g6f77f65b4e-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 2/2] Add user-mode only option to unprivileged_userfaultfd sysctl knob
  2020-09-24  6:56 [PATCH v4 0/2] Control over userfaultfd kernel-fault handling Lokesh Gidra
  2020-09-24  6:56 ` [PATCH v4 1/2] Add UFFD_USER_MODE_ONLY Lokesh Gidra
@ 2020-09-24  6:56 ` Lokesh Gidra
  2020-10-07 20:26 ` [PATCH v4 0/2] Control over userfaultfd kernel-fault handling Lokesh Gidra
  2 siblings, 0 replies; 8+ messages in thread
From: Lokesh Gidra @ 2020-09-24  6:56 UTC (permalink / raw)
  To: Kees Cook, Jonathan Corbet, Peter Xu, Andrea Arcangeli,
	Sebastian Andrzej Siewior, Andrew Morton
  Cc: Alexander Viro, Stephen Smalley, Eric Biggers, Lokesh Gidra,
	Daniel Colascione, Joel Fernandes (Google),
	linux-fsdevel, linux-kernel, linux-doc, kaleshsingh, calin,
	surenb, nnk, jeffv, kernel-team, Mike Rapoport, Shaohua Li,
	Jerome Glisse, Mauro Carvalho Chehab, Johannes Weiner,
	Mel Gorman, Nitin Gupta, Vlastimil Babka, Iurii Zaikin,
	Luis Chamberlain

With this change, when the knob is set to 0, it allows unprivileged
users to call userfaultfd, like when it is set to 1, but with the
restriction that page faults from only user-mode can be handled.
In this mode, an unprivileged user (without SYS_CAP_PTRACE capability)
must pass UFFD_USER_MODE_ONLY to userfaultd or the API will fail with
EPERM.

This enables administrators to reduce the likelihood that
an attacker with access to userfaultfd can delay faulting kernel
code to widen timing windows for other exploits.

The default value of this knob is changed to 0. This is required for
correct functioning of pipe mutex. However, this will fail postcopy
live migration, which will be unnoticeable to the VM guests. To avoid
this, set 'vm.userfault = 1' in /sys/sysctl.conf. For more details,
refer to Andrea's reply [1].

[1] https://lore.kernel.org/lkml/20200904033438.GI9411@redhat.com/

Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 15 ++++++++++-----
 fs/userfaultfd.c                        |  6 ++++--
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 4b9d2e8e9142..4263d38c3c21 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -871,12 +871,17 @@ file-backed pages is less than the high watermark in a zone.
 unprivileged_userfaultfd
 ========================
 
-This flag controls whether unprivileged users can use the userfaultfd
-system calls.  Set this to 1 to allow unprivileged users to use the
-userfaultfd system calls, or set this to 0 to restrict userfaultfd to only
-privileged users (with SYS_CAP_PTRACE capability).
+This flag controls the mode in which unprivileged users can use the
+userfaultfd system calls. Set this to 0 to restrict unprivileged users
+to handle page faults in user mode only. In this case, users without
+SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd to
+succeed. Prohibiting use of userfaultfd for handling faults from kernel
+mode may make certain vulnerabilities more difficult to exploit.
 
-The default value is 1.
+Set this to 1 to allow unprivileged users to use the userfaultfd system
+calls without any restrictions.
+
+The default value is 0.
 
 
 user_reserve_kbytes
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3191434057f3..3816c11a986a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -28,7 +28,7 @@
 #include <linux/security.h>
 #include <linux/hugetlb.h>
 
-int sysctl_unprivileged_userfaultfd __read_mostly = 1;
+int sysctl_unprivileged_userfaultfd __read_mostly;
 
 static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
 
@@ -1972,7 +1972,9 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 	struct userfaultfd_ctx *ctx;
 	int fd;
 
-	if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
+	if (!sysctl_unprivileged_userfaultfd &&
+	    (flags & UFFD_USER_MODE_ONLY) == 0 &&
+	    !capable(CAP_SYS_PTRACE))
 		return -EPERM;
 
 	BUG_ON(!current->mm);
-- 
2.28.0.681.g6f77f65b4e-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 0/2] Control over userfaultfd kernel-fault handling
  2020-09-24  6:56 [PATCH v4 0/2] Control over userfaultfd kernel-fault handling Lokesh Gidra
  2020-09-24  6:56 ` [PATCH v4 1/2] Add UFFD_USER_MODE_ONLY Lokesh Gidra
  2020-09-24  6:56 ` [PATCH v4 2/2] Add user-mode only option to unprivileged_userfaultfd sysctl knob Lokesh Gidra
@ 2020-10-07 20:26 ` Lokesh Gidra
  2020-10-08  4:01   ` Andrea Arcangeli
  2 siblings, 1 reply; 8+ messages in thread
From: Lokesh Gidra @ 2020-10-07 20:26 UTC (permalink / raw)
  To: Kees Cook, Jonathan Corbet, Peter Xu, Andrea Arcangeli,
	Sebastian Andrzej Siewior, Andrew Morton
  Cc: Alexander Viro, Stephen Smalley, Eric Biggers, Daniel Colascione,
	Joel Fernandes (Google),
	Linux FS Devel, linux-kernel, linux-doc, Kalesh Singh,
	Calin Juravle, Suren Baghdasaryan, Nick Kralevich,
	Jeffrey Vander Stoep, Cc: Android Kernel, Mike Rapoport,
	Shaohua Li, Jerome Glisse, Mauro Carvalho Chehab,
	Johannes Weiner, Mel Gorman, Nitin Gupta, Vlastimil Babka,
	Iurii Zaikin, Luis Chamberlain

On Wed, Sep 23, 2020 at 11:56 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> This patch series is split from [1]. The other series enables SELinux
> support for userfaultfd file descriptors so that its creation and
> movement can be controlled.
>
> It has been demonstrated on various occasions that suspending kernel
> code execution for an arbitrary amount of time at any access to
> userspace memory (copy_from_user()/copy_to_user()/...) can be exploited
> to change the intended behavior of the kernel. For instance, handling
> page faults in kernel-mode using userfaultfd has been exploited in [2, 3].
> Likewise, FUSE, which is similar to userfaultfd in this respect, has been
> exploited in [4, 5] for similar outcome.
>
> This small patch series adds a new flag to userfaultfd(2) that allows
> callers to give up the ability to handle kernel-mode faults with the
> resulting UFFD file object. It then adds a 'user-mode only' option to
> the unprivileged_userfaultfd sysctl knob to require unprivileged
> callers to use this new flag.
>
> The purpose of this new interface is to decrease the chance of an
> unprivileged userfaultfd user taking advantage of userfaultfd to
> enhance security vulnerabilities by lengthening the race window in
> kernel code.
>
> [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
> [2] https://duasynt.com/blog/linux-kernel-heap-spray
> [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
> [4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
> [5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808
>
> Changes since v3:
>
>   - Modified the meaning of value '0' of unprivileged_userfaultfd
>     sysctl knob. Setting this knob to '0' now allows unprivileged users
>     to use userfaultfd, but can handle page faults in user-mode only.
>   - The default value of unprivileged_userfaultfd sysctl knob is changed
>     to '0'.
>
Request reviewers and maintainers to please take a look.

> Changes since v2:
>
>   - Removed 'uffd_flags' and directly used 'UFFD_USER_MODE_ONLY' in
>     userfaultfd().
>
> Changes since v1:
>
>   - Added external references to the threats from allowing unprivileged
>     users to handle page faults from kernel-mode.
>   - Removed the new sysctl knob restricting handling of page
>     faults from kernel-mode, and added an option for the same
>     in the existing 'unprivileged_userfaultfd' knob.
>
> Lokesh Gidra (2):
>   Add UFFD_USER_MODE_ONLY
>   Add user-mode only option to unprivileged_userfaultfd sysctl knob
>
>  Documentation/admin-guide/sysctl/vm.rst | 15 ++++++++++-----
>  fs/userfaultfd.c                        | 12 +++++++++---
>  include/uapi/linux/userfaultfd.h        |  9 +++++++++
>  3 files changed, 28 insertions(+), 8 deletions(-)
>
> --
> 2.28.0.681.g6f77f65b4e-goog
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 0/2] Control over userfaultfd kernel-fault handling
  2020-10-07 20:26 ` [PATCH v4 0/2] Control over userfaultfd kernel-fault handling Lokesh Gidra
@ 2020-10-08  4:01   ` Andrea Arcangeli
  2020-10-08 23:22     ` Nick Kralevich
  0 siblings, 1 reply; 8+ messages in thread
From: Andrea Arcangeli @ 2020-10-08  4:01 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Kees Cook, Jonathan Corbet, Peter Xu, Sebastian Andrzej Siewior,
	Andrew Morton, Alexander Viro, Stephen Smalley, Eric Biggers,
	Daniel Colascione, Joel Fernandes (Google),
	Linux FS Devel, linux-kernel, linux-doc, Kalesh Singh,
	Calin Juravle, Suren Baghdasaryan, Nick Kralevich,
	Jeffrey Vander Stoep, Cc: Android Kernel, Mike Rapoport,
	Shaohua Li, Jerome Glisse, Mauro Carvalho Chehab,
	Johannes Weiner, Mel Gorman, Nitin Gupta, Vlastimil Babka,
	Iurii Zaikin, Luis Chamberlain

Hello Lokesh,

On Wed, Oct 07, 2020 at 01:26:55PM -0700, Lokesh Gidra wrote:
> On Wed, Sep 23, 2020 at 11:56 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > This patch series is split from [1]. The other series enables SELinux
> > support for userfaultfd file descriptors so that its creation and
> > movement can be controlled.
> >
> > It has been demonstrated on various occasions that suspending kernel
> > code execution for an arbitrary amount of time at any access to
> > userspace memory (copy_from_user()/copy_to_user()/...) can be exploited
> > to change the intended behavior of the kernel. For instance, handling
> > page faults in kernel-mode using userfaultfd has been exploited in [2, 3].
> > Likewise, FUSE, which is similar to userfaultfd in this respect, has been
> > exploited in [4, 5] for similar outcome.
> >
> > This small patch series adds a new flag to userfaultfd(2) that allows
> > callers to give up the ability to handle kernel-mode faults with the
> > resulting UFFD file object. It then adds a 'user-mode only' option to
> > the unprivileged_userfaultfd sysctl knob to require unprivileged
> > callers to use this new flag.
> >
> > The purpose of this new interface is to decrease the chance of an
> > unprivileged userfaultfd user taking advantage of userfaultfd to
> > enhance security vulnerabilities by lengthening the race window in
> > kernel code.
> >
> > [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
> > [2] https://duasynt.com/blog/linux-kernel-heap-spray
> > [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit

I've looking at those links and I've been trying to verify the link
[3] is relevant.

Specifically I've been trying to verify if 1) current state of the art
modern SLUB randomization techniques already enabled in production and
rightfully wasting some CPU in all enterprise kernels to prevent
things like above to become an issue in practice 2) combined with the
fact different memcg need to share the same kmemcaches (which was
incidentally fixed a few months ago upstream) and 3) further
robustness enhancements against exploits in the slub metadata, may
already render the exploit [3] from 2016 irrelevant in practice.

So I started by trying to reproduce [3] by building 4.5.1 with a
.config with no robustness features and I booted it on fedora-32 or
gentoo userland and I cannot even invoke call_usermodehelper. Calling
socket(22, AF_INET, 0) won't invoke such function. Can you reproduce
on 4.5.1? Which kernel .config should I use to build 4.5.1 in order
for call_usermodehelper to be invoked by the exploit? Could you help
to verify it?

It even has uninitialized variable spawning random perrors so it
doesn't give a warm fuzzy feeling:

====
int main(int argc, char **argv) {
	void *region, *map;
	              ^^^^^
	pthread_t uffd_thread;
	int uffd, msqid, i;

	region = (void *)mmap((void *)0x40000000, 0x2000, PROT_READ|PROT_WRITE,
                               MAP_FIXED|MAP_PRIVATE|MAP_ANON, -1, 0);

	if (!region) {
		perror("mmap");
		exit(2);
	}

	setup_pagefault(region + 0x1000, 0x1000, 1);

	printf("my pid = %d\n", getpid());

	if (!map) {
	^^^^^^^^
		perror("mmap");
====

The whole point of being able to reproduce on 4.5.1 is then to
simulate if the same exploit would also reproduce on current kernels
with all enterprise default robustness features enabled. Or did
anybody already verify it?

Anyway the links I was playing with are all in the cover letter, the
cover letter is not as important as the actual patches. The actual
patches looks fine to me.

The only improvement I can think of is, what about to add a
printk_once to suggest to toggle the sysctl if userfaultfd bails out
because the process lacks the CAP_SYS_PTRACE capability? That would
facilitate the /etc/sysctl.conf or tuned tweaking in case the apps
aren't verbose enough.

It's not relevant anymore with this latest patchset, but about the
previous argument that seccomp couldn't be used in all Android
processes because of performance concern, I'm slightly confused.

https://android-developers.googleblog.com/2017/07/seccomp-filter-in-android-o.html

"Android O includes a single seccomp filter installed into zygote, the
process from which all the Android applications are derived. Because
the filter is installed into zygote—and therefore all apps—the Android
security team took extra caution to not break existing apps"

Example:

$ uname -mo
aarch64 Android
$ cat swapoff.c
#include <sys/swap.h>

int main()
{
        swapoff("");
}
$ gcc swapoff.c -o swapoff -O2
$ ./swapoff
Bad system call
$

It's hard to imagine what is more performance critical than the zygote
process and the actual apps as above?

It's also hard to imagine what kind of performance concern can arise
by adding seccomp filters also to background system apps that
generally should consume ~0% of CPU.

If performance is really a concern, the BPF JIT representation with
the bitmap to be able to run the filter in O(1) sounds a better
solution than not adding ad-hoc filters and it's being worked on for
x86-64 and can be ported to aarch64 too. Many of the standalone
background processes likely wouldn't even use uffd at all so you could
block the user initiated faults too that way.

Ultimately because of issues as [3] (be them still relevant or not, to
be double checked), no matter if through selinux, seccomp or a
different sysctl value, without this patchset applied the default
behavior of the userfaultfd syscall for all Linux binaries running on
Android kernels, would deviate from the upstream kernel. So even if we
would make the pipe mutex logic more complex the deviation would
remain. Your patchset adds much less risk of breakage than adding a
timeout to kernel initiated userfaults and it resolves all concerns as
well as a timeout. We'll also make better use of the "0" value this
way. So while I'm not certain this is the best for the long term, this
looks the sweet spot for the short term to resolve many issues at
once.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 0/2] Control over userfaultfd kernel-fault handling
  2020-10-08  4:01   ` Andrea Arcangeli
@ 2020-10-08 23:22     ` Nick Kralevich
  2020-10-22 20:38       ` Lokesh Gidra
  2020-10-24  5:28       ` Andrea Arcangeli
  0 siblings, 2 replies; 8+ messages in thread
From: Nick Kralevich @ 2020-10-08 23:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Lokesh Gidra, Kees Cook, Jonathan Corbet, Peter Xu,
	Sebastian Andrzej Siewior, Andrew Morton, Alexander Viro,
	Stephen Smalley, Eric Biggers, Daniel Colascione,
	Joel Fernandes (Google),
	Linux FS Devel, linux-kernel, linux-doc, Kalesh Singh,
	Calin Juravle, Suren Baghdasaryan, Jeffrey Vander Stoep,
	Cc: Android Kernel, Mike Rapoport, Shaohua Li, Jerome Glisse,
	Mauro Carvalho Chehab, Johannes Weiner, Mel Gorman, Nitin Gupta,
	Vlastimil Babka, Iurii Zaikin, Luis Chamberlain

On Wed, Oct 7, 2020 at 9:01 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Hello Lokesh,
>
> On Wed, Oct 07, 2020 at 01:26:55PM -0700, Lokesh Gidra wrote:
> > On Wed, Sep 23, 2020 at 11:56 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > >
> > > This patch series is split from [1]. The other series enables SELinux
> > > support for userfaultfd file descriptors so that its creation and
> > > movement can be controlled.
> > >
> > > It has been demonstrated on various occasions that suspending kernel
> > > code execution for an arbitrary amount of time at any access to
> > > userspace memory (copy_from_user()/copy_to_user()/...) can be exploited
> > > to change the intended behavior of the kernel. For instance, handling
> > > page faults in kernel-mode using userfaultfd has been exploited in [2, 3].
> > > Likewise, FUSE, which is similar to userfaultfd in this respect, has been
> > > exploited in [4, 5] for similar outcome.
> > >
> > > This small patch series adds a new flag to userfaultfd(2) that allows
> > > callers to give up the ability to handle kernel-mode faults with the
> > > resulting UFFD file object. It then adds a 'user-mode only' option to
> > > the unprivileged_userfaultfd sysctl knob to require unprivileged
> > > callers to use this new flag.
> > >
> > > The purpose of this new interface is to decrease the chance of an
> > > unprivileged userfaultfd user taking advantage of userfaultfd to
> > > enhance security vulnerabilities by lengthening the race window in
> > > kernel code.
> > >
> > > [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
> > > [2] https://duasynt.com/blog/linux-kernel-heap-spray
> > > [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
>
> I've looking at those links and I've been trying to verify the link
> [3] is relevant.
>
> Specifically I've been trying to verify if 1) current state of the art
> modern SLUB randomization techniques already enabled in production and
> rightfully wasting some CPU in all enterprise kernels to prevent
> things like above to become an issue in practice 2) combined with the
> fact different memcg need to share the same kmemcaches (which was
> incidentally fixed a few months ago upstream) and 3) further
> robustness enhancements against exploits in the slub metadata, may
> already render the exploit [3] from 2016 irrelevant in practice.

It's quite possible that some other mitigation was helpful against the
technique used by this particular exploit. It's the nature of exploits
that they are fragile and will change as new soft mitigations are
introduced. The effectiveness of a particular exploit mitigation
change is orthogonal to the change presented here.

The purpose of this change is to prevent an attacker from suspending
kernel code execution and having kernel data structures in a
predictable state. This makes it harder for an attacker to "win" race
conditions against various kernel data structures. This change
compliments other kernel hardening changes such as the changes you've
referenced above. Focusing on one particular exploit somewhat misses
the point of this change.

>
> So I started by trying to reproduce [3] by building 4.5.1 with a
> .config with no robustness features and I booted it on fedora-32 or
> gentoo userland and I cannot even invoke call_usermodehelper. Calling
> socket(22, AF_INET, 0) won't invoke such function. Can you reproduce
> on 4.5.1? Which kernel .config should I use to build 4.5.1 in order
> for call_usermodehelper to be invoked by the exploit? Could you help
> to verify it?

I haven't tried to verify this myself. I wonder if the usermode
hardening changes also impacted this exploit? See
https://lkml.org/lkml/2017/1/16/468

But again, focusing on an exploit, which is inherently fragile in
nature and dependent on the state of the kernel tree at a particular
time, is unlikely to be useful to analyze this patch.

>
> It even has uninitialized variable spawning random perrors so it
> doesn't give a warm fuzzy feeling:
>
> ====
> int main(int argc, char **argv) {
>         void *region, *map;
>                       ^^^^^
>         pthread_t uffd_thread;
>         int uffd, msqid, i;
>
>         region = (void *)mmap((void *)0x40000000, 0x2000, PROT_READ|PROT_WRITE,
>                                MAP_FIXED|MAP_PRIVATE|MAP_ANON, -1, 0);
>
>         if (!region) {
>                 perror("mmap");
>                 exit(2);
>         }
>
>         setup_pagefault(region + 0x1000, 0x1000, 1);
>
>         printf("my pid = %d\n", getpid());
>
>         if (!map) {
>         ^^^^^^^^
>                 perror("mmap");
> ====
>
> The whole point of being able to reproduce on 4.5.1 is then to
> simulate if the same exploit would also reproduce on current kernels
> with all enterprise default robustness features enabled. Or did
> anybody already verify it?
>
> Anyway the links I was playing with are all in the cover letter, the
> cover letter is not as important as the actual patches. The actual
> patches looks fine to me.

That's great to hear.

>
> The only improvement I can think of is, what about to add a
> printk_once to suggest to toggle the sysctl if userfaultfd bails out
> because the process lacks the CAP_SYS_PTRACE capability? That would
> facilitate the /etc/sysctl.conf or tuned tweaking in case the apps
> aren't verbose enough.
>
> It's not relevant anymore with this latest patchset, but about the
> previous argument that seccomp couldn't be used in all Android
> processes because of performance concern, I'm slightly confused.

Seccomp causes more problems than just performance. Seccomp is not
designed for whole-of-system protections. Please see my other writeup
at https://lore.kernel.org/lkml/CAFJ0LnEo-7YUvgOhb4pHteuiUW+wPfzqbwXUCGAA35ZMx11A-w@mail.gmail.com/

>
> https://android-developers.googleblog.com/2017/07/seccomp-filter-in-android-o.html
>
> "Android O includes a single seccomp filter installed into zygote, the
> process from which all the Android applications are derived. Because
> the filter is installed into zygote—and therefore all apps—the Android
> security team took extra caution to not break existing apps"
>
> Example:
>
> $ uname -mo
> aarch64 Android
> $ cat swapoff.c
> #include <sys/swap.h>
>
> int main()
> {
>         swapoff("");
> }
> $ gcc swapoff.c -o swapoff -O2
> $ ./swapoff
> Bad system call
> $
>
> It's hard to imagine what is more performance critical than the zygote
> process and the actual apps as above?
>
> It's also hard to imagine what kind of performance concern can arise
> by adding seccomp filters also to background system apps that
> generally should consume ~0% of CPU.
>
> If performance is really a concern, the BPF JIT representation with
> the bitmap to be able to run the filter in O(1) sounds a better
> solution than not adding ad-hoc filters and it's being worked on for
> x86-64 and can be ported to aarch64 too. Many of the standalone
> background processes likely wouldn't even use uffd at all so you could
> block the user initiated faults too that way.
>
> Ultimately because of issues as [3] (be them still relevant or not, to
> be double checked), no matter if through selinux, seccomp or a
> different sysctl value, without this patchset applied the default
> behavior of the userfaultfd syscall for all Linux binaries running on
> Android kernels, would deviate from the upstream kernel. So even if we
> would make the pipe mutex logic more complex the deviation would
> remain. Your patchset adds much less risk of breakage than adding a
> timeout to kernel initiated userfaults and it resolves all concerns as
> well as a timeout. We'll also make better use of the "0" value this
> way. So while I'm not certain this is the best for the long term, this
> looks the sweet spot for the short term to resolve many issues at
> once.
>
> Thanks!
> Andrea
>


-- 
Nick Kralevich | nnk@google.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 0/2] Control over userfaultfd kernel-fault handling
  2020-10-08 23:22     ` Nick Kralevich
@ 2020-10-22 20:38       ` Lokesh Gidra
  2020-10-24  5:28       ` Andrea Arcangeli
  1 sibling, 0 replies; 8+ messages in thread
From: Lokesh Gidra @ 2020-10-22 20:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Kralevich, Kees Cook, Jonathan Corbet, Peter Xu,
	Sebastian Andrzej Siewior, Andrew Morton, Alexander Viro,
	Stephen Smalley, Eric Biggers, Daniel Colascione,
	Joel Fernandes (Google),
	Linux FS Devel, linux-kernel, linux-doc, Kalesh Singh,
	Calin Juravle, Suren Baghdasaryan, Jeffrey Vander Stoep,
	Cc: Android Kernel, Mike Rapoport, Shaohua Li, Jerome Glisse,
	Mauro Carvalho Chehab, Johannes Weiner, Mel Gorman, Nitin Gupta,
	Vlastimil Babka, Iurii Zaikin, Luis Chamberlain

On Thu, Oct 8, 2020 at 4:22 PM Nick Kralevich <nnk@google.com> wrote:
>
> On Wed, Oct 7, 2020 at 9:01 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > Hello Lokesh,
> >
> > On Wed, Oct 07, 2020 at 01:26:55PM -0700, Lokesh Gidra wrote:
> > > On Wed, Sep 23, 2020 at 11:56 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > >
> > > > This patch series is split from [1]. The other series enables SELinux
> > > > support for userfaultfd file descriptors so that its creation and
> > > > movement can be controlled.
> > > >
> > > > It has been demonstrated on various occasions that suspending kernel
> > > > code execution for an arbitrary amount of time at any access to
> > > > userspace memory (copy_from_user()/copy_to_user()/...) can be exploited
> > > > to change the intended behavior of the kernel. For instance, handling
> > > > page faults in kernel-mode using userfaultfd has been exploited in [2, 3].
> > > > Likewise, FUSE, which is similar to userfaultfd in this respect, has been
> > > > exploited in [4, 5] for similar outcome.
> > > >
> > > > This small patch series adds a new flag to userfaultfd(2) that allows
> > > > callers to give up the ability to handle kernel-mode faults with the
> > > > resulting UFFD file object. It then adds a 'user-mode only' option to
> > > > the unprivileged_userfaultfd sysctl knob to require unprivileged
> > > > callers to use this new flag.
> > > >
> > > > The purpose of this new interface is to decrease the chance of an
> > > > unprivileged userfaultfd user taking advantage of userfaultfd to
> > > > enhance security vulnerabilities by lengthening the race window in
> > > > kernel code.
> > > >
> > > > [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
> > > > [2] https://duasynt.com/blog/linux-kernel-heap-spray
> > > > [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
> >
> > I've looking at those links and I've been trying to verify the link
> > [3] is relevant.
> >
> > Specifically I've been trying to verify if 1) current state of the art
> > modern SLUB randomization techniques already enabled in production and
> > rightfully wasting some CPU in all enterprise kernels to prevent
> > things like above to become an issue in practice 2) combined with the
> > fact different memcg need to share the same kmemcaches (which was
> > incidentally fixed a few months ago upstream) and 3) further
> > robustness enhancements against exploits in the slub metadata, may
> > already render the exploit [3] from 2016 irrelevant in practice.
>
> It's quite possible that some other mitigation was helpful against the
> technique used by this particular exploit. It's the nature of exploits
> that they are fragile and will change as new soft mitigations are
> introduced. The effectiveness of a particular exploit mitigation
> change is orthogonal to the change presented here.
>
> The purpose of this change is to prevent an attacker from suspending
> kernel code execution and having kernel data structures in a
> predictable state. This makes it harder for an attacker to "win" race
> conditions against various kernel data structures. This change
> compliments other kernel hardening changes such as the changes you've
> referenced above. Focusing on one particular exploit somewhat misses
> the point of this change.
>
> >
> > So I started by trying to reproduce [3] by building 4.5.1 with a
> > .config with no robustness features and I booted it on fedora-32 or
> > gentoo userland and I cannot even invoke call_usermodehelper. Calling
> > socket(22, AF_INET, 0) won't invoke such function. Can you reproduce
> > on 4.5.1? Which kernel .config should I use to build 4.5.1 in order
> > for call_usermodehelper to be invoked by the exploit? Could you help
> > to verify it?
>
> I haven't tried to verify this myself. I wonder if the usermode
> hardening changes also impacted this exploit? See
> https://lkml.org/lkml/2017/1/16/468
>
> But again, focusing on an exploit, which is inherently fragile in
> nature and dependent on the state of the kernel tree at a particular
> time, is unlikely to be useful to analyze this patch.
>
> >
> > It even has uninitialized variable spawning random perrors so it
> > doesn't give a warm fuzzy feeling:
> >
> > ====
> > int main(int argc, char **argv) {
> >         void *region, *map;
> >                       ^^^^^
> >         pthread_t uffd_thread;
> >         int uffd, msqid, i;
> >
> >         region = (void *)mmap((void *)0x40000000, 0x2000, PROT_READ|PROT_WRITE,
> >                                MAP_FIXED|MAP_PRIVATE|MAP_ANON, -1, 0);
> >
> >         if (!region) {
> >                 perror("mmap");
> >                 exit(2);
> >         }
> >
> >         setup_pagefault(region + 0x1000, 0x1000, 1);
> >
> >         printf("my pid = %d\n", getpid());
> >
> >         if (!map) {
> >         ^^^^^^^^
> >                 perror("mmap");
> > ====
> >
> > The whole point of being able to reproduce on 4.5.1 is then to
> > simulate if the same exploit would also reproduce on current kernels
> > with all enterprise default robustness features enabled. Or did
> > anybody already verify it?
> >
> > Anyway the links I was playing with are all in the cover letter, the
> > cover letter is not as important as the actual patches. The actual
> > patches looks fine to me.
>
> That's great to hear.
>
> >
> > The only improvement I can think of is, what about to add a
> > printk_once to suggest to toggle the sysctl if userfaultfd bails out
> > because the process lacks the CAP_SYS_PTRACE capability? That would
> > facilitate the /etc/sysctl.conf or tuned tweaking in case the apps
> > aren't verbose enough.
> >
> > It's not relevant anymore with this latest patchset, but about the
> > previous argument that seccomp couldn't be used in all Android
> > processes because of performance concern, I'm slightly confused.
>
> Seccomp causes more problems than just performance. Seccomp is not
> designed for whole-of-system protections. Please see my other writeup
> at https://lore.kernel.org/lkml/CAFJ0LnEo-7YUvgOhb4pHteuiUW+wPfzqbwXUCGAA35ZMx11A-w@mail.gmail.com/
>
> >
> > https://android-developers.googleblog.com/2017/07/seccomp-filter-in-android-o.html
> >
> > "Android O includes a single seccomp filter installed into zygote, the
> > process from which all the Android applications are derived. Because
> > the filter is installed into zygote—and therefore all apps—the Android
> > security team took extra caution to not break existing apps"
> >
> > Example:
> >
> > $ uname -mo
> > aarch64 Android
> > $ cat swapoff.c
> > #include <sys/swap.h>
> >
> > int main()
> > {
> >         swapoff("");
> > }
> > $ gcc swapoff.c -o swapoff -O2
> > $ ./swapoff
> > Bad system call
> > $
> >
> > It's hard to imagine what is more performance critical than the zygote
> > process and the actual apps as above?
> >
> > It's also hard to imagine what kind of performance concern can arise
> > by adding seccomp filters also to background system apps that
> > generally should consume ~0% of CPU.
> >
> > If performance is really a concern, the BPF JIT representation with
> > the bitmap to be able to run the filter in O(1) sounds a better
> > solution than not adding ad-hoc filters and it's being worked on for
> > x86-64 and can be ported to aarch64 too. Many of the standalone
> > background processes likely wouldn't even use uffd at all so you could
> > block the user initiated faults too that way.
> >
> > Ultimately because of issues as [3] (be them still relevant or not, to
> > be double checked), no matter if through selinux, seccomp or a
> > different sysctl value, without this patchset applied the default
> > behavior of the userfaultfd syscall for all Linux binaries running on
> > Android kernels, would deviate from the upstream kernel. So even if we
> > would make the pipe mutex logic more complex the deviation would
> > remain. Your patchset adds much less risk of breakage than adding a
> > timeout to kernel initiated userfaults and it resolves all concerns as
> > well as a timeout. We'll also make better use of the "0" value this
> > way. So while I'm not certain this is the best for the long term, this
> > looks the sweet spot for the short term to resolve many issues at
> > once.
> >
> > Thanks!
> > Andrea
> >
>
>
> --
> Nick Kralevich | nnk@google.com

Hi Andrea,

Did you get a chance to go through Nick's reply to your questions?
Also, I sent another revision of this patch series which takes care of
the printk that you suggested. Please take a look.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 0/2] Control over userfaultfd kernel-fault handling
  2020-10-08 23:22     ` Nick Kralevich
  2020-10-22 20:38       ` Lokesh Gidra
@ 2020-10-24  5:28       ` Andrea Arcangeli
  1 sibling, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2020-10-24  5:28 UTC (permalink / raw)
  To: Nick Kralevich
  Cc: Lokesh Gidra, Kees Cook, Jonathan Corbet, Peter Xu,
	Sebastian Andrzej Siewior, Andrew Morton, Alexander Viro,
	Stephen Smalley, Eric Biggers, Daniel Colascione,
	Joel Fernandes (Google),
	Linux FS Devel, linux-kernel, linux-doc, Kalesh Singh,
	Calin Juravle, Suren Baghdasaryan, Jeffrey Vander Stoep,
	Cc: Android Kernel, Mike Rapoport, Shaohua Li, Jerome Glisse,
	Mauro Carvalho Chehab, Johannes Weiner, Mel Gorman, Nitin Gupta,
	Vlastimil Babka, Iurii Zaikin, Luis Chamberlain

Hello,

On Thu, Oct 08, 2020 at 04:22:36PM -0700, Nick Kralevich wrote:
> I haven't tried to verify this myself. I wonder if the usermode
> hardening changes also impacted this exploit? See
> https://lkml.org/lkml/2017/1/16/468

My plan was to:

1) reproduce with the old buggy kernel

2) forward port the bug to the very first version that had both the
   slub and page freelist randomization available and keep them
   disabled

3) enable the freelist randomization features (which are already
   enabled by default in the current enterprise kernels) and see if
   that makes the attack not workable

The hardening of the usermode helper you mentioned is spot on, but it
would have been something to worry about and possibly roll back at
point 2), but I couldn't get past point 1)..

Plenty other hardening techniques (just like the usermode helper) are
very specific to a single attack, but the randomization looks generic
enough to cover the entire class.

> But again, focusing on an exploit, which is inherently fragile in
> nature and dependent on the state of the kernel tree at a particular
> time, is unlikely to be useful to analyze this patch.

Agreed. A single exploit using userfaultfd to enlarge the race window
of the use-after-free, not being workable anymore with randomized slub
and page freelist enabled, wouldn't have meant a thing by itself.

As opposed if that single exploit was still fairly reproducible, it
would have been enough to consider the sysctl default to zero as
something providing a more tangible long term benefit. That would have
been good information to have too, if that's actually the case.

I was merely attempting to get a first data point.. overall it would
be nice to initiate some research to verify the exact statistical
effects that slub/page randomization has on those use-after-free race
conditions that can be enlarged by blocking kernel faults, given we're
already paying the price for it. I don't think anybody has a sure
answer at this point, if we can entirely rely on those features or not.

> Seccomp causes more problems than just performance. Seccomp is not
> designed for whole-of-system protections. Please see my other writeup
> at https://lore.kernel.org/lkml/CAFJ0LnEo-7YUvgOhb4pHteuiUW+wPfzqbwXUCGAA35ZMx11A-w@mail.gmail.com/

Whole-of-system protection I guess helps primarily because it requires
no change to userland I guess.

An example of a task not running as root (and without ptrace
capability) that could use more seccomp blocking:

# cat /proc/1517/cmdline ; echo ; grep CapEff /proc/1517/status; grep Seccomp /proc/1517/status
/vendor/bin/hw/qcrild
CapEff: 0000001000003000
Seccomp:        0

My view is that if the various binaries started by init.rc are run
without a strict seccomp filter there would be more things to worry
about, than kernel initiated userfaults for those.

Still the solution in the v5 patchset looks the safest for all until
we'll be able to tell if the slub/page randomizaton (or any other
generic enough robustness feature) is already effective against an
enlarged race window of kernel initiated userfaults and at the same
time it provides the main benefit of avoiding divergence in the
behavior of the userfaultfd syscall if invoked within the Android
userland.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-10-24  5:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-24  6:56 [PATCH v4 0/2] Control over userfaultfd kernel-fault handling Lokesh Gidra
2020-09-24  6:56 ` [PATCH v4 1/2] Add UFFD_USER_MODE_ONLY Lokesh Gidra
2020-09-24  6:56 ` [PATCH v4 2/2] Add user-mode only option to unprivileged_userfaultfd sysctl knob Lokesh Gidra
2020-10-07 20:26 ` [PATCH v4 0/2] Control over userfaultfd kernel-fault handling Lokesh Gidra
2020-10-08  4:01   ` Andrea Arcangeli
2020-10-08 23:22     ` Nick Kralevich
2020-10-22 20:38       ` Lokesh Gidra
2020-10-24  5:28       ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).