Linux-Doc Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/2] Control over userfaultfd kernel-fault handling
@ 2020-04-23  0:26 Daniel Colascione
  2020-04-23  0:26 ` [PATCH 1/2] Add UFFD_USER_MODE_ONLY Daniel Colascione
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Daniel Colascione @ 2020-04-23  0:26 UTC (permalink / raw)
  To: Jonathan Corbet, Alexander Viro, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Mauro Carvalho Chehab, Andrew Morton,
	Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Peter Xu, Daniel Colascione,
	Andrea Arcangeli, Mike Rapoport, Jerome Glisse, Shaohua Li,
	linux-doc, linux-kernel, linux-fsdevel, timmurray, minchan,
	sspatil, lokeshgidra

This small patch series adds a new flag to userfaultfd(2) that allows
callers to give up the ability to handle user-mode faults with the
resulting UFFD file object. In then add a new sysctl to require
unprivileged callers to use this new flag.

The purpose of this new interface is to decrease the change of an
unprivileged userfaultfd user taking advantage of userfaultfd to
enhance security vulnerabilities by lengthening the race window in
kernel code.

This patch series is split from [1].

[1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/

Daniel Colascione (2):
  Add UFFD_USER_MODE_ONLY
  Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only

 Documentation/admin-guide/sysctl/vm.rst | 13 +++++++++++++
 fs/userfaultfd.c                        | 18 ++++++++++++++++--
 include/linux/userfaultfd_k.h           |  1 +
 include/uapi/linux/userfaultfd.h        |  9 +++++++++
 kernel/sysctl.c                         |  9 +++++++++
 5 files changed, 48 insertions(+), 2 deletions(-)

-- 
2.26.2.303.gf8c07b1a785-goog


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/2] Add UFFD_USER_MODE_ONLY
  2020-04-23  0:26 [PATCH 0/2] Control over userfaultfd kernel-fault handling Daniel Colascione
@ 2020-04-23  0:26 ` Daniel Colascione
  2020-07-24 14:28   ` Michael S. Tsirkin
  2020-04-23  0:26 ` [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only Daniel Colascione
  2020-07-24 14:01 ` [PATCH 0/2] Control over userfaultfd kernel-fault handling Michael S. Tsirkin
  2 siblings, 1 reply; 23+ messages in thread
From: Daniel Colascione @ 2020-04-23  0:26 UTC (permalink / raw)
  To: Jonathan Corbet, Alexander Viro, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Mauro Carvalho Chehab, Andrew Morton,
	Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Peter Xu, Daniel Colascione,
	Andrea Arcangeli, Mike Rapoport, Jerome Glisse, Shaohua Li,
	linux-doc, linux-kernel, linux-fsdevel, timmurray, minchan,
	sspatil, lokeshgidra

userfaultfd handles page faults from both user and kernel code.  Add a
new UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the
resulting userfaultfd object refuse to handle faults from kernel mode,
treating these faults as if SIGBUS were always raised, causing the
kernel code to fail with EFAULT.

A future patch adds a knob allowing administrators to give some
processes the ability to create userfaultfd file objects only if they
pass UFFD_USER_MODE_ONLY, reducing the likelihood that these processes
will exploit userfaultfd's ability to delay kernel page faults to open
timing windows for future exploits.

Signed-off-by: Daniel Colascione <dancol@google.com>
---
 fs/userfaultfd.c                 | 7 ++++++-
 include/uapi/linux/userfaultfd.h | 9 +++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index e39fdec8a0b0..21378abe8f7b 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -418,6 +418,9 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	if (ctx->features & UFFD_FEATURE_SIGBUS)
 		goto out;
+	if ((vmf->flags & FAULT_FLAG_USER) == 0 &&
+	    ctx->flags & UFFD_USER_MODE_ONLY)
+		goto out;
 
 	/*
 	 * If it's already released don't get it. This avoids to loop
@@ -2003,6 +2006,7 @@ static void init_once_userfaultfd_ctx(void *mem)
 
 SYSCALL_DEFINE1(userfaultfd, int, flags)
 {
+	static const int uffd_flags = UFFD_USER_MODE_ONLY;
 	struct userfaultfd_ctx *ctx;
 	int fd;
 
@@ -2012,10 +2016,11 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 	BUG_ON(!current->mm);
 
 	/* Check the UFFD_* constants for consistency.  */
+	BUILD_BUG_ON(uffd_flags & UFFD_SHARED_FCNTL_FLAGS);
 	BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
 	BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
 
-	if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
+	if (flags & ~(UFFD_SHARED_FCNTL_FLAGS | uffd_flags))
 		return -EINVAL;
 
 	ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index e7e98bde221f..5f2d88212f7c 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -257,4 +257,13 @@ struct uffdio_writeprotect {
 	__u64 mode;
 };
 
+/*
+ * Flags for the userfaultfd(2) system call itself.
+ */
+
+/*
+ * Create a userfaultfd that can handle page faults only in user mode.
+ */
+#define UFFD_USER_MODE_ONLY 1
+
 #endif /* _LINUX_USERFAULTFD_H */
-- 
2.26.2.303.gf8c07b1a785-goog


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-04-23  0:26 [PATCH 0/2] Control over userfaultfd kernel-fault handling Daniel Colascione
  2020-04-23  0:26 ` [PATCH 1/2] Add UFFD_USER_MODE_ONLY Daniel Colascione
@ 2020-04-23  0:26 ` Daniel Colascione
  2020-05-06 19:38   ` Peter Xu
  2020-05-08 16:52   ` Michael S. Tsirkin
  2020-07-24 14:01 ` [PATCH 0/2] Control over userfaultfd kernel-fault handling Michael S. Tsirkin
  2 siblings, 2 replies; 23+ messages in thread
From: Daniel Colascione @ 2020-04-23  0:26 UTC (permalink / raw)
  To: Jonathan Corbet, Alexander Viro, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Mauro Carvalho Chehab, Andrew Morton,
	Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Peter Xu, Daniel Colascione,
	Andrea Arcangeli, Mike Rapoport, Jerome Glisse, Shaohua Li,
	linux-doc, linux-kernel, linux-fsdevel, timmurray, minchan,
	sspatil, lokeshgidra

This sysctl can be set to either zero or one. When zero (the default)
the system lets all users call userfaultfd with or without
UFFD_USER_MODE_ONLY, modulo other access controls. When
unprivileged_userfaultfd_user_mode_only is set to one, users without
CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API
will fail with EPERM. This facility allows administrators to reduce
the likelihood that an attacker with access to userfaultfd can delay
faulting kernel code to widen timing windows for other exploits.

Signed-off-by: Daniel Colascione <dancol@google.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 13 +++++++++++++
 fs/userfaultfd.c                        | 11 ++++++++++-
 include/linux/userfaultfd_k.h           |  1 +
 kernel/sysctl.c                         |  9 +++++++++
 4 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 0329a4d3fa9e..4296b508ab74 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -850,6 +850,19 @@ privileged users (with SYS_CAP_PTRACE capability).
 
 The default value is 1.
 
+unprivileged_userfaultfd_user_mode_only
+========================================
+
+This flag controls whether unprivileged users can use the userfaultfd
+system calls to handle page faults in kernel mode.  If set to zero,
+userfaultfd works with or without UFFD_USER_MODE_ONLY, modulo
+unprivileged_userfaultfd above.  If set to one, users without
+SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd
+to succeed.  Prohibiting use of userfaultfd for handling faults from
+kernel mode may make certain vulnerabilities more difficult
+to exploit.
+
+The default value is 0.
 
 user_reserve_kbytes
 ===================
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 21378abe8f7b..85cc1ab74361 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -29,6 +29,7 @@
 #include <linux/hugetlb.h>
 
 int sysctl_unprivileged_userfaultfd __read_mostly = 1;
+int sysctl_unprivileged_userfaultfd_user_mode_only __read_mostly = 0;
 
 static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
 
@@ -2009,8 +2010,16 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 	static const int uffd_flags = UFFD_USER_MODE_ONLY;
 	struct userfaultfd_ctx *ctx;
 	int fd;
+	bool need_cap_check = false;
 
-	if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
+	if (!sysctl_unprivileged_userfaultfd)
+		need_cap_check = true;
+
+	if (sysctl_unprivileged_userfaultfd_user_mode_only &&
+	    (flags & UFFD_USER_MODE_ONLY) == 0)
+		need_cap_check = true;
+
+	if (need_cap_check && !capable(CAP_SYS_PTRACE))
 		return -EPERM;
 
 	BUG_ON(!current->mm);
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index a8e5f3ea9bb2..d81e30074bf5 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -31,6 +31,7 @@
 #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
 
 extern int sysctl_unprivileged_userfaultfd;
+extern int sysctl_unprivileged_userfaultfd_user_mode_only;
 
 extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a176d8727a3..9cbdf4483961 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1719,6 +1719,15 @@ static struct ctl_table vm_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "unprivileged_userfaultfd_user_mode_only",
+		.data		= &sysctl_unprivileged_userfaultfd_user_mode_only,
+		.maxlen		= sizeof(sysctl_unprivileged_userfaultfd_user_mode_only),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 #endif
 	{ }
 };
-- 
2.26.2.303.gf8c07b1a785-goog


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-04-23  0:26 ` [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only Daniel Colascione
@ 2020-05-06 19:38   ` Peter Xu
  2020-05-07 19:15     ` Jonathan Corbet
  2020-05-08 16:52   ` Michael S. Tsirkin
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Xu @ 2020-05-06 19:38 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Jonathan Corbet, Alexander Viro, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Mauro Carvalho Chehab, Andrew Morton,
	Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Andrea Arcangeli, Mike Rapoport,
	Jerome Glisse, Shaohua Li, linux-doc, linux-kernel,
	linux-fsdevel, timmurray, minchan, sspatil, lokeshgidra

On Wed, Apr 22, 2020 at 05:26:32PM -0700, Daniel Colascione wrote:
> +unprivileged_userfaultfd_user_mode_only
> +========================================
> +
> +This flag controls whether unprivileged users can use the userfaultfd
> +system calls to handle page faults in kernel mode.  If set to zero,
> +userfaultfd works with or without UFFD_USER_MODE_ONLY, modulo
> +unprivileged_userfaultfd above.  If set to one, users without
> +SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd
> +to succeed.  Prohibiting use of userfaultfd for handling faults from
> +kernel mode may make certain vulnerabilities more difficult
> +to exploit.
> +
> +The default value is 0.

If this is going to be added... I am thinking whether it should be easier to
add another value for unprivileged_userfaultfd, rather than a new sysctl. E.g.:

  "0": unprivileged userfaultfd forbidden
  "1": unprivileged userfaultfd allowed (both user/kernel faults)
  "2": unprivileged userfaultfd allowed (only user faults)

Because after all unprivileged_userfaultfd_user_mode_only will be meaningless
(iiuc) if unprivileged_userfaultfd=0.  The default value will also be the same
as before ("1") then.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-05-06 19:38   ` Peter Xu
@ 2020-05-07 19:15     ` Jonathan Corbet
  2020-05-20  4:06       ` Andrea Arcangeli
  0 siblings, 1 reply; 23+ messages in thread
From: Jonathan Corbet @ 2020-05-07 19:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel Colascione, Alexander Viro, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Mauro Carvalho Chehab, Andrew Morton,
	Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Andrea Arcangeli, Mike Rapoport,
	Jerome Glisse, Shaohua Li, linux-doc, linux-kernel,
	linux-fsdevel, timmurray, minchan, sspatil, lokeshgidra

On Wed, 6 May 2020 15:38:16 -0400
Peter Xu <peterx@redhat.com> wrote:

> If this is going to be added... I am thinking whether it should be easier to
> add another value for unprivileged_userfaultfd, rather than a new sysctl. E.g.:
> 
>   "0": unprivileged userfaultfd forbidden
>   "1": unprivileged userfaultfd allowed (both user/kernel faults)
>   "2": unprivileged userfaultfd allowed (only user faults)
> 
> Because after all unprivileged_userfaultfd_user_mode_only will be meaningless
> (iiuc) if unprivileged_userfaultfd=0.  The default value will also be the same
> as before ("1") then.

It occurs to me to wonder whether this interface should also let an admin
block *privileged* user from handling kernel-space faults?  In a
secure-boot/lockdown setting, this could be a hardening measure that keeps
a (somewhat) restricted root user from expanding their privilege...?

jon

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-04-23  0:26 ` [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only Daniel Colascione
  2020-05-06 19:38   ` Peter Xu
@ 2020-05-08 16:52   ` Michael S. Tsirkin
  2020-05-08 16:54     ` Michael S. Tsirkin
  1 sibling, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2020-05-08 16:52 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Jonathan Corbet, Alexander Viro, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Mauro Carvalho Chehab, Andrew Morton,
	Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Peter Xu, Andrea Arcangeli,
	Mike Rapoport, Jerome Glisse, Shaohua Li, linux-doc,
	linux-kernel, linux-fsdevel, timmurray, minchan, sspatil,
	lokeshgidra

On Wed, Apr 22, 2020 at 05:26:32PM -0700, Daniel Colascione wrote:
> This sysctl can be set to either zero or one. When zero (the default)
> the system lets all users call userfaultfd with or without
> UFFD_USER_MODE_ONLY, modulo other access controls. When
> unprivileged_userfaultfd_user_mode_only is set to one, users without
> CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API
> will fail with EPERM. This facility allows administrators to reduce
> the likelihood that an attacker with access to userfaultfd can delay
> faulting kernel code to widen timing windows for other exploits.
> 
> Signed-off-by: Daniel Colascione <dancol@google.com>

The approach taken looks like a hard-coded security policy.
For example, it won't be possible to set the sysctl knob
in question on any sytem running kvm. So this is
no good for any general purpose system.

What's wrong with using a security policy for this instead?



> ---
>  Documentation/admin-guide/sysctl/vm.rst | 13 +++++++++++++
>  fs/userfaultfd.c                        | 11 ++++++++++-
>  include/linux/userfaultfd_k.h           |  1 +
>  kernel/sysctl.c                         |  9 +++++++++
>  4 files changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index 0329a4d3fa9e..4296b508ab74 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -850,6 +850,19 @@ privileged users (with SYS_CAP_PTRACE capability).
>  
>  The default value is 1.
>  
> +unprivileged_userfaultfd_user_mode_only
> +========================================
> +
> +This flag controls whether unprivileged users can use the userfaultfd
> +system calls to handle page faults in kernel mode.  If set to zero,
> +userfaultfd works with or without UFFD_USER_MODE_ONLY, modulo
> +unprivileged_userfaultfd above.  If set to one, users without
> +SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd
> +to succeed.  Prohibiting use of userfaultfd for handling faults from
> +kernel mode may make certain vulnerabilities more difficult
> +to exploit.
> +
> +The default value is 0.
>  
>  user_reserve_kbytes
>  ===================
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 21378abe8f7b..85cc1ab74361 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -29,6 +29,7 @@
>  #include <linux/hugetlb.h>
>  
>  int sysctl_unprivileged_userfaultfd __read_mostly = 1;
> +int sysctl_unprivileged_userfaultfd_user_mode_only __read_mostly = 0;
>  
>  static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
>  
> @@ -2009,8 +2010,16 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
>  	static const int uffd_flags = UFFD_USER_MODE_ONLY;
>  	struct userfaultfd_ctx *ctx;
>  	int fd;
> +	bool need_cap_check = false;
>  
> -	if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
> +	if (!sysctl_unprivileged_userfaultfd)
> +		need_cap_check = true;
> +
> +	if (sysctl_unprivileged_userfaultfd_user_mode_only &&
> +	    (flags & UFFD_USER_MODE_ONLY) == 0)
> +		need_cap_check = true;
> +
> +	if (need_cap_check && !capable(CAP_SYS_PTRACE))
>  		return -EPERM;
>  
>  	BUG_ON(!current->mm);
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index a8e5f3ea9bb2..d81e30074bf5 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -31,6 +31,7 @@
>  #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
>  
>  extern int sysctl_unprivileged_userfaultfd;
> +extern int sysctl_unprivileged_userfaultfd_user_mode_only;
>  
>  extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
>  
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 8a176d8727a3..9cbdf4483961 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1719,6 +1719,15 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= SYSCTL_ZERO,
>  		.extra2		= SYSCTL_ONE,
>  	},
> +	{
> +		.procname	= "unprivileged_userfaultfd_user_mode_only",
> +		.data		= &sysctl_unprivileged_userfaultfd_user_mode_only,
> +		.maxlen		= sizeof(sysctl_unprivileged_userfaultfd_user_mode_only),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_ONE,
> +	},
>  #endif
>  	{ }
>  };
> -- 
> 2.26.2.303.gf8c07b1a785-goog
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-05-08 16:52   ` Michael S. Tsirkin
@ 2020-05-08 16:54     ` Michael S. Tsirkin
  2020-05-20  4:59       ` Andrea Arcangeli
  0 siblings, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2020-05-08 16:54 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Jonathan Corbet, Alexander Viro, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Mauro Carvalho Chehab, Andrew Morton,
	Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Peter Xu, Andrea Arcangeli,
	Mike Rapoport, Jerome Glisse, Shaohua Li, linux-doc,
	linux-kernel, linux-fsdevel, timmurray, minchan, sspatil,
	lokeshgidra

On Fri, May 08, 2020 at 12:52:34PM -0400, Michael S. Tsirkin wrote:
> On Wed, Apr 22, 2020 at 05:26:32PM -0700, Daniel Colascione wrote:
> > This sysctl can be set to either zero or one. When zero (the default)
> > the system lets all users call userfaultfd with or without
> > UFFD_USER_MODE_ONLY, modulo other access controls. When
> > unprivileged_userfaultfd_user_mode_only is set to one, users without
> > CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API
> > will fail with EPERM. This facility allows administrators to reduce
> > the likelihood that an attacker with access to userfaultfd can delay
> > faulting kernel code to widen timing windows for other exploits.
> > 
> > Signed-off-by: Daniel Colascione <dancol@google.com>
> 
> The approach taken looks like a hard-coded security policy.
> For example, it won't be possible to set the sysctl knob
> in question on any sytem running kvm. So this is
> no good for any general purpose system.
> 
> What's wrong with using a security policy for this instead?

In fact I see the original thread already mentions selinux,
so it's just a question of making this controllable by
selinux.

> 
> 
> > ---
> >  Documentation/admin-guide/sysctl/vm.rst | 13 +++++++++++++
> >  fs/userfaultfd.c                        | 11 ++++++++++-
> >  include/linux/userfaultfd_k.h           |  1 +
> >  kernel/sysctl.c                         |  9 +++++++++
> >  4 files changed, 33 insertions(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> > index 0329a4d3fa9e..4296b508ab74 100644
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -850,6 +850,19 @@ privileged users (with SYS_CAP_PTRACE capability).
> >  
> >  The default value is 1.
> >  
> > +unprivileged_userfaultfd_user_mode_only
> > +========================================
> > +
> > +This flag controls whether unprivileged users can use the userfaultfd
> > +system calls to handle page faults in kernel mode.  If set to zero,
> > +userfaultfd works with or without UFFD_USER_MODE_ONLY, modulo
> > +unprivileged_userfaultfd above.  If set to one, users without
> > +SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd
> > +to succeed.  Prohibiting use of userfaultfd for handling faults from
> > +kernel mode may make certain vulnerabilities more difficult
> > +to exploit.
> > +
> > +The default value is 0.
> >  
> >  user_reserve_kbytes
> >  ===================
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 21378abe8f7b..85cc1ab74361 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -29,6 +29,7 @@
> >  #include <linux/hugetlb.h>
> >  
> >  int sysctl_unprivileged_userfaultfd __read_mostly = 1;
> > +int sysctl_unprivileged_userfaultfd_user_mode_only __read_mostly = 0;
> >  
> >  static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
> >  
> > @@ -2009,8 +2010,16 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
> >  	static const int uffd_flags = UFFD_USER_MODE_ONLY;
> >  	struct userfaultfd_ctx *ctx;
> >  	int fd;
> > +	bool need_cap_check = false;
> >  
> > -	if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
> > +	if (!sysctl_unprivileged_userfaultfd)
> > +		need_cap_check = true;
> > +
> > +	if (sysctl_unprivileged_userfaultfd_user_mode_only &&
> > +	    (flags & UFFD_USER_MODE_ONLY) == 0)
> > +		need_cap_check = true;
> > +
> > +	if (need_cap_check && !capable(CAP_SYS_PTRACE))
> >  		return -EPERM;
> >  
> >  	BUG_ON(!current->mm);
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index a8e5f3ea9bb2..d81e30074bf5 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -31,6 +31,7 @@
> >  #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
> >  
> >  extern int sysctl_unprivileged_userfaultfd;
> > +extern int sysctl_unprivileged_userfaultfd_user_mode_only;
> >  
> >  extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
> >  
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 8a176d8727a3..9cbdf4483961 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -1719,6 +1719,15 @@ static struct ctl_table vm_table[] = {
> >  		.extra1		= SYSCTL_ZERO,
> >  		.extra2		= SYSCTL_ONE,
> >  	},
> > +	{
> > +		.procname	= "unprivileged_userfaultfd_user_mode_only",
> > +		.data		= &sysctl_unprivileged_userfaultfd_user_mode_only,
> > +		.maxlen		= sizeof(sysctl_unprivileged_userfaultfd_user_mode_only),
> > +		.mode		= 0644,
> > +		.proc_handler	= proc_dointvec_minmax,
> > +		.extra1		= SYSCTL_ZERO,
> > +		.extra2		= SYSCTL_ONE,
> > +	},
> >  #endif
> >  	{ }
> >  };
> > -- 
> > 2.26.2.303.gf8c07b1a785-goog
> > 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-05-07 19:15     ` Jonathan Corbet
@ 2020-05-20  4:06       ` Andrea Arcangeli
  0 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2020-05-20  4:06 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Peter Xu, Daniel Colascione, Alexander Viro, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Mauro Carvalho Chehab, Andrew Morton,
	Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Mike Rapoport, Jerome Glisse,
	Shaohua Li, linux-doc, linux-kernel, linux-fsdevel, timmurray,
	minchan, sspatil, lokeshgidra

Hello Jonathan and everyone,

On Thu, May 07, 2020 at 01:15:03PM -0600, Jonathan Corbet wrote:
> On Wed, 6 May 2020 15:38:16 -0400
> Peter Xu <peterx@redhat.com> wrote:
> 
> > If this is going to be added... I am thinking whether it should be easier to
> > add another value for unprivileged_userfaultfd, rather than a new sysctl. E.g.:
> > 
> >   "0": unprivileged userfaultfd forbidden
> >   "1": unprivileged userfaultfd allowed (both user/kernel faults)
> >   "2": unprivileged userfaultfd allowed (only user faults)
> > 
> > Because after all unprivileged_userfaultfd_user_mode_only will be meaningless
> > (iiuc) if unprivileged_userfaultfd=0.  The default value will also be the same
> > as before ("1") 
> It occurs to me to wonder whether this interface should also let an admin
> block *privileged* user from handling kernel-space faults?  In a
> secure-boot/lockdown setting, this could be a hardening measure that keeps
> a (somewhat) restricted root user from expanding their privilege...?

That's a good question. In my view if as root in lockdown mode you can
still run the swapon syscall and setup nfs or other network devices
and load userland fuse filesystems or cuse chardev in userland, even
if you prevent userfaultfd from blocking kernel faults, kernel faults
can still be blocked by other means.

That in fact tends to be true also as non root (so regardless of
lockdown settings) since luser can generally load fuse filesystems.
There is no fundamental integrity breakage or privilege escalation
originating in userfaultfd.

The only concern here is about this: "after a new use-after-free is
discovered in some other part of the kernel (not related to
userfaultfd), how easy it is to turn the use-after-free from a mere
DoS to a more concerning privilege escalation?". userfaultfd might
facilitate the exploitation, but even if you remove userfaultfd from
the equation, there's still no guarantee an user-after-free won't
materialize as a privilege escalation by other means.

So to express it in another way: unless lockdown (no matter in which
mode) is a weak probabilistic based feature and in turn it cannot
provide any guarantee to begin with, userfaultfd sysctl set to 0|1|2
can't possibly make any difference to it.

The best mitigation for those kind of exploits remains to randomize
all kernel memory allocations, so even if the attacker can block the
fault, when it's unblocked it'll pick another page, not the one that
the attacker can predict it will use, so the attacker needs to repeat
the race many more times and hopefully it'll DoS and destabilize the
kernel before it can reproduce a privilege escalation. We got many of
those randomization features in the current kernel and it's probably
more important to enable those than to worry about this sysctl value.

One way to have a peace of mind against all use-after-free regardless
of this sysctl value, is to run each pod in a KVM instance, that's
safer than disabling syscalls or kernel features.

The default seccomp profiles of podman already block userfaultfd too,
so there's no need of virt to get extra safety if you use containers:
containers need to explicitly opt-in to enable userfaultfd through the
OCI schema seccomp object. If userfaultfd is being explicitly
whitelisted in the OCI schema of the container, well then you know
there is a good reason for it. As a matter of fact some things are
only possible to achieve with userfaultfd fully enabled.

The big value uffd brings compared to trapping sigsegv is precisely to
be able to handle kernel faults transparently. sigsegv can't do that
because every syscall would return 1) an inconsistent retval and 2) no
fault address along with the retval.

The possible future uffd userland users could be: dropping JVM dirty
bit, redis snapshot using pthread_create() instead of fork(),
distributed shared memory on pmem, new malloc() implementation never
taking mmap_sem for writing in the kernel and never modifying any vma
to allocate and free anon memory, etc.. I don't think any of them
would work with the sysctl set to "2".

The next kernel feature in uffd land that I was discussing with Peter,
is an async uffd event model to further optimize the replacement of
soft-dirty (which uffd already provides in O(1) instead of O(N)), so
the wrprotect fault won't have to block anymore until the uffd async
queue overflows. That also is unlikely to work with the sysctl set to
"2" without adding extra constraints that soft-dirty doesn't currently
have.

It would also be possible to implement the value "2" to work like
/proc/sys/kernel/unprivileged_bpf_disabled, so when you set it to "1"
as root, you can't set it to "2" or "0" and when you set it to "2" you
can't set it to "0", but personally I think it's unnecessary.

Thanks,
Andrea


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-05-08 16:54     ` Michael S. Tsirkin
@ 2020-05-20  4:59       ` Andrea Arcangeli
  2020-05-20 18:03         ` Kees Cook
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2020-05-20  4:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Daniel Colascione, Jonathan Corbet, Alexander Viro,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Mauro Carvalho Chehab,
	Andrew Morton, Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Peter Xu, Mike Rapoport,
	Jerome Glisse, Shaohua Li, linux-doc, linux-kernel,
	linux-fsdevel, timmurray, minchan, sspatil, lokeshgidra

Hello everyone,

On Fri, May 08, 2020 at 12:54:03PM -0400, Michael S. Tsirkin wrote:
> On Fri, May 08, 2020 at 12:52:34PM -0400, Michael S. Tsirkin wrote:
> > On Wed, Apr 22, 2020 at 05:26:32PM -0700, Daniel Colascione wrote:
> > > This sysctl can be set to either zero or one. When zero (the default)
> > > the system lets all users call userfaultfd with or without
> > > UFFD_USER_MODE_ONLY, modulo other access controls. When
> > > unprivileged_userfaultfd_user_mode_only is set to one, users without
> > > CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API
> > > will fail with EPERM. This facility allows administrators to reduce
> > > the likelihood that an attacker with access to userfaultfd can delay
> > > faulting kernel code to widen timing windows for other exploits.
> > > 
> > > Signed-off-by: Daniel Colascione <dancol@google.com>
> > 
> > The approach taken looks like a hard-coded security policy.
> > For example, it won't be possible to set the sysctl knob
> > in question on any sytem running kvm. So this is
> > no good for any general purpose system.
> > 
> > What's wrong with using a security policy for this instead?
> 
> In fact I see the original thread already mentions selinux,
> so it's just a question of making this controllable by
> selinux.

I agree it'd be preferable if it was not hardcoded, but then this
patchset is also much simpler than the previous controlling it through
selinux..

I was thinking, an alternative policy that could control it without
hard-coding it, is a seccomp-bpf filter, then you can drop 2/2 as
well, not just 1/6-4/6.

If you keep only 1/2, can't seccomp-bpf enforce userfaultfd to be
always called with flags==0x1 without requiring extra modifications in
the kernel?

Can't you get the feature party with the CAP_SYS_PTRACE capability
too, if you don't wrap those tasks with the ptrace capability under
that seccomp filter?

As far as I can tell, it's unprecedented to create a flag for a
syscall API, with the only purpose of implementing a seccomp-bpf
filter verifying such flag is set, but then if you want to control it
with LSM it's even more complex than doing it with seccomp-bpf, and it
requires more kernel code too. We could always add 2/2 later, such
possibility won't disappear, in fact we could also add 1/6-4/6 later
too if that is not enough.

If we could begin by merging only 1/2 from this new series and be done
with the kernel changes, because we offload the rest of the work to
the kernel eBPF JIT, I think it'd be ideal.

Thanks,
Andrea


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-05-20  4:59       ` Andrea Arcangeli
@ 2020-05-20 18:03         ` Kees Cook
  2020-05-20 19:48           ` Andrea Arcangeli
  0 siblings, 1 reply; 23+ messages in thread
From: Kees Cook @ 2020-05-20 18:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Michael S. Tsirkin, Daniel Colascione, Jonathan Corbet,
	Alexander Viro, Luis Chamberlain, Iurii Zaikin,
	Mauro Carvalho Chehab, Andrew Morton, Andy Shevchenko,
	Vlastimil Babka, Mel Gorman, Sebastian Andrzej Siewior, Peter Xu,
	Mike Rapoport, Jerome Glisse, Shaohua Li, linux-doc,
	linux-kernel, linux-fsdevel, timmurray, minchan, sspatil,
	lokeshgidra

On Wed, May 20, 2020 at 12:59:38AM -0400, Andrea Arcangeli wrote:
> Hello everyone,
> 
> On Fri, May 08, 2020 at 12:54:03PM -0400, Michael S. Tsirkin wrote:
> > On Fri, May 08, 2020 at 12:52:34PM -0400, Michael S. Tsirkin wrote:
> > > On Wed, Apr 22, 2020 at 05:26:32PM -0700, Daniel Colascione wrote:
> > > > This sysctl can be set to either zero or one. When zero (the default)
> > > > the system lets all users call userfaultfd with or without
> > > > UFFD_USER_MODE_ONLY, modulo other access controls. When
> > > > unprivileged_userfaultfd_user_mode_only is set to one, users without
> > > > CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API
> > > > will fail with EPERM. This facility allows administrators to reduce
> > > > the likelihood that an attacker with access to userfaultfd can delay
> > > > faulting kernel code to widen timing windows for other exploits.
> > > > 
> > > > Signed-off-by: Daniel Colascione <dancol@google.com>
> > > 
> > > The approach taken looks like a hard-coded security policy.
> > > For example, it won't be possible to set the sysctl knob
> > > in question on any sytem running kvm. So this is
> > > no good for any general purpose system.

Not all systems run unprivileged KVM. :)

> > > What's wrong with using a security policy for this instead?
> > 
> > In fact I see the original thread already mentions selinux,
> > so it's just a question of making this controllable by
> > selinux.
> 
> I agree it'd be preferable if it was not hardcoded, but then this
> patchset is also much simpler than the previous controlling it through
> selinux..
> 
> I was thinking, an alternative policy that could control it without
> hard-coding it, is a seccomp-bpf filter, then you can drop 2/2 as
> well, not just 1/6-4/6.

Err, did I miss a separate 6-patch series? I can't find anything on lore.

> 
> If you keep only 1/2, can't seccomp-bpf enforce userfaultfd to be
> always called with flags==0x1 without requiring extra modifications in
> the kernel?

Please no. This is way too much overhead for something that a system
owner wants to enforce globally. A sysctl is the correct option here,
IMO. If it needs to be a per-userns sysctl, that would be fine too.

> Can't you get the feature party with the CAP_SYS_PTRACE capability
> too, if you don't wrap those tasks with the ptrace capability under
> that seccomp filter?
> 
> As far as I can tell, it's unprecedented to create a flag for a
> syscall API, with the only purpose of implementing a seccomp-bpf
> filter verifying such flag is set, but then if you want to control it
> with LSM it's even more complex than doing it with seccomp-bpf, and it
> requires more kernel code too. We could always add 2/2 later, such
> possibility won't disappear, in fact we could also add 1/6-4/6 later
> too if that is not enough.
> 
> If we could begin by merging only 1/2 from this new series and be done
> with the kernel changes, because we offload the rest of the work to
> the kernel eBPF JIT, I think it'd be ideal.

I'd agree that patch 1 should land, as it appears to be required for any
further policy considerations. I'm still a big fan of a sysctl since
this is the kind of thing I would absolutely turn on globally for all my
systems.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-05-20 18:03         ` Kees Cook
@ 2020-05-20 19:48           ` Andrea Arcangeli
  2020-05-20 19:51             ` Andrea Arcangeli
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2020-05-20 19:48 UTC (permalink / raw)
  To: Kees Cook
  Cc: Michael S. Tsirkin, Daniel Colascione, Jonathan Corbet,
	Alexander Viro, Luis Chamberlain, Iurii Zaikin,
	Mauro Carvalho Chehab, Andrew Morton, Andy Shevchenko,
	Vlastimil Babka, Mel Gorman, Sebastian Andrzej Siewior, Peter Xu,
	Mike Rapoport, Jerome Glisse, Shaohua Li, linux-doc,
	linux-kernel, linux-fsdevel, timmurray, minchan, sspatil,
	lokeshgidra

Hello Kees,

On Wed, May 20, 2020 at 11:03:39AM -0700, Kees Cook wrote:
> Err, did I miss a separate 6-patch series? I can't find anything on lore.

Daniel included the link of the previous series I referred to is the
cover letter 0/2:

https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/

> > If you keep only 1/2, can't seccomp-bpf enforce userfaultfd to be
> > always called with flags==0x1 without requiring extra modifications in
> > the kernel?
> 
> Please no. This is way too much overhead for something that a system
> owner wants to enforce globally. A sysctl is the correct option here,
> IMO. If it needs to be a per-userns sysctl, that would be fine too.

The question is who could be this system owner who prefers "2" to "0"?

per-ns I don't see the point either when all containers already run
with default policies enforcing the same behavior as if the sysctl is
set to "0".

Why exactly is it preferable to enlarge the surface of attack of the
kernel and take the risk there is a real bug in userfaultfd code (not
just a facilitation of exploiting some other kernel bug) that leads to
a privilege escalation, when you still break 99% of userfaultfd users,
if you set with option "2"?

Is the system owner really going to purely run on his systems CRIU
postcopy live migration (which already runs with CAP_SYS_PTRACE) and
nothing else that could break?

Option "2" to me looks with a single possible user, and incidentally
this single user can already enforce model "2" by only tweaking its
seccomp-bpf filters without applying 2/2. It'd be a bug if android
apps runs unprotected by seccomp regardless of 2/2.

System owners I think would be better of to stick to "0" or "1" a far
as I can tell, "2" looks a bad tradeoff as system value, with nearly
all cons of "0", but less secure than "0".

> I'd agree that patch 1 should land, as it appears to be required for any

Agreed about merging 1/2.

> further policy considerations. I'm still a big fan of a sysctl since
> this is the kind of thing I would absolutely turn on globally for all my
> systems.

The sysctl /proc/sys/kernel/unprivileged_bpf_disabled is already there
upstream and you should have already set it to "0" in all your systems
if you can cope with some app features not working.

2/2 as modified with Peter's suggestion, would add a new value "2" (in
addition of "1" and "0"), that still breaks the majority of all
possible users, just like value "0", but that gives less security than
value "0".

It all boils down of how peculiar it is to be able to leverage only
the acceleration (reduction in context switches and enter/exit kernel)
and the vma fragmentation avoidance (to avoid running of vmas in
/proc/sys/vm/max_map_count) provided by userfaultfd (vs sigsegv
trapping) but not the transparency in handling faults in kernel (which
sigsegv can't possibly achieve).

Right now there's a single user that can cope with that limitation,
and it's not your running on your servers but on your phone. Even CRIU
cannot cope with such limitation, the only reason it would cope with
value "2" is that it already runs with CAP_SYS_PTRACE for other
reasons.

If there will be more users that can cope with handling only user
initiated page faults, then yes, I think it'd be fine to add a value
"2" later.

The other benefit of enforcing the policy with seccomp-bpf if that if
you'd run Android userland on a container on a ARM server on top of an
enterprise kernel, it'd already run as safe as if the sysctl value was
tweaked to "2", but without having to also add extra kernel code for
per-ns sysctl. It just looks simpler. The rest of the containers on
the same host are already running today as if the sysctl is set to 0
regardless of this patchset.

If you want to enforce maximum security and override any possible
opt-out of the default podman seccomp profile that blocks userfaultfd,
you already can upstream with the global sysctl by setting it to "0".

Thanks,
Andrea


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-05-20 19:48           ` Andrea Arcangeli
@ 2020-05-20 19:51             ` Andrea Arcangeli
  2020-05-20 20:17               ` Lokesh Gidra
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2020-05-20 19:51 UTC (permalink / raw)
  To: Kees Cook
  Cc: Michael S. Tsirkin, Daniel Colascione, Jonathan Corbet,
	Alexander Viro, Luis Chamberlain, Iurii Zaikin,
	Mauro Carvalho Chehab, Andrew Morton, Andy Shevchenko,
	Vlastimil Babka, Mel Gorman, Sebastian Andrzej Siewior, Peter Xu,
	Mike Rapoport, Jerome Glisse, Shaohua Li, linux-doc,
	linux-kernel, linux-fsdevel, timmurray, minchan, sspatil,
	lokeshgidra

On Wed, May 20, 2020 at 03:48:04PM -0400, Andrea Arcangeli wrote:
> The sysctl /proc/sys/kernel/unprivileged_bpf_disabled is already there

Oops I picked the wrong unprivileged_* :) of course I meant:
/proc/sys/vm/unprivileged_userfaultfd


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-05-20 19:51             ` Andrea Arcangeli
@ 2020-05-20 20:17               ` Lokesh Gidra
  2020-05-20 21:16                 ` Andrea Arcangeli
  0 siblings, 1 reply; 23+ messages in thread
From: Lokesh Gidra @ 2020-05-20 20:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Suren Baghdasaryan
  Cc: Kees Cook, Michael S. Tsirkin, Daniel Colascione,
	Jonathan Corbet, Alexander Viro, Luis Chamberlain, Iurii Zaikin,
	Mauro Carvalho Chehab, Andrew Morton, Andy Shevchenko,
	Vlastimil Babka, Mel Gorman, Sebastian Andrzej Siewior, Peter Xu,
	Mike Rapoport, Jerome Glisse, Shaohua Li, linux-doc,
	linux-kernel, linux-fsdevel, Tim Murray, Minchan Kim,
	Sandeep Patil, kernel

Adding the Android kernel team in the discussion.

On Wed, May 20, 2020 at 12:51 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> On Wed, May 20, 2020 at 03:48:04PM -0400, Andrea Arcangeli wrote:
> > The sysctl /proc/sys/kernel/unprivileged_bpf_disabled is already there
>
> Oops I picked the wrong unprivileged_* :) of course I meant:
> /proc/sys/vm/unprivileged_userfaultfd
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-05-20 20:17               ` Lokesh Gidra
@ 2020-05-20 21:16                 ` Andrea Arcangeli
  2020-07-17 12:57                   ` Jeffrey Vander Stoep
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2020-05-20 21:16 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Suren Baghdasaryan, Kees Cook, Michael S. Tsirkin,
	Daniel Colascione, Jonathan Corbet, Alexander Viro,
	Luis Chamberlain, Iurii Zaikin, Mauro Carvalho Chehab,
	Andrew Morton, Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Peter Xu, Mike Rapoport,
	Jerome Glisse, Shaohua Li, linux-doc, linux-kernel,
	linux-fsdevel, Tim Murray, Minchan Kim, Sandeep Patil, kernel

On Wed, May 20, 2020 at 01:17:20PM -0700, Lokesh Gidra wrote:
> Adding the Android kernel team in the discussion.

Unless I'm mistaken that you can already enforce bit 1 of the second
parameter of the userfaultfd syscall to be set with seccomp-bpf, this
would be more a question to the Android userland team.

The question would be: does it ever happen that a seccomp filter isn't
already applied to unprivileged software running without
SYS_CAP_PTRACE capability?

If answer is "no" the behavior of the new sysctl in patch 2/2 (in
subject) should be enforceable with minor changes to the BPF
assembly. Otherwise it'd require more changes.

Thanks!
Andrea


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-05-20 21:16                 ` Andrea Arcangeli
@ 2020-07-17 12:57                   ` Jeffrey Vander Stoep
  2020-07-23 17:30                     ` Lokesh Gidra
  0 siblings, 1 reply; 23+ messages in thread
From: Jeffrey Vander Stoep @ 2020-07-17 12:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Lokesh Gidra, Suren Baghdasaryan, Kees Cook, Michael S. Tsirkin,
	Daniel Colascione, Jonathan Corbet, Alexander Viro,
	Luis Chamberlain, Iurii Zaikin, Mauro Carvalho Chehab,
	Andrew Morton, Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Peter Xu, Mike Rapoport,
	Jerome Glisse, Shaohua Li, linux-doc, LKML, linux-fsdevel,
	Tim Murray, Minchan Kim, Sandeep Patil, kernel

On Wed, May 20, 2020 at 11:17 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> On Wed, May 20, 2020 at 01:17:20PM -0700, Lokesh Gidra wrote:
> > Adding the Android kernel team in the discussion.
>
> Unless I'm mistaken that you can already enforce bit 1 of the second
> parameter of the userfaultfd syscall to be set with seccomp-bpf, this
> would be more a question to the Android userland team.
>
> The question would be: does it ever happen that a seccomp filter isn't
> already applied to unprivileged software running without
> SYS_CAP_PTRACE capability?

Yes.

Android uses selinux as our primary sandboxing mechanism. We do use
seccomp on a few processes, but we have found that it has a
surprisingly high performance cost [1] on arm64 devices so turning it
on system wide is not a good option.

[1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad
>
>
> If answer is "no" the behavior of the new sysctl in patch 2/2 (in
> subject) should be enforceable with minor changes to the BPF
> assembly. Otherwise it'd require more changes.
>
> Thanks!
> Andrea
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-07-17 12:57                   ` Jeffrey Vander Stoep
@ 2020-07-23 17:30                     ` Lokesh Gidra
  2020-07-24  0:13                       ` Nick Kralevich
  0 siblings, 1 reply; 23+ messages in thread
From: Lokesh Gidra @ 2020-07-23 17:30 UTC (permalink / raw)
  To: Jeffrey Vander Stoep
  Cc: Andrea Arcangeli, Suren Baghdasaryan, Kees Cook,
	Michael S. Tsirkin, Daniel Colascione, Jonathan Corbet,
	Alexander Viro, Luis Chamberlain, Iurii Zaikin,
	Mauro Carvalho Chehab, Andrew Morton, Andy Shevchenko,
	Vlastimil Babka, Mel Gorman, Sebastian Andrzej Siewior, Peter Xu,
	Mike Rapoport, Jerome Glisse, Shaohua Li, linux-doc, LKML,
	Linux FS Devel, Tim Murray, Minchan Kim, Sandeep Patil, kernel,
	Daniel Colascione, Kalesh Singh, Nick Kralevich

Daniel, the original contributor of this patchset, has moved to
another company. Adding his personal email, in case he still wants to
be involved.

From the discussion so far it seems that there is a consensus that
patch 1/2 in this series should be upstreamed in any case. Is there
anything that is pending on that patch?

On Fri, Jul 17, 2020 at 5:57 AM Jeffrey Vander Stoep <jeffv@google.com> wrote:
>
> On Wed, May 20, 2020 at 11:17 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > On Wed, May 20, 2020 at 01:17:20PM -0700, Lokesh Gidra wrote:
> > > Adding the Android kernel team in the discussion.
> >
> > Unless I'm mistaken that you can already enforce bit 1 of the second
> > parameter of the userfaultfd syscall to be set with seccomp-bpf, this
> > would be more a question to the Android userland team.
> >
> > The question would be: does it ever happen that a seccomp filter isn't
> > already applied to unprivileged software running without
> > SYS_CAP_PTRACE capability?
>
> Yes.
>
> Android uses selinux as our primary sandboxing mechanism. We do use
> seccomp on a few processes, but we have found that it has a
> surprisingly high performance cost [1] on arm64 devices so turning it
> on system wide is not a good option.
>
> [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad
> >
> >
> > If answer is "no" the behavior of the new sysctl in patch 2/2 (in
> > subject) should be enforceable with minor changes to the BPF
> > assembly. Otherwise it'd require more changes.
> >
Adding Nick (Jeff is already here) to respond to Andrea's concerns
about adding option '2' to sysctl knob.

> > Thanks!
> > Andrea
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-07-23 17:30                     ` Lokesh Gidra
@ 2020-07-24  0:13                       ` Nick Kralevich
  2020-07-24 13:40                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Kralevich @ 2020-07-24  0:13 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Jeffrey Vander Stoep, Andrea Arcangeli, Suren Baghdasaryan,
	Kees Cook, Michael S. Tsirkin, Daniel Colascione,
	Jonathan Corbet, Alexander Viro, Luis Chamberlain, Iurii Zaikin,
	Mauro Carvalho Chehab, Andrew Morton, Andy Shevchenko,
	Vlastimil Babka, Mel Gorman, Sebastian Andrzej Siewior, Peter Xu,
	Mike Rapoport, Jerome Glisse, Shaohua Li, linux-doc, LKML,
	Linux FS Devel, Tim Murray, Minchan Kim, Sandeep Patil, kernel,
	Daniel Colascione, Kalesh Singh

On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> From the discussion so far it seems that there is a consensus that
> patch 1/2 in this series should be upstreamed in any case. Is there
> anything that is pending on that patch?

That's my reading of this thread too.

> > > Unless I'm mistaken that you can already enforce bit 1 of the second
> > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this
> > > would be more a question to the Android userland team.
> > >
> > > The question would be: does it ever happen that a seccomp filter isn't
> > > already applied to unprivileged software running without
> > > SYS_CAP_PTRACE capability?
> >
> > Yes.
> >
> > Android uses selinux as our primary sandboxing mechanism. We do use
> > seccomp on a few processes, but we have found that it has a
> > surprisingly high performance cost [1] on arm64 devices so turning it
> > on system wide is not a good option.
> >
> > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad

As Jeff mentioned, seccomp is used strategically on Android, but is
not applied to all processes. It's too expensive and impractical when
simpler implementations (such as this sysctl) can exist. It's also
significantly simpler to test a sysctl value for correctness as
opposed to a seccomp filter.

> > >
> > >
> > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in
> > > subject) should be enforceable with minor changes to the BPF
> > > assembly. Otherwise it'd require more changes.

It would be good to understand what these changes are.

> > > Why exactly is it preferable to enlarge the surface of attack of the
> > > kernel and take the risk there is a real bug in userfaultfd code (not
> > > just a facilitation of exploiting some other kernel bug) that leads to
> > > a privilege escalation, when you still break 99% of userfaultfd users,
> > > if you set with option "2"?

I can see your point if you think about the feature as a whole.
However, distributions (such as Android) have specialized knowledge of
their security environments, and may not want to support the typical
usages of userfaultfd. For such distributions, providing a mechanism
to prevent userfaultfd from being useful as an exploit primitive,
while still allowing the very limited use of userfaultfd for userspace
faults only, is desirable. Distributions shouldn't be forced into
supporting 100% of the use cases envisioned by userfaultfd when their
needs may be more specialized, and this sysctl knob empowers
distributions to make this choice for themselves.

> > > Is the system owner really going to purely run on his systems CRIU
> > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and
> > > nothing else that could break?

This is a great example of a capability which a distribution may not
want to support, due to distribution specific security policies.

> > >
> > > Option "2" to me looks with a single possible user, and incidentally
> > > this single user can already enforce model "2" by only tweaking its
> > > seccomp-bpf filters without applying 2/2. It'd be a bug if android
> > > apps runs unprotected by seccomp regardless of 2/2.

Can you elaborate on what bug is present by processes being
unprotected by seccomp?

Seccomp cannot be universally applied on Android due to previously
mentioned performance concerns. Seccomp is used in Android primarily
as a tool to enforce the list of allowed syscalls, so that such
syscalls can be audited before being included as part of the Android
API.

-- Nick

-- 
Nick Kralevich | nnk@google.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
  2020-07-24  0:13                       ` Nick Kralevich
@ 2020-07-24 13:40                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 23+ messages in thread
From: Michael S. Tsirkin @ 2020-07-24 13:40 UTC (permalink / raw)
  To: Nick Kralevich
  Cc: Lokesh Gidra, Jeffrey Vander Stoep, Andrea Arcangeli,
	Suren Baghdasaryan, Kees Cook, Daniel Colascione,
	Jonathan Corbet, Alexander Viro, Luis Chamberlain, Iurii Zaikin,
	Mauro Carvalho Chehab, Andrew Morton, Andy Shevchenko,
	Vlastimil Babka, Mel Gorman, Sebastian Andrzej Siewior, Peter Xu,
	Mike Rapoport, Jerome Glisse, Shaohua Li, linux-doc, LKML,
	Linux FS Devel, Tim Murray, Minchan Kim, Sandeep Patil, kernel,
	Daniel Colascione, Kalesh Singh

On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote:
> On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > From the discussion so far it seems that there is a consensus that
> > patch 1/2 in this series should be upstreamed in any case. Is there
> > anything that is pending on that patch?
> 
> That's my reading of this thread too.
> 
> > > > Unless I'm mistaken that you can already enforce bit 1 of the second
> > > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this
> > > > would be more a question to the Android userland team.
> > > >
> > > > The question would be: does it ever happen that a seccomp filter isn't
> > > > already applied to unprivileged software running without
> > > > SYS_CAP_PTRACE capability?
> > >
> > > Yes.
> > >
> > > Android uses selinux as our primary sandboxing mechanism. We do use
> > > seccomp on a few processes, but we have found that it has a
> > > surprisingly high performance cost [1] on arm64 devices so turning it
> > > on system wide is not a good option.
> > >
> > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad
> 
> As Jeff mentioned, seccomp is used strategically on Android, but is
> not applied to all processes. It's too expensive and impractical when
> simpler implementations (such as this sysctl) can exist. It's also
> significantly simpler to test a sysctl value for correctness as
> opposed to a seccomp filter.

Given that selinux is already used system-wide on Android, what is wrong
with using selinux to control userfaultfd as opposed to seccomp?


> > > >
> > > >
> > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in
> > > > subject) should be enforceable with minor changes to the BPF
> > > > assembly. Otherwise it'd require more changes.
> 
> It would be good to understand what these changes are.
> 
> > > > Why exactly is it preferable to enlarge the surface of attack of the
> > > > kernel and take the risk there is a real bug in userfaultfd code (not
> > > > just a facilitation of exploiting some other kernel bug) that leads to
> > > > a privilege escalation, when you still break 99% of userfaultfd users,
> > > > if you set with option "2"?
> 
> I can see your point if you think about the feature as a whole.
> However, distributions (such as Android) have specialized knowledge of
> their security environments, and may not want to support the typical
> usages of userfaultfd. For such distributions, providing a mechanism
> to prevent userfaultfd from being useful as an exploit primitive,
> while still allowing the very limited use of userfaultfd for userspace
> faults only, is desirable. Distributions shouldn't be forced into
> supporting 100% of the use cases envisioned by userfaultfd when their
> needs may be more specialized, and this sysctl knob empowers
> distributions to make this choice for themselves.
> 
> > > > Is the system owner really going to purely run on his systems CRIU
> > > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and
> > > > nothing else that could break?
> 
> This is a great example of a capability which a distribution may not
> want to support, due to distribution specific security policies.
> 
> > > >
> > > > Option "2" to me looks with a single possible user, and incidentally
> > > > this single user can already enforce model "2" by only tweaking its
> > > > seccomp-bpf filters without applying 2/2. It'd be a bug if android
> > > > apps runs unprotected by seccomp regardless of 2/2.
> 
> Can you elaborate on what bug is present by processes being
> unprotected by seccomp?
> 
> Seccomp cannot be universally applied on Android due to previously
> mentioned performance concerns. Seccomp is used in Android primarily
> as a tool to enforce the list of allowed syscalls, so that such
> syscalls can be audited before being included as part of the Android
> API.
> 
> -- Nick
> 
> -- 
> Nick Kralevich | nnk@google.com


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] Control over userfaultfd kernel-fault handling
  2020-04-23  0:26 [PATCH 0/2] Control over userfaultfd kernel-fault handling Daniel Colascione
  2020-04-23  0:26 ` [PATCH 1/2] Add UFFD_USER_MODE_ONLY Daniel Colascione
  2020-04-23  0:26 ` [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only Daniel Colascione
@ 2020-07-24 14:01 ` Michael S. Tsirkin
  2020-07-24 14:41   ` Lokesh Gidra
  2 siblings, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2020-07-24 14:01 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Jonathan Corbet, Alexander Viro, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Mauro Carvalho Chehab, Andrew Morton,
	Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Peter Xu, Andrea Arcangeli,
	Mike Rapoport, Jerome Glisse, Shaohua Li, linux-doc,
	linux-kernel, linux-fsdevel, timmurray, minchan, sspatil,
	lokeshgidra

On Wed, Apr 22, 2020 at 05:26:30PM -0700, Daniel Colascione wrote:
> This small patch series adds a new flag to userfaultfd(2) that allows
> callers to give up the ability to handle user-mode faults with the
> resulting UFFD file object. In then add a new sysctl to require
> unprivileged callers to use this new flag.
> 
> The purpose of this new interface is to decrease the change of an
> unprivileged userfaultfd user taking advantage of userfaultfd to
> enhance security vulnerabilities by lengthening the race window in
> kernel code.

There are other ways to lengthen the race window, such as madvise
MADV_DONTNEED, mmap of fuse files ...
Could the patchset commit log include some discussion about
why these are not the concern please?

Multiple subsystems including vhost have come to rely on
copy from/to user behaving identically to userspace access.

Could the patchset please include discussion on what effect blocking
these will have? E.g. I guess Android doesn't use vhost right now.
Will it want to do it to run VMs in 2021?

Thanks!

> This patch series is split from [1].
> 
> [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/

So in that series, Kees said:
https://lore.kernel.org/lkml/202002112332.BE71455@keescook/#t

What is the threat being solved? (I understand the threat, but detailing
  it in the commit log is important for people who don't know it.)

Could you pls do that?

> Daniel Colascione (2):
>   Add UFFD_USER_MODE_ONLY
>   Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
> 
>  Documentation/admin-guide/sysctl/vm.rst | 13 +++++++++++++
>  fs/userfaultfd.c                        | 18 ++++++++++++++++--
>  include/linux/userfaultfd_k.h           |  1 +
>  include/uapi/linux/userfaultfd.h        |  9 +++++++++
>  kernel/sysctl.c                         |  9 +++++++++
>  5 files changed, 48 insertions(+), 2 deletions(-)
> 
> -- 
> 2.26.2.303.gf8c07b1a785-goog
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] Add UFFD_USER_MODE_ONLY
  2020-04-23  0:26 ` [PATCH 1/2] Add UFFD_USER_MODE_ONLY Daniel Colascione
@ 2020-07-24 14:28   ` Michael S. Tsirkin
  2020-07-24 14:46     ` Lokesh Gidra
  0 siblings, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2020-07-24 14:28 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Jonathan Corbet, Alexander Viro, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Mauro Carvalho Chehab, Andrew Morton,
	Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Peter Xu, Andrea Arcangeli,
	Mike Rapoport, Jerome Glisse, Shaohua Li, linux-doc,
	linux-kernel, linux-fsdevel, timmurray, minchan, sspatil,
	lokeshgidra

On Wed, Apr 22, 2020 at 05:26:31PM -0700, Daniel Colascione wrote:
> userfaultfd handles page faults from both user and kernel code.  Add a
> new UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the
> resulting userfaultfd object refuse to handle faults from kernel mode,
> treating these faults as if SIGBUS were always raised, causing the
> kernel code to fail with EFAULT.
> 
> A future patch adds a knob allowing administrators to give some
> processes the ability to create userfaultfd file objects only if they
> pass UFFD_USER_MODE_ONLY, reducing the likelihood that these processes
> will exploit userfaultfd's ability to delay kernel page faults to open
> timing windows for future exploits.
> 
> Signed-off-by: Daniel Colascione <dancol@google.com>

Something to add here is that there is separate work on selinux to
support limiting specific userspace programs to only this type of
userfaultfd.

I also think Kees' comment about documenting what is the threat being solved
including some links to external sources still applies.

Finally, a question:

Is there any way at all to increase security without breaking
the assumption that copy_from_user is the same as userspace read?


As an example of a drastical approach that might solve some issues, how
about allocating some special memory and setting some VMA flag, then
limiting copy from/to user to just this subset of virtual addresses?
We can then do things like pin these pages in RAM, forbid
madvise/userfaultfd for these addresses, etc.

Affected userspace then needs to use a kind of a bounce buffer for any
calls into kernel.  This needs much more support from userspace and adds
much more overhead, but on the flip side, affects more ways userspace
can slow down the kernel.

Was this discussed in the past? Links would be appreciated.


> ---
>  fs/userfaultfd.c                 | 7 ++++++-
>  include/uapi/linux/userfaultfd.h | 9 +++++++++
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index e39fdec8a0b0..21378abe8f7b 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -418,6 +418,9 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>  
>  	if (ctx->features & UFFD_FEATURE_SIGBUS)
>  		goto out;
> +	if ((vmf->flags & FAULT_FLAG_USER) == 0 &&
> +	    ctx->flags & UFFD_USER_MODE_ONLY)
> +		goto out;
>  
>  	/*
>  	 * If it's already released don't get it. This avoids to loop
> @@ -2003,6 +2006,7 @@ static void init_once_userfaultfd_ctx(void *mem)
>  
>  SYSCALL_DEFINE1(userfaultfd, int, flags)
>  {
> +	static const int uffd_flags = UFFD_USER_MODE_ONLY;
>  	struct userfaultfd_ctx *ctx;
>  	int fd;
>  
> @@ -2012,10 +2016,11 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
>  	BUG_ON(!current->mm);
>  
>  	/* Check the UFFD_* constants for consistency.  */
> +	BUILD_BUG_ON(uffd_flags & UFFD_SHARED_FCNTL_FLAGS);
>  	BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
>  	BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
>  
> -	if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
> +	if (flags & ~(UFFD_SHARED_FCNTL_FLAGS | uffd_flags))
>  		return -EINVAL;
>  
>  	ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index e7e98bde221f..5f2d88212f7c 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -257,4 +257,13 @@ struct uffdio_writeprotect {
>  	__u64 mode;
>  };
>  
> +/*
> + * Flags for the userfaultfd(2) system call itself.
> + */
> +
> +/*
> + * Create a userfaultfd that can handle page faults only in user mode.
> + */
> +#define UFFD_USER_MODE_ONLY 1
> +
>  #endif /* _LINUX_USERFAULTFD_H */
> -- 
> 2.26.2.303.gf8c07b1a785-goog
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] Control over userfaultfd kernel-fault handling
  2020-07-24 14:01 ` [PATCH 0/2] Control over userfaultfd kernel-fault handling Michael S. Tsirkin
@ 2020-07-24 14:41   ` Lokesh Gidra
  0 siblings, 0 replies; 23+ messages in thread
From: Lokesh Gidra @ 2020-07-24 14:41 UTC (permalink / raw)
  To: Michael S. Tsirkin, kernel
  Cc: Jonathan Corbet, Alexander Viro, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Mauro Carvalho Chehab, Andrew Morton,
	Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Peter Xu, Andrea Arcangeli,
	Mike Rapoport, Jerome Glisse, Shaohua Li, linux-doc,
	linux-kernel, Linux FS Devel, Tim Murray, Minchan Kim,
	Sandeep Patil, Nick Kralevich, Jeffrey Vander Stoep,
	Daniel Colascione

On Fri, Jul 24, 2020 at 7:01 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Apr 22, 2020 at 05:26:30PM -0700, Daniel Colascione wrote:
> > This small patch series adds a new flag to userfaultfd(2) that allows
> > callers to give up the ability to handle user-mode faults with the
> > resulting UFFD file object. In then add a new sysctl to require
> > unprivileged callers to use this new flag.
> >
> > The purpose of this new interface is to decrease the change of an
> > unprivileged userfaultfd user taking advantage of userfaultfd to
> > enhance security vulnerabilities by lengthening the race window in
> > kernel code.
>
> There are other ways to lengthen the race window, such as madvise
> MADV_DONTNEED, mmap of fuse files ...
> Could the patchset commit log include some discussion about
> why these are not the concern please?
>
> Multiple subsystems including vhost have come to rely on
> copy from/to user behaving identically to userspace access.
>
> Could the patchset please include discussion on what effect blocking
> these will have? E.g. I guess Android doesn't use vhost right now.
> Will it want to do it to run VMs in 2021?
>
> Thanks!
>
> > This patch series is split from [1].
> >
> > [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
>
> So in that series, Kees said:
> https://lore.kernel.org/lkml/202002112332.BE71455@keescook/#t
>
> What is the threat being solved? (I understand the threat, but detailing
>   it in the commit log is important for people who don't know it.)
>

Adding Android security folks, Nick and Jeff, to answer.

> Could you pls do that?
>
> > Daniel Colascione (2):
> >   Add UFFD_USER_MODE_ONLY
> >   Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
> >
> >  Documentation/admin-guide/sysctl/vm.rst | 13 +++++++++++++
> >  fs/userfaultfd.c                        | 18 ++++++++++++++++--
> >  include/linux/userfaultfd_k.h           |  1 +
> >  include/uapi/linux/userfaultfd.h        |  9 +++++++++
> >  kernel/sysctl.c                         |  9 +++++++++
> >  5 files changed, 48 insertions(+), 2 deletions(-)
> >
> > --
> > 2.26.2.303.gf8c07b1a785-goog
> >
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] Add UFFD_USER_MODE_ONLY
  2020-07-24 14:28   ` Michael S. Tsirkin
@ 2020-07-24 14:46     ` Lokesh Gidra
  2020-07-26 10:09       ` Michael S. Tsirkin
  0 siblings, 1 reply; 23+ messages in thread
From: Lokesh Gidra @ 2020-07-24 14:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jonathan Corbet, Alexander Viro, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Mauro Carvalho Chehab, Andrew Morton,
	Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Peter Xu, Andrea Arcangeli,
	Mike Rapoport, Jerome Glisse, Shaohua Li, linux-doc,
	linux-kernel, Linux FS Devel, Tim Murray, Minchan Kim,
	Sandeep Patil, Daniel Colascione, Jeffrey Vander Stoep,
	Nick Kralevich, kernel, Kalesh Singh

On Fri, Jul 24, 2020 at 7:28 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Apr 22, 2020 at 05:26:31PM -0700, Daniel Colascione wrote:
> > userfaultfd handles page faults from both user and kernel code.  Add a
> > new UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the
> > resulting userfaultfd object refuse to handle faults from kernel mode,
> > treating these faults as if SIGBUS were always raised, causing the
> > kernel code to fail with EFAULT.
> >
> > A future patch adds a knob allowing administrators to give some
> > processes the ability to create userfaultfd file objects only if they
> > pass UFFD_USER_MODE_ONLY, reducing the likelihood that these processes
> > will exploit userfaultfd's ability to delay kernel page faults to open
> > timing windows for future exploits.
> >
> > Signed-off-by: Daniel Colascione <dancol@google.com>
>
> Something to add here is that there is separate work on selinux to
> support limiting specific userspace programs to only this type of
> userfaultfd.
>
> I also think Kees' comment about documenting what is the threat being solved
> including some links to external sources still applies.
>
> Finally, a question:
>
> Is there any way at all to increase security without breaking
> the assumption that copy_from_user is the same as userspace read?
>
>
> As an example of a drastical approach that might solve some issues, how
> about allocating some special memory and setting some VMA flag, then
> limiting copy from/to user to just this subset of virtual addresses?
> We can then do things like pin these pages in RAM, forbid
> madvise/userfaultfd for these addresses, etc.
>
> Affected userspace then needs to use a kind of a bounce buffer for any
> calls into kernel.  This needs much more support from userspace and adds
> much more overhead, but on the flip side, affects more ways userspace
> can slow down the kernel.
>
> Was this discussed in the past? Links would be appreciated.
>
Adding Nick and Jeff to the discussion.
>
> > ---
> >  fs/userfaultfd.c                 | 7 ++++++-
> >  include/uapi/linux/userfaultfd.h | 9 +++++++++
> >  2 files changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index e39fdec8a0b0..21378abe8f7b 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -418,6 +418,9 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >
> >       if (ctx->features & UFFD_FEATURE_SIGBUS)
> >               goto out;
> > +     if ((vmf->flags & FAULT_FLAG_USER) == 0 &&
> > +         ctx->flags & UFFD_USER_MODE_ONLY)
> > +             goto out;
> >
> >       /*
> >        * If it's already released don't get it. This avoids to loop
> > @@ -2003,6 +2006,7 @@ static void init_once_userfaultfd_ctx(void *mem)
> >
> >  SYSCALL_DEFINE1(userfaultfd, int, flags)
> >  {
> > +     static const int uffd_flags = UFFD_USER_MODE_ONLY;
> >       struct userfaultfd_ctx *ctx;
> >       int fd;
> >
> > @@ -2012,10 +2016,11 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
> >       BUG_ON(!current->mm);
> >
> >       /* Check the UFFD_* constants for consistency.  */
> > +     BUILD_BUG_ON(uffd_flags & UFFD_SHARED_FCNTL_FLAGS);
> >       BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
> >       BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
> >
> > -     if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
> > +     if (flags & ~(UFFD_SHARED_FCNTL_FLAGS | uffd_flags))
> >               return -EINVAL;
> >
> >       ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
> > diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> > index e7e98bde221f..5f2d88212f7c 100644
> > --- a/include/uapi/linux/userfaultfd.h
> > +++ b/include/uapi/linux/userfaultfd.h
> > @@ -257,4 +257,13 @@ struct uffdio_writeprotect {
> >       __u64 mode;
> >  };
> >
> > +/*
> > + * Flags for the userfaultfd(2) system call itself.
> > + */
> > +
> > +/*
> > + * Create a userfaultfd that can handle page faults only in user mode.
> > + */
> > +#define UFFD_USER_MODE_ONLY 1
> > +
> >  #endif /* _LINUX_USERFAULTFD_H */
> > --
> > 2.26.2.303.gf8c07b1a785-goog
> >
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] Add UFFD_USER_MODE_ONLY
  2020-07-24 14:46     ` Lokesh Gidra
@ 2020-07-26 10:09       ` Michael S. Tsirkin
  0 siblings, 0 replies; 23+ messages in thread
From: Michael S. Tsirkin @ 2020-07-26 10:09 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Jonathan Corbet, Alexander Viro, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Mauro Carvalho Chehab, Andrew Morton,
	Andy Shevchenko, Vlastimil Babka, Mel Gorman,
	Sebastian Andrzej Siewior, Peter Xu, Andrea Arcangeli,
	Mike Rapoport, Jerome Glisse, Shaohua Li, linux-doc,
	linux-kernel, Linux FS Devel, Tim Murray, Minchan Kim,
	Sandeep Patil, Daniel Colascione, Jeffrey Vander Stoep,
	Nick Kralevich, kernel, Kalesh Singh

On Fri, Jul 24, 2020 at 07:46:02AM -0700, Lokesh Gidra wrote:
> On Fri, Jul 24, 2020 at 7:28 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Apr 22, 2020 at 05:26:31PM -0700, Daniel Colascione wrote:
> > > userfaultfd handles page faults from both user and kernel code.  Add a
> > > new UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the
> > > resulting userfaultfd object refuse to handle faults from kernel mode,
> > > treating these faults as if SIGBUS were always raised, causing the
> > > kernel code to fail with EFAULT.
> > >
> > > A future patch adds a knob allowing administrators to give some
> > > processes the ability to create userfaultfd file objects only if they
> > > pass UFFD_USER_MODE_ONLY, reducing the likelihood that these processes
> > > will exploit userfaultfd's ability to delay kernel page faults to open
> > > timing windows for future exploits.
> > >
> > > Signed-off-by: Daniel Colascione <dancol@google.com>
> >
> > Something to add here is that there is separate work on selinux to
> > support limiting specific userspace programs to only this type of
> > userfaultfd.
> >
> > I also think Kees' comment about documenting what is the threat being solved
> > including some links to external sources still applies.
> >
> > Finally, a question:
> >
> > Is there any way at all to increase security without breaking
> > the assumption that copy_from_user is the same as userspace read?
> >
> >
> > As an example of a drastical approach that might solve some issues, how
> > about allocating some special memory and setting some VMA flag, then
> > limiting copy from/to user to just this subset of virtual addresses?
> > We can then do things like pin these pages in RAM, forbid
> > madvise/userfaultfd for these addresses, etc.
> >
> > Affected userspace then needs to use a kind of a bounce buffer for any
> > calls into kernel.  This needs much more support from userspace and adds
> > much more overhead, but on the flip side, affects more ways userspace
> > can slow down the kernel.
> >
> > Was this discussed in the past? Links would be appreciated.
> >
> Adding Nick and Jeff to the discussion.

I guess a valid alternative is to block major faults in copy
to/from user for a given process/group of syscalls. Userspace can mlock
an area it uses for these system calls.

For example, allow BPF/security linux policy block all major faults
until the next syscall.  Yes that would then include userfaultfd.


> >
> > > ---
> > >  fs/userfaultfd.c                 | 7 ++++++-
> > >  include/uapi/linux/userfaultfd.h | 9 +++++++++
> > >  2 files changed, 15 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index e39fdec8a0b0..21378abe8f7b 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -418,6 +418,9 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> > >
> > >       if (ctx->features & UFFD_FEATURE_SIGBUS)
> > >               goto out;
> > > +     if ((vmf->flags & FAULT_FLAG_USER) == 0 &&
> > > +         ctx->flags & UFFD_USER_MODE_ONLY)
> > > +             goto out;
> > >
> > >       /*
> > >        * If it's already released don't get it. This avoids to loop
> > > @@ -2003,6 +2006,7 @@ static void init_once_userfaultfd_ctx(void *mem)
> > >
> > >  SYSCALL_DEFINE1(userfaultfd, int, flags)
> > >  {
> > > +     static const int uffd_flags = UFFD_USER_MODE_ONLY;
> > >       struct userfaultfd_ctx *ctx;
> > >       int fd;
> > >
> > > @@ -2012,10 +2016,11 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
> > >       BUG_ON(!current->mm);
> > >
> > >       /* Check the UFFD_* constants for consistency.  */
> > > +     BUILD_BUG_ON(uffd_flags & UFFD_SHARED_FCNTL_FLAGS);
> > >       BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
> > >       BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
> > >
> > > -     if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
> > > +     if (flags & ~(UFFD_SHARED_FCNTL_FLAGS | uffd_flags))
> > >               return -EINVAL;
> > >
> > >       ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
> > > diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> > > index e7e98bde221f..5f2d88212f7c 100644
> > > --- a/include/uapi/linux/userfaultfd.h
> > > +++ b/include/uapi/linux/userfaultfd.h
> > > @@ -257,4 +257,13 @@ struct uffdio_writeprotect {
> > >       __u64 mode;
> > >  };
> > >
> > > +/*
> > > + * Flags for the userfaultfd(2) system call itself.
> > > + */
> > > +
> > > +/*
> > > + * Create a userfaultfd that can handle page faults only in user mode.
> > > + */
> > > +#define UFFD_USER_MODE_ONLY 1
> > > +
> > >  #endif /* _LINUX_USERFAULTFD_H */
> > > --
> > > 2.26.2.303.gf8c07b1a785-goog
> > >
> >


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, back to index

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-23  0:26 [PATCH 0/2] Control over userfaultfd kernel-fault handling Daniel Colascione
2020-04-23  0:26 ` [PATCH 1/2] Add UFFD_USER_MODE_ONLY Daniel Colascione
2020-07-24 14:28   ` Michael S. Tsirkin
2020-07-24 14:46     ` Lokesh Gidra
2020-07-26 10:09       ` Michael S. Tsirkin
2020-04-23  0:26 ` [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only Daniel Colascione
2020-05-06 19:38   ` Peter Xu
2020-05-07 19:15     ` Jonathan Corbet
2020-05-20  4:06       ` Andrea Arcangeli
2020-05-08 16:52   ` Michael S. Tsirkin
2020-05-08 16:54     ` Michael S. Tsirkin
2020-05-20  4:59       ` Andrea Arcangeli
2020-05-20 18:03         ` Kees Cook
2020-05-20 19:48           ` Andrea Arcangeli
2020-05-20 19:51             ` Andrea Arcangeli
2020-05-20 20:17               ` Lokesh Gidra
2020-05-20 21:16                 ` Andrea Arcangeli
2020-07-17 12:57                   ` Jeffrey Vander Stoep
2020-07-23 17:30                     ` Lokesh Gidra
2020-07-24  0:13                       ` Nick Kralevich
2020-07-24 13:40                         ` Michael S. Tsirkin
2020-07-24 14:01 ` [PATCH 0/2] Control over userfaultfd kernel-fault handling Michael S. Tsirkin
2020-07-24 14:41   ` Lokesh Gidra

Linux-Doc Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-doc/0 linux-doc/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-doc linux-doc/ https://lore.kernel.org/linux-doc \
		linux-doc@vger.kernel.org
	public-inbox-index linux-doc

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-doc


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git