Linux-Doc Archive on
 help / color / Atom feed
From: Andrea Arcangeli <>
To: Jonathan Corbet <>
Cc: Peter Xu <>,
	Daniel Colascione <>,
	Alexander Viro <>,
	Luis Chamberlain <>,
	Kees Cook <>,
	Iurii Zaikin <>,
	Mauro Carvalho Chehab <>,
	Andrew Morton <>,
	Andy Shevchenko <>,
	Vlastimil Babka <>,
	Mel Gorman <>,
	Sebastian Andrzej Siewior <>,
	Mike Rapoport <>,
	Jerome Glisse <>, Shaohua Li <>,,,,,,,
Subject: Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
Date: Wed, 20 May 2020 00:06:08 -0400
Message-ID: <> (raw)
In-Reply-To: <>

Hello Jonathan and everyone,

On Thu, May 07, 2020 at 01:15:03PM -0600, Jonathan Corbet wrote:
> On Wed, 6 May 2020 15:38:16 -0400
> Peter Xu <> wrote:
> > If this is going to be added... I am thinking whether it should be easier to
> > add another value for unprivileged_userfaultfd, rather than a new sysctl. E.g.:
> > 
> >   "0": unprivileged userfaultfd forbidden
> >   "1": unprivileged userfaultfd allowed (both user/kernel faults)
> >   "2": unprivileged userfaultfd allowed (only user faults)
> > 
> > Because after all unprivileged_userfaultfd_user_mode_only will be meaningless
> > (iiuc) if unprivileged_userfaultfd=0.  The default value will also be the same
> > as before ("1") 
> It occurs to me to wonder whether this interface should also let an admin
> block *privileged* user from handling kernel-space faults?  In a
> secure-boot/lockdown setting, this could be a hardening measure that keeps
> a (somewhat) restricted root user from expanding their privilege...?

That's a good question. In my view if as root in lockdown mode you can
still run the swapon syscall and setup nfs or other network devices
and load userland fuse filesystems or cuse chardev in userland, even
if you prevent userfaultfd from blocking kernel faults, kernel faults
can still be blocked by other means.

That in fact tends to be true also as non root (so regardless of
lockdown settings) since luser can generally load fuse filesystems.
There is no fundamental integrity breakage or privilege escalation
originating in userfaultfd.

The only concern here is about this: "after a new use-after-free is
discovered in some other part of the kernel (not related to
userfaultfd), how easy it is to turn the use-after-free from a mere
DoS to a more concerning privilege escalation?". userfaultfd might
facilitate the exploitation, but even if you remove userfaultfd from
the equation, there's still no guarantee an user-after-free won't
materialize as a privilege escalation by other means.

So to express it in another way: unless lockdown (no matter in which
mode) is a weak probabilistic based feature and in turn it cannot
provide any guarantee to begin with, userfaultfd sysctl set to 0|1|2
can't possibly make any difference to it.

The best mitigation for those kind of exploits remains to randomize
all kernel memory allocations, so even if the attacker can block the
fault, when it's unblocked it'll pick another page, not the one that
the attacker can predict it will use, so the attacker needs to repeat
the race many more times and hopefully it'll DoS and destabilize the
kernel before it can reproduce a privilege escalation. We got many of
those randomization features in the current kernel and it's probably
more important to enable those than to worry about this sysctl value.

One way to have a peace of mind against all use-after-free regardless
of this sysctl value, is to run each pod in a KVM instance, that's
safer than disabling syscalls or kernel features.

The default seccomp profiles of podman already block userfaultfd too,
so there's no need of virt to get extra safety if you use containers:
containers need to explicitly opt-in to enable userfaultfd through the
OCI schema seccomp object. If userfaultfd is being explicitly
whitelisted in the OCI schema of the container, well then you know
there is a good reason for it. As a matter of fact some things are
only possible to achieve with userfaultfd fully enabled.

The big value uffd brings compared to trapping sigsegv is precisely to
be able to handle kernel faults transparently. sigsegv can't do that
because every syscall would return 1) an inconsistent retval and 2) no
fault address along with the retval.

The possible future uffd userland users could be: dropping JVM dirty
bit, redis snapshot using pthread_create() instead of fork(),
distributed shared memory on pmem, new malloc() implementation never
taking mmap_sem for writing in the kernel and never modifying any vma
to allocate and free anon memory, etc.. I don't think any of them
would work with the sysctl set to "2".

The next kernel feature in uffd land that I was discussing with Peter,
is an async uffd event model to further optimize the replacement of
soft-dirty (which uffd already provides in O(1) instead of O(N)), so
the wrprotect fault won't have to block anymore until the uffd async
queue overflows. That also is unlikely to work with the sysctl set to
"2" without adding extra constraints that soft-dirty doesn't currently

It would also be possible to implement the value "2" to work like
/proc/sys/kernel/unprivileged_bpf_disabled, so when you set it to "1"
as root, you can't set it to "2" or "0" and when you set it to "2" you
can't set it to "0", but personally I think it's unnecessary.


  reply index

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-23  0:26 [PATCH 0/2] Control over userfaultfd kernel-fault handling Daniel Colascione
2020-04-23  0:26 ` [PATCH 1/2] Add UFFD_USER_MODE_ONLY Daniel Colascione
2020-07-24 14:28   ` Michael S. Tsirkin
2020-07-24 14:46     ` Lokesh Gidra
2020-07-26 10:09       ` Michael S. Tsirkin
2020-04-23  0:26 ` [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only Daniel Colascione
2020-05-06 19:38   ` Peter Xu
2020-05-07 19:15     ` Jonathan Corbet
2020-05-20  4:06       ` Andrea Arcangeli [this message]
2020-05-08 16:52   ` Michael S. Tsirkin
2020-05-08 16:54     ` Michael S. Tsirkin
2020-05-20  4:59       ` Andrea Arcangeli
2020-05-20 18:03         ` Kees Cook
2020-05-20 19:48           ` Andrea Arcangeli
2020-05-20 19:51             ` Andrea Arcangeli
2020-05-20 20:17               ` Lokesh Gidra
2020-05-20 21:16                 ` Andrea Arcangeli
2020-07-17 12:57                   ` Jeffrey Vander Stoep
2020-07-23 17:30                     ` Lokesh Gidra
2020-07-24  0:13                       ` Nick Kralevich
2020-07-24 13:40                         ` Michael S. Tsirkin
2020-08-06  0:43                           ` Nick Kralevich
2020-08-06  5:44                             ` Michael S. Tsirkin
2020-08-17 22:11                               ` Lokesh Gidra
2020-09-04  3:34                                 ` Andrea Arcangeli
2020-09-05  0:36                                   ` Lokesh Gidra
2020-09-19 18:14                                     ` Nick Kralevich
2020-07-24 14:01 ` [PATCH 0/2] Control over userfaultfd kernel-fault handling Michael S. Tsirkin
2020-07-24 14:41   ` Lokesh Gidra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Doc Archive on

Archives are clonable:
	git clone --mirror linux-doc/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-doc linux-doc/ \
	public-inbox-index linux-doc

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone