Linux-Doc Archive on
 help / color / Atom feed
From: Andrea Arcangeli <>
To: Kees Cook <>
Cc: "Michael S. Tsirkin" <>,
	Daniel Colascione <>,
	Jonathan Corbet <>,
	Alexander Viro <>,
	Luis Chamberlain <>,
	Iurii Zaikin <>,
	Mauro Carvalho Chehab <>,
	Andrew Morton <>,
	Andy Shevchenko <>,
	Vlastimil Babka <>,
	Mel Gorman <>,
	Sebastian Andrzej Siewior <>,
	Peter Xu <>, Mike Rapoport <>,
	Jerome Glisse <>, Shaohua Li <>,,,,,,,
Subject: Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
Date: Wed, 20 May 2020 15:48:04 -0400
Message-ID: <> (raw)
In-Reply-To: <202005200921.2BD5A0ADD@keescook>

Hello Kees,

On Wed, May 20, 2020 at 11:03:39AM -0700, Kees Cook wrote:
> Err, did I miss a separate 6-patch series? I can't find anything on lore.

Daniel included the link of the previous series I referred to is the
cover letter 0/2:

> > If you keep only 1/2, can't seccomp-bpf enforce userfaultfd to be
> > always called with flags==0x1 without requiring extra modifications in
> > the kernel?
> Please no. This is way too much overhead for something that a system
> owner wants to enforce globally. A sysctl is the correct option here,
> IMO. If it needs to be a per-userns sysctl, that would be fine too.

The question is who could be this system owner who prefers "2" to "0"?

per-ns I don't see the point either when all containers already run
with default policies enforcing the same behavior as if the sysctl is
set to "0".

Why exactly is it preferable to enlarge the surface of attack of the
kernel and take the risk there is a real bug in userfaultfd code (not
just a facilitation of exploiting some other kernel bug) that leads to
a privilege escalation, when you still break 99% of userfaultfd users,
if you set with option "2"?

Is the system owner really going to purely run on his systems CRIU
postcopy live migration (which already runs with CAP_SYS_PTRACE) and
nothing else that could break?

Option "2" to me looks with a single possible user, and incidentally
this single user can already enforce model "2" by only tweaking its
seccomp-bpf filters without applying 2/2. It'd be a bug if android
apps runs unprotected by seccomp regardless of 2/2.

System owners I think would be better of to stick to "0" or "1" a far
as I can tell, "2" looks a bad tradeoff as system value, with nearly
all cons of "0", but less secure than "0".

> I'd agree that patch 1 should land, as it appears to be required for any

Agreed about merging 1/2.

> further policy considerations. I'm still a big fan of a sysctl since
> this is the kind of thing I would absolutely turn on globally for all my
> systems.

The sysctl /proc/sys/kernel/unprivileged_bpf_disabled is already there
upstream and you should have already set it to "0" in all your systems
if you can cope with some app features not working.

2/2 as modified with Peter's suggestion, would add a new value "2" (in
addition of "1" and "0"), that still breaks the majority of all
possible users, just like value "0", but that gives less security than
value "0".

It all boils down of how peculiar it is to be able to leverage only
the acceleration (reduction in context switches and enter/exit kernel)
and the vma fragmentation avoidance (to avoid running of vmas in
/proc/sys/vm/max_map_count) provided by userfaultfd (vs sigsegv
trapping) but not the transparency in handling faults in kernel (which
sigsegv can't possibly achieve).

Right now there's a single user that can cope with that limitation,
and it's not your running on your servers but on your phone. Even CRIU
cannot cope with such limitation, the only reason it would cope with
value "2" is that it already runs with CAP_SYS_PTRACE for other

If there will be more users that can cope with handling only user
initiated page faults, then yes, I think it'd be fine to add a value
"2" later.

The other benefit of enforcing the policy with seccomp-bpf if that if
you'd run Android userland on a container on a ARM server on top of an
enterprise kernel, it'd already run as safe as if the sysctl value was
tweaked to "2", but without having to also add extra kernel code for
per-ns sysctl. It just looks simpler. The rest of the containers on
the same host are already running today as if the sysctl is set to 0
regardless of this patchset.

If you want to enforce maximum security and override any possible
opt-out of the default podman seccomp profile that blocks userfaultfd,
you already can upstream with the global sysctl by setting it to "0".


  reply index

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-23  0:26 [PATCH 0/2] Control over userfaultfd kernel-fault handling Daniel Colascione
2020-04-23  0:26 ` [PATCH 1/2] Add UFFD_USER_MODE_ONLY Daniel Colascione
2020-07-24 14:28   ` Michael S. Tsirkin
2020-07-24 14:46     ` Lokesh Gidra
2020-07-26 10:09       ` Michael S. Tsirkin
2020-04-23  0:26 ` [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only Daniel Colascione
2020-05-06 19:38   ` Peter Xu
2020-05-07 19:15     ` Jonathan Corbet
2020-05-20  4:06       ` Andrea Arcangeli
2020-05-08 16:52   ` Michael S. Tsirkin
2020-05-08 16:54     ` Michael S. Tsirkin
2020-05-20  4:59       ` Andrea Arcangeli
2020-05-20 18:03         ` Kees Cook
2020-05-20 19:48           ` Andrea Arcangeli [this message]
2020-05-20 19:51             ` Andrea Arcangeli
2020-05-20 20:17               ` Lokesh Gidra
2020-05-20 21:16                 ` Andrea Arcangeli
2020-07-17 12:57                   ` Jeffrey Vander Stoep
2020-07-23 17:30                     ` Lokesh Gidra
2020-07-24  0:13                       ` Nick Kralevich
2020-07-24 13:40                         ` Michael S. Tsirkin
2020-08-06  0:43                           ` Nick Kralevich
2020-08-06  5:44                             ` Michael S. Tsirkin
2020-08-17 22:11                               ` Lokesh Gidra
2020-09-04  3:34                                 ` Andrea Arcangeli
2020-09-05  0:36                                   ` Lokesh Gidra
2020-09-19 18:14                                     ` Nick Kralevich
2020-07-24 14:01 ` [PATCH 0/2] Control over userfaultfd kernel-fault handling Michael S. Tsirkin
2020-07-24 14:41   ` Lokesh Gidra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Doc Archive on

Archives are clonable:
	git clone --mirror linux-doc/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-doc linux-doc/ \
	public-inbox-index linux-doc

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone