LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Mike Rapoport <rppt@linux.ibm.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>, Jann Horn <jannh@google.com>,
	Daniel Colascione <dancol@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Pavel Emelyanov <xemul@virtuozzo.com>,
	Lokesh Gidra <lokeshgidra@google.com>,
	Nick Kralevich <nnk@google.com>, Nosh Minwalla <nosh@google.com>,
	Tim Murray <timmurray@google.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Linux API <linux-api@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.
Date: Thu, 24 Oct 2019 12:02:59 +0300
Message-ID: <20191024090258.GA9802@linux.ibm.com> (raw)
In-Reply-To: <20191023190959.GA9902@redhat.com>

On Wed, Oct 23, 2019 at 03:09:59PM -0400, Andrea Arcangeli wrote:
> Hello,
> 
> On Sat, Oct 12, 2019 at 06:14:23PM -0700, Andy Lutomirski wrote:
> > [adding more people because this is going to be an ABI break, sigh]
> 
> That wouldn't break the ABI, no more than when if you boot a kernel
> built with CONFIG_USERFAULTFD=n.
> 
> All non-cooperative features can be removed any time in a backwards
> compatible way, the only precaution is to mark their feature bits as
> reserved so they can't be reused for something else later.
> 
> > least severely restricted.  A .read implementation MUST NOT ACT ON THE
> > CALLING TASK.  Ever.  Just imagine the effect of passing a userfaultfd
> > as stdin to a setuid program.
> 
> With UFFD_EVENT_FORK, the newly created uffd that controls the child,
> is not passed to the parent nor to the child. Instead it's passed to
> the CRIU monitor only, which has to be already running as root and is
> fully trusted and acts a hypervisor (despite there is no hypervisor).
> 
> By the time execve runs and any suid bit in the execve'd inode becomes
> relevant, well before the new userland executable code can run, the
> kernel throws away the "old_mm" controlled by any uffd and all
> attached uffds are released as well.
> 
> All I found is your "A .read implementation MUST NOT ACT ON THE
> CALLING TASK" as an explanation that something is broken but I need
> further clarification.
> 
> Of course I can see you can always open a uffd and pass it to any task
> you are going to execve on, but that simply means the suid program
> will be able to control you, not the other way around. If you don't
> want to be controlled by the next task, no matter if suid or not, just
> don't that. What I don't see is how you're going to control the suid
> binary from the outside, the suid binary at most will block in the
> poll, read and write syscalls and get garbage or write some garbage
> and get an error, it won't get signals and it cannot block in any page
> fault either, it's not immediately clear what's out of ordinary.
> 
> On Mon, Oct 14, 2019 at 06:04:22PM +0200, Jann Horn wrote:
> > FWIW, <https://codesearch.debian.net/search?q=UFFD_FEATURE_EVENT_FORK&literal=1>
> > just shows the kernel, kernel selftests, and strace code for decoding
> > syscall arguments. CRIU uses it though (probably for postcopy live
> > migration / lazy migration?), I guess that code isn't in debian for
> > some reason.
> 
> https://criu.org/Userfaultfd#Limitations

That's no the reason that UFFD_FEATURE_EVENT_FORK does not show up in
Debian code search, CRIU simply is not there. Debian packages CRIU only in
experimental and I believe that's not indexed by the code search.

As for the limitations, the races were fixed, I just forgot to update the
wiki. As for the supported memory types and COW pages, these only affect
efficiency of post-copy, but not the correctness.
 
> The CRIU developers did a truly amazing job by making container post
> copy live migration work great for a subset of apps, that alone was an
> amazing achievement. Is that achievement enough to use post copy live
> migration of bare metal containers in production? Unfortunately
> probably not and not just in debian.
 
I don't know if anybody is using post-copy migration of containers in
production, but I don't think that the reason for that would be technical.
IMHO it's more about prevailing perception that there is no need to migrate
containers at all, not only with post-copy, and, as the result, slow rate
of adoption of container migration in general.

> If you're wrong and UFFDIO_EVENT_FORK isn't currently buggy and in
> turn it isn't causing further maintenance burden, there is no hurry of
> removing them, but in the long term, if none of the non-cooperative
> features find its way in production (like it was reasonable to expect
> initially), they must be removed from the kernel anyway, not just
> UFFD_EVEN_FORK but all non-cooperative features associated with it.

... 
 
> On my side, instead of trying to fix whatever issue in
> UFFD_EVENT_FORK, I'd prefer to spend my time reviewing the uffd-wp
> feature from Peter and the page fault enhancement patchset that Peter
> and Linus were discussing. uffd-wp has the potential to drop fork()
> from all apps calling fork() only to do an atomic snapshot of their
> memory. Replacing fork() also means the uffd manager thread can decide
> how much memory to reserve to the snapshot and it can start throttling
> waiting for I/O completion if the threshold is exceeded, while fork
> COWs cannot throttle and all apps using fork() risk to hit on x2
> memory usage which can become oom-killer material if the memory size
> of the process is huge. The side benefit is also that the way
> userfaultfd works the fault granularity is entirely in control of
> userland (because it's always userland that resolves the fault), it
> could decide to use 8k or 16k even if that doesn't match the hardware
> page size. That will allow to keep THP on without risking to hit on 2M
> cows during the snapshot. Being able to keep THP enabled in nosql db
> without hitting on slow 2M COW copies during snapshot, should allow a
> further overall performance improvement when the snapshot is not
> running than what it is possible today. In a completely different use
> case, uffd-wp will also avoid JITs to set a dirty bit every time they
> modify any data in memory. It should also be possible to provide the
> same soft-dirty information in O(1) instead of O(N).

If I remember correctly, there was an intention to deprecate soft-dirty in
favor of uffd-wp, which brings us back to the necessity to have
non-cooperative uffd because otherwise even pre-copy in CRIU will be broken
and that *is* used in production.
 
> Thanks,
> Andrea
> 

-- 
Sincerely yours,
Mike.


  parent reply index

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-12 19:15 [PATCH 0/7] Harden userfaultfd Daniel Colascione
2019-10-12 19:15 ` [PATCH 1/7] Add a new flags-accepting interface for anonymous inodes Daniel Colascione
2019-10-14  4:26   ` kbuild test robot
2019-10-14 15:38   ` Jann Horn
2019-10-14 18:15     ` Daniel Colascione
2019-10-14 18:30       ` Jann Horn
2019-10-15  8:08   ` Christoph Hellwig
2019-10-12 19:15 ` [PATCH 2/7] Add a concept of a "secure" anonymous file Daniel Colascione
2019-10-14  3:01   ` kbuild test robot
2019-10-15  8:08   ` Christoph Hellwig
2019-10-12 19:15 ` [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API Daniel Colascione
2019-10-12 23:10   ` Andy Lutomirski
2019-10-13  0:51     ` Daniel Colascione
2019-10-13  1:14       ` Andy Lutomirski
2019-10-13  1:38         ` Daniel Colascione
2019-10-14 16:04         ` Jann Horn
2019-10-23 19:09           ` Andrea Arcangeli
2019-10-23 19:21             ` Andy Lutomirski
2019-10-23 21:16               ` Andrea Arcangeli
2019-10-23 21:25                 ` Andy Lutomirski
2019-10-23 22:41                   ` Andrea Arcangeli
2019-10-23 23:01                     ` Andy Lutomirski
2019-10-23 23:27                       ` Andrea Arcangeli
2019-10-23 20:05             ` Daniel Colascione
2019-10-24  0:23               ` Andrea Arcangeli
2019-10-23 20:15             ` Linus Torvalds
2019-10-24  9:02             ` Mike Rapoport [this message]
2019-10-24 15:10               ` Andrea Arcangeli
2019-10-25 20:12                 ` Mike Rapoport
2019-10-22 21:27         ` Daniel Colascione
2019-10-23  4:11         ` Andy Lutomirski
2019-10-23  7:29           ` Cyrill Gorcunov
2019-10-23 12:43             ` Mike Rapoport
2019-10-23 17:13               ` Andy Lutomirski
2019-10-12 19:15 ` [PATCH 4/7] Teach SELinux about a new userfaultfd class Daniel Colascione
2019-10-12 23:08   ` Andy Lutomirski
2019-10-13  0:11     ` Daniel Colascione
2019-10-13  0:46       ` Andy Lutomirski
2019-10-12 19:16 ` [PATCH 5/7] Let userfaultfd opt out of handling kernel-mode faults Daniel Colascione
2019-10-12 19:16 ` [PATCH 6/7] Allow users to require UFFD_SECURE Daniel Colascione
2019-10-12 23:12   ` Andy Lutomirski
2019-10-12 19:16 ` [PATCH 7/7] Add a new sysctl for limiting userfaultfd to user mode faults Daniel Colascione
2019-10-16  0:02 ` [PATCH 0/7] Harden userfaultfd James Morris
2019-11-15 15:09 ` Stephen Smalley

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191024090258.GA9802@linux.ibm.com \
    --to=rppt@linux.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=dancol@google.com \
    --cc=jannh@google.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lokeshgidra@google.com \
    --cc=luto@kernel.org \
    --cc=nnk@google.com \
    --cc=nosh@google.com \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=timmurray@google.com \
    --cc=torvalds@linux-foundation.org \
    --cc=xemul@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git