From: Mike Rapoport <email@example.com> To: Andrea Arcangeli <firstname.lastname@example.org> Cc: Andy Lutomirski <email@example.com>, Jann Horn <firstname.lastname@example.org>, Daniel Colascione <email@example.com>, Linus Torvalds <firstname.lastname@example.org>, Pavel Emelyanov <email@example.com>, Lokesh Gidra <firstname.lastname@example.org>, Nick Kralevich <email@example.com>, Nosh Minwalla <firstname.lastname@example.org>, Tim Murray <email@example.com>, Mike Rapoport <firstname.lastname@example.org>, Linux API <email@example.com>, LKML <firstname.lastname@example.org> Subject: Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API. Date: Thu, 24 Oct 2019 12:02:59 +0300 Message-ID: <20191024090258.GA9802@linux.ibm.com> (raw) In-Reply-To: <20191023190959.GA9902@redhat.com> On Wed, Oct 23, 2019 at 03:09:59PM -0400, Andrea Arcangeli wrote: > Hello, > > On Sat, Oct 12, 2019 at 06:14:23PM -0700, Andy Lutomirski wrote: > > [adding more people because this is going to be an ABI break, sigh] > > That wouldn't break the ABI, no more than when if you boot a kernel > built with CONFIG_USERFAULTFD=n. > > All non-cooperative features can be removed any time in a backwards > compatible way, the only precaution is to mark their feature bits as > reserved so they can't be reused for something else later. > > > least severely restricted. A .read implementation MUST NOT ACT ON THE > > CALLING TASK. Ever. Just imagine the effect of passing a userfaultfd > > as stdin to a setuid program. > > With UFFD_EVENT_FORK, the newly created uffd that controls the child, > is not passed to the parent nor to the child. Instead it's passed to > the CRIU monitor only, which has to be already running as root and is > fully trusted and acts a hypervisor (despite there is no hypervisor). > > By the time execve runs and any suid bit in the execve'd inode becomes > relevant, well before the new userland executable code can run, the > kernel throws away the "old_mm" controlled by any uffd and all > attached uffds are released as well. > > All I found is your "A .read implementation MUST NOT ACT ON THE > CALLING TASK" as an explanation that something is broken but I need > further clarification. > > Of course I can see you can always open a uffd and pass it to any task > you are going to execve on, but that simply means the suid program > will be able to control you, not the other way around. If you don't > want to be controlled by the next task, no matter if suid or not, just > don't that. What I don't see is how you're going to control the suid > binary from the outside, the suid binary at most will block in the > poll, read and write syscalls and get garbage or write some garbage > and get an error, it won't get signals and it cannot block in any page > fault either, it's not immediately clear what's out of ordinary. > > On Mon, Oct 14, 2019 at 06:04:22PM +0200, Jann Horn wrote: > > FWIW, <https://codesearch.debian.net/search?q=UFFD_FEATURE_EVENT_FORK&literal=1> > > just shows the kernel, kernel selftests, and strace code for decoding > > syscall arguments. CRIU uses it though (probably for postcopy live > > migration / lazy migration?), I guess that code isn't in debian for > > some reason. > > https://criu.org/Userfaultfd#Limitations That's no the reason that UFFD_FEATURE_EVENT_FORK does not show up in Debian code search, CRIU simply is not there. Debian packages CRIU only in experimental and I believe that's not indexed by the code search. As for the limitations, the races were fixed, I just forgot to update the wiki. As for the supported memory types and COW pages, these only affect efficiency of post-copy, but not the correctness. > The CRIU developers did a truly amazing job by making container post > copy live migration work great for a subset of apps, that alone was an > amazing achievement. Is that achievement enough to use post copy live > migration of bare metal containers in production? Unfortunately > probably not and not just in debian. I don't know if anybody is using post-copy migration of containers in production, but I don't think that the reason for that would be technical. IMHO it's more about prevailing perception that there is no need to migrate containers at all, not only with post-copy, and, as the result, slow rate of adoption of container migration in general. > If you're wrong and UFFDIO_EVENT_FORK isn't currently buggy and in > turn it isn't causing further maintenance burden, there is no hurry of > removing them, but in the long term, if none of the non-cooperative > features find its way in production (like it was reasonable to expect > initially), they must be removed from the kernel anyway, not just > UFFD_EVEN_FORK but all non-cooperative features associated with it. ... > On my side, instead of trying to fix whatever issue in > UFFD_EVENT_FORK, I'd prefer to spend my time reviewing the uffd-wp > feature from Peter and the page fault enhancement patchset that Peter > and Linus were discussing. uffd-wp has the potential to drop fork() > from all apps calling fork() only to do an atomic snapshot of their > memory. Replacing fork() also means the uffd manager thread can decide > how much memory to reserve to the snapshot and it can start throttling > waiting for I/O completion if the threshold is exceeded, while fork > COWs cannot throttle and all apps using fork() risk to hit on x2 > memory usage which can become oom-killer material if the memory size > of the process is huge. The side benefit is also that the way > userfaultfd works the fault granularity is entirely in control of > userland (because it's always userland that resolves the fault), it > could decide to use 8k or 16k even if that doesn't match the hardware > page size. That will allow to keep THP on without risking to hit on 2M > cows during the snapshot. Being able to keep THP enabled in nosql db > without hitting on slow 2M COW copies during snapshot, should allow a > further overall performance improvement when the snapshot is not > running than what it is possible today. In a completely different use > case, uffd-wp will also avoid JITs to set a dirty bit every time they > modify any data in memory. It should also be possible to provide the > same soft-dirty information in O(1) instead of O(N). If I remember correctly, there was an intention to deprecate soft-dirty in favor of uffd-wp, which brings us back to the necessity to have non-cooperative uffd because otherwise even pre-copy in CRIU will be broken and that *is* used in production. > Thanks, > Andrea > -- Sincerely yours, Mike.
next prev parent reply index Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-10-12 19:15 [PATCH 0/7] Harden userfaultfd Daniel Colascione 2019-10-12 19:15 ` [PATCH 1/7] Add a new flags-accepting interface for anonymous inodes Daniel Colascione 2019-10-14 4:26 ` kbuild test robot 2019-10-14 15:38 ` Jann Horn 2019-10-14 18:15 ` Daniel Colascione 2019-10-14 18:30 ` Jann Horn 2019-10-15 8:08 ` Christoph Hellwig 2019-10-12 19:15 ` [PATCH 2/7] Add a concept of a "secure" anonymous file Daniel Colascione 2019-10-14 3:01 ` kbuild test robot 2019-10-15 8:08 ` Christoph Hellwig 2019-10-12 19:15 ` [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API Daniel Colascione 2019-10-12 23:10 ` Andy Lutomirski 2019-10-13 0:51 ` Daniel Colascione 2019-10-13 1:14 ` Andy Lutomirski 2019-10-13 1:38 ` Daniel Colascione 2019-10-14 16:04 ` Jann Horn 2019-10-23 19:09 ` Andrea Arcangeli 2019-10-23 19:21 ` Andy Lutomirski 2019-10-23 21:16 ` Andrea Arcangeli 2019-10-23 21:25 ` Andy Lutomirski 2019-10-23 22:41 ` Andrea Arcangeli 2019-10-23 23:01 ` Andy Lutomirski 2019-10-23 23:27 ` Andrea Arcangeli 2019-10-23 20:05 ` Daniel Colascione 2019-10-24 0:23 ` Andrea Arcangeli 2019-10-23 20:15 ` Linus Torvalds 2019-10-24 9:02 ` Mike Rapoport [this message] 2019-10-24 15:10 ` Andrea Arcangeli 2019-10-25 20:12 ` Mike Rapoport 2019-10-22 21:27 ` Daniel Colascione 2019-10-23 4:11 ` Andy Lutomirski 2019-10-23 7:29 ` Cyrill Gorcunov 2019-10-23 12:43 ` Mike Rapoport 2019-10-23 17:13 ` Andy Lutomirski 2019-10-12 19:15 ` [PATCH 4/7] Teach SELinux about a new userfaultfd class Daniel Colascione 2019-10-12 23:08 ` Andy Lutomirski 2019-10-13 0:11 ` Daniel Colascione 2019-10-13 0:46 ` Andy Lutomirski 2019-10-12 19:16 ` [PATCH 5/7] Let userfaultfd opt out of handling kernel-mode faults Daniel Colascione 2019-10-12 19:16 ` [PATCH 6/7] Allow users to require UFFD_SECURE Daniel Colascione 2019-10-12 23:12 ` Andy Lutomirski 2019-10-12 19:16 ` [PATCH 7/7] Add a new sysctl for limiting userfaultfd to user mode faults Daniel Colascione 2019-10-16 0:02 ` [PATCH 0/7] Harden userfaultfd James Morris 2019-11-15 15:09 ` Stephen Smalley
Reply instructions: You may reply publically to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20191024090258.GA9802@linux.ibm.com \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
LKML Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \ email@example.com public-inbox-index lkml Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel AGPL code for this site: git clone https://public-inbox.org/public-inbox.git