linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Andrea Arcangeli <aarcange@redhat.com>,
	Martin Cracauer <cracauer@cons.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Johannes Weiner <hannes@cmpxchg.org>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	Bobby Powers <bobbypowers@gmail.com>,
	Maya Gokhale <gokhale2@llnl.gov>,
	Jerome Glisse <jglisse@redhat.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Matthew Wilcox <willy@infradead.org>,
	Marty McFadden <mcfadden8@llnl.gov>, Mel Gorman <mgorman@suse.de>,
	Hugh Dickins <hughd@google.com>,
	Brian Geffon <bgeffon@google.com>,
	Denis Plotnikov <dplotnikov@virtuozzo.com>,
	Pavel Emelyanov <xemul@virtuozzo.com>
Subject: Re: [PATCH RESEND v6 00/16] mm: Page fault enhancements
Date: Sat, 7 Mar 2020 16:47:43 -0500	[thread overview]
Message-ID: <20200307214743.GA4206@xz-x1> (raw)
In-Reply-To: <1eb7bdd4-348f-da87-47a1-0b022b70e918@redhat.com>

On Sat, Mar 07, 2020 at 09:33:08PM +0100, David Hildenbrand wrote:
> On 20.02.20 16:53, Peter Xu wrote:
> > [Resend v6]
> > 
> > This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
> > else to be expected (plus some tests after the rebase).  Instead of
> > rewrite the cover letter I decided to use what we have for v5.
> > 
> > Adding extra CCs for both Bobby Powers <bobbypowers@gmail.com> and
> > Brian Geffon <bgeffon@google.com>.
> > 
> > Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry
> > 
> > Any review comment is appreciated.  Thanks,
> 
> If I am not completely missing something (and all my testing today was
> wrong) there is a very simple reason why I *LOVE* this series and it made
> my weekend. It makes userfaultfd with concurrent discarding (e.g.,
> MADV_DONTNEED) of pages actually usable.

Hi, David,

Thanks for doing further test against this branch!

> 
> The issue in current code is that between placing a page and waking
> up a waiter, somebody can zap the new placed page and trigger
> re-fault, triggering a SIGBUS and crashing an application where all
> memory is supposed to be accessible. And there is no real way to protect
> from that, because when the fault handler will be woken up and retry
> is not deterministic (e.g., making madvise(MADV_DONTNEED) and
> UFFDIO_ZEROPAGE mutually exclusive does not help).
> 
> Find a simple reproducer at the end of this mail.
> 
> Before this series:
> [root@localhost ~]# ./a.out 
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> [   34.849604] FAULT_FLAG_ALLOW_RETRY missing 70
> [   34.850466] CPU: 1 PID: 651 Comm: a.out Not tainted 5.6.0-rc2+ #92
> [   34.851525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
> [   34.852818] Call Trace:
> [   34.853045]  dump_stack+0x8f/0xd0
> [   34.853338]  handle_userfault.cold+0x1a/0x2e
> [   34.853704]  ? find_held_lock+0x2b/0x80
> [   34.854031]  ? __handle_mm_fault+0x18c5/0x1900
> [   34.854409]  __handle_mm_fault+0x18d4/0x1900
> [   34.854784]  handle_mm_fault+0x169/0x360
> [   34.855120]  do_user_addr_fault+0x20d/0x490
> [   34.855478]  async_page_fault+0x43/0x50
> [   34.855809] RIP: 0033:0x401659
> [   34.856069] Code: ba 1f 00 00 00 be 01 00 00 00 bf 10 21 40 00 e8 ad fa ff ff bf ff ff ff ff e8 93 fa ff ff 48 8b8
> [   34.857629] RSP: 002b:00007ffcfd536ec0 EFLAGS: 00010246
> [   34.858076] RAX: 00007fcba86a4000 RBX: 0000000000000000 RCX: 00007fcba85784ef
> [   34.858675] RDX: 00007fcba86a4007 RSI: 00000000016524e0 RDI: 00007fcba864b320
> [   34.859272] RBP: 00007ffcfd536f20 R08: 000000000000000a R09: 0000000000000070
> [   34.859876] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000401120
> [   34.860472] R13: 00007ffcfd537000 R14: 0000000000000000 R15: 0000000000000000
> 
> After this series:
> Well, "Progress!" all day long.

Yes, IIUC the race can happen like this in your below test:

     main thread          uffd thread             disgard thread
     ===========          ===========             ==============
     access page
       uffd page fault
         wait for page
                          UFFDIO_ZEROCOPY
                            put a page P there
                                                  MADV_DONTNEED on P
                            wakeup main thread
         return from fault
       page still missing
       uffd page fault again
       (without ALLOW_RETRY)
       --> SIGBUS.

And I agree this should be one if the major issues that this series
wants to address.

> 
> 
> Can we please have a way to identify that this "feature" is available?
> I'd appreciate a new read-only UFFD_FEAT_ , so we can detect this from
> user space easily and use concurrent discards without crashing our applications.

I'm not sure how others think about it, but to me this still fells
into the bucket of "solving an existing problem" rather than a
feature.  Also note that this should change the behavior for the page
fault logic in general, rather than an uffd-only change. So I'm also
not sure whether UFFD_FEAT_* suites here even if we want it.

> 
> 
> Questions:
> 1. I assume KVM will do multiple retries as well, and have the same behavior, right?

Yes I believe so, or we'll need to fix it.

> 
> 2. What will happen if I don't place a page on a pagefault, but only do a UFFDIO_WAKE?
>    For now we were able to trigger a signal this way.

If I'm not mistaken the UFFDIO_WAKE will directly trigger the sigbus
even without the help of the MADV_DONTNEED race.

> If the behavior is changed, can
>    we make this configurable via a UFFD_FEAT?

I'll still think that could be an overkill, but I'll leave the
discussion to the experts.

Thanks,

> 
> --- snip ---
> #include <string.h>
> #include <stdbool.h>
> #include <stdint.h>
> #include <sys/types.h>
> #include <stdio.h>
> #include <pthread.h>
> #include <errno.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <fcntl.h>
> #include <poll.h>
> #include <linux/userfaultfd.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/ioctl.h>
> 
> static int page_size;
> 
> static void *fault_handler_thread(void *arg)
> {
>     const long uffd = (long) arg;
>     struct pollfd pollfd = {
>         .fd = uffd,
>         .events = POLLIN,
>     };
>     int ret;
> 
>     while (true) {
>         struct uffdio_zeropage zeropage = {};
>         struct uffd_msg msg;
>         ssize_t nread;
> 
>         if (poll(&pollfd, 1, -1) == -1) {
>             fprintf(stderr, "POLL failed: %s\n", strerror(errno));
>             exit(-1);
>         }
>         if (read(uffd, &msg, sizeof(msg)) != sizeof(msg)) {
>             fprintf(stderr, "READ failed\n");
>             exit(-1);
>         }
>         if (msg.event != UFFD_EVENT_PAGEFAULT) {
>             fprintf(stderr, "Not UFFD_EVENT_PAGEFAULT\n");
>             exit(-1);
>         }
> 
>         zeropage.range.start = msg.arg.pagefault.address;
>         zeropage.range.len = page_size;
>         do {
>             ret = ioctl(uffd, UFFDIO_ZEROPAGE, &zeropage);
>             if (ret && errno != EAGAIN) {
>                 fprintf(stderr, "UFFDIO_ZEROPAGE failed:%s\n", strerror(errno));
>                 exit(-1);
>             }
>         } while (ret);
>     }
> }
> static void *discard_thread(void *arg)
> {
>     while (true) {
>         if (madvise(arg, page_size, MADV_DONTNEED)) {
>             fprintf(stderr, "MADV_DONTNEED failed:%s\n", strerror(errno));
>             exit(-1);
>         }
>         usleep(1000);
>     }
> }
> 
> int main(void)
> {
>     struct uffdio_register reg;
>     struct uffdio_api api = {
>         .api = UFFD_API,
>     };
>     pthread_t fault, discard;
>     long uffd;
>     char *area;
> 
>     page_size = sysconf(_SC_PAGE_SIZE);
> 
>     uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
>     if (uffd == -1) {
>         fprintf(stderr, "Could not create uffd: %s\n", strerror(errno));
>         exit(-1);
>     }
>     if (ioctl(uffd, UFFDIO_API, &api) == -1) {
>         fprintf(stderr, "UFFDIO_API failed: %s\n", strerror(errno));
>         exit(-1);
>     }
> 
>     area = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
>                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>     if (area == MAP_FAILED) {
>         fprintf(stderr, "Could not allocate memory");
>         exit(-1);
>     }
> 
>     reg.range.start = (uint64_t) area;
>     reg.range.len = page_size,
>     reg.mode = UFFDIO_REGISTER_MODE_MISSING;
>     if (ioctl(uffd, UFFDIO_REGISTER, &reg) == -1) {
>         fprintf(stderr, "UFFDIO_REGISTER failed: %s\n", strerror(errno));
>         exit(-1);
>     }
> 
>     /* thread to provide zeropages */
>     if (pthread_create(&fault, NULL, fault_handler_thread,
>                        (void *) uffd)) {
>         fprintf(stderr, "Could not create fault handling thread");
>         exit(-1);
>     }
> 
>     /* thread to discard the page */
>     if (pthread_create(&discard, NULL, discard_thread,
>                        (void *) area)) {
>         fprintf(stderr, "Could not create discard thread");
>         exit(-1);
>     }
> 
>     /* keep reading/writing the page */
>     while (true) {
>         area[7] = area[1];
>         usleep(10000);
>         printf("Progress!\n");
>     }
>     return 0;
> }
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

-- 
Peter Xu


  reply	other threads:[~2020-03-07 21:47 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
2020-02-20 15:53 ` [PATCH RESEND v6 01/16] mm/gup: Rename "nonblocking" to "locked" where proper Peter Xu
2020-02-20 15:53 ` [PATCH RESEND v6 02/16] mm/gup: Fix __get_user_pages() on fault retry of hugetlb Peter Xu
2020-03-02 19:02   ` David Hildenbrand
2020-03-02 20:07     ` Peter Xu
2020-03-02 20:22       ` David Hildenbrand
2020-02-20 15:53 ` [PATCH RESEND v6 03/16] mm: Introduce fault_signal_pending() Peter Xu
2020-03-02 19:04   ` David Hildenbrand
2020-02-20 15:53 ` [PATCH RESEND v6 04/16] x86/mm: Use helper fault_signal_pending() Peter Xu
2020-02-20 15:58 ` [PATCH RESEND v6 05/16] arc/mm: " Peter Xu
2020-02-20 15:59 ` [PATCH RESEND v6 06/16] arm64/mm: " Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 07/16] powerpc/mm: " Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 08/16] sh/mm: " Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 09/16] mm: Return faster for non-fatal signals in user mode faults Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 10/16] userfaultfd: Don't retake mmap_sem to emulate NOPAGE Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 11/16] mm: Introduce FAULT_FLAG_DEFAULT Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 13/16] mm: Allow VM_FAULT_RETRY for multiple times Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 15/16] mm/gup: Allow to react to fatal signals Peter Xu
2020-02-20 16:03 ` [PATCH RESEND v6 16/16] mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault path Peter Xu
2020-02-20 19:53 ` [PATCH RESEND v6 12/16] mm: Introduce FAULT_FLAG_INTERRUPTIBLE Peter Xu
2020-02-20 19:53 ` [PATCH RESEND v6 14/16] mm/gup: Allow VM_FAULT_RETRY for multiple times Peter Xu
2020-02-21 19:26 ` [PATCH RESEND v6 00/16] mm: Page fault enhancements Brian Geffon
2020-03-02 17:31   ` Peter Xu
2020-02-21 19:32 ` Linus Torvalds
2020-02-21 20:11   ` Peter Xu
2020-03-07 20:33 ` David Hildenbrand
2020-03-07 21:47   ` Peter Xu [this message]
2020-03-08 12:12     ` David Hildenbrand
2020-03-09 19:51       ` Peter Xu
2020-03-09 20:06         ` David Hildenbrand
2020-03-08 12:49   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200307214743.GA4206@xz-x1 \
    --to=peterx@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=bgeffon@google.com \
    --cc=bobbypowers@gmail.com \
    --cc=cracauer@cons.org \
    --cc=david@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=dplotnikov@virtuozzo.com \
    --cc=gokhale2@llnl.gov \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jglisse@redhat.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcfadden8@llnl.gov \
    --cc=mgorman@suse.de \
    --cc=mike.kravetz@oracle.com \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    --cc=xemul@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).