linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Peter Xu <peterx@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Martin Cracauer <cracauer@cons.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Johannes Weiner <hannes@cmpxchg.org>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	Bobby Powers <bobbypowers@gmail.com>,
	Maya Gokhale <gokhale2@llnl.gov>,
	Jerome Glisse <jglisse@redhat.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Matthew Wilcox <willy@infradead.org>,
	Marty McFadden <mcfadden8@llnl.gov>, Mel Gorman <mgorman@suse.de>,
	Hugh Dickins <hughd@google.com>,
	Brian Geffon <bgeffon@google.com>,
	Denis Plotnikov <dplotnikov@virtuozzo.com>,
	Pavel Emelyanov <xemul@virtuozzo.com>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>
Subject: Re: [PATCH RESEND v6 00/16] mm: Page fault enhancements
Date: Sat, 7 Mar 2020 21:33:08 +0100	[thread overview]
Message-ID: <1eb7bdd4-348f-da87-47a1-0b022b70e918@redhat.com> (raw)
In-Reply-To: <20200220155353.8676-1-peterx@redhat.com>

On 20.02.20 16:53, Peter Xu wrote:
> [Resend v6]
> 
> This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
> else to be expected (plus some tests after the rebase).  Instead of
> rewrite the cover letter I decided to use what we have for v5.
> 
> Adding extra CCs for both Bobby Powers <bobbypowers@gmail.com> and
> Brian Geffon <bgeffon@google.com>.
> 
> Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry
> 
> Any review comment is appreciated.  Thanks,

If I am not completely missing something (and all my testing today was
wrong) there is a very simple reason why I *LOVE* this series and it made
my weekend. It makes userfaultfd with concurrent discarding (e.g.,
MADV_DONTNEED) of pages actually usable.

The issue in current code is that between placing a page and waking
up a waiter, somebody can zap the new placed page and trigger
re-fault, triggering a SIGBUS and crashing an application where all
memory is supposed to be accessible. And there is no real way to protect
from that, because when the fault handler will be woken up and retry
is not deterministic (e.g., making madvise(MADV_DONTNEED) and
UFFDIO_ZEROPAGE mutually exclusive does not help).

Find a simple reproducer at the end of this mail.

Before this series:
[root@localhost ~]# ./a.out 
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
[   34.849604] FAULT_FLAG_ALLOW_RETRY missing 70
[   34.850466] CPU: 1 PID: 651 Comm: a.out Not tainted 5.6.0-rc2+ #92
[   34.851525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
[   34.852818] Call Trace:
[   34.853045]  dump_stack+0x8f/0xd0
[   34.853338]  handle_userfault.cold+0x1a/0x2e
[   34.853704]  ? find_held_lock+0x2b/0x80
[   34.854031]  ? __handle_mm_fault+0x18c5/0x1900
[   34.854409]  __handle_mm_fault+0x18d4/0x1900
[   34.854784]  handle_mm_fault+0x169/0x360
[   34.855120]  do_user_addr_fault+0x20d/0x490
[   34.855478]  async_page_fault+0x43/0x50
[   34.855809] RIP: 0033:0x401659
[   34.856069] Code: ba 1f 00 00 00 be 01 00 00 00 bf 10 21 40 00 e8 ad fa ff ff bf ff ff ff ff e8 93 fa ff ff 48 8b8
[   34.857629] RSP: 002b:00007ffcfd536ec0 EFLAGS: 00010246
[   34.858076] RAX: 00007fcba86a4000 RBX: 0000000000000000 RCX: 00007fcba85784ef
[   34.858675] RDX: 00007fcba86a4007 RSI: 00000000016524e0 RDI: 00007fcba864b320
[   34.859272] RBP: 00007ffcfd536f20 R08: 000000000000000a R09: 0000000000000070
[   34.859876] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000401120
[   34.860472] R13: 00007ffcfd537000 R14: 0000000000000000 R15: 0000000000000000

After this series:
Well, "Progress!" all day long.


Can we please have a way to identify that this "feature" is available?
I'd appreciate a new read-only UFFD_FEAT_ , so we can detect this from
user space easily and use concurrent discards without crashing our applications.


Questions:
1. I assume KVM will do multiple retries as well, and have the same behavior, right?

2. What will happen if I don't place a page on a pagefault, but only do a UFFDIO_WAKE?
   For now we were able to trigger a signal this way. If the behavior is changed, can
   we make this configurable via a UFFD_FEAT?

--- snip ---
#include <string.h>
#include <stdbool.h>
#include <stdint.h>
#include <sys/types.h>
#include <stdio.h>
#include <pthread.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <poll.h>
#include <linux/userfaultfd.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>

static int page_size;

static void *fault_handler_thread(void *arg)
{
    const long uffd = (long) arg;
    struct pollfd pollfd = {
        .fd = uffd,
        .events = POLLIN,
    };
    int ret;

    while (true) {
        struct uffdio_zeropage zeropage = {};
        struct uffd_msg msg;
        ssize_t nread;

        if (poll(&pollfd, 1, -1) == -1) {
            fprintf(stderr, "POLL failed: %s\n", strerror(errno));
            exit(-1);
        }
        if (read(uffd, &msg, sizeof(msg)) != sizeof(msg)) {
            fprintf(stderr, "READ failed\n");
            exit(-1);
        }
        if (msg.event != UFFD_EVENT_PAGEFAULT) {
            fprintf(stderr, "Not UFFD_EVENT_PAGEFAULT\n");
            exit(-1);
        }

        zeropage.range.start = msg.arg.pagefault.address;
        zeropage.range.len = page_size;
        do {
            ret = ioctl(uffd, UFFDIO_ZEROPAGE, &zeropage);
            if (ret && errno != EAGAIN) {
                fprintf(stderr, "UFFDIO_ZEROPAGE failed:%s\n", strerror(errno));
                exit(-1);
            }
        } while (ret);
    }
}
static void *discard_thread(void *arg)
{
    while (true) {
        if (madvise(arg, page_size, MADV_DONTNEED)) {
            fprintf(stderr, "MADV_DONTNEED failed:%s\n", strerror(errno));
            exit(-1);
        }
        usleep(1000);
    }
}

int main(void)
{
    struct uffdio_register reg;
    struct uffdio_api api = {
        .api = UFFD_API,
    };
    pthread_t fault, discard;
    long uffd;
    char *area;

    page_size = sysconf(_SC_PAGE_SIZE);

    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
    if (uffd == -1) {
        fprintf(stderr, "Could not create uffd: %s\n", strerror(errno));
        exit(-1);
    }
    if (ioctl(uffd, UFFDIO_API, &api) == -1) {
        fprintf(stderr, "UFFDIO_API failed: %s\n", strerror(errno));
        exit(-1);
    }

    area = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (area == MAP_FAILED) {
        fprintf(stderr, "Could not allocate memory");
        exit(-1);
    }

    reg.range.start = (uint64_t) area;
    reg.range.len = page_size,
    reg.mode = UFFDIO_REGISTER_MODE_MISSING;
    if (ioctl(uffd, UFFDIO_REGISTER, &reg) == -1) {
        fprintf(stderr, "UFFDIO_REGISTER failed: %s\n", strerror(errno));
        exit(-1);
    }

    /* thread to provide zeropages */
    if (pthread_create(&fault, NULL, fault_handler_thread,
                       (void *) uffd)) {
        fprintf(stderr, "Could not create fault handling thread");
        exit(-1);
    }

    /* thread to discard the page */
    if (pthread_create(&discard, NULL, discard_thread,
                       (void *) area)) {
        fprintf(stderr, "Could not create discard thread");
        exit(-1);
    }

    /* keep reading/writing the page */
    while (true) {
        area[7] = area[1];
        usleep(10000);
        printf("Progress!\n");
    }
    return 0;
}


-- 
Thanks,

David / dhildenb



  parent reply	other threads:[~2020-03-07 20:33 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
2020-02-20 15:53 ` [PATCH RESEND v6 01/16] mm/gup: Rename "nonblocking" to "locked" where proper Peter Xu
2020-02-20 15:53 ` [PATCH RESEND v6 02/16] mm/gup: Fix __get_user_pages() on fault retry of hugetlb Peter Xu
2020-03-02 19:02   ` David Hildenbrand
2020-03-02 20:07     ` Peter Xu
2020-03-02 20:22       ` David Hildenbrand
2020-02-20 15:53 ` [PATCH RESEND v6 03/16] mm: Introduce fault_signal_pending() Peter Xu
2020-03-02 19:04   ` David Hildenbrand
2020-02-20 15:53 ` [PATCH RESEND v6 04/16] x86/mm: Use helper fault_signal_pending() Peter Xu
2020-02-20 15:58 ` [PATCH RESEND v6 05/16] arc/mm: " Peter Xu
2020-02-20 15:59 ` [PATCH RESEND v6 06/16] arm64/mm: " Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 07/16] powerpc/mm: " Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 08/16] sh/mm: " Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 09/16] mm: Return faster for non-fatal signals in user mode faults Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 10/16] userfaultfd: Don't retake mmap_sem to emulate NOPAGE Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 11/16] mm: Introduce FAULT_FLAG_DEFAULT Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 13/16] mm: Allow VM_FAULT_RETRY for multiple times Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 15/16] mm/gup: Allow to react to fatal signals Peter Xu
2020-02-20 16:03 ` [PATCH RESEND v6 16/16] mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault path Peter Xu
2020-02-20 19:53 ` [PATCH RESEND v6 12/16] mm: Introduce FAULT_FLAG_INTERRUPTIBLE Peter Xu
2020-02-20 19:53 ` [PATCH RESEND v6 14/16] mm/gup: Allow VM_FAULT_RETRY for multiple times Peter Xu
2020-02-21 19:26 ` [PATCH RESEND v6 00/16] mm: Page fault enhancements Brian Geffon
2020-03-02 17:31   ` Peter Xu
2020-02-21 19:32 ` Linus Torvalds
2020-02-21 20:11   ` Peter Xu
2020-03-07 20:33 ` David Hildenbrand [this message]
2020-03-07 21:47   ` Peter Xu
2020-03-08 12:12     ` David Hildenbrand
2020-03-09 19:51       ` Peter Xu
2020-03-09 20:06         ` David Hildenbrand
2020-03-08 12:49   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1eb7bdd4-348f-da87-47a1-0b022b70e918@redhat.com \
    --to=david@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=bgeffon@google.com \
    --cc=bobbypowers@gmail.com \
    --cc=cracauer@cons.org \
    --cc=dgilbert@redhat.com \
    --cc=dplotnikov@virtuozzo.com \
    --cc=gokhale2@llnl.gov \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jglisse@redhat.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcfadden8@llnl.gov \
    --cc=mgorman@suse.de \
    --cc=mike.kravetz@oracle.com \
    --cc=peterx@redhat.com \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    --cc=xemul@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).