Re: [PATCH RESEND v6 00/16] mm: Page fault enhancements

From: David Hildenbrand <david@redhat.com>
To: Peter Xu <peterx@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Martin Cracauer <cracauer@cons.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Johannes Weiner <hannes@cmpxchg.org>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	Bobby Powers <bobbypowers@gmail.com>,
	Maya Gokhale <gokhale2@llnl.gov>,
	Jerome Glisse <jglisse@redhat.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Matthew Wilcox <willy@infradead.org>,
	Marty McFadden <mcfadden8@llnl.gov>, Mel Gorman <mgorman@suse.de>,
	Hugh Dickins <hughd@google.com>,
	Brian Geffon <bgeffon@google.com>,
	Denis Plotnikov <dplotnikov@virtuozzo.com>,
	Pavel Emelyanov <xemul@virtuozzo.com>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>
Subject: Re: [PATCH RESEND v6 00/16] mm: Page fault enhancements
Date: Sat, 7 Mar 2020 21:33:08 +0100	[thread overview]
Message-ID: <1eb7bdd4-348f-da87-47a1-0b022b70e918@redhat.com> (raw)
In-Reply-To: <20200220155353.8676-1-peterx@redhat.com>

On 20.02.20 16:53, Peter Xu wrote:
> [Resend v6]
> 
> This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
> else to be expected (plus some tests after the rebase).  Instead of
> rewrite the cover letter I decided to use what we have for v5.
> 
> Adding extra CCs for both Bobby Powers <bobbypowers@gmail.com> and
> Brian Geffon <bgeffon@google.com>.
> 
> Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry
> 
> Any review comment is appreciated.  Thanks,

If I am not completely missing something (and all my testing today was
wrong) there is a very simple reason why I *LOVE* this series and it made
my weekend. It makes userfaultfd with concurrent discarding (e.g.,
MADV_DONTNEED) of pages actually usable.

The issue in current code is that between placing a page and waking
up a waiter, somebody can zap the new placed page and trigger
re-fault, triggering a SIGBUS and crashing an application where all
memory is supposed to be accessible. And there is no real way to protect
from that, because when the fault handler will be woken up and retry
is not deterministic (e.g., making madvise(MADV_DONTNEED) and
UFFDIO_ZEROPAGE mutually exclusive does not help).

Find a simple reproducer at the end of this mail.

Before this series:
[root@localhost ~]# ./a.out 
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
[   34.849604] FAULT_FLAG_ALLOW_RETRY missing 70
[   34.850466] CPU: 1 PID: 651 Comm: a.out Not tainted 5.6.0-rc2+ #92
[   34.851525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
[   34.852818] Call Trace:
[   34.853045]  dump_stack+0x8f/0xd0
[   34.853338]  handle_userfault.cold+0x1a/0x2e
[   34.853704]  ? find_held_lock+0x2b/0x80
[   34.854031]  ? __handle_mm_fault+0x18c5/0x1900
[   34.854409]  __handle_mm_fault+0x18d4/0x1900
[   34.854784]  handle_mm_fault+0x169/0x360
[   34.855120]  do_user_addr_fault+0x20d/0x490
[   34.855478]  async_page_fault+0x43/0x50
[   34.855809] RIP: 0033:0x401659
[   34.856069] Code: ba 1f 00 00 00 be 01 00 00 00 bf 10 21 40 00 e8 ad fa ff ff bf ff ff ff ff e8 93 fa ff ff 48 8b8
[   34.857629] RSP: 002b:00007ffcfd536ec0 EFLAGS: 00010246
[   34.858076] RAX: 00007fcba86a4000 RBX: 0000000000000000 RCX: 00007fcba85784ef
[   34.858675] RDX: 00007fcba86a4007 RSI: 00000000016524e0 RDI: 00007fcba864b320
[   34.859272] RBP: 00007ffcfd536f20 R08: 000000000000000a R09: 0000000000000070
[   34.859876] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000401120
[   34.860472] R13: 00007ffcfd537000 R14: 0000000000000000 R15: 0000000000000000

After this series:
Well, "Progress!" all day long.

Can we please have a way to identify that this "feature" is available?
I'd appreciate a new read-only UFFD_FEAT_ , so we can detect this from
user space easily and use concurrent discards without crashing our applications.

Questions:
1. I assume KVM will do multiple retries as well, and have the same behavior, right?

2. What will happen if I don't place a page on a pagefault, but only do a UFFDIO_WAKE?
   For now we were able to trigger a signal this way. If the behavior is changed, can
   we make this configurable via a UFFD_FEAT?

--- snip ---
#include <string.h>
#include <stdbool.h>
#include <stdint.h>
#include <sys/types.h>
#include <stdio.h>
#include <pthread.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <poll.h>
#include <linux/userfaultfd.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>

static int page_size;

static void *fault_handler_thread(void *arg)
{
    const long uffd = (long) arg;
    struct pollfd pollfd = {
        .fd = uffd,
        .events = POLLIN,
    };
    int ret;

    while (true) {
        struct uffdio_zeropage zeropage = {};
        struct uffd_msg msg;
        ssize_t nread;

        if (poll(&pollfd, 1, -1) == -1) {
            fprintf(stderr, "POLL failed: %s\n", strerror(errno));
            exit(-1);
        }
        if (read(uffd, &msg, sizeof(msg)) != sizeof(msg)) {
            fprintf(stderr, "READ failed\n");
            exit(-1);
        }
        if (msg.event != UFFD_EVENT_PAGEFAULT) {
            fprintf(stderr, "Not UFFD_EVENT_PAGEFAULT\n");
            exit(-1);
        }

        zeropage.range.start = msg.arg.pagefault.address;
        zeropage.range.len = page_size;
        do {
            ret = ioctl(uffd, UFFDIO_ZEROPAGE, &zeropage);
            if (ret && errno != EAGAIN) {
                fprintf(stderr, "UFFDIO_ZEROPAGE failed:%s\n", strerror(errno));
                exit(-1);
            }
        } while (ret);
    }
}
static void *discard_thread(void *arg)
{
    while (true) {
        if (madvise(arg, page_size, MADV_DONTNEED)) {
            fprintf(stderr, "MADV_DONTNEED failed:%s\n", strerror(errno));
            exit(-1);
        }
        usleep(1000);
    }
}

int main(void)
{
    struct uffdio_register reg;
    struct uffdio_api api = {
        .api = UFFD_API,
    };
    pthread_t fault, discard;
    long uffd;
    char *area;

    page_size = sysconf(_SC_PAGE_SIZE);

    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
    if (uffd == -1) {
        fprintf(stderr, "Could not create uffd: %s\n", strerror(errno));
        exit(-1);
    }
    if (ioctl(uffd, UFFDIO_API, &api) == -1) {
        fprintf(stderr, "UFFDIO_API failed: %s\n", strerror(errno));
        exit(-1);
    }

    area = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (area == MAP_FAILED) {
        fprintf(stderr, "Could not allocate memory");
        exit(-1);
    }

    reg.range.start = (uint64_t) area;
    reg.range.len = page_size,
    reg.mode = UFFDIO_REGISTER_MODE_MISSING;
    if (ioctl(uffd, UFFDIO_REGISTER, &reg) == -1) {
        fprintf(stderr, "UFFDIO_REGISTER failed: %s\n", strerror(errno));
        exit(-1);
    }

    /* thread to provide zeropages */
    if (pthread_create(&fault, NULL, fault_handler_thread,
                       (void *) uffd)) {
        fprintf(stderr, "Could not create fault handling thread");
        exit(-1);
    }

    /* thread to discard the page */
    if (pthread_create(&discard, NULL, discard_thread,
                       (void *) area)) {
        fprintf(stderr, "Could not create discard thread");
        exit(-1);
    }

    /* keep reading/writing the page */
    while (true) {
        area[7] = area[1];
        usleep(10000);
        printf("Progress!\n");
    }
    return 0;
}

-- 
Thanks,

David / dhildenb