From mboxrd@z Thu Jan  1 00:00:00 1970
From: Grigory Makarevich <gmakarevich@google.com>
Subject: Re: Demand paging for VM on KVM
Date: Thu, 20 Mar 2014 11:27:13 -0700
Message-ID: <CAJMTq5kZ2R+U_gptB=LaST7avq8QbFH440Bs+Th7=d6LLZBt=A@mail.gmail.com>
References: <CAJMTq5=LXMp2jBaxPMBWX_3-+RC5j98n=Nz8TRe3AXFwRY1Beg@mail.gmail.com>
	<532AEABA.2070000@redhat.com>
	<20140320173229.GB4000@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
To: kvm@vger.kernel.org
Return-path: <kvm-owner@vger.kernel.org>
Received: from mail-ve0-f178.google.com ([209.85.128.178]:42239 "EHLO
	mail-ve0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751685AbaCTS1O (ORCPT <rfc822;kvm@vger.kernel.org>);
	Thu, 20 Mar 2014 14:27:14 -0400
Received: by mail-ve0-f178.google.com with SMTP id jw12so1412611veb.9
        for <kvm@vger.kernel.org>; Thu, 20 Mar 2014 11:27:14 -0700 (PDT)
In-Reply-To: <20140320173229.GB4000@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

-  Resending to kvm@, as the previous attempt bounced off.

Andrea, Paolo,

Thanks a lot for the comments.

I like the idea of userfaultfd a lot.  For my prototype I had to solve
a problem of accessing to the ondemand page from the paths where
exiting is not safe (emulator is one example). I have solved it using
send message using netlink socket/blocking calling thread/waking up
once page is delivered. userfaultfd might be cleaner way to achieve
the same goal.

My concerns regarding "general" mm solution are:

- Will it work with any memory mapping schema or only with anonymous memory ?
- Before blocking the calling thread while serving the page-fault in
host-kernel, one would need carefully release mmu semaphore
(otherwise, the user-space might be in trouble serving the
page-fault), which may be not that trivial.

Regarding qemu part of that:

-  Yes, indeed, user-space would need to be careful accessing
"ondemand" pages. However,  that should not be a problem, considering
that qemu would need to know in advance all "ondemand" regions.
Though, I would expect some  refactoring of the qemu internal api
might be required.


Thanks a lot,
Best,
Grigory

On Thu, Mar 20, 2014 at 10:32 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi,
>
> On Thu, Mar 20, 2014 at 02:18:50PM +0100, Paolo Bonzini wrote:
>> Il 20/03/2014 00:27, Grigory Makarevich ha scritto:
>> > Hi All,
>> >
>> > I have been exploring different ways to implement on-demand paging for
>> > VMs running in KVM.
>> >
>> > The core of the idea is to introduce an additional exit
>> >  KVM_EXIT_MEMORY_NOT_PRESENT to inform VMM's user space to process
>> > access to "not yet present" guest's page.
>> > Each memory slot may be instructed to keep track of ondemand bit per
>> > page. If the page is marked as "ondemand", page fault  will generate
>> > exit to the host's
>> > user-space with the information about the faulting page. Once the page
>> > is filled, VMM instructs the KVM to clear "ondemand" bit for the page.
>> >
>> > I have working prototype and would like to consider upstreaming
>> > corresponding KVM changes.
>
> That was the original idea before userfaultfd was introduced. The
> problem is then what happens when qemu is doing an O_DIRECT read from
> the missing memory. It's not just a matter of adding an additional
> exit, the whole qemu userland would need to become aware in various
> places about new kind of errors out of legacy syscalls like read(2),
> not just the KVM ioctl that would be easy to control by adding a new
> exit reason.
>
>> >
>> > To start up the discussion before sending the actual patch-set, I'd like
>> > to send the patch for the kvm's api.txt.  Please, let me know what you
>> > think.
>>
>> Hi, Andrea Arcangeli is considering a similar infrastructure at the
>> generic mm level.  Last time I discussed it with him, his idea was
>> roughly to have:
>>
>> * a "userfaultfd" syscall that would take a memory range and return a
>> file descriptor; the file descriptor becomes readable when the first
>> access happens on a page in the region, and the read gives the address
>> of the access.  Any thread that accesses a still-unmapped region remains
>> blocked until the address of the faulting page is written back to the
>> userfaultfd, or gets a SIGBUS if the userfaultfd is closed.
>>
>
> Yes, the userfaultfd by avoiding the kernel to return to userland (no
> exit to userland through KVM_EXIT_MEMORY_NOT_PRESENT anymore) will
> allow the kernel inside the vcpu/IO thread, to talk directly to the
> migration thread (or in Grigory case, to the ondemand paging manager
> thread). The kernel will sleep waiting for the page to be present
> without returning to userland. Then the migration/ondemand thread will
> notify the kernel through the userfaultfd to wakeup any vcpu/IO thread
> that was waiting for the page once finished (i.e. after the network
> transfer and remap_anon_pages completed).
>
> This should solve all troubles with O_DIRECT or similar syscalls that
> from the I/O thread may access the missing KVM memory, and it will
> handle the spte fault case more efficiently too, by avoiding an
> exit/enter kernel as KVM_EXIT_MEMORY_NOT_PRESENT will not be required
> anymore.
>
> It's not finished yet so I've no 100% proof this will work exactly as
> described above but I don't expect trouble as the design is pretty
> straightforward.
>
> The only slight difference compared to the description above, is that
> userfaultfd won't take a range of memory. Instead the userfault ranges
> will still be marked by MADV_USERFAULT. The other option would be to
> specify the ranges using iovecs but it felt less flexible having to
> specify it in the syscall invocation instead of allowing random
> mangling of the userfault ranges with madvise at runtime.
>
> The userfaultfd will just bind to the whole mm, so no matter which
> thread faults on memory marked MADV_USERFAULT, the faulting thread
> will engage in the userfaultfd protocol without exiting to userland.
>
> The actual syscall API will require review later anyway, that's not
> the primary concern at this point.
>
>> * a remap_anon_pages syscall that would be used in the userfaultfd I/O
>> handler to make the page accessible.  The handler would build the page
>> in a "shadow" area with the actual contents of guest memory, and then
>> remap the shadow area onto the actual guest memory.
>>
>> Andrea, please correct me.
>>
>> QEMU would use this infrastructure for post-copy migration and possibly
>> also for live snapshotting of the guests.  The advantage in making this
>> generic rather than KVM-based is that QEMU could use it also in
>> system-emulation mode (and of course anything else needing a read
>> barrier could use it too).
>
> Correct.
>
> Comments welcome,
> Andrea