From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: Demand paging for VM on KVM
Date: Thu, 20 Mar 2014 18:32:29 +0100
Message-ID: <20140320173229.GB4000@redhat.com>
References: <CAJMTq5=LXMp2jBaxPMBWX_3-+RC5j98n=Nz8TRe3AXFwRY1Beg@mail.gmail.com>
 <532AEABA.2070000@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Grigory Makarevich <gmakarevich@google.com>, kvm@vger.kernel.org,
	gleb@redhat.com, Eric Northup <digitaleric@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:34266 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1759435AbaCTRcc (ORCPT <rfc822;kvm@vger.kernel.org>);
	Thu, 20 Mar 2014 13:32:32 -0400
Content-Disposition: inline
In-Reply-To: <532AEABA.2070000@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Hi,

On Thu, Mar 20, 2014 at 02:18:50PM +0100, Paolo Bonzini wrote:
> Il 20/03/2014 00:27, Grigory Makarevich ha scritto:
> > Hi All,
> >
> > I have been exploring different ways to implement on-demand paging for
> > VMs running in KVM.
> >
> > The core of the idea is to introduce an additional exit
> >  KVM_EXIT_MEMORY_NOT_PRESENT to inform VMM's user space to process
> > access to "not yet present" guest's page.
> > Each memory slot may be instructed to keep track of ondemand bit per
> > page. If the page is marked as "ondemand", page fault  will generate
> > exit to the host's
> > user-space with the information about the faulting page. Once the page
> > is filled, VMM instructs the KVM to clear "ondemand" bit for the page.
> >
> > I have working prototype and would like to consider upstreaming
> > corresponding KVM changes.

That was the original idea before userfaultfd was introduced. The
problem is then what happens when qemu is doing an O_DIRECT read from
the missing memory. It's not just a matter of adding an additional
exit, the whole qemu userland would need to become aware in various
places about new kind of errors out of legacy syscalls like read(2),
not just the KVM ioctl that would be easy to control by adding a new
exit reason.

> >
> > To start up the discussion before sending the actual patch-set, I'd like
> > to send the patch for the kvm's api.txt.  Please, let me know what you
> > think.
> 
> Hi, Andrea Arcangeli is considering a similar infrastructure at the 
> generic mm level.  Last time I discussed it with him, his idea was 
> roughly to have:
> 
> * a "userfaultfd" syscall that would take a memory range and return a 
> file descriptor; the file descriptor becomes readable when the first 
> access happens on a page in the region, and the read gives the address 
> of the access.  Any thread that accesses a still-unmapped region remains 
> blocked until the address of the faulting page is written back to the 
> userfaultfd, or gets a SIGBUS if the userfaultfd is closed.
> 

Yes, the userfaultfd by avoiding the kernel to return to userland (no
exit to userland through KVM_EXIT_MEMORY_NOT_PRESENT anymore) will
allow the kernel inside the vcpu/IO thread, to talk directly to the
migration thread (or in Grigory case, to the ondemand paging manager
thread). The kernel will sleep waiting for the page to be present
without returning to userland. Then the migration/ondemand thread will
notify the kernel through the userfaultfd to wakeup any vcpu/IO thread
that was waiting for the page once finished (i.e. after the network
transfer and remap_anon_pages completed).

This should solve all troubles with O_DIRECT or similar syscalls that
from the I/O thread may access the missing KVM memory, and it will
handle the spte fault case more efficiently too, by avoiding an
exit/enter kernel as KVM_EXIT_MEMORY_NOT_PRESENT will not be required
anymore.

It's not finished yet so I've no 100% proof this will work exactly as
described above but I don't expect trouble as the design is pretty
straightforward.

The only slight difference compared to the description above, is that
userfaultfd won't take a range of memory. Instead the userfault ranges
will still be marked by MADV_USERFAULT. The other option would be to
specify the ranges using iovecs but it felt less flexible having to
specify it in the syscall invocation instead of allowing random
mangling of the userfault ranges with madvise at runtime.

The userfaultfd will just bind to the whole mm, so no matter which
thread faults on memory marked MADV_USERFAULT, the faulting thread
will engage in the userfaultfd protocol without exiting to userland.

The actual syscall API will require review later anyway, that's not
the primary concern at this point.

> * a remap_anon_pages syscall that would be used in the userfaultfd I/O 
> handler to make the page accessible.  The handler would build the page 
> in a "shadow" area with the actual contents of guest memory, and then 
> remap the shadow area onto the actual guest memory.
> 
> Andrea, please correct me.
> 
> QEMU would use this infrastructure for post-copy migration and possibly 
> also for live snapshotting of the guests.  The advantage in making this 
> generic rather than KVM-based is that QEMU could use it also in 
> system-emulation mode (and of course anything else needing a read 
> barrier could use it too).

Correct.

Comments welcome,
Andrea