From mboxrd@z Thu Jan 1 00:00:00 1970 From: Grigory Makarevich Subject: Re: Demand paging for VM on KVM Date: Thu, 20 Mar 2014 11:27:13 -0700 Message-ID: References: <532AEABA.2070000@redhat.com> <20140320173229.GB4000@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 To: kvm@vger.kernel.org Return-path: Received: from mail-ve0-f178.google.com ([209.85.128.178]:42239 "EHLO mail-ve0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751685AbaCTS1O (ORCPT ); Thu, 20 Mar 2014 14:27:14 -0400 Received: by mail-ve0-f178.google.com with SMTP id jw12so1412611veb.9 for ; Thu, 20 Mar 2014 11:27:14 -0700 (PDT) In-Reply-To: <20140320173229.GB4000@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: - Resending to kvm@, as the previous attempt bounced off. Andrea, Paolo, Thanks a lot for the comments. I like the idea of userfaultfd a lot. For my prototype I had to solve a problem of accessing to the ondemand page from the paths where exiting is not safe (emulator is one example). I have solved it using send message using netlink socket/blocking calling thread/waking up once page is delivered. userfaultfd might be cleaner way to achieve the same goal. My concerns regarding "general" mm solution are: - Will it work with any memory mapping schema or only with anonymous memory ? - Before blocking the calling thread while serving the page-fault in host-kernel, one would need carefully release mmu semaphore (otherwise, the user-space might be in trouble serving the page-fault), which may be not that trivial. Regarding qemu part of that: - Yes, indeed, user-space would need to be careful accessing "ondemand" pages. However, that should not be a problem, considering that qemu would need to know in advance all "ondemand" regions. Though, I would expect some refactoring of the qemu internal api might be required. Thanks a lot, Best, Grigory On Thu, Mar 20, 2014 at 10:32 AM, Andrea Arcangeli wrote: > Hi, > > On Thu, Mar 20, 2014 at 02:18:50PM +0100, Paolo Bonzini wrote: >> Il 20/03/2014 00:27, Grigory Makarevich ha scritto: >> > Hi All, >> > >> > I have been exploring different ways to implement on-demand paging for >> > VMs running in KVM. >> > >> > The core of the idea is to introduce an additional exit >> > KVM_EXIT_MEMORY_NOT_PRESENT to inform VMM's user space to process >> > access to "not yet present" guest's page. >> > Each memory slot may be instructed to keep track of ondemand bit per >> > page. If the page is marked as "ondemand", page fault will generate >> > exit to the host's >> > user-space with the information about the faulting page. Once the page >> > is filled, VMM instructs the KVM to clear "ondemand" bit for the page. >> > >> > I have working prototype and would like to consider upstreaming >> > corresponding KVM changes. > > That was the original idea before userfaultfd was introduced. The > problem is then what happens when qemu is doing an O_DIRECT read from > the missing memory. It's not just a matter of adding an additional > exit, the whole qemu userland would need to become aware in various > places about new kind of errors out of legacy syscalls like read(2), > not just the KVM ioctl that would be easy to control by adding a new > exit reason. > >> > >> > To start up the discussion before sending the actual patch-set, I'd like >> > to send the patch for the kvm's api.txt. Please, let me know what you >> > think. >> >> Hi, Andrea Arcangeli is considering a similar infrastructure at the >> generic mm level. Last time I discussed it with him, his idea was >> roughly to have: >> >> * a "userfaultfd" syscall that would take a memory range and return a >> file descriptor; the file descriptor becomes readable when the first >> access happens on a page in the region, and the read gives the address >> of the access. Any thread that accesses a still-unmapped region remains >> blocked until the address of the faulting page is written back to the >> userfaultfd, or gets a SIGBUS if the userfaultfd is closed. >> > > Yes, the userfaultfd by avoiding the kernel to return to userland (no > exit to userland through KVM_EXIT_MEMORY_NOT_PRESENT anymore) will > allow the kernel inside the vcpu/IO thread, to talk directly to the > migration thread (or in Grigory case, to the ondemand paging manager > thread). The kernel will sleep waiting for the page to be present > without returning to userland. Then the migration/ondemand thread will > notify the kernel through the userfaultfd to wakeup any vcpu/IO thread > that was waiting for the page once finished (i.e. after the network > transfer and remap_anon_pages completed). > > This should solve all troubles with O_DIRECT or similar syscalls that > from the I/O thread may access the missing KVM memory, and it will > handle the spte fault case more efficiently too, by avoiding an > exit/enter kernel as KVM_EXIT_MEMORY_NOT_PRESENT will not be required > anymore. > > It's not finished yet so I've no 100% proof this will work exactly as > described above but I don't expect trouble as the design is pretty > straightforward. > > The only slight difference compared to the description above, is that > userfaultfd won't take a range of memory. Instead the userfault ranges > will still be marked by MADV_USERFAULT. The other option would be to > specify the ranges using iovecs but it felt less flexible having to > specify it in the syscall invocation instead of allowing random > mangling of the userfault ranges with madvise at runtime. > > The userfaultfd will just bind to the whole mm, so no matter which > thread faults on memory marked MADV_USERFAULT, the faulting thread > will engage in the userfaultfd protocol without exiting to userland. > > The actual syscall API will require review later anyway, that's not > the primary concern at this point. > >> * a remap_anon_pages syscall that would be used in the userfaultfd I/O >> handler to make the page accessible. The handler would build the page >> in a "shadow" area with the actual contents of guest memory, and then >> remap the shadow area onto the actual guest memory. >> >> Andrea, please correct me. >> >> QEMU would use this infrastructure for post-copy migration and possibly >> also for live snapshotting of the guests. The advantage in making this >> generic rather than KVM-based is that QEMU could use it also in >> system-emulation mode (and of course anything else needing a read >> barrier could use it too). > > Correct. > > Comments welcome, > Andrea