* Re: Demand paging for VM on KVM [not found] <CAJMTq5=LXMp2jBaxPMBWX_3-+RC5j98n=Nz8TRe3AXFwRY1Beg@mail.gmail.com> @ 2014-03-20 13:18 ` Paolo Bonzini 2014-03-20 17:32 ` Andrea Arcangeli 0 siblings, 1 reply; 4+ messages in thread From: Paolo Bonzini @ 2014-03-20 13:18 UTC (permalink / raw) To: Grigory Makarevich, kvm, gleb; +Cc: Eric Northup, Andrea Arcangeli Il 20/03/2014 00:27, Grigory Makarevich ha scritto: > Hi All, > > I have been exploring different ways to implement on-demand paging for > VMs running in KVM. > > The core of the idea is to introduce an additional exit > KVM_EXIT_MEMORY_NOT_PRESENT to inform VMM's user space to process > access to "not yet present" guest's page. > Each memory slot may be instructed to keep track of ondemand bit per > page. If the page is marked as "ondemand", page fault will generate > exit to the host's > user-space with the information about the faulting page. Once the page > is filled, VMM instructs the KVM to clear "ondemand" bit for the page. > > I have working prototype and would like to consider upstreaming > corresponding KVM changes. > > To start up the discussion before sending the actual patch-set, I'd like > to send the patch for the kvm's api.txt. Please, let me know what you > think. Hi, Andrea Arcangeli is considering a similar infrastructure at the generic mm level. Last time I discussed it with him, his idea was roughly to have: * a "userfaultfd" syscall that would take a memory range and return a file descriptor; the file descriptor becomes readable when the first access happens on a page in the region, and the read gives the address of the access. Any thread that accesses a still-unmapped region remains blocked until the address of the faulting page is written back to the userfaultfd, or gets a SIGBUS if the userfaultfd is closed. * a remap_anon_pages syscall that would be used in the userfaultfd I/O handler to make the page accessible. The handler would build the page in a "shadow" area with the actual contents of guest memory, and then remap the shadow area onto the actual guest memory. Andrea, please correct me. QEMU would use this infrastructure for post-copy migration and possibly also for live snapshotting of the guests. The advantage in making this generic rather than KVM-based is that QEMU could use it also in system-emulation mode (and of course anything else needing a read barrier could use it too). Paolo ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Demand paging for VM on KVM 2014-03-20 13:18 ` Demand paging for VM on KVM Paolo Bonzini @ 2014-03-20 17:32 ` Andrea Arcangeli 2014-03-20 18:27 ` Grigory Makarevich [not found] ` <CAJMTq5nGcZoNEgEhP6mPQqhSbLFyf4J5YRd0cszWLMak-LJ0DA@mail.gmail.com> 0 siblings, 2 replies; 4+ messages in thread From: Andrea Arcangeli @ 2014-03-20 17:32 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Grigory Makarevich, kvm, gleb, Eric Northup Hi, On Thu, Mar 20, 2014 at 02:18:50PM +0100, Paolo Bonzini wrote: > Il 20/03/2014 00:27, Grigory Makarevich ha scritto: > > Hi All, > > > > I have been exploring different ways to implement on-demand paging for > > VMs running in KVM. > > > > The core of the idea is to introduce an additional exit > > KVM_EXIT_MEMORY_NOT_PRESENT to inform VMM's user space to process > > access to "not yet present" guest's page. > > Each memory slot may be instructed to keep track of ondemand bit per > > page. If the page is marked as "ondemand", page fault will generate > > exit to the host's > > user-space with the information about the faulting page. Once the page > > is filled, VMM instructs the KVM to clear "ondemand" bit for the page. > > > > I have working prototype and would like to consider upstreaming > > corresponding KVM changes. That was the original idea before userfaultfd was introduced. The problem is then what happens when qemu is doing an O_DIRECT read from the missing memory. It's not just a matter of adding an additional exit, the whole qemu userland would need to become aware in various places about new kind of errors out of legacy syscalls like read(2), not just the KVM ioctl that would be easy to control by adding a new exit reason. > > > > To start up the discussion before sending the actual patch-set, I'd like > > to send the patch for the kvm's api.txt. Please, let me know what you > > think. > > Hi, Andrea Arcangeli is considering a similar infrastructure at the > generic mm level. Last time I discussed it with him, his idea was > roughly to have: > > * a "userfaultfd" syscall that would take a memory range and return a > file descriptor; the file descriptor becomes readable when the first > access happens on a page in the region, and the read gives the address > of the access. Any thread that accesses a still-unmapped region remains > blocked until the address of the faulting page is written back to the > userfaultfd, or gets a SIGBUS if the userfaultfd is closed. > Yes, the userfaultfd by avoiding the kernel to return to userland (no exit to userland through KVM_EXIT_MEMORY_NOT_PRESENT anymore) will allow the kernel inside the vcpu/IO thread, to talk directly to the migration thread (or in Grigory case, to the ondemand paging manager thread). The kernel will sleep waiting for the page to be present without returning to userland. Then the migration/ondemand thread will notify the kernel through the userfaultfd to wakeup any vcpu/IO thread that was waiting for the page once finished (i.e. after the network transfer and remap_anon_pages completed). This should solve all troubles with O_DIRECT or similar syscalls that from the I/O thread may access the missing KVM memory, and it will handle the spte fault case more efficiently too, by avoiding an exit/enter kernel as KVM_EXIT_MEMORY_NOT_PRESENT will not be required anymore. It's not finished yet so I've no 100% proof this will work exactly as described above but I don't expect trouble as the design is pretty straightforward. The only slight difference compared to the description above, is that userfaultfd won't take a range of memory. Instead the userfault ranges will still be marked by MADV_USERFAULT. The other option would be to specify the ranges using iovecs but it felt less flexible having to specify it in the syscall invocation instead of allowing random mangling of the userfault ranges with madvise at runtime. The userfaultfd will just bind to the whole mm, so no matter which thread faults on memory marked MADV_USERFAULT, the faulting thread will engage in the userfaultfd protocol without exiting to userland. The actual syscall API will require review later anyway, that's not the primary concern at this point. > * a remap_anon_pages syscall that would be used in the userfaultfd I/O > handler to make the page accessible. The handler would build the page > in a "shadow" area with the actual contents of guest memory, and then > remap the shadow area onto the actual guest memory. > > Andrea, please correct me. > > QEMU would use this infrastructure for post-copy migration and possibly > also for live snapshotting of the guests. The advantage in making this > generic rather than KVM-based is that QEMU could use it also in > system-emulation mode (and of course anything else needing a read > barrier could use it too). Correct. Comments welcome, Andrea ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Demand paging for VM on KVM 2014-03-20 17:32 ` Andrea Arcangeli @ 2014-03-20 18:27 ` Grigory Makarevich [not found] ` <CAJMTq5nGcZoNEgEhP6mPQqhSbLFyf4J5YRd0cszWLMak-LJ0DA@mail.gmail.com> 1 sibling, 0 replies; 4+ messages in thread From: Grigory Makarevich @ 2014-03-20 18:27 UTC (permalink / raw) To: kvm - Resending to kvm@, as the previous attempt bounced off. Andrea, Paolo, Thanks a lot for the comments. I like the idea of userfaultfd a lot. For my prototype I had to solve a problem of accessing to the ondemand page from the paths where exiting is not safe (emulator is one example). I have solved it using send message using netlink socket/blocking calling thread/waking up once page is delivered. userfaultfd might be cleaner way to achieve the same goal. My concerns regarding "general" mm solution are: - Will it work with any memory mapping schema or only with anonymous memory ? - Before blocking the calling thread while serving the page-fault in host-kernel, one would need carefully release mmu semaphore (otherwise, the user-space might be in trouble serving the page-fault), which may be not that trivial. Regarding qemu part of that: - Yes, indeed, user-space would need to be careful accessing "ondemand" pages. However, that should not be a problem, considering that qemu would need to know in advance all "ondemand" regions. Though, I would expect some refactoring of the qemu internal api might be required. Thanks a lot, Best, Grigory On Thu, Mar 20, 2014 at 10:32 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > Hi, > > On Thu, Mar 20, 2014 at 02:18:50PM +0100, Paolo Bonzini wrote: >> Il 20/03/2014 00:27, Grigory Makarevich ha scritto: >> > Hi All, >> > >> > I have been exploring different ways to implement on-demand paging for >> > VMs running in KVM. >> > >> > The core of the idea is to introduce an additional exit >> > KVM_EXIT_MEMORY_NOT_PRESENT to inform VMM's user space to process >> > access to "not yet present" guest's page. >> > Each memory slot may be instructed to keep track of ondemand bit per >> > page. If the page is marked as "ondemand", page fault will generate >> > exit to the host's >> > user-space with the information about the faulting page. Once the page >> > is filled, VMM instructs the KVM to clear "ondemand" bit for the page. >> > >> > I have working prototype and would like to consider upstreaming >> > corresponding KVM changes. > > That was the original idea before userfaultfd was introduced. The > problem is then what happens when qemu is doing an O_DIRECT read from > the missing memory. It's not just a matter of adding an additional > exit, the whole qemu userland would need to become aware in various > places about new kind of errors out of legacy syscalls like read(2), > not just the KVM ioctl that would be easy to control by adding a new > exit reason. > >> > >> > To start up the discussion before sending the actual patch-set, I'd like >> > to send the patch for the kvm's api.txt. Please, let me know what you >> > think. >> >> Hi, Andrea Arcangeli is considering a similar infrastructure at the >> generic mm level. Last time I discussed it with him, his idea was >> roughly to have: >> >> * a "userfaultfd" syscall that would take a memory range and return a >> file descriptor; the file descriptor becomes readable when the first >> access happens on a page in the region, and the read gives the address >> of the access. Any thread that accesses a still-unmapped region remains >> blocked until the address of the faulting page is written back to the >> userfaultfd, or gets a SIGBUS if the userfaultfd is closed. >> > > Yes, the userfaultfd by avoiding the kernel to return to userland (no > exit to userland through KVM_EXIT_MEMORY_NOT_PRESENT anymore) will > allow the kernel inside the vcpu/IO thread, to talk directly to the > migration thread (or in Grigory case, to the ondemand paging manager > thread). The kernel will sleep waiting for the page to be present > without returning to userland. Then the migration/ondemand thread will > notify the kernel through the userfaultfd to wakeup any vcpu/IO thread > that was waiting for the page once finished (i.e. after the network > transfer and remap_anon_pages completed). > > This should solve all troubles with O_DIRECT or similar syscalls that > from the I/O thread may access the missing KVM memory, and it will > handle the spte fault case more efficiently too, by avoiding an > exit/enter kernel as KVM_EXIT_MEMORY_NOT_PRESENT will not be required > anymore. > > It's not finished yet so I've no 100% proof this will work exactly as > described above but I don't expect trouble as the design is pretty > straightforward. > > The only slight difference compared to the description above, is that > userfaultfd won't take a range of memory. Instead the userfault ranges > will still be marked by MADV_USERFAULT. The other option would be to > specify the ranges using iovecs but it felt less flexible having to > specify it in the syscall invocation instead of allowing random > mangling of the userfault ranges with madvise at runtime. > > The userfaultfd will just bind to the whole mm, so no matter which > thread faults on memory marked MADV_USERFAULT, the faulting thread > will engage in the userfaultfd protocol without exiting to userland. > > The actual syscall API will require review later anyway, that's not > the primary concern at this point. > >> * a remap_anon_pages syscall that would be used in the userfaultfd I/O >> handler to make the page accessible. The handler would build the page >> in a "shadow" area with the actual contents of guest memory, and then >> remap the shadow area onto the actual guest memory. >> >> Andrea, please correct me. >> >> QEMU would use this infrastructure for post-copy migration and possibly >> also for live snapshotting of the guests. The advantage in making this >> generic rather than KVM-based is that QEMU could use it also in >> system-emulation mode (and of course anything else needing a read >> barrier could use it too). > > Correct. > > Comments welcome, > Andrea ^ permalink raw reply [flat|nested] 4+ messages in thread
[parent not found: <CAJMTq5nGcZoNEgEhP6mPQqhSbLFyf4J5YRd0cszWLMak-LJ0DA@mail.gmail.com>]
* Re: Demand paging for VM on KVM [not found] ` <CAJMTq5nGcZoNEgEhP6mPQqhSbLFyf4J5YRd0cszWLMak-LJ0DA@mail.gmail.com> @ 2014-03-31 18:03 ` Andrea Arcangeli 0 siblings, 0 replies; 4+ messages in thread From: Andrea Arcangeli @ 2014-03-31 18:03 UTC (permalink / raw) To: Grigory Makarevich; +Cc: Paolo Bonzini, kvm, gleb, Eric Northup, Mike Waychison Hi Grigory, On Thu, Mar 20, 2014 at 10:50:07AM -0700, Grigory Makarevich wrote: > Andrea, Paolo, > > Thanks a lot for the comments. > > I like the idea of userfaultfd a lot. For my prototype I had to solve a > problem of accessing to the ondemand page from the paths where exiting is > not safe (emulator is one example). I have solved it using send message > using netlink socket/blocking calling thread/waking up once page is > delivered. userfaultfd might be cleaner way to achieve the same goal. I'm glad you like the idea of userfaultfd, the way the syscall works tends to mirror the eventfd(2) syscall but the protocol talked on the fd is clearly different. So you also made the vcpu to talk from the kernel to the "ondemand" thread through some file descriptor and protocol. So the difference is that it shouldn't be limited to the emulator but it shall work for all syscalls that the IO thread may invoke and that could hit on missing guest physical memory. > My concerns regarding "general" mm solution are: > > - Will it work with any memory mapping schema or only with anonymous memory > ? Currently I have the hooks only into anonymous memory but there's no reason why this shouldn't work for other kind of page faults. The main issue is not with the technicality in waiting in the page fault but with the semantics of a page not being mapped. If there's an hole in a filebacked mapping things are different than if there's an hole in anonymous memory. As the VM can unmap and free a filebacked page at any time. But if you're ok to notify the migration thread even in such a case, it could work the same. It would be more tricky if we had to differentiate an initial fault after the vma is created (in order to notify the migration thread only for initial faults), from faults triggered after the VM unmapped and freed the page as result of VM pressure. That would require putting a placeholder in the pagetable instead of keeping the VM code identical to now (which zeroes a pagetable when a page is unmapped from a filebacked mapping). There is also some similarity between the userfault mechanism and the volatile ranges but volatile ranges are handled in a finegrined way in the pagetables, but it looked like there was no code to share in the end. Volatile ranges provides a very different functionality from userfault, for example they don't need to provide transparent behavior if the memory is given as parameter to syscalls as far as I know. They are used only to store data in memory accessed by userland through the pagetables (not syscalls). The objective of userfault is also fundamentally different from the objective of volatile ranges. We cannot ever lose any data, while their whole point is to lose data if there's VM pressure. However if you want to extend the userfaultfd functionalty to trap the first access in the pagetabels for filebacked pages, we would also need to mangle pagetables with some placeholders to differentiate the first fault and we may want to revisit if there's some code or placeholder to share across the two features. Ideally support for filebacked ranges could be a added at a second stage. > - Before blocking the calling thread while serving the page-fault in > host-kernel, one would need carefully release mmu semaphore (otherwise, the > user-space might be in trouble serving the page-fault), which may be not > that trivial. Yes all locks must be dropped before waiting for the migration thread and that includes the mmap_sem. I don't think we'll need to add much complexity though, because we can relay on the behavior of page faults that can be repeated endlessy until they succeed. It's certainly more trivial for real faults than gup (because gup can work on more than 1 page at time so it may require some rolling back or special lock retaking during the gup loop). With the real faults the only trick as optimization will be to repeat it without returning to userland. > Regarding qemu part of that: > > - Yes, indeed, user-space would need to be careful accessing "ondemand" > pages. However, that should not be a problem, considering that qemu would > need to know in advance all "ondemand" regions. Though, I would expect > some refactoring of the qemu internal api might be required. I don't think the migration thread would risk messing with missing memory, but the refactoring of some API and wire protocol will be still needed to make the protocol bidirectional and to handle the postcopy mechanism. David is working on it. Paolo also pointed out one case that won't be entirely transparent. If gdb would debug the migration thread breakpointing into it, and then the gdb user tries to touch missing memory with ptrace after it stopped the migration thread, gdb would then soft lockup. It'll require a sigkill to unblock. There's no real way to fix it in my view... and overall it looks quite reasonable to end up in a soft lockup in such a scenario, I mean gdb has other ways to interfere in bad ways the app by corrupting its memory. Ideally the gdb user should know what it is doing. Thanks! Andrea ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-03-31 18:03 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <CAJMTq5=LXMp2jBaxPMBWX_3-+RC5j98n=Nz8TRe3AXFwRY1Beg@mail.gmail.com> 2014-03-20 13:18 ` Demand paging for VM on KVM Paolo Bonzini 2014-03-20 17:32 ` Andrea Arcangeli 2014-03-20 18:27 ` Grigory Makarevich [not found] ` <CAJMTq5nGcZoNEgEhP6mPQqhSbLFyf4J5YRd0cszWLMak-LJ0DA@mail.gmail.com> 2014-03-31 18:03 ` Andrea Arcangeli
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.