Re: Demand paging for VM on KVM

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: Demand paging for VM on KVM
       [not found] <CAJMTq5=LXMp2jBaxPMBWX_3-+RC5j98n=Nz8TRe3AXFwRY1Beg@mail.gmail.com>
@ 2014-03-20 13:18 ` Paolo Bonzini
  2014-03-20 17:32   ` Andrea Arcangeli
  0 siblings, 1 reply; 4+ messages in thread
From: Paolo Bonzini @ 2014-03-20 13:18 UTC (permalink / raw)
  To: Grigory Makarevich, kvm, gleb; +Cc: Eric Northup, Andrea Arcangeli

Il 20/03/2014 00:27, Grigory Makarevich ha scritto:
> Hi All,
>
> I have been exploring different ways to implement on-demand paging for
> VMs running in KVM.
>
> The core of the idea is to introduce an additional exit
>  KVM_EXIT_MEMORY_NOT_PRESENT to inform VMM's user space to process
> access to "not yet present" guest's page.
> Each memory slot may be instructed to keep track of ondemand bit per
> page. If the page is marked as "ondemand", page fault  will generate
> exit to the host's
> user-space with the information about the faulting page. Once the page
> is filled, VMM instructs the KVM to clear "ondemand" bit for the page.
>
> I have working prototype and would like to consider upstreaming
> corresponding KVM changes.
>
> To start up the discussion before sending the actual patch-set, I'd like
> to send the patch for the kvm's api.txt.  Please, let me know what you
> think.

Hi, Andrea Arcangeli is considering a similar infrastructure at the 
generic mm level.  Last time I discussed it with him, his idea was 
roughly to have:

* a "userfaultfd" syscall that would take a memory range and return a 
file descriptor; the file descriptor becomes readable when the first 
access happens on a page in the region, and the read gives the address 
of the access.  Any thread that accesses a still-unmapped region remains 
blocked until the address of the faulting page is written back to the 
userfaultfd, or gets a SIGBUS if the userfaultfd is closed.

* a remap_anon_pages syscall that would be used in the userfaultfd I/O 
handler to make the page accessible.  The handler would build the page 
in a "shadow" area with the actual contents of guest memory, and then 
remap the shadow area onto the actual guest memory.

Andrea, please correct me.

QEMU would use this infrastructure for post-copy migration and possibly 
also for live snapshotting of the guests.  The advantage in making this 
generic rather than KVM-based is that QEMU could use it also in 
system-emulation mode (and of course anything else needing a read 
barrier could use it too).

Paolo

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Demand paging for VM on KVM
  2014-03-20 13:18 ` Demand paging for VM on KVM Paolo Bonzini
@ 2014-03-20 17:32   ` Andrea Arcangeli
  2014-03-20 18:27     ` Grigory Makarevich
       [not found]     ` <CAJMTq5nGcZoNEgEhP6mPQqhSbLFyf4J5YRd0cszWLMak-LJ0DA@mail.gmail.com>
  0 siblings, 2 replies; 4+ messages in thread
From: Andrea Arcangeli @ 2014-03-20 17:32 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Grigory Makarevich, kvm, gleb, Eric Northup

Hi,

On Thu, Mar 20, 2014 at 02:18:50PM +0100, Paolo Bonzini wrote:
> Il 20/03/2014 00:27, Grigory Makarevich ha scritto:
> > Hi All,
> >
> > I have been exploring different ways to implement on-demand paging for
> > VMs running in KVM.
> >
> > The core of the idea is to introduce an additional exit
> >  KVM_EXIT_MEMORY_NOT_PRESENT to inform VMM's user space to process
> > access to "not yet present" guest's page.
> > Each memory slot may be instructed to keep track of ondemand bit per
> > page. If the page is marked as "ondemand", page fault  will generate
> > exit to the host's
> > user-space with the information about the faulting page. Once the page
> > is filled, VMM instructs the KVM to clear "ondemand" bit for the page.
> >
> > I have working prototype and would like to consider upstreaming
> > corresponding KVM changes.

That was the original idea before userfaultfd was introduced. The
problem is then what happens when qemu is doing an O_DIRECT read from
the missing memory. It's not just a matter of adding an additional
exit, the whole qemu userland would need to become aware in various
places about new kind of errors out of legacy syscalls like read(2),
not just the KVM ioctl that would be easy to control by adding a new
exit reason.

> >
> > To start up the discussion before sending the actual patch-set, I'd like
> > to send the patch for the kvm's api.txt.  Please, let me know what you
> > think.
> 
> Hi, Andrea Arcangeli is considering a similar infrastructure at the 
> generic mm level.  Last time I discussed it with him, his idea was 
> roughly to have:
> 
> * a "userfaultfd" syscall that would take a memory range and return a 
> file descriptor; the file descriptor becomes readable when the first 
> access happens on a page in the region, and the read gives the address 
> of the access.  Any thread that accesses a still-unmapped region remains 
> blocked until the address of the faulting page is written back to the 
> userfaultfd, or gets a SIGBUS if the userfaultfd is closed.
> 

Yes, the userfaultfd by avoiding the kernel to return to userland (no
exit to userland through KVM_EXIT_MEMORY_NOT_PRESENT anymore) will
allow the kernel inside the vcpu/IO thread, to talk directly to the
migration thread (or in Grigory case, to the ondemand paging manager
thread). The kernel will sleep waiting for the page to be present
without returning to userland. Then the migration/ondemand thread will
notify the kernel through the userfaultfd to wakeup any vcpu/IO thread
that was waiting for the page once finished (i.e. after the network
transfer and remap_anon_pages completed).

This should solve all troubles with O_DIRECT or similar syscalls that
from the I/O thread may access the missing KVM memory, and it will
handle the spte fault case more efficiently too, by avoiding an
exit/enter kernel as KVM_EXIT_MEMORY_NOT_PRESENT will not be required
anymore.

It's not finished yet so I've no 100% proof this will work exactly as
described above but I don't expect trouble as the design is pretty
straightforward.

The only slight difference compared to the description above, is that
userfaultfd won't take a range of memory. Instead the userfault ranges
will still be marked by MADV_USERFAULT. The other option would be to
specify the ranges using iovecs but it felt less flexible having to
specify it in the syscall invocation instead of allowing random
mangling of the userfault ranges with madvise at runtime.

The userfaultfd will just bind to the whole mm, so no matter which
thread faults on memory marked MADV_USERFAULT, the faulting thread
will engage in the userfaultfd protocol without exiting to userland.

The actual syscall API will require review later anyway, that's not
the primary concern at this point.

> * a remap_anon_pages syscall that would be used in the userfaultfd I/O 
> handler to make the page accessible.  The handler would build the page 
> in a "shadow" area with the actual contents of guest memory, and then 
> remap the shadow area onto the actual guest memory.
> 
> Andrea, please correct me.
> 
> QEMU would use this infrastructure for post-copy migration and possibly 
> also for live snapshotting of the guests.  The advantage in making this 
> generic rather than KVM-based is that QEMU could use it also in 
> system-emulation mode (and of course anything else needing a read 
> barrier could use it too).

Correct.

Comments welcome,
Andrea

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Demand paging for VM on KVM
  2014-03-20 17:32   ` Andrea Arcangeli
@ 2014-03-20 18:27     ` Grigory Makarevich
       [not found]     ` <CAJMTq5nGcZoNEgEhP6mPQqhSbLFyf4J5YRd0cszWLMak-LJ0DA@mail.gmail.com>
  1 sibling, 0 replies; 4+ messages in thread
From: Grigory Makarevich @ 2014-03-20 18:27 UTC (permalink / raw)
  To: kvm

-  Resending to kvm@, as the previous attempt bounced off.

Andrea, Paolo,

Thanks a lot for the comments.

I like the idea of userfaultfd a lot.  For my prototype I had to solve
a problem of accessing to the ondemand page from the paths where
exiting is not safe (emulator is one example). I have solved it using
send message using netlink socket/blocking calling thread/waking up
once page is delivered. userfaultfd might be cleaner way to achieve
the same goal.

My concerns regarding "general" mm solution are:

- Will it work with any memory mapping schema or only with anonymous memory ?
- Before blocking the calling thread while serving the page-fault in
host-kernel, one would need carefully release mmu semaphore
(otherwise, the user-space might be in trouble serving the
page-fault), which may be not that trivial.

Regarding qemu part of that:

-  Yes, indeed, user-space would need to be careful accessing
"ondemand" pages. However,  that should not be a problem, considering
that qemu would need to know in advance all "ondemand" regions.
Though, I would expect some  refactoring of the qemu internal api
might be required.


Thanks a lot,
Best,
Grigory

On Thu, Mar 20, 2014 at 10:32 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi,
>
> On Thu, Mar 20, 2014 at 02:18:50PM +0100, Paolo Bonzini wrote:
>> Il 20/03/2014 00:27, Grigory Makarevich ha scritto:
>> > Hi All,
>> >
>> > I have been exploring different ways to implement on-demand paging for
>> > VMs running in KVM.
>> >
>> > The core of the idea is to introduce an additional exit
>> >  KVM_EXIT_MEMORY_NOT_PRESENT to inform VMM's user space to process
>> > access to "not yet present" guest's page.
>> > Each memory slot may be instructed to keep track of ondemand bit per
>> > page. If the page is marked as "ondemand", page fault  will generate
>> > exit to the host's
>> > user-space with the information about the faulting page. Once the page
>> > is filled, VMM instructs the KVM to clear "ondemand" bit for the page.
>> >
>> > I have working prototype and would like to consider upstreaming
>> > corresponding KVM changes.
>
> That was the original idea before userfaultfd was introduced. The
> problem is then what happens when qemu is doing an O_DIRECT read from
> the missing memory. It's not just a matter of adding an additional
> exit, the whole qemu userland would need to become aware in various
> places about new kind of errors out of legacy syscalls like read(2),
> not just the KVM ioctl that would be easy to control by adding a new
> exit reason.
>
>> >
>> > To start up the discussion before sending the actual patch-set, I'd like
>> > to send the patch for the kvm's api.txt.  Please, let me know what you
>> > think.
>>
>> Hi, Andrea Arcangeli is considering a similar infrastructure at the
>> generic mm level.  Last time I discussed it with him, his idea was
>> roughly to have:
>>
>> * a "userfaultfd" syscall that would take a memory range and return a
>> file descriptor; the file descriptor becomes readable when the first
>> access happens on a page in the region, and the read gives the address
>> of the access.  Any thread that accesses a still-unmapped region remains
>> blocked until the address of the faulting page is written back to the
>> userfaultfd, or gets a SIGBUS if the userfaultfd is closed.
>>
>
> Yes, the userfaultfd by avoiding the kernel to return to userland (no
> exit to userland through KVM_EXIT_MEMORY_NOT_PRESENT anymore) will
> allow the kernel inside the vcpu/IO thread, to talk directly to the
> migration thread (or in Grigory case, to the ondemand paging manager
> thread). The kernel will sleep waiting for the page to be present
> without returning to userland. Then the migration/ondemand thread will
> notify the kernel through the userfaultfd to wakeup any vcpu/IO thread
> that was waiting for the page once finished (i.e. after the network
> transfer and remap_anon_pages completed).
>
> This should solve all troubles with O_DIRECT or similar syscalls that
> from the I/O thread may access the missing KVM memory, and it will
> handle the spte fault case more efficiently too, by avoiding an
> exit/enter kernel as KVM_EXIT_MEMORY_NOT_PRESENT will not be required
> anymore.
>
> It's not finished yet so I've no 100% proof this will work exactly as
> described above but I don't expect trouble as the design is pretty
> straightforward.
>
> The only slight difference compared to the description above, is that
> userfaultfd won't take a range of memory. Instead the userfault ranges
> will still be marked by MADV_USERFAULT. The other option would be to
> specify the ranges using iovecs but it felt less flexible having to
> specify it in the syscall invocation instead of allowing random
> mangling of the userfault ranges with madvise at runtime.
>
> The userfaultfd will just bind to the whole mm, so no matter which
> thread faults on memory marked MADV_USERFAULT, the faulting thread
> will engage in the userfaultfd protocol without exiting to userland.
>
> The actual syscall API will require review later anyway, that's not
> the primary concern at this point.
>
>> * a remap_anon_pages syscall that would be used in the userfaultfd I/O
>> handler to make the page accessible.  The handler would build the page
>> in a "shadow" area with the actual contents of guest memory, and then
>> remap the shadow area onto the actual guest memory.
>>
>> Andrea, please correct me.
>>
>> QEMU would use this infrastructure for post-copy migration and possibly
>> also for live snapshotting of the guests.  The advantage in making this
>> generic rather than KVM-based is that QEMU could use it also in
>> system-emulation mode (and of course anything else needing a read
>> barrier could use it too).
>
> Correct.
>
> Comments welcome,
> Andrea

^ permalink raw reply	[flat|nested] 4+ messages in thread

[parent not found: <CAJMTq5nGcZoNEgEhP6mPQqhSbLFyf4J5YRd0cszWLMak-LJ0DA@mail.gmail.com>]

* Re: Demand paging for VM on KVM
       [not found]     ` <CAJMTq5nGcZoNEgEhP6mPQqhSbLFyf4J5YRd0cszWLMak-LJ0DA@mail.gmail.com>
@ 2014-03-31 18:03       ` Andrea Arcangeli
  0 siblings, 0 replies; 4+ messages in thread
From: Andrea Arcangeli @ 2014-03-31 18:03 UTC (permalink / raw)
  To: Grigory Makarevich; +Cc: Paolo Bonzini, kvm, gleb, Eric Northup, Mike Waychison

Hi Grigory,

On Thu, Mar 20, 2014 at 10:50:07AM -0700, Grigory Makarevich wrote:
> Andrea, Paolo,
> 
> Thanks a lot for the comments.
> 
> I like the idea of userfaultfd a lot.  For my prototype I had to solve a
> problem of accessing to the ondemand page from the paths where exiting is
> not safe (emulator is one example). I have solved it using send message
> using netlink socket/blocking calling thread/waking up once page is
> delivered. userfaultfd might be cleaner way to achieve the same goal.

I'm glad you like the idea of userfaultfd, the way the syscall works
tends to mirror the eventfd(2) syscall but the protocol talked on the
fd is clearly different.

So you also made the vcpu to talk from the kernel to the "ondemand"
thread through some file descriptor and protocol. So the difference is
that it shouldn't be limited to the emulator but it shall work for all
syscalls that the IO thread may invoke and that could hit on missing
guest physical memory.

> My concerns regarding "general" mm solution are:
> 
> - Will it work with any memory mapping schema or only with anonymous memory
> ?

Currently I have the hooks only into anonymous memory but there's no
reason why this shouldn't work for other kind of page faults. The main
issue is not with the technicality in waiting in the page fault but
with the semantics of a page not being mapped.

If there's an hole in a filebacked mapping things are different than
if there's an hole in anonymous memory. As the VM can unmap and free a
filebacked page at any time. But if you're ok to notify the migration
thread even in such a case, it could work the same.

It would be more tricky if we had to differentiate an initial fault
after the vma is created (in order to notify the migration thread only
for initial faults), from faults triggered after the VM unmapped and
freed the page as result of VM pressure. That would require putting a
placeholder in the pagetable instead of keeping the VM code identical
to now (which zeroes a pagetable when a page is unmapped from a
filebacked mapping).

There is also some similarity between the userfault mechanism and the
volatile ranges but volatile ranges are handled in a finegrined way in
the pagetables, but it looked like there was no code to share in the
end. Volatile ranges provides a very different functionality from
userfault, for example they don't need to provide transparent behavior
if the memory is given as parameter to syscalls as far as I know. They
are used only to store data in memory accessed by userland through the
pagetables (not syscalls). The objective of userfault is also
fundamentally different from the objective of volatile ranges. We
cannot ever lose any data, while their whole point is to lose data if
there's VM pressure.

However if you want to extend the userfaultfd functionalty to trap the
first access in the pagetabels for filebacked pages, we would also
need to mangle pagetables with some placeholders to differentiate the
first fault and we may want to revisit if there's some code or
placeholder to share across the two features. Ideally support for
filebacked ranges could be a added at a second stage.

> - Before blocking the calling thread while serving the page-fault in
> host-kernel, one would need carefully release mmu semaphore (otherwise, the
> user-space might be in trouble serving the page-fault), which may be not
> that trivial.

Yes all locks must be dropped before waiting for the migration thread
and that includes the mmap_sem. I don't think we'll need to add much
complexity though, because we can relay on the behavior of page faults
that can be repeated endlessy until they succeed. It's certainly more
trivial for real faults than gup (because gup can work on more than 1
page at time so it may require some rolling back or special lock
retaking during the gup loop). With the real faults the only trick as
optimization will be to repeat it without returning to userland.

> Regarding qemu part of that:
> 
> -  Yes, indeed, user-space would need to be careful accessing "ondemand"
> pages. However,  that should not be a problem, considering that qemu would
> need to know in advance all "ondemand" regions.  Though, I would expect
> some  refactoring of the qemu internal api might be required.

I don't think the migration thread would risk messing with missing
memory, but the refactoring of some API and wire protocol will be
still needed to make the protocol bidirectional and to handle the
postcopy mechanism. David is working on it.

Paolo also pointed out one case that won't be entirely transparent. If
gdb would debug the migration thread breakpointing into it, and then
the gdb user tries to touch missing memory with ptrace after it
stopped the migration thread, gdb would then soft lockup. It'll
require a sigkill to unblock. There's no real way to fix it in my
view... and overall it looks quite reasonable to end up in a soft
lockup in such a scenario, I mean gdb has other ways to interfere in
bad ways the app by corrupting its memory. Ideally the gdb user should
know what it is doing.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-03-31 18:03 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAJMTq5=LXMp2jBaxPMBWX_3-+RC5j98n=Nz8TRe3AXFwRY1Beg@mail.gmail.com>
2014-03-20 13:18 ` Demand paging for VM on KVM Paolo Bonzini
2014-03-20 17:32   ` Andrea Arcangeli
2014-03-20 18:27     ` Grigory Makarevich
     [not found]     ` <CAJMTq5nGcZoNEgEhP6mPQqhSbLFyf4J5YRd0cszWLMak-LJ0DA@mail.gmail.com>
2014-03-31 18:03       ` Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.