kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Improving userfaultfd scalability for live migration
@ 2022-12-01 19:37 James Houghton
  2022-12-03  1:03 ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: James Houghton @ 2022-12-01 19:37 UTC (permalink / raw)
  To: Andrea Arcangeli, Peter Xu, Paolo Bonzini, Axel Rasmussen,
	Sean Christopherson, Linux MM, kvm, chao.p.peng

One of the main challenges of using userfaultfd is its performance. I
have some ideas for how userfaultfd could be made scalable for
post-copy live migration. I'm not sending a series for now; I want to
make sure the general approach here is something upstream would be
interested in.

== Background ==
The main scalability bottleneck comes from queueing faults with
userfaultfd (i.e., interacting with fault_wqh/fault_pending_wqh).
Doing so requires us to take those wait_queues' locks exclusively. I
think we have these options to deal with this:

1. Avoid queueing faults (if possible)
2. Reduce contention (have lots of VMAs, 1 userfaultfd per VMA)
3. Allow multiple userfaultfds on a VMA.
4. Remove contention in the wait_queues (i.e., implement a "lockless
wait_queue", whatever that might be).

#2 can help a little bit, but we have two problems: we don't want TONS
of VMAs, and it doesn't help the case where we have lots of faults on
the same VMA. #3 could be possible, but it would be complicated and
wouldn't completely fix the problem. #4, which I doubt is even
feasible, would introduce a lot of complexity.

#1, however, is quite doable. The main codepath for post-copy, the
path that is taken when a vCPU attempts to access unmapped memory, is
(for x86, but similar for other architectures): handle_ept_violation
-> hva_to_pfn -> GUP -> handle_userfault. I'll call this the "EPT
violation path" or "mem fault path." Other post-copy paths include at
least: (i) KVM attempts to access guest memory via.
copy_{to,from}_user -> #pf -> handle_mm_fault -> handle_userfault, and
(ii) other callers of gfn_to_pfn* or hva_to_pfn* outside of the EPT
violation path (e.g., instruction emulation).

We want the EPT violation path to be fast, as it is taken the vast
majority of the time. Note that this case is run by the vCPU thread
itself / the thread that called KVM_RUN, and, if GUP "fails", in most
cases KVM_RUN will exit with -EFAULT. We can use this to our
advantage.

If we can get KVM_RUN to exit with information about which page we
need to fetch, we can do post-copy, and we never have to queue a page
fault with userfaultfd!

== Getting the faulting GPA to userspace ==
KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged),
and it provides the main functionality we need. We can extend it
easily to support our use case here, and I think we have at least two
options:
- Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes
KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would
otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns
-EFAULT).
- We're already introducing a new CAP, so just tie the above behavior
to whether or not one of the CAPs (below) is being used.

== Potential Solutions ==
We need the solution to handle both the EPT violation case and other
cases properly. Today, we can easily handle the EPT violation case if
we just use UFFD_FEATURE_SIGBUS, but that doesn't fix the other cases
(e.g. instruction emulation might fail, and we won't know what to do
to resolve it).

In both of the following solutions, hva_to_pfn needs to know if the
caller is the EPT violation/mem fault path. To do that, we'll probably
need to add a parameter to __gfn_to_pfn_memslot, gfn_to_pfn_prot, and
maybe some other functions. I'm not sure what the cleanest way to do
this is. It's possible that the new parameter here could be more
general than "if we came from a mem fault": whether the caller wants
GUP to fail quickly or not.

Now that hva_to_pfn knows if it is being called from memfault, we can
talk about how we can make it fail quickly in the userfaultfd case.

-- Introduce KVM_CAP_USERFAULT_NOWAIT
In hva_to_pfn_slow, if we came from a mem_fault, we can include a new
flag in our call to GUP: FOLL_USERFAULT_NOWAIT. Then, in GUP, it can
pass a new fault flag if it must call into a page fault routine:
FAULT_USERFAULT_NOWAIT. That will make its way to handle_userfault(),
and we can exit quickly (say, with VM_FAULT_SIGBUS, but any
VM_FAULT_ERROR would do).

Userspace can then take appropriate action: if they registered for
MISSING faults, we can UFFDIO_COPY and, if they registered for MINOR
faults, we can UFFDIO_CONTINUE. However, userspace no longer knows
which kind of fault it was if they registered for both kinds. I don't
see this as a problem.

-- Introduce KVM_CAP_MEM_FAULT_NOWAIT
In KVM, if this CAP is specified, never call hva_to_pfn_slow from the
mem fault path, and always return KVM_PFN_ERR_FAULT if fast GUP fails.
Fast GUP can fail for all sorts of reasons, so the actions userspace
can take to resolve these are more complicated:
1) If userspace knows that we never UFFDIO_COPY'd or UFFDIO_CONTINUE'd
the page, we can do that now and restart the vCPU.
2) If userspace has previously UFFDIO_COPY/CONTINUE'd, we need to get
the kernel to make the page ready again. We could read from the
faulting address, but that might set up a read-only mapping, so
instead, we can use MADV_POPULATE_WRITE to set up a RW mapping, and
fast GUP should succeed.

If MADV_POPULATE_WRITE races with a different thread that is
UFFDIO_COPY/CONTINUEing the same page and happens to win, it will drop
into handle_userfault and be woken up with a UFFDIO_WAKE later via the
same path that handles the non-mem-fault case.

This solution might seem bizarre, but it makes it so that the mem
fault path never needs to grab the mmap lock for reading. (We still
have to grab it for reading in UFFDIO_COPY/CONTINUE.)

== Problems ==
The major problem here is that this only solves the scalability
problem for the KVM demand paging case. Other userfaultfd users, if
they have scalability problems, will need to find another approach.

- James Houghton

[1]: https://lore.kernel.org/all/20221025151344.3784230-4-chao.p.peng@linux.intel.com/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Improving userfaultfd scalability for live migration
  2022-12-01 19:37 [RFC] Improving userfaultfd scalability for live migration James Houghton
@ 2022-12-03  1:03 ` Sean Christopherson
  2022-12-05 15:27   ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2022-12-03  1:03 UTC (permalink / raw)
  To: James Houghton
  Cc: Andrea Arcangeli, Peter Xu, Paolo Bonzini, Axel Rasmussen,
	Linux MM, kvm, chao.p.peng

On Thu, Dec 01, 2022, James Houghton wrote:
> #1, however, is quite doable. The main codepath for post-copy, the
> path that is taken when a vCPU attempts to access unmapped memory, is
> (for x86, but similar for other architectures): handle_ept_violation
> -> hva_to_pfn -> GUP -> handle_userfault. I'll call this the "EPT
> violation path" or "mem fault path." Other post-copy paths include at
> least: (i) KVM attempts to access guest memory via.
> copy_{to,from}_user -> #pf -> handle_mm_fault -> handle_userfault, and
> (ii) other callers of gfn_to_pfn* or hva_to_pfn* outside of the EPT
> violation path (e.g., instruction emulation).
> 
> We want the EPT violation path to be fast, as it is taken the vast
> majority of the time.

...

> == Getting the faulting GPA to userspace ==
> KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged),
> and it provides the main functionality we need. We can extend it
> easily to support our use case here, and I think we have at least two
> options:
> - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes
> KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would
> otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns
> -EFAULT).
> - We're already introducing a new CAP, so just tie the above behavior
> to whether or not one of the CAPs (below) is being used.

We might even be able to get away with a third option: unconditionally return
KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing
guest memory.

> == Problems ==
> The major problem here is that this only solves the scalability
> problem for the KVM demand paging case. Other userfaultfd users, if
> they have scalability problems, will need to find another approach.

It may not fully solve KVM's problem either.  E.g. if the VM is running nested
VMs, many (most?) of the user faults could be triggered by FNAME(walk_addr_generic)
via __get_user() when walking L1's EPT tables.

Disclaimer: I know _very_ little about UFFD.

Rather than add yet another flag to gup(), what about flag to say the task doesn't
want to wait for UFFD faults?  If desired/necessary, KVM could even toggle the flag
in KVM_RUN so that faults that occur outside of KVM ultimately don't send an actual
SIGBUGS.

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 07c81ab3fd4d..7f66b56dd6e7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
         * shmem_vm_ops->fault method is invoked even during
         * coredumping without mmap_lock and it ends up here.
         */
-       if (current->flags & (PF_EXITING|PF_DUMPCORE))
+       if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT))
                goto out;
 
        /*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffb6eb55cd13..4c6c53ac6531 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1729,7 +1729,7 @@ extern struct pid *cad_pid;
 #define PF_MEMALLOC            0x00000800      /* Allocating memory */
 #define PF_NPROC_EXCEEDED      0x00001000      /* set_user() noticed that RLIMIT_NPROC was exceeded */
 #define PF_USED_MATH           0x00002000      /* If unset the fpu must be initialized before use */
-#define PF__HOLE__00004000     0x00004000
+#define PF_NO_UFFD_WAIT                0x00004000
 #define PF_NOFREEZE            0x00008000      /* This thread should not be frozen */
 #define PF__HOLE__00010000     0x00010000
 #define PF_KSWAPD              0x00020000      /* I am kswapd */


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC] Improving userfaultfd scalability for live migration
  2022-12-03  1:03 ` Sean Christopherson
@ 2022-12-05 15:27   ` Peter Xu
  2022-12-05 17:31     ` David Matlack
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2022-12-05 15:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Houghton, Andrea Arcangeli, Paolo Bonzini, Axel Rasmussen,
	Linux MM, kvm, chao.p.peng

On Sat, Dec 03, 2022 at 01:03:38AM +0000, Sean Christopherson wrote:
> On Thu, Dec 01, 2022, James Houghton wrote:
> > #1, however, is quite doable. The main codepath for post-copy, the
> > path that is taken when a vCPU attempts to access unmapped memory, is
> > (for x86, but similar for other architectures): handle_ept_violation
> > -> hva_to_pfn -> GUP -> handle_userfault. I'll call this the "EPT
> > violation path" or "mem fault path." Other post-copy paths include at
> > least: (i) KVM attempts to access guest memory via.
> > copy_{to,from}_user -> #pf -> handle_mm_fault -> handle_userfault, and
> > (ii) other callers of gfn_to_pfn* or hva_to_pfn* outside of the EPT
> > violation path (e.g., instruction emulation).
> > 
> > We want the EPT violation path to be fast, as it is taken the vast
> > majority of the time.
> 
> ...
> 
> > == Getting the faulting GPA to userspace ==
> > KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged),
> > and it provides the main functionality we need. We can extend it
> > easily to support our use case here, and I think we have at least two
> > options:
> > - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes
> > KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would
> > otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns
> > -EFAULT).
> > - We're already introducing a new CAP, so just tie the above behavior
> > to whether or not one of the CAPs (below) is being used.
> 
> We might even be able to get away with a third option: unconditionally return
> KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing
> guest memory.
> 
> > == Problems ==
> > The major problem here is that this only solves the scalability
> > problem for the KVM demand paging case. Other userfaultfd users, if
> > they have scalability problems, will need to find another approach.
> 
> It may not fully solve KVM's problem either.  E.g. if the VM is running nested
> VMs, many (most?) of the user faults could be triggered by FNAME(walk_addr_generic)
> via __get_user() when walking L1's EPT tables.
> 
> Disclaimer: I know _very_ little about UFFD.
> 
> Rather than add yet another flag to gup(), what about flag to say the task doesn't
> want to wait for UFFD faults?  If desired/necessary, KVM could even toggle the flag
> in KVM_RUN so that faults that occur outside of KVM ultimately don't send an actual
> SIGBUGS.
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 07c81ab3fd4d..7f66b56dd6e7 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>          * shmem_vm_ops->fault method is invoked even during
>          * coredumping without mmap_lock and it ends up here.
>          */
> -       if (current->flags & (PF_EXITING|PF_DUMPCORE))
> +       if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT))
>                 goto out;

I'll have a closer read on the nested part, but note that this path already
has the mmap lock then it invalidates the goal if we want to avoid taking
it from the first place, or maybe we don't care?

If we want to avoid taking the mmap lock at all (hence the fast-gup
approach), I'd also suggest we don't make it related to uffd at all but
instead an interface to say "let's check whether the page tables are there
(walk pgtable by fast-gup only), if not return to userspace".

Because IIUC fast-gup has nothing to do with uffd, so it can also be a more
generic interface.  It's just that if the userspace knows what it's doing
(postcopy-ing), it knows then the faults can potentially be resolved by
userfaultfd at this stage.

>  
>         /*
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ffb6eb55cd13..4c6c53ac6531 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1729,7 +1729,7 @@ extern struct pid *cad_pid;
>  #define PF_MEMALLOC            0x00000800      /* Allocating memory */
>  #define PF_NPROC_EXCEEDED      0x00001000      /* set_user() noticed that RLIMIT_NPROC was exceeded */
>  #define PF_USED_MATH           0x00002000      /* If unset the fpu must be initialized before use */
> -#define PF__HOLE__00004000     0x00004000
> +#define PF_NO_UFFD_WAIT                0x00004000
>  #define PF_NOFREEZE            0x00008000      /* This thread should not be frozen */
>  #define PF__HOLE__00010000     0x00010000
>  #define PF_KSWAPD              0x00020000      /* I am kswapd */
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Improving userfaultfd scalability for live migration
  2022-12-05 15:27   ` Peter Xu
@ 2022-12-05 17:31     ` David Matlack
  2022-12-05 18:03       ` David Matlack
  2022-12-05 18:20       ` Sean Christopherson
  0 siblings, 2 replies; 16+ messages in thread
From: David Matlack @ 2022-12-05 17:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sean Christopherson, James Houghton, Andrea Arcangeli,
	Paolo Bonzini, Axel Rasmussen, Linux MM, kvm, chao.p.peng

On Mon, Dec 5, 2022 at 7:30 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Sat, Dec 03, 2022 at 01:03:38AM +0000, Sean Christopherson wrote:
> > On Thu, Dec 01, 2022, James Houghton wrote:
> > > #1, however, is quite doable. The main codepath for post-copy, the
> > > path that is taken when a vCPU attempts to access unmapped memory, is
> > > (for x86, but similar for other architectures): handle_ept_violation
> > > -> hva_to_pfn -> GUP -> handle_userfault. I'll call this the "EPT
> > > violation path" or "mem fault path." Other post-copy paths include at
> > > least: (i) KVM attempts to access guest memory via.
> > > copy_{to,from}_user -> #pf -> handle_mm_fault -> handle_userfault, and
> > > (ii) other callers of gfn_to_pfn* or hva_to_pfn* outside of the EPT
> > > violation path (e.g., instruction emulation).
> > >
> > > We want the EPT violation path to be fast, as it is taken the vast
> > > majority of the time.
> >
> > ...
> >
> > > == Getting the faulting GPA to userspace ==
> > > KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged),
> > > and it provides the main functionality we need. We can extend it
> > > easily to support our use case here, and I think we have at least two
> > > options:
> > > - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes
> > > KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would
> > > otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns
> > > -EFAULT).
> > > - We're already introducing a new CAP, so just tie the above behavior
> > > to whether or not one of the CAPs (below) is being used.
> >
> > We might even be able to get away with a third option: unconditionally return
> > KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing
> > guest memory.
> >
> > > == Problems ==
> > > The major problem here is that this only solves the scalability
> > > problem for the KVM demand paging case. Other userfaultfd users, if
> > > they have scalability problems, will need to find another approach.
> >
> > It may not fully solve KVM's problem either.  E.g. if the VM is running nested
> > VMs, many (most?) of the user faults could be triggered by FNAME(walk_addr_generic)
> > via __get_user() when walking L1's EPT tables.

We could always modify FNAME(walk_addr_generic) to return out to user
space in the same way if that is indeed another bottleneck.

> >
> > Disclaimer: I know _very_ little about UFFD.
> >
> > Rather than add yet another flag to gup(), what about flag to say the task doesn't
> > want to wait for UFFD faults?  If desired/necessary, KVM could even toggle the flag
> > in KVM_RUN so that faults that occur outside of KVM ultimately don't send an actual
> > SIGBUGS.

There are some copy_to/from_user() calls in KVM that cannot easily
exit out to KVM_RUN (for example, in the guts of the emulator IIRC).
But we could use your approach just to wrap the specific call sites
that can return from KVM_RUN.

> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 07c81ab3fd4d..7f66b56dd6e7 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >          * shmem_vm_ops->fault method is invoked even during
> >          * coredumping without mmap_lock and it ends up here.
> >          */
> > -       if (current->flags & (PF_EXITING|PF_DUMPCORE))
> > +       if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT))
> >                 goto out;
>
> I'll have a closer read on the nested part, but note that this path already
> has the mmap lock then it invalidates the goal if we want to avoid taking
> it from the first place, or maybe we don't care?
>
> If we want to avoid taking the mmap lock at all (hence the fast-gup
> approach), I'd also suggest we don't make it related to uffd at all but
> instead an interface to say "let's check whether the page tables are there
> (walk pgtable by fast-gup only), if not return to userspace".
>
> Because IIUC fast-gup has nothing to do with uffd, so it can also be a more
> generic interface.  It's just that if the userspace knows what it's doing
> (postcopy-ing), it knows then the faults can potentially be resolved by
> userfaultfd at this stage.

Are there any cases where fast-gup can fail while uffd is enabled but
it's not due to uffd? e.g. if a page is swapped out? I don't know what
userspace would do in those situations to make forward progress.



>
> >
> >         /*
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index ffb6eb55cd13..4c6c53ac6531 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1729,7 +1729,7 @@ extern struct pid *cad_pid;
> >  #define PF_MEMALLOC            0x00000800      /* Allocating memory */
> >  #define PF_NPROC_EXCEEDED      0x00001000      /* set_user() noticed that RLIMIT_NPROC was exceeded */
> >  #define PF_USED_MATH           0x00002000      /* If unset the fpu must be initialized before use */
> > -#define PF__HOLE__00004000     0x00004000
> > +#define PF_NO_UFFD_WAIT                0x00004000
> >  #define PF_NOFREEZE            0x00008000      /* This thread should not be frozen */
> >  #define PF__HOLE__00010000     0x00010000
> >  #define PF_KSWAPD              0x00020000      /* I am kswapd */
> >
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Improving userfaultfd scalability for live migration
  2022-12-05 17:31     ` David Matlack
@ 2022-12-05 18:03       ` David Matlack
  2022-12-05 18:23         ` Sean Christopherson
  2022-12-05 18:20       ` Sean Christopherson
  1 sibling, 1 reply; 16+ messages in thread
From: David Matlack @ 2022-12-05 18:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sean Christopherson, James Houghton, Andrea Arcangeli,
	Paolo Bonzini, Axel Rasmussen, Linux MM, kvm, chao.p.peng

On Mon, Dec 5, 2022 at 9:31 AM David Matlack <dmatlack@google.com> wrote:
>
> On Mon, Dec 5, 2022 at 7:30 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Sat, Dec 03, 2022 at 01:03:38AM +0000, Sean Christopherson wrote:
> > > On Thu, Dec 01, 2022, James Houghton wrote:
> > > > #1, however, is quite doable. The main codepath for post-copy, the
> > > > path that is taken when a vCPU attempts to access unmapped memory, is
> > > > (for x86, but similar for other architectures): handle_ept_violation
> > > > -> hva_to_pfn -> GUP -> handle_userfault. I'll call this the "EPT
> > > > violation path" or "mem fault path." Other post-copy paths include at
> > > > least: (i) KVM attempts to access guest memory via.
> > > > copy_{to,from}_user -> #pf -> handle_mm_fault -> handle_userfault, and
> > > > (ii) other callers of gfn_to_pfn* or hva_to_pfn* outside of the EPT
> > > > violation path (e.g., instruction emulation).
> > > >
> > > > We want the EPT violation path to be fast, as it is taken the vast
> > > > majority of the time.
> > >
> > > ...
> > >
> > > > == Getting the faulting GPA to userspace ==
> > > > KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged),
> > > > and it provides the main functionality we need. We can extend it
> > > > easily to support our use case here, and I think we have at least two
> > > > options:
> > > > - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes
> > > > KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would
> > > > otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns
> > > > -EFAULT).
> > > > - We're already introducing a new CAP, so just tie the above behavior
> > > > to whether or not one of the CAPs (below) is being used.
> > >
> > > We might even be able to get away with a third option: unconditionally return
> > > KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing
> > > guest memory.
> > >
> > > > == Problems ==
> > > > The major problem here is that this only solves the scalability
> > > > problem for the KVM demand paging case. Other userfaultfd users, if
> > > > they have scalability problems, will need to find another approach.
> > >
> > > It may not fully solve KVM's problem either.  E.g. if the VM is running nested
> > > VMs, many (most?) of the user faults could be triggered by FNAME(walk_addr_generic)
> > > via __get_user() when walking L1's EPT tables.
>
> We could always modify FNAME(walk_addr_generic) to return out to user
> space in the same way if that is indeed another bottleneck.

Scratch that, walk_addr_generic ultimately has a ton of callers
throughout KVM, so it would be difficult to plumb the error handling
out to userspace.

That being said, I'm not sure this is really going to be a bottleneck.
Internally we are using the netlink socket for these faults without
issue and, from a theoretical perspective, only a small fraction of
guest memory (and thus a small fraction of uffd faults) are guest EPT
tables. And any guest-initiated access of a not-present guest EPT
table will still go through handle_ept_violation(), not
walk_addr_generic().

>
> > >
> > > Disclaimer: I know _very_ little about UFFD.
> > >
> > > Rather than add yet another flag to gup(), what about flag to say the task doesn't
> > > want to wait for UFFD faults?  If desired/necessary, KVM could even toggle the flag
> > > in KVM_RUN so that faults that occur outside of KVM ultimately don't send an actual
> > > SIGBUGS.
>
> There are some copy_to/from_user() calls in KVM that cannot easily
> exit out to KVM_RUN (for example, in the guts of the emulator IIRC).
> But we could use your approach just to wrap the specific call sites
> that can return from KVM_RUN.
>
> > >
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index 07c81ab3fd4d..7f66b56dd6e7 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> > >          * shmem_vm_ops->fault method is invoked even during
> > >          * coredumping without mmap_lock and it ends up here.
> > >          */
> > > -       if (current->flags & (PF_EXITING|PF_DUMPCORE))
> > > +       if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT))
> > >                 goto out;
> >
> > I'll have a closer read on the nested part, but note that this path already
> > has the mmap lock then it invalidates the goal if we want to avoid taking
> > it from the first place, or maybe we don't care?
> >
> > If we want to avoid taking the mmap lock at all (hence the fast-gup
> > approach), I'd also suggest we don't make it related to uffd at all but
> > instead an interface to say "let's check whether the page tables are there
> > (walk pgtable by fast-gup only), if not return to userspace".
> >
> > Because IIUC fast-gup has nothing to do with uffd, so it can also be a more
> > generic interface.  It's just that if the userspace knows what it's doing
> > (postcopy-ing), it knows then the faults can potentially be resolved by
> > userfaultfd at this stage.
>
> Are there any cases where fast-gup can fail while uffd is enabled but
> it's not due to uffd? e.g. if a page is swapped out? I don't know what
> userspace would do in those situations to make forward progress.
>
>
>
> >
> > >
> > >         /*
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index ffb6eb55cd13..4c6c53ac6531 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -1729,7 +1729,7 @@ extern struct pid *cad_pid;
> > >  #define PF_MEMALLOC            0x00000800      /* Allocating memory */
> > >  #define PF_NPROC_EXCEEDED      0x00001000      /* set_user() noticed that RLIMIT_NPROC was exceeded */
> > >  #define PF_USED_MATH           0x00002000      /* If unset the fpu must be initialized before use */
> > > -#define PF__HOLE__00004000     0x00004000
> > > +#define PF_NO_UFFD_WAIT                0x00004000
> > >  #define PF_NOFREEZE            0x00008000      /* This thread should not be frozen */
> > >  #define PF__HOLE__00010000     0x00010000
> > >  #define PF_KSWAPD              0x00020000      /* I am kswapd */
> > >
> >
> > --
> > Peter Xu
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Improving userfaultfd scalability for live migration
  2022-12-05 17:31     ` David Matlack
  2022-12-05 18:03       ` David Matlack
@ 2022-12-05 18:20       ` Sean Christopherson
  2022-12-05 21:19         ` James Houghton
  1 sibling, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2022-12-05 18:20 UTC (permalink / raw)
  To: David Matlack
  Cc: Peter Xu, James Houghton, Andrea Arcangeli, Paolo Bonzini,
	Axel Rasmussen, Linux MM, kvm, chao.p.peng

On Mon, Dec 05, 2022, David Matlack wrote:
> On Mon, Dec 5, 2022 at 7:30 AM Peter Xu <peterx@redhat.com> wrote:
> > > > == Getting the faulting GPA to userspace ==
> > > > KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged),
> > > > and it provides the main functionality we need. We can extend it
> > > > easily to support our use case here, and I think we have at least two
> > > > options:
> > > > - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes
> > > > KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would
> > > > otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns
> > > > -EFAULT).
> > > > - We're already introducing a new CAP, so just tie the above behavior
> > > > to whether or not one of the CAPs (below) is being used.
> > >
> > > We might even be able to get away with a third option: unconditionally return
> > > KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing
> > > guest memory.
> > >
> > > > == Problems ==
> > > > The major problem here is that this only solves the scalability
> > > > problem for the KVM demand paging case. Other userfaultfd users, if
> > > > they have scalability problems, will need to find another approach.
> > >
> > > It may not fully solve KVM's problem either.  E.g. if the VM is running nested
> > > VMs, many (most?) of the user faults could be triggered by FNAME(walk_addr_generic)
> > > via __get_user() when walking L1's EPT tables.
> 
> We could always modify FNAME(walk_addr_generic) to return out to user
> space in the same way if that is indeed another bottleneck.

Yes, but given that there's a decent chance that solving this problem will add
new ABI, I want to make sure that we are confident that we won't end up with gaps
in the ABI.  I.e. I don't want to punt the nested case to the future.

> > > Disclaimer: I know _very_ little about UFFD.
> > >
> > > Rather than add yet another flag to gup(), what about flag to say the task doesn't
> > > want to wait for UFFD faults?  If desired/necessary, KVM could even toggle the flag
> > > in KVM_RUN so that faults that occur outside of KVM ultimately don't send an actual
> > > SIGBUGS.
> 
> There are some copy_to/from_user() calls in KVM that cannot easily
> exit out to KVM_RUN (for example, in the guts of the emulator IIRC).
> But we could use your approach just to wrap the specific call sites
> that can return from KVM_RUN.

Yeah, it would definitely need to be opt-in. 

> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index 07c81ab3fd4d..7f66b56dd6e7 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> > >          * shmem_vm_ops->fault method is invoked even during
> > >          * coredumping without mmap_lock and it ends up here.
> > >          */
> > > -       if (current->flags & (PF_EXITING|PF_DUMPCORE))
> > > +       if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT))
> > >                 goto out;
> >
> > I'll have a closer read on the nested part, but note that this path already
> > has the mmap lock then it invalidates the goal if we want to avoid taking
> > it from the first place, or maybe we don't care?
> >
> > If we want to avoid taking the mmap lock at all (hence the fast-gup
> > approach), I'd also suggest we don't make it related to uffd at all but
> > instead an interface to say "let's check whether the page tables are there
> > (walk pgtable by fast-gup only), if not return to userspace".

Ooh, good point.  If KVM provided a way for userspace to toggle a "fast-only" flag,
then hva_to_pfn() could bail if hva_to_pfn_fast() failed, and I think KVM could
just do pagefault_disable/enable() around compatible KVM uaccesses?

> > Because IIUC fast-gup has nothing to do with uffd, so it can also be a more
> > generic interface.  It's just that if the userspace knows what it's doing
> > (postcopy-ing), it knows then the faults can potentially be resolved by
> > userfaultfd at this stage.
> 
> Are there any cases where fast-gup can fail while uffd is enabled but
> it's not due to uffd? e.g. if a page is swapped out?

Undoubtedly.  COW, NUMA balancing, KSM?, etc.  Nit, I don't think "due to uffd"
is the right terminology, I think the right phrasing is something like "but can't
be resolved by userspace", or maybe "but weren't induced by userspace".  UFFD
itself never causes faults.

> I don't know what userspace would do in those situations to make forward progress.

Access the page from userspace?  E.g. a "LOCK AND -1" would resolve read and write
faults without modifying guest memory.

That won't work for guests backed by "restricted mem", a.k.a. UPM guests, but
restricted mem really should be able to prevent those types of faults in the first
place.  SEV guests are the one case I can think of where that approach won't work,
since writes will corrupt the guest.  SEV guests can likely be special cased though.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Improving userfaultfd scalability for live migration
  2022-12-05 18:03       ` David Matlack
@ 2022-12-05 18:23         ` Sean Christopherson
  0 siblings, 0 replies; 16+ messages in thread
From: Sean Christopherson @ 2022-12-05 18:23 UTC (permalink / raw)
  To: David Matlack
  Cc: Peter Xu, James Houghton, Andrea Arcangeli, Paolo Bonzini,
	Axel Rasmussen, Linux MM, kvm, chao.p.peng

On Mon, Dec 05, 2022, David Matlack wrote:
> On Mon, Dec 5, 2022 at 9:31 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Sat, Dec 03, 2022 at 01:03:38AM +0000, Sean Christopherson wrote:
> > > > On Thu, Dec 01, 2022, James Houghton wrote:
> > > > > == Problems ==
> > > > > The major problem here is that this only solves the scalability
> > > > > problem for the KVM demand paging case. Other userfaultfd users, if
> > > > > they have scalability problems, will need to find another approach.
> > > >
> > > > It may not fully solve KVM's problem either.  E.g. if the VM is running nested
> > > > VMs, many (most?) of the user faults could be triggered by FNAME(walk_addr_generic)
> > > > via __get_user() when walking L1's EPT tables.
> >
> > We could always modify FNAME(walk_addr_generic) to return out to user
> > space in the same way if that is indeed another bottleneck.
> 
> Scratch that, walk_addr_generic ultimately has a ton of callers
> throughout KVM, so it would be difficult to plumb the error handling
> out to userspace.

It would be easy enough to plumb a "fast-only" flag into walk_addr_generic().

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Improving userfaultfd scalability for live migration
  2022-12-05 18:20       ` Sean Christopherson
@ 2022-12-05 21:19         ` James Houghton
  2022-12-06  1:06           ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: James Houghton @ 2022-12-05 21:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Matlack, Peter Xu, Andrea Arcangeli, Paolo Bonzini,
	Axel Rasmussen, Linux MM, kvm, chao.p.peng

On Mon, Dec 5, 2022 at 1:20 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Dec 05, 2022, David Matlack wrote:
> > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > == Getting the faulting GPA to userspace ==
> > > > > KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged),
> > > > > and it provides the main functionality we need. We can extend it
> > > > > easily to support our use case here, and I think we have at least two
> > > > > options:
> > > > > - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes
> > > > > KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would
> > > > > otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns
> > > > > -EFAULT).
> > > > > - We're already introducing a new CAP, so just tie the above behavior
> > > > > to whether or not one of the CAPs (below) is being used.
> > > >
> > > > We might even be able to get away with a third option: unconditionally return
> > > > KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing
> > > > guest memory.

Wouldn't we need a new CAP for this?

> > > >
> > > > > == Problems ==
> > > > > The major problem here is that this only solves the scalability
> > > > > problem for the KVM demand paging case. Other userfaultfd users, if
> > > > > they have scalability problems, will need to find another approach.
> > > >
> > > > It may not fully solve KVM's problem either.  E.g. if the VM is running nested
> > > > VMs, many (most?) of the user faults could be triggered by FNAME(walk_addr_generic)
> > > > via __get_user() when walking L1's EPT tables.
> >
> > We could always modify FNAME(walk_addr_generic) to return out to user
> > space in the same way if that is indeed another bottleneck.
>
> Yes, but given that there's a decent chance that solving this problem will add
> new ABI, I want to make sure that we are confident that we won't end up with gaps
> in the ABI.  I.e. I don't want to punt the nested case to the future.
>
> > > > Disclaimer: I know _very_ little about UFFD.
> > > >
> > > > Rather than add yet another flag to gup(), what about flag to say the task doesn't
> > > > want to wait for UFFD faults?  If desired/necessary, KVM could even toggle the flag
> > > > in KVM_RUN so that faults that occur outside of KVM ultimately don't send an actual
> > > > SIGBUGS.

I really like this idea! Having KVM_RUN toggle it in
handle_ept_violation/etc. seems like it would work the best. If we
toggled it in userspace before KVM_RUN, we would still open ourselves
up to KVM_RUN exiting without post-copy information (like, if GUP
failed during instruction emulation), IIUC.

> >
> > There are some copy_to/from_user() calls in KVM that cannot easily
> > exit out to KVM_RUN (for example, in the guts of the emulator IIRC).
> > But we could use your approach just to wrap the specific call sites
> > that can return from KVM_RUN.
>
> Yeah, it would definitely need to be opt-in.
>
> > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > > index 07c81ab3fd4d..7f66b56dd6e7 100644
> > > > --- a/fs/userfaultfd.c
> > > > +++ b/fs/userfaultfd.c
> > > > @@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> > > >          * shmem_vm_ops->fault method is invoked even during
> > > >          * coredumping without mmap_lock and it ends up here.
> > > >          */
> > > > -       if (current->flags & (PF_EXITING|PF_DUMPCORE))
> > > > +       if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT))
> > > >                 goto out;
> > >
> > > I'll have a closer read on the nested part, but note that this path already
> > > has the mmap lock then it invalidates the goal if we want to avoid taking
> > > it from the first place, or maybe we don't care?

Not taking the mmap lock would be helpful, but we still have to take
it in UFFDIO_CONTINUE, so it's ok if we have to still take it here.

The main goal is to avoid the locks in the userfaultfd wait_queues. If
we could completely avoid taking the mmap lock for reading in the
common post-copy case, we would avoid potential latency spikes if
someone (e.g. khugepaged) came around and grabbed the mmap lock for
writing.

It seems pretty difficult to make UFFDIO_CONTINUE *not* take the mmap
lock for reading, but I suppose it could be done with something like
the per-VMA lock work [2]. If we could avoid taking the lock in
UFFDIO_CONTINUE, then it seems plausible that we could avoid taking it
in slow GUP too. So really whether or not we are taking the mmap lock
(for reading) in the mem fault path isn't a huge deal by itself.

> > >
> > > If we want to avoid taking the mmap lock at all (hence the fast-gup
> > > approach), I'd also suggest we don't make it related to uffd at all but
> > > instead an interface to say "let's check whether the page tables are there
> > > (walk pgtable by fast-gup only), if not return to userspace".
>
> Ooh, good point.  If KVM provided a way for userspace to toggle a "fast-only" flag,
> then hva_to_pfn() could bail if hva_to_pfn_fast() failed, and I think KVM could
> just do pagefault_disable/enable() around compatible KVM uaccesses?
>
> > > Because IIUC fast-gup has nothing to do with uffd, so it can also be a more
> > > generic interface.  It's just that if the userspace knows what it's doing
> > > (postcopy-ing), it knows then the faults can potentially be resolved by
> > > userfaultfd at this stage.
> >
> > Are there any cases where fast-gup can fail while uffd is enabled but
> > it's not due to uffd? e.g. if a page is swapped out?
>
> Undoubtedly.  COW, NUMA balancing, KSM?, etc.  Nit, I don't think "due to uffd"
> is the right terminology, I think the right phrasing is something like "but can't
> be resolved by userspace", or maybe "but weren't induced by userspace".  UFFD
> itself never causes faults.
>
> > I don't know what userspace would do in those situations to make forward progress.
>
> Access the page from userspace?  E.g. a "LOCK AND -1" would resolve read and write
> faults without modifying guest memory.
>
> That won't work for guests backed by "restricted mem", a.k.a. UPM guests, but
> restricted mem really should be able to prevent those types of faults in the first
> place.  SEV guests are the one case I can think of where that approach won't work,
> since writes will corrupt the guest.  SEV guests can likely be special cased though.

As I mentioned in the original email, I think MADV_POPULATE_WRITE
would work here (Peter suggested this to me last week, thanks Peter!).
It would basically call slow GUP for us. So instead of hva_to_pfn_fast
(fails) -> hva_to_pfn_slow -> slow GUP, we do hva_to_pfn_fast (fails)
-> exit to userspace -> MADV_POPULATE_WRITE (-> slow GUP) -> KVM_RUN
-> hva_to_pfn_fast (succeeds).

[2]: https://lwn.net/Articles/906852/

- James

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Improving userfaultfd scalability for live migration
  2022-12-05 21:19         ` James Houghton
@ 2022-12-06  1:06           ` Sean Christopherson
  2022-12-06 17:35             ` James Houghton
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2022-12-06  1:06 UTC (permalink / raw)
  To: James Houghton
  Cc: David Matlack, Peter Xu, Andrea Arcangeli, Paolo Bonzini,
	Axel Rasmussen, Linux MM, kvm, chao.p.peng

On Mon, Dec 05, 2022, James Houghton wrote:
> On Mon, Dec 5, 2022 at 1:20 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Dec 05, 2022, David Matlack wrote:
> > > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > > == Getting the faulting GPA to userspace ==
> > > > > > KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged),
> > > > > > and it provides the main functionality we need. We can extend it
> > > > > > easily to support our use case here, and I think we have at least two
> > > > > > options:
> > > > > > - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes
> > > > > > KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would
> > > > > > otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns
> > > > > > -EFAULT).
> > > > > > - We're already introducing a new CAP, so just tie the above behavior
> > > > > > to whether or not one of the CAPs (below) is being used.
> > > > >
> > > > > We might even be able to get away with a third option: unconditionally return
> > > > > KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing
> > > > > guest memory.
> 
> Wouldn't we need a new CAP for this?

Maybe?  I did say "might" :-)  -EFAULT is sooo useless for userspace in these
cases that there's a chance we can get away with an unconditional change.  Probably
not worth the risk of breaking userspace though as KVM will likely end up with a
helper to fill in the exit info.

> > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > > > index 07c81ab3fd4d..7f66b56dd6e7 100644
> > > > > --- a/fs/userfaultfd.c
> > > > > +++ b/fs/userfaultfd.c
> > > > > @@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> > > > >          * shmem_vm_ops->fault method is invoked even during
> > > > >          * coredumping without mmap_lock and it ends up here.
> > > > >          */
> > > > > -       if (current->flags & (PF_EXITING|PF_DUMPCORE))
> > > > > +       if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT))
> > > > >                 goto out;
> > > >
> > > > I'll have a closer read on the nested part, but note that this path already
> > > > has the mmap lock then it invalidates the goal if we want to avoid taking
> > > > it from the first place, or maybe we don't care?
> 
> Not taking the mmap lock would be helpful, but we still have to take
> it in UFFDIO_CONTINUE, so it's ok if we have to still take it here.

IIUC, Peter is suggesting that the kernel not even get to the point where UFFD
is involved.  The "fault" would get propagated to userspace by KVM, userspace
fixes the fault (gets the page from the source, does MADV_POPULATE_WRITE), and
resumes the vCPU.

> The main goal is to avoid the locks in the userfaultfd wait_queues. If
> we could completely avoid taking the mmap lock for reading in the
> common post-copy case, we would avoid potential latency spikes if
> someone (e.g. khugepaged) came around and grabbed the mmap lock for
> writing.
> 
> It seems pretty difficult to make UFFDIO_CONTINUE *not* take the mmap
> lock for reading, but I suppose it could be done with something like
> the per-VMA lock work [2]. If we could avoid taking the lock in
> UFFDIO_CONTINUE, then it seems plausible that we could avoid taking it
> in slow GUP too. So really whether or not we are taking the mmap lock
> (for reading) in the mem fault path isn't a huge deal by itself.

...

> > > I don't know what userspace would do in those situations to make forward progress.
> >
> > Access the page from userspace?  E.g. a "LOCK AND -1" would resolve read and write
> > faults without modifying guest memory.
> >
> > That won't work for guests backed by "restricted mem", a.k.a. UPM guests, but
> > restricted mem really should be able to prevent those types of faults in the first
> > place.  SEV guests are the one case I can think of where that approach won't work,
> > since writes will corrupt the guest.  SEV guests can likely be special cased though.
> 
> As I mentioned in the original email, I think MADV_POPULATE_WRITE
> would work here (Peter suggested this to me last week, thanks Peter!).
> It would basically call slow GUP for us. So instead of hva_to_pfn_fast
> (fails) -> hva_to_pfn_slow -> slow GUP, we do hva_to_pfn_fast (fails)
> -> exit to userspace -> MADV_POPULATE_WRITE (-> slow GUP) -> KVM_RUN
> -> hva_to_pfn_fast (succeeds).a

Ah, nice.  Missed that (obviously).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Improving userfaultfd scalability for live migration
  2022-12-06  1:06           ` Sean Christopherson
@ 2022-12-06 17:35             ` James Houghton
  2022-12-06 18:00               ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: James Houghton @ 2022-12-06 17:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Matlack, Peter Xu, Andrea Arcangeli, Paolo Bonzini,
	Axel Rasmussen, Linux MM, kvm, chao.p.peng

On Mon, Dec 5, 2022 at 8:06 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Dec 05, 2022, James Houghton wrote:
> > On Mon, Dec 5, 2022 at 1:20 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Mon, Dec 05, 2022, David Matlack wrote:
> > > > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > ...
> > > > > I'll have a closer read on the nested part, but note that this path already
> > > > > has the mmap lock then it invalidates the goal if we want to avoid taking
> > > > > it from the first place, or maybe we don't care?
> >
> > Not taking the mmap lock would be helpful, but we still have to take
> > it in UFFDIO_CONTINUE, so it's ok if we have to still take it here.
>
> IIUC, Peter is suggesting that the kernel not even get to the point where UFFD
> is involved.  The "fault" would get propagated to userspace by KVM, userspace
> fixes the fault (gets the page from the source, does MADV_POPULATE_WRITE), and
> resumes the vCPU.

If we haven't UFFDIO_CONTINUE'd some address range yet,
MADV_POPULATE_WRITE for that range will drop into handle_userfault and
go to sleep. Not good!

So, going with the no-slow-GUP approach, resolving faults is done like this:
- If we haven't UFFDIO_CONTINUE'd yet, do that now and restart
KVM_RUN. The PTEs will be none/blank right now. This is the common
case.
- If we have UFFDIO_CONTINUE'd already, if we were to do it again, we
would get EEXIST. (In this case, we probably have some type of swap
entry in the page tables.) We have to change the page tables to make
fast GUP succeed now *without* using UFFDIO_CONTINUE now.
MADV_POPULATE_WRITE seems to be the right tool for the job. This case
happens if the kernel has swapped the memory out, is migrating it, has
poisoned it, etc. If MADV_POPULATE_WRITE fails, we probably need to
crash or inject a memory error.

So with this approach, we never need to take the mmap_lock for reading
in hva_to_pfn, but we still need to take it in UFFDIO_CONTINUE.
Without removing the mmap_lock from *both*, we don't gain much.

So if we disregard this tiny mmap_lock benefit, the other approach
(the PF_NO_UFFD_WAIT approach) seems better. When KVM_RUN exits:
- If we haven't UFFDIO_CONTINUE'd yet, do that now and restart KVM_RUN.
- If we have, then something bad has happened. Slow GUP already ran
and failed, so we need to treat this in the same way we treat a
MADV_POPULATE_WRITE failure above: userspace might just want to crash
(or inject a memory error or something).

- James

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Improving userfaultfd scalability for live migration
  2022-12-06 17:35             ` James Houghton
@ 2022-12-06 18:00               ` Sean Christopherson
  2022-12-06 20:41                 ` James Houghton
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2022-12-06 18:00 UTC (permalink / raw)
  To: James Houghton
  Cc: David Matlack, Peter Xu, Andrea Arcangeli, Paolo Bonzini,
	Axel Rasmussen, Linux MM, kvm, chao.p.peng

On Tue, Dec 06, 2022, James Houghton wrote:
> On Mon, Dec 5, 2022 at 8:06 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Dec 05, 2022, James Houghton wrote:
> > > On Mon, Dec 5, 2022 at 1:20 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Mon, Dec 05, 2022, David Matlack wrote:
> > > > > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > > ...
> > > > > > I'll have a closer read on the nested part, but note that this path already
> > > > > > has the mmap lock then it invalidates the goal if we want to avoid taking
> > > > > > it from the first place, or maybe we don't care?
> > >
> > > Not taking the mmap lock would be helpful, but we still have to take
> > > it in UFFDIO_CONTINUE, so it's ok if we have to still take it here.
> >
> > IIUC, Peter is suggesting that the kernel not even get to the point where UFFD
> > is involved.  The "fault" would get propagated to userspace by KVM, userspace
> > fixes the fault (gets the page from the source, does MADV_POPULATE_WRITE), and
> > resumes the vCPU.
> 
> If we haven't UFFDIO_CONTINUE'd some address range yet,
> MADV_POPULATE_WRITE for that range will drop into handle_userfault and
> go to sleep. Not good!

Ah, right, userspace would still need to register UFFD for the region to handle
non-KVM (or incompatible KVM) accesses and could loop back on itself.

> So, going with the no-slow-GUP approach, resolving faults is done like this:
> - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart
> KVM_RUN. The PTEs will be none/blank right now. This is the common
> case.
> - If we have UFFDIO_CONTINUE'd already, if we were to do it again, we
> would get EEXIST. (In this case, we probably have some type of swap
> entry in the page tables.) We have to change the page tables to make
> fast GUP succeed now *without* using UFFDIO_CONTINUE now.
> MADV_POPULATE_WRITE seems to be the right tool for the job. This case
> happens if the kernel has swapped the memory out, is migrating it, has
> poisoned it, etc. If MADV_POPULATE_WRITE fails, we probably need to
> crash or inject a memory error.
> 
> So with this approach, we never need to take the mmap_lock for reading
> in hva_to_pfn, but we still need to take it in UFFDIO_CONTINUE.
> Without removing the mmap_lock from *both*, we don't gain much.
> 
> So if we disregard this tiny mmap_lock benefit, the other approach
> (the PF_NO_UFFD_WAIT approach) seems better.

Can you elaborate on what makes it better?  Or maybe generate a list of pros and
cons?  I can think of (dis)advantages for both approaches, but I haven't identified
anything that would be a blocking issue for either approach.  Doesn't mean there
isn't one or more blocking issues, just that I haven't thought of any :-)

> When KVM_RUN exits:
> - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart KVM_RUN.
> - If we have, then something bad has happened. Slow GUP already ran
> and failed, so we need to treat this in the same way we treat a
> MADV_POPULATE_WRITE failure above: userspace might just want to crash
> (or inject a memory error or something).
> 
> - James

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Improving userfaultfd scalability for live migration
  2022-12-06 18:00               ` Sean Christopherson
@ 2022-12-06 20:41                 ` James Houghton
  2022-12-08  1:56                   ` David Matlack
  0 siblings, 1 reply; 16+ messages in thread
From: James Houghton @ 2022-12-06 20:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Matlack, Peter Xu, Andrea Arcangeli, Paolo Bonzini,
	Axel Rasmussen, Linux MM, kvm, chao.p.peng, Oliver Upton

On Tue, Dec 6, 2022 at 1:01 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Dec 06, 2022, James Houghton wrote:
> > On Mon, Dec 5, 2022 at 8:06 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Mon, Dec 05, 2022, James Houghton wrote:
> > > > On Mon, Dec 5, 2022 at 1:20 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > >
> > > > > On Mon, Dec 05, 2022, David Matlack wrote:
> > > > > > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > > > ...
> > > > > > > I'll have a closer read on the nested part, but note that this path already
> > > > > > > has the mmap lock then it invalidates the goal if we want to avoid taking
> > > > > > > it from the first place, or maybe we don't care?
> > > >
> > > > Not taking the mmap lock would be helpful, but we still have to take
> > > > it in UFFDIO_CONTINUE, so it's ok if we have to still take it here.
> > >
> > > IIUC, Peter is suggesting that the kernel not even get to the point where UFFD
> > > is involved.  The "fault" would get propagated to userspace by KVM, userspace
> > > fixes the fault (gets the page from the source, does MADV_POPULATE_WRITE), and
> > > resumes the vCPU.
> >
> > If we haven't UFFDIO_CONTINUE'd some address range yet,
> > MADV_POPULATE_WRITE for that range will drop into handle_userfault and
> > go to sleep. Not good!
>
> Ah, right, userspace would still need to register UFFD for the region to handle
> non-KVM (or incompatible KVM) accesses and could loop back on itself.
>
> > So, going with the no-slow-GUP approach, resolving faults is done like this:
> > - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart
> > KVM_RUN. The PTEs will be none/blank right now. This is the common
> > case.
> > - If we have UFFDIO_CONTINUE'd already, if we were to do it again, we
> > would get EEXIST. (In this case, we probably have some type of swap
> > entry in the page tables.) We have to change the page tables to make
> > fast GUP succeed now *without* using UFFDIO_CONTINUE now.
> > MADV_POPULATE_WRITE seems to be the right tool for the job. This case
> > happens if the kernel has swapped the memory out, is migrating it, has
> > poisoned it, etc. If MADV_POPULATE_WRITE fails, we probably need to
> > crash or inject a memory error.
> >
> > So with this approach, we never need to take the mmap_lock for reading
> > in hva_to_pfn, but we still need to take it in UFFDIO_CONTINUE.
> > Without removing the mmap_lock from *both*, we don't gain much.
> >
> > So if we disregard this tiny mmap_lock benefit, the other approach
> > (the PF_NO_UFFD_WAIT approach) seems better.
>
> Can you elaborate on what makes it better?  Or maybe generate a list of pros and
> cons?  I can think of (dis)advantages for both approaches, but I haven't identified
> anything that would be a blocking issue for either approach.  Doesn't mean there
> isn't one or more blocking issues, just that I haven't thought of any :-)

Let's see.... so using no-slow-GUP over no UFFD waiting:
- No need to take mmap_lock in mem fault path.
- Change the relevant __gfn_to_pfn_memslot callers
(kvm_faultin_pfn/user_mem_abort/others?) to set `atomic = true` if the
new CAP is used.
- No need for a new PF_NO_UFFD_WAIT (would be toggled somewhere
in/near kvm_faultin_pfn/user_mem_abort).
- Userspace has to indirectly figure out the state of the page tables
to know what action to take (which introduces some weirdness, like if
anyone MADV_DONTNEEDs some guest memory, we need to know).
- While userfaultfd is registered (so like during post-copy), any
hva_to_pfn() calls that were resolvable with slow GUP before (without
dropping into handle_userfault()) will now need to be resolved by
userspace manually with a call to MADV_POPULATE_WRITE. This extra trip
to userspace could slow things down.

Both of these seem pretty simple to implement in the kernel; the most
complicated part is just returning KVM_EXIT_MEMORY_FAULT in more
places / for other architectures (I care about x86 and arm64).

Right now both approaches seem fine to me. Not having to take the
mmap_lock in the fault path, while being such a minor difference now,
could be a huge benefit if we can later get around to making
UFFDIO_CONTINUE not need the mmap lock. Disregarding that, not
requiring userspace to guess the state of the page tables seems
helpful (less bug-prone, I guess).

>
> > When KVM_RUN exits:
> > - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart KVM_RUN.
> > - If we have, then something bad has happened. Slow GUP already ran
> > and failed, so we need to treat this in the same way we treat a
> > MADV_POPULATE_WRITE failure above: userspace might just want to crash
> > (or inject a memory error or something).
> >
> > - James

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Improving userfaultfd scalability for live migration
  2022-12-06 20:41                 ` James Houghton
@ 2022-12-08  1:56                   ` David Matlack
  2022-12-08 17:50                     ` James Houghton
  0 siblings, 1 reply; 16+ messages in thread
From: David Matlack @ 2022-12-08  1:56 UTC (permalink / raw)
  To: James Houghton
  Cc: Sean Christopherson, Peter Xu, Andrea Arcangeli, Paolo Bonzini,
	Axel Rasmussen, Linux MM, kvm, chao.p.peng, Oliver Upton

On Tue, Dec 6, 2022 at 12:41 PM James Houghton <jthoughton@google.com> wrote:
> On Tue, Dec 6, 2022 at 1:01 PM Sean Christopherson <seanjc@google.com> wrote:
> > Can you elaborate on what makes it better?  Or maybe generate a list of pros and
> > cons?  I can think of (dis)advantages for both approaches, but I haven't identified
> > anything that would be a blocking issue for either approach.  Doesn't mean there
> > isn't one or more blocking issues, just that I haven't thought of any :-)
>
> Let's see.... so using no-slow-GUP over no UFFD waiting:
> - No need to take mmap_lock in mem fault path.
> - Change the relevant __gfn_to_pfn_memslot callers
> (kvm_faultin_pfn/user_mem_abort/others?) to set `atomic = true` if the
> new CAP is used.
> - No need for a new PF_NO_UFFD_WAIT (would be toggled somewhere
> in/near kvm_faultin_pfn/user_mem_abort).
> - Userspace has to indirectly figure out the state of the page tables
> to know what action to take (which introduces some weirdness, like if
> anyone MADV_DONTNEEDs some guest memory, we need to know).

I'm no expert but I believe a guest access to MADV_DONTNEED'd GFN
would just cause a new page to be allocated by the kernel. So I think
userspace can still blindly do MADV_POPULATE_WRITE in this case. Were
there any other scenarios you had in mind?

> - While userfaultfd is registered (so like during post-copy), any
> hva_to_pfn() calls that were resolvable with slow GUP before (without
> dropping into handle_userfault()) will now need to be resolved by
> userspace manually with a call to MADV_POPULATE_WRITE. This extra trip
> to userspace could slow things down.

Is there any way to enable fast-gup to identify when a PTE is not
present due to userfaultfd specifically without taking the mmap_lock
(e.g. using an unused bit in the PTE)? Then we could avoid extra trips
to userspace for MADV_POPULATE_WRITE.

>
> Both of these seem pretty simple to implement in the kernel; the most
> complicated part is just returning KVM_EXIT_MEMORY_FAULT in more
> places / for other architectures (I care about x86 and arm64).
>
> Right now both approaches seem fine to me. Not having to take the
> mmap_lock in the fault path, while being such a minor difference now,
> could be a huge benefit if we can later get around to making
> UFFDIO_CONTINUE not need the mmap lock. Disregarding that, not
> requiring userspace to guess the state of the page tables seems
> helpful (less bug-prone, I guess).
>
> >
> > > When KVM_RUN exits:
> > > - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart KVM_RUN.
> > > - If we have, then something bad has happened. Slow GUP already ran
> > > and failed, so we need to treat this in the same way we treat a
> > > MADV_POPULATE_WRITE failure above: userspace might just want to crash
> > > (or inject a memory error or something).
> > >
> > > - James

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Improving userfaultfd scalability for live migration
  2022-12-08  1:56                   ` David Matlack
@ 2022-12-08 17:50                     ` James Houghton
  2023-01-04  0:57                       ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: James Houghton @ 2022-12-08 17:50 UTC (permalink / raw)
  To: David Matlack
  Cc: Sean Christopherson, Peter Xu, Andrea Arcangeli, Paolo Bonzini,
	Axel Rasmussen, Linux MM, kvm, chao.p.peng, Oliver Upton

On Wed, Dec 7, 2022 at 8:57 PM David Matlack <dmatlack@google.com> wrote:
>
> On Tue, Dec 6, 2022 at 12:41 PM James Houghton <jthoughton@google.com> wrote:
> > On Tue, Dec 6, 2022 at 1:01 PM Sean Christopherson <seanjc@google.com> wrote:
> > > Can you elaborate on what makes it better?  Or maybe generate a list of pros and
> > > cons?  I can think of (dis)advantages for both approaches, but I haven't identified
> > > anything that would be a blocking issue for either approach.  Doesn't mean there
> > > isn't one or more blocking issues, just that I haven't thought of any :-)
> >
> > Let's see.... so using no-slow-GUP over no UFFD waiting:
> > - No need to take mmap_lock in mem fault path.
> > - Change the relevant __gfn_to_pfn_memslot callers
> > (kvm_faultin_pfn/user_mem_abort/others?) to set `atomic = true` if the
> > new CAP is used.
> > - No need for a new PF_NO_UFFD_WAIT (would be toggled somewhere
> > in/near kvm_faultin_pfn/user_mem_abort).
> > - Userspace has to indirectly figure out the state of the page tables
> > to know what action to take (which introduces some weirdness, like if
> > anyone MADV_DONTNEEDs some guest memory, we need to know).
>
> I'm no expert but I believe a guest access to MADV_DONTNEED'd GFN
> would just cause a new page to be allocated by the kernel. So I think
> userspace can still blindly do MADV_POPULATE_WRITE in this case. Were
> there any other scenarios you had in mind?

MADV_POPULATE_WRITE would drop into handle_userfault() if we're using
uffd minor faults after we do MADV_DONTNEED. For uffd minor faults, if
the PTE is none (i.e., completely blank, no swap information or
anything), then we drop into handle_userfault().

I partially take back what I said. We have to be careful about someone
messing with our page tables no matter which API we choose.

Here is a better description of the weirdness that we have to put up
with given each choice, with this assumption that, normally, we want
to UFFDIO_CONTINUE a page exactly once:

- For the no-slow-GUP choice, if someone MADV_DONTNEEDed memory and we
didn't know about it, we would get stuck in MADV_POPULATE_WRITE. By
using UFFD_FEATURE_THREAD_ID, we can tell if we got a userfault for a
thread that is in the middle of a MADV_POPULATE_WRITE, and we can try
to unblock the thread by doing an extra UFFDIO_CONTINUE.

- For the PF_NO_UFFD_WAIT choice, if someone MADV_DONTNEEDed memory,
we would just keep trying to start the vCPU without doing anything (we
assume some other thread has UFFDIO_CONTINUEd for us). This is
basically the same as if we were stuck in MADV_POPULATE_WRITE, and we
can try to unblock the thread in a fashion similar to how we would in
the other case.

So really these approaches have similar requirements for what
userspace needs to track. So I think I prefer the no-slow-GUP approach
then.

>
> > - While userfaultfd is registered (so like during post-copy), any
> > hva_to_pfn() calls that were resolvable with slow GUP before (without
> > dropping into handle_userfault()) will now need to be resolved by
> > userspace manually with a call to MADV_POPULATE_WRITE. This extra trip
> > to userspace could slow things down.
>
> Is there any way to enable fast-gup to identify when a PTE is not
> present due to userfaultfd specifically without taking the mmap_lock
> (e.g. using an unused bit in the PTE)? Then we could avoid extra trips
> to userspace for MADV_POPULATE_WRITE.

To know if you would have dropped into handle_userfault(), you have to
at least check the VMA flags, so at the moment, no. :(

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Improving userfaultfd scalability for live migration
  2022-12-08 17:50                     ` James Houghton
@ 2023-01-04  0:57                       ` Sean Christopherson
  2023-01-04  1:05                         ` James Houghton
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2023-01-04  0:57 UTC (permalink / raw)
  To: James Houghton
  Cc: David Matlack, Peter Xu, Andrea Arcangeli, Paolo Bonzini,
	Axel Rasmussen, Linux MM, kvm, chao.p.peng, Oliver Upton

On Thu, Dec 08, 2022, James Houghton wrote:
> - For the no-slow-GUP choice, if someone MADV_DONTNEEDed memory and we
> didn't know about it, we would get stuck in MADV_POPULATE_WRITE. By
> using UFFD_FEATURE_THREAD_ID, we can tell if we got a userfault for a
> thread that is in the middle of a MADV_POPULATE_WRITE, and we can try
> to unblock the thread by doing an extra UFFDIO_CONTINUE.
> 
> - For the PF_NO_UFFD_WAIT choice, if someone MADV_DONTNEEDed memory,
> we would just keep trying to start the vCPU without doing anything (we
> assume some other thread has UFFDIO_CONTINUEd for us). This is
> basically the same as if we were stuck in MADV_POPULATE_WRITE, and we
> can try to unblock the thread in a fashion similar to how we would in
> the other case.
> 
> So really these approaches have similar requirements for what
> userspace needs to track. So I think I prefer the no-slow-GUP approach
> then.

Are you planning on sending a patch (RFC?) for the no-slow-GUP approach?  It sounds
like there's a rough consensus that that's a viable, minimally invasive solution.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Improving userfaultfd scalability for live migration
  2023-01-04  0:57                       ` Sean Christopherson
@ 2023-01-04  1:05                         ` James Houghton
  0 siblings, 0 replies; 16+ messages in thread
From: James Houghton @ 2023-01-04  1:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Matlack, Peter Xu, Andrea Arcangeli, Paolo Bonzini,
	Axel Rasmussen, Linux MM, kvm, chao.p.peng, Oliver Upton

On Wed, Jan 4, 2023 at 12:57 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Dec 08, 2022, James Houghton wrote:
> > - For the no-slow-GUP choice, if someone MADV_DONTNEEDed memory and we
> > didn't know about it, we would get stuck in MADV_POPULATE_WRITE. By
> > using UFFD_FEATURE_THREAD_ID, we can tell if we got a userfault for a
> > thread that is in the middle of a MADV_POPULATE_WRITE, and we can try
> > to unblock the thread by doing an extra UFFDIO_CONTINUE.
> >
> > - For the PF_NO_UFFD_WAIT choice, if someone MADV_DONTNEEDed memory,
> > we would just keep trying to start the vCPU without doing anything (we
> > assume some other thread has UFFDIO_CONTINUEd for us). This is
> > basically the same as if we were stuck in MADV_POPULATE_WRITE, and we
> > can try to unblock the thread in a fashion similar to how we would in
> > the other case.
> >
> > So really these approaches have similar requirements for what
> > userspace needs to track. So I think I prefer the no-slow-GUP approach
> > then.
>
> Are you planning on sending a patch (RFC?) for the no-slow-GUP approach?  It sounds
> like there's a rough consensus that that's a viable, minimally invasive solution.

Yes, soon. amoorthy@google.com is working on it.

Also a small correction regarding userspace needing to track
MADV_DONTNEEDs, userfaultfd already handles that with
UFFD_EVENT_REMOVE, so it's a non-issue. Even more reason to take the
no-slow-GUP approach.

- James

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-01-04  1:06 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-01 19:37 [RFC] Improving userfaultfd scalability for live migration James Houghton
2022-12-03  1:03 ` Sean Christopherson
2022-12-05 15:27   ` Peter Xu
2022-12-05 17:31     ` David Matlack
2022-12-05 18:03       ` David Matlack
2022-12-05 18:23         ` Sean Christopherson
2022-12-05 18:20       ` Sean Christopherson
2022-12-05 21:19         ` James Houghton
2022-12-06  1:06           ` Sean Christopherson
2022-12-06 17:35             ` James Houghton
2022-12-06 18:00               ` Sean Christopherson
2022-12-06 20:41                 ` James Houghton
2022-12-08  1:56                   ` David Matlack
2022-12-08 17:50                     ` James Houghton
2023-01-04  0:57                       ` Sean Christopherson
2023-01-04  1:05                         ` James Houghton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).