All of lore.kernel.org
 help / color / mirror / Atom feed
From: James Houghton <jthoughton@google.com>
To: Ben Gardon <bgardon@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Shuah Khan <shuah@kernel.org>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
	kvm <kvm@vger.kernel.org>,
	linux-kselftest@vger.kernel.org,
	Axel Rasmussen <axelrasmussen@google.com>,
	Jue Wang <juew@google.com>, Peter Xu <peterx@redhat.com>,
	Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [PATCH 1/2] KVM: Deliver VM fault signals to userspace
Date: Fri, 21 May 2021 15:25:31 -0400	[thread overview]
Message-ID: <CADrL8HU4Yj_uAWKCLanUmQoDS6rsoo1GJiJhiCY=e0Lwm_zyiw@mail.gmail.com> (raw)
In-Reply-To: <CANgfPd-=+-0+UzXD+zpqX50SGEi_wCLVJfsv63Rq9GYR+4-dCw@mail.gmail.com>

Thanks Ben.

First I'd like to clarify that the 3-4x speedup was measured without
KVM (calling UFFDIO_COPY from the signal handler, not signal-safe);
with KVM it drops to about a 30% improvement (single vCPU). This isn't
that important though, as the real purpose of the change is to allow
userfaultfd page-ins to scale better. To test scaling, I've updated
the KVM demand paging self-test to fix the non-partitioned userfaultfd
case (I will send this out when I send a new patchset that addresses
your concerns). It turns out that we contend on the threads'
sighand->siglock (because pthread_create uses CLONE_SIGHAND). Needless
to say, I need to do more testing.

Removing the siglock contention should lead to better page-in
performance at scale, but this patch won't be useful unless I can
actually demonstrate this. There are a few couple benefits I forgot to
mention in the commit message.
1. NUMA locality for page-in threads is much easier to maintain when
using UFFD_FEATURE_SIGBUS.
2. The number of threads that perform page-ins automatically scales
with the number of vCPUs.

Regarding situations where the kernel is unable to return to
userspace: thanks for pointing this out. If we can solve the signal
contention problems with this approach, page-ins this way might end up
being quite desirable, but only if we can actually exit to userspace.
So we could implement a SIGBUS-like userfaultfd feature, where it
returns VM_FAULT_SIGBUS if it knows the caller is ready to handle it
(i.e., in this patchset, if a caller has passed a non-NULL
`fault_error` to get_user_pages), otherwise continue with the
handle_userfault path and put the thread to sleep. I'll work on this.

- James


On Wed, May 19, 2021 at 6:41 PM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, May 19, 2021 at 2:04 PM James Houghton <jthoughton@google.com> wrote:
> >
> > This patch has been written to support page-ins using userfaultfd's
> > SIGBUS feature.  When a userfaultfd is created with UFFD_FEATURE_SIGBUS,
> > `handle_userfault` will return VM_FAULT_SIGBUS instead of putting the
> > calling thread to sleep. Normal (non-guest) threads that access memory
> > that has been registered with a UFFD_FEATURE_SIGBUS userfaultfd receive
> > a SIGBUS.
> >
> > When a vCPU gets an EPT page fault in a userfaultfd-registered region,
> > KVM calls into `handle_userfault` to resolve the page fault. With
> > UFFD_FEATURE_SIGBUS, VM_FAULT_SIGBUS is returned, but a SIGBUS is never
> > delivered to the userspace thread. This patch propagates the
> > VM_FAULT_SIGBUS error up to KVM, where we then send the signal.
> >
> > Upon receiving a VM_FAULT_SIGBUS, the KVM_RUN ioctl will exit to
> > userspace. This functionality already exists. This allows a hypervisor
> > to do page-ins with UFFD_FEATURE_SIGBUS:
> >
> > 1. Setup a SIGBUS handler to save the address of the SIGBUS (to a
> >    thread-local variable).
> > 2. Enter the guest.
> > 3. Immediately after KVM_RUN returns, check if the address has been set.
> > 4. If an address has been set, we exited due to a page fault that we can
> >    now handle.
> > 5. Userspace can do anything it wants to make the memory available,
> >    using MODE_NOWAKE for the UFFDIO memory installation ioctls.
> > 6. Re-enter the guest. If the memory still isn't ready, this process
> >    will repeat.
> >
> > This style of demand paging is significantly faster than the standard
> > poll/read/wake mechanism userfaultfd uses and completely bypasses the
> > userfaultfd waitq. For a single vCPU, page-in throughput increases by
> > about 3-4x.
>
> Wow that's an awesome improvement! My impression is that the
> improvement is even more significant with more vCPUs because we avoid
> wait queue contention. Is that right?
>
> How does this mode deal with situations where returning back to
> userspace isn't feasible? For example, if we're buried deep in some
> nested instruction emulation path, there may be no way to return back
> to userspace without creating unintended side effects. Do we have the
> facility to do a regular UFFD request in a case like that?
>
> As an aside, if you can think of a way to split this patch it would be
> easier to review. I realize most of the changes are propagating the
> fault_error parameter around though, so splitting the patch might not
> be easy.
>
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > Suggested-by: Jue Wang <juew@google.com>
> > ---
> >  include/linux/hugetlb.h |  2 +-
> >  include/linux/mm.h      |  3 ++-
> >  mm/gup.c                | 57 +++++++++++++++++++++++++++--------------
> >  mm/hugetlb.c            |  5 +++-
> >  virt/kvm/kvm_main.c     | 30 +++++++++++++++++++++-
> >  5 files changed, 74 insertions(+), 23 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index b92f25ccef58..a777fb254df0 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -119,7 +119,7 @@ int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_ar
> >  long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> >                          struct page **, struct vm_area_struct **,
> >                          unsigned long *, unsigned long *, long, unsigned int,
> > -                        int *);
> > +                        int *, int *);
> >  void unmap_hugepage_range(struct vm_area_struct *,
> >                           unsigned long, unsigned long, struct page *);
> >  void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 322ec61d0da7..1dcd1ac81992 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1824,7 +1824,8 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
> >  long pin_user_pages_locked(unsigned long start, unsigned long nr_pages,
> >                     unsigned int gup_flags, struct page **pages, int *locked);
> >  long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> > -                   struct page **pages, unsigned int gup_flags);
> > +                   struct page **pages, unsigned int gup_flags,
> > +                   int *fault_error);
> >  long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> >                     struct page **pages, unsigned int gup_flags);
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 0697134b6a12..ab55a67aef78 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -881,7 +881,8 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
> >   * is, *@locked will be set to 0 and -EBUSY returned.
> >   */
> >  static int faultin_page(struct vm_area_struct *vma,
> > -               unsigned long address, unsigned int *flags, int *locked)
> > +               unsigned long address, unsigned int *flags, int *locked,
> > +               int *fault_error)
> >  {
> >         unsigned int fault_flags = 0;
> >         vm_fault_t ret;
> > @@ -906,6 +907,8 @@ static int faultin_page(struct vm_area_struct *vma,
> >         }
> >
> >         ret = handle_mm_fault(vma, address, fault_flags, NULL);
> > +       if (fault_error)
> > +               *fault_error = ret;
> >         if (ret & VM_FAULT_ERROR) {
> >                 int err = vm_fault_to_errno(ret, *flags);
> >
> > @@ -996,6 +999,8 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> >   * @vmas:      array of pointers to vmas corresponding to each page.
> >   *             Or NULL if the caller does not require them.
> >   * @locked:     whether we're still with the mmap_lock held
> > + * @fault_error: VM fault error from handle_mm_fault. NULL if the caller
> > + *             does not require this error.
> >   *
> >   * Returns either number of pages pinned (which may be less than the
> >   * number requested), or an error. Details about the return value:
> > @@ -1040,6 +1045,13 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> >   * when it's been released.  Otherwise, it must be held for either
> >   * reading or writing and will not be released.
> >   *
> > + * If @fault_error != NULL, __get_user_pages will return the VM fault error
> > + * from handle_mm_fault() in this argument in the event of a VM fault error.
> > + * On success (ret == nr_pages) fault_error is zero.
> > + * On failure (ret != nr_pages) fault_error may still be 0 if the error did
> > + * not originate from handle_mm_fault().
> > + *
> > + *
> >   * In most cases, get_user_pages or get_user_pages_fast should be used
> >   * instead of __get_user_pages. __get_user_pages should be used only if
> >   * you need some special @gup_flags.
> > @@ -1047,7 +1059,8 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> >  static long __get_user_pages(struct mm_struct *mm,
> >                 unsigned long start, unsigned long nr_pages,
> >                 unsigned int gup_flags, struct page **pages,
> > -               struct vm_area_struct **vmas, int *locked)
> > +               struct vm_area_struct **vmas, int *locked,
> > +               int *fault_error)
> >  {
> >         long ret = 0, i = 0;
> >         struct vm_area_struct *vma = NULL;
> > @@ -1097,7 +1110,7 @@ static long __get_user_pages(struct mm_struct *mm,
> >                         if (is_vm_hugetlb_page(vma)) {
> >                                 i = follow_hugetlb_page(mm, vma, pages, vmas,
> >                                                 &start, &nr_pages, i,
> > -                                               gup_flags, locked);
> > +                                               gup_flags, locked, fault_error);
> >                                 if (locked && *locked == 0) {
> >                                         /*
> >                                          * We've got a VM_FAULT_RETRY
> > @@ -1124,7 +1137,8 @@ static long __get_user_pages(struct mm_struct *mm,
> >
> >                 page = follow_page_mask(vma, start, foll_flags, &ctx);
> >                 if (!page) {
> > -                       ret = faultin_page(vma, start, &foll_flags, locked);
> > +                       ret = faultin_page(vma, start, &foll_flags, locked,
> > +                                          fault_error);
> >                         switch (ret) {
> >                         case 0:
> >                                 goto retry;
> > @@ -1280,7 +1294,8 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
> >                                                 struct page **pages,
> >                                                 struct vm_area_struct **vmas,
> >                                                 int *locked,
> > -                                               unsigned int flags)
> > +                                               unsigned int flags,
> > +                                               int *fault_error)
> >  {
> >         long ret, pages_done;
> >         bool lock_dropped;
> > @@ -1311,7 +1326,7 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
> >         lock_dropped = false;
> >         for (;;) {
> >                 ret = __get_user_pages(mm, start, nr_pages, flags, pages,
> > -                                      vmas, locked);
> > +                                      vmas, locked, fault_error);
> >                 if (!locked)
> >                         /* VM_FAULT_RETRY couldn't trigger, bypass */
> >                         return ret;
> > @@ -1371,7 +1386,7 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
> >
> >                 *locked = 1;
> >                 ret = __get_user_pages(mm, start, 1, flags | FOLL_TRIED,
> > -                                      pages, NULL, locked);
> > +                                      pages, NULL, locked, fault_error);
> >                 if (!*locked) {
> >                         /* Continue to retry until we succeeded */
> >                         BUG_ON(ret != 0);
> > @@ -1458,7 +1473,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> >          * not result in a stack expansion that recurses back here.
> >          */
> >         return __get_user_pages(mm, start, nr_pages, gup_flags,
> > -                               NULL, NULL, locked);
> > +                               NULL, NULL, locked, NULL);
> >  }
> >
> >  /*
> > @@ -1524,7 +1539,7 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
> >  static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
> >                 unsigned long nr_pages, struct page **pages,
> >                 struct vm_area_struct **vmas, int *locked,
> > -               unsigned int foll_flags)
> > +               unsigned int foll_flags, int *fault_error)
> >  {
> >         struct vm_area_struct *vma;
> >         unsigned long vm_flags;
> > @@ -1590,7 +1605,8 @@ struct page *get_dump_page(unsigned long addr)
> >         if (mmap_read_lock_killable(mm))
> >                 return NULL;
> >         ret = __get_user_pages_locked(mm, addr, 1, &page, NULL, &locked,
> > -                                     FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> > +                                     FOLL_FORCE | FOLL_DUMP | FOLL_GET,
> > +                                     NULL);
> >         if (locked)
> >                 mmap_read_unlock(mm);
> >
> > @@ -1704,11 +1720,11 @@ static long __gup_longterm_locked(struct mm_struct *mm,
> >
> >         if (!(gup_flags & FOLL_LONGTERM))
> >                 return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
> > -                                              NULL, gup_flags);
> > +                                              NULL, gup_flags, NULL);
> >         flags = memalloc_pin_save();
> >         do {
> >                 rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
> > -                                            NULL, gup_flags);
> > +                                            NULL, gup_flags, NULL);
> >                 if (rc <= 0)
> >                         break;
> >                 rc = check_and_migrate_movable_pages(rc, pages, gup_flags);
> > @@ -1764,7 +1780,8 @@ static long __get_user_pages_remote(struct mm_struct *mm,
> >
> >         return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
> >                                        locked,
> > -                                      gup_flags | FOLL_TOUCH | FOLL_REMOTE);
> > +                                      gup_flags | FOLL_TOUCH | FOLL_REMOTE,
> > +                                      NULL);
> >  }
> >
> >  /**
> > @@ -1941,7 +1958,7 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
> >
> >         return __get_user_pages_locked(current->mm, start, nr_pages,
> >                                        pages, NULL, locked,
> > -                                      gup_flags | FOLL_TOUCH);
> > +                                      gup_flags | FOLL_TOUCH, NULL);
> >  }
> >  EXPORT_SYMBOL(get_user_pages_locked);
> >
> > @@ -1961,7 +1978,8 @@ EXPORT_SYMBOL(get_user_pages_locked);
> >   * (e.g. FOLL_FORCE) are not required.
> >   */
> >  long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> > -                            struct page **pages, unsigned int gup_flags)
> > +                            struct page **pages, unsigned int gup_flags,
> > +                            int *fault_error)
> >  {
> >         struct mm_struct *mm = current->mm;
> >         int locked = 1;
> > @@ -1978,7 +1996,8 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> >
> >         mmap_read_lock(mm);
> >         ret = __get_user_pages_locked(mm, start, nr_pages, pages, NULL,
> > -                                     &locked, gup_flags | FOLL_TOUCH);
> > +                                     &locked, gup_flags | FOLL_TOUCH,
> > +                                     fault_error);
> >         if (locked)
> >                 mmap_read_unlock(mm);
> >         return ret;
> > @@ -2550,7 +2569,7 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> >                 mmap_read_unlock(current->mm);
> >         } else {
> >                 ret = get_user_pages_unlocked(start, nr_pages,
> > -                                             pages, gup_flags);
> > +                                             pages, gup_flags, NULL);
> >         }
> >
> >         return ret;
> > @@ -2880,7 +2899,7 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> >                 return -EINVAL;
> >
> >         gup_flags |= FOLL_PIN;
> > -       return get_user_pages_unlocked(start, nr_pages, pages, gup_flags);
> > +       return get_user_pages_unlocked(start, nr_pages, pages, gup_flags, NULL);
> >  }
> >  EXPORT_SYMBOL(pin_user_pages_unlocked);
> >
> > @@ -2909,6 +2928,6 @@ long pin_user_pages_locked(unsigned long start, unsigned long nr_pages,
> >         gup_flags |= FOLL_PIN;
> >         return __get_user_pages_locked(current->mm, start, nr_pages,
> >                                        pages, NULL, locked,
> > -                                      gup_flags | FOLL_TOUCH);
> > +                                      gup_flags | FOLL_TOUCH, NULL);
> >  }
> >  EXPORT_SYMBOL(pin_user_pages_locked);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 3db405dea3dc..889ac33d57d5 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5017,7 +5017,8 @@ static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
> >  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >                          struct page **pages, struct vm_area_struct **vmas,
> >                          unsigned long *position, unsigned long *nr_pages,
> > -                        long i, unsigned int flags, int *locked)
> > +                        long i, unsigned int flags, int *locked,
> > +                        int  *fault_error)
> >  {
> >         unsigned long pfn_offset;
> >         unsigned long vaddr = *position;
> > @@ -5103,6 +5104,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >                         }
> >                         ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
> >                         if (ret & VM_FAULT_ERROR) {
> > +                               if (fault_error)
> > +                                       *fault_error = ret;
> >                                 err = vm_fault_to_errno(ret, flags);
> >                                 remainder = 0;
> >                                 break;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 2799c6660cce..0a20d926ae32 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2004,6 +2004,30 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> >         return false;
> >  }
> >
> > +static void kvm_send_vm_fault_signal(int fault_error, int errno,
> > +                                    unsigned long address,
> > +                                    struct task_struct *tsk)
> > +{
> > +       kernel_siginfo_t info;
> > +
> > +       clear_siginfo(&info);
> > +
> > +       if (fault_error == VM_FAULT_SIGBUS)
> > +               info.si_signo = SIGBUS;
> > +       else if (fault_error == VM_FAULT_SIGSEGV)
> > +               info.si_signo = SIGSEGV;
> > +       else
> > +               // Other fault errors should not result in a signal.
> > +               return;
> > +
> > +       info.si_errno = errno;
> > +       info.si_code = BUS_ADRERR;
> > +       info.si_addr = (void __user *)address;
> > +       info.si_addr_lsb = PAGE_SHIFT;
> > +
> > +       send_sig_info(info.si_signo, &info, tsk);
> > +}
> > +
> >  /*
> >   * The slow path to get the pfn of the specified host virtual address,
> >   * 1 indicates success, -errno is returned if error is detected.
> > @@ -2014,6 +2038,7 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> >         unsigned int flags = FOLL_HWPOISON;
> >         struct page *page;
> >         int npages = 0;
> > +       int fault_error;
> >
> >         might_sleep();
> >
> > @@ -2025,7 +2050,10 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> >         if (async)
> >                 flags |= FOLL_NOWAIT;
> >
> > -       npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > +       npages = get_user_pages_unlocked(addr, 1, &page, flags, &fault_error);
> > +       if (fault_error & VM_FAULT_ERROR)
> > +               kvm_send_vm_fault_signal(fault_error, npages, addr, current);
> > +
> >         if (npages != 1)
> >                 return npages;
> >
> > --
> > 2.31.1.751.gd2f1c929bd-goog
> >

  reply	other threads:[~2021-05-21 19:25 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-19 21:04 [PATCH 1/2] KVM: Deliver VM fault signals to userspace James Houghton
2021-05-19 21:04 ` James Houghton
2021-05-19 21:04 ` [PATCH 2/2] KVM: selftests: Add UFFD_FEATURE_SIGBUS page-in tests James Houghton
2021-05-19 21:04   ` James Houghton
2021-05-19 22:46   ` Ben Gardon
2021-05-19 22:46     ` Ben Gardon
2021-05-19 22:41 ` [PATCH 1/2] KVM: Deliver VM fault signals to userspace Ben Gardon
2021-05-19 22:41   ` Ben Gardon
2021-05-21 19:25   ` James Houghton [this message]
2021-05-21 19:25     ` James Houghton
2021-05-25  0:47 ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CADrL8HU4Yj_uAWKCLanUmQoDS6rsoo1GJiJhiCY=e0Lwm_zyiw@mail.gmail.com' \
    --to=jthoughton@google.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=bgardon@google.com \
    --cc=juew@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=shuah@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.