All of lore.kernel.org
 help / color / mirror / Atom feed
From: James Houghton <jthoughton@google.com>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Anish Moorthy <amoorthy@google.com>, Peter Xu <peterx@redhat.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	maz@kernel.org, oliver.upton@linux.dev,
	Sean Christopherson <seanjc@google.com>,
	bgardon@google.com, dmatlack@google.com, ricarkol@google.com,
	kvm <kvm@vger.kernel.org>,
	kvmarm@lists.linux.dev
Subject: Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
Date: Thu, 27 Apr 2023 09:38:49 -0700	[thread overview]
Message-ID: <CADrL8HVtbfe2OwsELmrrG8eKEQRPfYD1muj4qM2bOHyRU5AgjQ@mail.gmail.com> (raw)
In-Reply-To: <307D798E-9135-41F7-80C7-1E0758259F95@gmail.com>

On Mon, Apr 24, 2023 at 5:54 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
>
>
> > On Apr 24, 2023, at 5:15 PM, Anish Moorthy <amoorthy@google.com> wrote:
> >
> > On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@gmail.com> wrote:
> >>
> >>
> >>
> >>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
> >>>
> >>> On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
> >>>>
> >>>> If I understand the problem correctly, it sounds as if the proper solution
> >>>> should be some kind of a range-locks. If it is too heavy or the interface can
> >>>> be changed/extended to wake a single address (instead of a range),
> >>>> simpler hashed-locks can be used.
> >>>
> >>> Some sort of range-based locking system does seem relevant, although I
> >>> don't see how that would necessarily speed up the delivery of faults
> >>> to UFFD readers: I'll have to think about it more.
> >>
> >> Perhaps I misread your issue. Based on the scalability issues you raised,
> >> I assumed that the problem you encountered is related to lock contention.
> >> I do not know whether your profiled it, but some information would be
> >> useful.
> >
> > No, you had it right: the issue at hand is contention on the uffd wait
> > queues. I'm just not sure what the range-based locking would really be
> > doing. Events would still have to be delivered to userspace in an
> > ordered manner, so it seems to me that each uffd would still need to
> > maintain a queue (and the associated contention).
>
> There are 2 queues. One for the pending faults that were still not reported
> to userspace, and one for the faults that we might need to wake up. The second
> one can have range locks.
>
> Perhaps some hybrid approach would be best: do not block on page-faults that
> KVM runs into, which would prevent you from the need to enqueue on fault_wqh.

Hi Nadav,

If we don't block on the page faults that KVM runs into, what are you
suggesting that these threads do?

1. If you're saying that we should kick the threads out to userspace
and then read the page fault event, then I would say that it's just
unnecessary complexity. (Seems like this is what you mean from what
you said below.)
2. If you're saying they should busy-wait, then unfortunately we can't
afford that.
3. If it's neither of those, could you clarify?

>
> But I do not know whether the reporting through KVM instead of
> userfaultfd-based mechanism is very clean. I think that an IO-uring based
> solution, such as the one I proposed before, would be more generic. Actually,
> now that I understand better your use-case, you do not need a core to poll
> and you would just be able to read the page-fault information from the IO-uring.
>
> Then, you can report whether the page-fault blocked or not in a flag.

This is a fine idea, but I don't think the required complexity is
worth it. The memory fault info reporting piece of this series is
relatively uncontentious, so let's assume we have it at our disposal.

Now, the complexity to make KVM only attempt fast GUP (and EFAULT if
it fails) is really minimal. We automatically know that we don't need
to WAKE and which address to make ready.  Userspace is also able to
resolve the fault: UFFDIO_CONTINUE if we haven't already, then
MADV_POPULATE_WRITE if we have (forces userspace page tables to be
populated if they haven't been, potentially going through userfaultfd
to do so, i.e., if UFFDIO_CONTINUE wasn't already done).

It sounds like what you're suggesting is something like:
1. KVM attempts fast GUP then slow GUP.
2. In slow GUP, queue a "non-blocking" userfault, but don't go to
sleep (return with VM_FAULT_SIGBUS or something).
3. The vCPU thread gets kicked out to userspace with EFAULT (+ fault
info if we've enabled it).
4. Read a fault from the userfaultfd or io_uring.
5. Make the page ready, and if it were non-blocking, then don't WAKE.

I have some questions/thoughts with this approach:
1. Is io_uring the only way to make reading from a userfaultfd scale?
Maybe it's possible to avoid using a wait_queue for "non-blocking"
faults, but then we'd need a special read() API specifically to
*avoid* the standard fault_pending_wqh queue. Either approach will be
quite complex.
2. We'll still need to annotate KVM in the same-ish place to tell
userfaultfd that the fault should be non-blocking, but we'll probably
*also* need like GUP_USERFAULT_NONBLOCK and/or
FAULT_FLAG_USERFAULT_NOBLOCK or something. (UFFD_FEATURE_SIGBUS does
not exactly solve this problem either.)
3. If the vCPU thread is getting kicked out to userspace, it seems
like there is no way for it to find/read the #pf it generated. This
seems problematic.

>
> >
> > With respect to the "sharding" idea, I collected some more runs of the
> > self test (full command in [1]). This time I omitted the "-a" flag, so
> > that every vCPU accesses a different range of guest memory with its
> > own UFFD, and set the number of reader threads per UFFD to 1.
>
> Just wondering, did you run the benchmark with DONTWAKE? Sounds as if the
> wake is not needed.
>

Anish's selftest only WAKEs when it's necessary[1]. IOW, we only WAKE
when we actually read the #pf from the userfaultfd. If we were to WAKE
for each fault, we wouldn't get much of a scalability improvement at
all (we would still be contending on the wait_queue locks, just not
quite as much as before).

[1]: https://lore.kernel.org/kvm/20230412213510.1220557-23-amoorthy@google.com/

Thanks for your insights/suggestions, Nadav.

- James

  reply	other threads:[~2023-04-27 16:39 UTC|newest]

Thread overview: 105+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test Anish Moorthy
2023-04-19 13:51   ` Hoo Robert
2023-04-20 17:55     ` Anish Moorthy
2023-04-21 12:15       ` Robert Hoo
2023-04-21 16:21         ` Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 02/22] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
2023-04-19 13:36   ` Hoo Robert
2023-04-19 23:26     ` Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 03/22] KVM: Allow hva_pfn_fast() to resolve read-only faults Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN Anish Moorthy
2023-05-02 17:17   ` Anish Moorthy
2023-05-02 18:51     ` Sean Christopherson
2023-05-02 19:49       ` Anish Moorthy
2023-05-02 20:41         ` Sean Christopherson
2023-05-02 21:46           ` Anish Moorthy
2023-05-02 22:31             ` Sean Christopherson
2023-04-12 21:34 ` [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
2023-04-19 13:57   ` Hoo Robert
2023-04-20 18:09     ` Anish Moorthy
2023-04-21 12:28       ` Robert Hoo
2023-06-01 19:52   ` Oliver Upton
2023-06-01 20:30     ` Anish Moorthy
2023-06-01 21:29       ` Oliver Upton
2023-07-04 10:10   ` Kautuk Consul
2023-04-12 21:34 ` [PATCH v3 06/22] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page() Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page() Anish Moorthy
2023-04-20 20:52   ` Peter Xu
2023-04-20 23:29     ` Anish Moorthy
2023-04-21 15:00       ` Peter Xu
2023-04-12 21:34 ` [PATCH v3 08/22] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page() Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map() Anish Moorthy
2023-04-20 20:53   ` Peter Xu
2023-04-20 23:34     ` Anish Moorthy
2023-04-21 14:58       ` Peter Xu
2023-04-12 21:34 ` [PATCH v3 10/22] KVM: x86: Annotate -EFAULTs from kvm_mmu_page_fault() Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 11/22] KVM: x86: Annotate -EFAULTs from setup_vmgexit_scratch() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 12/22] KVM: x86: Annotate -EFAULTs from kvm_handle_page_fault() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 13/22] KVM: x86: Annotate -EFAULTs from kvm_hv_get_assist_page() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 14/22] KVM: x86: Annotate -EFAULTs from kvm_pv_clock_pairing() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 15/22] KVM: x86: Annotate -EFAULTs from direct_map() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 16/22] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation Anish Moorthy
2023-04-19 14:00   ` Hoo Robert
2023-04-20 18:23     ` Anish Moorthy
2023-04-24 21:02   ` Sean Christopherson
2023-06-01 16:04     ` Oliver Upton
2023-06-01 18:19   ` Oliver Upton
2023-06-01 18:59     ` Sean Christopherson
2023-06-01 19:29       ` Oliver Upton
2023-06-01 19:34         ` Sean Christopherson
2023-04-12 21:35 ` [PATCH v3 18/22] KVM: x86: Implement KVM_CAP_ABSENT_MAPPING_FAULT Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 19/22] KVM: arm64: Annotate (some) -EFAULTs from user_mem_abort() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 20/22] KVM: arm64: Implement KVM_CAP_ABSENT_MAPPING_FAULT Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 21/22] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
2023-04-19 14:09   ` Hoo Robert
2023-04-19 16:40     ` Anish Moorthy
2023-04-20 22:47     ` Anish Moorthy
2023-04-27 15:48   ` James Houghton
2023-05-01 18:01     ` Anish Moorthy
2023-04-19 19:55 ` [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Peter Xu
2023-04-19 20:15   ` Axel Rasmussen
2023-04-19 21:05     ` Peter Xu
2023-04-19 21:53       ` Anish Moorthy
2023-04-20 21:29         ` Peter Xu
2023-04-21 16:58           ` Anish Moorthy
2023-04-21 17:39           ` Nadav Amit
2023-04-24 17:54             ` Anish Moorthy
2023-04-24 19:44               ` Nadav Amit
2023-04-24 20:35                 ` Sean Christopherson
2023-04-24 23:47                   ` Nadav Amit
2023-04-25  0:26                     ` Sean Christopherson
2023-04-25  0:37                       ` Nadav Amit
2023-04-25  0:15                 ` Anish Moorthy
2023-04-25  0:54                   ` Nadav Amit
2023-04-27 16:38                     ` James Houghton [this message]
2023-04-27 20:26                   ` Peter Xu
2023-05-03 19:45                     ` Anish Moorthy
2023-05-03 20:09                       ` Sean Christopherson
2023-05-03 21:18                       ` Peter Xu
2023-05-03 21:27                         ` Peter Xu
2023-05-03 21:42                           ` Sean Christopherson
2023-05-03 23:45                             ` Peter Xu
2023-05-04 19:09                               ` Peter Xu
2023-05-05 18:32                                 ` Anish Moorthy
2023-05-08  1:23                                   ` Peter Xu
2023-05-09 20:52                                     ` Anish Moorthy
2023-05-10 21:50                                       ` Peter Xu
2023-05-11 17:17                                         ` David Matlack
2023-05-11 17:33                                           ` Axel Rasmussen
2023-05-11 19:05                                             ` David Matlack
2023-05-11 19:45                                               ` Axel Rasmussen
2023-05-15 15:16                                                 ` Peter Xu
2023-05-15 15:05                                             ` Peter Xu
2023-05-15 17:16                                         ` Anish Moorthy
2023-05-05 20:05                               ` Nadav Amit
2023-05-08  1:12                                 ` Peter Xu
2023-04-20 23:42         ` Anish Moorthy
2023-05-09 22:19 ` David Matlack
2023-05-10 16:35   ` Anish Moorthy
2023-05-10 22:35   ` Sean Christopherson
2023-05-10 23:44     ` Anish Moorthy
2023-05-23 17:49     ` Anish Moorthy
2023-06-01 22:43       ` Oliver Upton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CADrL8HVtbfe2OwsELmrrG8eKEQRPfYD1muj4qM2bOHyRU5AgjQ@mail.gmail.com \
    --to=jthoughton@google.com \
    --cc=amoorthy@google.com \
    --cc=axelrasmussen@google.com \
    --cc=bgardon@google.com \
    --cc=dmatlack@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=maz@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=oliver.upton@linux.dev \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=ricarkol@google.com \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.