All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Axel Rasmussen <axelrasmussen@google.com>
Cc: Anish Moorthy <amoorthy@google.com>,
	pbonzini@redhat.com, maz@kernel.org, oliver.upton@linux.dev,
	seanjc@google.com, jthoughton@google.com, bgardon@google.com,
	dmatlack@google.com, ricarkol@google.com, kvm@vger.kernel.org,
	kvmarm@lists.linux.dev
Subject: Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
Date: Wed, 19 Apr 2023 17:05:15 -0400	[thread overview]
Message-ID: <ZEBXi5tZZNxA+jRs@x1n> (raw)
In-Reply-To: <CAJHvVchBqQ8iVHgF9cVZDusMKQM2AjtNx2z=i9ZHP2BosN4tBg@mail.gmail.com>

On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote:
> On Wed, Apr 19, 2023 at 12:56 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Anish,
> >
> > On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> > > KVM's demand paging self test is extended to demonstrate the performance
> > > benefits of using the two new capabilities to bypass the userfaultfd
> > > wait queue. The performance samples below (rates in thousands of
> > > pages/s, n = 5), were generated using [2] on an x86 machine with 256
> > > cores.
> > >
> > > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> > > 1       150     340
> > > 2       191     477
> > > 4       210     809
> > > 8       155     1239
> > > 16      130     1595
> > > 32      108     2299
> > > 64      86      3482
> > > 128     62      4134
> > > 256     36      4012
> >
> > The number looks very promising.  Though..
> >
> > >
> > > [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> > > [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
> > >     A quick rundown of the new flags (also detailed in later commits)
> > >         -a registers all of guest memory to a single uffd.
> >
> > ... this is the worst case scenario.  I'd say it's slightly unfair to
> > compare by first introducing a bottleneck then compare with it. :)
> >
> > Jokes aside: I'd think it'll make more sense if such a performance solution
> > will be measured on real systems showing real benefits, because so far it's
> > still not convincing enough if it's only with the test especially with only
> > one uffd.
> >
> > I don't remember whether I used to discuss this with James before, but..
> >
> > I know that having multiple uffds in productions also means scattered guest
> > memory and scattered VMAs all over the place.  However split the guest
> > large mem into at least a few (or even tens of) VMAs may still be something
> > worth trying?  Do you think that'll already solve some of the contentions
> > on userfaultfd, either on the queue or else?
> 
> We considered sharding into several UFFDs. I do think it helps, but
> also I think there are two main problems with it:
> 
> - One is, I think there's a limit to how much you'd want to do that.
> E.g. splitting guest memory in 1/2, or in 1/10, could be reasonable,
> but 1/100 or 1/1000 might become ridiculous in terms of the
> "scattering" of VMAs and so on like you mentioned. Especially for very
> large VMs (e.g. consider Google offers VMs with ~11T of RAM [1]) I'm
> not sure splitting just "slightly" is enough to get good performance.
> 
> - Another is, sharding UFFDs sort of assumes accesses are randomly
> distributed across the guest physical address space. I'm not sure this
> is guaranteed for all possible VMs / customer workloads. In other
> words, even if we shard across several UFFDs, we may end up with a
> small number of them being "hot".

I never tried to monitor this, but I had a feeling that it's actually
harder to maintain physical continuity of pages being used and accessed at
least on Linux.

The more possible case to me is the system pages goes very scattered easily
after boot a few hours unless special care is taken, e.g., on using hugetlb
pages or reservations for specific purpose.

I also think that's normally optimal to the system, e.g., numa balancing
will help nodes / cpus using local memory which helps spread the memory
consumptions, hence each core can access different pages that is local to
it.

But I agree I can never justify that it'll always work.  If you or Anish
could provide some data points to further support this issue that would be
very interesting and helpful, IMHO, not required though.

> 
> A benefit to Anish's series is that it solves the problem more
> fundamentally, and allows demand paging with no "global" locking. So,
> it will scale better regardless of VM size, or access pattern.
> 
> [1]: https://cloud.google.com/compute/docs/memory-optimized-machines
> 
> >
> > With a bunch of VMAs and userfaultfds (paired with uffd fault handler
> > threads, totally separate uffd queues), I'd expect to some extend other
> > things can pop up already, e.g., the network bandwidth, without teaching
> > each vcpu thread to report uffd faults themselves.
> >
> > These are my pure imaginations though, I think that's also why it'll be
> > great if such a solution can be tested more or less on a real migration
> > scenario to show its real benefits.
> 
> I wonder, is there an existing open source QEMU/KVM based live
> migration stress test?

I am not aware of any.

> 
> I think we could share numbers from some of our internal benchmarks,
> or at the very least give relative numbers (e.g. +50% increase), but
> since a lot of the software stack is proprietary (e.g. we don't use
> QEMU), it may not be that useful or reproducible for folks.

Those numbers can still be helpful.  I was not asking for reproduceability,
but some test to better justify this feature.

IMHO the demand paging test (at least the current one) may or may not be a
good test to show the value of this specific feature.  When with 1-uffd, it
obviously bottlenecks on the single uffd, so it doesn't explain whether
scaling num of uffds could help.

But it's not friendly to multi-uffd either, because it'll be the other
extreme case where all mem accesses are spread the cores, so probably the
feature won't show a result proving its worthwhile.

From another aspect, if a kernel feature is proposed it'll be always nice
(and sometimes mandatory) to have at least one user of it (besides the unit
tests).  I think that should also include proprietary softwares.  It
doesn't need to be used already in production, but some POC would
definitely be very helpful to move a feature forward towards community
acceptance.

Thanks,

-- 
Peter Xu


  reply	other threads:[~2023-04-19 21:06 UTC|newest]

Thread overview: 105+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test Anish Moorthy
2023-04-19 13:51   ` Hoo Robert
2023-04-20 17:55     ` Anish Moorthy
2023-04-21 12:15       ` Robert Hoo
2023-04-21 16:21         ` Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 02/22] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
2023-04-19 13:36   ` Hoo Robert
2023-04-19 23:26     ` Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 03/22] KVM: Allow hva_pfn_fast() to resolve read-only faults Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN Anish Moorthy
2023-05-02 17:17   ` Anish Moorthy
2023-05-02 18:51     ` Sean Christopherson
2023-05-02 19:49       ` Anish Moorthy
2023-05-02 20:41         ` Sean Christopherson
2023-05-02 21:46           ` Anish Moorthy
2023-05-02 22:31             ` Sean Christopherson
2023-04-12 21:34 ` [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
2023-04-19 13:57   ` Hoo Robert
2023-04-20 18:09     ` Anish Moorthy
2023-04-21 12:28       ` Robert Hoo
2023-06-01 19:52   ` Oliver Upton
2023-06-01 20:30     ` Anish Moorthy
2023-06-01 21:29       ` Oliver Upton
2023-07-04 10:10   ` Kautuk Consul
2023-04-12 21:34 ` [PATCH v3 06/22] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page() Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page() Anish Moorthy
2023-04-20 20:52   ` Peter Xu
2023-04-20 23:29     ` Anish Moorthy
2023-04-21 15:00       ` Peter Xu
2023-04-12 21:34 ` [PATCH v3 08/22] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page() Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map() Anish Moorthy
2023-04-20 20:53   ` Peter Xu
2023-04-20 23:34     ` Anish Moorthy
2023-04-21 14:58       ` Peter Xu
2023-04-12 21:34 ` [PATCH v3 10/22] KVM: x86: Annotate -EFAULTs from kvm_mmu_page_fault() Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 11/22] KVM: x86: Annotate -EFAULTs from setup_vmgexit_scratch() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 12/22] KVM: x86: Annotate -EFAULTs from kvm_handle_page_fault() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 13/22] KVM: x86: Annotate -EFAULTs from kvm_hv_get_assist_page() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 14/22] KVM: x86: Annotate -EFAULTs from kvm_pv_clock_pairing() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 15/22] KVM: x86: Annotate -EFAULTs from direct_map() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 16/22] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation Anish Moorthy
2023-04-19 14:00   ` Hoo Robert
2023-04-20 18:23     ` Anish Moorthy
2023-04-24 21:02   ` Sean Christopherson
2023-06-01 16:04     ` Oliver Upton
2023-06-01 18:19   ` Oliver Upton
2023-06-01 18:59     ` Sean Christopherson
2023-06-01 19:29       ` Oliver Upton
2023-06-01 19:34         ` Sean Christopherson
2023-04-12 21:35 ` [PATCH v3 18/22] KVM: x86: Implement KVM_CAP_ABSENT_MAPPING_FAULT Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 19/22] KVM: arm64: Annotate (some) -EFAULTs from user_mem_abort() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 20/22] KVM: arm64: Implement KVM_CAP_ABSENT_MAPPING_FAULT Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 21/22] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
2023-04-19 14:09   ` Hoo Robert
2023-04-19 16:40     ` Anish Moorthy
2023-04-20 22:47     ` Anish Moorthy
2023-04-27 15:48   ` James Houghton
2023-05-01 18:01     ` Anish Moorthy
2023-04-19 19:55 ` [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Peter Xu
2023-04-19 20:15   ` Axel Rasmussen
2023-04-19 21:05     ` Peter Xu [this message]
2023-04-19 21:53       ` Anish Moorthy
2023-04-20 21:29         ` Peter Xu
2023-04-21 16:58           ` Anish Moorthy
2023-04-21 17:39           ` Nadav Amit
2023-04-24 17:54             ` Anish Moorthy
2023-04-24 19:44               ` Nadav Amit
2023-04-24 20:35                 ` Sean Christopherson
2023-04-24 23:47                   ` Nadav Amit
2023-04-25  0:26                     ` Sean Christopherson
2023-04-25  0:37                       ` Nadav Amit
2023-04-25  0:15                 ` Anish Moorthy
2023-04-25  0:54                   ` Nadav Amit
2023-04-27 16:38                     ` James Houghton
2023-04-27 20:26                   ` Peter Xu
2023-05-03 19:45                     ` Anish Moorthy
2023-05-03 20:09                       ` Sean Christopherson
2023-05-03 21:18                       ` Peter Xu
2023-05-03 21:27                         ` Peter Xu
2023-05-03 21:42                           ` Sean Christopherson
2023-05-03 23:45                             ` Peter Xu
2023-05-04 19:09                               ` Peter Xu
2023-05-05 18:32                                 ` Anish Moorthy
2023-05-08  1:23                                   ` Peter Xu
2023-05-09 20:52                                     ` Anish Moorthy
2023-05-10 21:50                                       ` Peter Xu
2023-05-11 17:17                                         ` David Matlack
2023-05-11 17:33                                           ` Axel Rasmussen
2023-05-11 19:05                                             ` David Matlack
2023-05-11 19:45                                               ` Axel Rasmussen
2023-05-15 15:16                                                 ` Peter Xu
2023-05-15 15:05                                             ` Peter Xu
2023-05-15 17:16                                         ` Anish Moorthy
2023-05-05 20:05                               ` Nadav Amit
2023-05-08  1:12                                 ` Peter Xu
2023-04-20 23:42         ` Anish Moorthy
2023-05-09 22:19 ` David Matlack
2023-05-10 16:35   ` Anish Moorthy
2023-05-10 22:35   ` Sean Christopherson
2023-05-10 23:44     ` Anish Moorthy
2023-05-23 17:49     ` Anish Moorthy
2023-06-01 22:43       ` Oliver Upton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZEBXi5tZZNxA+jRs@x1n \
    --to=peterx@redhat.com \
    --cc=amoorthy@google.com \
    --cc=axelrasmussen@google.com \
    --cc=bgardon@google.com \
    --cc=dmatlack@google.com \
    --cc=jthoughton@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=maz@kernel.org \
    --cc=oliver.upton@linux.dev \
    --cc=pbonzini@redhat.com \
    --cc=ricarkol@google.com \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.