Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.

From: Peter Xu <peterx@redhat.com>
To: Anish Moorthy <amoorthy@google.com>
Cc: Nadav Amit <nadav.amit@gmail.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	maz@kernel.org, oliver.upton@linux.dev,
	Sean Christopherson <seanjc@google.com>,
	James Houghton <jthoughton@google.com>,
	bgardon@google.com, dmatlack@google.com, ricarkol@google.com,
	kvm <kvm@vger.kernel.org>,
	kvmarm@lists.linux.dev
Subject: Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
Date: Thu, 27 Apr 2023 16:26:44 -0400	[thread overview]
Message-ID: <ZErahL/7DKimG+46@x1n> (raw)
In-Reply-To: <CAF7b7mr-_U6vU1iOwukdmOoaT0G1ttyxD62cv=vebnQeXL3R0w@mail.gmail.com>

Hi, Anish,

On Mon, Apr 24, 2023 at 05:15:49PM -0700, Anish Moorthy wrote:
> On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@gmail.com> wrote:
> >
> >
> >
> > > On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
> > >
> > > On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
> > >>
> > >> If I understand the problem correctly, it sounds as if the proper solution
> > >> should be some kind of a range-locks. If it is too heavy or the interface can
> > >> be changed/extended to wake a single address (instead of a range),
> > >> simpler hashed-locks can be used.
> > >
> > > Some sort of range-based locking system does seem relevant, although I
> > > don't see how that would necessarily speed up the delivery of faults
> > > to UFFD readers: I'll have to think about it more.
> >
> > Perhaps I misread your issue. Based on the scalability issues you raised,
> > I assumed that the problem you encountered is related to lock contention.
> > I do not know whether your profiled it, but some information would be
> > useful.
> 
> No, you had it right: the issue at hand is contention on the uffd wait
> queues. I'm just not sure what the range-based locking would really be
> doing. Events would still have to be delivered to userspace in an
> ordered manner, so it seems to me that each uffd would still need to
> maintain a queue (and the associated contention).
> 
> With respect to the "sharding" idea, I collected some more runs of the
> self test (full command in [1]). This time I omitted the "-a" flag, so
> that every vCPU accesses a different range of guest memory with its
> own UFFD, and set the number of reader threads per UFFD to 1.
> 
> vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> 1      180     307
> 2       85      220
> 4       80      206
> 8       39     163
> 16     18     104
> 32      8      73
> 64      4      57
> 128    1      37
> 256    1      16
> 
> I'm reporting paging rate on a per-vcpu rather than total basis, which
> is why the numbers look so different than the ones in the cover
> letter. I'm actually not sure why the demand paging rate falls off
> with the number of vCPUs (maybe a prioritization issue on my side?),
> but even when UFFDs aren't being contended for it's clear that demand
> paging via memory fault exits is significantly faster.
> 
> I'll try to get some perf traces as well: that will take a little bit
> of time though, as to do it for cycler will involve patching our VMM
> first.
> 
> [1] ./demand_paging_test -b 64M -u MINOR -s shmem -v <n> -r 1 [-w]

Thanks (for doing this test, and also to Nadav for all his inputs), and
sorry for a late response.

These numbers caught my eye, and I'm very curious why even 2 vcpus can
scale that bad.

I gave it a shot on a test machine and I got something slightly different:

  Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores, 40 threads)
  $ ./demand_paging_test -b 512M -u MINOR -s shmem -v N
  |-------+----------+--------|
  | n_thr | per-vcpu | total  |
  |-------+----------+--------|
  |     1 | 39.5K    | 39.5K  |
  |     2 | 33.8K    | 67.6K  |
  |     4 | 31.8K    | 127.2K |
  |     8 | 30.8K    | 246.1K |
  |    16 | 21.9K    | 351.0K |
  |-------+----------+--------|

I used larger ram due to less cores.  I didn't try 32+ vcpus to make sure I
don't have two threads content on a core/thread already since I only got 40
hardware threads there, but still we can compare with your lower half.

When I was testing I noticed bad numbers and another bug on not using
NSEC_PER_SEC properly, so I did this before the test:

https://lore.kernel.org/all/20230427201112.2164776-1-peterx@redhat.com/

I think it means it still doesn't scale that good, however not so bad
either - no obvious 1/2 drop on using 2vcpus.  There're still a bunch of
paths triggered in the test so I also don't expect it to fully scale
linearly.  From my numbers I just didn't see as drastic as yours. I'm not
sure whether it's simply broken test number, parameter differences
(e.g. you used 64M only per-vcpu), or hardware differences.

-- 
Peter Xu