Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.

From: Anish Moorthy <amoorthy@google.com>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Xu <peterx@redhat.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	maz@kernel.org, oliver.upton@linux.dev,
	Sean Christopherson <seanjc@google.com>,
	James Houghton <jthoughton@google.com>,
	bgardon@google.com, dmatlack@google.com, ricarkol@google.com,
	kvm <kvm@vger.kernel.org>,
	kvmarm@lists.linux.dev
Subject: Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
Date: Mon, 24 Apr 2023 17:15:49 -0700	[thread overview]
Message-ID: <CAF7b7mr-_U6vU1iOwukdmOoaT0G1ttyxD62cv=vebnQeXL3R0w@mail.gmail.com> (raw)
In-Reply-To: <84DD9212-31FB-4AF6-80DD-9BA5AEA0EC1A@gmail.com>

On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
>
>
> > On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
> >
> > On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
> >>
> >> If I understand the problem correctly, it sounds as if the proper solution
> >> should be some kind of a range-locks. If it is too heavy or the interface can
> >> be changed/extended to wake a single address (instead of a range),
> >> simpler hashed-locks can be used.
> >
> > Some sort of range-based locking system does seem relevant, although I
> > don't see how that would necessarily speed up the delivery of faults
> > to UFFD readers: I'll have to think about it more.
>
> Perhaps I misread your issue. Based on the scalability issues you raised,
> I assumed that the problem you encountered is related to lock contention.
> I do not know whether your profiled it, but some information would be
> useful.

No, you had it right: the issue at hand is contention on the uffd wait
queues. I'm just not sure what the range-based locking would really be
doing. Events would still have to be delivered to userspace in an
ordered manner, so it seems to me that each uffd would still need to
maintain a queue (and the associated contention).

With respect to the "sharding" idea, I collected some more runs of the
self test (full command in [1]). This time I omitted the "-a" flag, so
that every vCPU accesses a different range of guest memory with its
own UFFD, and set the number of reader threads per UFFD to 1.

vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
1      180     307
2       85      220
4       80      206
8       39     163
16     18     104
32      8      73
64      4      57
128    1      37
256    1      16

I'm reporting paging rate on a per-vcpu rather than total basis, which
is why the numbers look so different than the ones in the cover
letter. I'm actually not sure why the demand paging rate falls off
with the number of vCPUs (maybe a prioritization issue on my side?),
but even when UFFDs aren't being contended for it's clear that demand
paging via memory fault exits is significantly faster.

I'll try to get some perf traces as well: that will take a little bit
of time though, as to do it for cycler will involve patching our VMM
first.

[1] ./demand_paging_test -b 64M -u MINOR -s shmem -v <n> -r 1 [-w]

> It certainly not my call. But if you ask me, introducing a solution for
> a concrete use-case that requires API changes/enhancements is not
> guaranteed to be the best solution. It may be better first to fully
> understand the existing overheads and agree that there is no alternative
> cleaner and more general solution with similar performance.
>
> Considering the mess that KVM async-PF introduced, I
> would be very careful before introducing such API changes. I did not look
> too much on the details, but some things anyhow look slightly strange
> (which might be since I am out-of-touch with KVM). For instance, returning
> -EFAULT on from KVM_RUN? I would have assumed -EAGAIN would be more
> appropriate since the invocation did succeed.

I'm not quite sure whether you're focusing on
KVM_CAP_MEMORY_FAULT_INFO or KVM_CAP_ABSENT_MAPPING_FAULT here. But to
my knowledge, none of the KVM folks have objections to either:
hopefully it stays that way, but we'll have to see :)