Re: RFC: KVM: x86/mmu: Eager Page Splitting

From: Paolo Bonzini <pbonzini@redhat.com>
To: David Matlack <dmatlack@google.com>
Cc: kvm list <kvm@vger.kernel.org>, Ben Gardon <bgardon@google.com>,
	Junaid Shahid <junaids@google.com>,
	Sean Christopherson <seanjc@google.com>,
	Oliver Upton <oupton@google.com>,
	Harish Barathvajasankar <hbarath@google.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Peter Xu <peterx@redhat.com>,
	Peter Shier <pshier@google.com>
Subject: Re: RFC: KVM: x86/mmu: Eager Page Splitting
Date: Fri, 5 Nov 2021 09:44:14 +0100	[thread overview]
Message-ID: <c9bd3bca-f901-d8db-c23d-5292ab7bd247@redhat.com> (raw)
In-Reply-To: <CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com>

On 11/4/21 23:45, David Matlack wrote:
> The goal of this RFC is to get feedback on "Eager Page Splitting",
> an optimization that has been in use in Google Cloud since 2016 to 
> reduce the performance impact of live migration on customer 
> workloads. We wanted to get feedback on the feature before delving 
> too far into porting it to the latest upstream kernel for submission.
> If there is interest in adding this feature to KVM we plan to follow
> up in the coming months with patches.

Hi David!

I'm definitely interested in eager page splitting upstream, but with a
twist: in order to limit the proliferation of knobs, I would rather
enable it only when KVM_DIRTY_LOG_INITIALLY_SET is set, and do the split
on the first KVM_CLEAR_DIRTY_LOG ioctl.

Initially-all-set does not require write protection when dirty logging
is enabled; instead, it delays write protection to the first
KVM_CLEAR_DIRTY_LOG.  In fact, I believe that eager page splitting can
be enabled unconditionally for initial-all-set.  You would still have
the benefit of moving the page splitting out of the vCPU run
path; and because you can smear the cost of splitting over multiple
calls, most of the disadvantages go away.

Initially-all-set is already the best-performing method for bitmap-based
dirty page tracking, so it makes sense to focus on it.  Even if Google
might not be using initial-all-set internally, adding eager page
splitting to the upstream code would remove most of the delta related to
it.  The rest of the delta can be tackled later; I'm not super
interested in adding eager page splitting for the older methods (clear
on KVM_GET_DIRTY_LOG, and manual-clear without initially-all-set), but
it should be useful for the ring buffer method and that *should* share
most of the code with the older methods.

> In order to avoid allocating while holding the MMU lock, vCPUs 
> preallocate everything they need to handle the fault and store it in 
> kvm_mmu_memory_cache structs. Eager Page Splitting does the same 
> thing but since it runs outside of a vCPU thread it needs its own 
> copies of kvm_mmu_memory_cache structs. This requires refactoring the
> way kvm_mmu_memory_cache structs are passed around in the MMU code
> and adding kvm_mmu_memory_cache structs to kvm_arch.

That's okay, we can move more arguments to structs if needed in the same
was as struct kvm_page_fault; or we can use kvm_get_running_vcpu() if
it's easier or more appropriate.

> * Increases the duration of the VM ioctls that enable dirty logging. 
> This does not affect customer performance but may have unintended 
> consequences depending on how userspace invokes the ioctl. For 
> example, eagerly splitting a 1.5TB memslot takes 30 seconds.

This issue goes away (or becomes easier to manage) if it's done in
KVM_CLEAR_DIRTY_LOG.

> "RFC: Split EPT huge pages in advance of dirty logging" [1] was a 
> previous proposal to proactively split large pages off of the vCPU 
> threads. However it required faulting in every page in the migration 
> thread, a vCPU-like thread in QEMU, which requires extra userspace 
> support and also is less efficient since it requires faulting.

Yeah, this is best done on the kernel side.

> The last alternative is to perform dirty tracking at a 2M 
> granularity. This would reduce the amount of splitting work required
>  by 512x, making the current approach of splitting on fault less 
> impactful to customer performance. We are in the early stages of 
> investigating 2M dirty tracking internally but it will be a while 
> before it is proven and ready for production. Furthermore there may 
> be scenarios where dirty tracking at 4K would be preferable to reduce
> the amount of memory that needs to be demand-faulted during precopy.

Granularity of dirty tracking is somewhat orthogonal to this anyway,
since you'd have to split 1G pages down to 2M.  So please let me know if
you're okay with the above twist, and let's go ahead with the plan!

Paolo