RFC: KVM: x86/mmu: Eager Page Splitting

From: David Matlack <dmatlack@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm list <kvm@vger.kernel.org>, Ben Gardon <bgardon@google.com>,
	Junaid Shahid <junaids@google.com>,
	Sean Christopherson <seanjc@google.com>,
	Oliver Upton <oupton@google.com>,
	Harish Barathvajasankar <hbarath@google.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Peter Xu <peterx@redhat.com>,
	Peter Shier <pshier@google.com>
Subject: RFC: KVM: x86/mmu: Eager Page Splitting
Date: Thu, 4 Nov 2021 15:45:30 -0700	[thread overview]
Message-ID: <CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com> (raw)

The goal of this RFC is to get feedback on "Eager Page Splitting", an
optimization that has been in use in Google Cloud since 2016 to reduce
the performance impact of live migration on customer workloads. We
wanted to get feedback on the feature before delving too far into
porting it to the latest upstream kernel for submission. If there is
interest in adding this feature to KVM we plan to follow up in the
coming months with patches.

Background
==========
When KVM is tracking writes for dirty logging it write-protects any
2MiB or 1GiB pages that are mapped into the guest. When a vCPU writes
to such a page a write-protection fault will occur which KVM handles
by allocating lower level page tables, mapping in the faulting address
with a PT-level (4KiB) SPTE, and recording the 4KiB page is dirty.
Handling these faults is done under the MMU lock. In the TDP MMU,
where the MMU lock is a rwlock rather than a spin lock, splitting is
done while holding the lock in read (shared) mode and atomic
compare-exchanges are used when modifying SPTEs to detect races and
retry.

Motivation
==========
The write-protection faults to break down 2MiB and 1GiB mappings into
4KiB mappings are taken on the critical path of guest execution, which
negatively impacts guest performance. The negative impact scales with
the number of vCPUs, since each vCPU contends for the MMU lock.

This overhead can be seen by running dirty_log_perf_test with 1GiB per
vCPU and comparing the time it takes the test to dirty memory after
enabling dirty logging when the backing source is `anonymous` (4KiB
pages) and `anonymous_hugetlb_1gb` (1GiB pages).

        |        First Pass Dirty Memory Time           |
        |       tdp_mmu=N       |     tdp_mmu=Y         |
vCPUs   | 4KiB      | 1GiB      | 4KiB      | 1GiB      |
------- | --------- | --------- | --------- | --------- |
1       | 0.063s    | 0.241s    | 0.061s    | 0.238s    |
2       | 0.066s    | 0.280s    | 0.066s    | 0.278s    |
4       | 0.069s    | 0.359s    | 0.070s    | 0.298s    |
8       | 0.112s    | 0.982s    | 0.109s    | 0.345s    |
16      | 0.197s    | 3.153s    | 0.183s    | 1.020s    |
32      | 0.368s    | 9.293s    | 0.425s    | 2.610s    |
64      | 0.456s    | 23.291s   | 0.354s    | 4.212s    |
128     | 0.334s    | 55.030s   | 0.419s    | 7.169s    |
256     | 0.576s    | 141.332s  | 0.492s    | 13.874s   |
416     | 0.881s    | 338.185s  | 0.785s    | 14.582s   |

The performance overhead is egregious with the legacy MMU, as
expected, since every fault requires contending for exclusive access
to MMU lock. However, even with the TDP MMU, where the MMU lock is
held in read-mode, perf recording confirms there is still contention
due to hammering of atomic operations that scales with the number of
vCPUs:

+   28.16%  [k] _raw_read_lock
+   28.10%  [k] direct_page_fault
+   21.52%  [k] tdp_mmu_set_spte_atomic
+    6.90%  [k] __handle_changed_spte
+    3.93%  [k] __get_current_cr3_fast
+    3.47%  [k] lockless_pages_from_mm

_raw_read_lock, direct_page_fault, tdp_mmu_set_spte_atomic, and
__handle_changed_spte each spend 99+% of their time on atomic
operations. Note the _raw_read_lock path specifically is pure
reader/reader contention, it is not spinning waiting for writers.

As of Nov 2021, Google Cloud supports live migrating VMs with up to
416 vCPUs and 28 GiB per vCPU. The customers using large instances
tend to run applications that are sensitive to abrupt performance
degradations. For these VMs we use Eager Page Splitting in conjunction
with the Direct MMU (our internal predecessor to the TDP MMU).

Design
======
Eager Page Splitting occurs when dirty logging is enabled on a region
of memory. Before KVM write-protects all large page mapping, we
attempt the following two steps (*):

1. Iterate through all 1GiB SPTEs and for each:
    a. Allocate a new shadow page table.
    b. Populate it with 2MiB SPTEs mapping the 1GiB page.
    c. Replace the 1GiB SPTE with a link to the shadow page table.
2. Iterate through all 2MiB SPTEs and for each:
    a. Allocate a new shadow page table.
    b. Populate it with 4KiB SPTEs mapping the 2MiB page.
    c. Replace the 2MiB SPTE with a link to the shadow page table.

(*) We could split 1G pages directly down to 4K, and then split 2MiB
down to 4K. But the two step approach is a bit simpler to understand
and implement.

The implementation of splitting a 1GiB SPTE into 512 2MiB SPTEs and a
2MiB SPTE into a 512 4KiB SPTEs is generalized into a new function
that splits an SPTE into one level smaller.

The lower level SPTEs receive the same set of permissions as the upper
level SPTEs. The exception is execution permissions, which is granted
for 4K SPTEs if the large SPTE was forced NX due to HugePage NX.

Eager Page Splitting can gracefully fall back to write-protection, if
we decide the complexity is not worth splitting in certain scenarios,
since the existing write-protection logic runs after Eager Page
Splitting. This makes it possible to build Eager Page Splitting
iteratively (e.g. just supporting direct-map-only GFNs, then adding
support for rmap, etc.). This also means that it is trivial to make
Eager Page Splitting an optional capability if desired.

Allocations
-----------
In order to avoid allocating while holding the MMU lock, vCPUs
preallocate everything they need to handle the fault and store it in
kvm_mmu_memory_cache structs. Eager Page Splitting does the same thing
but since it runs outside of a vCPU thread it needs its own copies of
kvm_mmu_memory_cache structs. This requires refactoring the way
kvm_mmu_memory_cache structs are passed around in the MMU code and
adding kvm_mmu_memory_cache structs to kvm_arch.

Before splitting a large page, Eager Page Splitting checks if it has
enough memory in its caches to fully split the page. If it doesn't, it
flushes TLBs, drops the MMU lock, cond_rescheds, topups the caches,
reacquires the MMU lock, and continue where it left off.

Pros/Cons
---------
The benefits of Eager Page Splitting are:

  * Eliminates the overhead of enabling dirty logging when using large
pages. vCPUs will only take write-protection faults on 4KiB pages,
which can be handled without acquiring the MMU lock (fast_page_fault),
regardless of which page sizes were used before the migration started.
  * Eager Page Splitting is more efficient overall. We do not have to
pay the VM-exit and fault handling costs for every 4KiB of guest
memory, the splitter benefits from cache locality, and TLB flushes can
be batched. When using the tdp_mmu, the new child page table can be
populated without atomics.

The downsides of Eager Page Splitting are:

  * Introduces a new and complex path for modifying KVM's page tables
outside of the vCPU fault path.
  * Increases the duration of the VM ioctls that enable dirty logging.
This does not affect customer performance but may have unintended
consequences depending on how userspace invokes the ioctl. For
example, eagerly splitting a 1.5TB memslot takes 30 seconds.
  * May increase memory usage since it allocates the worst case number
of page tables in order to map all memory at 4KiB.

Alternatives
============
"RFC: Split EPT huge pages in advance of dirty logging" [1] was a
previous proposal to proactively split large pages off of the vCPU
threads. However it required faulting in every page in the migration
thread, a vCPU-like thread in QEMU, which requires extra userspace
support and also is less efficient since it requires faulting.

Another alternative is to modify the vCPU fault handler to map in the
entire large page when handling write-protection faults on large
pages, rather than just mapping in the faulting 4KiB region. This is a
middle ground approach that would be more efficient than the current
solution but has yet to be prototyped or proven in a production
environment.

The last alternative is to perform dirty tracking at a 2M granularity.
This would reduce the amount of splitting work required by 512x,
making the current approach of splitting on fault less impactful to
customer performance. We are in the early stages of investigating 2M
dirty tracking internally but it will be a while before it is proven
and ready for production. Furthermore there may be scenarios where
dirty tracking at 4K would be preferable to reduce the amount of
memory that needs to be demand-faulted during precopy.

Credit
======
Eager Page Splitting was originally designed and implemented by Peter
Feiner <pfeiner@google.com> for Google's internal kernel in 2016.

Appendix
========
In order to collect the performance results I ran the following
commands on an Intel Cascade Lake host that is used to run Google
Cloud's m2-ultramem-416 VMs:

  echo Y > /sys/module/kvm/parameters/tdp_mmu
  ./dirty_log_perf_test -v${vcpus} -s anonymous
  ./dirty_log_perf_test -v${vcpus} -s anonymous_hugetlb_1gb
  echo N > /sys/module/kvm/parameters/tdp_mmu
  ./dirty_log_perf_test -v${vcpus} -s anonymous
  ./dirty_log_perf_test -v${vcpus} -s anonymous_hugetlb_1gb

[1] https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg04774.html