All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU
@ 2022-01-19 23:07 David Matlack
  2022-01-19 23:07 ` [PATCH v2 01/18] KVM: x86/mmu: Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn() David Matlack
                   ` (18 more replies)
  0 siblings, 19 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

This series implements Eager Page Splitting for the TDP MMU.

"Eager Page Splitting" is an optimization that has been in use in Google
Cloud since 2016 to reduce the performance impact of live migration on
customer workloads. It was originally designed and implemented by Peter
Feiner <pfeiner@google.com>.

For background and performance motivation for this feature, please
see "RFC: KVM: x86/mmu: Eager Page Splitting" [1].

Implementation
==============

This series implements support for splitting all huge pages mapped by
the TDP MMU. Pages mapped by the shadow MMU are not split, although I
plan to add the support in a future patchset.

Eager page splitting is triggered in two ways:

- KVM_SET_USER_MEMORY_REGION ioctl: If this ioctl is invoked to enable
  dirty logging on a memslot and KVM_DIRTY_LOG_INITIALLY_SET is not
  enabled, KVM will attempt to split all huge pages in the memslot down
  to the 4K level.

- KVM_CLEAR_DIRTY_LOG ioctl: If this ioctl is invoked and
  KVM_DIRTY_LOG_INITIALLY_SET is enabled, KVM will attempt to split all
  huge pages cleared by the ioctl down to the 4K level before attempting
  to write-protect them.

Eager page splitting is enabled by default in both paths but can be
disabled via module param eager_page_split=N.

Splitting for pages mapped by the TDP MMU is done under the MMU lock in
read mode. The lock is dropped and the thread rescheduled if contention
or need_resched() is detected.

To allocate memory for the lower level page tables, we attempt to
allocate without dropping the MMU lock using GFP_NOWAIT to avoid doing
direct reclaim or invoking filesystem callbacks. If that fails we drop
the lock and perform a normal GFP_KERNEL allocation.

Performance
===========

Eager page splitting moves the cost of splitting huge pages off of the
vCPU thread and onto the thread invoking one of the aforementioned
ioctls. This is useful because:

 - Splitting on the vCPU thread interrupts vCPUs execution and is
   disruptive to customers whereas splitting on VM ioctl threads can
   run in parallel with vCPU execution.

 - Splitting on the VM ioctl thread is more efficient because it does
   no require performing VM-exit handling and page table walks for every
   4K page.

The measure the performance impact of Eager Page Splitting I ran
dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB
HugeTLB memory.

When KVM_DIRTY_LOG_INITIALLY_SET is set, we can see that the first
KVM_CLEAR_DIRTY_LOG iteration gets longer because KVM is splitting
huge pages. But the time it takes for vCPUs to dirty their memory
is significantly shorter since they do not have to take write-
protection faults.

           | Iteration 1 clear dirty log time | Iteration 2 dirty memory time
---------- | -------------------------------- | -----------------------------
Before     | 0.049572219s                     | 2.751442902s
After      | 1.667811687s                     | 0.127016504s

Eager page splitting does make subsequent KVM_CLEAR_DIRTY_LOG ioctls
about 4% slower since it always walks the page tables looking for pages
to split.  This can be avoided but will require extra memory and/or code
complexity to track when splitting can be skipped.

           | Iteration 3 clear dirty log time
---------- | --------------------------------
Before     | 1.374501209s
After      | 1.422478617s

When not using KVM_DIRTY_LOG_INITIALLY_SET, KVM performs splitting on
the entire memslot during the KVM_SET_USER_MEMORY_REGION ioctl that
enables dirty logging. We can see that as an increase in the time it
takes to enable dirty logging. This allows vCPUs to avoid taking
write-protection faults which we again see in the dirty memory time.

           | Enabling dirty logging time      | Iteration 1 dirty memory time
---------- | -------------------------------- | -----------------------------
Before     | 0.001683739s                     | 2.943733325s
After      | 1.546904175s                     | 0.145979748s

Testing
=======

- Ran all kvm-unit-tests and KVM selftests on debug and non-debug kernels.

- Ran dirty_log_perf_test with different backing sources (anonymous,
  anonymous_thp, anonymous_hugetlb_2mb, anonymous_hugetlb_1gb) with and
  without Eager Page Splitting enabled.

- Added a tracepoint locally to time the GFP_NOWAIT allocations. Across
  40 runs of dirty_log_perf_test using 1GiB HugeTLB with 96 vCPUs there
  were only 4 allocations that took longer than 20 microseconds and the
  longest was 60 microseconds. None of the GFP_NOWAIT allocations
  failed.

- I have not been able to trigger a GFP_NOWAIT allocation failure (to
  exercise the fallback path). However I did manually modify the code
  to force every allocation to fallback by removing the GFP_NOWAIT
  allocation altogether to make sure the logic works correctly.

- Live migrated a 32 vCPU 32 GiB Linux VM running a workload that
  aggressively writes to guest memory with Eager Page Splitting enabled.
  Observed pages being split via tracepoint and the pages_{4k,2m,1g}
  stats.

Version Log
===========

v2:

[Overall Changes]
 - Additional testing by live migrating a Linux VM (see above).
 - Add Sean's, Ben's, and Peter's Reviewed-by tags.
 - Use () when referring to functions in commit message and comments [Sean]
 - Add TDP MMU to shortlog where applicable [Sean]
 - Fix gramatical errors in commit messages [Sean]
 - Break 80+ char function declarations across multiple lines [Sean]

[PATCH v1 03/13] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
 - Remove useless empty line [Peter]
 - Tighten up the wording in comments [Sean]
 - Move result of rcu_dereference() to a local variable to cut down line lengths [Sean]

[PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table
 - Add prep patch to return 0/-EBUSY instead of bool [Sean]
 - Add prep patch to rename {un,}link_page to {un,}link_sp [Sean]
 - Fold tdp_mmu_link_page() into tdp_mmu_install_sp_atomic() [Sean]

[PATCH v1 05/13] KVM: x86/mmu: Move restore_acc_track_spte to spte.c
 - Make inline [Sean]
 - Eliminate WARN_ON_ONCE [Sean]
 - Eliminate unnecessary local variable new_spte [Sean].

[PATCH v1 06/13] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
 - Eliminate unnecessary local variable root_pt [Sean]

[PATCH v1 07/13] KVM: x86/mmu: Derive page role from parent
 - Eliminate redundant role overrides [Sean]

[PATCH v1 08/13] KVM: x86/mmu: Refactor TDP MMU child page initialization
 - Rename alloc_tdp_mmu_page*() functions [Sean]

[PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled
 - Drop access from make_huge_page_split_spte() [Sean]
 - Drop is_mmio_spte() check from make_huge_page_split_spte() [Sean]
 - Change WARN_ON to WARN_ON_ONCE in make_huge_page_split_spte() [Sean]
 - Improve comment for making 4K SPTEs executable [Sean]
 - Rename mark_spte_executable() to mark_spte_executable() [Sean]
 - Put return type on same line as tdp_mmu_split_huge_page_atomic() [Sean]
 - Drop child_spte local variable in tdp_mmu_split_huge_page_atomic() [Sean]
 - Make alloc_tdp_mmu_page_for_split() play nice with
   commit 3a0f64de479c ("KVM: x86/mmu: Don't advance iterator after restart due to yielding") [Sean]
 - Free unused sp after dropping RCU [Sean]
 - Rename module param to something shorter [Sean]
 - Document module param somewhere [Sean]
 - Fix rcu_read_unlock in tdp_mmu_split_huge_pages_root() [me]
 - Document TLB flush behavior [Peter]

[PATCH v1 10/13] KVM: Push MMU locking down into kvm_arch_mmu_enable_log_dirty_pt_masked
 - Drop [Peter]

[PATCH v1 11/13] KVM: x86/mmu: Split huge pages during CLEAR_DIRTY_LOG
 - Hold the lock in write-mode when splitting [Peter]
 - Document TLB flush behavior [Peter]

[PATCH v1 12/13] KVM: x86/mmu: Add tracepoint for splitting huge pages
 - Record if split succeeded or failed [Sean]

v1: https://lore.kernel.org/kvm/20211213225918.672507-1-dmatlack@google.com/

[Overall Changes]
 - Use "huge page" instead of "large page" [Sean Christopherson]

[RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect
 - Add Ben's Reviewed-by.
 - Add Peter's Reviewed-by.

[RFC PATCH 03/15] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
 - Add comment when updating old_spte [Ben Gardon]
 - Follow kernel style of else case in zap_gfn_range [Ben Gardon]
 - Don't delete old_spte update after zapping in kvm_tdp_mmu_map [me]

[RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table
 - Add blurb to commit message describing intentional drop of tracepoint [Ben Gardon]
 - Consolidate "u64 spte = make_nonleaf_spte(...);" onto one line [Sean Christopherson]
 - Do not free the sp if set fails  [Sean Christopherson]

[RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct
 - Drop to adopt Sean's proposed allocation scheme.

[RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent
 - No changes.

[RFC PATCH 07/15] KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu
 - Drop to adopt Sean's proposed allocation scheme.

[RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes
 - Drop this commit and the helper function [Sean Christopherson]

[RFC PATCH 09/15] KVM: x86/mmu: Move restore_acc_track_spte to spte.c
 - Add Ben's Reviewed-by.

[RFC PATCH 10/15] KVM: x86/mmu: Abstract need_resched logic from tdp_mmu_iter_cond_resched
 - Drop to adopt Sean's proposed allocation scheme.

[RFC PATCH 11/15] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
 - Add Ben's Reviewed-by.

[RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
 - Add a module parameter to control Eager Page Splitting [Peter Xu]
 - Change level to large_spte_level [Ben Gardon]
 - Get rid of BUG_ONs [Ben Gardon]
 - Change += to |= and add a comment [Ben Gardon]
 - Do not flush TLBs when dropping the MMU lock. [Sean Christopherson]
 - Allocate memory directly from the kernel instead of using mmu_caches [Sean Christopherson]

[RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
 - Fix deadlock by refactoring MMU locking and dropping write lock before splitting. [kernel test robot]
 - Did not follow Sean's suggestion of skipping write-protection if splitting
   succeeds as it would require extra complexity since we aren't splitting
   pages in the shadow MMU yet.

[RFC PATCH 14/15] KVM: x86/mmu: Add tracepoint for splitting large pages
 - No changes.

[RFC PATCH 15/15] KVM: x86/mmu: Update page stats when splitting large pages
 - Squash into patch that first introduces page splitting.

Note: I opted not to change TDP MMU functions to return int instead of
bool per Sean's suggestion. I agree this change should be done but can
be left to a separate series.

RFC: https://lore.kernel.org/kvm/20211119235759.1304274-1-dmatlack@google.com/

[1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t

David Matlack (18):
  KVM: x86/mmu: Rename rmap_write_protect() to
    kvm_vcpu_write_protect_gfn()
  KVM: x86/mmu: Rename __rmap_write_protect() to rmap_write_protect()
  KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
  KVM: x86/mmu: Change tdp_mmu_{set,zap}_spte_atomic() to return
    0/-EBUSY
  KVM: x86/mmu: Rename TDP MMU functions that handle shadow pages
  KVM: x86/mmu: Rename handle_removed_tdp_mmu_page() to
    handle_removed_pt()
  KVM: x86/mmu: Consolidate logic to atomically install a new TDP MMU
    page table
  KVM: x86/mmu: Remove unnecessary warnings from
    restore_acc_track_spte()
  KVM: x86/mmu: Drop new_spte local variable from
    restore_acc_track_spte()
  KVM: x86/mmu: Move restore_acc_track_spte() to spte.h
  KVM: x86/mmu: Refactor TDP MMU iterators to take kvm_mmu_page root
  KVM: x86/mmu: Remove redundant role overrides for TDP MMU shadow pages
  KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent
  KVM: x86/mmu: Separate TDP MMU shadow page allocation and
    initialization
  KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty
    logging is enabled
  KVM: x86/mmu: Split huge pages mapped by the TDP MMU during
    KVM_CLEAR_DIRTY_LOG
  KVM: x86/mmu: Add tracepoint for splitting huge pages
  KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and
    INITIALLY_SET

 .../admin-guide/kernel-parameters.txt         |  26 ++
 arch/x86/include/asm/kvm_host.h               |   7 +
 arch/x86/kvm/mmu/mmu.c                        |  79 ++--
 arch/x86/kvm/mmu/mmutrace.h                   |  23 +
 arch/x86/kvm/mmu/spte.c                       |  59 +++
 arch/x86/kvm/mmu/spte.h                       |  16 +
 arch/x86/kvm/mmu/tdp_iter.c                   |   8 +-
 arch/x86/kvm/mmu/tdp_iter.h                   |  10 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    | 419 +++++++++++++-----
 arch/x86/kvm/mmu/tdp_mmu.h                    |   5 +
 arch/x86/kvm/x86.c                            |   6 +
 arch/x86/kvm/x86.h                            |   2 +
 .../selftests/kvm/dirty_log_perf_test.c       |  13 +-
 13 files changed, 520 insertions(+), 153 deletions(-)


base-commit: edb9e50dbe18394d0fc9d0494f5b6046fc912d33
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 01/18] KVM: x86/mmu: Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn()
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 02/18] KVM: x86/mmu: Rename __rmap_write_protect() to rmap_write_protect() David Matlack
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

rmap_write_protect() is a poor name because it also write-protects SPTEs
in the TDP MMU, not just SPTEs in the rmap. It is also confusing that
rmap_write_protect() is not a simple wrapper around
__rmap_write_protect(), since that is the common pattern for functions
with double-underscore names.

Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn() to convey
that KVM is write-protecting a specific gfn in the context of a vCPU.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8ed2b42a7aa3..b541683c29c7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1421,7 +1421,7 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	return write_protected;
 }
 
-static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
+static bool kvm_vcpu_write_protect_gfn(struct kvm_vcpu *vcpu, u64 gfn)
 {
 	struct kvm_memory_slot *slot;
 
@@ -2024,7 +2024,7 @@ static int mmu_sync_children(struct kvm_vcpu *vcpu,
 		bool protected = false;
 
 		for_each_sp(pages, sp, parents, i)
-			protected |= rmap_write_protect(vcpu, sp->gfn);
+			protected |= kvm_vcpu_write_protect_gfn(vcpu, sp->gfn);
 
 		if (protected) {
 			kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, true);
@@ -2149,7 +2149,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	hlist_add_head(&sp->hash_link, sp_list);
 	if (!direct) {
 		account_shadowed(vcpu->kvm, sp);
-		if (level == PG_LEVEL_4K && rmap_write_protect(vcpu, gfn))
+		if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
 	trace_kvm_mmu_get_page(sp, true);

base-commit: edb9e50dbe18394d0fc9d0494f5b6046fc912d33
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 02/18] KVM: x86/mmu: Rename __rmap_write_protect() to rmap_write_protect()
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
  2022-01-19 23:07 ` [PATCH v2 01/18] KVM: x86/mmu: Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn() David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 03/18] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails David Matlack
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

The function formerly known as rmap_write_protect() has been renamed to
kvm_vcpu_write_protect_gfn(), so we can get rid of the double
underscores in front of __rmap_write_protect().

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b541683c29c7..fb6718714caa 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1229,9 +1229,9 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
 	return mmu_spte_update(sptep, spte);
 }
 
-static bool __rmap_write_protect(struct kvm *kvm,
-				 struct kvm_rmap_head *rmap_head,
-				 bool pt_protect)
+static bool rmap_write_protect(struct kvm *kvm,
+			       struct kvm_rmap_head *rmap_head,
+			       bool pt_protect)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1311,7 +1311,7 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 	while (mask) {
 		rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
 					PG_LEVEL_4K, slot);
-		__rmap_write_protect(kvm, rmap_head, false);
+		rmap_write_protect(kvm, rmap_head, false);
 
 		/* clear the first set bit */
 		mask &= mask - 1;
@@ -1410,7 +1410,7 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	if (kvm_memslots_have_rmaps(kvm)) {
 		for (i = min_level; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
 			rmap_head = gfn_to_rmap(gfn, i, slot);
-			write_protected |= __rmap_write_protect(kvm, rmap_head, true);
+			write_protected |= rmap_write_protect(kvm, rmap_head, true);
 		}
 	}
 
@@ -5802,7 +5802,7 @@ static bool slot_rmap_write_protect(struct kvm *kvm,
 				    struct kvm_rmap_head *rmap_head,
 				    const struct kvm_memory_slot *slot)
 {
-	return __rmap_write_protect(kvm, rmap_head, false);
+	return rmap_write_protect(kvm, rmap_head, false);
 }
 
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 03/18] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
  2022-01-19 23:07 ` [PATCH v2 01/18] KVM: x86/mmu: Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn() David Matlack
  2022-01-19 23:07 ` [PATCH v2 02/18] KVM: x86/mmu: Rename __rmap_write_protect() to rmap_write_protect() David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 04/18] KVM: x86/mmu: Change tdp_mmu_{set,zap}_spte_atomic() to return 0/-EBUSY David Matlack
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Consolidate a bunch of code that was manually re-reading the spte if the
cmpxchg failed. There is no extra cost of doing this because we already
have the spte value as a result of the cmpxchg (and in fact this
eliminates re-reading the spte), and none of the call sites depend on
iter->old_spte retaining the stale spte value.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 52 +++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bc9e3553fba2..cb7c4e3e9001 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -492,16 +492,23 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
  * and handle the associated bookkeeping.  Do not mark the page dirty
  * in KVM's dirty bitmaps.
  *
+ * If setting the SPTE fails because it has changed, iter->old_spte will be
+ * refreshed to the current value of the spte.
+ *
  * @kvm: kvm instance
  * @iter: a tdp_iter instance currently on the SPTE that should be set
  * @new_spte: The value the SPTE should be set to
  * Returns: true if the SPTE was set, false if it was not. If false is returned,
- *	    this function will have no side-effects.
+ *          this function will have no side-effects other than setting
+ *          iter->old_spte to the last known value of spte.
  */
 static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 					   struct tdp_iter *iter,
 					   u64 new_spte)
 {
+	u64 *sptep = rcu_dereference(iter->sptep);
+	u64 old_spte;
+
 	WARN_ON_ONCE(iter->yielded);
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
@@ -517,9 +524,17 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
 	 * does not hold the mmu_lock.
 	 */
-	if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
-		      new_spte) != iter->old_spte)
+	old_spte = cmpxchg64(sptep, iter->old_spte, new_spte);
+	if (old_spte != iter->old_spte) {
+		/*
+		 * The page table entry was modified by a different logical
+		 * CPU. Refresh iter->old_spte with the current value so the
+		 * caller operates on fresh data, e.g. if it retries
+		 * tdp_mmu_set_spte_atomic().
+		 */
+		iter->old_spte = old_spte;
 		return false;
+	}
 
 	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
 			      new_spte, iter->level, true);
@@ -751,11 +766,6 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			tdp_mmu_set_spte(kvm, &iter, 0);
 			flush = true;
 		} else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
-			/*
-			 * The iter must explicitly re-read the SPTE because
-			 * the atomic cmpxchg failed.
-			 */
-			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
 			goto retry;
 		}
 	}
@@ -1193,14 +1203,9 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
 
-		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte)) {
-			/*
-			 * The iter must explicitly re-read the SPTE because
-			 * the atomic cmpxchg failed.
-			 */
-			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
 			goto retry;
-		}
+
 		spte_set = true;
 	}
 
@@ -1261,14 +1266,9 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 				continue;
 		}
 
-		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte)) {
-			/*
-			 * The iter must explicitly re-read the SPTE because
-			 * the atomic cmpxchg failed.
-			 */
-			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
 			goto retry;
-		}
+
 		spte_set = true;
 	}
 
@@ -1392,14 +1392,8 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 			continue;
 
 		/* Note, a successful atomic zap also does a remote TLB flush. */
-		if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
-			/*
-			 * The iter must explicitly re-read the SPTE because
-			 * the atomic cmpxchg failed.
-			 */
-			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+		if (!tdp_mmu_zap_spte_atomic(kvm, &iter))
 			goto retry;
-		}
 	}
 
 	rcu_read_unlock();
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 04/18] KVM: x86/mmu: Change tdp_mmu_{set,zap}_spte_atomic() to return 0/-EBUSY
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (2 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 03/18] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 05/18] KVM: x86/mmu: Rename TDP MMU functions that handle shadow pages David Matlack
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

tdp_mmu_set_spte_atomic() and tdp_mmu_zap_spte_atomic() return a bool
with true indicating the SPTE modification was successful and false
indicating failure. Change these functions to return an int instead
since that is the common practice.

Opportunistically fix up the kernel-doc style for the Return section
above tdp_mmu_set_spte_atomic().

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 47 +++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cb7c4e3e9001..3dc2e2a6d439 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -498,13 +498,15 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
  * @kvm: kvm instance
  * @iter: a tdp_iter instance currently on the SPTE that should be set
  * @new_spte: The value the SPTE should be set to
- * Returns: true if the SPTE was set, false if it was not. If false is returned,
- *          this function will have no side-effects other than setting
- *          iter->old_spte to the last known value of spte.
+ * Return:
+ * * 0      - If the SPTE was set.
+ * * -EBUSY - If the SPTE cannot be set. In this case this function will have
+ *            no side-effects other than setting iter->old_spte to the last
+ *            known value of the spte.
  */
-static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
-					   struct tdp_iter *iter,
-					   u64 new_spte)
+static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
+					  struct tdp_iter *iter,
+					  u64 new_spte)
 {
 	u64 *sptep = rcu_dereference(iter->sptep);
 	u64 old_spte;
@@ -518,7 +520,7 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	 * may modify it.
 	 */
 	if (is_removed_spte(iter->old_spte))
-		return false;
+		return -EBUSY;
 
 	/*
 	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
@@ -533,27 +535,30 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 		 * tdp_mmu_set_spte_atomic().
 		 */
 		iter->old_spte = old_spte;
-		return false;
+		return -EBUSY;
 	}
 
 	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
 			      new_spte, iter->level, true);
 	handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
 
-	return true;
+	return 0;
 }
 
-static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
-					   struct tdp_iter *iter)
+static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
+					  struct tdp_iter *iter)
 {
+	int ret;
+
 	/*
 	 * Freeze the SPTE by setting it to a special,
 	 * non-present value. This will stop other threads from
 	 * immediately installing a present entry in its place
 	 * before the TLBs are flushed.
 	 */
-	if (!tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE))
-		return false;
+	ret = tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE);
+	if (ret)
+		return ret;
 
 	kvm_flush_remote_tlbs_with_address(kvm, iter->gfn,
 					   KVM_PAGES_PER_HPAGE(iter->level));
@@ -568,7 +573,7 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 */
 	WRITE_ONCE(*rcu_dereference(iter->sptep), 0);
 
-	return true;
+	return 0;
 }
 
 
@@ -765,7 +770,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		if (!shared) {
 			tdp_mmu_set_spte(kvm, &iter, 0);
 			flush = true;
-		} else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
+		} else if (tdp_mmu_zap_spte_atomic(kvm, &iter)) {
 			goto retry;
 		}
 	}
@@ -923,7 +928,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
-	else if (!tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
+	else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
 		return RET_PF_RETRY;
 
 	/*
@@ -989,7 +994,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 */
 		if (is_shadow_present_pte(iter.old_spte) &&
 		    is_large_pte(iter.old_spte)) {
-			if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
+			if (tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
 				break;
 
 			/*
@@ -1015,7 +1020,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			new_spte = make_nonleaf_spte(child_pt,
 						     !shadow_accessed_mask);
 
-			if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
+			if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
 				tdp_mmu_link_page(vcpu->kvm, sp,
 						  fault->huge_page_disallowed &&
 						  fault->req_level >= iter.level);
@@ -1203,7 +1208,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
 
-		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
+		if (tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
 			goto retry;
 
 		spte_set = true;
@@ -1266,7 +1271,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 				continue;
 		}
 
-		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
+		if (tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
 			goto retry;
 
 		spte_set = true;
@@ -1392,7 +1397,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 			continue;
 
 		/* Note, a successful atomic zap also does a remote TLB flush. */
-		if (!tdp_mmu_zap_spte_atomic(kvm, &iter))
+		if (tdp_mmu_zap_spte_atomic(kvm, &iter))
 			goto retry;
 	}
 
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 05/18] KVM: x86/mmu: Rename TDP MMU functions that handle shadow pages
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (3 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 04/18] KVM: x86/mmu: Change tdp_mmu_{set,zap}_spte_atomic() to return 0/-EBUSY David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 06/18] KVM: x86/mmu: Rename handle_removed_tdp_mmu_page() to handle_removed_pt() David Matlack
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Rename 3 functions in tdp_mmu.c that handle shadow pages:

  alloc_tdp_mmu_page()  -> tdp_mmu_alloc_sp()
  tdp_mmu_link_page()   -> tdp_mmu_link_sp()
  tdp_mmu_unlink_page() -> tdp_mmu_unlink_sp()

These changed make tdp_mmu a consistent prefix before the verb in the
function name, and make it more clear that these functions deal with
kvm_mmu_page structs rather than struct pages.

One could argue that "shadow page" is the wrong term for a page table in
the TDP MMU since it never actually shadows a guest page table.
However, "shadow page" (or "sp" for short) has evolved to become the
standard term in KVM when referring to a kvm_mmu_page struct, and its
associated page table and other metadata, regardless of whether the page
table shadows a guest page table. So this commit just makes the TDP MMU
more consistent with the rest of KVM.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3dc2e2a6d439..15cce503ffde 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -172,8 +172,8 @@ static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
 	return role;
 }
 
-static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					       int level)
+static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     int level)
 {
 	struct kvm_mmu_page *sp;
 
@@ -207,7 +207,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 			goto out;
 	}
 
-	root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
+	root = tdp_mmu_alloc_sp(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
 	refcount_set(&root->tdp_mmu_root_count, 1);
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
@@ -252,15 +252,15 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 }
 
 /**
- * tdp_mmu_link_page - Add a new page to the list of pages used by the TDP MMU
+ * tdp_mmu_link_sp() - Add a new shadow page to the list of used pages
  *
  * @kvm: kvm instance
  * @sp: the new page
  * @account_nx: This page replaces a NX large page and should be marked for
  *		eventual reclaim.
  */
-static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
-			      bool account_nx)
+static void tdp_mmu_link_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
+			    bool account_nx)
 {
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
@@ -270,7 +270,7 @@ static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 }
 
 /**
- * tdp_mmu_unlink_page - Remove page from the list of pages used by the TDP MMU
+ * tdp_mmu_unlink_sp() - Remove a shadow page from the list of used pages
  *
  * @kvm: kvm instance
  * @sp: the page to be removed
@@ -278,8 +278,8 @@ static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
  *	    the MMU lock and the operation must synchronize with other
  *	    threads that might be adding or removing pages.
  */
-static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
-				bool shared)
+static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
+			      bool shared)
 {
 	if (shared)
 		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
@@ -321,7 +321,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
 
 	trace_kvm_mmu_prepare_zap_page(sp);
 
-	tdp_mmu_unlink_page(kvm, sp, shared);
+	tdp_mmu_unlink_sp(kvm, sp, shared);
 
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
 		u64 *sptep = rcu_dereference(pt) + i;
@@ -1014,16 +1014,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			if (is_removed_spte(iter.old_spte))
 				break;
 
-			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level - 1);
+			sp = tdp_mmu_alloc_sp(vcpu, iter.gfn, iter.level - 1);
 			child_pt = sp->spt;
 
 			new_spte = make_nonleaf_spte(child_pt,
 						     !shadow_accessed_mask);
 
 			if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
-				tdp_mmu_link_page(vcpu->kvm, sp,
-						  fault->huge_page_disallowed &&
-						  fault->req_level >= iter.level);
+				tdp_mmu_link_sp(vcpu->kvm, sp,
+						fault->huge_page_disallowed &&
+						fault->req_level >= iter.level);
 
 				trace_kvm_mmu_get_page(sp, true);
 			} else {
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 06/18] KVM: x86/mmu: Rename handle_removed_tdp_mmu_page() to handle_removed_pt()
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (4 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 05/18] KVM: x86/mmu: Rename TDP MMU functions that handle shadow pages David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 07/18] KVM: x86/mmu: Consolidate logic to atomically install a new TDP MMU page table David Matlack
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

First remove tdp_mmu_ from the name since it is redundant given that it
is a static function in tdp_mmu.c. There is a pattern of using tdp_mmu_
as a prefix in the names of static TDP MMU functions, but all of the
other handle_*() variants do not include such a prefix. So drop it
entirely.

Then change "page" to "pt" to convey that this is operating on a page
table rather than an struct page. Purposely use "pt" instead of "sp"
since this function takes the raw RCU-protected page table pointer as an
argument rather than  a pointer to the struct kvm_mmu_page.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 15cce503ffde..902dd6e49e50 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -295,7 +295,7 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
 }
 
 /**
- * handle_removed_tdp_mmu_page - handle a pt removed from the TDP structure
+ * handle_removed_pt() - handle a page table removed from the TDP structure
  *
  * @kvm: kvm instance
  * @pt: the page removed from the paging structure
@@ -311,8 +311,7 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
  * this thread will be responsible for ensuring the page is freed. Hence the
  * early rcu_dereferences in the function.
  */
-static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
-					bool shared)
+static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 {
 	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(pt));
 	int level = sp->role.level;
@@ -472,8 +471,7 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	 * the paging structure.
 	 */
 	if (was_present && !was_leaf && (pfn_changed || !is_present))
-		handle_removed_tdp_mmu_page(kvm,
-				spte_to_child_pt(old_spte, level), shared);
+		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 07/18] KVM: x86/mmu: Consolidate logic to atomically install a new TDP MMU page table
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (5 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 06/18] KVM: x86/mmu: Rename handle_removed_tdp_mmu_page() to handle_removed_pt() David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 08/18] KVM: x86/mmu: Remove unnecessary warnings from restore_acc_track_spte() David Matlack
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Consolidate the logic to atomically replace an SPTE with an SPTE that
points to a new page table into a single helper function. This will be
used in a follow-up commit to split huge pages, which involves replacing
each huge page SPTE with an SPTE that points to a page table.

Opportunistically drop the call to trace_kvm_mmu_get_page() in
kvm_tdp_mmu_map() since it is redundant with the identical tracepoint in
tdp_mmu_alloc_sp().

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 68 ++++++++++++++++++++------------------
 1 file changed, 36 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 902dd6e49e50..f6144db48367 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -251,24 +251,6 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 	}
 }
 
-/**
- * tdp_mmu_link_sp() - Add a new shadow page to the list of used pages
- *
- * @kvm: kvm instance
- * @sp: the new page
- * @account_nx: This page replaces a NX large page and should be marked for
- *		eventual reclaim.
- */
-static void tdp_mmu_link_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
-			    bool account_nx)
-{
-	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
-	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
-	if (account_nx)
-		account_huge_nx_page(kvm, sp);
-	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
-}
-
 /**
  * tdp_mmu_unlink_sp() - Remove a shadow page from the list of used pages
  *
@@ -959,6 +941,38 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	return ret;
 }
 
+/*
+ * tdp_mmu_link_sp_atomic - Atomically replace the given spte with an spte
+ * pointing to the provided page table.
+ *
+ * @kvm: kvm instance
+ * @iter: a tdp_iter instance currently on the SPTE that should be set
+ * @sp: The new TDP page table to install.
+ * @account_nx: True if this page table is being installed to split a
+ *              non-executable huge page.
+ *
+ * Returns: 0 if the new page table was installed. Non-0 if the page table
+ *          could not be installed (e.g. the atomic compare-exchange failed).
+ */
+static int tdp_mmu_link_sp_atomic(struct kvm *kvm, struct tdp_iter *iter,
+				  struct kvm_mmu_page *sp, bool account_nx)
+{
+	u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
+	int ret;
+
+	ret = tdp_mmu_set_spte_atomic(kvm, iter, spte);
+	if (ret)
+		return ret;
+
+	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
+	if (account_nx)
+		account_huge_nx_page(kvm, sp);
+	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+
+	return 0;
+}
+
 /*
  * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
  * page tables and SPTEs to translate the faulting guest physical address.
@@ -968,8 +982,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
-	u64 *child_pt;
-	u64 new_spte;
 	int ret;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1004,6 +1016,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		}
 
 		if (!is_shadow_present_pte(iter.old_spte)) {
+			bool account_nx = fault->huge_page_disallowed &&
+					  fault->req_level >= iter.level;
+
 			/*
 			 * If SPTE has been frozen by another thread, just
 			 * give up and retry, avoiding unnecessary page table
@@ -1013,18 +1028,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 				break;
 
 			sp = tdp_mmu_alloc_sp(vcpu, iter.gfn, iter.level - 1);
-			child_pt = sp->spt;
-
-			new_spte = make_nonleaf_spte(child_pt,
-						     !shadow_accessed_mask);
-
-			if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
-				tdp_mmu_link_sp(vcpu->kvm, sp,
-						fault->huge_page_disallowed &&
-						fault->req_level >= iter.level);
-
-				trace_kvm_mmu_get_page(sp, true);
-			} else {
+			if (tdp_mmu_link_sp_atomic(vcpu->kvm, &iter, sp, account_nx)) {
 				tdp_mmu_free_sp(sp);
 				break;
 			}
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 08/18] KVM: x86/mmu: Remove unnecessary warnings from restore_acc_track_spte()
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (6 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 07/18] KVM: x86/mmu: Consolidate logic to atomically install a new TDP MMU page table David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 09/18] KVM: x86/mmu: Drop new_spte local variable " David Matlack
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

The warnings in restore_acc_track_spte() can be removed because the only
caller checks is_access_track_spte(), and is_access_track_spte() checks
!spte_ad_enabled(). In other words, the warning can never be triggered.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fb6718714caa..4485554336d7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -653,9 +653,6 @@ static u64 restore_acc_track_spte(u64 spte)
 	u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT)
 			 & SHADOW_ACC_TRACK_SAVED_BITS_MASK;
 
-	WARN_ON_ONCE(spte_ad_enabled(spte));
-	WARN_ON_ONCE(!is_access_track_spte(spte));
-
 	new_spte &= ~shadow_acc_track_mask;
 	new_spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
 		      SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 09/18] KVM: x86/mmu: Drop new_spte local variable from restore_acc_track_spte()
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (7 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 08/18] KVM: x86/mmu: Remove unnecessary warnings from restore_acc_track_spte() David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 10/18] KVM: x86/mmu: Move restore_acc_track_spte() to spte.h David Matlack
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

The new_spte local variable is unnecessary. Deleting it can save a line
of code and simplify the remaining lines a bit.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4485554336d7..7a70c238cd26 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -649,16 +649,15 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
 /* Restore an acc-track PTE back to a regular PTE */
 static u64 restore_acc_track_spte(u64 spte)
 {
-	u64 new_spte = spte;
 	u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT)
 			 & SHADOW_ACC_TRACK_SAVED_BITS_MASK;
 
-	new_spte &= ~shadow_acc_track_mask;
-	new_spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
-		      SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
-	new_spte |= saved_bits;
+	spte &= ~shadow_acc_track_mask;
+	spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
+		  SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
+	spte |= saved_bits;
 
-	return new_spte;
+	return spte;
 }
 
 /* Returns the Accessed status of the PTE and resets it at the same time. */
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 10/18] KVM: x86/mmu: Move restore_acc_track_spte() to spte.h
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (8 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 09/18] KVM: x86/mmu: Drop new_spte local variable " David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 11/18] KVM: x86/mmu: Refactor TDP MMU iterators to take kvm_mmu_page root David Matlack
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

restore_acc_track_spte() is pure SPTE bit manipulation, making it a good
fit for spte.h. And now that the WARN_ON_ONCE() calls have been removed,
there isn't any good reason to not inline it.

This move also prepares for a follow-up commit that will need to call
restore_acc_track_spte() from spte.c

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c  | 14 --------------
 arch/x86/kvm/mmu/spte.h | 15 +++++++++++++++
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7a70c238cd26..51aa38bdb858 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -646,20 +646,6 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
 	return __get_spte_lockless(sptep);
 }
 
-/* Restore an acc-track PTE back to a regular PTE */
-static u64 restore_acc_track_spte(u64 spte)
-{
-	u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT)
-			 & SHADOW_ACC_TRACK_SAVED_BITS_MASK;
-
-	spte &= ~shadow_acc_track_mask;
-	spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
-		  SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
-	spte |= saved_bits;
-
-	return spte;
-}
-
 /* Returns the Accessed status of the PTE and resets it at the same time. */
 static bool mmu_spte_age(u64 *sptep)
 {
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index be6a007a4af3..e8ac1fab3185 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -367,6 +367,21 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
+
+/* Restore an acc-track PTE back to a regular PTE */
+static inline u64 restore_acc_track_spte(u64 spte)
+{
+	u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT)
+			 & SHADOW_ACC_TRACK_SAVED_BITS_MASK;
+
+	spte &= ~shadow_acc_track_mask;
+	spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
+		  SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
+	spte |= saved_bits;
+
+	return spte;
+}
+
 u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);
 
 void kvm_mmu_reset_all_pte_masks(void);
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 11/18] KVM: x86/mmu: Refactor TDP MMU iterators to take kvm_mmu_page root
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (9 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 10/18] KVM: x86/mmu: Move restore_acc_track_spte() to spte.h David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 12/18] KVM: x86/mmu: Remove redundant role overrides for TDP MMU shadow pages David Matlack
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Instead of passing a pointer to the root page table and the root level
separately, pass in a pointer to the root kvm_mmu_page struct.  This
reduces the number of arguments by 1, cutting down on line lengths.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_iter.c |  8 +++++---
 arch/x86/kvm/mmu/tdp_iter.h | 10 +++++-----
 arch/x86/kvm/mmu/tdp_mmu.c  | 14 +++++---------
 3 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index caa96c270b95..be3f096db2eb 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -40,17 +40,19 @@ void tdp_iter_restart(struct tdp_iter *iter)
  * Sets a TDP iterator to walk a pre-order traversal of the paging structure
  * rooted at root_pt, starting with the walk to translate next_last_level_gfn.
  */
-void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
+void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
 		    int min_level, gfn_t next_last_level_gfn)
 {
+	int root_level = root->role.level;
+
 	WARN_ON(root_level < 1);
 	WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
 
 	iter->next_last_level_gfn = next_last_level_gfn;
 	iter->root_level = root_level;
 	iter->min_level = min_level;
-	iter->pt_path[iter->root_level - 1] = (tdp_ptep_t)root_pt;
-	iter->as_id = kvm_mmu_page_as_id(sptep_to_sp(root_pt));
+	iter->pt_path[iter->root_level - 1] = (tdp_ptep_t)root->spt;
+	iter->as_id = kvm_mmu_page_as_id(root);
 
 	tdp_iter_restart(iter);
 }
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index e19cabbcb65c..216ebbe76ddd 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -57,17 +57,17 @@ struct tdp_iter {
  * Iterates over every SPTE mapping the GFN range [start, end) in a
  * preorder traversal.
  */
-#define for_each_tdp_pte_min_level(iter, root, root_level, min_level, start, end) \
-	for (tdp_iter_start(&iter, root, root_level, min_level, start); \
+#define for_each_tdp_pte_min_level(iter, root, min_level, start, end) \
+	for (tdp_iter_start(&iter, root, min_level, start); \
 	     iter.valid && iter.gfn < end;		     \
 	     tdp_iter_next(&iter))
 
-#define for_each_tdp_pte(iter, root, root_level, start, end) \
-	for_each_tdp_pte_min_level(iter, root, root_level, PG_LEVEL_4K, start, end)
+#define for_each_tdp_pte(iter, root, start, end) \
+	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end)
 
 tdp_ptep_t spte_to_child_pt(u64 pte, int level);
 
-void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
+void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
 		    int min_level, gfn_t next_last_level_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_restart(struct tdp_iter *iter);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f6144db48367..38ec5a61dbff 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -624,7 +624,7 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 }
 
 #define tdp_root_for_each_pte(_iter, _root, _start, _end) \
-	for_each_tdp_pte(_iter, _root->spt, _root->role.level, _start, _end)
+	for_each_tdp_pte(_iter, _root, _start, _end)
 
 #define tdp_root_for_each_leaf_pte(_iter, _root, _start, _end)	\
 	tdp_root_for_each_pte(_iter, _root, _start, _end)		\
@@ -634,8 +634,7 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 		else
 
 #define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end)		\
-	for_each_tdp_pte(_iter, __va(_mmu->root_hpa),		\
-			 _mmu->shadow_root_level, _start, _end)
+	for_each_tdp_pte(_iter, to_shadow_page(_mmu->root_hpa), _start, _end)
 
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
@@ -724,8 +723,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
-				   min_level, start, end) {
+	for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
 retry:
 		if (can_yield &&
 		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
@@ -1197,8 +1195,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
 
-	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
-				   min_level, start, end) {
+	for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
@@ -1437,8 +1434,7 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
-				   min_level, gfn, gfn + 1) {
+	for_each_tdp_pte_min_level(iter, root, min_level, gfn, gfn + 1) {
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 12/18] KVM: x86/mmu: Remove redundant role overrides for TDP MMU shadow pages
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (10 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 11/18] KVM: x86/mmu: Refactor TDP MMU iterators to take kvm_mmu_page root David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 13/18] KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent David Matlack
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

The vCPU's mmu_role already has the correct values for direct,
has_4_byte_gpte, access, and ad_disabled. Remove the code that was
redundantly overwriting these fields with the same values.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 38ec5a61dbff..90b6fbec83cc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -164,10 +164,6 @@ static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
 
 	role = vcpu->arch.mmu->mmu_role.base;
 	role.level = level;
-	role.direct = true;
-	role.has_4_byte_gpte = false;
-	role.access = ACC_ALL;
-	role.ad_disabled = !shadow_accessed_mask;
 
 	return role;
 }
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 13/18] KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (11 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 12/18] KVM: x86/mmu: Remove redundant role overrides for TDP MMU shadow pages David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 14/18] KVM: x86/mmu: Separate TDP MMU shadow page allocation and initialization David Matlack
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Derive the page role from the parent shadow page, since the only thing
that changes is the level. This is in preparation for splitting huge
pages during VM-ioctls which do not have access to the vCPU MMU context.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++------------------
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 90b6fbec83cc..5c1f1777e3d3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -157,19 +157,8 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 		if (kvm_mmu_page_as_id(_root) != _as_id) {		\
 		} else
 
-static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
-						   int level)
-{
-	union kvm_mmu_page_role role;
-
-	role = vcpu->arch.mmu->mmu_role.base;
-	role.level = level;
-
-	return role;
-}
-
 static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     int level)
+					     union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_page *sp;
 
@@ -177,7 +166,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
-	sp->role.word = page_role_for_level(vcpu, level).word;
+	sp->role = role;
 	sp->gfn = gfn;
 	sp->tdp_mmu_page = true;
 
@@ -186,16 +175,28 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return sp;
 }
 
-hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
+static struct kvm_mmu_page *tdp_mmu_alloc_child_sp(struct kvm_vcpu *vcpu,
+						   struct tdp_iter *iter)
 {
+	struct kvm_mmu_page *parent_sp;
 	union kvm_mmu_page_role role;
+
+	parent_sp = sptep_to_sp(rcu_dereference(iter->sptep));
+
+	role = parent_sp->role;
+	role.level--;
+
+	return tdp_mmu_alloc_sp(vcpu, iter->gfn, role);
+}
+
+hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
+{
+	union kvm_mmu_page_role role = vcpu->arch.mmu->mmu_role.base;
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_page *root;
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
-	role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
-
 	/* Check for an existing root before allocating a new one. */
 	for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
 		if (root->role.word == role.word &&
@@ -203,7 +204,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 			goto out;
 	}
 
-	root = tdp_mmu_alloc_sp(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
+	root = tdp_mmu_alloc_sp(vcpu, 0, role);
 	refcount_set(&root->tdp_mmu_root_count, 1);
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
@@ -1021,7 +1022,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			if (is_removed_spte(iter.old_spte))
 				break;
 
-			sp = tdp_mmu_alloc_sp(vcpu, iter.gfn, iter.level - 1);
+			sp = tdp_mmu_alloc_child_sp(vcpu, &iter);
 			if (tdp_mmu_link_sp_atomic(vcpu->kvm, &iter, sp, account_nx)) {
 				tdp_mmu_free_sp(sp);
 				break;
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 14/18] KVM: x86/mmu: Separate TDP MMU shadow page allocation and initialization
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (12 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 13/18] KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 15/18] KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled David Matlack
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Separate the allocation of shadow pages from their initialization.  This
is in preparation for splitting huge pages outside of the vCPU fault
context, which requires a different allocation mechanism.

No functional changed intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5c1f1777e3d3..b526a1873f30 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -157,13 +157,19 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 		if (kvm_mmu_page_as_id(_root) != _as_id) {		\
 		} else
 
-static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu_page *sp;
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+
+	return sp;
+}
+
+static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, gfn_t gfn,
+			      union kvm_mmu_page_role role)
+{
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
 	sp->role = role;
@@ -171,12 +177,10 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
 	sp->tdp_mmu_page = true;
 
 	trace_kvm_mmu_get_page(sp, true);
-
-	return sp;
 }
 
-static struct kvm_mmu_page *tdp_mmu_alloc_child_sp(struct kvm_vcpu *vcpu,
-						   struct tdp_iter *iter)
+static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
+				  struct tdp_iter *iter)
 {
 	struct kvm_mmu_page *parent_sp;
 	union kvm_mmu_page_role role;
@@ -186,7 +190,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_child_sp(struct kvm_vcpu *vcpu,
 	role = parent_sp->role;
 	role.level--;
 
-	return tdp_mmu_alloc_sp(vcpu, iter->gfn, role);
+	tdp_mmu_init_sp(child_sp, iter->gfn, role);
 }
 
 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
@@ -204,7 +208,9 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 			goto out;
 	}
 
-	root = tdp_mmu_alloc_sp(vcpu, 0, role);
+	root = tdp_mmu_alloc_sp(vcpu);
+	tdp_mmu_init_sp(root, 0, role);
+
 	refcount_set(&root->tdp_mmu_root_count, 1);
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
@@ -1022,7 +1028,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			if (is_removed_spte(iter.old_spte))
 				break;
 
-			sp = tdp_mmu_alloc_child_sp(vcpu, &iter);
+			sp = tdp_mmu_alloc_sp(vcpu);
+			tdp_mmu_init_child_sp(sp, &iter);
+
 			if (tdp_mmu_link_sp_atomic(vcpu->kvm, &iter, sp, account_nx)) {
 				tdp_mmu_free_sp(sp);
 				break;
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 15/18] KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (13 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 14/18] KVM: x86/mmu: Separate TDP MMU shadow page allocation and initialization David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 16/18] KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG David Matlack
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.

Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.

Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:

 1. Splitting on the vCPU thread interrupts vCPUs execution and is
    disruptive to customers whereas splitting on VM ioctl threads can
    run in parallel with vCPU execution.

 2. Splitting all huge pages at once is more efficient because it does
    not require performing VM-exit handling or walking the page table for
    every 4KiB page in the memslot, and greatly reduces the amount of
    contention on the mmu_lock.

For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.

Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.

Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../admin-guide/kernel-parameters.txt         |  24 +++
 arch/x86/include/asm/kvm_host.h               |   3 +
 arch/x86/kvm/mmu/mmu.c                        |  24 +++
 arch/x86/kvm/mmu/spte.c                       |  59 ++++++
 arch/x86/kvm/mmu/spte.h                       |   1 +
 arch/x86/kvm/mmu/tdp_mmu.c                    | 173 ++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h                    |   5 +
 arch/x86/kvm/x86.c                            |   6 +
 8 files changed, 295 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 86a2456d94ba..f5e9c4a45aef 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2330,6 +2330,30 @@
 	kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs.
 			Default is 0 (don't ignore, but inject #GP)
 
+	kvm.eager_page_split=
+			[KVM,X86] Controls whether or not KVM will try to
+			proactively split all huge pages during dirty logging.
+			Eager page splitting reduces interruptions to vCPU
+			execution by eliminating the write-protection faults
+			and MMU lock contention that would otherwise be
+			required to split huge pages lazily.
+
+			VM workloads that rarely perform writes or that write
+			only to a small region of VM memory may benefit from
+			disabling eager page splitting to allow huge pages to
+			still be used for reads.
+
+			The behavior of eager page splitting depends on whether
+			KVM_DIRTY_LOG_INITIALLY_SET is enabled or disabled. If
+			disabled, all huge pages in a memslot will be eagerly
+			split when dirty logging is enabled on that memslot. If
+			enabled, huge pages will not be eagerly split.
+
+			Eager page splitting currently only supports splitting
+			huge pages mapped by the TDP MMU.
+
+			Default is Y (on).
+
 	kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
 				   Default is false (don't support).
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 682ad02a4e58..97560980456d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1579,6 +1579,9 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 				      const struct kvm_memory_slot *memslot,
 				      int start_level);
+void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
+				       const struct kvm_memory_slot *memslot,
+				       int target_level);
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 51aa38bdb858..a273536e8b25 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5834,6 +5834,30 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
+				       const struct kvm_memory_slot *memslot,
+				       int target_level)
+{
+	u64 start = memslot->base_gfn;
+	u64 end = start + memslot->npages;
+
+	if (is_tdp_mmu_enabled(kvm)) {
+		read_lock(&kvm->mmu_lock);
+		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
+		read_unlock(&kvm->mmu_lock);
+	}
+
+	/*
+	 * No TLB flush is necessary here. KVM will flush TLBs after
+	 * write-protecting and/or clearing dirty on the newly split SPTEs to
+	 * ensure that guest writes are reflected in the dirty log before the
+	 * ioctl to enable dirty logging on this memslot completes. Since the
+	 * split SPTEs retain the write and dirty bits of the huge SPTE, it is
+	 * safe for KVM to decide if a TLB flush is necessary based on the split
+	 * SPTEs.
+	 */
+}
+
 static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 					 struct kvm_rmap_head *rmap_head,
 					 const struct kvm_memory_slot *slot)
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index f8677404c93c..17f226bb8ec2 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -191,6 +191,65 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	return wrprot;
 }
 
+static u64 make_spte_executable(u64 spte)
+{
+	bool is_access_track = is_access_track_spte(spte);
+
+	if (is_access_track)
+		spte = restore_acc_track_spte(spte);
+
+	spte &= ~shadow_nx_mask;
+	spte |= shadow_x_mask;
+
+	if (is_access_track)
+		spte = mark_spte_for_access_track(spte);
+
+	return spte;
+}
+
+/*
+ * Construct an SPTE that maps a sub-page of the given huge page SPTE where
+ * `index` identifies which sub-page.
+ *
+ * This is used during huge page splitting to build the SPTEs that make up the
+ * new page table.
+ */
+u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
+{
+	u64 child_spte;
+	int child_level;
+
+	if (WARN_ON_ONCE(!is_shadow_present_pte(huge_spte)))
+		return 0;
+
+	if (WARN_ON_ONCE(!is_large_pte(huge_spte)))
+		return 0;
+
+	child_spte = huge_spte;
+	child_level = huge_level - 1;
+
+	/*
+	 * The child_spte already has the base address of the huge page being
+	 * split. So we just have to OR in the offset to the page at the next
+	 * lower level for the given index.
+	 */
+	child_spte |= (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
+
+	if (child_level == PG_LEVEL_4K) {
+		child_spte &= ~PT_PAGE_SIZE_MASK;
+
+		/*
+		 * When splitting to a 4K page, mark the page executable as the
+		 * NX hugepage mitigation no longer applies.
+		 */
+		if (is_nx_huge_page_enabled())
+			child_spte = make_spte_executable(child_spte);
+	}
+
+	return child_spte;
+}
+
+
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
 {
 	u64 spte = SPTE_MMU_PRESENT_MASK;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index e8ac1fab3185..9d6bb74bbb54 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -364,6 +364,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
+u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b526a1873f30..88f723fc0d1f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1242,6 +1242,179 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 	return spte_set;
 }
 
+static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
+{
+	struct kvm_mmu_page *sp;
+
+	gfp |= __GFP_ZERO;
+
+	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
+	if (!sp)
+		return NULL;
+
+	sp->spt = (void *)__get_free_page(gfp);
+	if (!sp->spt) {
+		kmem_cache_free(mmu_page_header_cache, sp);
+		return NULL;
+	}
+
+	return sp;
+}
+
+static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
+						       struct tdp_iter *iter)
+{
+	struct kvm_mmu_page *sp;
+
+	lockdep_assert_held_read(&kvm->mmu_lock);
+
+	/*
+	 * Since we are allocating while under the MMU lock we have to be
+	 * careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct
+	 * reclaim and to avoid making any filesystem callbacks (which can end
+	 * up invoking KVM MMU notifiers, resulting in a deadlock).
+	 *
+	 * If this allocation fails we drop the lock and retry with reclaim
+	 * allowed.
+	 */
+	sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
+	if (sp)
+		return sp;
+
+	rcu_read_unlock();
+	read_unlock(&kvm->mmu_lock);
+
+	iter->yielded = true;
+	sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
+
+	read_lock(&kvm->mmu_lock);
+	rcu_read_lock();
+
+	return sp;
+}
+
+static int tdp_mmu_split_huge_page_atomic(struct kvm *kvm,
+					  struct tdp_iter *iter,
+					  struct kvm_mmu_page *sp)
+{
+	const u64 huge_spte = iter->old_spte;
+	const int level = iter->level;
+	int ret, i;
+
+	tdp_mmu_init_child_sp(sp, iter);
+
+	/*
+	 * No need for atomics when writing to sp->spt since the page table has
+	 * not been linked in yet and thus is not reachable from any other CPU.
+	 */
+	for (i = 0; i < PT64_ENT_PER_PAGE; i++)
+		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i);
+
+	/*
+	 * Replace the huge spte with a pointer to the populated lower level
+	 * page table. Since we are making this change without a TLB flush vCPUs
+	 * will see a mix of the split mappings and the original huge mapping,
+	 * depending on what's currently in their TLB. This is fine from a
+	 * correctness standpoint since the translation will be the same either
+	 * way.
+	 */
+	ret = tdp_mmu_link_sp_atomic(kvm, iter, sp, false);
+	if (ret)
+		return ret;
+
+	/*
+	 * tdp_mmu_link_sp_atomic() will handle subtracting the huge page we
+	 * are overwriting from the page stats. But we have to manually update
+	 * the page stats with the new present child pages.
+	 */
+	kvm_update_page_stats(kvm, level - 1, PT64_ENT_PER_PAGE);
+
+	return 0;
+}
+
+static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
+					 struct kvm_mmu_page *root,
+					 gfn_t start, gfn_t end,
+					 int target_level)
+{
+	struct kvm_mmu_page *sp = NULL;
+	struct tdp_iter iter;
+	int ret = 0;
+
+	rcu_read_lock();
+
+	/*
+	 * Traverse the page table splitting all huge pages above the target
+	 * level into one lower level. For example, if we encounter a 1GB page
+	 * we split it into 512 2MB pages.
+	 *
+	 * Since the TDP iterator uses a pre-order traversal, we are guaranteed
+	 * to visit an SPTE before ever visiting its children, which means we
+	 * will correctly recursively split huge pages that are more than one
+	 * level above the target level (e.g. splitting a 1GB to 512 2MB pages,
+	 * and then splitting each of those to 512 4KB pages).
+	 */
+	for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
+retry:
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
+			continue;
+
+		if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
+			continue;
+
+		if (!sp) {
+			sp = tdp_mmu_alloc_sp_for_split(kvm, &iter);
+			if (!sp) {
+				ret = -ENOMEM;
+				break;
+			}
+
+			if (iter.yielded)
+				continue;
+		}
+
+		if (tdp_mmu_split_huge_page_atomic(kvm, &iter, sp))
+			goto retry;
+
+		sp = NULL;
+	}
+
+	rcu_read_unlock();
+
+	/*
+	 * It's possible to exit the loop having never used the last sp if, for
+	 * example, a vCPU doing HugePage NX splitting wins the race and
+	 * installs its own sp in place of the last sp we tried to split.
+	 */
+	if (sp)
+		tdp_mmu_free_sp(sp);
+
+
+	return ret;
+}
+
+/*
+ * Try to split all huge pages mapped by the TDP MMU down to the target level.
+ */
+void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
+				      const struct kvm_memory_slot *slot,
+				      gfn_t start, gfn_t end,
+				      int target_level)
+{
+	struct kvm_mmu_page *root;
+	int r = 0;
+
+	lockdep_assert_held_read(&kvm->mmu_lock);
+
+	for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true) {
+		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level);
+		if (r) {
+			kvm_tdp_mmu_put_root(kvm, root, true);
+			break;
+		}
+	}
+}
+
 /*
  * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
  * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 3899004a5d91..4a8756507829 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -71,6 +71,11 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn,
 				   int min_level);
 
+void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
+				      const struct kvm_memory_slot *slot,
+				      gfn_t start, gfn_t end,
+				      int target_level);
+
 static inline void kvm_tdp_mmu_walk_lockless_begin(void)
 {
 	rcu_read_lock();
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 55518b7d3b96..f5aad3e8e0a0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -192,6 +192,9 @@ bool __read_mostly enable_pmu = true;
 EXPORT_SYMBOL_GPL(enable_pmu);
 module_param(enable_pmu, bool, 0444);
 
+static bool __read_mostly eager_page_split = true;
+module_param(eager_page_split, bool, 0644);
+
 /*
  * Restoring the host value for MSRs that are only consumed when running in
  * usermode, e.g. SYSCALL MSRs and TSC_AUX, can be deferred until the CPU
@@ -11948,6 +11951,9 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
 			return;
 
+		if (READ_ONCE(eager_page_split))
+			kvm_mmu_slot_try_split_huge_pages(kvm, new, PG_LEVEL_4K);
+
 		if (kvm_x86_ops.cpu_dirty_log_size) {
 			kvm_mmu_slot_leaf_clear_dirty(kvm, new);
 			kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_2M);
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 16/18] KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (14 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 15/18] KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 17/18] KVM: x86/mmu: Add tracepoint for splitting huge pages David Matlack
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not
write-protected when dirty logging is enabled on the memslot. Instead
they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for
the first time and only for the specific sub-region being cleared.

Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to
write-protecting to avoid causing write-protection faults on vCPU
threads. This also allows userspace to smear the cost of huge page
splitting across multiple ioctls rather than splitting the entire
memslot when not using initially-all-set.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../admin-guide/kernel-parameters.txt         |  4 +-
 arch/x86/include/asm/kvm_host.h               |  4 ++
 arch/x86/kvm/mmu/mmu.c                        | 25 ++++++-
 arch/x86/kvm/mmu/tdp_mmu.c                    | 67 +++++++++++--------
 arch/x86/kvm/mmu/tdp_mmu.h                    |  2 +-
 arch/x86/kvm/x86.c                            |  2 +-
 arch/x86/kvm/x86.h                            |  2 +
 7 files changed, 73 insertions(+), 33 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f5e9c4a45aef..1b54e410e206 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2347,7 +2347,9 @@
 			KVM_DIRTY_LOG_INITIALLY_SET is enabled or disabled. If
 			disabled, all huge pages in a memslot will be eagerly
 			split when dirty logging is enabled on that memslot. If
-			enabled, huge pages will not be eagerly split.
+			enabled, eager page splitting will be performed during
+			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
+			cleared.
 
 			Eager page splitting currently only supports splitting
 			huge pages mapped by the TDP MMU.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 97560980456d..e089f34a66eb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1582,6 +1582,10 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
 				       const struct kvm_memory_slot *memslot,
 				       int target_level);
+void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
+				  const struct kvm_memory_slot *memslot,
+				  u64 start, u64 end,
+				  int target_level);
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a273536e8b25..62caf5b6d82e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1360,6 +1360,9 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
 		gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
 
+		if (READ_ONCE(eager_page_split))
+			kvm_mmu_try_split_huge_pages(kvm, slot, start, end, PG_LEVEL_4K);
+
 		kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
 
 		/* Cross two large pages? */
@@ -5834,16 +5837,32 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+/* Must be called with the mmu_lock held in write-mode. */
+void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
+				   const struct kvm_memory_slot *memslot,
+				   u64 start, u64 end,
+				   int target_level)
+{
+	if (is_tdp_mmu_enabled(kvm))
+		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
+						 target_level, false);
+
+	/*
+	 * A TLB flush is unnecessary at this point for the same resons as in
+	 * kvm_mmu_slot_try_split_huge_pages().
+	 */
+}
+
 void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
-				       const struct kvm_memory_slot *memslot,
-				       int target_level)
+					const struct kvm_memory_slot *memslot,
+					int target_level)
 {
 	u64 start = memslot->base_gfn;
 	u64 end = start + memslot->npages;
 
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
-		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
+		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
 		read_unlock(&kvm->mmu_lock);
 	}
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 88f723fc0d1f..d5e713b849e9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -943,27 +943,33 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 }
 
 /*
- * tdp_mmu_link_sp_atomic - Atomically replace the given spte with an spte
- * pointing to the provided page table.
+ * tdp_mmu_link_sp - Replace the given spte with an spte pointing to the
+ * provided page table.
  *
  * @kvm: kvm instance
  * @iter: a tdp_iter instance currently on the SPTE that should be set
  * @sp: The new TDP page table to install.
  * @account_nx: True if this page table is being installed to split a
  *              non-executable huge page.
+ * @shared: This operation is running under the MMU lock in read mode.
  *
  * Returns: 0 if the new page table was installed. Non-0 if the page table
  *          could not be installed (e.g. the atomic compare-exchange failed).
  */
-static int tdp_mmu_link_sp_atomic(struct kvm *kvm, struct tdp_iter *iter,
-				  struct kvm_mmu_page *sp, bool account_nx)
+static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
+			   struct kvm_mmu_page *sp, bool account_nx,
+			   bool shared)
 {
 	u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
-	int ret;
+	int ret = 0;
 
-	ret = tdp_mmu_set_spte_atomic(kvm, iter, spte);
-	if (ret)
-		return ret;
+	if (shared) {
+		ret = tdp_mmu_set_spte_atomic(kvm, iter, spte);
+		if (ret)
+			return ret;
+	} else {
+		tdp_mmu_set_spte(kvm, iter, spte);
+	}
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
@@ -1031,7 +1037,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			sp = tdp_mmu_alloc_sp(vcpu);
 			tdp_mmu_init_child_sp(sp, &iter);
 
-			if (tdp_mmu_link_sp_atomic(vcpu->kvm, &iter, sp, account_nx)) {
+			if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) {
 				tdp_mmu_free_sp(sp);
 				break;
 			}
@@ -1262,12 +1268,11 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
 }
 
 static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
-						       struct tdp_iter *iter)
+						       struct tdp_iter *iter,
+						       bool shared)
 {
 	struct kvm_mmu_page *sp;
 
-	lockdep_assert_held_read(&kvm->mmu_lock);
-
 	/*
 	 * Since we are allocating while under the MMU lock we have to be
 	 * careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct
@@ -1282,20 +1287,27 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 		return sp;
 
 	rcu_read_unlock();
-	read_unlock(&kvm->mmu_lock);
+
+	if (shared)
+		read_unlock(&kvm->mmu_lock);
+	else
+		write_unlock(&kvm->mmu_lock);
 
 	iter->yielded = true;
 	sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
 
-	read_lock(&kvm->mmu_lock);
+	if (shared)
+		read_lock(&kvm->mmu_lock);
+	else
+		write_lock(&kvm->mmu_lock);
+
 	rcu_read_lock();
 
 	return sp;
 }
 
-static int tdp_mmu_split_huge_page_atomic(struct kvm *kvm,
-					  struct tdp_iter *iter,
-					  struct kvm_mmu_page *sp)
+static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
+				   struct kvm_mmu_page *sp, bool shared)
 {
 	const u64 huge_spte = iter->old_spte;
 	const int level = iter->level;
@@ -1318,7 +1330,7 @@ static int tdp_mmu_split_huge_page_atomic(struct kvm *kvm,
 	 * correctness standpoint since the translation will be the same either
 	 * way.
 	 */
-	ret = tdp_mmu_link_sp_atomic(kvm, iter, sp, false);
+	ret = tdp_mmu_link_sp(kvm, iter, sp, false, shared);
 	if (ret)
 		return ret;
 
@@ -1335,7 +1347,7 @@ static int tdp_mmu_split_huge_page_atomic(struct kvm *kvm,
 static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 					 struct kvm_mmu_page *root,
 					 gfn_t start, gfn_t end,
-					 int target_level)
+					 int target_level, bool shared)
 {
 	struct kvm_mmu_page *sp = NULL;
 	struct tdp_iter iter;
@@ -1356,14 +1368,14 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 	 */
 	for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
 retry:
-		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
 			continue;
 
 		if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
 			continue;
 
 		if (!sp) {
-			sp = tdp_mmu_alloc_sp_for_split(kvm, &iter);
+			sp = tdp_mmu_alloc_sp_for_split(kvm, &iter, shared);
 			if (!sp) {
 				ret = -ENOMEM;
 				break;
@@ -1373,7 +1385,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 				continue;
 		}
 
-		if (tdp_mmu_split_huge_page_atomic(kvm, &iter, sp))
+		if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
 			goto retry;
 
 		sp = NULL;
@@ -1393,23 +1405,24 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 	return ret;
 }
 
+
 /*
  * Try to split all huge pages mapped by the TDP MMU down to the target level.
  */
 void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
 				      const struct kvm_memory_slot *slot,
 				      gfn_t start, gfn_t end,
-				      int target_level)
+				      int target_level, bool shared)
 {
 	struct kvm_mmu_page *root;
 	int r = 0;
 
-	lockdep_assert_held_read(&kvm->mmu_lock);
+	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true) {
-		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level);
+	for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, shared) {
+		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level, shared);
 		if (r) {
-			kvm_tdp_mmu_put_root(kvm, root, true);
+			kvm_tdp_mmu_put_root(kvm, root, shared);
 			break;
 		}
 	}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 4a8756507829..ed9f6fbf5f25 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -74,7 +74,7 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
 				      const struct kvm_memory_slot *slot,
 				      gfn_t start, gfn_t end,
-				      int target_level);
+				      int target_level, bool shared);
 
 static inline void kvm_tdp_mmu_walk_lockless_begin(void)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f5aad3e8e0a0..e2ee6fc92dbc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -192,7 +192,7 @@ bool __read_mostly enable_pmu = true;
 EXPORT_SYMBOL_GPL(enable_pmu);
 module_param(enable_pmu, bool, 0444);
 
-static bool __read_mostly eager_page_split = true;
+bool __read_mostly eager_page_split = true;
 module_param(eager_page_split, bool, 0644);
 
 /*
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 1ebd5a7594da..d1836f69f20c 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -352,6 +352,8 @@ extern int pi_inject_timer;
 
 extern bool report_ignored_msrs;
 
+extern bool eager_page_split;
+
 static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
 {
 	return pvclock_scale_delta(nsec, vcpu->arch.virtual_tsc_mult,
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 17/18] KVM: x86/mmu: Add tracepoint for splitting huge pages
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (15 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 16/18] KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-01-19 23:07 ` [PATCH v2 18/18] KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and INITIALLY_SET David Matlack
  2022-02-01 18:03 ` [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU Paolo Bonzini
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Add a tracepoint that records whenever KVM eagerly splits a huge page
and the error status of the split to indicate if it succeeded or failed
and why.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmutrace.h | 23 +++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c  | 10 +++++++---
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index de5e8e4e1aa7..12247b96af01 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -416,6 +416,29 @@ TRACE_EVENT(
 	)
 );
 
+TRACE_EVENT(
+	kvm_mmu_split_huge_page,
+	TP_PROTO(u64 gfn, u64 spte, int level, int errno),
+	TP_ARGS(gfn, spte, level, errno),
+
+	TP_STRUCT__entry(
+		__field(u64, gfn)
+		__field(u64, spte)
+		__field(int, level)
+		__field(int, errno)
+	),
+
+	TP_fast_assign(
+		__entry->gfn = gfn;
+		__entry->spte = spte;
+		__entry->level = level;
+		__entry->errno = errno;
+	),
+
+	TP_printk("gfn %llx spte %llx level %d errno %d",
+		  __entry->gfn, __entry->spte, __entry->level, __entry->errno)
+);
+
 #endif /* _TRACE_KVMMMU_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d5e713b849e9..1ed8e20270f0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1332,7 +1332,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 */
 	ret = tdp_mmu_link_sp(kvm, iter, sp, false, shared);
 	if (ret)
-		return ret;
+		goto out;
 
 	/*
 	 * tdp_mmu_link_sp_atomic() will handle subtracting the huge page we
@@ -1341,7 +1341,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 */
 	kvm_update_page_stats(kvm, level - 1, PT64_ENT_PER_PAGE);
 
-	return 0;
+out:
+	trace_kvm_mmu_split_huge_page(iter->gfn, huge_spte, level, ret);
+	return ret;
 }
 
 static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
@@ -1378,6 +1380,9 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 			sp = tdp_mmu_alloc_sp_for_split(kvm, &iter, shared);
 			if (!sp) {
 				ret = -ENOMEM;
+				trace_kvm_mmu_split_huge_page(iter.gfn,
+							      iter.old_spte,
+							      iter.level, ret);
 				break;
 			}
 
@@ -1401,7 +1406,6 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 	if (sp)
 		tdp_mmu_free_sp(sp);
 
-
 	return ret;
 }
 
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 18/18] KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and INITIALLY_SET
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (16 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 17/18] KVM: x86/mmu: Add tracepoint for splitting huge pages David Matlack
@ 2022-01-19 23:07 ` David Matlack
  2022-02-01 18:03 ` [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU Paolo Bonzini
  18 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-01-19 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Add an option to dirty_log_perf_test.c to disable
KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE and KVM_DIRTY_LOG_INITIALLY_SET so
the legacy dirty logging code path can be tested.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 tools/testing/selftests/kvm/dirty_log_perf_test.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 1954b964d1cf..101759ac93a4 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -298,12 +298,18 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 static void help(char *name)
 {
 	puts("");
-	printf("usage: %s [-h] [-i iterations] [-p offset] "
+	printf("usage: %s [-h] [-i iterations] [-p offset] [-g]"
 	       "[-m mode] [-b vcpu bytes] [-v vcpus] [-o] [-s mem type]"
 	       "[-x memslots]\n", name);
 	puts("");
 	printf(" -i: specify iteration counts (default: %"PRIu64")\n",
 	       TEST_HOST_LOOP_N);
+	printf(" -g: Do not enable KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2. This\n"
+	       "     makes KVM_GET_DIRTY_LOG clear the dirty log (i.e.\n"
+	       "     KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE is not enabled)\n"
+	       "     and writes will be tracked as soon as dirty logging is\n"
+	       "     enabled on the memslot (i.e. KVM_DIRTY_LOG_INITIALLY_SET\n"
+	       "     is not enabled).\n");
 	printf(" -p: specify guest physical test memory offset\n"
 	       "     Warning: a low offset can conflict with the loaded test code.\n");
 	guest_modes_help();
@@ -343,8 +349,11 @@ int main(int argc, char *argv[])
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "hi:p:m:b:f:v:os:x:")) != -1) {
+	while ((opt = getopt(argc, argv, "ghi:p:m:b:f:v:os:x:")) != -1) {
 		switch (opt) {
+		case 'g':
+			dirty_log_manual_caps = 0;
+			break;
 		case 'i':
 			p.iterations = atoi(optarg);
 			break;
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU
  2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (17 preceding siblings ...)
  2022-01-19 23:07 ` [PATCH v2 18/18] KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and INITIALLY_SET David Matlack
@ 2022-02-01 18:03 ` Paolo Bonzini
  2022-02-01 18:24   ` David Matlack
  18 siblings, 1 reply; 21+ messages in thread
From: Paolo Bonzini @ 2022-02-01 18:03 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On 1/20/22 00:07, David Matlack wrote:
> This series implements Eager Page Splitting for the TDP MMU.
> 
> "Eager Page Splitting" is an optimization that has been in use in Google
> Cloud since 2016 to reduce the performance impact of live migration on
> customer workloads. It was originally designed and implemented by Peter
> Feiner <pfeiner@google.com>.
> 
> For background and performance motivation for this feature, please
> see "RFC: KVM: x86/mmu: Eager Page Splitting" [1].
> 
> Implementation
> ==============
> 
> This series implements support for splitting all huge pages mapped by
> the TDP MMU. Pages mapped by the shadow MMU are not split, although I
> plan to add the support in a future patchset.
> 
> Eager page splitting is triggered in two ways:
> 
> - KVM_SET_USER_MEMORY_REGION ioctl: If this ioctl is invoked to enable
>    dirty logging on a memslot and KVM_DIRTY_LOG_INITIALLY_SET is not
>    enabled, KVM will attempt to split all huge pages in the memslot down
>    to the 4K level.
> 
> - KVM_CLEAR_DIRTY_LOG ioctl: If this ioctl is invoked and
>    KVM_DIRTY_LOG_INITIALLY_SET is enabled, KVM will attempt to split all
>    huge pages cleared by the ioctl down to the 4K level before attempting
>    to write-protect them.
> 
> Eager page splitting is enabled by default in both paths but can be
> disabled via module param eager_page_split=N.
> 
> Splitting for pages mapped by the TDP MMU is done under the MMU lock in
> read mode. The lock is dropped and the thread rescheduled if contention
> or need_resched() is detected.
> 
> To allocate memory for the lower level page tables, we attempt to
> allocate without dropping the MMU lock using GFP_NOWAIT to avoid doing
> direct reclaim or invoking filesystem callbacks. If that fails we drop
> the lock and perform a normal GFP_KERNEL allocation.
> 
> Performance
> ===========
> 
> Eager page splitting moves the cost of splitting huge pages off of the
> vCPU thread and onto the thread invoking one of the aforementioned
> ioctls. This is useful because:
> 
>   - Splitting on the vCPU thread interrupts vCPUs execution and is
>     disruptive to customers whereas splitting on VM ioctl threads can
>     run in parallel with vCPU execution.
> 
>   - Splitting on the VM ioctl thread is more efficient because it does
>     no require performing VM-exit handling and page table walks for every
>     4K page.
> 
> The measure the performance impact of Eager Page Splitting I ran
> dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB
> HugeTLB memory.
> 
> When KVM_DIRTY_LOG_INITIALLY_SET is set, we can see that the first
> KVM_CLEAR_DIRTY_LOG iteration gets longer because KVM is splitting
> huge pages. But the time it takes for vCPUs to dirty their memory
> is significantly shorter since they do not have to take write-
> protection faults.
> 
>             | Iteration 1 clear dirty log time | Iteration 2 dirty memory time
> ---------- | -------------------------------- | -----------------------------
> Before     | 0.049572219s                     | 2.751442902s
> After      | 1.667811687s                     | 0.127016504s
> 
> Eager page splitting does make subsequent KVM_CLEAR_DIRTY_LOG ioctls
> about 4% slower since it always walks the page tables looking for pages
> to split.  This can be avoided but will require extra memory and/or code
> complexity to track when splitting can be skipped.
> 
>             | Iteration 3 clear dirty log time
> ---------- | --------------------------------
> Before     | 1.374501209s
> After      | 1.422478617s
> 
> When not using KVM_DIRTY_LOG_INITIALLY_SET, KVM performs splitting on
> the entire memslot during the KVM_SET_USER_MEMORY_REGION ioctl that
> enables dirty logging. We can see that as an increase in the time it
> takes to enable dirty logging. This allows vCPUs to avoid taking
> write-protection faults which we again see in the dirty memory time.
> 
>             | Enabling dirty logging time      | Iteration 1 dirty memory time
> ---------- | -------------------------------- | -----------------------------
> Before     | 0.001683739s                     | 2.943733325s
> After      | 1.546904175s                     | 0.145979748s
> 
> Testing
> =======
> 
> - Ran all kvm-unit-tests and KVM selftests on debug and non-debug kernels.
> 
> - Ran dirty_log_perf_test with different backing sources (anonymous,
>    anonymous_thp, anonymous_hugetlb_2mb, anonymous_hugetlb_1gb) with and
>    without Eager Page Splitting enabled.
> 
> - Added a tracepoint locally to time the GFP_NOWAIT allocations. Across
>    40 runs of dirty_log_perf_test using 1GiB HugeTLB with 96 vCPUs there
>    were only 4 allocations that took longer than 20 microseconds and the
>    longest was 60 microseconds. None of the GFP_NOWAIT allocations
>    failed.
> 
> - I have not been able to trigger a GFP_NOWAIT allocation failure (to
>    exercise the fallback path). However I did manually modify the code
>    to force every allocation to fallback by removing the GFP_NOWAIT
>    allocation altogether to make sure the logic works correctly.
> 
> - Live migrated a 32 vCPU 32 GiB Linux VM running a workload that
>    aggressively writes to guest memory with Eager Page Splitting enabled.
>    Observed pages being split via tracepoint and the pages_{4k,2m,1g}
>    stats.

Queued, thanks!

Paolo

> Version Log
> ===========
> 
> v2:
> 
> [Overall Changes]
>   - Additional testing by live migrating a Linux VM (see above).
>   - Add Sean's, Ben's, and Peter's Reviewed-by tags.
>   - Use () when referring to functions in commit message and comments [Sean]
>   - Add TDP MMU to shortlog where applicable [Sean]
>   - Fix gramatical errors in commit messages [Sean]
>   - Break 80+ char function declarations across multiple lines [Sean]
> 
> [PATCH v1 03/13] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
>   - Remove useless empty line [Peter]
>   - Tighten up the wording in comments [Sean]
>   - Move result of rcu_dereference() to a local variable to cut down line lengths [Sean]
> 
> [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table
>   - Add prep patch to return 0/-EBUSY instead of bool [Sean]
>   - Add prep patch to rename {un,}link_page to {un,}link_sp [Sean]
>   - Fold tdp_mmu_link_page() into tdp_mmu_install_sp_atomic() [Sean]
> 
> [PATCH v1 05/13] KVM: x86/mmu: Move restore_acc_track_spte to spte.c
>   - Make inline [Sean]
>   - Eliminate WARN_ON_ONCE [Sean]
>   - Eliminate unnecessary local variable new_spte [Sean].
> 
> [PATCH v1 06/13] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
>   - Eliminate unnecessary local variable root_pt [Sean]
> 
> [PATCH v1 07/13] KVM: x86/mmu: Derive page role from parent
>   - Eliminate redundant role overrides [Sean]
> 
> [PATCH v1 08/13] KVM: x86/mmu: Refactor TDP MMU child page initialization
>   - Rename alloc_tdp_mmu_page*() functions [Sean]
> 
> [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled
>   - Drop access from make_huge_page_split_spte() [Sean]
>   - Drop is_mmio_spte() check from make_huge_page_split_spte() [Sean]
>   - Change WARN_ON to WARN_ON_ONCE in make_huge_page_split_spte() [Sean]
>   - Improve comment for making 4K SPTEs executable [Sean]
>   - Rename mark_spte_executable() to mark_spte_executable() [Sean]
>   - Put return type on same line as tdp_mmu_split_huge_page_atomic() [Sean]
>   - Drop child_spte local variable in tdp_mmu_split_huge_page_atomic() [Sean]
>   - Make alloc_tdp_mmu_page_for_split() play nice with
>     commit 3a0f64de479c ("KVM: x86/mmu: Don't advance iterator after restart due to yielding") [Sean]
>   - Free unused sp after dropping RCU [Sean]
>   - Rename module param to something shorter [Sean]
>   - Document module param somewhere [Sean]
>   - Fix rcu_read_unlock in tdp_mmu_split_huge_pages_root() [me]
>   - Document TLB flush behavior [Peter]
> 
> [PATCH v1 10/13] KVM: Push MMU locking down into kvm_arch_mmu_enable_log_dirty_pt_masked
>   - Drop [Peter]
> 
> [PATCH v1 11/13] KVM: x86/mmu: Split huge pages during CLEAR_DIRTY_LOG
>   - Hold the lock in write-mode when splitting [Peter]
>   - Document TLB flush behavior [Peter]
> 
> [PATCH v1 12/13] KVM: x86/mmu: Add tracepoint for splitting huge pages
>   - Record if split succeeded or failed [Sean]
> 
> v1: https://lore.kernel.org/kvm/20211213225918.672507-1-dmatlack@google.com/
> 
> [Overall Changes]
>   - Use "huge page" instead of "large page" [Sean Christopherson]
> 
> [RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect
>   - Add Ben's Reviewed-by.
>   - Add Peter's Reviewed-by.
> 
> [RFC PATCH 03/15] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
>   - Add comment when updating old_spte [Ben Gardon]
>   - Follow kernel style of else case in zap_gfn_range [Ben Gardon]
>   - Don't delete old_spte update after zapping in kvm_tdp_mmu_map [me]
> 
> [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table
>   - Add blurb to commit message describing intentional drop of tracepoint [Ben Gardon]
>   - Consolidate "u64 spte = make_nonleaf_spte(...);" onto one line [Sean Christopherson]
>   - Do not free the sp if set fails  [Sean Christopherson]
> 
> [RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct
>   - Drop to adopt Sean's proposed allocation scheme.
> 
> [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent
>   - No changes.
> 
> [RFC PATCH 07/15] KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu
>   - Drop to adopt Sean's proposed allocation scheme.
> 
> [RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes
>   - Drop this commit and the helper function [Sean Christopherson]
> 
> [RFC PATCH 09/15] KVM: x86/mmu: Move restore_acc_track_spte to spte.c
>   - Add Ben's Reviewed-by.
> 
> [RFC PATCH 10/15] KVM: x86/mmu: Abstract need_resched logic from tdp_mmu_iter_cond_resched
>   - Drop to adopt Sean's proposed allocation scheme.
> 
> [RFC PATCH 11/15] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
>   - Add Ben's Reviewed-by.
> 
> [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
>   - Add a module parameter to control Eager Page Splitting [Peter Xu]
>   - Change level to large_spte_level [Ben Gardon]
>   - Get rid of BUG_ONs [Ben Gardon]
>   - Change += to |= and add a comment [Ben Gardon]
>   - Do not flush TLBs when dropping the MMU lock. [Sean Christopherson]
>   - Allocate memory directly from the kernel instead of using mmu_caches [Sean Christopherson]
> 
> [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
>   - Fix deadlock by refactoring MMU locking and dropping write lock before splitting. [kernel test robot]
>   - Did not follow Sean's suggestion of skipping write-protection if splitting
>     succeeds as it would require extra complexity since we aren't splitting
>     pages in the shadow MMU yet.
> 
> [RFC PATCH 14/15] KVM: x86/mmu: Add tracepoint for splitting large pages
>   - No changes.
> 
> [RFC PATCH 15/15] KVM: x86/mmu: Update page stats when splitting large pages
>   - Squash into patch that first introduces page splitting.
> 
> Note: I opted not to change TDP MMU functions to return int instead of
> bool per Sean's suggestion. I agree this change should be done but can
> be left to a separate series.
> 
> RFC: https://lore.kernel.org/kvm/20211119235759.1304274-1-dmatlack@google.com/
> 
> [1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t
> 
> David Matlack (18):
>    KVM: x86/mmu: Rename rmap_write_protect() to
>      kvm_vcpu_write_protect_gfn()
>    KVM: x86/mmu: Rename __rmap_write_protect() to rmap_write_protect()
>    KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
>    KVM: x86/mmu: Change tdp_mmu_{set,zap}_spte_atomic() to return
>      0/-EBUSY
>    KVM: x86/mmu: Rename TDP MMU functions that handle shadow pages
>    KVM: x86/mmu: Rename handle_removed_tdp_mmu_page() to
>      handle_removed_pt()
>    KVM: x86/mmu: Consolidate logic to atomically install a new TDP MMU
>      page table
>    KVM: x86/mmu: Remove unnecessary warnings from
>      restore_acc_track_spte()
>    KVM: x86/mmu: Drop new_spte local variable from
>      restore_acc_track_spte()
>    KVM: x86/mmu: Move restore_acc_track_spte() to spte.h
>    KVM: x86/mmu: Refactor TDP MMU iterators to take kvm_mmu_page root
>    KVM: x86/mmu: Remove redundant role overrides for TDP MMU shadow pages
>    KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent
>    KVM: x86/mmu: Separate TDP MMU shadow page allocation and
>      initialization
>    KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty
>      logging is enabled
>    KVM: x86/mmu: Split huge pages mapped by the TDP MMU during
>      KVM_CLEAR_DIRTY_LOG
>    KVM: x86/mmu: Add tracepoint for splitting huge pages
>    KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and
>      INITIALLY_SET
> 
>   .../admin-guide/kernel-parameters.txt         |  26 ++
>   arch/x86/include/asm/kvm_host.h               |   7 +
>   arch/x86/kvm/mmu/mmu.c                        |  79 ++--
>   arch/x86/kvm/mmu/mmutrace.h                   |  23 +
>   arch/x86/kvm/mmu/spte.c                       |  59 +++
>   arch/x86/kvm/mmu/spte.h                       |  16 +
>   arch/x86/kvm/mmu/tdp_iter.c                   |   8 +-
>   arch/x86/kvm/mmu/tdp_iter.h                   |  10 +-
>   arch/x86/kvm/mmu/tdp_mmu.c                    | 419 +++++++++++++-----
>   arch/x86/kvm/mmu/tdp_mmu.h                    |   5 +
>   arch/x86/kvm/x86.c                            |   6 +
>   arch/x86/kvm/x86.h                            |   2 +
>   .../selftests/kvm/dirty_log_perf_test.c       |  13 +-
>   13 files changed, 520 insertions(+), 153 deletions(-)
> 
> 
> base-commit: edb9e50dbe18394d0fc9d0494f5b6046fc912d33


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU
  2022-02-01 18:03 ` [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU Paolo Bonzini
@ 2022-02-01 18:24   ` David Matlack
  0 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-02-01 18:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm list, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Tue, Feb 1, 2022 at 10:03 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 1/20/22 00:07, David Matlack wrote:
> > This series implements Eager Page Splitting for the TDP MMU.
> >
> > "Eager Page Splitting" is an optimization that has been in use in Google
> > Cloud since 2016 to reduce the performance impact of live migration on
> > customer workloads. It was originally designed and implemented by Peter
> > Feiner <pfeiner@google.com>.
> >
> > For background and performance motivation for this feature, please
> > see "RFC: KVM: x86/mmu: Eager Page Splitting" [1].
> >
> > Implementation
> > ==============
> >
> > This series implements support for splitting all huge pages mapped by
> > the TDP MMU. Pages mapped by the shadow MMU are not split, although I
> > plan to add the support in a future patchset.
> >
> > Eager page splitting is triggered in two ways:
> >
> > - KVM_SET_USER_MEMORY_REGION ioctl: If this ioctl is invoked to enable
> >    dirty logging on a memslot and KVM_DIRTY_LOG_INITIALLY_SET is not
> >    enabled, KVM will attempt to split all huge pages in the memslot down
> >    to the 4K level.
> >
> > - KVM_CLEAR_DIRTY_LOG ioctl: If this ioctl is invoked and
> >    KVM_DIRTY_LOG_INITIALLY_SET is enabled, KVM will attempt to split all
> >    huge pages cleared by the ioctl down to the 4K level before attempting
> >    to write-protect them.
> >
> > Eager page splitting is enabled by default in both paths but can be
> > disabled via module param eager_page_split=N.
> >
> > Splitting for pages mapped by the TDP MMU is done under the MMU lock in
> > read mode. The lock is dropped and the thread rescheduled if contention
> > or need_resched() is detected.
> >
> > To allocate memory for the lower level page tables, we attempt to
> > allocate without dropping the MMU lock using GFP_NOWAIT to avoid doing
> > direct reclaim or invoking filesystem callbacks. If that fails we drop
> > the lock and perform a normal GFP_KERNEL allocation.
> >
> > Performance
> > ===========
> >
> > Eager page splitting moves the cost of splitting huge pages off of the
> > vCPU thread and onto the thread invoking one of the aforementioned
> > ioctls. This is useful because:
> >
> >   - Splitting on the vCPU thread interrupts vCPUs execution and is
> >     disruptive to customers whereas splitting on VM ioctl threads can
> >     run in parallel with vCPU execution.
> >
> >   - Splitting on the VM ioctl thread is more efficient because it does
> >     no require performing VM-exit handling and page table walks for every
> >     4K page.
> >
> > The measure the performance impact of Eager Page Splitting I ran
> > dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB
> > HugeTLB memory.
> >
> > When KVM_DIRTY_LOG_INITIALLY_SET is set, we can see that the first
> > KVM_CLEAR_DIRTY_LOG iteration gets longer because KVM is splitting
> > huge pages. But the time it takes for vCPUs to dirty their memory
> > is significantly shorter since they do not have to take write-
> > protection faults.
> >
> >             | Iteration 1 clear dirty log time | Iteration 2 dirty memory time
> > ---------- | -------------------------------- | -----------------------------
> > Before     | 0.049572219s                     | 2.751442902s
> > After      | 1.667811687s                     | 0.127016504s
> >
> > Eager page splitting does make subsequent KVM_CLEAR_DIRTY_LOG ioctls
> > about 4% slower since it always walks the page tables looking for pages
> > to split.  This can be avoided but will require extra memory and/or code
> > complexity to track when splitting can be skipped.
> >
> >             | Iteration 3 clear dirty log time
> > ---------- | --------------------------------
> > Before     | 1.374501209s
> > After      | 1.422478617s
> >
> > When not using KVM_DIRTY_LOG_INITIALLY_SET, KVM performs splitting on
> > the entire memslot during the KVM_SET_USER_MEMORY_REGION ioctl that
> > enables dirty logging. We can see that as an increase in the time it
> > takes to enable dirty logging. This allows vCPUs to avoid taking
> > write-protection faults which we again see in the dirty memory time.
> >
> >             | Enabling dirty logging time      | Iteration 1 dirty memory time
> > ---------- | -------------------------------- | -----------------------------
> > Before     | 0.001683739s                     | 2.943733325s
> > After      | 1.546904175s                     | 0.145979748s
> >
> > Testing
> > =======
> >
> > - Ran all kvm-unit-tests and KVM selftests on debug and non-debug kernels.
> >
> > - Ran dirty_log_perf_test with different backing sources (anonymous,
> >    anonymous_thp, anonymous_hugetlb_2mb, anonymous_hugetlb_1gb) with and
> >    without Eager Page Splitting enabled.
> >
> > - Added a tracepoint locally to time the GFP_NOWAIT allocations. Across
> >    40 runs of dirty_log_perf_test using 1GiB HugeTLB with 96 vCPUs there
> >    were only 4 allocations that took longer than 20 microseconds and the
> >    longest was 60 microseconds. None of the GFP_NOWAIT allocations
> >    failed.
> >
> > - I have not been able to trigger a GFP_NOWAIT allocation failure (to
> >    exercise the fallback path). However I did manually modify the code
> >    to force every allocation to fallback by removing the GFP_NOWAIT
> >    allocation altogether to make sure the logic works correctly.
> >
> > - Live migrated a 32 vCPU 32 GiB Linux VM running a workload that
> >    aggressively writes to guest memory with Eager Page Splitting enabled.
> >    Observed pages being split via tracepoint and the pages_{4k,2m,1g}
> >    stats.
>
> Queued, thanks!

Thanks Paolo!

I should have the shadow MMU implementation out for review shortly.

>
> Paolo
>
> > Version Log
> > ===========
> >
> > v2:
> >
> > [Overall Changes]
> >   - Additional testing by live migrating a Linux VM (see above).
> >   - Add Sean's, Ben's, and Peter's Reviewed-by tags.
> >   - Use () when referring to functions in commit message and comments [Sean]
> >   - Add TDP MMU to shortlog where applicable [Sean]
> >   - Fix gramatical errors in commit messages [Sean]
> >   - Break 80+ char function declarations across multiple lines [Sean]
> >
> > [PATCH v1 03/13] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
> >   - Remove useless empty line [Peter]
> >   - Tighten up the wording in comments [Sean]
> >   - Move result of rcu_dereference() to a local variable to cut down line lengths [Sean]
> >
> > [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table
> >   - Add prep patch to return 0/-EBUSY instead of bool [Sean]
> >   - Add prep patch to rename {un,}link_page to {un,}link_sp [Sean]
> >   - Fold tdp_mmu_link_page() into tdp_mmu_install_sp_atomic() [Sean]
> >
> > [PATCH v1 05/13] KVM: x86/mmu: Move restore_acc_track_spte to spte.c
> >   - Make inline [Sean]
> >   - Eliminate WARN_ON_ONCE [Sean]
> >   - Eliminate unnecessary local variable new_spte [Sean].
> >
> > [PATCH v1 06/13] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
> >   - Eliminate unnecessary local variable root_pt [Sean]
> >
> > [PATCH v1 07/13] KVM: x86/mmu: Derive page role from parent
> >   - Eliminate redundant role overrides [Sean]
> >
> > [PATCH v1 08/13] KVM: x86/mmu: Refactor TDP MMU child page initialization
> >   - Rename alloc_tdp_mmu_page*() functions [Sean]
> >
> > [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled
> >   - Drop access from make_huge_page_split_spte() [Sean]
> >   - Drop is_mmio_spte() check from make_huge_page_split_spte() [Sean]
> >   - Change WARN_ON to WARN_ON_ONCE in make_huge_page_split_spte() [Sean]
> >   - Improve comment for making 4K SPTEs executable [Sean]
> >   - Rename mark_spte_executable() to mark_spte_executable() [Sean]
> >   - Put return type on same line as tdp_mmu_split_huge_page_atomic() [Sean]
> >   - Drop child_spte local variable in tdp_mmu_split_huge_page_atomic() [Sean]
> >   - Make alloc_tdp_mmu_page_for_split() play nice with
> >     commit 3a0f64de479c ("KVM: x86/mmu: Don't advance iterator after restart due to yielding") [Sean]
> >   - Free unused sp after dropping RCU [Sean]
> >   - Rename module param to something shorter [Sean]
> >   - Document module param somewhere [Sean]
> >   - Fix rcu_read_unlock in tdp_mmu_split_huge_pages_root() [me]
> >   - Document TLB flush behavior [Peter]
> >
> > [PATCH v1 10/13] KVM: Push MMU locking down into kvm_arch_mmu_enable_log_dirty_pt_masked
> >   - Drop [Peter]
> >
> > [PATCH v1 11/13] KVM: x86/mmu: Split huge pages during CLEAR_DIRTY_LOG
> >   - Hold the lock in write-mode when splitting [Peter]
> >   - Document TLB flush behavior [Peter]
> >
> > [PATCH v1 12/13] KVM: x86/mmu: Add tracepoint for splitting huge pages
> >   - Record if split succeeded or failed [Sean]
> >
> > v1: https://lore.kernel.org/kvm/20211213225918.672507-1-dmatlack@google.com/
> >
> > [Overall Changes]
> >   - Use "huge page" instead of "large page" [Sean Christopherson]
> >
> > [RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect
> >   - Add Ben's Reviewed-by.
> >   - Add Peter's Reviewed-by.
> >
> > [RFC PATCH 03/15] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
> >   - Add comment when updating old_spte [Ben Gardon]
> >   - Follow kernel style of else case in zap_gfn_range [Ben Gardon]
> >   - Don't delete old_spte update after zapping in kvm_tdp_mmu_map [me]
> >
> > [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table
> >   - Add blurb to commit message describing intentional drop of tracepoint [Ben Gardon]
> >   - Consolidate "u64 spte = make_nonleaf_spte(...);" onto one line [Sean Christopherson]
> >   - Do not free the sp if set fails  [Sean Christopherson]
> >
> > [RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct
> >   - Drop to adopt Sean's proposed allocation scheme.
> >
> > [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent
> >   - No changes.
> >
> > [RFC PATCH 07/15] KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu
> >   - Drop to adopt Sean's proposed allocation scheme.
> >
> > [RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes
> >   - Drop this commit and the helper function [Sean Christopherson]
> >
> > [RFC PATCH 09/15] KVM: x86/mmu: Move restore_acc_track_spte to spte.c
> >   - Add Ben's Reviewed-by.
> >
> > [RFC PATCH 10/15] KVM: x86/mmu: Abstract need_resched logic from tdp_mmu_iter_cond_resched
> >   - Drop to adopt Sean's proposed allocation scheme.
> >
> > [RFC PATCH 11/15] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
> >   - Add Ben's Reviewed-by.
> >
> > [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
> >   - Add a module parameter to control Eager Page Splitting [Peter Xu]
> >   - Change level to large_spte_level [Ben Gardon]
> >   - Get rid of BUG_ONs [Ben Gardon]
> >   - Change += to |= and add a comment [Ben Gardon]
> >   - Do not flush TLBs when dropping the MMU lock. [Sean Christopherson]
> >   - Allocate memory directly from the kernel instead of using mmu_caches [Sean Christopherson]
> >
> > [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
> >   - Fix deadlock by refactoring MMU locking and dropping write lock before splitting. [kernel test robot]
> >   - Did not follow Sean's suggestion of skipping write-protection if splitting
> >     succeeds as it would require extra complexity since we aren't splitting
> >     pages in the shadow MMU yet.
> >
> > [RFC PATCH 14/15] KVM: x86/mmu: Add tracepoint for splitting large pages
> >   - No changes.
> >
> > [RFC PATCH 15/15] KVM: x86/mmu: Update page stats when splitting large pages
> >   - Squash into patch that first introduces page splitting.
> >
> > Note: I opted not to change TDP MMU functions to return int instead of
> > bool per Sean's suggestion. I agree this change should be done but can
> > be left to a separate series.
> >
> > RFC: https://lore.kernel.org/kvm/20211119235759.1304274-1-dmatlack@google.com/
> >
> > [1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t
> >
> > David Matlack (18):
> >    KVM: x86/mmu: Rename rmap_write_protect() to
> >      kvm_vcpu_write_protect_gfn()
> >    KVM: x86/mmu: Rename __rmap_write_protect() to rmap_write_protect()
> >    KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
> >    KVM: x86/mmu: Change tdp_mmu_{set,zap}_spte_atomic() to return
> >      0/-EBUSY
> >    KVM: x86/mmu: Rename TDP MMU functions that handle shadow pages
> >    KVM: x86/mmu: Rename handle_removed_tdp_mmu_page() to
> >      handle_removed_pt()
> >    KVM: x86/mmu: Consolidate logic to atomically install a new TDP MMU
> >      page table
> >    KVM: x86/mmu: Remove unnecessary warnings from
> >      restore_acc_track_spte()
> >    KVM: x86/mmu: Drop new_spte local variable from
> >      restore_acc_track_spte()
> >    KVM: x86/mmu: Move restore_acc_track_spte() to spte.h
> >    KVM: x86/mmu: Refactor TDP MMU iterators to take kvm_mmu_page root
> >    KVM: x86/mmu: Remove redundant role overrides for TDP MMU shadow pages
> >    KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent
> >    KVM: x86/mmu: Separate TDP MMU shadow page allocation and
> >      initialization
> >    KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty
> >      logging is enabled
> >    KVM: x86/mmu: Split huge pages mapped by the TDP MMU during
> >      KVM_CLEAR_DIRTY_LOG
> >    KVM: x86/mmu: Add tracepoint for splitting huge pages
> >    KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and
> >      INITIALLY_SET
> >
> >   .../admin-guide/kernel-parameters.txt         |  26 ++
> >   arch/x86/include/asm/kvm_host.h               |   7 +
> >   arch/x86/kvm/mmu/mmu.c                        |  79 ++--
> >   arch/x86/kvm/mmu/mmutrace.h                   |  23 +
> >   arch/x86/kvm/mmu/spte.c                       |  59 +++
> >   arch/x86/kvm/mmu/spte.h                       |  16 +
> >   arch/x86/kvm/mmu/tdp_iter.c                   |   8 +-
> >   arch/x86/kvm/mmu/tdp_iter.h                   |  10 +-
> >   arch/x86/kvm/mmu/tdp_mmu.c                    | 419 +++++++++++++-----
> >   arch/x86/kvm/mmu/tdp_mmu.h                    |   5 +
> >   arch/x86/kvm/x86.c                            |   6 +
> >   arch/x86/kvm/x86.h                            |   2 +
> >   .../selftests/kvm/dirty_log_perf_test.c       |  13 +-
> >   13 files changed, 520 insertions(+), 153 deletions(-)
> >
> >
> > base-commit: edb9e50dbe18394d0fc9d0494f5b6046fc912d33
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-02-01 18:24 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-19 23:07 [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
2022-01-19 23:07 ` [PATCH v2 01/18] KVM: x86/mmu: Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn() David Matlack
2022-01-19 23:07 ` [PATCH v2 02/18] KVM: x86/mmu: Rename __rmap_write_protect() to rmap_write_protect() David Matlack
2022-01-19 23:07 ` [PATCH v2 03/18] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails David Matlack
2022-01-19 23:07 ` [PATCH v2 04/18] KVM: x86/mmu: Change tdp_mmu_{set,zap}_spte_atomic() to return 0/-EBUSY David Matlack
2022-01-19 23:07 ` [PATCH v2 05/18] KVM: x86/mmu: Rename TDP MMU functions that handle shadow pages David Matlack
2022-01-19 23:07 ` [PATCH v2 06/18] KVM: x86/mmu: Rename handle_removed_tdp_mmu_page() to handle_removed_pt() David Matlack
2022-01-19 23:07 ` [PATCH v2 07/18] KVM: x86/mmu: Consolidate logic to atomically install a new TDP MMU page table David Matlack
2022-01-19 23:07 ` [PATCH v2 08/18] KVM: x86/mmu: Remove unnecessary warnings from restore_acc_track_spte() David Matlack
2022-01-19 23:07 ` [PATCH v2 09/18] KVM: x86/mmu: Drop new_spte local variable " David Matlack
2022-01-19 23:07 ` [PATCH v2 10/18] KVM: x86/mmu: Move restore_acc_track_spte() to spte.h David Matlack
2022-01-19 23:07 ` [PATCH v2 11/18] KVM: x86/mmu: Refactor TDP MMU iterators to take kvm_mmu_page root David Matlack
2022-01-19 23:07 ` [PATCH v2 12/18] KVM: x86/mmu: Remove redundant role overrides for TDP MMU shadow pages David Matlack
2022-01-19 23:07 ` [PATCH v2 13/18] KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent David Matlack
2022-01-19 23:07 ` [PATCH v2 14/18] KVM: x86/mmu: Separate TDP MMU shadow page allocation and initialization David Matlack
2022-01-19 23:07 ` [PATCH v2 15/18] KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled David Matlack
2022-01-19 23:07 ` [PATCH v2 16/18] KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG David Matlack
2022-01-19 23:07 ` [PATCH v2 17/18] KVM: x86/mmu: Add tracepoint for splitting huge pages David Matlack
2022-01-19 23:07 ` [PATCH v2 18/18] KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and INITIALLY_SET David Matlack
2022-02-01 18:03 ` [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU Paolo Bonzini
2022-02-01 18:24   ` David Matlack

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.