[RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU
@ 2021-11-19 23:57 David Matlack
  2021-11-19 23:57 ` [RFC PATCH 01/15] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn David Matlack
                   ` (15 more replies)
  0 siblings, 16 replies; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

This series is a first pass at implementing Eager Page Splitting for the
TDP MMU. For context on the motivation and design of Eager Page
Splitting, please see the RFC design proposal and discussion [1].

Paolo, I went ahead and added splitting in both the intially-all-set
case (only splitting the region passed to CLEAR_DIRTY_LOG) and the
case where we are not using initially-all-set (splitting the entire
memslot when dirty logging is enabled) to give you an idea of what
both look like.

Note: I will be on vacation all of next week so I will not be able to
respond to reviews until Monday November 29. I thought it would be
useful to seed discussion and reviews with an early version of the code
rather than putting it off another week. But feel free to also ignore
this until I get back :)

This series compiles and passes the most basic splitting test:

$ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 2 -i 4

But please operate under the assumption that this code is probably
buggy.

[1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t

David Matlack (15):
  KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn
  KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect
  KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
  KVM: x86/mmu: Factor out logic to atomically install a new page table
  KVM: x86/mmu: Abstract mmu caches out to a separate struct
  KVM: x86/mmu: Derive page role from parent
  KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu
  KVM: x86/mmu: Helper method to check for large and present sptes
  KVM: x86/mmu: Move restore_acc_track_spte to spte.c
  KVM: x86/mmu: Abstract need_resched logic from
    tdp_mmu_iter_cond_resched
  KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
  KVM: x86/mmu: Split large pages when dirty logging is enabled
  KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
  KVM: x86/mmu: Add tracepoint for splitting large pages
  KVM: x86/mmu: Update page stats when splitting large pages

 arch/x86/include/asm/kvm_host.h |  22 ++-
 arch/x86/kvm/mmu/mmu.c          | 185 +++++++++++++-----
 arch/x86/kvm/mmu/mmu_internal.h |   3 +
 arch/x86/kvm/mmu/mmutrace.h     |  20 ++
 arch/x86/kvm/mmu/spte.c         |  64 +++++++
 arch/x86/kvm/mmu/spte.h         |   7 +
 arch/x86/kvm/mmu/tdp_iter.c     |   5 +-
 arch/x86/kvm/mmu/tdp_iter.h     |  10 +-
 arch/x86/kvm/mmu/tdp_mmu.c      | 322 +++++++++++++++++++++++---------
 arch/x86/kvm/mmu/tdp_mmu.h      |   5 +
 arch/x86/kvm/x86.c              |   6 +
 11 files changed, 501 insertions(+), 148 deletions(-)

-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [RFC PATCH 01/15] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
@ 2021-11-19 23:57 ` David Matlack
  2021-11-22 18:52   ` Ben Gardon
  2021-11-26 12:18   ` Peter Xu
  2021-11-19 23:57 ` [RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect David Matlack
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

rmap_write_protect is a poor name because we may not even touch the rmap
if the TDP MMU is in use. It is also confusing that rmap_write_protect
is not a simpler wrapper around __rmap_write_protect, since that is the
typical flow for functions with double-underscore names.

Rename it to kvm_vcpu_write_protect_gfn to convey that we are
write-protecting a specific gfn in the context of a vCPU.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8f0035517450..16ffb571bc75 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1427,7 +1427,7 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	return write_protected;
 }
 
-static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
+static bool kvm_vcpu_write_protect_gfn(struct kvm_vcpu *vcpu, u64 gfn)
 {
 	struct kvm_memory_slot *slot;
 
@@ -2026,7 +2026,7 @@ static int mmu_sync_children(struct kvm_vcpu *vcpu,
 		bool protected = false;
 
 		for_each_sp(pages, sp, parents, i)
-			protected |= rmap_write_protect(vcpu, sp->gfn);
+			protected |= kvm_vcpu_write_protect_gfn(vcpu, sp->gfn);
 
 		if (protected) {
 			kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, true);
@@ -2153,7 +2153,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	hlist_add_head(&sp->hash_link, sp_list);
 	if (!direct) {
 		account_shadowed(vcpu->kvm, sp);
-		if (level == PG_LEVEL_4K && rmap_write_protect(vcpu, gfn))
+		if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
 	trace_kvm_mmu_get_page(sp, true);
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
  2021-11-19 23:57 ` [RFC PATCH 01/15] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn David Matlack
@ 2021-11-19 23:57 ` David Matlack
  2021-11-22 18:52   ` Ben Gardon
  2021-11-26 12:18   ` Peter Xu
  2021-11-19 23:57 ` [RFC PATCH 03/15] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails David Matlack
                   ` (13 subsequent siblings)
  15 siblings, 2 replies; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

Now that rmap_write_protect has been renamed, there is no need for the
double underscores in front of __rmap_write_protect.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 16ffb571bc75..1146f87044a6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1235,9 +1235,9 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
 	return mmu_spte_update(sptep, spte);
 }
 
-static bool __rmap_write_protect(struct kvm *kvm,
-				 struct kvm_rmap_head *rmap_head,
-				 bool pt_protect)
+static bool rmap_write_protect(struct kvm *kvm,
+			       struct kvm_rmap_head *rmap_head,
+			       bool pt_protect)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1317,7 +1317,7 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 	while (mask) {
 		rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
 					PG_LEVEL_4K, slot);
-		__rmap_write_protect(kvm, rmap_head, false);
+		rmap_write_protect(kvm, rmap_head, false);
 
 		/* clear the first set bit */
 		mask &= mask - 1;
@@ -1416,7 +1416,7 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	if (kvm_memslots_have_rmaps(kvm)) {
 		for (i = min_level; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
 			rmap_head = gfn_to_rmap(gfn, i, slot);
-			write_protected |= __rmap_write_protect(kvm, rmap_head, true);
+			write_protected |= rmap_write_protect(kvm, rmap_head, true);
 		}
 	}
 
@@ -5780,7 +5780,7 @@ static bool slot_rmap_write_protect(struct kvm *kvm,
 				    struct kvm_rmap_head *rmap_head,
 				    const struct kvm_memory_slot *slot)
 {
-	return __rmap_write_protect(kvm, rmap_head, false);
+	return rmap_write_protect(kvm, rmap_head, false);
 }
 
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH 03/15] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
  2021-11-19 23:57 ` [RFC PATCH 01/15] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn David Matlack
  2021-11-19 23:57 ` [RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect David Matlack
@ 2021-11-19 23:57 ` David Matlack
  2021-11-22 18:52   ` Ben Gardon
  2021-11-19 23:57 ` [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table David Matlack
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

Consolidate a bunch of code that was manually re-reading the spte if the
cmpxchg fails. There is no extra cost of doing this because we already
have the spte value as a result of the cmpxchg (and in fact this
eliminates re-reading the spte), and none of the call sites depend on
iter->old_spte retaining the stale spte value.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 56 ++++++++++++--------------------------
 1 file changed, 18 insertions(+), 38 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 377a96718a2e..cc9fe33c9b36 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -492,16 +492,22 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
  * and handle the associated bookkeeping.  Do not mark the page dirty
  * in KVM's dirty bitmaps.
  *
+ * If setting the SPTE fails because it has changed, iter->old_spte will be
+ * updated with the updated value of the spte.
+ *
  * @kvm: kvm instance
  * @iter: a tdp_iter instance currently on the SPTE that should be set
  * @new_spte: The value the SPTE should be set to
  * Returns: true if the SPTE was set, false if it was not. If false is returned,
- *	    this function will have no side-effects.
+ *          this function will have no side-effects other than updating
+ *          iter->old_spte to the latest value of spte.
  */
 static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 					   struct tdp_iter *iter,
 					   u64 new_spte)
 {
+	u64 old_spte;
+
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
 	/*
@@ -515,9 +521,11 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
 	 * does not hold the mmu_lock.
 	 */
-	if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
-		      new_spte) != iter->old_spte)
+	old_spte = cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte, new_spte);
+	if (old_spte != iter->old_spte) {
+		iter->old_spte = old_spte;
 		return false;
+	}
 
 	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
 			      new_spte, iter->level, true);
@@ -747,14 +755,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		if (!shared) {
 			tdp_mmu_set_spte(kvm, &iter, 0);
 			flush = true;
-		} else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
-			/*
-			 * The iter must explicitly re-read the SPTE because
-			 * the atomic cmpxchg failed.
-			 */
-			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+		} else if (!tdp_mmu_zap_spte_atomic(kvm, &iter))
 			goto retry;
-		}
 	}
 
 	rcu_read_unlock();
@@ -978,13 +980,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		    is_large_pte(iter.old_spte)) {
 			if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
 				break;
-
-			/*
-			 * The iter must explicitly re-read the spte here
-			 * because the new value informs the !present
-			 * path below.
-			 */
-			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
 		}
 
 		if (!is_shadow_present_pte(iter.old_spte)) {
@@ -1190,14 +1185,9 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
 
-		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte)) {
-			/*
-			 * The iter must explicitly re-read the SPTE because
-			 * the atomic cmpxchg failed.
-			 */
-			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
 			goto retry;
-		}
+
 		spte_set = true;
 	}
 
@@ -1258,14 +1248,9 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 				continue;
 		}
 
-		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte)) {
-			/*
-			 * The iter must explicitly re-read the SPTE because
-			 * the atomic cmpxchg failed.
-			 */
-			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
 			goto retry;
-		}
+
 		spte_set = true;
 	}
 
@@ -1391,14 +1376,9 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
 							    pfn, PG_LEVEL_NUM))
 			continue;
 
-		if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
-			/*
-			 * The iter must explicitly re-read the SPTE because
-			 * the atomic cmpxchg failed.
-			 */
-			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+		if (!tdp_mmu_zap_spte_atomic(kvm, &iter))
 			goto retry;
-		}
+
 		flush = true;
 	}
 
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (2 preceding siblings ...)
  2021-11-19 23:57 ` [RFC PATCH 03/15] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails David Matlack
@ 2021-11-19 23:57 ` David Matlack
  2021-11-22 18:52   ` Ben Gardon
  2021-12-01 19:13   ` Sean Christopherson
  2021-11-19 23:57 ` [RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct David Matlack
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

Factor out the logic to atomically replace an SPTE with an SPTE that
points to a new page table. This will be used in a follow-up commit to
split a large page SPTE into one level lower.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 53 ++++++++++++++++++++++++++------------
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cc9fe33c9b36..9ee3f4f7fdf5 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -945,6 +945,39 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	return ret;
 }
 
+/*
+ * tdp_mmu_install_sp_atomic - Atomically replace the given spte with an
+ * spte pointing to the provided page table.
+ *
+ * @kvm: kvm instance
+ * @iter: a tdp_iter instance currently on the SPTE that should be set
+ * @sp: The new TDP page table to install.
+ * @account_nx: True if this page table is being installed to split a
+ *              non-executable huge page.
+ *
+ * Returns: True if the new page table was installed. False if spte being
+ *          replaced changed, causing the atomic compare-exchange to fail.
+ *          If this function returns false the sp will be freed before
+ *          returning.
+ */
+static bool tdp_mmu_install_sp_atomic(struct kvm *kvm,
+				      struct tdp_iter *iter,
+				      struct kvm_mmu_page *sp,
+				      bool account_nx)
+{
+	u64 spte;
+
+	spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
+
+	if (tdp_mmu_set_spte_atomic(kvm, iter, spte)) {
+		tdp_mmu_link_page(kvm, sp, account_nx);
+		return true;
+	} else {
+		tdp_mmu_free_sp(sp);
+		return false;
+	}
+}
+
 /*
  * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
  * page tables and SPTEs to translate the faulting guest physical address.
@@ -954,8 +987,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
-	u64 *child_pt;
-	u64 new_spte;
 	int ret;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -983,6 +1014,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		}
 
 		if (!is_shadow_present_pte(iter.old_spte)) {
+			bool account_nx = fault->huge_page_disallowed &&
+					  fault->req_level >= iter.level;
+
 			/*
 			 * If SPTE has been frozen by another thread, just
 			 * give up and retry, avoiding unnecessary page table
@@ -992,21 +1026,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 				break;
 
 			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level - 1);
-			child_pt = sp->spt;
-
-			new_spte = make_nonleaf_spte(child_pt,
-						     !shadow_accessed_mask);
-
-			if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
-				tdp_mmu_link_page(vcpu->kvm, sp,
-						  fault->huge_page_disallowed &&
-						  fault->req_level >= iter.level);
-
-				trace_kvm_mmu_get_page(sp, true);
-			} else {
-				tdp_mmu_free_sp(sp);
+			if (!tdp_mmu_install_sp_atomic(vcpu->kvm, &iter, sp, account_nx))
 				break;
-			}
 		}
 	}
 
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (3 preceding siblings ...)
  2021-11-19 23:57 ` [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table David Matlack
@ 2021-11-19 23:57 ` David Matlack
  2021-11-22 18:55   ` Ben Gardon
  2021-11-19 23:57 ` [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent David Matlack
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

Move the kvm_mmu_memory_cache structs into a separate wrapper struct.
This is in preparation for eagerly splitting all large pages during
VM-ioctls (i.e. not in the vCPU fault path) which will require adding
kvm_mmu_memory_cache structs to struct kvm_arch.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h | 12 ++++---
 arch/x86/kvm/mmu/mmu.c          | 59 ++++++++++++++++++++++-----------
 arch/x86/kvm/mmu/tdp_mmu.c      |  7 ++--
 3 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1fcb345bc107..2a7564703ea6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -612,6 +612,13 @@ struct kvm_vcpu_xen {
 	u64 runstate_times[4];
 };
 
+struct kvm_mmu_memory_caches {
+	struct kvm_mmu_memory_cache pte_list_desc_cache;
+	struct kvm_mmu_memory_cache shadow_page_cache;
+	struct kvm_mmu_memory_cache gfn_array_cache;
+	struct kvm_mmu_memory_cache page_header_cache;
+};
+
 struct kvm_vcpu_arch {
 	/*
 	 * rip and regs accesses must go through
@@ -681,10 +688,7 @@ struct kvm_vcpu_arch {
 	 */
 	struct kvm_mmu *walk_mmu;
 
-	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
-	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
-	struct kvm_mmu_memory_cache mmu_page_header_cache;
+	struct kvm_mmu_memory_caches mmu_caches;
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1146f87044a6..537952574211 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -732,38 +732,60 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 
 static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 {
+	struct kvm_mmu_memory_caches *mmu_caches;
 	int r;
 
+	mmu_caches = &vcpu->arch.mmu_caches;
+
 	/* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
-	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
+	r = kvm_mmu_topup_memory_cache(&mmu_caches->pte_list_desc_cache,
 				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
 	if (r)
 		return r;
-	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
+	r = kvm_mmu_topup_memory_cache(&mmu_caches->shadow_page_cache,
 				       PT64_ROOT_MAX_LEVEL);
 	if (r)
 		return r;
 	if (maybe_indirect) {
-		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache,
+		r = kvm_mmu_topup_memory_cache(&mmu_caches->gfn_array_cache,
 					       PT64_ROOT_MAX_LEVEL);
 		if (r)
 			return r;
 	}
-	return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
+	return kvm_mmu_topup_memory_cache(&mmu_caches->page_header_cache,
 					  PT64_ROOT_MAX_LEVEL);
 }
 
 static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
+	struct kvm_mmu_memory_caches *mmu_caches;
+
+	mmu_caches = &vcpu->arch.mmu_caches;
+
+	kvm_mmu_free_memory_cache(&mmu_caches->pte_list_desc_cache);
+	kvm_mmu_free_memory_cache(&mmu_caches->shadow_page_cache);
+	kvm_mmu_free_memory_cache(&mmu_caches->gfn_array_cache);
+	kvm_mmu_free_memory_cache(&mmu_caches->page_header_cache);
+}
+
+static void mmu_init_memory_caches(struct kvm_mmu_memory_caches *caches)
+{
+	caches->pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
+	caches->pte_list_desc_cache.gfp_zero = __GFP_ZERO;
+
+	caches->page_header_cache.kmem_cache = mmu_page_header_cache;
+	caches->page_header_cache.gfp_zero = __GFP_ZERO;
+
+	caches->shadow_page_cache.gfp_zero = __GFP_ZERO;
 }
 
 static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
 {
-	return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
+	struct kvm_mmu_memory_caches *mmu_caches;
+
+	mmu_caches = &vcpu->arch.mmu_caches;
+
+	return kvm_mmu_memory_cache_alloc(&mmu_caches->pte_list_desc_cache);
 }
 
 static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
@@ -1071,7 +1093,7 @@ static bool rmap_can_add(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu_memory_cache *mc;
 
-	mc = &vcpu->arch.mmu_pte_list_desc_cache;
+	mc = &vcpu->arch.mmu_caches.pte_list_desc_cache;
 	return kvm_mmu_memory_cache_nr_free_objects(mc);
 }
 
@@ -1742,12 +1764,15 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 
 static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
 {
+	struct kvm_mmu_memory_caches *mmu_caches;
 	struct kvm_mmu_page *sp;
 
-	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	mmu_caches = &vcpu->arch.mmu_caches;
+
+	sp = kvm_mmu_memory_cache_alloc(&mmu_caches->page_header_cache);
+	sp->spt = kvm_mmu_memory_cache_alloc(&mmu_caches->shadow_page_cache);
 	if (!direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+		sp->gfns = kvm_mmu_memory_cache_alloc(&mmu_caches->gfn_array_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
 	/*
@@ -5544,13 +5569,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 {
 	int ret;
 
-	vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
-	vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
-
-	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
-	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
-
-	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+	mmu_init_memory_caches(&vcpu->arch.mmu_caches);
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
 	vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9ee3f4f7fdf5..b70707a7fe87 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -175,10 +175,13 @@ static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
 static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 					       int level)
 {
+	struct kvm_mmu_memory_caches *mmu_caches;
 	struct kvm_mmu_page *sp;
 
-	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	mmu_caches = &vcpu->arch.mmu_caches;
+
+	sp = kvm_mmu_memory_cache_alloc(&mmu_caches->page_header_cache);
+	sp->spt = kvm_mmu_memory_cache_alloc(&mmu_caches->shadow_page_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
 	sp->role.word = page_role_for_level(vcpu, level).word;
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (4 preceding siblings ...)
  2021-11-19 23:57 ` [RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct David Matlack
@ 2021-11-19 23:57 ` David Matlack
  2021-11-20 12:53   ` Paolo Bonzini
  2021-11-19 23:57 ` [RFC PATCH 07/15] KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu David Matlack
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

Derive the page role from the parent shadow page, since the only thing
that changes is the level. This is in preparation for eagerly splitting
large pages during VM-ioctls which does not have access to the vCPU
MMU context.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 43 ++++++++++++++++++++------------------
 1 file changed, 23 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b70707a7fe87..1a409992a57f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -157,23 +157,8 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 		if (kvm_mmu_page_as_id(_root) != _as_id) {		\
 		} else
 
-static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
-						   int level)
-{
-	union kvm_mmu_page_role role;
-
-	role = vcpu->arch.mmu->mmu_role.base;
-	role.level = level;
-	role.direct = true;
-	role.gpte_is_8_bytes = true;
-	role.access = ACC_ALL;
-	role.ad_disabled = !shadow_accessed_mask;
-
-	return role;
-}
-
 static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					       int level)
+					       union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_memory_caches *mmu_caches;
 	struct kvm_mmu_page *sp;
@@ -184,7 +169,7 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 	sp->spt = kvm_mmu_memory_cache_alloc(&mmu_caches->shadow_page_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
-	sp->role.word = page_role_for_level(vcpu, level).word;
+	sp->role = role;
 	sp->gfn = gfn;
 	sp->tdp_mmu_page = true;
 
@@ -193,6 +178,19 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return sp;
 }
 
+static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, struct tdp_iter *iter)
+{
+	struct kvm_mmu_page *parent_sp;
+	union kvm_mmu_page_role role;
+
+	parent_sp = sptep_to_sp(rcu_dereference(iter->sptep));
+
+	role = parent_sp->role;
+	role.level--;
+
+	return alloc_tdp_mmu_page(vcpu, iter->gfn, role);
+}
+
 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 {
 	union kvm_mmu_page_role role;
@@ -201,7 +199,12 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
-	role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
+	role = vcpu->arch.mmu->mmu_role.base;
+	role.level = vcpu->arch.mmu->shadow_root_level;
+	role.direct = true;
+	role.gpte_is_8_bytes = true;
+	role.access = ACC_ALL;
+	role.ad_disabled = !shadow_accessed_mask;
 
 	/* Check for an existing root before allocating a new one. */
 	for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
@@ -210,7 +213,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 			goto out;
 	}
 
-	root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
+	root = alloc_tdp_mmu_page(vcpu, 0, role);
 	refcount_set(&root->tdp_mmu_root_count, 1);
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
@@ -1028,7 +1031,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			if (is_removed_spte(iter.old_spte))
 				break;
 
-			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level - 1);
+			sp = alloc_child_tdp_mmu_page(vcpu, &iter);
 			if (!tdp_mmu_install_sp_atomic(vcpu->kvm, &iter, sp, account_nx))
 				break;
 		}
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH 07/15] KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (5 preceding siblings ...)
  2021-11-19 23:57 ` [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent David Matlack
@ 2021-11-19 23:57 ` David Matlack
  2021-11-22 18:56   ` Ben Gardon
  2021-11-19 23:57 ` [RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes David Matlack
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

Pass in vcpu->arch.mmu_caches to alloc_{,_child}_tdp_mmu_page() instead
of the vcpu. This is in preparation for eagerly splitting large pages
during VM-ioctls which does not have access to the vCPU mmu_caches.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1a409992a57f..ff4d83ad7580 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -157,14 +157,11 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 		if (kvm_mmu_page_as_id(_root) != _as_id) {		\
 		} else
 
-static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					       union kvm_mmu_page_role role)
+static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_mmu_memory_caches *mmu_caches,
+					       gfn_t gfn, union kvm_mmu_page_role role)
 {
-	struct kvm_mmu_memory_caches *mmu_caches;
 	struct kvm_mmu_page *sp;
 
-	mmu_caches = &vcpu->arch.mmu_caches;
-
 	sp = kvm_mmu_memory_cache_alloc(&mmu_caches->page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&mmu_caches->shadow_page_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
@@ -178,7 +175,8 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return sp;
 }
 
-static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, struct tdp_iter *iter)
+static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_mmu_memory_caches *mmu_caches,
+						     struct tdp_iter *iter)
 {
 	struct kvm_mmu_page *parent_sp;
 	union kvm_mmu_page_role role;
@@ -188,7 +186,7 @@ static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, stru
 	role = parent_sp->role;
 	role.level--;
 
-	return alloc_tdp_mmu_page(vcpu, iter->gfn, role);
+	return alloc_tdp_mmu_page(mmu_caches, iter->gfn, role);
 }
 
 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
@@ -213,7 +211,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 			goto out;
 	}
 
-	root = alloc_tdp_mmu_page(vcpu, 0, role);
+	root = alloc_tdp_mmu_page(&vcpu->arch.mmu_caches, 0, role);
 	refcount_set(&root->tdp_mmu_root_count, 1);
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
@@ -1031,7 +1029,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			if (is_removed_spte(iter.old_spte))
 				break;
 
-			sp = alloc_child_tdp_mmu_page(vcpu, &iter);
+			sp = alloc_child_tdp_mmu_page(&vcpu->arch.mmu_caches, &iter);
 			if (!tdp_mmu_install_sp_atomic(vcpu->kvm, &iter, sp, account_nx))
 				break;
 		}
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (6 preceding siblings ...)
  2021-11-19 23:57 ` [RFC PATCH 07/15] KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu David Matlack
@ 2021-11-19 23:57 ` David Matlack
  2021-11-22 18:56   ` Ben Gardon
  2021-12-01 18:34   ` Sean Christopherson
  2021-11-19 23:57 ` [RFC PATCH 09/15] KVM: x86/mmu: Move restore_acc_track_spte to spte.c David Matlack
                   ` (7 subsequent siblings)
  15 siblings, 2 replies; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

Consolidate is_large_pte and is_present_pte into a single helper. This
will be used in a follow-up commit to check for present large-pages
during Eager Page Splitting.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/spte.h    | 5 +++++
 arch/x86/kvm/mmu/tdp_mmu.c | 3 +--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index cc432f9a966b..e73c41d31816 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -257,6 +257,11 @@ static inline bool is_large_pte(u64 pte)
 	return pte & PT_PAGE_SIZE_MASK;
 }
 
+static inline bool is_large_present_pte(u64 pte)
+{
+	return is_shadow_present_pte(pte) && is_large_pte(pte);
+}
+
 static inline bool is_last_spte(u64 pte, int level)
 {
 	return (level == PG_LEVEL_4K) || is_large_pte(pte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ff4d83ad7580..f8c4337f1fcf 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1011,8 +1011,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * than the target, that SPTE must be cleared and replaced
 		 * with a non-leaf SPTE.
 		 */
-		if (is_shadow_present_pte(iter.old_spte) &&
-		    is_large_pte(iter.old_spte)) {
+		if (is_large_present_pte(iter.old_spte)) {
 			if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
 				break;
 		}
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH 09/15] KVM: x86/mmu: Move restore_acc_track_spte to spte.c
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (7 preceding siblings ...)
  2021-11-19 23:57 ` [RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes David Matlack
@ 2021-11-19 23:57 ` David Matlack
  2021-11-22 18:56   ` Ben Gardon
  2021-11-19 23:57 ` [RFC PATCH 10/15] KVM: x86/mmu: Abstract need_resched logic from tdp_mmu_iter_cond_resched David Matlack
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

restore_acc_track_spte is purely an SPTE manipulation, making it a good
fit for spte.c. It is also needed in spte.c in a follow-up commit so we
can construct child SPTEs during large page splitting.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c  | 18 ------------------
 arch/x86/kvm/mmu/spte.c | 18 ++++++++++++++++++
 arch/x86/kvm/mmu/spte.h |  1 +
 3 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 537952574211..54f0d2228135 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -652,24 +652,6 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
 	return __get_spte_lockless(sptep);
 }
 
-/* Restore an acc-track PTE back to a regular PTE */
-static u64 restore_acc_track_spte(u64 spte)
-{
-	u64 new_spte = spte;
-	u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT)
-			 & SHADOW_ACC_TRACK_SAVED_BITS_MASK;
-
-	WARN_ON_ONCE(spte_ad_enabled(spte));
-	WARN_ON_ONCE(!is_access_track_spte(spte));
-
-	new_spte &= ~shadow_acc_track_mask;
-	new_spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
-		      SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
-	new_spte |= saved_bits;
-
-	return new_spte;
-}
-
 /* Returns the Accessed status of the PTE and resets it at the same time. */
 static bool mmu_spte_age(u64 *sptep)
 {
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 0c76c45fdb68..df2cdb8bcf77 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -268,6 +268,24 @@ u64 mark_spte_for_access_track(u64 spte)
 	return spte;
 }
 
+/* Restore an acc-track PTE back to a regular PTE */
+u64 restore_acc_track_spte(u64 spte)
+{
+	u64 new_spte = spte;
+	u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT)
+			 & SHADOW_ACC_TRACK_SAVED_BITS_MASK;
+
+	WARN_ON_ONCE(spte_ad_enabled(spte));
+	WARN_ON_ONCE(!is_access_track_spte(spte));
+
+	new_spte &= ~shadow_acc_track_mask;
+	new_spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
+		      SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
+	new_spte |= saved_bits;
+
+	return new_spte;
+}
+
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
 {
 	BUG_ON((u64)(unsigned)access_mask != access_mask);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index e73c41d31816..3e4943ee5a01 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -342,6 +342,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
+u64 restore_acc_track_spte(u64 spte);
 u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);
 
 void kvm_mmu_reset_all_pte_masks(void);
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH 10/15] KVM: x86/mmu: Abstract need_resched logic from tdp_mmu_iter_cond_resched
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (8 preceding siblings ...)
  2021-11-19 23:57 ` [RFC PATCH 09/15] KVM: x86/mmu: Move restore_acc_track_spte to spte.c David Matlack
@ 2021-11-19 23:57 ` David Matlack
  2021-11-22 18:56   ` Ben Gardon
  2021-11-19 23:57 ` [RFC PATCH 11/15] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root David Matlack
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

Abstract out the logic that checks whether or not we should reschedule
(including the extra check that ensures we make forward progress) to a
helper method. This will be used in a follow-up commit to reschedule
during large page splitting.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f8c4337f1fcf..2221e074d8ea 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -645,6 +645,15 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 	for_each_tdp_pte(_iter, __va(_mmu->root_hpa),		\
 			 _mmu->shadow_root_level, _start, _end)
 
+static inline bool tdp_mmu_iter_need_resched(struct kvm *kvm, struct tdp_iter *iter)
+{
+	/* Ensure forward progress has been made before yielding. */
+	if (iter->next_last_level_gfn == iter->yielded_gfn)
+		return false;
+
+	return need_resched() || rwlock_needbreak(&kvm->mmu_lock);
+}
+
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
  * to the scheduler.
@@ -664,11 +673,7 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
 					     struct tdp_iter *iter, bool flush,
 					     bool shared)
 {
-	/* Ensure forward progress has been made before yielding. */
-	if (iter->next_last_level_gfn == iter->yielded_gfn)
-		return false;
-
-	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
+	if (tdp_mmu_iter_need_resched(kvm, iter)) {
 		rcu_read_unlock();
 
 		if (flush)
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH 11/15] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (9 preceding siblings ...)
  2021-11-19 23:57 ` [RFC PATCH 10/15] KVM: x86/mmu: Abstract need_resched logic from tdp_mmu_iter_cond_resched David Matlack
@ 2021-11-19 23:57 ` David Matlack
  2021-11-22 18:56   ` Ben Gardon
  2021-11-19 23:57 ` [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled David Matlack
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

Instead of passing a pointer to the root page table and the root level
seperately, pass in a pointer to the kvm_mmu_page that backs the root.
This reduces the number of arguments by 1, cutting down on line lengths.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_iter.c |  5 ++++-
 arch/x86/kvm/mmu/tdp_iter.h | 10 +++++-----
 arch/x86/kvm/mmu/tdp_mmu.c  | 14 +++++---------
 3 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index b3ed302c1a35..92b3a075525a 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -39,9 +39,12 @@ void tdp_iter_restart(struct tdp_iter *iter)
  * Sets a TDP iterator to walk a pre-order traversal of the paging structure
  * rooted at root_pt, starting with the walk to translate next_last_level_gfn.
  */
-void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
+void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
 		    int min_level, gfn_t next_last_level_gfn)
 {
+	u64 *root_pt = root->spt;
+	int root_level = root->role.level;
+
 	WARN_ON(root_level < 1);
 	WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
 
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index b1748b988d3a..ec1f58013428 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -51,17 +51,17 @@ struct tdp_iter {
  * Iterates over every SPTE mapping the GFN range [start, end) in a
  * preorder traversal.
  */
-#define for_each_tdp_pte_min_level(iter, root, root_level, min_level, start, end) \
-	for (tdp_iter_start(&iter, root, root_level, min_level, start); \
+#define for_each_tdp_pte_min_level(iter, root, min_level, start, end) \
+	for (tdp_iter_start(&iter, root, min_level, start); \
 	     iter.valid && iter.gfn < end;		     \
 	     tdp_iter_next(&iter))
 
-#define for_each_tdp_pte(iter, root, root_level, start, end) \
-	for_each_tdp_pte_min_level(iter, root, root_level, PG_LEVEL_4K, start, end)
+#define for_each_tdp_pte(iter, root, start, end) \
+	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end)
 
 tdp_ptep_t spte_to_child_pt(u64 pte, int level);
 
-void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
+void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
 		    int min_level, gfn_t next_last_level_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_restart(struct tdp_iter *iter);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2221e074d8ea..5ca0fa659245 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -632,7 +632,7 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 }
 
 #define tdp_root_for_each_pte(_iter, _root, _start, _end) \
-	for_each_tdp_pte(_iter, _root->spt, _root->role.level, _start, _end)
+	for_each_tdp_pte(_iter, _root, _start, _end)
 
 #define tdp_root_for_each_leaf_pte(_iter, _root, _start, _end)	\
 	tdp_root_for_each_pte(_iter, _root, _start, _end)		\
@@ -642,8 +642,7 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 		else
 
 #define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end)		\
-	for_each_tdp_pte(_iter, __va(_mmu->root_hpa),		\
-			 _mmu->shadow_root_level, _start, _end)
+	for_each_tdp_pte(_iter, to_shadow_page(_mmu->root_hpa), _start, _end)
 
 static inline bool tdp_mmu_iter_need_resched(struct kvm *kvm, struct tdp_iter *iter)
 {
@@ -738,8 +737,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
-				   min_level, start, end) {
+	for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
 retry:
 		if (can_yield &&
 		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
@@ -1201,8 +1199,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
 
-	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
-				   min_level, start, end) {
+	for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
@@ -1450,8 +1447,7 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
-				   min_level, gfn, gfn + 1) {
+	for_each_tdp_pte_min_level(iter, root, min_level, gfn, gfn + 1) {
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (10 preceding siblings ...)
  2021-11-19 23:57 ` [RFC PATCH 11/15] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root David Matlack
@ 2021-11-19 23:57 ` David Matlack
  2021-11-22  5:05   ` Nikunj A. Dadhania
                     ` (2 more replies)
  2021-11-19 23:57 ` [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG David Matlack
                   ` (3 subsequent siblings)
  15 siblings, 3 replies; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

When dirty logging is enabled without initially-all-set, attempt to
split all large pages in the memslot down to 4KB pages so that vCPUs
do not have to take expensive write-protection faults to split large
pages.

Large page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a large page is fine from a
correctness standpoint because we still always follow it up by write-
protecting any remaining large pages.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |   6 ++
 arch/x86/kvm/mmu/mmu.c          |  83 +++++++++++++++++++++
 arch/x86/kvm/mmu/mmu_internal.h |   3 +
 arch/x86/kvm/mmu/spte.c         |  46 ++++++++++++
 arch/x86/kvm/mmu/spte.h         |   1 +
 arch/x86/kvm/mmu/tdp_mmu.c      | 123 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      |   5 ++
 arch/x86/kvm/x86.c              |   6 ++
 8 files changed, 273 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2a7564703ea6..432a4df817ec 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1232,6 +1232,9 @@ struct kvm_arch {
 	hpa_t	hv_root_tdp;
 	spinlock_t hv_root_tdp_lock;
 #endif
+
+	/* MMU caches used when splitting large pages during VM-ioctls. */
+	struct kvm_mmu_memory_caches split_caches;
 };
 
 struct kvm_vm_stat {
@@ -1588,6 +1591,9 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 				      const struct kvm_memory_slot *memslot,
 				      int start_level);
+void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
+					const struct kvm_memory_slot *memslot,
+					int target_level);
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 54f0d2228135..6768ef9c0891 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -738,6 +738,66 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 					  PT64_ROOT_MAX_LEVEL);
 }
 
+static inline void assert_split_caches_invariants(struct kvm *kvm)
+{
+	/*
+	 * The split caches must only be modified while holding the slots_lock,
+	 * since it is only used during memslot VM-ioctls.
+	 */
+	lockdep_assert_held(&kvm->slots_lock);
+
+	/*
+	 * Only the TDP MMU supports large page splitting using
+	 * kvm->arch.split_caches, which is why we only have to allocate
+	 * page_header_cache and shadow_page_cache. Assert that the TDP
+	 * MMU is at least enabled when the split cache is allocated.
+	 */
+	BUG_ON(!is_tdp_mmu_enabled(kvm));
+}
+
+int mmu_topup_split_caches(struct kvm *kvm)
+{
+	struct kvm_mmu_memory_caches *split_caches = &kvm->arch.split_caches;
+	int r;
+
+	assert_split_caches_invariants(kvm);
+
+	r = kvm_mmu_topup_memory_cache(&split_caches->page_header_cache, 1);
+	if (r)
+		goto out;
+
+	r = kvm_mmu_topup_memory_cache(&split_caches->shadow_page_cache, 1);
+	if (r)
+		goto out;
+
+	return 0;
+
+out:
+	pr_warn("Failed to top-up split caches. Will not split large pages.\n");
+	return r;
+}
+
+static void mmu_free_split_caches(struct kvm *kvm)
+{
+	assert_split_caches_invariants(kvm);
+
+	kvm_mmu_free_memory_cache(&kvm->arch.split_caches.pte_list_desc_cache);
+	kvm_mmu_free_memory_cache(&kvm->arch.split_caches.shadow_page_cache);
+}
+
+bool mmu_split_caches_need_topup(struct kvm *kvm)
+{
+	assert_split_caches_invariants(kvm);
+
+	if (kvm->arch.split_caches.page_header_cache.nobjs == 0)
+		return true;
+
+	if (kvm->arch.split_caches.shadow_page_cache.nobjs == 0)
+		return true;
+
+	return false;
+}
+
 static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu_memory_caches *mmu_caches;
@@ -5696,6 +5756,7 @@ void kvm_mmu_init_vm(struct kvm *kvm)
 
 	spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
 
+	mmu_init_memory_caches(&kvm->arch.split_caches);
 	kvm_mmu_init_tdp_mmu(kvm);
 
 	node->track_write = kvm_mmu_pte_write;
@@ -5819,6 +5880,28 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
+					const struct kvm_memory_slot *memslot,
+					int target_level)
+{
+	u64 start, end;
+
+	if (!is_tdp_mmu_enabled(kvm))
+		return;
+
+	if (mmu_topup_split_caches(kvm))
+		return;
+
+	start = memslot->base_gfn;
+	end = start + memslot->npages;
+
+	read_lock(&kvm->mmu_lock);
+	kvm_tdp_mmu_try_split_large_pages(kvm, memslot, start, end, target_level);
+	read_unlock(&kvm->mmu_lock);
+
+	mmu_free_split_caches(kvm);
+}
+
 static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 					 struct kvm_rmap_head *rmap_head,
 					 const struct kvm_memory_slot *slot)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 52c6527b1a06..89b9b907c567 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -161,4 +161,7 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+int mmu_topup_split_caches(struct kvm *kvm);
+bool mmu_split_caches_need_topup(struct kvm *kvm);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index df2cdb8bcf77..6bb9b597a854 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -191,6 +191,52 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	return wrprot;
 }
 
+static u64 mark_spte_executable(u64 spte)
+{
+	bool is_access_track = is_access_track_spte(spte);
+
+	if (is_access_track)
+		spte = restore_acc_track_spte(spte);
+
+	spte &= ~shadow_nx_mask;
+	spte |= shadow_x_mask;
+
+	if (is_access_track)
+		spte = mark_spte_for_access_track(spte);
+
+	return spte;
+}
+
+/*
+ * Construct an SPTE that maps a sub-page of the given large SPTE. This is
+ * used during large page splitting, to build the SPTEs that make up the new
+ * page table.
+ */
+u64 make_large_page_split_spte(u64 large_spte, int level, int index, unsigned int access)
+{
+	u64 child_spte;
+	int child_level;
+
+	BUG_ON(is_mmio_spte(large_spte));
+	BUG_ON(!is_large_present_pte(large_spte));
+
+	child_spte = large_spte;
+	child_level = level - 1;
+
+	child_spte += (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
+
+	if (child_level == PG_LEVEL_4K) {
+		child_spte &= ~PT_PAGE_SIZE_MASK;
+
+		/* Allow execution for 4K pages if it was disabled for NX HugePages. */
+		if (is_nx_huge_page_enabled() && access & ACC_EXEC_MASK)
+			child_spte = mark_spte_executable(child_spte);
+	}
+
+	return child_spte;
+}
+
+
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
 {
 	u64 spte = SPTE_MMU_PRESENT_MASK;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 3e4943ee5a01..4efb4837e38d 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -339,6 +339,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
+u64 make_large_page_split_spte(u64 large_spte, int level, int index, unsigned int access);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5ca0fa659245..366857b9fb3b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -695,6 +695,39 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
 	return false;
 }
 
+static inline bool
+tdp_mmu_need_split_caches_topup_or_resched(struct kvm *kvm, struct tdp_iter *iter)
+{
+	if (mmu_split_caches_need_topup(kvm))
+		return true;
+
+	return tdp_mmu_iter_need_resched(kvm, iter);
+}
+
+static inline int
+tdp_mmu_topup_split_caches_resched(struct kvm *kvm, struct tdp_iter *iter, bool flush)
+{
+	int r;
+
+	rcu_read_unlock();
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+
+	read_unlock(&kvm->mmu_lock);
+
+	cond_resched();
+	r = mmu_topup_split_caches(kvm);
+
+	read_lock(&kvm->mmu_lock);
+
+	rcu_read_lock();
+	WARN_ON(iter->gfn > iter->next_last_level_gfn);
+	tdp_iter_restart(iter);
+
+	return r;
+}
+
 /*
  * Tears down the mappings for the range of gfns, [start, end), and frees the
  * non-root pages mapping GFNs strictly within that range. Returns true if
@@ -1241,6 +1274,96 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 	return spte_set;
 }
 
+static bool tdp_mmu_split_large_page_atomic(struct kvm *kvm, struct tdp_iter *iter)
+{
+	const u64 large_spte = iter->old_spte;
+	const int level = iter->level;
+	struct kvm_mmu_page *child_sp;
+	u64 child_spte;
+	int i;
+
+	BUG_ON(mmu_split_caches_need_topup(kvm));
+
+	child_sp = alloc_child_tdp_mmu_page(&kvm->arch.split_caches, iter);
+
+	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
+		child_spte = make_large_page_split_spte(large_spte, level, i, ACC_ALL);
+
+		/*
+		 * No need for atomics since child_sp has not been installed
+		 * in the table yet and thus is not reachable by any other
+		 * thread.
+		 */
+		child_sp->spt[i] = child_spte;
+	}
+
+	return tdp_mmu_install_sp_atomic(kvm, iter, child_sp, false);
+}
+
+static void tdp_mmu_split_large_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
+					   gfn_t start, gfn_t end, int target_level)
+{
+	struct tdp_iter iter;
+	bool flush = false;
+	int r;
+
+	rcu_read_lock();
+
+	/*
+	 * Traverse the page table splitting all large pages above the target
+	 * level into one lower level. For example, if we encounter a 1GB page
+	 * we split it into 512 2MB pages.
+	 *
+	 * Since the TDP iterator uses a pre-order traversal, we are guaranteed
+	 * to visit an SPTE before ever visiting its children, which means we
+	 * will correctly recursively split large pages that are more than one
+	 * level above the target level (e.g. splitting 1GB to 2MB to 4KB).
+	 */
+	for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
+retry:
+		if (tdp_mmu_need_split_caches_topup_or_resched(kvm, &iter)) {
+			r = tdp_mmu_topup_split_caches_resched(kvm, &iter, flush);
+			flush = false;
+
+			/*
+			 * If topping up the split caches failed, we can't split
+			 * any more pages. Bail out of the loop.
+			 */
+			if (r)
+				break;
+
+			continue;
+		}
+
+		if (!is_large_present_pte(iter.old_spte))
+			continue;
+
+		if (!tdp_mmu_split_large_page_atomic(kvm, &iter))
+			goto retry;
+
+		flush = true;
+	}
+
+	rcu_read_unlock();
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+}
+
+void kvm_tdp_mmu_try_split_large_pages(struct kvm *kvm,
+				       const struct kvm_memory_slot *slot,
+				       gfn_t start, gfn_t end,
+				       int target_level)
+{
+	struct kvm_mmu_page *root;
+
+	lockdep_assert_held_read(&kvm->mmu_lock);
+
+	for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
+		tdp_mmu_split_large_pages_root(kvm, root, start, end, target_level);
+
+}
+
 /*
  * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
  * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 476b133544dd..7812087836b2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -72,6 +72,11 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn,
 				   int min_level);
 
+void kvm_tdp_mmu_try_split_large_pages(struct kvm *kvm,
+				       const struct kvm_memory_slot *slot,
+				       gfn_t start, gfn_t end,
+				       int target_level);
+
 static inline void kvm_tdp_mmu_walk_lockless_begin(void)
 {
 	rcu_read_lock();
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 04e8dabc187d..4702ebfd394b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11735,6 +11735,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
 			return;
 
+		/*
+		 * Attempt to split all large pages into 4K pages so that vCPUs
+		 * do not have to take write-protection faults.
+		 */
+		kvm_mmu_slot_try_split_large_pages(kvm, new, PG_LEVEL_4K);
+
 		if (kvm_x86_ops.cpu_dirty_log_size) {
 			kvm_mmu_slot_leaf_clear_dirty(kvm, new);
 			kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_2M);
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (11 preceding siblings ...)
  2021-11-19 23:57 ` [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled David Matlack
@ 2021-11-19 23:57 ` David Matlack
  2021-11-26 12:17   ` Peter Xu
  2021-12-01 19:22   ` Sean Christopherson
  2021-11-19 23:57 ` [RFC PATCH 14/15] KVM: x86/mmu: Add tracepoint for splitting large pages David Matlack
                   ` (2 subsequent siblings)
  15 siblings, 2 replies; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

When using initially-all-set, large pages are not write-protected when
dirty logging is enabled on the memslot. Instead they are
write-protected once userspace invoked CLEAR_DIRTY_LOG for the first
time, and only for the specific sub-region of the memslot that userspace
whishes to clear.

Enhance CLEAR_DIRTY_LOG to also try to split large pages prior to
write-protecting to avoid causing write-protection faults on vCPU
threads. This also allows userspace to smear the cost of large page
splitting across multiple ioctls rather than splitting the entire
memslot when not using initially-all-set.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++++
 arch/x86/kvm/mmu/mmu.c          | 30 ++++++++++++++++++++++--------
 2 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 432a4df817ec..6b5bf99f57af 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1591,6 +1591,10 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 				      const struct kvm_memory_slot *memslot,
 				      int start_level);
+void kvm_mmu_try_split_large_pages(struct kvm *kvm,
+				   const struct kvm_memory_slot *memslot,
+				   u64 start, u64 end,
+				   int target_level);
 void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
 					const struct kvm_memory_slot *memslot,
 					int target_level);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6768ef9c0891..4e78ef2dd352 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
 		gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
 
+		/*
+		 * Try to proactively split any large pages down to 4KB so that
+		 * vCPUs don't have to take write-protection faults.
+		 */
+		kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
+
 		kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
 
 		/* Cross two large pages? */
@@ -5880,21 +5886,17 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
-void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
-					const struct kvm_memory_slot *memslot,
-					int target_level)
+void kvm_mmu_try_split_large_pages(struct kvm *kvm,
+				   const struct kvm_memory_slot *memslot,
+				   u64 start, u64 end,
+				   int target_level)
 {
-	u64 start, end;
-
 	if (!is_tdp_mmu_enabled(kvm))
 		return;
 
 	if (mmu_topup_split_caches(kvm))
 		return;
 
-	start = memslot->base_gfn;
-	end = start + memslot->npages;
-
 	read_lock(&kvm->mmu_lock);
 	kvm_tdp_mmu_try_split_large_pages(kvm, memslot, start, end, target_level);
 	read_unlock(&kvm->mmu_lock);
@@ -5902,6 +5904,18 @@ void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
 	mmu_free_split_caches(kvm);
 }
 
+void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
+					const struct kvm_memory_slot *memslot,
+					int target_level)
+{
+	u64 start, end;
+
+	start = memslot->base_gfn;
+	end = start + memslot->npages;
+
+	kvm_mmu_try_split_large_pages(kvm, memslot, start, end, target_level);
+}
+
 static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 					 struct kvm_rmap_head *rmap_head,
 					 const struct kvm_memory_slot *slot)
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH 14/15] KVM: x86/mmu: Add tracepoint for splitting large pages
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (12 preceding siblings ...)
  2021-11-19 23:57 ` [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG David Matlack
@ 2021-11-19 23:57 ` David Matlack
  2021-11-19 23:57 ` [RFC PATCH 15/15] KVM: x86/mmu: Update page stats when " David Matlack
  2021-11-26 14:13 ` [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU Peter Xu
  15 siblings, 0 replies; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

Add a tracepoint that records whenever we split a large page.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmutrace.h | 20 ++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c  |  2 ++
 2 files changed, 22 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index b8151bbca36a..4adb794470ae 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -416,6 +416,26 @@ TRACE_EVENT(
 	)
 );
 
+TRACE_EVENT(
+	kvm_mmu_split_large_page,
+	TP_PROTO(u64 gfn, u64 spte, int level),
+	TP_ARGS(gfn, spte, level),
+
+	TP_STRUCT__entry(
+		__field(u64, gfn)
+		__field(u64, spte)
+		__field(int, level)
+	),
+
+	TP_fast_assign(
+		__entry->gfn = gfn;
+		__entry->spte = spte;
+		__entry->level = level;
+	),
+
+	TP_printk("gfn %llx spte %llx level %d", __entry->gfn, __entry->spte, __entry->level)
+);
+
 #endif /* _TRACE_KVMMMU_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 366857b9fb3b..8f60d942c789 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1284,6 +1284,8 @@ static bool tdp_mmu_split_large_page_atomic(struct kvm *kvm, struct tdp_iter *it
 
 	BUG_ON(mmu_split_caches_need_topup(kvm));
 
+	trace_kvm_mmu_split_large_page(iter->gfn, large_spte, level);
+
 	child_sp = alloc_child_tdp_mmu_page(&kvm->arch.split_caches, iter);
 
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH 15/15] KVM: x86/mmu: Update page stats when splitting large pages
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (13 preceding siblings ...)
  2021-11-19 23:57 ` [RFC PATCH 14/15] KVM: x86/mmu: Add tracepoint for splitting large pages David Matlack
@ 2021-11-19 23:57 ` David Matlack
  2021-12-01 19:36   ` Sean Christopherson
  2021-11-26 14:13 ` [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU Peter Xu
  15 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-11-19 23:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, David Matlack

When splitting large pages we need to update the pages stats to reflect
all of the new pages at the lower level. We do not need to change the
page stats for the large page that was removed as that is already
handled tdp_mmu_set_spte_atomic.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8f60d942c789..4c313613a939 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1299,7 +1299,12 @@ static bool tdp_mmu_split_large_page_atomic(struct kvm *kvm, struct tdp_iter *it
 		child_sp->spt[i] = child_spte;
 	}
 
-	return tdp_mmu_install_sp_atomic(kvm, iter, child_sp, false);
+	if (!tdp_mmu_install_sp_atomic(kvm, iter, child_sp, false))
+		return false;
+
+	kvm_update_page_stats(kvm, level - 1, PT64_ENT_PER_PAGE);
+
+	return true;
 }
 
 static void tdp_mmu_split_large_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent
  2021-11-19 23:57 ` [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent David Matlack
@ 2021-11-20 12:53   ` Paolo Bonzini
  2021-11-27  2:07     ` Lai Jiangshan
  2021-11-30 23:31     ` David Matlack
  0 siblings, 2 replies; 77+ messages in thread
From: Paolo Bonzini @ 2021-11-20 12:53 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On 11/20/21 00:57, David Matlack wrote:
> Derive the page role from the parent shadow page, since the only thing
> that changes is the level. This is in preparation for eagerly splitting
> large pages during VM-ioctls which does not have access to the vCPU
> MMU context.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>   arch/x86/kvm/mmu/tdp_mmu.c | 43 ++++++++++++++++++++------------------
>   1 file changed, 23 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b70707a7fe87..1a409992a57f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -157,23 +157,8 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
>   		if (kvm_mmu_page_as_id(_root) != _as_id) {		\
>   		} else
>   
> -static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
> -						   int level)
> -{
> -	union kvm_mmu_page_role role;
> -
> -	role = vcpu->arch.mmu->mmu_role.base;
> -	role.level = level;
> -	role.direct = true;
> -	role.gpte_is_8_bytes = true;
> -	role.access = ACC_ALL;
> -	role.ad_disabled = !shadow_accessed_mask;
> -
> -	return role;
> -}
> -
>   static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> -					       int level)
> +					       union kvm_mmu_page_role role)
>   {
>   	struct kvm_mmu_memory_caches *mmu_caches;
>   	struct kvm_mmu_page *sp;
> @@ -184,7 +169,7 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>   	sp->spt = kvm_mmu_memory_cache_alloc(&mmu_caches->shadow_page_cache);
>   	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>   
> -	sp->role.word = page_role_for_level(vcpu, level).word;
> +	sp->role = role;
>   	sp->gfn = gfn;
>   	sp->tdp_mmu_page = true;
>   
> @@ -193,6 +178,19 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>   	return sp;
>   }
>   
> +static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, struct tdp_iter *iter)
> +{
> +	struct kvm_mmu_page *parent_sp;
> +	union kvm_mmu_page_role role;
> +
> +	parent_sp = sptep_to_sp(rcu_dereference(iter->sptep));
> +
> +	role = parent_sp->role;
> +	role.level--;
> +
> +	return alloc_tdp_mmu_page(vcpu, iter->gfn, role);
> +}
> +
>   hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>   {
>   	union kvm_mmu_page_role role;
> @@ -201,7 +199,12 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>   
>   	lockdep_assert_held_write(&kvm->mmu_lock);
>   
> -	role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
> +	role = vcpu->arch.mmu->mmu_role.base;
> +	role.level = vcpu->arch.mmu->shadow_root_level;
> +	role.direct = true;
> +	role.gpte_is_8_bytes = true;
> +	role.access = ACC_ALL;
> +	role.ad_disabled = !shadow_accessed_mask;

I have a similar patch for the old MMU, but it was also replacing 
shadow_root_level with shadow_root_role.  I'll see if I can adapt it to 
the TDP MMU, since the shadow_root_role is obviously the same for both.

Paolo

>   	/* Check for an existing root before allocating a new one. */
>   	for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
> @@ -210,7 +213,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>   			goto out;
>   	}
>   
> -	root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
> +	root = alloc_tdp_mmu_page(vcpu, 0, role);
>   	refcount_set(&root->tdp_mmu_root_count, 1);
>   
>   	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> @@ -1028,7 +1031,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   			if (is_removed_spte(iter.old_spte))
>   				break;
>   
> -			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level - 1);
> +			sp = alloc_child_tdp_mmu_page(vcpu, &iter);
>   			if (!tdp_mmu_install_sp_atomic(vcpu->kvm, &iter, sp, account_nx))
>   				break;
>   		}
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-11-19 23:57 ` [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled David Matlack
@ 2021-11-22  5:05   ` Nikunj A. Dadhania
  2021-11-30 23:33     ` David Matlack
  2021-11-22 19:30   ` Ben Gardon
  2021-11-26 12:01   ` Peter Xu
  2 siblings, 1 reply; 77+ messages in thread
From: Nikunj A. Dadhania @ 2021-11-22  5:05 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, nikunj



On 11/20/2021 5:27 AM, David Matlack wrote:
> When dirty logging is enabled without initially-all-set, attempt to
> split all large pages in the memslot down to 4KB pages so that vCPUs
> do not have to take expensive write-protection faults to split large
> pages.
> 
> Large page splitting is best-effort only. This commit only adds the
> support for the TDP MMU, and even there splitting may fail due to out
> of memory conditions. Failures to split a large page is fine from a
> correctness standpoint because we still always follow it up by write-
> protecting any remaining large pages.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

> +int mmu_topup_split_caches(struct kvm *kvm)
> +{
> +	struct kvm_mmu_memory_caches *split_caches = &kvm->arch.split_caches;
> +	int r;
> +
> +	assert_split_caches_invariants(kvm);
> +
> +	r = kvm_mmu_topup_memory_cache(&split_caches->page_header_cache, 1);
> +	if (r)
> +		goto out;
> +
> +	r = kvm_mmu_topup_memory_cache(&split_caches->shadow_page_cache, 1);
> +	if (r)
> +		goto out;
> +
> +	return 0;
> +
> +out:
> +	pr_warn("Failed to top-up split caches. Will not split large pages.\n");
> +	return r;
> +}
> +
> +static void mmu_free_split_caches(struct kvm *kvm)
> +{
> +	assert_split_caches_invariants(kvm);
> +
> +	kvm_mmu_free_memory_cache(&kvm->arch.split_caches.pte_list_desc_cache);
                                                              ^^^^^^^^^^^^^^
I believe this should be page_header_cache.

> +	kvm_mmu_free_memory_cache(&kvm->arch.split_caches.shadow_page_cache);
> +}

Regards
Nikunj


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 01/15] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn
  2021-11-19 23:57 ` [RFC PATCH 01/15] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn David Matlack
@ 2021-11-22 18:52   ` Ben Gardon
  2021-11-26 12:18   ` Peter Xu
  1 sibling, 0 replies; 77+ messages in thread
From: Ben Gardon @ 2021-11-22 18:52 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
>
> rmap_write_protect is a poor name because we may not even touch the rmap
> if the TDP MMU is in use. It is also confusing that rmap_write_protect
> is not a simpler wrapper around __rmap_write_protect, since that is the
> typical flow for functions with double-underscore names.
>
> Rename it to kvm_vcpu_write_protect_gfn to convey that we are
> write-protecting a specific gfn in the context of a vCPU.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>


> ---
>  arch/x86/kvm/mmu/mmu.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8f0035517450..16ffb571bc75 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1427,7 +1427,7 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>         return write_protected;
>  }
>
> -static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
> +static bool kvm_vcpu_write_protect_gfn(struct kvm_vcpu *vcpu, u64 gfn)
>  {
>         struct kvm_memory_slot *slot;
>
> @@ -2026,7 +2026,7 @@ static int mmu_sync_children(struct kvm_vcpu *vcpu,
>                 bool protected = false;
>
>                 for_each_sp(pages, sp, parents, i)
> -                       protected |= rmap_write_protect(vcpu, sp->gfn);
> +                       protected |= kvm_vcpu_write_protect_gfn(vcpu, sp->gfn);
>
>                 if (protected) {
>                         kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, true);
> @@ -2153,7 +2153,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>         hlist_add_head(&sp->hash_link, sp_list);
>         if (!direct) {
>                 account_shadowed(vcpu->kvm, sp);
> -               if (level == PG_LEVEL_4K && rmap_write_protect(vcpu, gfn))
> +               if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
>                         kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
>         }
>         trace_kvm_mmu_get_page(sp, true);
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect
  2021-11-19 23:57 ` [RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect David Matlack
@ 2021-11-22 18:52   ` Ben Gardon
  2021-11-26 12:18   ` Peter Xu
  1 sibling, 0 replies; 77+ messages in thread
From: Ben Gardon @ 2021-11-22 18:52 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
>
> Now that rmap_write_protect has been renamed, there is no need for the
> double underscores in front of __rmap_write_protect.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>


> ---
>  arch/x86/kvm/mmu/mmu.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 16ffb571bc75..1146f87044a6 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1235,9 +1235,9 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
>         return mmu_spte_update(sptep, spte);
>  }
>
> -static bool __rmap_write_protect(struct kvm *kvm,
> -                                struct kvm_rmap_head *rmap_head,
> -                                bool pt_protect)
> +static bool rmap_write_protect(struct kvm *kvm,
> +                              struct kvm_rmap_head *rmap_head,
> +                              bool pt_protect)
>  {
>         u64 *sptep;
>         struct rmap_iterator iter;
> @@ -1317,7 +1317,7 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
>         while (mask) {
>                 rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
>                                         PG_LEVEL_4K, slot);
> -               __rmap_write_protect(kvm, rmap_head, false);
> +               rmap_write_protect(kvm, rmap_head, false);
>
>                 /* clear the first set bit */
>                 mask &= mask - 1;
> @@ -1416,7 +1416,7 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>         if (kvm_memslots_have_rmaps(kvm)) {
>                 for (i = min_level; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
>                         rmap_head = gfn_to_rmap(gfn, i, slot);
> -                       write_protected |= __rmap_write_protect(kvm, rmap_head, true);
> +                       write_protected |= rmap_write_protect(kvm, rmap_head, true);
>                 }
>         }
>
> @@ -5780,7 +5780,7 @@ static bool slot_rmap_write_protect(struct kvm *kvm,
>                                     struct kvm_rmap_head *rmap_head,
>                                     const struct kvm_memory_slot *slot)
>  {
> -       return __rmap_write_protect(kvm, rmap_head, false);
> +       return rmap_write_protect(kvm, rmap_head, false);
>  }
>
>  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 03/15] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
  2021-11-19 23:57 ` [RFC PATCH 03/15] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails David Matlack
@ 2021-11-22 18:52   ` Ben Gardon
  2021-11-30 23:25     ` David Matlack
  0 siblings, 1 reply; 77+ messages in thread
From: Ben Gardon @ 2021-11-22 18:52 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
>
> Consolidate a bunch of code that was manually re-reading the spte if the
> cmpxchg fails. There is no extra cost of doing this because we already
> have the spte value as a result of the cmpxchg (and in fact this
> eliminates re-reading the spte), and none of the call sites depend on
> iter->old_spte retaining the stale spte value.
>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 56 ++++++++++++--------------------------
>  1 file changed, 18 insertions(+), 38 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 377a96718a2e..cc9fe33c9b36 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -492,16 +492,22 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>   * and handle the associated bookkeeping.  Do not mark the page dirty
>   * in KVM's dirty bitmaps.
>   *
> + * If setting the SPTE fails because it has changed, iter->old_spte will be
> + * updated with the updated value of the spte.
> + *
>   * @kvm: kvm instance
>   * @iter: a tdp_iter instance currently on the SPTE that should be set
>   * @new_spte: The value the SPTE should be set to
>   * Returns: true if the SPTE was set, false if it was not. If false is returned,
> - *         this function will have no side-effects.
> + *          this function will have no side-effects other than updating
> + *          iter->old_spte to the latest value of spte.
>   */
>  static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>                                            struct tdp_iter *iter,
>                                            u64 new_spte)
>  {
> +       u64 old_spte;
> +
>         lockdep_assert_held_read(&kvm->mmu_lock);
>
>         /*
> @@ -515,9 +521,11 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>          * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
>          * does not hold the mmu_lock.
>          */
> -       if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
> -                     new_spte) != iter->old_spte)
> +       old_spte = cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte, new_spte);

This probably deserves a comment:

/*
 * If the old_spte values differ, the cmpxchg failed. Update
iter->old_spte with the value inserted by
 * another thread.
 */

> +       if (old_spte != iter->old_spte) {
> +               iter->old_spte = old_spte;
>                 return false;
> +       }
>
>         __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
>                               new_spte, iter->level, true);
> @@ -747,14 +755,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>                 if (!shared) {
>                         tdp_mmu_set_spte(kvm, &iter, 0);
>                         flush = true;
> -               } else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
> -                       /*
> -                        * The iter must explicitly re-read the SPTE because
> -                        * the atomic cmpxchg failed.
> -                        */
> -                       iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));

I think kernel style is to include the curly braces on the else if, if
the if had them.


> +               } else if (!tdp_mmu_zap_spte_atomic(kvm, &iter))
>                         goto retry;
> -               }
>         }
>
>         rcu_read_unlock();
> @@ -978,13 +980,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                     is_large_pte(iter.old_spte)) {
>                         if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
>                                 break;
> -
> -                       /*
> -                        * The iter must explicitly re-read the spte here
> -                        * because the new value informs the !present
> -                        * path below.
> -                        */
> -                       iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
>                 }
>
>                 if (!is_shadow_present_pte(iter.old_spte)) {
> @@ -1190,14 +1185,9 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>
>                 new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
>
> -               if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte)) {
> -                       /*
> -                        * The iter must explicitly re-read the SPTE because
> -                        * the atomic cmpxchg failed.
> -                        */
> -                       iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> +               if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
>                         goto retry;
> -               }
> +
>                 spte_set = true;
>         }
>
> @@ -1258,14 +1248,9 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>                                 continue;
>                 }
>
> -               if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte)) {
> -                       /*
> -                        * The iter must explicitly re-read the SPTE because
> -                        * the atomic cmpxchg failed.
> -                        */
> -                       iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> +               if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
>                         goto retry;
> -               }
> +
>                 spte_set = true;
>         }
>
> @@ -1391,14 +1376,9 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
>                                                             pfn, PG_LEVEL_NUM))
>                         continue;
>
> -               if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
> -                       /*
> -                        * The iter must explicitly re-read the SPTE because
> -                        * the atomic cmpxchg failed.
> -                        */
> -                       iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> +               if (!tdp_mmu_zap_spte_atomic(kvm, &iter))
>                         goto retry;
> -               }
> +
>                 flush = true;
>         }
>
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table
  2021-11-19 23:57 ` [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table David Matlack
@ 2021-11-22 18:52   ` Ben Gardon
  2021-11-30 23:27     ` David Matlack
  2021-12-01 19:13   ` Sean Christopherson
  1 sibling, 1 reply; 77+ messages in thread
From: Ben Gardon @ 2021-11-22 18:52 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
>
> Factor out the logic to atomically replace an SPTE with an SPTE that
> points to a new page table. This will be used in a follow-up commit to
> split a large page SPTE into one level lower.
>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 53 ++++++++++++++++++++++++++------------
>  1 file changed, 37 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index cc9fe33c9b36..9ee3f4f7fdf5 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -945,6 +945,39 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>         return ret;
>  }
>
> +/*
> + * tdp_mmu_install_sp_atomic - Atomically replace the given spte with an
> + * spte pointing to the provided page table.
> + *
> + * @kvm: kvm instance
> + * @iter: a tdp_iter instance currently on the SPTE that should be set
> + * @sp: The new TDP page table to install.
> + * @account_nx: True if this page table is being installed to split a
> + *              non-executable huge page.
> + *
> + * Returns: True if the new page table was installed. False if spte being
> + *          replaced changed, causing the atomic compare-exchange to fail.
> + *          If this function returns false the sp will be freed before
> + *          returning.
> + */
> +static bool tdp_mmu_install_sp_atomic(struct kvm *kvm,
> +                                     struct tdp_iter *iter,
> +                                     struct kvm_mmu_page *sp,
> +                                     bool account_nx)
> +{
> +       u64 spte;
> +
> +       spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
> +
> +       if (tdp_mmu_set_spte_atomic(kvm, iter, spte)) {
> +               tdp_mmu_link_page(kvm, sp, account_nx);
> +               return true;
> +       } else {
> +               tdp_mmu_free_sp(sp);
> +               return false;
> +       }
> +}
> +
>  /*
>   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
>   * page tables and SPTEs to translate the faulting guest physical address.
> @@ -954,8 +987,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>         struct kvm_mmu *mmu = vcpu->arch.mmu;
>         struct tdp_iter iter;
>         struct kvm_mmu_page *sp;
> -       u64 *child_pt;
> -       u64 new_spte;
>         int ret;
>
>         kvm_mmu_hugepage_adjust(vcpu, fault);
> @@ -983,6 +1014,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                 }
>
>                 if (!is_shadow_present_pte(iter.old_spte)) {
> +                       bool account_nx = fault->huge_page_disallowed &&
> +                                         fault->req_level >= iter.level;
> +
>                         /*
>                          * If SPTE has been frozen by another thread, just
>                          * give up and retry, avoiding unnecessary page table
> @@ -992,21 +1026,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                                 break;
>
>                         sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level - 1);
> -                       child_pt = sp->spt;
> -
> -                       new_spte = make_nonleaf_spte(child_pt,
> -                                                    !shadow_accessed_mask);
> -
> -                       if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
> -                               tdp_mmu_link_page(vcpu->kvm, sp,
> -                                                 fault->huge_page_disallowed &&
> -                                                 fault->req_level >= iter.level);
> -
> -                               trace_kvm_mmu_get_page(sp, true);

This refactoring drops this trace point. Is that intentional?


> -                       } else {
> -                               tdp_mmu_free_sp(sp);
> +                       if (!tdp_mmu_install_sp_atomic(vcpu->kvm, &iter, sp, account_nx))
>                                 break;
> -                       }
>                 }
>         }
>
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct
  2021-11-19 23:57 ` [RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct David Matlack
@ 2021-11-22 18:55   ` Ben Gardon
  2021-11-22 18:55     ` Ben Gardon
  2021-11-30 23:28     ` David Matlack
  0 siblings, 2 replies; 77+ messages in thread
From: Ben Gardon @ 2021-11-22 18:55 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
>
> Move the kvm_mmu_memory_cache structs into a separate wrapper struct.
> This is in preparation for eagerly splitting all large pages during
> VM-ioctls (i.e. not in the vCPU fault path) which will require adding
> kvm_mmu_memory_cache structs to struct kvm_arch.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon

I don't think this patch creates any functional change. If that's the
intent, it'd be worth noting.


> ---
>  arch/x86/include/asm/kvm_host.h | 12 ++++---
>  arch/x86/kvm/mmu/mmu.c          | 59 ++++++++++++++++++++++-----------
>  arch/x86/kvm/mmu/tdp_mmu.c      |  7 ++--
>  3 files changed, 52 insertions(+), 26 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1fcb345bc107..2a7564703ea6 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -612,6 +612,13 @@ struct kvm_vcpu_xen {
>         u64 runstate_times[4];
>  };
>
> +struct kvm_mmu_memory_caches {
> +       struct kvm_mmu_memory_cache pte_list_desc_cache;
> +       struct kvm_mmu_memory_cache shadow_page_cache;
> +       struct kvm_mmu_memory_cache gfn_array_cache;
> +       struct kvm_mmu_memory_cache page_header_cache;
> +};
> +
>  struct kvm_vcpu_arch {
>         /*
>          * rip and regs accesses must go through
> @@ -681,10 +688,7 @@ struct kvm_vcpu_arch {
>          */
>         struct kvm_mmu *walk_mmu;
>
> -       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> -       struct kvm_mmu_memory_cache mmu_gfn_array_cache;
> -       struct kvm_mmu_memory_cache mmu_page_header_cache;
> +       struct kvm_mmu_memory_caches mmu_caches;
>
>         /*
>          * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1146f87044a6..537952574211 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -732,38 +732,60 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
> +       struct kvm_mmu_memory_caches *mmu_caches;
>         int r;
>
> +       mmu_caches = &vcpu->arch.mmu_caches;
> +
>         /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
> -       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> +       r = kvm_mmu_topup_memory_cache(&mmu_caches->pte_list_desc_cache,
>                                        1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
>         if (r)
>                 return r;
> -       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> +       r = kvm_mmu_topup_memory_cache(&mmu_caches->shadow_page_cache,
>                                        PT64_ROOT_MAX_LEVEL);
>         if (r)
>                 return r;
>         if (maybe_indirect) {
> -               r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache,
> +               r = kvm_mmu_topup_memory_cache(&mmu_caches->gfn_array_cache,
>                                                PT64_ROOT_MAX_LEVEL);
>                 if (r)
>                         return r;
>         }
> -       return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
> +       return kvm_mmu_topup_memory_cache(&mmu_caches->page_header_cache,
>                                           PT64_ROOT_MAX_LEVEL);
>  }
>
>  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  {
> -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
> -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> +       struct kvm_mmu_memory_caches *mmu_caches;
> +
> +       mmu_caches = &vcpu->arch.mmu_caches;
> +
> +       kvm_mmu_free_memory_cache(&mmu_caches->pte_list_desc_cache);
> +       kvm_mmu_free_memory_cache(&mmu_caches->shadow_page_cache);
> +       kvm_mmu_free_memory_cache(&mmu_caches->gfn_array_cache);
> +       kvm_mmu_free_memory_cache(&mmu_caches->page_header_cache);
> +}
> +
> +static void mmu_init_memory_caches(struct kvm_mmu_memory_caches *caches)
> +{
> +       caches->pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> +       caches->pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> +
> +       caches->page_header_cache.kmem_cache = mmu_page_header_cache;
> +       caches->page_header_cache.gfp_zero = __GFP_ZERO;
> +
> +       caches->shadow_page_cache.gfp_zero = __GFP_ZERO;
>  }
>
>  static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
>  {
> -       return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
> +       struct kvm_mmu_memory_caches *mmu_caches;
> +
> +       mmu_caches = &vcpu->arch.mmu_caches;
> +
> +       return kvm_mmu_memory_cache_alloc(&mmu_caches->pte_list_desc_cache);
>  }
>
>  static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
> @@ -1071,7 +1093,7 @@ static bool rmap_can_add(struct kvm_vcpu *vcpu)
>  {
>         struct kvm_mmu_memory_cache *mc;
>
> -       mc = &vcpu->arch.mmu_pte_list_desc_cache;
> +       mc = &vcpu->arch.mmu_caches.pte_list_desc_cache;
>         return kvm_mmu_memory_cache_nr_free_objects(mc);
>  }
>
> @@ -1742,12 +1764,15 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
>
>  static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
>  {
> +       struct kvm_mmu_memory_caches *mmu_caches;
>         struct kvm_mmu_page *sp;
>
> -       sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> -       sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> +       mmu_caches = &vcpu->arch.mmu_caches;
> +
> +       sp = kvm_mmu_memory_cache_alloc(&mmu_caches->page_header_cache);
> +       sp->spt = kvm_mmu_memory_cache_alloc(&mmu_caches->shadow_page_cache);
>         if (!direct)
> -               sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> +               sp->gfns = kvm_mmu_memory_cache_alloc(&mmu_caches->gfn_array_cache);
>         set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>
>         /*
> @@ -5544,13 +5569,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  {
>         int ret;
>
> -       vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> -       vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> -
> -       vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> -       vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> -
> -       vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> +       mmu_init_memory_caches(&vcpu->arch.mmu_caches);
>
>         vcpu->arch.mmu = &vcpu->arch.root_mmu;
>         vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 9ee3f4f7fdf5..b70707a7fe87 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -175,10 +175,13 @@ static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
>  static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>                                                int level)
>  {
> +       struct kvm_mmu_memory_caches *mmu_caches;
>         struct kvm_mmu_page *sp;
>
> -       sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> -       sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> +       mmu_caches = &vcpu->arch.mmu_caches;
> +
> +       sp = kvm_mmu_memory_cache_alloc(&mmu_caches->page_header_cache);
> +       sp->spt = kvm_mmu_memory_cache_alloc(&mmu_caches->shadow_page_cache);
>         set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>
>         sp->role.word = page_role_for_level(vcpu, level).word;
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct
  2021-11-22 18:55   ` Ben Gardon
@ 2021-11-22 18:55     ` Ben Gardon
  2021-11-30 23:28     ` David Matlack
  1 sibling, 0 replies; 77+ messages in thread
From: Ben Gardon @ 2021-11-22 18:55 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Mon, Nov 22, 2021 at 10:55 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Move the kvm_mmu_memory_cache structs into a separate wrapper struct.
> > This is in preparation for eagerly splitting all large pages during
> > VM-ioctls (i.e. not in the vCPU fault path) which will require adding
> > kvm_mmu_memory_cache structs to struct kvm_arch.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> Reviewed-by: Ben Gardon

Woops
Reviewed-by: Ben Gardon <bgardon@google.com>

>
> I don't think this patch creates any functional change. If that's the
> intent, it'd be worth noting.
>
>
> > ---
> >  arch/x86/include/asm/kvm_host.h | 12 ++++---
> >  arch/x86/kvm/mmu/mmu.c          | 59 ++++++++++++++++++++++-----------
> >  arch/x86/kvm/mmu/tdp_mmu.c      |  7 ++--
> >  3 files changed, 52 insertions(+), 26 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 1fcb345bc107..2a7564703ea6 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -612,6 +612,13 @@ struct kvm_vcpu_xen {
> >         u64 runstate_times[4];
> >  };
> >
> > +struct kvm_mmu_memory_caches {
> > +       struct kvm_mmu_memory_cache pte_list_desc_cache;
> > +       struct kvm_mmu_memory_cache shadow_page_cache;
> > +       struct kvm_mmu_memory_cache gfn_array_cache;
> > +       struct kvm_mmu_memory_cache page_header_cache;
> > +};
> > +
> >  struct kvm_vcpu_arch {
> >         /*
> >          * rip and regs accesses must go through
> > @@ -681,10 +688,7 @@ struct kvm_vcpu_arch {
> >          */
> >         struct kvm_mmu *walk_mmu;
> >
> > -       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> > -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > -       struct kvm_mmu_memory_cache mmu_gfn_array_cache;
> > -       struct kvm_mmu_memory_cache mmu_page_header_cache;
> > +       struct kvm_mmu_memory_caches mmu_caches;
> >
> >         /*
> >          * QEMU userspace and the guest each have their own FPU state.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 1146f87044a6..537952574211 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -732,38 +732,60 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> >
> >  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> >  {
> > +       struct kvm_mmu_memory_caches *mmu_caches;
> >         int r;
> >
> > +       mmu_caches = &vcpu->arch.mmu_caches;
> > +
> >         /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
> > -       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> > +       r = kvm_mmu_topup_memory_cache(&mmu_caches->pte_list_desc_cache,
> >                                        1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> >         if (r)
> >                 return r;
> > -       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > +       r = kvm_mmu_topup_memory_cache(&mmu_caches->shadow_page_cache,
> >                                        PT64_ROOT_MAX_LEVEL);
> >         if (r)
> >                 return r;
> >         if (maybe_indirect) {
> > -               r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache,
> > +               r = kvm_mmu_topup_memory_cache(&mmu_caches->gfn_array_cache,
> >                                                PT64_ROOT_MAX_LEVEL);
> >                 if (r)
> >                         return r;
> >         }
> > -       return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
> > +       return kvm_mmu_topup_memory_cache(&mmu_caches->page_header_cache,
> >                                           PT64_ROOT_MAX_LEVEL);
> >  }
> >
> >  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> >  {
> > -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> > -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> > -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
> > -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> > +       struct kvm_mmu_memory_caches *mmu_caches;
> > +
> > +       mmu_caches = &vcpu->arch.mmu_caches;
> > +
> > +       kvm_mmu_free_memory_cache(&mmu_caches->pte_list_desc_cache);
> > +       kvm_mmu_free_memory_cache(&mmu_caches->shadow_page_cache);
> > +       kvm_mmu_free_memory_cache(&mmu_caches->gfn_array_cache);
> > +       kvm_mmu_free_memory_cache(&mmu_caches->page_header_cache);
> > +}
> > +
> > +static void mmu_init_memory_caches(struct kvm_mmu_memory_caches *caches)
> > +{
> > +       caches->pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> > +       caches->pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> > +
> > +       caches->page_header_cache.kmem_cache = mmu_page_header_cache;
> > +       caches->page_header_cache.gfp_zero = __GFP_ZERO;
> > +
> > +       caches->shadow_page_cache.gfp_zero = __GFP_ZERO;
> >  }
> >
> >  static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
> >  {
> > -       return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
> > +       struct kvm_mmu_memory_caches *mmu_caches;
> > +
> > +       mmu_caches = &vcpu->arch.mmu_caches;
> > +
> > +       return kvm_mmu_memory_cache_alloc(&mmu_caches->pte_list_desc_cache);
> >  }
> >
> >  static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
> > @@ -1071,7 +1093,7 @@ static bool rmap_can_add(struct kvm_vcpu *vcpu)
> >  {
> >         struct kvm_mmu_memory_cache *mc;
> >
> > -       mc = &vcpu->arch.mmu_pte_list_desc_cache;
> > +       mc = &vcpu->arch.mmu_caches.pte_list_desc_cache;
> >         return kvm_mmu_memory_cache_nr_free_objects(mc);
> >  }
> >
> > @@ -1742,12 +1764,15 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
> >
> >  static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
> >  {
> > +       struct kvm_mmu_memory_caches *mmu_caches;
> >         struct kvm_mmu_page *sp;
> >
> > -       sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > -       sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > +       mmu_caches = &vcpu->arch.mmu_caches;
> > +
> > +       sp = kvm_mmu_memory_cache_alloc(&mmu_caches->page_header_cache);
> > +       sp->spt = kvm_mmu_memory_cache_alloc(&mmu_caches->shadow_page_cache);
> >         if (!direct)
> > -               sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> > +               sp->gfns = kvm_mmu_memory_cache_alloc(&mmu_caches->gfn_array_cache);
> >         set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> >
> >         /*
> > @@ -5544,13 +5569,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> >  {
> >         int ret;
> >
> > -       vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> > -       vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> > -
> > -       vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> > -       vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> > -
> > -       vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > +       mmu_init_memory_caches(&vcpu->arch.mmu_caches);
> >
> >         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> >         vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 9ee3f4f7fdf5..b70707a7fe87 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -175,10 +175,13 @@ static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
> >  static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> >                                                int level)
> >  {
> > +       struct kvm_mmu_memory_caches *mmu_caches;
> >         struct kvm_mmu_page *sp;
> >
> > -       sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > -       sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > +       mmu_caches = &vcpu->arch.mmu_caches;
> > +
> > +       sp = kvm_mmu_memory_cache_alloc(&mmu_caches->page_header_cache);
> > +       sp->spt = kvm_mmu_memory_cache_alloc(&mmu_caches->shadow_page_cache);
> >         set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> >
> >         sp->role.word = page_role_for_level(vcpu, level).word;
> > --
> > 2.34.0.rc2.393.gf8c9666880-goog
> >

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 07/15] KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu
  2021-11-19 23:57 ` [RFC PATCH 07/15] KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu David Matlack
@ 2021-11-22 18:56   ` Ben Gardon
  0 siblings, 0 replies; 77+ messages in thread
From: Ben Gardon @ 2021-11-22 18:56 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
>
> Pass in vcpu->arch.mmu_caches to alloc_{,_child}_tdp_mmu_page() instead
> of the vcpu. This is in preparation for eagerly splitting large pages
> during VM-ioctls which does not have access to the vCPU mmu_caches.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>


> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 16 +++++++---------
>  1 file changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 1a409992a57f..ff4d83ad7580 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -157,14 +157,11 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
>                 if (kvm_mmu_page_as_id(_root) != _as_id) {              \
>                 } else
>
> -static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> -                                              union kvm_mmu_page_role role)
> +static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_mmu_memory_caches *mmu_caches,
> +                                              gfn_t gfn, union kvm_mmu_page_role role)
>  {
> -       struct kvm_mmu_memory_caches *mmu_caches;
>         struct kvm_mmu_page *sp;
>
> -       mmu_caches = &vcpu->arch.mmu_caches;
> -
>         sp = kvm_mmu_memory_cache_alloc(&mmu_caches->page_header_cache);
>         sp->spt = kvm_mmu_memory_cache_alloc(&mmu_caches->shadow_page_cache);
>         set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> @@ -178,7 +175,8 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>         return sp;
>  }
>
> -static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, struct tdp_iter *iter)
> +static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_mmu_memory_caches *mmu_caches,
> +                                                    struct tdp_iter *iter)
>  {
>         struct kvm_mmu_page *parent_sp;
>         union kvm_mmu_page_role role;
> @@ -188,7 +186,7 @@ static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, stru
>         role = parent_sp->role;
>         role.level--;
>
> -       return alloc_tdp_mmu_page(vcpu, iter->gfn, role);
> +       return alloc_tdp_mmu_page(mmu_caches, iter->gfn, role);
>  }
>
>  hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> @@ -213,7 +211,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>                         goto out;
>         }
>
> -       root = alloc_tdp_mmu_page(vcpu, 0, role);
> +       root = alloc_tdp_mmu_page(&vcpu->arch.mmu_caches, 0, role);
>         refcount_set(&root->tdp_mmu_root_count, 1);
>
>         spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> @@ -1031,7 +1029,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                         if (is_removed_spte(iter.old_spte))
>                                 break;
>
> -                       sp = alloc_child_tdp_mmu_page(vcpu, &iter);
> +                       sp = alloc_child_tdp_mmu_page(&vcpu->arch.mmu_caches, &iter);
>                         if (!tdp_mmu_install_sp_atomic(vcpu->kvm, &iter, sp, account_nx))
>                                 break;
>                 }
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes
  2021-11-19 23:57 ` [RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes David Matlack
@ 2021-11-22 18:56   ` Ben Gardon
  2021-12-01 18:34   ` Sean Christopherson
  1 sibling, 0 replies; 77+ messages in thread
From: Ben Gardon @ 2021-11-22 18:56 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
>
> Consolidate is_large_pte and is_present_pte into a single helper. This
> will be used in a follow-up commit to check for present large-pages
> during Eager Page Splitting.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>

> ---
>  arch/x86/kvm/mmu/spte.h    | 5 +++++
>  arch/x86/kvm/mmu/tdp_mmu.c | 3 +--
>  2 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index cc432f9a966b..e73c41d31816 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -257,6 +257,11 @@ static inline bool is_large_pte(u64 pte)
>         return pte & PT_PAGE_SIZE_MASK;
>  }
>
> +static inline bool is_large_present_pte(u64 pte)
> +{
> +       return is_shadow_present_pte(pte) && is_large_pte(pte);
> +}
> +
>  static inline bool is_last_spte(u64 pte, int level)
>  {
>         return (level == PG_LEVEL_4K) || is_large_pte(pte);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index ff4d83ad7580..f8c4337f1fcf 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1011,8 +1011,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                  * than the target, that SPTE must be cleared and replaced
>                  * with a non-leaf SPTE.
>                  */
> -               if (is_shadow_present_pte(iter.old_spte) &&
> -                   is_large_pte(iter.old_spte)) {
> +               if (is_large_present_pte(iter.old_spte)) {

I'm amazed there's only one instance of a check for present and large.


>                         if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
>                                 break;
>                 }
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 09/15] KVM: x86/mmu: Move restore_acc_track_spte to spte.c
  2021-11-19 23:57 ` [RFC PATCH 09/15] KVM: x86/mmu: Move restore_acc_track_spte to spte.c David Matlack
@ 2021-11-22 18:56   ` Ben Gardon
  0 siblings, 0 replies; 77+ messages in thread
From: Ben Gardon @ 2021-11-22 18:56 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
>
> restore_acc_track_spte is purely an SPTE manipulation, making it a good
> fit for spte.c. It is also needed in spte.c in a follow-up commit so we
> can construct child SPTEs during large page splitting.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>

Love it.


> ---
>  arch/x86/kvm/mmu/mmu.c  | 18 ------------------
>  arch/x86/kvm/mmu/spte.c | 18 ++++++++++++++++++
>  arch/x86/kvm/mmu/spte.h |  1 +
>  3 files changed, 19 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 537952574211..54f0d2228135 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -652,24 +652,6 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
>         return __get_spte_lockless(sptep);
>  }
>
> -/* Restore an acc-track PTE back to a regular PTE */
> -static u64 restore_acc_track_spte(u64 spte)
> -{
> -       u64 new_spte = spte;
> -       u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT)
> -                        & SHADOW_ACC_TRACK_SAVED_BITS_MASK;
> -
> -       WARN_ON_ONCE(spte_ad_enabled(spte));
> -       WARN_ON_ONCE(!is_access_track_spte(spte));
> -
> -       new_spte &= ~shadow_acc_track_mask;
> -       new_spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
> -                     SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
> -       new_spte |= saved_bits;
> -
> -       return new_spte;
> -}
> -
>  /* Returns the Accessed status of the PTE and resets it at the same time. */
>  static bool mmu_spte_age(u64 *sptep)
>  {
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 0c76c45fdb68..df2cdb8bcf77 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -268,6 +268,24 @@ u64 mark_spte_for_access_track(u64 spte)
>         return spte;
>  }
>
> +/* Restore an acc-track PTE back to a regular PTE */
> +u64 restore_acc_track_spte(u64 spte)
> +{
> +       u64 new_spte = spte;
> +       u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT)
> +                        & SHADOW_ACC_TRACK_SAVED_BITS_MASK;
> +
> +       WARN_ON_ONCE(spte_ad_enabled(spte));
> +       WARN_ON_ONCE(!is_access_track_spte(spte));
> +
> +       new_spte &= ~shadow_acc_track_mask;
> +       new_spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
> +                     SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
> +       new_spte |= saved_bits;
> +
> +       return new_spte;
> +}
> +
>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
>  {
>         BUG_ON((u64)(unsigned)access_mask != access_mask);
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index e73c41d31816..3e4943ee5a01 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -342,6 +342,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
>  u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
>  u64 mark_spte_for_access_track(u64 spte);
> +u64 restore_acc_track_spte(u64 spte);
>  u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);
>
>  void kvm_mmu_reset_all_pte_masks(void);
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 10/15] KVM: x86/mmu: Abstract need_resched logic from tdp_mmu_iter_cond_resched
  2021-11-19 23:57 ` [RFC PATCH 10/15] KVM: x86/mmu: Abstract need_resched logic from tdp_mmu_iter_cond_resched David Matlack
@ 2021-11-22 18:56   ` Ben Gardon
  0 siblings, 0 replies; 77+ messages in thread
From: Ben Gardon @ 2021-11-22 18:56 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
>
> Abstract out the logic that checks whether or not we should reschedule
> (including the extra check that ensures we make forward progress) to a
> helper method. This will be used in a follow-up commit to reschedule
> during large page splitting.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>


> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f8c4337f1fcf..2221e074d8ea 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -645,6 +645,15 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
>         for_each_tdp_pte(_iter, __va(_mmu->root_hpa),           \
>                          _mmu->shadow_root_level, _start, _end)
>
> +static inline bool tdp_mmu_iter_need_resched(struct kvm *kvm, struct tdp_iter *iter)
> +{
> +       /* Ensure forward progress has been made before yielding. */
> +       if (iter->next_last_level_gfn == iter->yielded_gfn)
> +               return false;
> +
> +       return need_resched() || rwlock_needbreak(&kvm->mmu_lock);
> +}
> +
>  /*
>   * Yield if the MMU lock is contended or this thread needs to return control
>   * to the scheduler.
> @@ -664,11 +673,7 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
>                                              struct tdp_iter *iter, bool flush,
>                                              bool shared)
>  {
> -       /* Ensure forward progress has been made before yielding. */
> -       if (iter->next_last_level_gfn == iter->yielded_gfn)
> -               return false;
> -
> -       if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> +       if (tdp_mmu_iter_need_resched(kvm, iter)) {
>                 rcu_read_unlock();
>
>                 if (flush)
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 11/15] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
  2021-11-19 23:57 ` [RFC PATCH 11/15] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root David Matlack
@ 2021-11-22 18:56   ` Ben Gardon
  0 siblings, 0 replies; 77+ messages in thread
From: Ben Gardon @ 2021-11-22 18:56 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
>
> Instead of passing a pointer to the root page table and the root level
> seperately, pass in a pointer to the kvm_mmu_page that backs the root.
> This reduces the number of arguments by 1, cutting down on line lengths.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>

> ---
>  arch/x86/kvm/mmu/tdp_iter.c |  5 ++++-
>  arch/x86/kvm/mmu/tdp_iter.h | 10 +++++-----
>  arch/x86/kvm/mmu/tdp_mmu.c  | 14 +++++---------
>  3 files changed, 14 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> index b3ed302c1a35..92b3a075525a 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.c
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -39,9 +39,12 @@ void tdp_iter_restart(struct tdp_iter *iter)
>   * Sets a TDP iterator to walk a pre-order traversal of the paging structure
>   * rooted at root_pt, starting with the walk to translate next_last_level_gfn.
>   */
> -void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
> +void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,

I think this is an artifact of the days when I thought we could avoid
allocating struct kvm_mmu_pages for the TDP MMU.
Trying to do that turned out to be a huge pain though and the memory
savings weren't great.
Happy to see this cleaned up.


>                     int min_level, gfn_t next_last_level_gfn)
>  {
> +       u64 *root_pt = root->spt;
> +       int root_level = root->role.level;
> +
>         WARN_ON(root_level < 1);
>         WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
>
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index b1748b988d3a..ec1f58013428 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -51,17 +51,17 @@ struct tdp_iter {
>   * Iterates over every SPTE mapping the GFN range [start, end) in a
>   * preorder traversal.
>   */
> -#define for_each_tdp_pte_min_level(iter, root, root_level, min_level, start, end) \
> -       for (tdp_iter_start(&iter, root, root_level, min_level, start); \
> +#define for_each_tdp_pte_min_level(iter, root, min_level, start, end) \
> +       for (tdp_iter_start(&iter, root, min_level, start); \
>              iter.valid && iter.gfn < end;                   \
>              tdp_iter_next(&iter))
>
> -#define for_each_tdp_pte(iter, root, root_level, start, end) \
> -       for_each_tdp_pte_min_level(iter, root, root_level, PG_LEVEL_4K, start, end)
> +#define for_each_tdp_pte(iter, root, start, end) \
> +       for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end)
>
>  tdp_ptep_t spte_to_child_pt(u64 pte, int level);
>
> -void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
> +void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
>                     int min_level, gfn_t next_last_level_gfn);
>  void tdp_iter_next(struct tdp_iter *iter);
>  void tdp_iter_restart(struct tdp_iter *iter);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 2221e074d8ea..5ca0fa659245 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -632,7 +632,7 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
>  }
>
>  #define tdp_root_for_each_pte(_iter, _root, _start, _end) \
> -       for_each_tdp_pte(_iter, _root->spt, _root->role.level, _start, _end)
> +       for_each_tdp_pte(_iter, _root, _start, _end)
>
>  #define tdp_root_for_each_leaf_pte(_iter, _root, _start, _end) \
>         tdp_root_for_each_pte(_iter, _root, _start, _end)               \
> @@ -642,8 +642,7 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
>                 else
>
>  #define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end)                \
> -       for_each_tdp_pte(_iter, __va(_mmu->root_hpa),           \
> -                        _mmu->shadow_root_level, _start, _end)
> +       for_each_tdp_pte(_iter, to_shadow_page(_mmu->root_hpa), _start, _end)
>
>  static inline bool tdp_mmu_iter_need_resched(struct kvm *kvm, struct tdp_iter *iter)
>  {
> @@ -738,8 +737,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>
>         rcu_read_lock();
>
> -       for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
> -                                  min_level, start, end) {
> +       for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
>  retry:
>                 if (can_yield &&
>                     tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
> @@ -1201,8 +1199,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>
>         BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
>
> -       for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
> -                                  min_level, start, end) {
> +       for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
>  retry:
>                 if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
>                         continue;
> @@ -1450,8 +1447,7 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
>
>         rcu_read_lock();
>
> -       for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
> -                                  min_level, gfn, gfn + 1) {
> +       for_each_tdp_pte_min_level(iter, root, min_level, gfn, gfn + 1) {
>                 if (!is_shadow_present_pte(iter.old_spte) ||
>                     !is_last_spte(iter.old_spte, iter.level))
>                         continue;
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-11-19 23:57 ` [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled David Matlack
  2021-11-22  5:05   ` Nikunj A. Dadhania
@ 2021-11-22 19:30   ` Ben Gardon
  2021-11-30 23:44     ` David Matlack
  2021-11-26 12:01   ` Peter Xu
  2 siblings, 1 reply; 77+ messages in thread
From: Ben Gardon @ 2021-11-22 19:30 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
>
> When dirty logging is enabled without initially-all-set, attempt to
> split all large pages in the memslot down to 4KB pages so that vCPUs
> do not have to take expensive write-protection faults to split large
> pages.
>
> Large page splitting is best-effort only. This commit only adds the
> support for the TDP MMU, and even there splitting may fail due to out
> of memory conditions. Failures to split a large page is fine from a
> correctness standpoint because we still always follow it up by write-
> protecting any remaining large pages.
>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   6 ++
>  arch/x86/kvm/mmu/mmu.c          |  83 +++++++++++++++++++++
>  arch/x86/kvm/mmu/mmu_internal.h |   3 +
>  arch/x86/kvm/mmu/spte.c         |  46 ++++++++++++
>  arch/x86/kvm/mmu/spte.h         |   1 +
>  arch/x86/kvm/mmu/tdp_mmu.c      | 123 ++++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu/tdp_mmu.h      |   5 ++
>  arch/x86/kvm/x86.c              |   6 ++
>  8 files changed, 273 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2a7564703ea6..432a4df817ec 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1232,6 +1232,9 @@ struct kvm_arch {
>         hpa_t   hv_root_tdp;
>         spinlock_t hv_root_tdp_lock;
>  #endif
> +
> +       /* MMU caches used when splitting large pages during VM-ioctls. */
> +       struct kvm_mmu_memory_caches split_caches;
>  };
>
>  struct kvm_vm_stat {
> @@ -1588,6 +1591,9 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>                                       const struct kvm_memory_slot *memslot,
>                                       int start_level);
> +void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
> +                                       const struct kvm_memory_slot *memslot,
> +                                       int target_level);
>  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>                                    const struct kvm_memory_slot *memslot);
>  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 54f0d2228135..6768ef9c0891 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -738,6 +738,66 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>                                           PT64_ROOT_MAX_LEVEL);
>  }
>
> +static inline void assert_split_caches_invariants(struct kvm *kvm)
> +{
> +       /*
> +        * The split caches must only be modified while holding the slots_lock,
> +        * since it is only used during memslot VM-ioctls.
> +        */
> +       lockdep_assert_held(&kvm->slots_lock);
> +
> +       /*
> +        * Only the TDP MMU supports large page splitting using
> +        * kvm->arch.split_caches, which is why we only have to allocate
> +        * page_header_cache and shadow_page_cache. Assert that the TDP
> +        * MMU is at least enabled when the split cache is allocated.
> +        */
> +       BUG_ON(!is_tdp_mmu_enabled(kvm));
> +}
> +
> +int mmu_topup_split_caches(struct kvm *kvm)
> +{
> +       struct kvm_mmu_memory_caches *split_caches = &kvm->arch.split_caches;
> +       int r;
> +
> +       assert_split_caches_invariants(kvm);
> +
> +       r = kvm_mmu_topup_memory_cache(&split_caches->page_header_cache, 1);
> +       if (r)
> +               goto out;
> +
> +       r = kvm_mmu_topup_memory_cache(&split_caches->shadow_page_cache, 1);
> +       if (r)
> +               goto out;
> +
> +       return 0;
> +
> +out:
> +       pr_warn("Failed to top-up split caches. Will not split large pages.\n");
> +       return r;
> +}
> +
> +static void mmu_free_split_caches(struct kvm *kvm)
> +{
> +       assert_split_caches_invariants(kvm);
> +
> +       kvm_mmu_free_memory_cache(&kvm->arch.split_caches.pte_list_desc_cache);
> +       kvm_mmu_free_memory_cache(&kvm->arch.split_caches.shadow_page_cache);
> +}
> +
> +bool mmu_split_caches_need_topup(struct kvm *kvm)
> +{
> +       assert_split_caches_invariants(kvm);
> +
> +       if (kvm->arch.split_caches.page_header_cache.nobjs == 0)
> +               return true;
> +
> +       if (kvm->arch.split_caches.shadow_page_cache.nobjs == 0)
> +               return true;
> +
> +       return false;
> +}
> +
>  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  {
>         struct kvm_mmu_memory_caches *mmu_caches;
> @@ -5696,6 +5756,7 @@ void kvm_mmu_init_vm(struct kvm *kvm)
>
>         spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
>
> +       mmu_init_memory_caches(&kvm->arch.split_caches);
>         kvm_mmu_init_tdp_mmu(kvm);
>
>         node->track_write = kvm_mmu_pte_write;
> @@ -5819,6 +5880,28 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>                 kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
>  }
>
> +void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
> +                                       const struct kvm_memory_slot *memslot,
> +                                       int target_level)
> +{
> +       u64 start, end;
> +
> +       if (!is_tdp_mmu_enabled(kvm))
> +               return;
> +
> +       if (mmu_topup_split_caches(kvm))
> +               return;
> +
> +       start = memslot->base_gfn;
> +       end = start + memslot->npages;
> +
> +       read_lock(&kvm->mmu_lock);
> +       kvm_tdp_mmu_try_split_large_pages(kvm, memslot, start, end, target_level);
> +       read_unlock(&kvm->mmu_lock);
> +
> +       mmu_free_split_caches(kvm);
> +}
> +
>  static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>                                          struct kvm_rmap_head *rmap_head,
>                                          const struct kvm_memory_slot *slot)
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 52c6527b1a06..89b9b907c567 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -161,4 +161,7 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +int mmu_topup_split_caches(struct kvm *kvm);
> +bool mmu_split_caches_need_topup(struct kvm *kvm);
> +
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index df2cdb8bcf77..6bb9b597a854 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -191,6 +191,52 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>         return wrprot;
>  }
>
> +static u64 mark_spte_executable(u64 spte)
> +{
> +       bool is_access_track = is_access_track_spte(spte);
> +
> +       if (is_access_track)
> +               spte = restore_acc_track_spte(spte);
> +
> +       spte &= ~shadow_nx_mask;
> +       spte |= shadow_x_mask;
> +
> +       if (is_access_track)
> +               spte = mark_spte_for_access_track(spte);
> +
> +       return spte;
> +}
> +
> +/*
> + * Construct an SPTE that maps a sub-page of the given large SPTE. This is
> + * used during large page splitting, to build the SPTEs that make up the new
> + * page table.
> + */
> +u64 make_large_page_split_spte(u64 large_spte, int level, int index, unsigned int access)

Just because this always trips me up reading code, I'd suggest naming
the argument large_spte_level or something.
Avoiding a variable called "level" in this function makes it much more explicit.

> +{
> +       u64 child_spte;
> +       int child_level;
> +
> +       BUG_ON(is_mmio_spte(large_spte));
> +       BUG_ON(!is_large_present_pte(large_spte));

In the interest of not crashing the host, I think it would be safe to
WARN and return 0 here.
BUG is fine too if that's preferred.

> +
> +       child_spte = large_spte;
> +       child_level = level - 1;
> +
> +       child_spte += (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;

This += makes me nervous. It at least merits a comment explaining
what's going on.
I'd find a |= more readable to make it more explicit and since sptes
aren't numbers.
You could probably also be really explicit about extracting the PFN
and adding to it, clearing the PFN bits and then putting it back in
and I bet the compiler would optimize out the extra bit fiddling.

> +
> +       if (child_level == PG_LEVEL_4K) {
> +               child_spte &= ~PT_PAGE_SIZE_MASK;
> +
> +               /* Allow execution for 4K pages if it was disabled for NX HugePages. */
> +               if (is_nx_huge_page_enabled() && access & ACC_EXEC_MASK)
> +                       child_spte = mark_spte_executable(child_spte);
> +       }
> +
> +       return child_spte;
> +}
> +
> +
>  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
>  {
>         u64 spte = SPTE_MMU_PRESENT_MASK;
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 3e4943ee5a01..4efb4837e38d 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -339,6 +339,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>                unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
>                u64 old_spte, bool prefetch, bool can_unsync,
>                bool host_writable, u64 *new_spte);
> +u64 make_large_page_split_spte(u64 large_spte, int level, int index, unsigned int access);
>  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
>  u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
>  u64 mark_spte_for_access_track(u64 spte);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5ca0fa659245..366857b9fb3b 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -695,6 +695,39 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
>         return false;
>  }
>
> +static inline bool
> +tdp_mmu_need_split_caches_topup_or_resched(struct kvm *kvm, struct tdp_iter *iter)
> +{
> +       if (mmu_split_caches_need_topup(kvm))
> +               return true;
> +
> +       return tdp_mmu_iter_need_resched(kvm, iter);
> +}
> +
> +static inline int
> +tdp_mmu_topup_split_caches_resched(struct kvm *kvm, struct tdp_iter *iter, bool flush)

This functionality could be shoe-horned into
tdp_mmu_iter_cond_resched, reducing code duplication.
I don't know if the extra parameters / complexity on that function
would be worth it, but I'm slightly inclined in that direction.

> +{
> +       int r;
> +
> +       rcu_read_unlock();
> +
> +       if (flush)
> +               kvm_flush_remote_tlbs(kvm);
> +
> +       read_unlock(&kvm->mmu_lock);
> +
> +       cond_resched();
> +       r = mmu_topup_split_caches(kvm);

Ah, right. I was confused by this for a second, but it's safe because
the caches are protected by the slots lock.

> +
> +       read_lock(&kvm->mmu_lock);
> +
> +       rcu_read_lock();
> +       WARN_ON(iter->gfn > iter->next_last_level_gfn);
> +       tdp_iter_restart(iter);
> +
> +       return r;
> +}
> +
>  /*
>   * Tears down the mappings for the range of gfns, [start, end), and frees the
>   * non-root pages mapping GFNs strictly within that range. Returns true if
> @@ -1241,6 +1274,96 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
>         return spte_set;
>  }
>
> +static bool tdp_mmu_split_large_page_atomic(struct kvm *kvm, struct tdp_iter *iter)
> +{
> +       const u64 large_spte = iter->old_spte;
> +       const int level = iter->level;
> +       struct kvm_mmu_page *child_sp;
> +       u64 child_spte;
> +       int i;
> +
> +       BUG_ON(mmu_split_caches_need_topup(kvm));

I think it would be safe to just WARN and return here as well.

> +
> +       child_sp = alloc_child_tdp_mmu_page(&kvm->arch.split_caches, iter);
> +
> +       for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> +               child_spte = make_large_page_split_spte(large_spte, level, i, ACC_ALL);

Relating to my other comment above on make_large_page_split_spte, you
could also iterate through the range of PFNs here and pass that as an
argument to the helper function.

> +
> +               /*
> +                * No need for atomics since child_sp has not been installed
> +                * in the table yet and thus is not reachable by any other
> +                * thread.
> +                */
> +               child_sp->spt[i] = child_spte;
> +       }
> +
> +       return tdp_mmu_install_sp_atomic(kvm, iter, child_sp, false);
> +}
> +
> +static void tdp_mmu_split_large_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
> +                                          gfn_t start, gfn_t end, int target_level)
> +{
> +       struct tdp_iter iter;
> +       bool flush = false;
> +       int r;
> +
> +       rcu_read_lock();
> +
> +       /*
> +        * Traverse the page table splitting all large pages above the target
> +        * level into one lower level. For example, if we encounter a 1GB page
> +        * we split it into 512 2MB pages.
> +        *
> +        * Since the TDP iterator uses a pre-order traversal, we are guaranteed
> +        * to visit an SPTE before ever visiting its children, which means we
> +        * will correctly recursively split large pages that are more than one
> +        * level above the target level (e.g. splitting 1GB to 2MB to 4KB).
> +        */
> +       for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
> +retry:
> +               if (tdp_mmu_need_split_caches_topup_or_resched(kvm, &iter)) {
> +                       r = tdp_mmu_topup_split_caches_resched(kvm, &iter, flush);
> +                       flush = false;
> +
> +                       /*
> +                        * If topping up the split caches failed, we can't split
> +                        * any more pages. Bail out of the loop.
> +                        */
> +                       if (r)
> +                               break;
> +
> +                       continue;
> +               }
> +
> +               if (!is_large_present_pte(iter.old_spte))
> +                       continue;
> +
> +               if (!tdp_mmu_split_large_page_atomic(kvm, &iter))
> +                       goto retry;
> +
> +               flush = true;
> +       }
> +
> +       rcu_read_unlock();
> +
> +       if (flush)
> +               kvm_flush_remote_tlbs(kvm);
> +}
> +
> +void kvm_tdp_mmu_try_split_large_pages(struct kvm *kvm,
> +                                      const struct kvm_memory_slot *slot,
> +                                      gfn_t start, gfn_t end,
> +                                      int target_level)
> +{
> +       struct kvm_mmu_page *root;
> +
> +       lockdep_assert_held_read(&kvm->mmu_lock);
> +
> +       for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
> +               tdp_mmu_split_large_pages_root(kvm, root, start, end, target_level);
> +
> +}
> +
>  /*
>   * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
>   * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 476b133544dd..7812087836b2 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -72,6 +72,11 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>                                    struct kvm_memory_slot *slot, gfn_t gfn,
>                                    int min_level);
>
> +void kvm_tdp_mmu_try_split_large_pages(struct kvm *kvm,
> +                                      const struct kvm_memory_slot *slot,
> +                                      gfn_t start, gfn_t end,
> +                                      int target_level);
> +
>  static inline void kvm_tdp_mmu_walk_lockless_begin(void)
>  {
>         rcu_read_lock();
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 04e8dabc187d..4702ebfd394b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11735,6 +11735,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>                 if (kvm_dirty_log_manual_protect_and_init_set(kvm))
>                         return;
>
> +               /*
> +                * Attempt to split all large pages into 4K pages so that vCPUs
> +                * do not have to take write-protection faults.
> +                */
> +               kvm_mmu_slot_try_split_large_pages(kvm, new, PG_LEVEL_4K);

Thank you for parameterizing the target level here. I'm working on a
proof of concept for 2M dirty tracking right now (still in exploratory
phase) and this parameter will help future-proof the splitting
algorithm if we ever decide we don't want to split everything to 4k
for dirty logging.

> +
>                 if (kvm_x86_ops.cpu_dirty_log_size) {
>                         kvm_mmu_slot_leaf_clear_dirty(kvm, new);
>                         kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_2M);
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-11-19 23:57 ` [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled David Matlack
  2021-11-22  5:05   ` Nikunj A. Dadhania
  2021-11-22 19:30   ` Ben Gardon
@ 2021-11-26 12:01   ` Peter Xu
  2021-11-30 23:56     ` David Matlack
  2 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2021-11-26 12:01 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

Hi, David,

On Fri, Nov 19, 2021 at 11:57:56PM +0000, David Matlack wrote:
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2a7564703ea6..432a4df817ec 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1232,6 +1232,9 @@ struct kvm_arch {
>  	hpa_t	hv_root_tdp;
>  	spinlock_t hv_root_tdp_lock;
>  #endif
> +
> +	/* MMU caches used when splitting large pages during VM-ioctls. */
> +	struct kvm_mmu_memory_caches split_caches;

Are mmu_gfn_array_cache and mmu_pte_list_desc_cache wasted here?  I saw that
"struct kvm_mmu_memory_cache" still takes up quite a few hundreds of bytes,
just want to make sure we won't waste them in vain.

[...]

> +int mmu_topup_split_caches(struct kvm *kvm)
> +{
> +	struct kvm_mmu_memory_caches *split_caches = &kvm->arch.split_caches;
> +	int r;
> +
> +	assert_split_caches_invariants(kvm);
> +
> +	r = kvm_mmu_topup_memory_cache(&split_caches->page_header_cache, 1);
> +	if (r)
> +		goto out;
> +
> +	r = kvm_mmu_topup_memory_cache(&split_caches->shadow_page_cache, 1);
> +	if (r)
> +		goto out;

Is it intended to only top-up with one cache object?  IIUC this means we'll try
to proactively yield the cpu for each of the huge page split right after the
object is consumed.

Wondering whether it be more efficient to make it a slightly larger number, so
we don't overload the memory but also make the loop a bit more efficient.

> +
> +	return 0;
> +
> +out:
> +	pr_warn("Failed to top-up split caches. Will not split large pages.\n");
> +	return r;
> +}

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
  2021-11-19 23:57 ` [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG David Matlack
@ 2021-11-26 12:17   ` Peter Xu
  2021-12-01  0:16     ` David Matlack
  2021-12-01 19:22   ` Sean Christopherson
  1 sibling, 1 reply; 77+ messages in thread
From: Peter Xu @ 2021-11-26 12:17 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6768ef9c0891..4e78ef2dd352 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  		gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
>  		gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
>  
> +		/*
> +		 * Try to proactively split any large pages down to 4KB so that
> +		 * vCPUs don't have to take write-protection faults.
> +		 */
> +		kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
> +
>  		kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
>  
>  		/* Cross two large pages? */

Is it intended to try split every time even if we could have split it already?
As I remember Paolo mentioned that we can skip split if it's not the 1st
CLEAR_LOG on the same range, and IIUC that makes sense.

But indeed I don't see a trivial way to know whether this is the first clear of
this range.  Maybe we can maintain "how many huge pages are there under current
kvm_mmu_page node" somehow?  Then if root sp has the counter==0, then we can
skip it.  Just a wild idea..

Or maybe it's intended to try split unconditionally for some reason?  If so
it'll be great to mention that either in the commit message or in comments.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 01/15] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn
  2021-11-19 23:57 ` [RFC PATCH 01/15] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn David Matlack
  2021-11-22 18:52   ` Ben Gardon
@ 2021-11-26 12:18   ` Peter Xu
  1 sibling, 0 replies; 77+ messages in thread
From: Peter Xu @ 2021-11-26 12:18 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Fri, Nov 19, 2021 at 11:57:45PM +0000, David Matlack wrote:
> rmap_write_protect is a poor name because we may not even touch the rmap
> if the TDP MMU is in use. It is also confusing that rmap_write_protect
> is not a simpler wrapper around __rmap_write_protect, since that is the
> typical flow for functions with double-underscore names.
> 
> Rename it to kvm_vcpu_write_protect_gfn to convey that we are
> write-protecting a specific gfn in the context of a vCPU.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect
  2021-11-19 23:57 ` [RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect David Matlack
  2021-11-22 18:52   ` Ben Gardon
@ 2021-11-26 12:18   ` Peter Xu
  1 sibling, 0 replies; 77+ messages in thread
From: Peter Xu @ 2021-11-26 12:18 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Fri, Nov 19, 2021 at 11:57:46PM +0000, David Matlack wrote:
> Now that rmap_write_protect has been renamed, there is no need for the
> double underscores in front of __rmap_write_protect.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU
  2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (14 preceding siblings ...)
  2021-11-19 23:57 ` [RFC PATCH 15/15] KVM: x86/mmu: Update page stats when " David Matlack
@ 2021-11-26 14:13 ` Peter Xu
  2021-11-30 23:22   ` David Matlack
  15 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2021-11-26 14:13 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

Hi, David,

On Fri, Nov 19, 2021 at 11:57:44PM +0000, David Matlack wrote:
> This series is a first pass at implementing Eager Page Splitting for the
> TDP MMU. For context on the motivation and design of Eager Page
> Splitting, please see the RFC design proposal and discussion [1].
> 
> Paolo, I went ahead and added splitting in both the intially-all-set
> case (only splitting the region passed to CLEAR_DIRTY_LOG) and the
> case where we are not using initially-all-set (splitting the entire
> memslot when dirty logging is enabled) to give you an idea of what
> both look like.
> 
> Note: I will be on vacation all of next week so I will not be able to
> respond to reviews until Monday November 29. I thought it would be
> useful to seed discussion and reviews with an early version of the code
> rather than putting it off another week. But feel free to also ignore
> this until I get back :)
> 
> This series compiles and passes the most basic splitting test:
> 
> $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 2 -i 4
> 
> But please operate under the assumption that this code is probably
> buggy.
> 
> [1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t

Will there be more numbers to show in the formal patchset?  It's interesting to
know how "First Pass Dirty Memory Time" will change comparing to the rfc
numbers; I can have a feel of it, but still. :) Also, not only how it speedup
guest dirty apps, but also some general measurement on how it slows down
KVM_SET_USER_MEMORY_REGION (!init-all-set) or CLEAR_LOG (init-all-set) would be
even nicer (for CLEAR, I guess the 1st/2nd+ round will have different overhead).

Besides that, I'm also wondering whether we should still have a knob for it, as
I'm wondering what if the use case is the kind where eager split huge page may
not help at all.  What I'm thinking:

  - Read-mostly guest overload; split huge page will speed up rare writes, but
    at the meantime drag readers down due to huge->small page mappings.

  - Writes-over-very-limited-region workload: say we have 1T guest and the app
    in the guest only writes 10G part of it.  Hmm not sure whether it exists..

  - Postcopy targeted: it means precopy may only run a few iterations just to
    send the static pages, so the migration duration will be relatively short,
    and the write just didn't spread a lot to the whole guest mem.

I don't really think any of the example is strong enough as they're all very
corner cased, but just to show what I meant to raise this question on whether
unconditionally eager split is the best approach.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent
  2021-11-20 12:53   ` Paolo Bonzini
@ 2021-11-27  2:07     ` Lai Jiangshan
  2021-11-27 10:26       ` Paolo Bonzini
  2021-11-30 23:31     ` David Matlack
  1 sibling, 1 reply; 77+ messages in thread
From: Lai Jiangshan @ 2021-11-27  2:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Matlack, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Xu, Peter Shier

On Sat, Nov 20, 2021 at 9:02 PM Paolo Bonzini <pbonzini@redhat.com> wrote:

>
> I have a similar patch for the old MMU, but it was also replacing
> shadow_root_level with shadow_root_role.  I'll see if I can adapt it to
> the TDP MMU, since the shadow_root_role is obviously the same for both.
>

Hello, Paolo

I'm sorry to ask something unrelated to this patchset, but related
to my pending work.

I will still continue to do something on shadow_root_level.  But I
would like to wait until your shadow_root_role work is queued.
And is it a part of work splitting the struct kvm_mmu?

Thanks
Lai

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent
  2021-11-27  2:07     ` Lai Jiangshan
@ 2021-11-27 10:26       ` Paolo Bonzini
  0 siblings, 0 replies; 77+ messages in thread
From: Paolo Bonzini @ 2021-11-27 10:26 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: David Matlack, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Xu, Peter Shier

On 11/27/21 03:07, Lai Jiangshan wrote:
> On Sat, Nov 20, 2021 at 9:02 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
>>
>> I have a similar patch for the old MMU, but it was also replacing
>> shadow_root_level with shadow_root_role.  I'll see if I can adapt it to
>> the TDP MMU, since the shadow_root_role is obviously the same for both.
>>
> 
> Hello, Paolo
> 
> I'm sorry to ask something unrelated to this patchset, but related
> to my pending work.
> 
> I will still continue to do something on shadow_root_level.  But I
> would like to wait until your shadow_root_role work is queued.
> And is it a part of work splitting the struct kvm_mmu?

Yes, more or less.  I'm basically splitting the "CPU role" (the basic 
and extended role from the processor registers) used for emulation, from 
the "MMU role" (the basic role used for the root shadow page tables). 
Then shadow_root_level/root_level become respectively mmu_role->level 
and cpu_role->base.level.

Paolo


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU
  2021-11-26 14:13 ` [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU Peter Xu
@ 2021-11-30 23:22   ` David Matlack
  2021-12-01  4:10     ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-11-30 23:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Fri, Nov 26, 2021 at 6:13 AM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, David,
>
> On Fri, Nov 19, 2021 at 11:57:44PM +0000, David Matlack wrote:
> > This series is a first pass at implementing Eager Page Splitting for the
> > TDP MMU. For context on the motivation and design of Eager Page
> > Splitting, please see the RFC design proposal and discussion [1].
> >
> > Paolo, I went ahead and added splitting in both the intially-all-set
> > case (only splitting the region passed to CLEAR_DIRTY_LOG) and the
> > case where we are not using initially-all-set (splitting the entire
> > memslot when dirty logging is enabled) to give you an idea of what
> > both look like.
> >
> > Note: I will be on vacation all of next week so I will not be able to
> > respond to reviews until Monday November 29. I thought it would be
> > useful to seed discussion and reviews with an early version of the code
> > rather than putting it off another week. But feel free to also ignore
> > this until I get back :)
> >
> > This series compiles and passes the most basic splitting test:
> >
> > $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 2 -i 4
> >
> > But please operate under the assumption that this code is probably
> > buggy.
> >
> > [1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t
>
> Will there be more numbers to show in the formal patchset?

Yes definitely. I didn't have a lot of time to test this series, hence
the RFC status. I'll include more thorough testing and performance
evaluation in the cover letter for v1.


> It's interesting to
> know how "First Pass Dirty Memory Time" will change comparing to the rfc
> numbers; I can have a feel of it, but still. :) Also, not only how it speedup
> guest dirty apps, but also some general measurement on how it slows down
> KVM_SET_USER_MEMORY_REGION (!init-all-set) or CLEAR_LOG (init-all-set) would be
> even nicer (for CLEAR, I guess the 1st/2nd+ round will have different overhead).
>
> Besides that, I'm also wondering whether we should still have a knob for it, as
> I'm wondering what if the use case is the kind where eager split huge page may
> not help at all.  What I'm thinking:
>
>   - Read-mostly guest overload; split huge page will speed up rare writes, but
>     at the meantime drag readers down due to huge->small page mappings.
>
>   - Writes-over-very-limited-region workload: say we have 1T guest and the app
>     in the guest only writes 10G part of it.  Hmm not sure whether it exists..
>
>   - Postcopy targeted: it means precopy may only run a few iterations just to
>     send the static pages, so the migration duration will be relatively short,
>     and the write just didn't spread a lot to the whole guest mem.
>
> I don't really think any of the example is strong enough as they're all very
> corner cased, but just to show what I meant to raise this question on whether
> unconditionally eager split is the best approach.

I'd be happy to add a knob if there's a userspace that wants to use
it. I think the main challenge though is knowing when it is safe to
disable eager splitting. For a small deployment where you know the VM
workload, it might make sense. But for a public cloud provider the
only feasible way would be to dynamically monitor the guest writing
patterns. But then we're back at square one because that would require
dirty logging. And even then, there's no guaranteed way to predict
future guest write patterns based on past patterns.

The way forward here might be to do a hybrid of 2M and 4K dirty
tracking (and maybe even 1G). For example, first start dirty logging
at 2M granularity, and then log at 4K for any specific regions or
memslots that aren't making progress. We'd still use Eager Page
Splitting unconditionally though, first to split to 2M and then to
split to 4K.

>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 03/15] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
  2021-11-22 18:52   ` Ben Gardon
@ 2021-11-30 23:25     ` David Matlack
  0 siblings, 0 replies; 77+ messages in thread
From: David Matlack @ 2021-11-30 23:25 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Mon, Nov 22, 2021 at 10:52 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Consolidate a bunch of code that was manually re-reading the spte if the
> > cmpxchg fails. There is no extra cost of doing this because we already
> > have the spte value as a result of the cmpxchg (and in fact this
> > eliminates re-reading the spte), and none of the call sites depend on
> > iter->old_spte retaining the stale spte value.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 56 ++++++++++++--------------------------
> >  1 file changed, 18 insertions(+), 38 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 377a96718a2e..cc9fe33c9b36 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -492,16 +492,22 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> >   * and handle the associated bookkeeping.  Do not mark the page dirty
> >   * in KVM's dirty bitmaps.
> >   *
> > + * If setting the SPTE fails because it has changed, iter->old_spte will be
> > + * updated with the updated value of the spte.
> > + *
> >   * @kvm: kvm instance
> >   * @iter: a tdp_iter instance currently on the SPTE that should be set
> >   * @new_spte: The value the SPTE should be set to
> >   * Returns: true if the SPTE was set, false if it was not. If false is returned,
> > - *         this function will have no side-effects.
> > + *          this function will have no side-effects other than updating
> > + *          iter->old_spte to the latest value of spte.
> >   */
> >  static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
> >                                            struct tdp_iter *iter,
> >                                            u64 new_spte)
> >  {
> > +       u64 old_spte;
> > +
> >         lockdep_assert_held_read(&kvm->mmu_lock);
> >
> >         /*
> > @@ -515,9 +521,11 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
> >          * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
> >          * does not hold the mmu_lock.
> >          */
> > -       if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
> > -                     new_spte) != iter->old_spte)
> > +       old_spte = cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte, new_spte);
>
> This probably deserves a comment:
>
> /*
>  * If the old_spte values differ, the cmpxchg failed. Update
> iter->old_spte with the value inserted by
>  * another thread.
>  */

Will do.

>
> > +       if (old_spte != iter->old_spte) {
> > +               iter->old_spte = old_spte;
> >                 return false;
> > +       }
> >
> >         __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> >                               new_spte, iter->level, true);
> > @@ -747,14 +755,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >                 if (!shared) {
> >                         tdp_mmu_set_spte(kvm, &iter, 0);
> >                         flush = true;
> > -               } else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
> > -                       /*
> > -                        * The iter must explicitly re-read the SPTE because
> > -                        * the atomic cmpxchg failed.
> > -                        */
> > -                       iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
>
> I think kernel style is to include the curly braces on the else if, if
> the if had them.

You are correct! Will fix in v1.

>
>
> > +               } else if (!tdp_mmu_zap_spte_atomic(kvm, &iter))
> >                         goto retry;
> > -               }
> >         }
> >
> >         rcu_read_unlock();
> > @@ -978,13 +980,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >                     is_large_pte(iter.old_spte)) {
> >                         if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
> >                                 break;
> > -
> > -                       /*
> > -                        * The iter must explicitly re-read the spte here
> > -                        * because the new value informs the !present
> > -                        * path below.
> > -                        */
> > -                       iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> >                 }
> >
> >                 if (!is_shadow_present_pte(iter.old_spte)) {
> > @@ -1190,14 +1185,9 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >
> >                 new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
> >
> > -               if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte)) {
> > -                       /*
> > -                        * The iter must explicitly re-read the SPTE because
> > -                        * the atomic cmpxchg failed.
> > -                        */
> > -                       iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> > +               if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
> >                         goto retry;
> > -               }
> > +
> >                 spte_set = true;
> >         }
> >
> > @@ -1258,14 +1248,9 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >                                 continue;
> >                 }
> >
> > -               if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte)) {
> > -                       /*
> > -                        * The iter must explicitly re-read the SPTE because
> > -                        * the atomic cmpxchg failed.
> > -                        */
> > -                       iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> > +               if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
> >                         goto retry;
> > -               }
> > +
> >                 spte_set = true;
> >         }
> >
> > @@ -1391,14 +1376,9 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
> >                                                             pfn, PG_LEVEL_NUM))
> >                         continue;
> >
> > -               if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
> > -                       /*
> > -                        * The iter must explicitly re-read the SPTE because
> > -                        * the atomic cmpxchg failed.
> > -                        */
> > -                       iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> > +               if (!tdp_mmu_zap_spte_atomic(kvm, &iter))
> >                         goto retry;
> > -               }
> > +
> >                 flush = true;
> >         }
> >
> > --
> > 2.34.0.rc2.393.gf8c9666880-goog
> >

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table
  2021-11-22 18:52   ` Ben Gardon
@ 2021-11-30 23:27     ` David Matlack
  0 siblings, 0 replies; 77+ messages in thread
From: David Matlack @ 2021-11-30 23:27 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Mon, Nov 22, 2021 at 10:53 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Factor out the logic to atomically replace an SPTE with an SPTE that
> > points to a new page table. This will be used in a follow-up commit to
> > split a large page SPTE into one level lower.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 53 ++++++++++++++++++++++++++------------
> >  1 file changed, 37 insertions(+), 16 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index cc9fe33c9b36..9ee3f4f7fdf5 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -945,6 +945,39 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> >         return ret;
> >  }
> >
> > +/*
> > + * tdp_mmu_install_sp_atomic - Atomically replace the given spte with an
> > + * spte pointing to the provided page table.
> > + *
> > + * @kvm: kvm instance
> > + * @iter: a tdp_iter instance currently on the SPTE that should be set
> > + * @sp: The new TDP page table to install.
> > + * @account_nx: True if this page table is being installed to split a
> > + *              non-executable huge page.
> > + *
> > + * Returns: True if the new page table was installed. False if spte being
> > + *          replaced changed, causing the atomic compare-exchange to fail.
> > + *          If this function returns false the sp will be freed before
> > + *          returning.
> > + */
> > +static bool tdp_mmu_install_sp_atomic(struct kvm *kvm,
> > +                                     struct tdp_iter *iter,
> > +                                     struct kvm_mmu_page *sp,
> > +                                     bool account_nx)
> > +{
> > +       u64 spte;
> > +
> > +       spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
> > +
> > +       if (tdp_mmu_set_spte_atomic(kvm, iter, spte)) {
> > +               tdp_mmu_link_page(kvm, sp, account_nx);
> > +               return true;
> > +       } else {
> > +               tdp_mmu_free_sp(sp);
> > +               return false;
> > +       }
> > +}
> > +
> >  /*
> >   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
> >   * page tables and SPTEs to translate the faulting guest physical address.
> > @@ -954,8 +987,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >         struct kvm_mmu *mmu = vcpu->arch.mmu;
> >         struct tdp_iter iter;
> >         struct kvm_mmu_page *sp;
> > -       u64 *child_pt;
> > -       u64 new_spte;
> >         int ret;
> >
> >         kvm_mmu_hugepage_adjust(vcpu, fault);
> > @@ -983,6 +1014,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >                 }
> >
> >                 if (!is_shadow_present_pte(iter.old_spte)) {
> > +                       bool account_nx = fault->huge_page_disallowed &&
> > +                                         fault->req_level >= iter.level;
> > +
> >                         /*
> >                          * If SPTE has been frozen by another thread, just
> >                          * give up and retry, avoiding unnecessary page table
> > @@ -992,21 +1026,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >                                 break;
> >
> >                         sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level - 1);
> > -                       child_pt = sp->spt;
> > -
> > -                       new_spte = make_nonleaf_spte(child_pt,
> > -                                                    !shadow_accessed_mask);
> > -
> > -                       if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
> > -                               tdp_mmu_link_page(vcpu->kvm, sp,
> > -                                                 fault->huge_page_disallowed &&
> > -                                                 fault->req_level >= iter.level);
> > -
> > -                               trace_kvm_mmu_get_page(sp, true);
>
> This refactoring drops this trace point. Is that intentional?

Yes it was intentional, but I forgot to describe it in the commit
message. Good catch.

This tracepoint is redundant with the one in alloc_tdp_mmu_page().

I'll update the commit message for v1.

>
>
> > -                       } else {
> > -                               tdp_mmu_free_sp(sp);
> > +                       if (!tdp_mmu_install_sp_atomic(vcpu->kvm, &iter, sp, account_nx))
> >                                 break;
> > -                       }
> >                 }
> >         }
> >
> > --
> > 2.34.0.rc2.393.gf8c9666880-goog
> >

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct
  2021-11-22 18:55   ` Ben Gardon
  2021-11-22 18:55     ` Ben Gardon
@ 2021-11-30 23:28     ` David Matlack
  1 sibling, 0 replies; 77+ messages in thread
From: David Matlack @ 2021-11-30 23:28 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Mon, Nov 22, 2021 at 10:55 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Move the kvm_mmu_memory_cache structs into a separate wrapper struct.
> > This is in preparation for eagerly splitting all large pages during
> > VM-ioctls (i.e. not in the vCPU fault path) which will require adding
> > kvm_mmu_memory_cache structs to struct kvm_arch.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> Reviewed-by: Ben Gardon
>
> I don't think this patch creates any functional change. If that's the
> intent, it'd be worth noting.

Will do!

>
>
> > ---
> >  arch/x86/include/asm/kvm_host.h | 12 ++++---
> >  arch/x86/kvm/mmu/mmu.c          | 59 ++++++++++++++++++++++-----------
> >  arch/x86/kvm/mmu/tdp_mmu.c      |  7 ++--
> >  3 files changed, 52 insertions(+), 26 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 1fcb345bc107..2a7564703ea6 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -612,6 +612,13 @@ struct kvm_vcpu_xen {
> >         u64 runstate_times[4];
> >  };
> >
> > +struct kvm_mmu_memory_caches {
> > +       struct kvm_mmu_memory_cache pte_list_desc_cache;
> > +       struct kvm_mmu_memory_cache shadow_page_cache;
> > +       struct kvm_mmu_memory_cache gfn_array_cache;
> > +       struct kvm_mmu_memory_cache page_header_cache;
> > +};
> > +
> >  struct kvm_vcpu_arch {
> >         /*
> >          * rip and regs accesses must go through
> > @@ -681,10 +688,7 @@ struct kvm_vcpu_arch {
> >          */
> >         struct kvm_mmu *walk_mmu;
> >
> > -       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> > -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > -       struct kvm_mmu_memory_cache mmu_gfn_array_cache;
> > -       struct kvm_mmu_memory_cache mmu_page_header_cache;
> > +       struct kvm_mmu_memory_caches mmu_caches;
> >
> >         /*
> >          * QEMU userspace and the guest each have their own FPU state.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 1146f87044a6..537952574211 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -732,38 +732,60 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> >
> >  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> >  {
> > +       struct kvm_mmu_memory_caches *mmu_caches;
> >         int r;
> >
> > +       mmu_caches = &vcpu->arch.mmu_caches;
> > +
> >         /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
> > -       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> > +       r = kvm_mmu_topup_memory_cache(&mmu_caches->pte_list_desc_cache,
> >                                        1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> >         if (r)
> >                 return r;
> > -       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > +       r = kvm_mmu_topup_memory_cache(&mmu_caches->shadow_page_cache,
> >                                        PT64_ROOT_MAX_LEVEL);
> >         if (r)
> >                 return r;
> >         if (maybe_indirect) {
> > -               r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache,
> > +               r = kvm_mmu_topup_memory_cache(&mmu_caches->gfn_array_cache,
> >                                                PT64_ROOT_MAX_LEVEL);
> >                 if (r)
> >                         return r;
> >         }
> > -       return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
> > +       return kvm_mmu_topup_memory_cache(&mmu_caches->page_header_cache,
> >                                           PT64_ROOT_MAX_LEVEL);
> >  }
> >
> >  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> >  {
> > -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> > -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> > -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
> > -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> > +       struct kvm_mmu_memory_caches *mmu_caches;
> > +
> > +       mmu_caches = &vcpu->arch.mmu_caches;
> > +
> > +       kvm_mmu_free_memory_cache(&mmu_caches->pte_list_desc_cache);
> > +       kvm_mmu_free_memory_cache(&mmu_caches->shadow_page_cache);
> > +       kvm_mmu_free_memory_cache(&mmu_caches->gfn_array_cache);
> > +       kvm_mmu_free_memory_cache(&mmu_caches->page_header_cache);
> > +}
> > +
> > +static void mmu_init_memory_caches(struct kvm_mmu_memory_caches *caches)
> > +{
> > +       caches->pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> > +       caches->pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> > +
> > +       caches->page_header_cache.kmem_cache = mmu_page_header_cache;
> > +       caches->page_header_cache.gfp_zero = __GFP_ZERO;
> > +
> > +       caches->shadow_page_cache.gfp_zero = __GFP_ZERO;
> >  }
> >
> >  static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
> >  {
> > -       return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
> > +       struct kvm_mmu_memory_caches *mmu_caches;
> > +
> > +       mmu_caches = &vcpu->arch.mmu_caches;
> > +
> > +       return kvm_mmu_memory_cache_alloc(&mmu_caches->pte_list_desc_cache);
> >  }
> >
> >  static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
> > @@ -1071,7 +1093,7 @@ static bool rmap_can_add(struct kvm_vcpu *vcpu)
> >  {
> >         struct kvm_mmu_memory_cache *mc;
> >
> > -       mc = &vcpu->arch.mmu_pte_list_desc_cache;
> > +       mc = &vcpu->arch.mmu_caches.pte_list_desc_cache;
> >         return kvm_mmu_memory_cache_nr_free_objects(mc);
> >  }
> >
> > @@ -1742,12 +1764,15 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
> >
> >  static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
> >  {
> > +       struct kvm_mmu_memory_caches *mmu_caches;
> >         struct kvm_mmu_page *sp;
> >
> > -       sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > -       sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > +       mmu_caches = &vcpu->arch.mmu_caches;
> > +
> > +       sp = kvm_mmu_memory_cache_alloc(&mmu_caches->page_header_cache);
> > +       sp->spt = kvm_mmu_memory_cache_alloc(&mmu_caches->shadow_page_cache);
> >         if (!direct)
> > -               sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> > +               sp->gfns = kvm_mmu_memory_cache_alloc(&mmu_caches->gfn_array_cache);
> >         set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> >
> >         /*
> > @@ -5544,13 +5569,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> >  {
> >         int ret;
> >
> > -       vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> > -       vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> > -
> > -       vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> > -       vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> > -
> > -       vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > +       mmu_init_memory_caches(&vcpu->arch.mmu_caches);
> >
> >         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> >         vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 9ee3f4f7fdf5..b70707a7fe87 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -175,10 +175,13 @@ static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
> >  static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> >                                                int level)
> >  {
> > +       struct kvm_mmu_memory_caches *mmu_caches;
> >         struct kvm_mmu_page *sp;
> >
> > -       sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > -       sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > +       mmu_caches = &vcpu->arch.mmu_caches;
> > +
> > +       sp = kvm_mmu_memory_cache_alloc(&mmu_caches->page_header_cache);
> > +       sp->spt = kvm_mmu_memory_cache_alloc(&mmu_caches->shadow_page_cache);
> >         set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> >
> >         sp->role.word = page_role_for_level(vcpu, level).word;
> > --
> > 2.34.0.rc2.393.gf8c9666880-goog
> >

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent
  2021-11-20 12:53   ` Paolo Bonzini
  2021-11-27  2:07     ` Lai Jiangshan
@ 2021-11-30 23:31     ` David Matlack
  2021-12-01  0:45       ` Sean Christopherson
  1 sibling, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-11-30 23:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Sat, Nov 20, 2021 at 4:53 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 11/20/21 00:57, David Matlack wrote:
> > Derive the page role from the parent shadow page, since the only thing
> > that changes is the level. This is in preparation for eagerly splitting
> > large pages during VM-ioctls which does not have access to the vCPU
> > MMU context.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >   arch/x86/kvm/mmu/tdp_mmu.c | 43 ++++++++++++++++++++------------------
> >   1 file changed, 23 insertions(+), 20 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index b70707a7fe87..1a409992a57f 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -157,23 +157,8 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> >               if (kvm_mmu_page_as_id(_root) != _as_id) {              \
> >               } else
> >
> > -static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
> > -                                                int level)
> > -{
> > -     union kvm_mmu_page_role role;
> > -
> > -     role = vcpu->arch.mmu->mmu_role.base;
> > -     role.level = level;
> > -     role.direct = true;
> > -     role.gpte_is_8_bytes = true;
> > -     role.access = ACC_ALL;
> > -     role.ad_disabled = !shadow_accessed_mask;
> > -
> > -     return role;
> > -}
> > -
> >   static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > -                                            int level)
> > +                                            union kvm_mmu_page_role role)
> >   {
> >       struct kvm_mmu_memory_caches *mmu_caches;
> >       struct kvm_mmu_page *sp;
> > @@ -184,7 +169,7 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> >       sp->spt = kvm_mmu_memory_cache_alloc(&mmu_caches->shadow_page_cache);
> >       set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> >
> > -     sp->role.word = page_role_for_level(vcpu, level).word;
> > +     sp->role = role;
> >       sp->gfn = gfn;
> >       sp->tdp_mmu_page = true;
> >
> > @@ -193,6 +178,19 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> >       return sp;
> >   }
> >
> > +static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, struct tdp_iter *iter)
> > +{
> > +     struct kvm_mmu_page *parent_sp;
> > +     union kvm_mmu_page_role role;
> > +
> > +     parent_sp = sptep_to_sp(rcu_dereference(iter->sptep));
> > +
> > +     role = parent_sp->role;
> > +     role.level--;
> > +
> > +     return alloc_tdp_mmu_page(vcpu, iter->gfn, role);
> > +}
> > +
> >   hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> >   {
> >       union kvm_mmu_page_role role;
> > @@ -201,7 +199,12 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> >
> >       lockdep_assert_held_write(&kvm->mmu_lock);
> >
> > -     role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
> > +     role = vcpu->arch.mmu->mmu_role.base;
> > +     role.level = vcpu->arch.mmu->shadow_root_level;
> > +     role.direct = true;
> > +     role.gpte_is_8_bytes = true;
> > +     role.access = ACC_ALL;
> > +     role.ad_disabled = !shadow_accessed_mask;
>
> I have a similar patch for the old MMU, but it was also replacing
> shadow_root_level with shadow_root_role.  I'll see if I can adapt it to
> the TDP MMU, since the shadow_root_role is obviously the same for both.

While I was writing this patch it got me wondering if we can do an
even more general refactor and replace root_hpa and shadow_root_level
with a pointer to the root kvm_mmu_page struct. But I didn't get a
chance to look into it further.


>
> Paolo
>
> >       /* Check for an existing root before allocating a new one. */
> >       for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
> > @@ -210,7 +213,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> >                       goto out;
> >       }
> >
> > -     root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
> > +     root = alloc_tdp_mmu_page(vcpu, 0, role);
> >       refcount_set(&root->tdp_mmu_root_count, 1);
> >
> >       spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > @@ -1028,7 +1031,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >                       if (is_removed_spte(iter.old_spte))
> >                               break;
> >
> > -                     sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level - 1);
> > +                     sp = alloc_child_tdp_mmu_page(vcpu, &iter);
> >                       if (!tdp_mmu_install_sp_atomic(vcpu->kvm, &iter, sp, account_nx))
> >                               break;
> >               }
> >

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-11-22  5:05   ` Nikunj A. Dadhania
@ 2021-11-30 23:33     ` David Matlack
  0 siblings, 0 replies; 77+ messages in thread
From: David Matlack @ 2021-11-30 23:33 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Xu, Peter Shier

On Sun, Nov 21, 2021 at 9:05 PM Nikunj A. Dadhania <nikunj@amd.com> wrote:
>
>
>
> On 11/20/2021 5:27 AM, David Matlack wrote:
> > When dirty logging is enabled without initially-all-set, attempt to
> > split all large pages in the memslot down to 4KB pages so that vCPUs
> > do not have to take expensive write-protection faults to split large
> > pages.
> >
> > Large page splitting is best-effort only. This commit only adds the
> > support for the TDP MMU, and even there splitting may fail due to out
> > of memory conditions. Failures to split a large page is fine from a
> > correctness standpoint because we still always follow it up by write-
> > protecting any remaining large pages.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> > +int mmu_topup_split_caches(struct kvm *kvm)
> > +{
> > +     struct kvm_mmu_memory_caches *split_caches = &kvm->arch.split_caches;
> > +     int r;
> > +
> > +     assert_split_caches_invariants(kvm);
> > +
> > +     r = kvm_mmu_topup_memory_cache(&split_caches->page_header_cache, 1);
> > +     if (r)
> > +             goto out;
> > +
> > +     r = kvm_mmu_topup_memory_cache(&split_caches->shadow_page_cache, 1);
> > +     if (r)
> > +             goto out;
> > +
> > +     return 0;
> > +
> > +out:
> > +     pr_warn("Failed to top-up split caches. Will not split large pages.\n");
> > +     return r;
> > +}
> > +
> > +static void mmu_free_split_caches(struct kvm *kvm)
> > +{
> > +     assert_split_caches_invariants(kvm);
> > +
> > +     kvm_mmu_free_memory_cache(&kvm->arch.split_caches.pte_list_desc_cache);
>                                                               ^^^^^^^^^^^^^^
> I believe this should be page_header_cache.

Oh wow, thanks for catching that. You are correct. Will fix in v1.

>
> > +     kvm_mmu_free_memory_cache(&kvm->arch.split_caches.shadow_page_cache);
> > +}
>
> Regards
> Nikunj
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-11-22 19:30   ` Ben Gardon
@ 2021-11-30 23:44     ` David Matlack
  0 siblings, 0 replies; 77+ messages in thread
From: David Matlack @ 2021-11-30 23:44 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Mon, Nov 22, 2021 at 11:31 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Fri, Nov 19, 2021 at 3:58 PM David Matlack <dmatlack@google.com> wrote:
> >
> > When dirty logging is enabled without initially-all-set, attempt to
> > split all large pages in the memslot down to 4KB pages so that vCPUs
> > do not have to take expensive write-protection faults to split large
> > pages.
> >
> > Large page splitting is best-effort only. This commit only adds the
> > support for the TDP MMU, and even there splitting may fail due to out
> > of memory conditions. Failures to split a large page is fine from a
> > correctness standpoint because we still always follow it up by write-
> > protecting any remaining large pages.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |   6 ++
> >  arch/x86/kvm/mmu/mmu.c          |  83 +++++++++++++++++++++
> >  arch/x86/kvm/mmu/mmu_internal.h |   3 +
> >  arch/x86/kvm/mmu/spte.c         |  46 ++++++++++++
> >  arch/x86/kvm/mmu/spte.h         |   1 +
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 123 ++++++++++++++++++++++++++++++++
> >  arch/x86/kvm/mmu/tdp_mmu.h      |   5 ++
> >  arch/x86/kvm/x86.c              |   6 ++
> >  8 files changed, 273 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 2a7564703ea6..432a4df817ec 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1232,6 +1232,9 @@ struct kvm_arch {
> >         hpa_t   hv_root_tdp;
> >         spinlock_t hv_root_tdp_lock;
> >  #endif
> > +
> > +       /* MMU caches used when splitting large pages during VM-ioctls. */
> > +       struct kvm_mmu_memory_caches split_caches;
> >  };
> >
> >  struct kvm_vm_stat {
> > @@ -1588,6 +1591,9 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
> >  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >                                       const struct kvm_memory_slot *memslot,
> >                                       int start_level);
> > +void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
> > +                                       const struct kvm_memory_slot *memslot,
> > +                                       int target_level);
> >  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> >                                    const struct kvm_memory_slot *memslot);
> >  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 54f0d2228135..6768ef9c0891 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -738,6 +738,66 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> >                                           PT64_ROOT_MAX_LEVEL);
> >  }
> >
> > +static inline void assert_split_caches_invariants(struct kvm *kvm)
> > +{
> > +       /*
> > +        * The split caches must only be modified while holding the slots_lock,
> > +        * since it is only used during memslot VM-ioctls.
> > +        */
> > +       lockdep_assert_held(&kvm->slots_lock);
> > +
> > +       /*
> > +        * Only the TDP MMU supports large page splitting using
> > +        * kvm->arch.split_caches, which is why we only have to allocate
> > +        * page_header_cache and shadow_page_cache. Assert that the TDP
> > +        * MMU is at least enabled when the split cache is allocated.
> > +        */
> > +       BUG_ON(!is_tdp_mmu_enabled(kvm));
> > +}
> > +
> > +int mmu_topup_split_caches(struct kvm *kvm)
> > +{
> > +       struct kvm_mmu_memory_caches *split_caches = &kvm->arch.split_caches;
> > +       int r;
> > +
> > +       assert_split_caches_invariants(kvm);
> > +
> > +       r = kvm_mmu_topup_memory_cache(&split_caches->page_header_cache, 1);
> > +       if (r)
> > +               goto out;
> > +
> > +       r = kvm_mmu_topup_memory_cache(&split_caches->shadow_page_cache, 1);
> > +       if (r)
> > +               goto out;
> > +
> > +       return 0;
> > +
> > +out:
> > +       pr_warn("Failed to top-up split caches. Will not split large pages.\n");
> > +       return r;
> > +}
> > +
> > +static void mmu_free_split_caches(struct kvm *kvm)
> > +{
> > +       assert_split_caches_invariants(kvm);
> > +
> > +       kvm_mmu_free_memory_cache(&kvm->arch.split_caches.pte_list_desc_cache);
> > +       kvm_mmu_free_memory_cache(&kvm->arch.split_caches.shadow_page_cache);
> > +}
> > +
> > +bool mmu_split_caches_need_topup(struct kvm *kvm)
> > +{
> > +       assert_split_caches_invariants(kvm);
> > +
> > +       if (kvm->arch.split_caches.page_header_cache.nobjs == 0)
> > +               return true;
> > +
> > +       if (kvm->arch.split_caches.shadow_page_cache.nobjs == 0)
> > +               return true;
> > +
> > +       return false;
> > +}
> > +
> >  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> >  {
> >         struct kvm_mmu_memory_caches *mmu_caches;
> > @@ -5696,6 +5756,7 @@ void kvm_mmu_init_vm(struct kvm *kvm)
> >
> >         spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
> >
> > +       mmu_init_memory_caches(&kvm->arch.split_caches);
> >         kvm_mmu_init_tdp_mmu(kvm);
> >
> >         node->track_write = kvm_mmu_pte_write;
> > @@ -5819,6 +5880,28 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >                 kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> >  }
> >
> > +void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
> > +                                       const struct kvm_memory_slot *memslot,
> > +                                       int target_level)
> > +{
> > +       u64 start, end;
> > +
> > +       if (!is_tdp_mmu_enabled(kvm))
> > +               return;
> > +
> > +       if (mmu_topup_split_caches(kvm))
> > +               return;
> > +
> > +       start = memslot->base_gfn;
> > +       end = start + memslot->npages;
> > +
> > +       read_lock(&kvm->mmu_lock);
> > +       kvm_tdp_mmu_try_split_large_pages(kvm, memslot, start, end, target_level);
> > +       read_unlock(&kvm->mmu_lock);
> > +
> > +       mmu_free_split_caches(kvm);
> > +}
> > +
> >  static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >                                          struct kvm_rmap_head *rmap_head,
> >                                          const struct kvm_memory_slot *slot)
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 52c6527b1a06..89b9b907c567 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -161,4 +161,7 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +int mmu_topup_split_caches(struct kvm *kvm);
> > +bool mmu_split_caches_need_topup(struct kvm *kvm);
> > +
> >  #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index df2cdb8bcf77..6bb9b597a854 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -191,6 +191,52 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >         return wrprot;
> >  }
> >
> > +static u64 mark_spte_executable(u64 spte)
> > +{
> > +       bool is_access_track = is_access_track_spte(spte);
> > +
> > +       if (is_access_track)
> > +               spte = restore_acc_track_spte(spte);
> > +
> > +       spte &= ~shadow_nx_mask;
> > +       spte |= shadow_x_mask;
> > +
> > +       if (is_access_track)
> > +               spte = mark_spte_for_access_track(spte);
> > +
> > +       return spte;
> > +}
> > +
> > +/*
> > + * Construct an SPTE that maps a sub-page of the given large SPTE. This is
> > + * used during large page splitting, to build the SPTEs that make up the new
> > + * page table.
> > + */
> > +u64 make_large_page_split_spte(u64 large_spte, int level, int index, unsigned int access)
>
> Just because this always trips me up reading code, I'd suggest naming
> the argument large_spte_level or something.
> Avoiding a variable called "level" in this function makes it much more explicit.

Will do.

>
> > +{
> > +       u64 child_spte;
> > +       int child_level;
> > +
> > +       BUG_ON(is_mmio_spte(large_spte));
> > +       BUG_ON(!is_large_present_pte(large_spte));
>
> In the interest of not crashing the host, I think it would be safe to
> WARN and return 0 here.
> BUG is fine too if that's preferred.

Ack. I'll take a look and see if I can avoid the BUG_ONs. They're
optional sanity checks anyway.

>
> > +
> > +       child_spte = large_spte;
> > +       child_level = level - 1;
> > +
> > +       child_spte += (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
>
> This += makes me nervous. It at least merits a comment explaining
> what's going on.
> I'd find a |= more readable to make it more explicit and since sptes
> aren't numbers.
> You could probably also be really explicit about extracting the PFN
> and adding to it, clearing the PFN bits and then putting it back in
> and I bet the compiler would optimize out the extra bit fiddling.

I can change it to |= and add a comment. I prefer not to extra the PFN
and replace it since there's really no reason to. One of the nice
things about this function in general is that we don't have to
construct the child SPTE from scratch, we just have to slightly adjust
the parent SPTE. For the address, the address in the large SPTE is
already there, we just need to add in the offset to the lower-level
page.

>
> > +
> > +       if (child_level == PG_LEVEL_4K) {
> > +               child_spte &= ~PT_PAGE_SIZE_MASK;
> > +
> > +               /* Allow execution for 4K pages if it was disabled for NX HugePages. */
> > +               if (is_nx_huge_page_enabled() && access & ACC_EXEC_MASK)
> > +                       child_spte = mark_spte_executable(child_spte);
> > +       }
> > +
> > +       return child_spte;
> > +}
> > +
> > +
> >  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
> >  {
> >         u64 spte = SPTE_MMU_PRESENT_MASK;
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index 3e4943ee5a01..4efb4837e38d 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -339,6 +339,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >                unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
> >                u64 old_spte, bool prefetch, bool can_unsync,
> >                bool host_writable, u64 *new_spte);
> > +u64 make_large_page_split_spte(u64 large_spte, int level, int index, unsigned int access);
> >  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
> >  u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
> >  u64 mark_spte_for_access_track(u64 spte);
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 5ca0fa659245..366857b9fb3b 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -695,6 +695,39 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> >         return false;
> >  }
> >
> > +static inline bool
> > +tdp_mmu_need_split_caches_topup_or_resched(struct kvm *kvm, struct tdp_iter *iter)
> > +{
> > +       if (mmu_split_caches_need_topup(kvm))
> > +               return true;
> > +
> > +       return tdp_mmu_iter_need_resched(kvm, iter);
> > +}
> > +
> > +static inline int
> > +tdp_mmu_topup_split_caches_resched(struct kvm *kvm, struct tdp_iter *iter, bool flush)
>
> This functionality could be shoe-horned into
> tdp_mmu_iter_cond_resched, reducing code duplication.
> I don't know if the extra parameters / complexity on that function
> would be worth it, but I'm slightly inclined in that direction.

Ok I'll take a look and see if I can combine them in a nice way.

>
> > +{
> > +       int r;
> > +
> > +       rcu_read_unlock();
> > +
> > +       if (flush)
> > +               kvm_flush_remote_tlbs(kvm);
> > +
> > +       read_unlock(&kvm->mmu_lock);
> > +
> > +       cond_resched();
> > +       r = mmu_topup_split_caches(kvm);
>
> Ah, right. I was confused by this for a second, but it's safe because
> the caches are protected by the slots lock.
>
> > +
> > +       read_lock(&kvm->mmu_lock);
> > +
> > +       rcu_read_lock();
> > +       WARN_ON(iter->gfn > iter->next_last_level_gfn);
> > +       tdp_iter_restart(iter);
> > +
> > +       return r;
> > +}
> > +
> >  /*
> >   * Tears down the mappings for the range of gfns, [start, end), and frees the
> >   * non-root pages mapping GFNs strictly within that range. Returns true if
> > @@ -1241,6 +1274,96 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
> >         return spte_set;
> >  }
> >
> > +static bool tdp_mmu_split_large_page_atomic(struct kvm *kvm, struct tdp_iter *iter)
> > +{
> > +       const u64 large_spte = iter->old_spte;
> > +       const int level = iter->level;
> > +       struct kvm_mmu_page *child_sp;
> > +       u64 child_spte;
> > +       int i;
> > +
> > +       BUG_ON(mmu_split_caches_need_topup(kvm));
>
> I think it would be safe to just WARN and return here as well.
>
> > +
> > +       child_sp = alloc_child_tdp_mmu_page(&kvm->arch.split_caches, iter);
> > +
> > +       for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> > +               child_spte = make_large_page_split_spte(large_spte, level, i, ACC_ALL);
>
> Relating to my other comment above on make_large_page_split_spte, you
> could also iterate through the range of PFNs here and pass that as an
> argument to the helper function.
>
> > +
> > +               /*
> > +                * No need for atomics since child_sp has not been installed
> > +                * in the table yet and thus is not reachable by any other
> > +                * thread.
> > +                */
> > +               child_sp->spt[i] = child_spte;
> > +       }
> > +
> > +       return tdp_mmu_install_sp_atomic(kvm, iter, child_sp, false);
> > +}
> > +
> > +static void tdp_mmu_split_large_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > +                                          gfn_t start, gfn_t end, int target_level)
> > +{
> > +       struct tdp_iter iter;
> > +       bool flush = false;
> > +       int r;
> > +
> > +       rcu_read_lock();
> > +
> > +       /*
> > +        * Traverse the page table splitting all large pages above the target
> > +        * level into one lower level. For example, if we encounter a 1GB page
> > +        * we split it into 512 2MB pages.
> > +        *
> > +        * Since the TDP iterator uses a pre-order traversal, we are guaranteed
> > +        * to visit an SPTE before ever visiting its children, which means we
> > +        * will correctly recursively split large pages that are more than one
> > +        * level above the target level (e.g. splitting 1GB to 2MB to 4KB).
> > +        */
> > +       for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
> > +retry:
> > +               if (tdp_mmu_need_split_caches_topup_or_resched(kvm, &iter)) {
> > +                       r = tdp_mmu_topup_split_caches_resched(kvm, &iter, flush);
> > +                       flush = false;
> > +
> > +                       /*
> > +                        * If topping up the split caches failed, we can't split
> > +                        * any more pages. Bail out of the loop.
> > +                        */
> > +                       if (r)
> > +                               break;
> > +
> > +                       continue;
> > +               }
> > +
> > +               if (!is_large_present_pte(iter.old_spte))
> > +                       continue;
> > +
> > +               if (!tdp_mmu_split_large_page_atomic(kvm, &iter))
> > +                       goto retry;
> > +
> > +               flush = true;
> > +       }
> > +
> > +       rcu_read_unlock();
> > +
> > +       if (flush)
> > +               kvm_flush_remote_tlbs(kvm);
> > +}
> > +
> > +void kvm_tdp_mmu_try_split_large_pages(struct kvm *kvm,
> > +                                      const struct kvm_memory_slot *slot,
> > +                                      gfn_t start, gfn_t end,
> > +                                      int target_level)
> > +{
> > +       struct kvm_mmu_page *root;
> > +
> > +       lockdep_assert_held_read(&kvm->mmu_lock);
> > +
> > +       for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
> > +               tdp_mmu_split_large_pages_root(kvm, root, start, end, target_level);
> > +
> > +}
> > +
> >  /*
> >   * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
> >   * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index 476b133544dd..7812087836b2 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -72,6 +72,11 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
> >                                    struct kvm_memory_slot *slot, gfn_t gfn,
> >                                    int min_level);
> >
> > +void kvm_tdp_mmu_try_split_large_pages(struct kvm *kvm,
> > +                                      const struct kvm_memory_slot *slot,
> > +                                      gfn_t start, gfn_t end,
> > +                                      int target_level);
> > +
> >  static inline void kvm_tdp_mmu_walk_lockless_begin(void)
> >  {
> >         rcu_read_lock();
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 04e8dabc187d..4702ebfd394b 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -11735,6 +11735,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
> >                 if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> >                         return;
> >
> > +               /*
> > +                * Attempt to split all large pages into 4K pages so that vCPUs
> > +                * do not have to take write-protection faults.
> > +                */
> > +               kvm_mmu_slot_try_split_large_pages(kvm, new, PG_LEVEL_4K);
>
> Thank you for parameterizing the target level here. I'm working on a
> proof of concept for 2M dirty tracking right now (still in exploratory
> phase) and this parameter will help future-proof the splitting
> algorithm if we ever decide we don't want to split everything to 4k
> for dirty logging.

Exactly my thinking as well! :)

>
> > +
> >                 if (kvm_x86_ops.cpu_dirty_log_size) {
> >                         kvm_mmu_slot_leaf_clear_dirty(kvm, new);
> >                         kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_2M);
> > --
> > 2.34.0.rc2.393.gf8c9666880-goog
> >

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-11-26 12:01   ` Peter Xu
@ 2021-11-30 23:56     ` David Matlack
  2021-12-01  1:00       ` Sean Christopherson
  0 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-11-30 23:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Fri, Nov 26, 2021 at 4:01 AM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, David,
>
> On Fri, Nov 19, 2021 at 11:57:56PM +0000, David Matlack wrote:
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 2a7564703ea6..432a4df817ec 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1232,6 +1232,9 @@ struct kvm_arch {
> >       hpa_t   hv_root_tdp;
> >       spinlock_t hv_root_tdp_lock;
> >  #endif
> > +
> > +     /* MMU caches used when splitting large pages during VM-ioctls. */
> > +     struct kvm_mmu_memory_caches split_caches;
>
> Are mmu_gfn_array_cache and mmu_pte_list_desc_cache wasted here?  I saw that
> "struct kvm_mmu_memory_cache" still takes up quite a few hundreds of bytes,
> just want to make sure we won't waste them in vain.

Yes they are wasted right now. But there's a couple of things to keep in mind:

1. They are also wasted in every vCPU (in the per-vCPU caches) that
does not use the shadow MMU.
2. They will (I think) be used eventually when I add Eager Page
Splitting support to the shadow MMU.
3. split_caches is per-VM so it's only a few hundred bytes per VM.

If we really want to save the memory the right way forward might be to
make each kvm_mmu_memory_cache a pointer instead of an embedded
struct. Then we can allocate each dynamically only as needed. I can
add that to my TODO list but I don't think it'd be worth blocking this
on it given the points above.

>
> [...]
>
> > +int mmu_topup_split_caches(struct kvm *kvm)
> > +{
> > +     struct kvm_mmu_memory_caches *split_caches = &kvm->arch.split_caches;
> > +     int r;
> > +
> > +     assert_split_caches_invariants(kvm);
> > +
> > +     r = kvm_mmu_topup_memory_cache(&split_caches->page_header_cache, 1);
> > +     if (r)
> > +             goto out;
> > +
> > +     r = kvm_mmu_topup_memory_cache(&split_caches->shadow_page_cache, 1);
> > +     if (r)
> > +             goto out;
>
> Is it intended to only top-up with one cache object?  IIUC this means we'll try
> to proactively yield the cpu for each of the huge page split right after the
> object is consumed.
>
> Wondering whether it be more efficient to make it a slightly larger number, so
> we don't overload the memory but also make the loop a bit more efficient.

IIUC, 1 here is just the min needed for kvm_mmu_topup_memory_cache to
return success. I chose 1 for each because it's the minimum necessary
to make forward progress (split one large page).

No matter what you pass for min kvm_mmu_topup_memory_cache() will
still always try to allocate KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
objects.


>
> > +
> > +     return 0;
> > +
> > +out:
> > +     pr_warn("Failed to top-up split caches. Will not split large pages.\n");
> > +     return r;
> > +}
>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
  2021-11-26 12:17   ` Peter Xu
@ 2021-12-01  0:16     ` David Matlack
  2021-12-01  0:17       ` David Matlack
  0 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-12-01  0:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Fri, Nov 26, 2021 at 4:17 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6768ef9c0891..4e78ef2dd352 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> >
> > +             /*
> > +              * Try to proactively split any large pages down to 4KB so that
> > +              * vCPUs don't have to take write-protection faults.
> > +              */
> > +             kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
> > +
> >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> >
> >               /* Cross two large pages? */
>
> Is it intended to try split every time even if we could have split it already?
> As I remember Paolo mentioned that we can skip split if it's not the 1st
> CLEAR_LOG on the same range, and IIUC that makes sense.
>
> But indeed I don't see a trivial way to know whether this is the first clear of
> this range.  Maybe we can maintain "how many huge pages are there under current
> kvm_mmu_page node" somehow?  Then if root sp has the counter==0, then we can
> skip it.  Just a wild idea..
>
> Or maybe it's intended to try split unconditionally for some reason?  If so
> it'll be great to mention that either in the commit message or in comments.

Thanks for calling this out. Could the same be said about the existing
code that unconditionally tries to write-protect 2M+ pages? I aimed to
keep parity with the write-protection calls (always try to split
before write-protecting) but I agree there might be opportunities
available to skip altogether.

By the way, looking at this code again I think I see some potential bugs:
 - I don't think I ever free split_caches in the initially-all-set case.
 - What happens if splitting fails the CLEAR_LOG but succeeds the
CLEAR_LOG? We would end up propagating the write-protection on the 2M
page down to the 4K page. This might cause issues if using PML.

>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
  2021-12-01  0:16     ` David Matlack
@ 2021-12-01  0:17       ` David Matlack
  2021-12-01  4:03         ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-12-01  0:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Tue, Nov 30, 2021 at 4:16 PM David Matlack <dmatlack@google.com> wrote:
>
> On Fri, Nov 26, 2021 at 4:17 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 6768ef9c0891..4e78ef2dd352 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> > >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> > >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> > >
> > > +             /*
> > > +              * Try to proactively split any large pages down to 4KB so that
> > > +              * vCPUs don't have to take write-protection faults.
> > > +              */
> > > +             kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
> > > +
> > >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> > >
> > >               /* Cross two large pages? */
> >
> > Is it intended to try split every time even if we could have split it already?
> > As I remember Paolo mentioned that we can skip split if it's not the 1st
> > CLEAR_LOG on the same range, and IIUC that makes sense.
> >
> > But indeed I don't see a trivial way to know whether this is the first clear of
> > this range.  Maybe we can maintain "how many huge pages are there under current
> > kvm_mmu_page node" somehow?  Then if root sp has the counter==0, then we can
> > skip it.  Just a wild idea..
> >
> > Or maybe it's intended to try split unconditionally for some reason?  If so
> > it'll be great to mention that either in the commit message or in comments.
>
> Thanks for calling this out. Could the same be said about the existing
> code that unconditionally tries to write-protect 2M+ pages? I aimed to
> keep parity with the write-protection calls (always try to split
> before write-protecting) but I agree there might be opportunities
> available to skip altogether.
>
> By the way, looking at this code again I think I see some potential bugs:
>  - I don't think I ever free split_caches in the initially-all-set case.
>  - What happens if splitting fails the CLEAR_LOG but succeeds the
> CLEAR_LOG?

Gah, meant to say "first CLEAR_LOG" and "second CLEAR_LOG" here.

> We would end up propagating the write-protection on the 2M
> page down to the 4K page. This might cause issues if using PML.
>
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent
  2021-11-30 23:31     ` David Matlack
@ 2021-12-01  0:45       ` Sean Christopherson
  2021-12-01 21:56         ` David Matlack
  0 siblings, 1 reply; 77+ messages in thread
From: Sean Christopherson @ 2021-12-01  0:45 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Tue, Nov 30, 2021, David Matlack wrote:
> > I have a similar patch for the old MMU, but it was also replacing
> > shadow_root_level with shadow_root_role.  I'll see if I can adapt it to
> > the TDP MMU, since the shadow_root_role is obviously the same for both.
> 
> While I was writing this patch it got me wondering if we can do an
> even more general refactor and replace root_hpa and shadow_root_level
> with a pointer to the root kvm_mmu_page struct. But I didn't get a
> chance to look into it further.

For TDP MUU, yes, as root_hpa == __pa(sp->spt) in all cases.  For the legacy/full
MMU, not without additional refactoring since root_hpa doesn't point at a kvm_mmu_page
when KVM shadows a non-paging guest with PAE paging (uses pae_root), or when KVM
shadows nested NPT and the guest is using fewer paging levels that the host (uses
pml5_root or pml4_root).

	if (mmu->shadow_root_level == PT64_ROOT_5LEVEL)
		mmu->root_hpa = __pa(mmu->pml5_root);
	else if (mmu->shadow_root_level == PT64_ROOT_4LEVEL)
		mmu->root_hpa = __pa(mmu->pml4_root);
	else
		mmu->root_hpa = __pa(mmu->pae_root);

That's definitely a solvable problem, e.g. it wouldn't be a problem to burn a few
kvm_mmu_page for the special root.  The biggest issue is probably the sheer amount
of code that would need to be updated.  I do think it would be a good change, but
I think we'd want to do it in a release that isn't expected to have many other MMU
changes.

shadow_root_level can also be replaced by mmu_role.base.level.  I've never bothered
to do the replacement because there's zero memory savings and it would undoubtedly
take me some time to retrain my brain :-)

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-11-30 23:56     ` David Matlack
@ 2021-12-01  1:00       ` Sean Christopherson
  2021-12-01  1:29         ` David Matlack
  0 siblings, 1 reply; 77+ messages in thread
From: Sean Christopherson @ 2021-12-01  1:00 UTC (permalink / raw)
  To: David Matlack
  Cc: Peter Xu, Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel,
	Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Tue, Nov 30, 2021, David Matlack wrote:
> On Fri, Nov 26, 2021 at 4:01 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, David,
> >
> > On Fri, Nov 19, 2021 at 11:57:56PM +0000, David Matlack wrote:
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 2a7564703ea6..432a4df817ec 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -1232,6 +1232,9 @@ struct kvm_arch {
> > >       hpa_t   hv_root_tdp;
> > >       spinlock_t hv_root_tdp_lock;
> > >  #endif
> > > +
> > > +     /* MMU caches used when splitting large pages during VM-ioctls. */
> > > +     struct kvm_mmu_memory_caches split_caches;
> >
> > Are mmu_gfn_array_cache and mmu_pte_list_desc_cache wasted here?  I saw that
> > "struct kvm_mmu_memory_cache" still takes up quite a few hundreds of bytes,
> > just want to make sure we won't waste them in vain.
> 
> Yes they are wasted right now. But there's a couple of things to keep in mind:
> 
> 1. They are also wasted in every vCPU (in the per-vCPU caches) that
> does not use the shadow MMU.
> 2. They will (I think) be used eventually when I add Eager Page
> Splitting support to the shadow MMU.
> 3. split_caches is per-VM so it's only a few hundred bytes per VM.
> 
> If we really want to save the memory the right way forward might be to
> make each kvm_mmu_memory_cache a pointer instead of an embedded
> struct. Then we can allocate each dynamically only as needed. I can
> add that to my TODO list but I don't think it'd be worth blocking this
> on it given the points above.
> 
> >
> > [...]
> >
> > > +int mmu_topup_split_caches(struct kvm *kvm)
> > > +{
> > > +     struct kvm_mmu_memory_caches *split_caches = &kvm->arch.split_caches;
> > > +     int r;
> > > +
> > > +     assert_split_caches_invariants(kvm);
> > > +
> > > +     r = kvm_mmu_topup_memory_cache(&split_caches->page_header_cache, 1);
> > > +     if (r)
> > > +             goto out;
> > > +
> > > +     r = kvm_mmu_topup_memory_cache(&split_caches->shadow_page_cache, 1);
> > > +     if (r)
> > > +             goto out;
> >
> > Is it intended to only top-up with one cache object?  IIUC this means we'll try
> > to proactively yield the cpu for each of the huge page split right after the
> > object is consumed.
> >
> > Wondering whether it be more efficient to make it a slightly larger number, so
> > we don't overload the memory but also make the loop a bit more efficient.
> 
> IIUC, 1 here is just the min needed for kvm_mmu_topup_memory_cache to
> return success. I chose 1 for each because it's the minimum necessary
> to make forward progress (split one large page).

The @min parameter is minimum number of pages that _must_ be available in the
cache, i.e. it's the maximum number of pages that can theoretically be used by
whatever upcoming operation is going to be consuming pages from the cache.

So '1' is technically correct, but I think it's the wrong choice given the behavior
of this code.  E.g. if there's 1 object in the cache, the initial top-up will do
nothing, and then tdp_mmu_split_large_pages_root() will almost immediately drop
mmu_lock to topup the cache.  Since the in-loop usage explicitly checks for an
empty cache, i.e. any non-zero @min will have identical behavior, I think it makes
sense to use KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE _and_ add a comment explaining why.
 
> No matter what you pass for min kvm_mmu_topup_memory_cache() will
> still always try to allocate KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
> objects.

No, it will try to allocate KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE if and only if there
are fewer than @min objects in the cache. 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-12-01  1:00       ` Sean Christopherson
@ 2021-12-01  1:29         ` David Matlack
  2021-12-01  2:29           ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-12-01  1:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Xu, Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel,
	Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Tue, Nov 30, 2021 at 5:01 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Nov 30, 2021, David Matlack wrote:
> > On Fri, Nov 26, 2021 at 4:01 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hi, David,
> > >
> > > On Fri, Nov 19, 2021 at 11:57:56PM +0000, David Matlack wrote:
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 2a7564703ea6..432a4df817ec 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1232,6 +1232,9 @@ struct kvm_arch {
> > > >       hpa_t   hv_root_tdp;
> > > >       spinlock_t hv_root_tdp_lock;
> > > >  #endif
> > > > +
> > > > +     /* MMU caches used when splitting large pages during VM-ioctls. */
> > > > +     struct kvm_mmu_memory_caches split_caches;
> > >
> > > Are mmu_gfn_array_cache and mmu_pte_list_desc_cache wasted here?  I saw that
> > > "struct kvm_mmu_memory_cache" still takes up quite a few hundreds of bytes,
> > > just want to make sure we won't waste them in vain.
> >
> > Yes they are wasted right now. But there's a couple of things to keep in mind:
> >
> > 1. They are also wasted in every vCPU (in the per-vCPU caches) that
> > does not use the shadow MMU.
> > 2. They will (I think) be used eventually when I add Eager Page
> > Splitting support to the shadow MMU.
> > 3. split_caches is per-VM so it's only a few hundred bytes per VM.
> >
> > If we really want to save the memory the right way forward might be to
> > make each kvm_mmu_memory_cache a pointer instead of an embedded
> > struct. Then we can allocate each dynamically only as needed. I can
> > add that to my TODO list but I don't think it'd be worth blocking this
> > on it given the points above.
> >
> > >
> > > [...]
> > >
> > > > +int mmu_topup_split_caches(struct kvm *kvm)
> > > > +{
> > > > +     struct kvm_mmu_memory_caches *split_caches = &kvm->arch.split_caches;
> > > > +     int r;
> > > > +
> > > > +     assert_split_caches_invariants(kvm);
> > > > +
> > > > +     r = kvm_mmu_topup_memory_cache(&split_caches->page_header_cache, 1);
> > > > +     if (r)
> > > > +             goto out;
> > > > +
> > > > +     r = kvm_mmu_topup_memory_cache(&split_caches->shadow_page_cache, 1);
> > > > +     if (r)
> > > > +             goto out;
> > >
> > > Is it intended to only top-up with one cache object?  IIUC this means we'll try
> > > to proactively yield the cpu for each of the huge page split right after the
> > > object is consumed.
> > >
> > > Wondering whether it be more efficient to make it a slightly larger number, so
> > > we don't overload the memory but also make the loop a bit more efficient.
> >
> > IIUC, 1 here is just the min needed for kvm_mmu_topup_memory_cache to
> > return success. I chose 1 for each because it's the minimum necessary
> > to make forward progress (split one large page).
>
> The @min parameter is minimum number of pages that _must_ be available in the
> cache, i.e. it's the maximum number of pages that can theoretically be used by
> whatever upcoming operation is going to be consuming pages from the cache.
>
> So '1' is technically correct, but I think it's the wrong choice given the behavior
> of this code.  E.g. if there's 1 object in the cache, the initial top-up will do
> nothing,

This scenario will not happen though, since we free the caches after
splitting. So, the next time userspace enables dirty logging on a
memslot and we go to do the initial top-up the caches will have 0
objects.

> and then tdp_mmu_split_large_pages_root() will almost immediately drop
> mmu_lock to topup the cache.  Since the in-loop usage explicitly checks for an
> empty cache, i.e. any non-zero @min will have identical behavior, I think it makes
> sense to use KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE _and_ add a comment explaining why.

If we set the min to KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
kvm_mmu_topup_memory_cache will return ENOMEM if it can't allocate at
least KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE objects, even though we really
only need 1 to make forward progress.

It's a total edge case but there could be a scenario where userspace
sets the cgroup memory limits so tight that we can't allocate
KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE objects when splitting the last few
pages and in the end we only needed 1 or 2 objects to finish
splitting. In this case we'd end up with a spurious pr_warn and may
not split the last few pages depending on which cache failed to get
topped up.


>
> > No matter what you pass for min kvm_mmu_topup_memory_cache() will
> > still always try to allocate KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
> > objects.
>
> No, it will try to allocate KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE if and only if there
> are fewer than @min objects in the cache.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-12-01  1:29         ` David Matlack
@ 2021-12-01  2:29           ` Peter Xu
  2021-12-01 18:29             ` Sean Christopherson
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2021-12-01  2:29 UTC (permalink / raw)
  To: David Matlack
  Cc: Sean Christopherson, Paolo Bonzini, kvm, Ben Gardon,
	Joerg Roedel, Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Tue, Nov 30, 2021 at 05:29:10PM -0800, David Matlack wrote:
> On Tue, Nov 30, 2021 at 5:01 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Nov 30, 2021, David Matlack wrote:
> > > On Fri, Nov 26, 2021 at 4:01 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Hi, David,
> > > >
> > > > On Fri, Nov 19, 2021 at 11:57:56PM +0000, David Matlack wrote:
> > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > > index 2a7564703ea6..432a4df817ec 100644
> > > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > > @@ -1232,6 +1232,9 @@ struct kvm_arch {
> > > > >       hpa_t   hv_root_tdp;
> > > > >       spinlock_t hv_root_tdp_lock;
> > > > >  #endif
> > > > > +
> > > > > +     /* MMU caches used when splitting large pages during VM-ioctls. */
> > > > > +     struct kvm_mmu_memory_caches split_caches;
> > > >
> > > > Are mmu_gfn_array_cache and mmu_pte_list_desc_cache wasted here?  I saw that
> > > > "struct kvm_mmu_memory_cache" still takes up quite a few hundreds of bytes,
> > > > just want to make sure we won't waste them in vain.
> > >
> > > Yes they are wasted right now. But there's a couple of things to keep in mind:
> > >
> > > 1. They are also wasted in every vCPU (in the per-vCPU caches) that
> > > does not use the shadow MMU.
> > > 2. They will (I think) be used eventually when I add Eager Page
> > > Splitting support to the shadow MMU.
> > > 3. split_caches is per-VM so it's only a few hundred bytes per VM.
> > >
> > > If we really want to save the memory the right way forward might be to
> > > make each kvm_mmu_memory_cache a pointer instead of an embedded
> > > struct. Then we can allocate each dynamically only as needed. I can
> > > add that to my TODO list but I don't think it'd be worth blocking this
> > > on it given the points above.

Yeah I never meant to block this series just for this. :)

If there's plan to move forward with shadow mmu support and they'll be needed
at last, then it's good to me to keep it as is.  Maybe before adding the shadow
mmu support we add a comment above the structure?  Depending on whether the
shadow mmu support is in schedule or not, I think.

> > >
> > > >
> > > > [...]
> > > >
> > > > > +int mmu_topup_split_caches(struct kvm *kvm)
> > > > > +{
> > > > > +     struct kvm_mmu_memory_caches *split_caches = &kvm->arch.split_caches;
> > > > > +     int r;
> > > > > +
> > > > > +     assert_split_caches_invariants(kvm);
> > > > > +
> > > > > +     r = kvm_mmu_topup_memory_cache(&split_caches->page_header_cache, 1);
> > > > > +     if (r)
> > > > > +             goto out;
> > > > > +
> > > > > +     r = kvm_mmu_topup_memory_cache(&split_caches->shadow_page_cache, 1);
> > > > > +     if (r)
> > > > > +             goto out;
> > > >
> > > > Is it intended to only top-up with one cache object?  IIUC this means we'll try
> > > > to proactively yield the cpu for each of the huge page split right after the
> > > > object is consumed.
> > > >
> > > > Wondering whether it be more efficient to make it a slightly larger number, so
> > > > we don't overload the memory but also make the loop a bit more efficient.
> > >
> > > IIUC, 1 here is just the min needed for kvm_mmu_topup_memory_cache to
> > > return success. I chose 1 for each because it's the minimum necessary
> > > to make forward progress (split one large page).
> >
> > The @min parameter is minimum number of pages that _must_ be available in the
> > cache, i.e. it's the maximum number of pages that can theoretically be used by
> > whatever upcoming operation is going to be consuming pages from the cache.
> >
> > So '1' is technically correct, but I think it's the wrong choice given the behavior
> > of this code.  E.g. if there's 1 object in the cache, the initial top-up will do
> > nothing,
> 
> This scenario will not happen though, since we free the caches after
> splitting. So, the next time userspace enables dirty logging on a
> memslot and we go to do the initial top-up the caches will have 0
> objects.
> 
> > and then tdp_mmu_split_large_pages_root() will almost immediately drop
> > mmu_lock to topup the cache.  Since the in-loop usage explicitly checks for an
> > empty cache, i.e. any non-zero @min will have identical behavior, I think it makes
> > sense to use KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE _and_ add a comment explaining why.
> 
> If we set the min to KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> kvm_mmu_topup_memory_cache will return ENOMEM if it can't allocate at
> least KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE objects, even though we really
> only need 1 to make forward progress.
> 
> It's a total edge case but there could be a scenario where userspace
> sets the cgroup memory limits so tight that we can't allocate
> KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE objects when splitting the last few
> pages and in the end we only needed 1 or 2 objects to finish
> splitting. In this case we'd end up with a spurious pr_warn and may
> not split the last few pages depending on which cache failed to get
> topped up.

IMHO when -ENOMEM happens, instead of keep trying with 1 shadow sp we should
just bail out even earlier.

Say, if we only have 10 (<40) pages left for shadow sp's use, we'd better make
good use of them lazily to be consumed in follow up page faults when the guest
accessed any of the huge pages, rather than we take them all over to split the
next continuous huge pages assuming it'll be helpful..

From that POV I have a slight preference over Sean's suggestion because that'll
make us fail earlier.  But I agree it shouldn't be a big deal.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
  2021-12-01  0:17       ` David Matlack
@ 2021-12-01  4:03         ` Peter Xu
  2021-12-01 22:14           ` David Matlack
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2021-12-01  4:03 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Tue, Nov 30, 2021 at 04:17:01PM -0800, David Matlack wrote:
> On Tue, Nov 30, 2021 at 4:16 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On Fri, Nov 26, 2021 at 4:17 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote:
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 6768ef9c0891..4e78ef2dd352 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> > > >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> > > >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> > > >
> > > > +             /*
> > > > +              * Try to proactively split any large pages down to 4KB so that
> > > > +              * vCPUs don't have to take write-protection faults.
> > > > +              */
> > > > +             kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
> > > > +
> > > >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> > > >
> > > >               /* Cross two large pages? */
> > >
> > > Is it intended to try split every time even if we could have split it already?
> > > As I remember Paolo mentioned that we can skip split if it's not the 1st
> > > CLEAR_LOG on the same range, and IIUC that makes sense.
> > >
> > > But indeed I don't see a trivial way to know whether this is the first clear of
> > > this range.  Maybe we can maintain "how many huge pages are there under current
> > > kvm_mmu_page node" somehow?  Then if root sp has the counter==0, then we can
> > > skip it.  Just a wild idea..
> > >
> > > Or maybe it's intended to try split unconditionally for some reason?  If so
> > > it'll be great to mention that either in the commit message or in comments.
> >
> > Thanks for calling this out. Could the same be said about the existing
> > code that unconditionally tries to write-protect 2M+ pages?

They're different because wr-protect can be restored (to be not-wr-protected)
when vcpu threads write to the pages, so they need to be always done.

For huge page split - when it happened during dirty tracking it'll not be
recovered anymore, so it's a one-time thing.

> > I aimed to keep parity with the write-protection calls (always try to split
> > before write-protecting) but I agree there might be opportunities available
> > to skip altogether.

So IMHO it's not about parity but it could be about how easy can it be
implemented, and whether it'll be worth it to add that complexity.

Besides the above accounting idea per-sp, we can have other ways to do this
too, e.g., keeping a bitmap showing which range has been split: that bitmap
will be 2M in granule for x86 because that'll be enough.  We init-all-ones for
the bitmap when start logging for a memslot.

But again maybe it turns out we don't really want that complexity.

IMHO a good start could be the perf numbers (which I asked in the cover letter)
comparing the overhead of 2nd+ iterations of CLEAR_LOG with/without eager page
split.

> >
> > By the way, looking at this code again I think I see some potential bugs:
> >  - I don't think I ever free split_caches in the initially-all-set case.

I saw that it's freed in kvm_mmu_try_split_large_pages(), no?

> >  - What happens if splitting fails the CLEAR_LOG but succeeds the
> > CLEAR_LOG?
> 
> Gah, meant to say "first CLEAR_LOG" and "second CLEAR_LOG" here.
> 
> > We would end up propagating the write-protection on the 2M
> > page down to the 4K page. This might cause issues if using PML.

Hmm looks correct.. I'm wondering what will happen with that.

Firstly this should be rare as the 1st split should in 99% cases succeed.

Then if split failed at the 1st attempt, we wr-protected sptes even during pml
during the split.  When written, we'll go the fast page fault and record the
writes too, I think, as we'll apply dirty bit to the new spte so I think it'll
just skip pml.  Looks like we'll be using a mixture of pml+wp but all dirty
will still be captured as exptected?..

There could be leftover wp when stopping dirty logging, but that seems not
directly harmful too.  It'll make things a bit messed up, at least.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU
  2021-11-30 23:22   ` David Matlack
@ 2021-12-01  4:10     ` Peter Xu
  2021-12-01  4:19       ` Peter Xu
  2021-12-01 21:46       ` David Matlack
  0 siblings, 2 replies; 77+ messages in thread
From: Peter Xu @ 2021-12-01  4:10 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Tue, Nov 30, 2021 at 03:22:29PM -0800, David Matlack wrote:
> On Fri, Nov 26, 2021 at 6:13 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, David,
> >
> > On Fri, Nov 19, 2021 at 11:57:44PM +0000, David Matlack wrote:
> > > This series is a first pass at implementing Eager Page Splitting for the
> > > TDP MMU. For context on the motivation and design of Eager Page
> > > Splitting, please see the RFC design proposal and discussion [1].
> > >
> > > Paolo, I went ahead and added splitting in both the intially-all-set
> > > case (only splitting the region passed to CLEAR_DIRTY_LOG) and the
> > > case where we are not using initially-all-set (splitting the entire
> > > memslot when dirty logging is enabled) to give you an idea of what
> > > both look like.
> > >
> > > Note: I will be on vacation all of next week so I will not be able to
> > > respond to reviews until Monday November 29. I thought it would be
> > > useful to seed discussion and reviews with an early version of the code
> > > rather than putting it off another week. But feel free to also ignore
> > > this until I get back :)
> > >
> > > This series compiles and passes the most basic splitting test:
> > >
> > > $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 2 -i 4
> > >
> > > But please operate under the assumption that this code is probably
> > > buggy.
> > >
> > > [1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t
> >
> > Will there be more numbers to show in the formal patchset?
> 
> Yes definitely. I didn't have a lot of time to test this series, hence
> the RFC status. I'll include more thorough testing and performance
> evaluation in the cover letter for v1.
> 
> 
> > It's interesting to
> > know how "First Pass Dirty Memory Time" will change comparing to the rfc
> > numbers; I can have a feel of it, but still. :) Also, not only how it speedup
> > guest dirty apps, but also some general measurement on how it slows down
> > KVM_SET_USER_MEMORY_REGION (!init-all-set) or CLEAR_LOG (init-all-set) would be
> > even nicer (for CLEAR, I guess the 1st/2nd+ round will have different overhead).
> >
> > Besides that, I'm also wondering whether we should still have a knob for it, as
> > I'm wondering what if the use case is the kind where eager split huge page may
> > not help at all.  What I'm thinking:
> >
> >   - Read-mostly guest overload; split huge page will speed up rare writes, but
> >     at the meantime drag readers down due to huge->small page mappings.
> >
> >   - Writes-over-very-limited-region workload: say we have 1T guest and the app
> >     in the guest only writes 10G part of it.  Hmm not sure whether it exists..
> >
> >   - Postcopy targeted: it means precopy may only run a few iterations just to
> >     send the static pages, so the migration duration will be relatively short,
> >     and the write just didn't spread a lot to the whole guest mem.
> >
> > I don't really think any of the example is strong enough as they're all very
> > corner cased, but just to show what I meant to raise this question on whether
> > unconditionally eager split is the best approach.
> 
> I'd be happy to add a knob if there's a userspace that wants to use
> it. I think the main challenge though is knowing when it is safe to
> disable eager splitting.

Isn't it a performance feature?  Why it'll be not safe?

> For a small deployment where you know the VM workload, it might make
> sense. But for a public cloud provider the only feasible way would be to
> dynamically monitor the guest writing patterns. But then we're back at square
> one because that would require dirty logging. And even then, there's no
> guaranteed way to predict future guest write patterns based on past patterns.

Agreed, what I was thinking was not for public cloud usages, but for the cases
where we can do specific tunings on some specific scenarios.  It normally won't
matter a lot with small or medium sized VMs but extreme use cases.

> 
> The way forward here might be to do a hybrid of 2M and 4K dirty
> tracking (and maybe even 1G). For example, first start dirty logging
> at 2M granularity, and then log at 4K for any specific regions or
> memslots that aren't making progress. We'd still use Eager Page
> Splitting unconditionally though, first to split to 2M and then to
> split to 4K.

Do you mean we'd also offer different granule dirty bitmap to the userspace
too?

I remembered you mentioned 2mb dirty tracking in your rfc series, but I didn't
expect it can be dynamically switched during tracking.  That sounds a very
intersting idea.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU
  2021-12-01  4:10     ` Peter Xu
@ 2021-12-01  4:19       ` Peter Xu
  2021-12-01 21:46       ` David Matlack
  1 sibling, 0 replies; 77+ messages in thread
From: Peter Xu @ 2021-12-01  4:19 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Wed, Dec 01, 2021 at 12:10:38PM +0800, Peter Xu wrote:
> On Tue, Nov 30, 2021 at 03:22:29PM -0800, David Matlack wrote:
> > On Fri, Nov 26, 2021 at 6:13 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hi, David,
> > >
> > > On Fri, Nov 19, 2021 at 11:57:44PM +0000, David Matlack wrote:
> > > > This series is a first pass at implementing Eager Page Splitting for the
> > > > TDP MMU. For context on the motivation and design of Eager Page
> > > > Splitting, please see the RFC design proposal and discussion [1].
> > > >
> > > > Paolo, I went ahead and added splitting in both the intially-all-set
> > > > case (only splitting the region passed to CLEAR_DIRTY_LOG) and the
> > > > case where we are not using initially-all-set (splitting the entire
> > > > memslot when dirty logging is enabled) to give you an idea of what
> > > > both look like.
> > > >
> > > > Note: I will be on vacation all of next week so I will not be able to
> > > > respond to reviews until Monday November 29. I thought it would be
> > > > useful to seed discussion and reviews with an early version of the code
> > > > rather than putting it off another week. But feel free to also ignore
> > > > this until I get back :)
> > > >
> > > > This series compiles and passes the most basic splitting test:
> > > >
> > > > $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 2 -i 4
> > > >
> > > > But please operate under the assumption that this code is probably
> > > > buggy.
> > > >
> > > > [1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t
> > >
> > > Will there be more numbers to show in the formal patchset?
> > 
> > Yes definitely. I didn't have a lot of time to test this series, hence
> > the RFC status. I'll include more thorough testing and performance
> > evaluation in the cover letter for v1.
> > 
> > 
> > > It's interesting to
> > > know how "First Pass Dirty Memory Time" will change comparing to the rfc
> > > numbers; I can have a feel of it, but still. :) Also, not only how it speedup
> > > guest dirty apps, but also some general measurement on how it slows down
> > > KVM_SET_USER_MEMORY_REGION (!init-all-set) or CLEAR_LOG (init-all-set) would be
> > > even nicer (for CLEAR, I guess the 1st/2nd+ round will have different overhead).
> > >
> > > Besides that, I'm also wondering whether we should still have a knob for it, as
> > > I'm wondering what if the use case is the kind where eager split huge page may
> > > not help at all.  What I'm thinking:
> > >
> > >   - Read-mostly guest overload; split huge page will speed up rare writes, but
> > >     at the meantime drag readers down due to huge->small page mappings.
> > >
> > >   - Writes-over-very-limited-region workload: say we have 1T guest and the app
> > >     in the guest only writes 10G part of it.  Hmm not sure whether it exists..
> > >
> > >   - Postcopy targeted: it means precopy may only run a few iterations just to
> > >     send the static pages, so the migration duration will be relatively short,
> > >     and the write just didn't spread a lot to the whole guest mem.
> > >
> > > I don't really think any of the example is strong enough as they're all very
> > > corner cased, but just to show what I meant to raise this question on whether
> > > unconditionally eager split is the best approach.
> > 
> > I'd be happy to add a knob if there's a userspace that wants to use
> > it. I think the main challenge though is knowing when it is safe to
> > disable eager splitting.
> 
> Isn't it a performance feature?  Why it'll be not safe?
> 
> > For a small deployment where you know the VM workload, it might make
> > sense. But for a public cloud provider the only feasible way would be to
> > dynamically monitor the guest writing patterns. But then we're back at square
> > one because that would require dirty logging. And even then, there's no
> > guaranteed way to predict future guest write patterns based on past patterns.
> 
> Agreed, what I was thinking was not for public cloud usages, but for the cases
> where we can do specific tunings on some specific scenarios.  It normally won't
> matter a lot with small or medium sized VMs but extreme use cases.

PS: I think even with a tunable, one static per-module parameter should be far
enough for what I can imagine for now.

> 
> > 
> > The way forward here might be to do a hybrid of 2M and 4K dirty
> > tracking (and maybe even 1G). For example, first start dirty logging
> > at 2M granularity, and then log at 4K for any specific regions or
> > memslots that aren't making progress. We'd still use Eager Page
> > Splitting unconditionally though, first to split to 2M and then to
> > split to 4K.
> 
> Do you mean we'd also offer different granule dirty bitmap to the userspace
> too?
> 
> I remembered you mentioned 2mb dirty tracking in your rfc series, but I didn't
> expect it can be dynamically switched during tracking.  That sounds a very
> intersting idea.
> 
> Thanks,
> 
> -- 
> Peter Xu

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-12-01  2:29           ` Peter Xu
@ 2021-12-01 18:29             ` Sean Christopherson
  2021-12-01 21:36               ` David Matlack
  0 siblings, 1 reply; 77+ messages in thread
From: Sean Christopherson @ 2021-12-01 18:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Matlack, Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel,
	Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Wed, Dec 01, 2021, Peter Xu wrote:
> On Tue, Nov 30, 2021 at 05:29:10PM -0800, David Matlack wrote:
> > On Tue, Nov 30, 2021 at 5:01 PM Sean Christopherson <seanjc@google.com> wrote:
> > > So '1' is technically correct, but I think it's the wrong choice given the behavior
> > > of this code.  E.g. if there's 1 object in the cache, the initial top-up will do
> > > nothing,
> > 
> > This scenario will not happen though, since we free the caches after
> > splitting. So, the next time userspace enables dirty logging on a
> > memslot and we go to do the initial top-up the caches will have 0
> > objects.

Ah.

> > > and then tdp_mmu_split_large_pages_root() will almost immediately drop
> > > mmu_lock to topup the cache.  Since the in-loop usage explicitly checks for an
> > > empty cache, i.e. any non-zero @min will have identical behavior, I think it makes
> > > sense to use KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE _and_ add a comment explaining why.
> > 
> > If we set the min to KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> > kvm_mmu_topup_memory_cache will return ENOMEM if it can't allocate at
> > least KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE objects, even though we really
> > only need 1 to make forward progress.
> > 
> > It's a total edge case but there could be a scenario where userspace
> > sets the cgroup memory limits so tight that we can't allocate
> > KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE objects when splitting the last few
> > pages and in the end we only needed 1 or 2 objects to finish
> > splitting. In this case we'd end up with a spurious pr_warn and may
> > not split the last few pages depending on which cache failed to get
> > topped up.
> 
> IMHO when -ENOMEM happens, instead of keep trying with 1 shadow sp we should
> just bail out even earlier.
> 
> Say, if we only have 10 (<40) pages left for shadow sp's use, we'd better make
> good use of them lazily to be consumed in follow up page faults when the guest
> accessed any of the huge pages, rather than we take them all over to split the
> next continuous huge pages assuming it'll be helpful..
> 
> From that POV I have a slight preference over Sean's suggestion because that'll
> make us fail earlier.  But I agree it shouldn't be a big deal.

Hmm, in this particular case, I think using the caches is the wrong approach.  The
behavior of pre-filling the caches makes sense for vCPUs because faults may need
multiple objects and filling the cache ensures the entire fault can be handled
without dropping mmu_lock.  And any extra/unused objects can be used by future
faults.  For page splitting, neither of those really holds true.  If there are a
lot of pages to split, KVM will have to drop mmu_lock to refill the cache.  And if
there are few pages to split, or the caches are refilled toward the end of the walk,
KVM may end up with a pile of unused objects it needs to free.

Since this code already needs to handle failure, and more importantly, it's a
best-effort optimization, I think trying to use the caches is a square peg, round
hole scenario.

Rather than use the caches, we could do allocation 100% on-demand and never drop
mmu_lock to do allocation.  The one caveat is that direct reclaim would need to be
disallowed so that the allocation won't sleep.  That would mean that eager splitting
would fail under heavy memory pressure when it otherwise might succeed by reclaiming.
That would mean vCPUs get penalized as they'd need to do the splitting on fault and
potentially do direct reclaim as well.  It's not obvious that that would be a problem
in practice, e.g. the vCPU is probably already seeing a fair amount of disruption due
to memory pressure, and slowing down vCPUs might alleviate some of that pressure.

Not using the cache would also reduce the extra complexity, e.g. no need for
special mmu_cache handling or a variant of tdp_mmu_iter_cond_resched().

I'm thinking something like this (very incomplete):

static void init_tdp_mmu_page(struct kvm_mmu_page *sp, u64 *spt, gfn_t gfn,
			      union kvm_mmu_page_role role)
{
	sp->spt = spt;
	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);

	sp->role = role;
	sp->gfn = gfn;
	sp->tdp_mmu_page = true;

	trace_kvm_mmu_get_page(sp, true);
}

static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
					       union kvm_mmu_page_role role)
{
	struct kvm_mmu_page *sp;
	u64 *spt;

	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
	spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);

	init_tdp_mmu_page(sp, spt, gfn, role);
}

static union kvm_mmu_page_role get_child_page_role(struct tdp_iter *iter)
{
	struct kvm_mmu_page *parent = sptep_to_sp(rcu_dereference(iter->sptep));
	union kvm_mmu_page_role role = parent->role;

	role.level--;
	return role;
}

static bool tdp_mmu_install_sp_atomic(struct kvm *kvm,
				      struct tdp_iter *iter,
				      struct kvm_mmu_page *sp,
				      bool account_nx)
{
	u64 spte;

	spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);

	if (tdp_mmu_set_spte_atomic(kvm, iter, spte)) {
		tdp_mmu_link_page(kvm, sp, account_nx);
		return true;
	}
	return false;
}

static void tdp_mmu_split_large_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
					   gfn_t start, gfn_t end, int target_level)
{
	/*
	 * Disallow direct reclaim, allocations will be made while holding
	 * mmu_lock and must not sleep.
	 */
	gfp_t gfp = (GFP_KERNEL_ACCOUNT | __GFP_ZERO) & ~__GFP_DIRECT_RECLAIM;
	struct kvm_mmu_page *sp = NULL;
	struct tdp_iter iter;
	bool flush = false;
	u64 *spt = NULL;
	int r;

	rcu_read_lock();

	/*
	 * Traverse the page table splitting all large pages above the target
	 * level into one lower level. For example, if we encounter a 1GB page
	 * we split it into 512 2MB pages.
	 *
	 * Since the TDP iterator uses a pre-order traversal, we are guaranteed
	 * to visit an SPTE before ever visiting its children, which means we
	 * will correctly recursively split large pages that are more than one
	 * level above the target level (e.g. splitting 1GB to 2MB to 4KB).
	 */
	for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
retry:
		if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true))
			continue;

		if (!is_shadow_present_pte(iter.old_spte || !is_large_pte(pte))
			continue;

		if (!sp) {
			sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
			if (!sp)
				break;
			spt = (void *)__get_free_page(gfp);
			if (!spt)
				break;
		}

		init_tdp_mmu_page(sp, spt, iter->gfn,
				  get_child_page_role(&iter));

		if (!tdp_mmu_split_large_page(kvm, &iter, sp))
			goto retry;

		sp = NULL;
		spt = NULL;
	}

	free_page((unsigned long)spt);
	kmem_cache_free(mmu_page_header_cache, sp);

	rcu_read_unlock();

	if (flush)
		kvm_flush_remote_tlbs(kvm);
}

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes
  2021-11-19 23:57 ` [RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes David Matlack
  2021-11-22 18:56   ` Ben Gardon
@ 2021-12-01 18:34   ` Sean Christopherson
  2021-12-01 21:13     ` David Matlack
  1 sibling, 1 reply; 77+ messages in thread
From: Sean Christopherson @ 2021-12-01 18:34 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Fri, Nov 19, 2021, David Matlack wrote:
> Consolidate is_large_pte and is_present_pte into a single helper. This
> will be used in a follow-up commit to check for present large-pages
> during Eager Page Splitting.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/spte.h    | 5 +++++
>  arch/x86/kvm/mmu/tdp_mmu.c | 3 +--
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index cc432f9a966b..e73c41d31816 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -257,6 +257,11 @@ static inline bool is_large_pte(u64 pte)
>  	return pte & PT_PAGE_SIZE_MASK;
>  }
>  
> +static inline bool is_large_present_pte(u64 pte)
> +{
> +	return is_shadow_present_pte(pte) && is_large_pte(pte);
> +}
> +
>  static inline bool is_last_spte(u64 pte, int level)
>  {
>  	return (level == PG_LEVEL_4K) || is_large_pte(pte);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index ff4d83ad7580..f8c4337f1fcf 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1011,8 +1011,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  		 * than the target, that SPTE must be cleared and replaced
>  		 * with a non-leaf SPTE.
>  		 */
> -		if (is_shadow_present_pte(iter.old_spte) &&
> -		    is_large_pte(iter.old_spte)) {
> +		if (is_large_present_pte(iter.old_spte)) {

I strongly object to this helper.  PRESENT in hardware and shadow-present are two
very different things, the name is_large_present_pte() doesn't capture that detail.
Yeah, we could name it is_large_shadow_present_pte(), but for me at least that
requires more effort to read, and it's not like this is replacing 10s of instances.

>  			if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
>  				break;
>  		}
> -- 
> 2.34.0.rc2.393.gf8c9666880-goog
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table
  2021-11-19 23:57 ` [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table David Matlack
  2021-11-22 18:52   ` Ben Gardon
@ 2021-12-01 19:13   ` Sean Christopherson
  2021-12-01 21:52     ` David Matlack
  1 sibling, 1 reply; 77+ messages in thread
From: Sean Christopherson @ 2021-12-01 19:13 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Fri, Nov 19, 2021, David Matlack wrote:
> Factor out the logic to atomically replace an SPTE with an SPTE that
> points to a new page table. This will be used in a follow-up commit to
> split a large page SPTE into one level lower.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 53 ++++++++++++++++++++++++++------------
>  1 file changed, 37 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index cc9fe33c9b36..9ee3f4f7fdf5 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -945,6 +945,39 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>  	return ret;
>  }
>  
> +/*
> + * tdp_mmu_install_sp_atomic - Atomically replace the given spte with an
> + * spte pointing to the provided page table.
> + *
> + * @kvm: kvm instance
> + * @iter: a tdp_iter instance currently on the SPTE that should be set
> + * @sp: The new TDP page table to install.
> + * @account_nx: True if this page table is being installed to split a
> + *              non-executable huge page.
> + *
> + * Returns: True if the new page table was installed. False if spte being
> + *          replaced changed, causing the atomic compare-exchange to fail.
> + *          If this function returns false the sp will be freed before
> + *          returning.
> + */
> +static bool tdp_mmu_install_sp_atomic(struct kvm *kvm,
> +				      struct tdp_iter *iter,
> +				      struct kvm_mmu_page *sp,
> +				      bool account_nx)
> +{
> +	u64 spte;
> +
> +	spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);

This can easily go on one line.

	u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
> +
> +	if (tdp_mmu_set_spte_atomic(kvm, iter, spte)) {
> +		tdp_mmu_link_page(kvm, sp, account_nx);
> +		return true;
> +	} else {
> +		tdp_mmu_free_sp(sp);
> +		return false;

I don't think this helper should free the sp on failure, even if that's what all
paths end up doing.  When reading the calling code, it really looks like the sp
is being leaked because the allocation and free are in different contexts.  That
the sp is consumed on success is fairly intuitive given the "install" action, but
freeing on failure not so much.

And for the eager splitting, freeing on failure is wasteful.  It's extremely
unlikely to happen often, so in practice it's unlikely to be an issue, but it's
certainly odd since the loop is likely going to immediately allocate another sp,
either for the current spte or for the next spte.

Side topic, tdp_mmu_set_spte_atomic() and friends really should return 0/-EBUSY.
Boolean returns for errors usually end in tears sooner or later.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
  2021-11-19 23:57 ` [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG David Matlack
  2021-11-26 12:17   ` Peter Xu
@ 2021-12-01 19:22   ` Sean Christopherson
  2021-12-01 19:49     ` Ben Gardon
  2021-12-01 22:17     ` David Matlack
  1 sibling, 2 replies; 77+ messages in thread
From: Sean Christopherson @ 2021-12-01 19:22 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Fri, Nov 19, 2021, David Matlack wrote:
> When using initially-all-set, large pages are not write-protected when
> dirty logging is enabled on the memslot. Instead they are
> write-protected once userspace invoked CLEAR_DIRTY_LOG for the first
> time, and only for the specific sub-region of the memslot that userspace
> whishes to clear.
> 
> Enhance CLEAR_DIRTY_LOG to also try to split large pages prior to
> write-protecting to avoid causing write-protection faults on vCPU
> threads. This also allows userspace to smear the cost of large page
> splitting across multiple ioctls rather than splitting the entire
> memslot when not using initially-all-set.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  4 ++++
>  arch/x86/kvm/mmu/mmu.c          | 30 ++++++++++++++++++++++--------
>  2 files changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 432a4df817ec..6b5bf99f57af 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1591,6 +1591,10 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  				      const struct kvm_memory_slot *memslot,
>  				      int start_level);
> +void kvm_mmu_try_split_large_pages(struct kvm *kvm,

I would prefer we use hugepage when possible, mostly because that's the terminology
used by the kernel.  KVM is comically inconsistent, but if we make an effort to use
hugepage when adding new code, hopefully someday we'll have enough inertia to commit
fully to hugepage.

> +				   const struct kvm_memory_slot *memslot,
> +				   u64 start, u64 end,
> +				   int target_level);
>  void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
>  					const struct kvm_memory_slot *memslot,
>  					int target_level);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6768ef9c0891..4e78ef2dd352 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  		gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
>  		gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
>  
> +		/*
> +		 * Try to proactively split any large pages down to 4KB so that
> +		 * vCPUs don't have to take write-protection faults.
> +		 */
> +		kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);

This should return a value.  If splitting succeeds, there should be no hugepages
and so walking the page tables to write-protect 2M is unnecessary.  Same for the
previous patch, although skipping the write-protect path is a little less
straightforward in that case.

> +
>  		kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
>  
>  		/* Cross two large pages? */

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 15/15] KVM: x86/mmu: Update page stats when splitting large pages
  2021-11-19 23:57 ` [RFC PATCH 15/15] KVM: x86/mmu: Update page stats when " David Matlack
@ 2021-12-01 19:36   ` Sean Christopherson
  2021-12-01 21:11     ` David Matlack
  0 siblings, 1 reply; 77+ messages in thread
From: Sean Christopherson @ 2021-12-01 19:36 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Fri, Nov 19, 2021, David Matlack wrote:
> When splitting large pages we need to update the pages stats to reflect
> all of the new pages at the lower level. We do not need to change the
> page stats for the large page that was removed as that is already
> handled tdp_mmu_set_spte_atomic.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 8f60d942c789..4c313613a939 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1299,7 +1299,12 @@ static bool tdp_mmu_split_large_page_atomic(struct kvm *kvm, struct tdp_iter *it
>  		child_sp->spt[i] = child_spte;
>  	}
>  
> -	return tdp_mmu_install_sp_atomic(kvm, iter, child_sp, false);
> +	if (!tdp_mmu_install_sp_atomic(kvm, iter, child_sp, false))
> +		return false;
> +
> +	kvm_update_page_stats(kvm, level - 1, PT64_ENT_PER_PAGE);

This should be done when tdp_mmu_split_large_page_atomic() is introduced, otherwise
this series is effectively introducing a bug and then fixing it.  At a very quick
glance, I don't see anything that would prevent squashing this in.

> +
> +	return true;
>  }
>  
>  static void tdp_mmu_split_large_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
> -- 
> 2.34.0.rc2.393.gf8c9666880-goog
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
  2021-12-01 19:22   ` Sean Christopherson
@ 2021-12-01 19:49     ` Ben Gardon
  2021-12-01 20:16       ` Sean Christopherson
  2021-12-01 22:17     ` David Matlack
  1 sibling, 1 reply; 77+ messages in thread
From: Ben Gardon @ 2021-12-01 19:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Matlack, Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Wed, Dec 1, 2021 at 11:22 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Nov 19, 2021, David Matlack wrote:
> > When using initially-all-set, large pages are not write-protected when
> > dirty logging is enabled on the memslot. Instead they are
> > write-protected once userspace invoked CLEAR_DIRTY_LOG for the first
> > time, and only for the specific sub-region of the memslot that userspace
> > whishes to clear.
> >
> > Enhance CLEAR_DIRTY_LOG to also try to split large pages prior to
> > write-protecting to avoid causing write-protection faults on vCPU
> > threads. This also allows userspace to smear the cost of large page
> > splitting across multiple ioctls rather than splitting the entire
> > memslot when not using initially-all-set.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  4 ++++
> >  arch/x86/kvm/mmu/mmu.c          | 30 ++++++++++++++++++++++--------
> >  2 files changed, 26 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 432a4df817ec..6b5bf99f57af 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1591,6 +1591,10 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
> >  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >                                     const struct kvm_memory_slot *memslot,
> >                                     int start_level);
> > +void kvm_mmu_try_split_large_pages(struct kvm *kvm,
>
> I would prefer we use hugepage when possible, mostly because that's the terminology
> used by the kernel.  KVM is comically inconsistent, but if we make an effort to use
> hugepage when adding new code, hopefully someday we'll have enough inertia to commit
> fully to hugepage.

In my mind "huge page" implies 2M and "large page" is generic to 2m
and 1g. (IDK if we settled on a name for 1G pages)
I've definitely been guilty of reinforcing this inconsistent
terminology. (Though it was consistent in my head, of course.) If we
want to pick one and use it everywhere, I'm happy to get onboard with
a standard terminology.

>
> > +                                const struct kvm_memory_slot *memslot,
> > +                                u64 start, u64 end,
> > +                                int target_level);
> >  void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
> >                                       const struct kvm_memory_slot *memslot,
> >                                       int target_level);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6768ef9c0891..4e78ef2dd352 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> >
> > +             /*
> > +              * Try to proactively split any large pages down to 4KB so that
> > +              * vCPUs don't have to take write-protection faults.
> > +              */
> > +             kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
>
> This should return a value.  If splitting succeeds, there should be no hugepages
> and so walking the page tables to write-protect 2M is unnecessary.  Same for the
> previous patch, although skipping the write-protect path is a little less
> straightforward in that case.
>
> > +
> >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> >
> >               /* Cross two large pages? */

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
  2021-12-01 19:49     ` Ben Gardon
@ 2021-12-01 20:16       ` Sean Christopherson
  2021-12-01 22:11         ` Ben Gardon
  0 siblings, 1 reply; 77+ messages in thread
From: Sean Christopherson @ 2021-12-01 20:16 UTC (permalink / raw)
  To: Ben Gardon
  Cc: David Matlack, Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Wed, Dec 01, 2021, Ben Gardon wrote:
> On Wed, Dec 1, 2021 at 11:22 AM Sean Christopherson <seanjc@google.com> wrote:
> > I would prefer we use hugepage when possible, mostly because that's the terminology
> > used by the kernel.  KVM is comically inconsistent, but if we make an effort to use
> > hugepage when adding new code, hopefully someday we'll have enough inertia to commit
> > fully to hugepage.
> 
> In my mind "huge page" implies 2M and "large page" is generic to 2m
> and 1g. (IDK if we settled on a name for 1G pages)

What about 4m PSE pages?  :-)

I'm mostly joking, but it does raise the point that trying to provide unique names
for each size is a bit of a fools errand, especially on non-x86 architectures that
support a broader variety of hugepage sizes.  IMO, the least ambiguous way to refer
to hugepages is to say that everything that isn't a 4k page (or whatever PAGE_SIZE
is on the architecture) is a hugepage, and then explicitly state the size of the
page if it matters.

> I've definitely been guilty of reinforcing this inconsistent
> terminology. (Though it was consistent in my head, of course.) If we
> want to pick one and use it everywhere, I'm happy to get onboard with
> a standard terminology.

I hear you on using "large page", I've had to undo a solid decade of "large page"
terminology from my pre-Linux days.  But for better or worse, the kernel uses
hugepage, e.g. hugetlbfs supports 1gb and 2mb pages.  I think we should follow
the kernel, especially since we have aspirations of unifying more of KVM's MMU
across multiple architectures.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 15/15] KVM: x86/mmu: Update page stats when splitting large pages
  2021-12-01 19:36   ` Sean Christopherson
@ 2021-12-01 21:11     ` David Matlack
  0 siblings, 0 replies; 77+ messages in thread
From: David Matlack @ 2021-12-01 21:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Wed, Dec 1, 2021 at 11:37 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Nov 19, 2021, David Matlack wrote:
> > When splitting large pages we need to update the pages stats to reflect
> > all of the new pages at the lower level. We do not need to change the
> > page stats for the large page that was removed as that is already
> > handled tdp_mmu_set_spte_atomic.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 8f60d942c789..4c313613a939 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1299,7 +1299,12 @@ static bool tdp_mmu_split_large_page_atomic(struct kvm *kvm, struct tdp_iter *it
> >               child_sp->spt[i] = child_spte;
> >       }
> >
> > -     return tdp_mmu_install_sp_atomic(kvm, iter, child_sp, false);
> > +     if (!tdp_mmu_install_sp_atomic(kvm, iter, child_sp, false))
> > +             return false;
> > +
> > +     kvm_update_page_stats(kvm, level - 1, PT64_ENT_PER_PAGE);
>
> This should be done when tdp_mmu_split_large_page_atomic() is introduced, otherwise
> this series is effectively introducing a bug and then fixing it.  At a very quick
> glance, I don't see anything that would prevent squashing this in.

Will do.

>
> > +
> > +     return true;
> >  }
> >
> >  static void tdp_mmu_split_large_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > --
> > 2.34.0.rc2.393.gf8c9666880-goog
> >

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes
  2021-12-01 18:34   ` Sean Christopherson
@ 2021-12-01 21:13     ` David Matlack
  0 siblings, 0 replies; 77+ messages in thread
From: David Matlack @ 2021-12-01 21:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Wed, Dec 1, 2021 at 10:34 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Nov 19, 2021, David Matlack wrote:
> > Consolidate is_large_pte and is_present_pte into a single helper. This
> > will be used in a follow-up commit to check for present large-pages
> > during Eager Page Splitting.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/spte.h    | 5 +++++
> >  arch/x86/kvm/mmu/tdp_mmu.c | 3 +--
> >  2 files changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index cc432f9a966b..e73c41d31816 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -257,6 +257,11 @@ static inline bool is_large_pte(u64 pte)
> >       return pte & PT_PAGE_SIZE_MASK;
> >  }
> >
> > +static inline bool is_large_present_pte(u64 pte)
> > +{
> > +     return is_shadow_present_pte(pte) && is_large_pte(pte);
> > +}
> > +
> >  static inline bool is_last_spte(u64 pte, int level)
> >  {
> >       return (level == PG_LEVEL_4K) || is_large_pte(pte);
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index ff4d83ad7580..f8c4337f1fcf 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1011,8 +1011,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >                * than the target, that SPTE must be cleared and replaced
> >                * with a non-leaf SPTE.
> >                */
> > -             if (is_shadow_present_pte(iter.old_spte) &&
> > -                 is_large_pte(iter.old_spte)) {
> > +             if (is_large_present_pte(iter.old_spte)) {
>
> I strongly object to this helper.  PRESENT in hardware and shadow-present are two
> very different things, the name is_large_present_pte() doesn't capture that detail.
> Yeah, we could name it is_large_shadow_present_pte(), but for me at least that
> requires more effort to read, and it's not like this is replacing 10s of instances.

Ok I'll drop it in v1.

>
> >                       if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
> >                               break;
> >               }
> > --
> > 2.34.0.rc2.393.gf8c9666880-goog
> >

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-12-01 18:29             ` Sean Christopherson
@ 2021-12-01 21:36               ` David Matlack
  2021-12-01 23:37                 ` Sean Christopherson
  0 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-12-01 21:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Xu, Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel,
	Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Wed, Dec 1, 2021 at 10:29 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Dec 01, 2021, Peter Xu wrote:
> > On Tue, Nov 30, 2021 at 05:29:10PM -0800, David Matlack wrote:
> > > On Tue, Nov 30, 2021 at 5:01 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > So '1' is technically correct, but I think it's the wrong choice given the behavior
> > > > of this code.  E.g. if there's 1 object in the cache, the initial top-up will do
> > > > nothing,
> > >
> > > This scenario will not happen though, since we free the caches after
> > > splitting. So, the next time userspace enables dirty logging on a
> > > memslot and we go to do the initial top-up the caches will have 0
> > > objects.
>
> Ah.
>
> > > > and then tdp_mmu_split_large_pages_root() will almost immediately drop
> > > > mmu_lock to topup the cache.  Since the in-loop usage explicitly checks for an
> > > > empty cache, i.e. any non-zero @min will have identical behavior, I think it makes
> > > > sense to use KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE _and_ add a comment explaining why.
> > >
> > > If we set the min to KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> > > kvm_mmu_topup_memory_cache will return ENOMEM if it can't allocate at
> > > least KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE objects, even though we really
> > > only need 1 to make forward progress.
> > >
> > > It's a total edge case but there could be a scenario where userspace
> > > sets the cgroup memory limits so tight that we can't allocate
> > > KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE objects when splitting the last few
> > > pages and in the end we only needed 1 or 2 objects to finish
> > > splitting. In this case we'd end up with a spurious pr_warn and may
> > > not split the last few pages depending on which cache failed to get
> > > topped up.
> >
> > IMHO when -ENOMEM happens, instead of keep trying with 1 shadow sp we should
> > just bail out even earlier.
> >
> > Say, if we only have 10 (<40) pages left for shadow sp's use, we'd better make
> > good use of them lazily to be consumed in follow up page faults when the guest
> > accessed any of the huge pages, rather than we take them all over to split the
> > next continuous huge pages assuming it'll be helpful..
> >
> > From that POV I have a slight preference over Sean's suggestion because that'll
> > make us fail earlier.  But I agree it shouldn't be a big deal.
>
> Hmm, in this particular case, I think using the caches is the wrong approach.  The
> behavior of pre-filling the caches makes sense for vCPUs because faults may need
> multiple objects and filling the cache ensures the entire fault can be handled
> without dropping mmu_lock.  And any extra/unused objects can be used by future
> faults.  For page splitting, neither of those really holds true.  If there are a
> lot of pages to split, KVM will have to drop mmu_lock to refill the cache.  And if
> there are few pages to split, or the caches are refilled toward the end of the walk,
> KVM may end up with a pile of unused objects it needs to free.
>
> Since this code already needs to handle failure, and more importantly, it's a
> best-effort optimization, I think trying to use the caches is a square peg, round
> hole scenario.
>
> Rather than use the caches, we could do allocation 100% on-demand and never drop
> mmu_lock to do allocation.  The one caveat is that direct reclaim would need to be
> disallowed so that the allocation won't sleep.  That would mean that eager splitting
> would fail under heavy memory pressure when it otherwise might succeed by reclaiming.
> That would mean vCPUs get penalized as they'd need to do the splitting on fault and
> potentially do direct reclaim as well.  It's not obvious that that would be a problem
> in practice, e.g. the vCPU is probably already seeing a fair amount of disruption due
> to memory pressure, and slowing down vCPUs might alleviate some of that pressure.

Not necessarily. The vCPUs might be running just fine in the VM being
split because they are in their steady state and not faulting in any
new memory. (Memory pressure might be coming from another VM landing
on the host.)

IMO, if we have an opportunity to avoid doing direct reclaim in the
critical path of customer execution we should take it.

The on-demand approach will also increase the amount of time we have
to hold the MMU lock to page splitting. This is not too terrible for
the TDP MMU since we are holding the MMU lock in read mode, but is
going to become a problem when we add page splitting support for the
shadow MMU.

I do agree that the caches approach, as implemented, will inevitably
end up with a pile of unused objects at the end that need to be freed.
I'd be happy to take a look and see if there's anyway to reduce the
amount of unused objects at the end with a bit smarter top-up logic.

>
> Not using the cache would also reduce the extra complexity, e.g. no need for
> special mmu_cache handling or a variant of tdp_mmu_iter_cond_resched().
>
> I'm thinking something like this (very incomplete):
>
> static void init_tdp_mmu_page(struct kvm_mmu_page *sp, u64 *spt, gfn_t gfn,
>                               union kvm_mmu_page_role role)
> {
>         sp->spt = spt;
>         set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>
>         sp->role = role;
>         sp->gfn = gfn;
>         sp->tdp_mmu_page = true;
>
>         trace_kvm_mmu_get_page(sp, true);
> }
>
> static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>                                                union kvm_mmu_page_role role)
> {
>         struct kvm_mmu_page *sp;
>         u64 *spt;
>
>         sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
>         spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
>
>         init_tdp_mmu_page(sp, spt, gfn, role);
> }
>
> static union kvm_mmu_page_role get_child_page_role(struct tdp_iter *iter)
> {
>         struct kvm_mmu_page *parent = sptep_to_sp(rcu_dereference(iter->sptep));
>         union kvm_mmu_page_role role = parent->role;
>
>         role.level--;
>         return role;
> }
>
> static bool tdp_mmu_install_sp_atomic(struct kvm *kvm,
>                                       struct tdp_iter *iter,
>                                       struct kvm_mmu_page *sp,
>                                       bool account_nx)
> {
>         u64 spte;
>
>         spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
>
>         if (tdp_mmu_set_spte_atomic(kvm, iter, spte)) {
>                 tdp_mmu_link_page(kvm, sp, account_nx);
>                 return true;
>         }
>         return false;
> }
>
> static void tdp_mmu_split_large_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
>                                            gfn_t start, gfn_t end, int target_level)
> {
>         /*
>          * Disallow direct reclaim, allocations will be made while holding
>          * mmu_lock and must not sleep.
>          */
>         gfp_t gfp = (GFP_KERNEL_ACCOUNT | __GFP_ZERO) & ~__GFP_DIRECT_RECLAIM;
>         struct kvm_mmu_page *sp = NULL;
>         struct tdp_iter iter;
>         bool flush = false;
>         u64 *spt = NULL;
>         int r;
>
>         rcu_read_lock();
>
>         /*
>          * Traverse the page table splitting all large pages above the target
>          * level into one lower level. For example, if we encounter a 1GB page
>          * we split it into 512 2MB pages.
>          *
>          * Since the TDP iterator uses a pre-order traversal, we are guaranteed
>          * to visit an SPTE before ever visiting its children, which means we
>          * will correctly recursively split large pages that are more than one
>          * level above the target level (e.g. splitting 1GB to 2MB to 4KB).
>          */
>         for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
> retry:
>                 if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true))
>                         continue;
>
>                 if (!is_shadow_present_pte(iter.old_spte || !is_large_pte(pte))
>                         continue;
>
>                 if (!sp) {
>                         sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
>                         if (!sp)
>                                 break;
>                         spt = (void *)__get_free_page(gfp);
>                         if (!spt)
>                                 break;
>                 }
>
>                 init_tdp_mmu_page(sp, spt, iter->gfn,
>                                   get_child_page_role(&iter));
>
>                 if (!tdp_mmu_split_large_page(kvm, &iter, sp))
>                         goto retry;
>
>                 sp = NULL;
>                 spt = NULL;
>         }
>
>         free_page((unsigned long)spt);
>         kmem_cache_free(mmu_page_header_cache, sp);
>
>         rcu_read_unlock();
>
>         if (flush)
>                 kvm_flush_remote_tlbs(kvm);
> }

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU
  2021-12-01  4:10     ` Peter Xu
  2021-12-01  4:19       ` Peter Xu
@ 2021-12-01 21:46       ` David Matlack
  1 sibling, 0 replies; 77+ messages in thread
From: David Matlack @ 2021-12-01 21:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Tue, Nov 30, 2021 at 8:10 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Nov 30, 2021 at 03:22:29PM -0800, David Matlack wrote:
> > On Fri, Nov 26, 2021 at 6:13 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hi, David,
> > >
> > > On Fri, Nov 19, 2021 at 11:57:44PM +0000, David Matlack wrote:
> > > > This series is a first pass at implementing Eager Page Splitting for the
> > > > TDP MMU. For context on the motivation and design of Eager Page
> > > > Splitting, please see the RFC design proposal and discussion [1].
> > > >
> > > > Paolo, I went ahead and added splitting in both the intially-all-set
> > > > case (only splitting the region passed to CLEAR_DIRTY_LOG) and the
> > > > case where we are not using initially-all-set (splitting the entire
> > > > memslot when dirty logging is enabled) to give you an idea of what
> > > > both look like.
> > > >
> > > > Note: I will be on vacation all of next week so I will not be able to
> > > > respond to reviews until Monday November 29. I thought it would be
> > > > useful to seed discussion and reviews with an early version of the code
> > > > rather than putting it off another week. But feel free to also ignore
> > > > this until I get back :)
> > > >
> > > > This series compiles and passes the most basic splitting test:
> > > >
> > > > $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 2 -i 4
> > > >
> > > > But please operate under the assumption that this code is probably
> > > > buggy.
> > > >
> > > > [1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t
> > >
> > > Will there be more numbers to show in the formal patchset?
> >
> > Yes definitely. I didn't have a lot of time to test this series, hence
> > the RFC status. I'll include more thorough testing and performance
> > evaluation in the cover letter for v1.
> >
> >
> > > It's interesting to
> > > know how "First Pass Dirty Memory Time" will change comparing to the rfc
> > > numbers; I can have a feel of it, but still. :) Also, not only how it speedup
> > > guest dirty apps, but also some general measurement on how it slows down
> > > KVM_SET_USER_MEMORY_REGION (!init-all-set) or CLEAR_LOG (init-all-set) would be
> > > even nicer (for CLEAR, I guess the 1st/2nd+ round will have different overhead).
> > >
> > > Besides that, I'm also wondering whether we should still have a knob for it, as
> > > I'm wondering what if the use case is the kind where eager split huge page may
> > > not help at all.  What I'm thinking:
> > >
> > >   - Read-mostly guest overload; split huge page will speed up rare writes, but
> > >     at the meantime drag readers down due to huge->small page mappings.
> > >
> > >   - Writes-over-very-limited-region workload: say we have 1T guest and the app
> > >     in the guest only writes 10G part of it.  Hmm not sure whether it exists..
> > >
> > >   - Postcopy targeted: it means precopy may only run a few iterations just to
> > >     send the static pages, so the migration duration will be relatively short,
> > >     and the write just didn't spread a lot to the whole guest mem.
> > >
> > > I don't really think any of the example is strong enough as they're all very
> > > corner cased, but just to show what I meant to raise this question on whether
> > > unconditionally eager split is the best approach.
> >
> > I'd be happy to add a knob if there's a userspace that wants to use
> > it. I think the main challenge though is knowing when it is safe to
> > disable eager splitting.
>
> Isn't it a performance feature?  Why it'll be not safe?

Heh, "safe" is a bit overzealous. But we've found that as the vCPU
count scales in VMs, not doing Eager Page Splitting leads to
unacceptable performance degradations (per customers), especially when
using the shadow MMU where hugepage write-protection faults are done
while holding the MMU lock in write mode. So from that perspective,
it's "unsafe" to skip Eager Page Splitting unless you are absolutely
sure the guest workload will not be doing much writes.

>
> > For a small deployment where you know the VM workload, it might make
> > sense. But for a public cloud provider the only feasible way would be to
> > dynamically monitor the guest writing patterns. But then we're back at square
> > one because that would require dirty logging. And even then, there's no
> > guaranteed way to predict future guest write patterns based on past patterns.
>
> Agreed, what I was thinking was not for public cloud usages, but for the cases
> where we can do specific tunings on some specific scenarios.  It normally won't
> matter a lot with small or medium sized VMs but extreme use cases.

Ack. I'll include a module parameter in v1 like you suggested your other email.

>
> >
> > The way forward here might be to do a hybrid of 2M and 4K dirty
> > tracking (and maybe even 1G). For example, first start dirty logging
> > at 2M granularity, and then log at 4K for any specific regions or
> > memslots that aren't making progress. We'd still use Eager Page
> > Splitting unconditionally though, first to split to 2M and then to
> > split to 4K.
>
> Do you mean we'd also offer different granule dirty bitmap to the userspace
> too?

Perhaps. The 2M dirty tracking work is still in very early research
phases and the first version will likely not be so dynamic. But I
could imagine we eventually get to the point where we are doing some
hybrid approach.

>
> I remembered you mentioned 2mb dirty tracking in your rfc series, but I didn't
> expect it can be dynamically switched during tracking.  That sounds a very
> intersting idea.
>
> Thanks,

Thanks for all the reviews and feedback on this series!

>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table
  2021-12-01 19:13   ` Sean Christopherson
@ 2021-12-01 21:52     ` David Matlack
  0 siblings, 0 replies; 77+ messages in thread
From: David Matlack @ 2021-12-01 21:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Wed, Dec 1, 2021 at 11:14 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Nov 19, 2021, David Matlack wrote:
> > Factor out the logic to atomically replace an SPTE with an SPTE that
> > points to a new page table. This will be used in a follow-up commit to
> > split a large page SPTE into one level lower.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 53 ++++++++++++++++++++++++++------------
> >  1 file changed, 37 insertions(+), 16 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index cc9fe33c9b36..9ee3f4f7fdf5 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -945,6 +945,39 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> >       return ret;
> >  }
> >
> > +/*
> > + * tdp_mmu_install_sp_atomic - Atomically replace the given spte with an
> > + * spte pointing to the provided page table.
> > + *
> > + * @kvm: kvm instance
> > + * @iter: a tdp_iter instance currently on the SPTE that should be set
> > + * @sp: The new TDP page table to install.
> > + * @account_nx: True if this page table is being installed to split a
> > + *              non-executable huge page.
> > + *
> > + * Returns: True if the new page table was installed. False if spte being
> > + *          replaced changed, causing the atomic compare-exchange to fail.
> > + *          If this function returns false the sp will be freed before
> > + *          returning.
> > + */
> > +static bool tdp_mmu_install_sp_atomic(struct kvm *kvm,
> > +                                   struct tdp_iter *iter,
> > +                                   struct kvm_mmu_page *sp,
> > +                                   bool account_nx)
> > +{
> > +     u64 spte;
> > +
> > +     spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
>
> This can easily go on one line.
>
>         u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
> > +
> > +     if (tdp_mmu_set_spte_atomic(kvm, iter, spte)) {
> > +             tdp_mmu_link_page(kvm, sp, account_nx);
> > +             return true;
> > +     } else {
> > +             tdp_mmu_free_sp(sp);
> > +             return false;
>
> I don't think this helper should free the sp on failure, even if that's what all
> paths end up doing.  When reading the calling code, it really looks like the sp
> is being leaked because the allocation and free are in different contexts.  That
> the sp is consumed on success is fairly intuitive given the "install" action, but
> freeing on failure not so much.
>
> And for the eager splitting, freeing on failure is wasteful.  It's extremely
> unlikely to happen often, so in practice it's unlikely to be an issue, but it's
> certainly odd since the loop is likely going to immediately allocate another sp,
> either for the current spte or for the next spte.

Good point. I'll fix this in v1.

>
> Side topic, tdp_mmu_set_spte_atomic() and friends really should return 0/-EBUSY.
> Boolean returns for errors usually end in tears sooner or later.

Agreed. I was sticking with local style here but would like to see
more of this code switch to returning ints. I'll take a look at
including that cleanup as well in v1, if not a separate pre-series.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent
  2021-12-01  0:45       ` Sean Christopherson
@ 2021-12-01 21:56         ` David Matlack
  0 siblings, 0 replies; 77+ messages in thread
From: David Matlack @ 2021-12-01 21:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Tue, Nov 30, 2021 at 4:45 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Nov 30, 2021, David Matlack wrote:
> > > I have a similar patch for the old MMU, but it was also replacing
> > > shadow_root_level with shadow_root_role.  I'll see if I can adapt it to
> > > the TDP MMU, since the shadow_root_role is obviously the same for both.
> >
> > While I was writing this patch it got me wondering if we can do an
> > even more general refactor and replace root_hpa and shadow_root_level
> > with a pointer to the root kvm_mmu_page struct. But I didn't get a
> > chance to look into it further.
>
> For TDP MUU, yes, as root_hpa == __pa(sp->spt) in all cases.  For the legacy/full
> MMU, not without additional refactoring since root_hpa doesn't point at a kvm_mmu_page
> when KVM shadows a non-paging guest with PAE paging (uses pae_root), or when KVM
> shadows nested NPT and the guest is using fewer paging levels that the host (uses
> pml5_root or pml4_root).
>
>         if (mmu->shadow_root_level == PT64_ROOT_5LEVEL)
>                 mmu->root_hpa = __pa(mmu->pml5_root);
>         else if (mmu->shadow_root_level == PT64_ROOT_4LEVEL)
>                 mmu->root_hpa = __pa(mmu->pml4_root);
>         else
>                 mmu->root_hpa = __pa(mmu->pae_root);
>
> That's definitely a solvable problem, e.g. it wouldn't be a problem to burn a few
> kvm_mmu_page for the special root.  The biggest issue is probably the sheer amount
> of code that would need to be updated.  I do think it would be a good change, but
> I think we'd want to do it in a release that isn't expected to have many other MMU
> changes.

Thanks for the explanation! I had a feeling this refactor would start
getting hairy when I ventured outside of the TDP MMU.

>
> shadow_root_level can also be replaced by mmu_role.base.level.  I've never bothered
> to do the replacement because there's zero memory savings and it would undoubtedly
> take me some time to retrain my brain :-)

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
  2021-12-01 20:16       ` Sean Christopherson
@ 2021-12-01 22:11         ` Ben Gardon
  0 siblings, 0 replies; 77+ messages in thread
From: Ben Gardon @ 2021-12-01 22:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Matlack, Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Wed, Dec 1, 2021 at 12:17 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Dec 01, 2021, Ben Gardon wrote:
> > On Wed, Dec 1, 2021 at 11:22 AM Sean Christopherson <seanjc@google.com> wrote:
> > > I would prefer we use hugepage when possible, mostly because that's the terminology
> > > used by the kernel.  KVM is comically inconsistent, but if we make an effort to use
> > > hugepage when adding new code, hopefully someday we'll have enough inertia to commit
> > > fully to hugepage.
> >
> > In my mind "huge page" implies 2M and "large page" is generic to 2m
> > and 1g. (IDK if we settled on a name for 1G pages)
>
> What about 4m PSE pages?  :-)
>
> I'm mostly joking, but it does raise the point that trying to provide unique names
> for each size is a bit of a fools errand, especially on non-x86 architectures that
> support a broader variety of hugepage sizes.  IMO, the least ambiguous way to refer
> to hugepages is to say that everything that isn't a 4k page (or whatever PAGE_SIZE
> is on the architecture) is a hugepage, and then explicitly state the size of the
> page if it matters.
>
> > I've definitely been guilty of reinforcing this inconsistent
> > terminology. (Though it was consistent in my head, of course.) If we
> > want to pick one and use it everywhere, I'm happy to get onboard with
> > a standard terminology.
>
> I hear you on using "large page", I've had to undo a solid decade of "large page"
> terminology from my pre-Linux days.  But for better or worse, the kernel uses
> hugepage, e.g. hugetlbfs supports 1gb and 2mb pages.  I think we should follow
> the kernel, especially since we have aspirations of unifying more of KVM's MMU
> across multiple architectures.

Sounds good to me. I'll keep that in mind in future patches. I'm happy
to call them anything as long as we all use the same terms.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
  2021-12-01  4:03         ` Peter Xu
@ 2021-12-01 22:14           ` David Matlack
  2021-12-03  4:57             ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-12-01 22:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Tue, Nov 30, 2021 at 8:04 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Nov 30, 2021 at 04:17:01PM -0800, David Matlack wrote:
> > On Tue, Nov 30, 2021 at 4:16 PM David Matlack <dmatlack@google.com> wrote:
> > >
> > > On Fri, Nov 26, 2021 at 4:17 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote:
> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > index 6768ef9c0891..4e78ef2dd352 100644
> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> > > > >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> > > > >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> > > > >
> > > > > +             /*
> > > > > +              * Try to proactively split any large pages down to 4KB so that
> > > > > +              * vCPUs don't have to take write-protection faults.
> > > > > +              */
> > > > > +             kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
> > > > > +
> > > > >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> > > > >
> > > > >               /* Cross two large pages? */
> > > >
> > > > Is it intended to try split every time even if we could have split it already?
> > > > As I remember Paolo mentioned that we can skip split if it's not the 1st
> > > > CLEAR_LOG on the same range, and IIUC that makes sense.
> > > >
> > > > But indeed I don't see a trivial way to know whether this is the first clear of
> > > > this range.  Maybe we can maintain "how many huge pages are there under current
> > > > kvm_mmu_page node" somehow?  Then if root sp has the counter==0, then we can
> > > > skip it.  Just a wild idea..
> > > >
> > > > Or maybe it's intended to try split unconditionally for some reason?  If so
> > > > it'll be great to mention that either in the commit message or in comments.
> > >
> > > Thanks for calling this out. Could the same be said about the existing
> > > code that unconditionally tries to write-protect 2M+ pages?
>
> They're different because wr-protect can be restored (to be not-wr-protected)
> when vcpu threads write to the pages, so they need to be always done.

That's true for 4K pages, but not for write-protecting 2M+ pages
(which is what we're discussing here). Once KVM write-protects a 2M+
page, it should never need to write-protect it again, but we always
try to here. Same goes with splitting.

>
> For huge page split - when it happened during dirty tracking it'll not be
> recovered anymore, so it's a one-time thing.
>
> > > I aimed to keep parity with the write-protection calls (always try to split
> > > before write-protecting) but I agree there might be opportunities available
> > > to skip altogether.
>
> So IMHO it's not about parity but it could be about how easy can it be
> implemented, and whether it'll be worth it to add that complexity.

Agreed.

>
> Besides the above accounting idea per-sp, we can have other ways to do this
> too, e.g., keeping a bitmap showing which range has been split: that bitmap
> will be 2M in granule for x86 because that'll be enough.  We init-all-ones for
> the bitmap when start logging for a memslot.
>
> But again maybe it turns out we don't really want that complexity.
>
> IMHO a good start could be the perf numbers (which I asked in the cover letter)
> comparing the overhead of 2nd+ iterations of CLEAR_LOG with/without eager page
> split.

Ack. I'll be sure to include these in v1!

>
> > >
> > > By the way, looking at this code again I think I see some potential bugs:
> > >  - I don't think I ever free split_caches in the initially-all-set case.
>
> I saw that it's freed in kvm_mmu_try_split_large_pages(), no?

Ah yes you are right. I misremembered how I implemented it and thought
we kept the split_caches around across calls to CLEAR_LOG. (We
probably should TBH. The current implementation is quite wasteful.)

>
> > >  - What happens if splitting fails the CLEAR_LOG but succeeds the
> > > CLEAR_LOG?
> >
> > Gah, meant to say "first CLEAR_LOG" and "second CLEAR_LOG" here.
> >
> > > We would end up propagating the write-protection on the 2M
> > > page down to the 4K page. This might cause issues if using PML.
>
> Hmm looks correct.. I'm wondering what will happen with that.
>
> Firstly this should be rare as the 1st split should in 99% cases succeed.
>
> Then if split failed at the 1st attempt, we wr-protected sptes even during pml
> during the split.  When written, we'll go the fast page fault and record the
> writes too, I think, as we'll apply dirty bit to the new spte so I think it'll
> just skip pml.  Looks like we'll be using a mixture of pml+wp but all dirty
> will still be captured as exptected?..

That's what I was hoping for. I'll double check for v1.

>
> There could be leftover wp when stopping dirty logging, but that seems not
> directly harmful too.  It'll make things a bit messed up, at least.
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
  2021-12-01 19:22   ` Sean Christopherson
  2021-12-01 19:49     ` Ben Gardon
@ 2021-12-01 22:17     ` David Matlack
  1 sibling, 0 replies; 77+ messages in thread
From: David Matlack @ 2021-12-01 22:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier

On Wed, Dec 1, 2021 at 11:22 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Nov 19, 2021, David Matlack wrote:
> > When using initially-all-set, large pages are not write-protected when
> > dirty logging is enabled on the memslot. Instead they are
> > write-protected once userspace invoked CLEAR_DIRTY_LOG for the first
> > time, and only for the specific sub-region of the memslot that userspace
> > whishes to clear.
> >
> > Enhance CLEAR_DIRTY_LOG to also try to split large pages prior to
> > write-protecting to avoid causing write-protection faults on vCPU
> > threads. This also allows userspace to smear the cost of large page
> > splitting across multiple ioctls rather than splitting the entire
> > memslot when not using initially-all-set.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  4 ++++
> >  arch/x86/kvm/mmu/mmu.c          | 30 ++++++++++++++++++++++--------
> >  2 files changed, 26 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 432a4df817ec..6b5bf99f57af 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1591,6 +1591,10 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
> >  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >                                     const struct kvm_memory_slot *memslot,
> >                                     int start_level);
> > +void kvm_mmu_try_split_large_pages(struct kvm *kvm,
>
> I would prefer we use hugepage when possible, mostly because that's the terminology
> used by the kernel.  KVM is comically inconsistent, but if we make an effort to use
> hugepage when adding new code, hopefully someday we'll have enough inertia to commit
> fully to hugepage.

Will do.

>
> > +                                const struct kvm_memory_slot *memslot,
> > +                                u64 start, u64 end,
> > +                                int target_level);
> >  void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
> >                                       const struct kvm_memory_slot *memslot,
> >                                       int target_level);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6768ef9c0891..4e78ef2dd352 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> >
> > +             /*
> > +              * Try to proactively split any large pages down to 4KB so that
> > +              * vCPUs don't have to take write-protection faults.
> > +              */
> > +             kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
>
> This should return a value.  If splitting succeeds, there should be no hugepages
> and so walking the page tables to write-protect 2M is unnecessary.  Same for the
> previous patch, although skipping the write-protect path is a little less
> straightforward in that case.

Great idea! Will do.

>
> > +
> >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> >
> >               /* Cross two large pages? */

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-12-01 21:36               ` David Matlack
@ 2021-12-01 23:37                 ` Sean Christopherson
  2021-12-02 17:41                   ` David Matlack
  0 siblings, 1 reply; 77+ messages in thread
From: Sean Christopherson @ 2021-12-01 23:37 UTC (permalink / raw)
  To: David Matlack
  Cc: Peter Xu, Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel,
	Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Wed, Dec 01, 2021, David Matlack wrote:
> On Wed, Dec 1, 2021 at 10:29 AM Sean Christopherson <seanjc@google.com> wrote:
> > Hmm, in this particular case, I think using the caches is the wrong approach.  The
> > behavior of pre-filling the caches makes sense for vCPUs because faults may need
> > multiple objects and filling the cache ensures the entire fault can be handled
> > without dropping mmu_lock.  And any extra/unused objects can be used by future
> > faults.  For page splitting, neither of those really holds true.  If there are a
> > lot of pages to split, KVM will have to drop mmu_lock to refill the cache.  And if
> > there are few pages to split, or the caches are refilled toward the end of the walk,
> > KVM may end up with a pile of unused objects it needs to free.
> >
> > Since this code already needs to handle failure, and more importantly, it's a
> > best-effort optimization, I think trying to use the caches is a square peg, round
> > hole scenario.
> >
> > Rather than use the caches, we could do allocation 100% on-demand and never drop
> > mmu_lock to do allocation.  The one caveat is that direct reclaim would need to be
> > disallowed so that the allocation won't sleep.  That would mean that eager splitting
> > would fail under heavy memory pressure when it otherwise might succeed by reclaiming.
> > That would mean vCPUs get penalized as they'd need to do the splitting on fault and
> > potentially do direct reclaim as well.  It's not obvious that that would be a problem
> > in practice, e.g. the vCPU is probably already seeing a fair amount of disruption due
> > to memory pressure, and slowing down vCPUs might alleviate some of that pressure.
> 
> Not necessarily. The vCPUs might be running just fine in the VM being
> split because they are in their steady state and not faulting in any
> new memory. (Memory pressure might be coming from another VM landing
> on the host.)

Hrm, true.

> IMO, if we have an opportunity to avoid doing direct reclaim in the
> critical path of customer execution we should take it.
>
> 
> The on-demand approach will also increase the amount of time we have
> to hold the MMU lock to page splitting. This is not too terrible for
> the TDP MMU since we are holding the MMU lock in read mode, but is
> going to become a problem when we add page splitting support for the
> shadow MMU.
> 
> I do agree that the caches approach, as implemented, will inevitably
> end up with a pile of unused objects at the end that need to be freed.
> I'd be happy to take a look and see if there's anyway to reduce the
> amount of unused objects at the end with a bit smarter top-up logic.

It's not just the extra objects, it's the overall complexity that bothers me.
Complexity isn't really the correct word, it's more that as written, the logic
is spread over several files and is disingenuous from the perspective that the
split_cache is in kvm->arch, which implies persistence, but the cache are
completely torn down after evey memslot split.

I suspect part of the problem is that the code is trying to plan for a future
where nested MMUs also support splitting large pages.  Usually I'm all for that
sort of thing, but in this case it creates a lot of APIs that should not exist,
either because the function is not needed at all, or because it's a helper buried
in tdp_mmu.c.  E.g. assert_split_caches_invariants() is overkill.

That's solvable by refactoring and shuffling code, but using kvm_mmu_memory_cache
still feels wrong.  The caches don't fully solve the might_sleep() problem since
the loop still has to drop mmu_lock purely because it needs to allocate memory,
and at the same time the caches are too agressive because we can theoretically get
false positives on OOM scenarios, e.g. a topup could fail when trying to allocate
25 objects, when only 1 is needed.  We could enhance the cache code, which is
pretty rudimentary, but it still feels forced.

One thing we can take advantage of is that remote TLB flushes can be deferred
until after all roots are done, and don't need to be serviced if mmu_lock is
dropped.  Changes from a hugepage to a collection of smaller pages is atomic, no
memory is freed, and there are no changes in gfn=>pfn made by the split.  If
something else comes along and modifies the newly created sp or its children,
then it will flush accordingly.  Similar to write-protecting the page, the only
requirement is that all vCPUs see the small pages before the ioctl() returns,
i.e. before userspace can query the dirty log.  Never needing to flush is one
less reason to use a variant of tdp_mmu_iter_cond_resched(). 

So, what if we do something like this?  Try to allocate on-demand without dropping
mmu_lock.  In the happy case, it will succeed and there's no need to drop mmu_lock.
If allocation fails, drop RCU and mmu_lock and retry with direct relcaim allowed.

Some ugly gotos to reduce indentation, there's probably a better way to dress
this up.  Comments obviously needed.  This also doesn't track whether or not a
flush is needed, that will sadly need to be an in/out param, assuming we want to
return success/failure.

static struct kvm_mmu_page *tdp_mmu_alloc_sp(gfp_t allow_direct_reclaim)
{
	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | allow_direct_reclaim;
	struct kvm_mmu_page *sp;
	u64 *spt;

	spt = (void *)__get_free_page(gfp);
	if (!spt)
		return NULL;

	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
	if (!sp) {
		free_page((unsigned long)spt);
		return NULL;
	}

	sp->spt = spt;

	return sp;
}

static int tdp_mmu_split_large_pages(struct kvm *kvm, struct kvm_mmu_page *root,
				     gfn_t start, gfn_t end, int target_level)
{
	struct kvm_mmu_page *sp = NULL;
	struct tdp_iter iter;

	rcu_read_lock();

	for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
retry:
		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
			continue;

		if (!is_shadow_present_pte(iter.old_spte || !is_large_pte(pte))
			continue;

		if (likely(sp))
			goto do_split;

		sp = tdp_mmu_alloc_sp(0);
		if (!sp) {
			rcu_read_unlock();
			read_unlock(&kvm->mmu_lock);

			sp = tdp_mmu_alloc_sp(__GFP_DIRECT_RECLAIM);

			read_lock(&kvm->mmu_lock);

			if (!sp)
				return -ENOMEM;

			rcu_read_lock();
			tdp_iter_restart(iter);
			continue;
		}

do_split:
		init_tdp_mmu_page(sp, iter->gfn, get_child_page_role(&iter));

		if (!tdp_mmu_split_large_page(kvm, &iter, sp))
			goto retry;

		sp = NULL;
	}

	rcu_read_unlock();

	return 0;
}

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-12-01 23:37                 ` Sean Christopherson
@ 2021-12-02 17:41                   ` David Matlack
  2021-12-02 18:42                     ` Sean Christopherson
  0 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-12-02 17:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Xu, Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel,
	Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Wed, Dec 1, 2021 at 3:37 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Dec 01, 2021, David Matlack wrote:
> > On Wed, Dec 1, 2021 at 10:29 AM Sean Christopherson <seanjc@google.com> wrote:
> > > Hmm, in this particular case, I think using the caches is the wrong approach.  The
> > > behavior of pre-filling the caches makes sense for vCPUs because faults may need
> > > multiple objects and filling the cache ensures the entire fault can be handled
> > > without dropping mmu_lock.  And any extra/unused objects can be used by future
> > > faults.  For page splitting, neither of those really holds true.  If there are a
> > > lot of pages to split, KVM will have to drop mmu_lock to refill the cache.  And if
> > > there are few pages to split, or the caches are refilled toward the end of the walk,
> > > KVM may end up with a pile of unused objects it needs to free.
> > >
> > > Since this code already needs to handle failure, and more importantly, it's a
> > > best-effort optimization, I think trying to use the caches is a square peg, round
> > > hole scenario.
> > >
> > > Rather than use the caches, we could do allocation 100% on-demand and never drop
> > > mmu_lock to do allocation.  The one caveat is that direct reclaim would need to be
> > > disallowed so that the allocation won't sleep.  That would mean that eager splitting
> > > would fail under heavy memory pressure when it otherwise might succeed by reclaiming.
> > > That would mean vCPUs get penalized as they'd need to do the splitting on fault and
> > > potentially do direct reclaim as well.  It's not obvious that that would be a problem
> > > in practice, e.g. the vCPU is probably already seeing a fair amount of disruption due
> > > to memory pressure, and slowing down vCPUs might alleviate some of that pressure.
> >
> > Not necessarily. The vCPUs might be running just fine in the VM being
> > split because they are in their steady state and not faulting in any
> > new memory. (Memory pressure might be coming from another VM landing
> > on the host.)
>
> Hrm, true.
>
> > IMO, if we have an opportunity to avoid doing direct reclaim in the
> > critical path of customer execution we should take it.
> >
> >
> > The on-demand approach will also increase the amount of time we have
> > to hold the MMU lock to page splitting. This is not too terrible for
> > the TDP MMU since we are holding the MMU lock in read mode, but is
> > going to become a problem when we add page splitting support for the
> > shadow MMU.
> >
> > I do agree that the caches approach, as implemented, will inevitably
> > end up with a pile of unused objects at the end that need to be freed.
> > I'd be happy to take a look and see if there's anyway to reduce the
> > amount of unused objects at the end with a bit smarter top-up logic.
>
> It's not just the extra objects, it's the overall complexity that bothers me.
> Complexity isn't really the correct word, it's more that as written, the logic
> is spread over several files and is disingenuous from the perspective that the
> split_cache is in kvm->arch, which implies persistence, but the cache are
> completely torn down after evey memslot split.
>
> I suspect part of the problem is that the code is trying to plan for a future
> where nested MMUs also support splitting large pages.  Usually I'm all for that
> sort of thing, but in this case it creates a lot of APIs that should not exist,
> either because the function is not needed at all, or because it's a helper buried
> in tdp_mmu.c.  E.g. assert_split_caches_invariants() is overkill.
>
> That's solvable by refactoring and shuffling code, but using kvm_mmu_memory_cache
> still feels wrong.  The caches don't fully solve the might_sleep() problem since
> the loop still has to drop mmu_lock purely because it needs to allocate memory,

I thought dropping the lock to allocate memory was a good thing. It
reduces the length of time we hold the RCU read lock and mmu_lock in
read mode. Plus it avoids the retry-with-reclaim and lets us reuse the
existing sp allocation code.

Eager page splitting itself does not need to be that performant since
it's not on the critical path of vCPU execution. But holding the MMU
lock can negatively affect vCPU performance.

But your preference is to allocate without dropping the lock when possible. Why?

> and at the same time the caches are too agressive because we can theoretically get
> false positives on OOM scenarios, e.g. a topup could fail when trying to allocate
> 25 objects, when only 1 is needed.

This is why I picked a min of 1 for the cache top-up. But this would
be true if we increased the min beyond 1.

> We could enhance the cache code, which is
> pretty rudimentary, but it still feels forced.
>
> One thing we can take advantage of is that remote TLB flushes can be deferred
> until after all roots are done, and don't need to be serviced if mmu_lock is
> dropped.

Good point. I'll revise the TLB flushing in v1 regardless.


> Changes from a hugepage to a collection of smaller pages is atomic, no
> memory is freed, and there are no changes in gfn=>pfn made by the split.  If
> something else comes along and modifies the newly created sp or its children,
> then it will flush accordingly.  Similar to write-protecting the page, the only
> requirement is that all vCPUs see the small pages before the ioctl() returns,
> i.e. before userspace can query the dirty log.  Never needing to flush is one
> less reason to use a variant of tdp_mmu_iter_cond_resched().
>
> So, what if we do something like this?  Try to allocate on-demand without dropping
> mmu_lock.  In the happy case, it will succeed and there's no need to drop mmu_lock.
> If allocation fails, drop RCU and mmu_lock and retry with direct relcaim allowed.
>
> Some ugly gotos to reduce indentation, there's probably a better way to dress
> this up.  Comments obviously needed.  This also doesn't track whether or not a
> flush is needed, that will sadly need to be an in/out param, assuming we want to
> return success/failure.
>
> static struct kvm_mmu_page *tdp_mmu_alloc_sp(gfp_t allow_direct_reclaim)
> {
>         gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | allow_direct_reclaim;
>         struct kvm_mmu_page *sp;
>         u64 *spt;
>
>         spt = (void *)__get_free_page(gfp);
>         if (!spt)
>                 return NULL;
>
>         sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
>         if (!sp) {
>                 free_page((unsigned long)spt);
>                 return NULL;
>         }
>
>         sp->spt = spt;
>
>         return sp;
> }
>
> static int tdp_mmu_split_large_pages(struct kvm *kvm, struct kvm_mmu_page *root,
>                                      gfn_t start, gfn_t end, int target_level)
> {
>         struct kvm_mmu_page *sp = NULL;
>         struct tdp_iter iter;
>
>         rcu_read_lock();
>
>         for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
> retry:
>                 if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
>                         continue;
>
>                 if (!is_shadow_present_pte(iter.old_spte || !is_large_pte(pte))
>                         continue;
>
>                 if (likely(sp))
>                         goto do_split;
>
>                 sp = tdp_mmu_alloc_sp(0);
>                 if (!sp) {
>                         rcu_read_unlock();
>                         read_unlock(&kvm->mmu_lock);
>
>                         sp = tdp_mmu_alloc_sp(__GFP_DIRECT_RECLAIM);
>
>                         read_lock(&kvm->mmu_lock);
>
>                         if (!sp)
>                                 return -ENOMEM;
>
>                         rcu_read_lock();
>                         tdp_iter_restart(iter);
>                         continue;
>                 }
>
> do_split:
>                 init_tdp_mmu_page(sp, iter->gfn, get_child_page_role(&iter));
>
>                 if (!tdp_mmu_split_large_page(kvm, &iter, sp))
>                         goto retry;
>
>                 sp = NULL;
>         }
>
>         rcu_read_unlock();
>
>         return 0;
> }
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-12-02 17:41                   ` David Matlack
@ 2021-12-02 18:42                     ` Sean Christopherson
  2021-12-03  0:00                       ` David Matlack
  0 siblings, 1 reply; 77+ messages in thread
From: Sean Christopherson @ 2021-12-02 18:42 UTC (permalink / raw)
  To: David Matlack
  Cc: Peter Xu, Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel,
	Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Thu, Dec 02, 2021, David Matlack wrote:
> On Wed, Dec 1, 2021 at 3:37 PM Sean Christopherson <seanjc@google.com> wrote:
> > It's not just the extra objects, it's the overall complexity that bothers me.
> > Complexity isn't really the correct word, it's more that as written, the logic
> > is spread over several files and is disingenuous from the perspective that the
> > split_cache is in kvm->arch, which implies persistence, but the cache are
> > completely torn down after evey memslot split.
> >
> > I suspect part of the problem is that the code is trying to plan for a future
> > where nested MMUs also support splitting large pages.  Usually I'm all for that
> > sort of thing, but in this case it creates a lot of APIs that should not exist,
> > either because the function is not needed at all, or because it's a helper buried
> > in tdp_mmu.c.  E.g. assert_split_caches_invariants() is overkill.
> >
> > That's solvable by refactoring and shuffling code, but using kvm_mmu_memory_cache
> > still feels wrong.  The caches don't fully solve the might_sleep() problem since
> > the loop still has to drop mmu_lock purely because it needs to allocate memory,
> 
> I thought dropping the lock to allocate memory was a good thing. It
> reduces the length of time we hold the RCU read lock and mmu_lock in
> read mode. Plus it avoids the retry-with-reclaim and lets us reuse the
> existing sp allocation code.

It's not a simple reuse though, e.g. it needs new logic to detect when the caches
are empty, requires a variant of tdp_mmu_iter_cond_resched(), needs its own instance
of caches and thus initialization/destruction of the caches, etc...

> Eager page splitting itself does not need to be that performant since
> it's not on the critical path of vCPU execution. But holding the MMU
> lock can negatively affect vCPU performance.
> 
> But your preference is to allocate without dropping the lock when possible. Why?

Because they're two different things.  Lock contention is already handled by
tdp_mmu_iter_cond_resched().  If mmu_lock is not contended, holding it for a long
duration is a complete non-issue.

Dropping mmu_lock means restarting the walk at the root because a different task
may have zapped/changed upper level entries.  If every allocation is dropping
mmu_lock, that adds up to a lot of extra memory accesses, especially when using
5-level paging.

Batching allocations via mmu_caches mostly works around that problem, but IMO
it's more complex overall than the retry-on-failure approach because it bleeds
core details into several locations, e.g. the split logic needs to know intimate
details of kvm_mmu_memory_cache, and we end up with two (or one complex) versions
of tdp_mmu_iter_cond_resched().

In general, I also dislike relying on magic numbers (the capacity of the cache)
for performance.  At best, we have to justify the magic number, now and in the
future.  At worst, someone will have a use case that doesn't play nice with KVM's
chosen magic number and then we have to do more tuning, e.g. see the PTE prefetch
stuff where the magic number of '8' (well, 7) ran out of gas for modern usage.
I don't actually think tuning will be problematic for this case, but I'd rather
avoid the discussion entirely if possible.

I'm not completely opposed to using kvm_mmu_memory_cache to batch allocations,
but I think we should do so if and only if batching has measurably better
performance for things we care about.  E.g. if eager splitting takes n% longer
under heavy memory pressure, but vCPUs aren't impacted, do we care?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-12-02 18:42                     ` Sean Christopherson
@ 2021-12-03  0:00                       ` David Matlack
  2021-12-03  1:07                         ` Sean Christopherson
  0 siblings, 1 reply; 77+ messages in thread
From: David Matlack @ 2021-12-03  0:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Xu, Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel,
	Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Thu, Dec 2, 2021 at 10:43 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Dec 02, 2021, David Matlack wrote:
> > On Wed, Dec 1, 2021 at 3:37 PM Sean Christopherson <seanjc@google.com> wrote:
> > > It's not just the extra objects, it's the overall complexity that bothers me.
> > > Complexity isn't really the correct word, it's more that as written, the logic
> > > is spread over several files and is disingenuous from the perspective that the
> > > split_cache is in kvm->arch, which implies persistence, but the cache are
> > > completely torn down after evey memslot split.
> > >kmem_cache_alloc
> > > I suspect part of the problem is that the code is trying to plan for a future
> > > where nested MMUs also support splitting large pages.  Usually I'm all for that
> > > sort of thing, but in this case it creates a lot of APIs that should not exist,
> > > either because the function is not needed at all, or because it's a helper buried
> > > in tdp_mmu.c.  E.g. assert_split_caches_invariants() is overkill.
> > >
> > > That's solvable by refactoring and shuffling code, but using kvm_mmu_memory_cache
> > > still feels wrong.  The caches don't fully solve the might_sleep() problem since
> > > the loop still has to drop mmu_lock purely because it needs to allocate memory,
> >
> > I thought dropping the lock to allocate memory was a good thing. It
> > reduces the length of time we hold the RCU read lock and mmu_lock in
> > read mode. Plus it avoids the retry-with-reclaim and lets us reuse the
> > existing sp allocation code.
>
> It's not a simple reuse though, e.g. it needs new logic to detect when the caches
> are empty, requires a variant of tdp_mmu_iter_cond_resched(), needs its own instance
> of caches and thus initialization/destruction of the caches, etc...
>
> > Eager page splitting itself does not need to be that performant since
> > it's not on the critical path of vCPU execution. But holding the MMU
> > lock can negatively affect vCPU performance.
> >
> > But your preference is to allocate without dropping the lock when possible. Why?
>
> Because they're two different things.  Lock contention is already handled by
> tdp_mmu_iter_cond_resched().  If mmu_lock is not contended, holding it for a long
> duration is a complete non-issue.

So I think you are positing that disabling reclaim will make the
allocations fast enough that the time between
tdp_mmu_iter_cond_resched checks will be acceptable. Is there really
no risk of long tail latency in kmem_cache_alloc() or
__get_free_page()? Even if it's rare, they will be common at scale.

This is why I'm being so hesitant, and prefer to avoid the problem
entirely by doing all allocations outside the lock. But I'm honestly
more than happy to be convinced otherwise and go with your approach.

>
> Dropping mmu_lock means restarting the walk at the root because a different task
> may have zapped/changed upper level entries.  If every allocation is dropping
> mmu_lock, that adds up to a lot of extra memory accesses, especially when using
> 5-level paging.
>
> Batching allocations via mmu_caches mostly works around that problem, but IMO
> it's more complex overall than the retry-on-failure approach because it bleeds
> core details into several locations, e.g. the split logic needs to know intimate
> details of kvm_mmu_memory_cache, and we end up with two (or one complex) versions
> of tdp_mmu_iter_cond_resched().
>
> In general, I also dislike relying on magic numbers (the capacity of the cache)
> for performance.  At best, we have to justify the magic number, now and in the
> future.  At worst, someone will have a use case that doesn't play nice with KVM's
> chosen magic number and then we have to do more tuning, e.g. see the PTE prefetch
> stuff where the magic number of '8' (well, 7) ran out of gas for modern usage.
> I don't actually think tuning will be problematic for this case, but I'd rather
> avoid the discussion entirely if possible.
>
> I'm not completely opposed to using kvm_mmu_memory_cache to batch allocations,
> but I think we should do so if and only if batching has measurably better
> performance for things we care about.  E.g. if eager splitting takes n% longer
> under heavy memory pressure, but vCPUs aren't impacted, do we care?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-12-03  0:00                       ` David Matlack
@ 2021-12-03  1:07                         ` Sean Christopherson
  2021-12-03 17:22                           ` David Matlack
  0 siblings, 1 reply; 77+ messages in thread
From: Sean Christopherson @ 2021-12-03  1:07 UTC (permalink / raw)
  To: David Matlack
  Cc: Peter Xu, Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel,
	Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Thu, Dec 02, 2021, David Matlack wrote:
> On Thu, Dec 2, 2021 at 10:43 AM Sean Christopherson <seanjc@google.com> wrote:
> > Because they're two different things.  Lock contention is already handled by
> > tdp_mmu_iter_cond_resched().  If mmu_lock is not contended, holding it for a long
> > duration is a complete non-issue.
> 
> So I think you are positing that disabling reclaim will make the
> allocations fast enough that the time between
> tdp_mmu_iter_cond_resched checks will be acceptable.

Yep.

> Is there really no risk of long tail latency in kmem_cache_alloc() or
> __get_free_page()? Even if it's rare, they will be common at scale.

If there is a potentially long latency in __get_free_page(), then we're hosed no
matter what because per alloc_pages(), it's allowed in any context, including NMI,
IRQ, and Soft-IRQ.  I've no idea how often those contexts allocate, but I assume
it's not _that_ rare given the amount of stuff that networking does in Soft-IRQ
context, e.g. see the stack trace from commit 2620fe268e80, the use of PF_MEMALLOC,
the use of GFP_ATOMIC in napi_alloc_skb, etc...  Anb it's not just direct
allocations, e.g. anything that uses a radix tree or XArray will potentially
trigger allocation on insertion.

But I would be very, very surprised if alloc_pages() without GFP_DIRECT_RECLAIM
has a long tail latency, otherwise allocating from any atomic context would be
doomed.

> This is why I'm being so hesitant, and prefer to avoid the problem
> entirely by doing all allocations outside the lock. But I'm honestly
> more than happy to be convinced otherwise and go with your approach.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
  2021-12-01 22:14           ` David Matlack
@ 2021-12-03  4:57             ` Peter Xu
  0 siblings, 0 replies; 77+ messages in thread
From: Peter Xu @ 2021-12-03  4:57 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Wed, Dec 01, 2021 at 02:14:27PM -0800, David Matlack wrote:
> > > > Thanks for calling this out. Could the same be said about the existing
> > > > code that unconditionally tries to write-protect 2M+ pages?
> >
> > They're different because wr-protect can be restored (to be not-wr-protected)
> > when vcpu threads write to the pages, so they need to be always done.
> 
> That's true for 4K pages, but not for write-protecting 2M+ pages
> (which is what we're discussing here). Once KVM write-protects a 2M+
> page, it should never need to write-protect it again, but we always
> try to here. Same goes with splitting.

Ah I see, that's fair point. :)

Yeah let's see how it goes with the numbers, I'd hope it's trivial to do both
wr-protect 2m and the split unconditionally, because for CLEAR_LOG the major
overhead should be walking the small pages instead, afaiu.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
  2021-12-03  1:07                         ` Sean Christopherson
@ 2021-12-03 17:22                           ` David Matlack
  0 siblings, 0 replies; 77+ messages in thread
From: David Matlack @ 2021-12-03 17:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Xu, Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel,
	Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier

On Thu, Dec 2, 2021 at 5:07 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Dec 02, 2021, David Matlack wrote:
> > Is there really no risk of long tail latency in kmem_cache_alloc() or
> > __get_free_page()? Even if it's rare, they will be common at scale.
>
> If there is a potentially long latency in __get_free_page(), then we're hosed no
> matter what because per alloc_pages(), it's allowed in any context, including NMI,
> IRQ, and Soft-IRQ.  I've no idea how often those contexts allocate, but I assume
> it's not _that_ rare given the amount of stuff that networking does in Soft-IRQ
> context, e.g. see the stack trace from commit 2620fe268e80, the use of PF_MEMALLOC,
> the use of GFP_ATOMIC in napi_alloc_skb, etc...  Anb it's not just direct
> allocations, e.g. anything that uses a radix tree or XArray will potentially
> trigger allocation on insertion.
>
> But I would be very, very surprised if alloc_pages() without GFP_DIRECT_RECLAIM
> has a long tail latency, otherwise allocating from any atomic context would be
> doomed.

In that case I agree your approach should not introduce any more MMU
lock contention than the split_caches approach in practice, and will
require a lot less new code. I'll attempt to do some testing to
confirm, but assuming that goes fine I'll go with your approach in v1.

Thanks!

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2021-12-03 17:23 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
2021-11-19 23:57 ` [RFC PATCH 01/15] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn David Matlack
2021-11-22 18:52   ` Ben Gardon
2021-11-26 12:18   ` Peter Xu
2021-11-19 23:57 ` [RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect David Matlack
2021-11-22 18:52   ` Ben Gardon
2021-11-26 12:18   ` Peter Xu
2021-11-19 23:57 ` [RFC PATCH 03/15] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails David Matlack
2021-11-22 18:52   ` Ben Gardon
2021-11-30 23:25     ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table David Matlack
2021-11-22 18:52   ` Ben Gardon
2021-11-30 23:27     ` David Matlack
2021-12-01 19:13   ` Sean Christopherson
2021-12-01 21:52     ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct David Matlack
2021-11-22 18:55   ` Ben Gardon
2021-11-22 18:55     ` Ben Gardon
2021-11-30 23:28     ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent David Matlack
2021-11-20 12:53   ` Paolo Bonzini
2021-11-27  2:07     ` Lai Jiangshan
2021-11-27 10:26       ` Paolo Bonzini
2021-11-30 23:31     ` David Matlack
2021-12-01  0:45       ` Sean Christopherson
2021-12-01 21:56         ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 07/15] KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu David Matlack
2021-11-22 18:56   ` Ben Gardon
2021-11-19 23:57 ` [RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes David Matlack
2021-11-22 18:56   ` Ben Gardon
2021-12-01 18:34   ` Sean Christopherson
2021-12-01 21:13     ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 09/15] KVM: x86/mmu: Move restore_acc_track_spte to spte.c David Matlack
2021-11-22 18:56   ` Ben Gardon
2021-11-19 23:57 ` [RFC PATCH 10/15] KVM: x86/mmu: Abstract need_resched logic from tdp_mmu_iter_cond_resched David Matlack
2021-11-22 18:56   ` Ben Gardon
2021-11-19 23:57 ` [RFC PATCH 11/15] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root David Matlack
2021-11-22 18:56   ` Ben Gardon
2021-11-19 23:57 ` [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled David Matlack
2021-11-22  5:05   ` Nikunj A. Dadhania
2021-11-30 23:33     ` David Matlack
2021-11-22 19:30   ` Ben Gardon
2021-11-30 23:44     ` David Matlack
2021-11-26 12:01   ` Peter Xu
2021-11-30 23:56     ` David Matlack
2021-12-01  1:00       ` Sean Christopherson
2021-12-01  1:29         ` David Matlack
2021-12-01  2:29           ` Peter Xu
2021-12-01 18:29             ` Sean Christopherson
2021-12-01 21:36               ` David Matlack
2021-12-01 23:37                 ` Sean Christopherson
2021-12-02 17:41                   ` David Matlack
2021-12-02 18:42                     ` Sean Christopherson
2021-12-03  0:00                       ` David Matlack
2021-12-03  1:07                         ` Sean Christopherson
2021-12-03 17:22                           ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG David Matlack
2021-11-26 12:17   ` Peter Xu
2021-12-01  0:16     ` David Matlack
2021-12-01  0:17       ` David Matlack
2021-12-01  4:03         ` Peter Xu
2021-12-01 22:14           ` David Matlack
2021-12-03  4:57             ` Peter Xu
2021-12-01 19:22   ` Sean Christopherson
2021-12-01 19:49     ` Ben Gardon
2021-12-01 20:16       ` Sean Christopherson
2021-12-01 22:11         ` Ben Gardon
2021-12-01 22:17     ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 14/15] KVM: x86/mmu: Add tracepoint for splitting large pages David Matlack
2021-11-19 23:57 ` [RFC PATCH 15/15] KVM: x86/mmu: Update page stats when " David Matlack
2021-12-01 19:36   ` Sean Christopherson
2021-12-01 21:11     ` David Matlack
2021-11-26 14:13 ` [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU Peter Xu
2021-11-30 23:22   ` David Matlack
2021-12-01  4:10     ` Peter Xu
2021-12-01  4:19       ` Peter Xu
2021-12-01 21:46       ` David Matlack

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).