linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing
@ 2022-03-03 19:38 Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 01/30] KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU Paolo Bonzini
                   ` (30 more replies)
  0 siblings, 31 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang


Overhaul TDP MMU's handling of zapping and TLB flushing to reduce the
number of TLB flushes, fix soft lockups and RCU stalls, avoid blocking
vCPUs for long durations while zapping paging structure, and to clean up
the zapping code.

The largest cleanup is to separate the flows for zapping roots (zap
_everything_), zapping leaf SPTEs (zap guest mappings for whatever reason),
and zapping a specific SP (NX recovery).  They're currently smushed into a
single zap_gfn_range(), which was a good idea at the time, but became a
mess when trying to handle the different rules, e.g. TLB flushes aren't
needed when zapping a root because KVM can safely zap a root if and only
if it's unreachable.

To solve the soft lockups, stalls, and vCPU performance issues:

 - Defer remote TLB flushes to the caller when zapping TDP MMU shadow
   pages by relying on RCU to ensure the paging structure isn't freed
   until all vCPUs have exited the guest.

 - Allowing yielding when zapping TDP MMU roots in response to the root's
   last reference being put.  This requires a bit of trickery to ensure
   the root is reachable via mmu_notifier, but it's not too gross.

 - Zap roots in two passes to avoid holding RCU for potential hundreds of
   seconds when zapping guest with terabytes of memory that is backed
   entirely by 4kb SPTEs.

 - Zap defunct roots asynchronously via the common work_queue so that a
   vCPU doesn't get stuck doing the work if the vCPU happens to drop the
   last reference to a root.

The selftest at the end allows populating a guest with the max amount of
memory allowed by the underlying architecture.  The most I've tested is
~64tb (MAXPHYADDR=46) as I don't have easy access to a system with
MAXPHYADDR=52.  The selftest compiles on arm64 and s390x, but otherwise
hasn't been tested outside of x86-64.  It will hopefully do something
useful as is, but there's a non-zero chance it won't get past init with
a high max memory.  Running on x86 without the TDP MMU is comically slow.

Testing: passes kvm-unit-tests and guest installation tests on Intel.
Haven't yet run AMD or selftests.

Thanks,

Paolo

v4:
- collected reviews and typo fixes (plus some typo fixes of my own)

- new patches to simplify reader invariants: they are not allowed to
  acquire references to invalid roots

- new version of "Allow yielding when zapping GFNs for defunct TDP MMU
  root", simplifying the atomic a bit by 1) using xchg and relying on
  its implicit memory barriers 2) relying on readers to have the same
  behavior for the three stats refcount=0/valid, refcount=0/invalid,
  refcount=1/invalid (see previous point)

- switch zapping of invalidated roots to asynchronous workers on a
  per-VM workqueue, fixing a bug in v3 where the extra reference added
  by kvm_tdp_mmu_put_root could be given back twice.  This also replaces
  "KVM: x86/mmu: Use common iterator for walking invalid TDP MMU roots"
  in v3, since it gets rid of next_invalidated_root() in a different way.

- because of the previous point, most of the logic in v3's "KVM: x86/mmu:
  Zap defunct roots via asynchronous worker" moves to the earlier patch
  "KVM: x86/mmu: Zap invalidated roots via asynchronous worker"


v3:
- Drop patches that were applied.
- Rebase to latest kvm/queue.
- Collect a review. [David]
- Use helper instead of goto to zap roots in two passes. [David]
- Add patches to disallow REMOVED "old" SPTE when atomically
  setting SPTE.

Paolo Bonzini (5):
  KVM: x86/mmu: only perform eager page splitting on valid roots
  KVM: x86/mmu: do not allow readers to acquire references to invalid roots
  KVM: x86/mmu: Zap invalidated roots via asynchronous worker
  KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root
  KVM: x86/mmu: Zap defunct roots via asynchronous worker

Sean Christopherson (25):
  KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU
  KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap
  KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic
  KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush
  KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter
  KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal
  KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte
  KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks
  KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU
  KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path
  KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw values
  KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery
  KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU
  KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page
  KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range
  KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
  KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched
  KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages
  KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls
  KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE
  KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE
  KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils
  KVM: selftests: Split out helper to allocate guest mem via memfd
  KVM: selftests: Define cpu_relax() helpers for s390 and x86
  KVM: selftests: Add test to populate a VM with the max possible guest mem

 arch/x86/include/asm/kvm_host.h               |   2 +
 arch/x86/kvm/mmu/mmu.c                        |  49 +-
 arch/x86/kvm/mmu/mmu_internal.h               |  15 +-
 arch/x86/kvm/mmu/tdp_iter.c                   |   6 +-
 arch/x86/kvm/mmu/tdp_iter.h                   |  15 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    | 559 +++++++++++-------
 arch/x86/kvm/mmu/tdp_mmu.h                    |  26 +-
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   3 +
 .../selftests/kvm/include/kvm_util_base.h     |   5 +
 .../selftests/kvm/include/s390x/processor.h   |   8 +
 .../selftests/kvm/include/x86_64/processor.h  |   5 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  66 ++-
 .../selftests/kvm/max_guest_memory_test.c     | 292 +++++++++
 .../selftests/kvm/set_memory_region_test.c    |  35 +-
 15 files changed, 794 insertions(+), 293 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/max_guest_memory_test.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v4 01/30] KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 02/30] KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap Paolo Bonzini
                   ` (29 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang, stable

From: Sean Christopherson <seanjc@google.com>

Explicitly check for present SPTEs when clearing dirty bits in the TDP
MMU.  This isn't strictly required for correctness, as setting the dirty
bit in a defunct SPTE will not change the SPTE from !PRESENT to PRESENT.
However, the guarded MMU_WARN_ON() in spte_ad_need_write_protect() would
complain if anyone actually turned on KVM's MMU debugging.

Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-3-seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index debf08212f12..4cf0cc04b2a0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1468,6 +1468,9 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
 
+		if (!is_shadow_present_pte(iter.old_spte))
+			continue;
+
 		if (spte_ad_need_write_protect(iter.old_spte)) {
 			if (is_writable_pte(iter.old_spte))
 				new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 02/30] KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 01/30] KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 03/30] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic Paolo Bonzini
                   ` (28 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Fix misleading and arguably wrong comments in the TDP MMU's fast zap
flow.  The comments, and the fact that actually zapping invalid roots was
added separately, strongly suggests that zapping invalid roots is an
optimization and not required for correctness.  That is a lie.

KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(),
because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(),
KVM is relying on it to fully remove all references to the memslot.  Once
the memslot is gone, KVM's mmu_notifier hooks will be unable to find the
stale references as the hva=>gfn translation is done via the memslots.
If KVM doesn't immediately zap SPTEs and userspace unmaps a range after
deleting a memslot, KVM will fail to zap in response to the mmu_notifier
due to not finding a memslot corresponding to the notifier's range, which
leads to a variation of use-after-free.

The other misleading comment (and code) explicitly states that roots
without a reference should be skipped.  While that's technically true,
it's also extremely misleading as it should be impossible for KVM to
encounter a defunct root on the list while holding mmu_lock for write.
Opportunistically add a WARN to enforce that invariant.

Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Fixes: 4c6654bd160d ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c     |  8 +++++++
 arch/x86/kvm/mmu/tdp_mmu.c | 46 +++++++++++++++++++++-----------------
 2 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3e7c8ad5bed9..32c041ed65cb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5721,6 +5721,14 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 
 	write_unlock(&kvm->mmu_lock);
 
+	/*
+	 * Zap the invalidated TDP MMU roots, all SPTEs must be dropped before
+	 * returning to the caller, e.g. if the zap is in response to a memslot
+	 * deletion, mmu_notifier callbacks will be unable to reach the SPTEs
+	 * associated with the deleted memslot once the update completes, and
+	 * Deferring the zap until the final reference to the root is put would
+	 * lead to use-after-free.
+	 */
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
 		kvm_tdp_mmu_zap_invalidated_roots(kvm);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4cf0cc04b2a0..b97a4125feac 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -833,12 +833,11 @@ static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
 }
 
 /*
- * Since kvm_tdp_mmu_zap_all_fast has acquired a reference to each
- * invalidated root, they will not be freed until this function drops the
- * reference. Before dropping that reference, tear down the paging
- * structure so that whichever thread does drop the last reference
- * only has to do a trivial amount of work. Since the roots are invalid,
- * no new SPTEs should be created under them.
+ * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
+ * zap" completes.  Since kvm_tdp_mmu_invalidate_all_roots() has acquired a
+ * reference to each invalidated root, roots will not be freed until after this
+ * function drops the gifted reference, e.g. so that vCPUs don't get stuck with
+ * tearing down paging structures.
  */
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
 {
@@ -877,21 +876,25 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
 }
 
 /*
- * Mark each TDP MMU root as invalid so that other threads
- * will drop their references and allow the root count to
- * go to 0.
+ * Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
+ * is about to be zapped, e.g. in response to a memslots update.  The caller is
+ * responsible for invoking kvm_tdp_mmu_zap_invalidated_roots() to do the actual
+ * zapping.
  *
- * Also take a reference on all roots so that this thread
- * can do the bulk of the work required to free the roots
- * once they are invalidated. Without this reference, a
- * vCPU thread might drop the last reference to a root and
- * get stuck with tearing down the entire paging structure.
+ * Take a reference on all roots to prevent the root from being freed before it
+ * is zapped by this thread.  Freeing a root is not a correctness issue, but if
+ * a vCPU drops the last reference to a root prior to the root being zapped, it
+ * will get stuck with tearing down the entire paging structure.
  *
- * Roots which have a zero refcount should be skipped as
- * they're already being torn down.
- * Already invalid roots should be referenced again so that
- * they aren't freed before kvm_tdp_mmu_zap_all_fast is
- * done with them.
+ * Get a reference even if the root is already invalid,
+ * kvm_tdp_mmu_zap_invalidated_roots() assumes it was gifted a reference to all
+ * invalid roots, e.g. there's no epoch to identify roots that were invalidated
+ * by a previous call.  Roots stay on the list until the last reference is
+ * dropped, so even though all invalid roots are zapped, a root may not go away
+ * for quite some time, e.g. if a vCPU blocks across multiple memslot updates.
+ *
+ * Because mmu_lock is held for write, it should be impossible to observe a
+ * root with zero refcount, i.e. the list of roots cannot be stale.
  *
  * This has essentially the same effect for the TDP MMU
  * as updating mmu_valid_gen does for the shadow MMU.
@@ -901,9 +904,10 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
 	struct kvm_mmu_page *root;
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
-	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link)
-		if (refcount_inc_not_zero(&root->tdp_mmu_root_count))
+	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
+		if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
 			root->role.invalid = true;
+	}
 }
 
 /*
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 03/30] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 01/30] KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 02/30] KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 23:39   ` Mingwei Zhang
  2022-03-03 19:38 ` [PATCH v4 04/30] KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush Paolo Bonzini
                   ` (27 subsequent siblings)
  30 siblings, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Explicitly ignore the result of zap_gfn_range() when putting the last
reference to a TDP MMU root, and add a pile of comments to formalize the
TDP MMU's behavior of deferring TLB flushes to alloc/reuse.  Note, this
only affects the !shared case, as zap_gfn_range() subtly never returns
true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic().

Putting the root without a flush is ok because even if there are stale
references to the root in the TLB, they are unreachable because KVM will
not run the guest with the same ASID without first flushing (where ASID
in this context refers to both SVM's explicit ASID and Intel's implicit
ASID that is constructed from VPID+PCID+EPT4A+etc...).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-5-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c     |  8 ++++++++
 arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 32c041ed65cb..9a6df2d02777 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5083,6 +5083,14 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 	kvm_mmu_sync_roots(vcpu);
 
 	kvm_mmu_load_pgd(vcpu);
+
+	/*
+	 * Flush any TLB entries for the new root, the provenance of the root
+	 * is unknown.  Even if KVM ensures there are no stale TLB entries
+	 * for a freed root, in theory another hypervisor could have left
+	 * stale entries.  Flushing on alloc also allows KVM to skip the TLB
+	 * flush when freeing a root (see kvm_tdp_mmu_put_root()).
+	 */
 	static_call(kvm_x86_flush_tlb_current)(vcpu);
 out:
 	return r;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b97a4125feac..921fa386df99 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	list_del_rcu(&root->link);
 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 
-	zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
+	/*
+	 * A TLB flush is not necessary as KVM performs a local TLB flush when
+	 * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
+	 * to a different pCPU.  Note, the local TLB flush on reuse also
+	 * invalidates any paging-structure-cache entries, i.e. TLB entries for
+	 * intermediate paging structures, that may be zapped, as such entries
+	 * are associated with the ASID on both VMX and SVM.
+	 */
+	(void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
 
 	call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 04/30] KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (2 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 03/30] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 05/30] KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter Paolo Bonzini
                   ` (26 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Remove the misleading flush "handling" when zapping invalidated TDP MMU
roots, and document that flushing is unnecessary for all flavors of MMUs
when zapping invalid/obsolete roots/pages.  The "handling" in the TDP MMU
is dead code, as zap_gfn_range() is called with shared=true, in which
case it will never return true due to the flushing being handled by
tdp_mmu_zap_spte_atomic().

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c     | 10 +++++++---
 arch/x86/kvm/mmu/tdp_mmu.c | 15 ++++++++++-----
 2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9a6df2d02777..8408d7db8d2a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5674,9 +5674,13 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
 	}
 
 	/*
-	 * Trigger a remote TLB flush before freeing the page tables to ensure
-	 * KVM is not in the middle of a lockless shadow page table walk, which
-	 * may reference the pages.
+	 * Kick all vCPUs (via remote TLB flush) before freeing the page tables
+	 * to ensure KVM is not in the middle of a lockless shadow page table
+	 * walk, which may reference the pages.  The remote TLB flush itself is
+	 * not required and is simply a convenient way to kick vCPUs as needed.
+	 * KVM performs a local TLB flush when allocating a new root (see
+	 * kvm_mmu_load()), and the reload in the caller ensure no vCPUs are
+	 * running with an obsolete MMU.
 	 */
 	kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 921fa386df99..2ce6915b70fe 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -851,7 +851,6 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
 {
 	struct kvm_mmu_page *next_root;
 	struct kvm_mmu_page *root;
-	bool flush = false;
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
@@ -864,7 +863,16 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
 
 		rcu_read_unlock();
 
-		flush = zap_gfn_range(kvm, root, 0, -1ull, true, flush, true);
+		/*
+		 * A TLB flush is unnecessary, invalidated roots are guaranteed
+		 * to be unreachable by the guest (see kvm_tdp_mmu_put_root()
+		 * for more details), and unlike the legacy MMU, no vCPU kick
+		 * is needed to play nice with lockless shadow walks as the TDP
+		 * MMU protects its paging structures via RCU.  Note, zapping
+		 * will still flush on yield, but that's a minor performance
+		 * blip and not a functional issue.
+		 */
+		(void)zap_gfn_range(kvm, root, 0, -1ull, true, false, true);
 
 		/*
 		 * Put the reference acquired in
@@ -878,9 +886,6 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
 	}
 
 	rcu_read_unlock();
-
-	if (flush)
-		kvm_flush_remote_tlbs(kvm);
 }
 
 /*
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 05/30] KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (3 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 04/30] KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 06/30] KVM: x86/mmu: only perform eager page splitting on valid roots Paolo Bonzini
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Assert that mmu_lock is held for write by users of the yield-unfriendly
TDP iterator.  The nature of a shared walk means that the caller needs to
play nice with other tasks modifying the page tables, which is more or
less the same thing as playing nice with yielding.  Theoretically, KVM
could gain a flow where it could legitimately take mmu_lock for read in
a non-preemptible context, but that's highly unlikely and any such case
should be viewed with a fair amount of scrutiny.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2ce6915b70fe..30424fbceb5f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -29,13 +29,16 @@ bool kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 	return true;
 }
 
-static __always_inline void kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
+/* Arbitrarily returns true so that this may be used in if statements. */
+static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
 							     bool shared)
 {
 	if (shared)
 		lockdep_assert_held_read(&kvm->mmu_lock);
 	else
 		lockdep_assert_held_write(&kvm->mmu_lock);
+
+	return true;
 }
 
 void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
@@ -172,11 +175,17 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 #define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared)		\
 	__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, false)
 
-#define for_each_tdp_mmu_root(_kvm, _root, _as_id)				\
-	list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link,		\
-				lockdep_is_held_type(&kvm->mmu_lock, 0) ||	\
-				lockdep_is_held(&kvm->arch.tdp_mmu_pages_lock))	\
-		if (kvm_mmu_page_as_id(_root) != _as_id) {		\
+/*
+ * Iterate over all TDP MMU roots.  Requires that mmu_lock be held for write,
+ * the implication being that any flow that holds mmu_lock for read is
+ * inherently yield-friendly and should use the yield-safe variant above.
+ * Holding mmu_lock for write obviates the need for RCU protection as the list
+ * is guaranteed to be stable.
+ */
+#define for_each_tdp_mmu_root(_kvm, _root, _as_id)			\
+	list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)	\
+		if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) &&	\
+		    kvm_mmu_page_as_id(_root) != _as_id) {		\
 		} else
 
 static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 06/30] KVM: x86/mmu: only perform eager page splitting on valid roots
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (4 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 05/30] KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 20:03   ` Sean Christopherson
  2022-03-03 19:38 ` [PATCH v4 07/30] KVM: x86/mmu: do not allow readers to acquire references to invalid roots Paolo Bonzini
                   ` (24 subsequent siblings)
  30 siblings, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

Eager page splitting is an optimization; it does not have to be performed on
invalid roots.  It is also the only case in which a reader might acquire
a reference to an invalid root, so after this change we know that readers
will skip both dying and invalid roots.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 30424fbceb5f..d39593b9ac9e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1464,7 +1464,7 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
 
 	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, shared) {
+	for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, shared) {
 		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level, shared);
 		if (r) {
 			kvm_tdp_mmu_put_root(kvm, root, shared);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 07/30] KVM: x86/mmu: do not allow readers to acquire references to invalid roots
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (5 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 06/30] KVM: x86/mmu: only perform eager page splitting on valid roots Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 20:12   ` Sean Christopherson
  2022-03-03 19:38 ` [PATCH v4 08/30] KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal Paolo Bonzini
                   ` (23 subsequent siblings)
  30 siblings, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

Remove the "shared" argument of for_each_tdp_mmu_root_yield_safe, thus ensuring
that readers do not ever acquire a reference to an invalid root.  After this
patch, all readers except kvm_tdp_mmu_zap_invalidated_roots() treat
refcount=0/valid, refcount=0/invalid and refcount=1/invalid in exactly the
same way.  kvm_tdp_mmu_zap_invalidated_roots() is different but it also
does not acquire a reference to the invalid root, and it cannot see
refcount=0/invalid because it is guaranteed to run after
kvm_tdp_mmu_invalidate_all_roots().

Opportunistically add a lockdep assertion to the yield-safe iterator.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d39593b9ac9e..79bc48ddb69d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -166,14 +166,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 	for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, _only_valid);	\
 	     _root;								\
 	     _root = tdp_mmu_next_root(_kvm, _root, _shared, _only_valid))	\
-		if (kvm_mmu_page_as_id(_root) != _as_id) {			\
+		if (kvm_lockdep_assert_mmu_lock_held(_kvm, _shared) &&		\
+		    kvm_mmu_page_as_id(_root) != _as_id) {			\
 		} else
 
 #define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared)	\
 	__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true)
 
-#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared)		\
-	__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, false)
+#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id)			\
+	__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, false, false)
 
 /*
  * Iterate over all TDP MMU roots.  Requires that mmu_lock be held for write,
@@ -808,7 +809,7 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
 {
 	struct kvm_mmu_page *root;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
+	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
 		flush = zap_gfn_range(kvm, root, start, end, can_yield, flush,
 				      false);
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 08/30] KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (6 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 07/30] KVM: x86/mmu: do not allow readers to acquire references to invalid roots Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 09/30] KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte Paolo Bonzini
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Look for a !leaf=>leaf conversion instead of a PFN change when checking
if a SPTE change removed a TDP MMU shadow page.  Convert the PFN check
into a WARN, as KVM should never change the PFN of a shadow page (except
when its being zapped or replaced).

From a purely theoretical perspective, it's not illegal to replace a SP
with a hugepage pointing at the same PFN.  In practice, it's impossible
as that would require mapping guest memory overtop a kernel-allocated SP.
Either way, the check is odd.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-8-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 79bc48ddb69d..53c7987198b7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -491,9 +491,12 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
-	 * the paging structure.
+	 * the paging structure.  Note the WARN on the PFN changing without the
+	 * SPTE being converted to a hugepage (leaf) or being zapped.  Shadow
+	 * pages are kernel allocations and should never be migrated.
 	 */
-	if (was_present && !was_leaf && (pfn_changed || !is_present))
+	if (was_present && !was_leaf &&
+	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
 }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 09/30] KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (7 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 08/30] KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 10/30] KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks Paolo Bonzini
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Batch TLB flushes (with other MMUs) when handling ->change_spte()
notifications in the TDP MMU.  The MMU notifier path in question doesn't
allow yielding and correcty flushes before dropping mmu_lock.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-9-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 53c7987198b7..9b1d64468d95 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1226,13 +1226,12 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
  */
 bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	bool flush = kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
-
-	/* FIXME: return 'flush' instead of flushing here. */
-	if (flush)
-		kvm_flush_remote_tlbs_with_address(kvm, range->start, 1);
-
-	return false;
+	/*
+	 * No need to handle the remote TLB flush under RCU protection, the
+	 * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
+	 * shadow page.  See the WARN on pfn_changed in __handle_changed_spte().
+	 */
+	return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
 }
 
 /*
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 10/30] KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (8 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 09/30] KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 11/30] KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU Paolo Bonzini
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Drop RCU protection after processing each root when handling MMU notifier
hooks that aren't the "unmap" path, i.e. aren't zapping.  Temporarily
drop RCU to let RCU do its thing between roots, and to make it clear that
there's no special behavior that relies on holding RCU across all roots.

Currently, the RCU protection is completely superficial, it's necessary
only to make rcu_dereference() of SPTE pointers happy.  A future patch
will rely on holding RCU as a proxy for vCPUs in the guest, e.g. to
ensure shadow pages aren't freed before all vCPUs do a TLB flush (or
rather, acknowledge the need for a flush), but in that case RCU needs to
be held until the flush is complete if and only if the flush is needed
because a shadow page may have been removed.  And except for the "unmap"
path, MMU notifier events cannot remove SPs (don't toggle PRESENT bit,
and can't change the PFN for a SP).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-10-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9b1d64468d95..22b0c03b673b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1123,18 +1123,18 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 	struct tdp_iter iter;
 	bool ret = false;
 
-	rcu_read_lock();
-
 	/*
 	 * Don't support rescheduling, none of the MMU notifiers that funnel
 	 * into this helper allow blocking; it'd be dead, wasteful code.
 	 */
 	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+		rcu_read_lock();
+
 		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
 			ret |= handler(kvm, &iter, range);
-	}
 
-	rcu_read_unlock();
+		rcu_read_unlock();
+	}
 
 	return ret;
 }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 11/30] KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (9 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 10/30] KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 12/30] KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path Paolo Bonzini
                   ` (19 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Add helpers to read and write TDP MMU SPTEs instead of open coding
rcu_dereference() all over the place, and to provide a convenient
location to document why KVM doesn't exempt holding mmu_lock for write
from having to hold RCU (and any future changes to the rules).

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-11-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_iter.c |  6 +++---
 arch/x86/kvm/mmu/tdp_iter.h | 16 ++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c  |  6 +++---
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index be3f096db2eb..6d3b3e5a5533 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -12,7 +12,7 @@ static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
 {
 	iter->sptep = iter->pt_path[iter->level - 1] +
 		SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
-	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+	iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
 }
 
 static gfn_t round_gfn_for_level(gfn_t gfn, int level)
@@ -89,7 +89,7 @@ static bool try_step_down(struct tdp_iter *iter)
 	 * Reread the SPTE before stepping down to avoid traversing into page
 	 * tables that are no longer linked from this entry.
 	 */
-	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+	iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
 
 	child_pt = spte_to_child_pt(iter->old_spte, iter->level);
 	if (!child_pt)
@@ -123,7 +123,7 @@ static bool try_step_side(struct tdp_iter *iter)
 	iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
 	iter->next_last_level_gfn = iter->gfn;
 	iter->sptep++;
-	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+	iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
 
 	return true;
 }
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 216ebbe76ddd..bb9b581f1ee4 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -9,6 +9,22 @@
 
 typedef u64 __rcu *tdp_ptep_t;
 
+/*
+ * TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
+ * to be zapped while holding mmu_lock for read.  Holding RCU isn't required for
+ * correctness if mmu_lock is held for write, but plumbing "struct kvm" down to
+ * the lower depths of the TDP MMU just to make lockdep happy is a nightmare, so
+ * all accesses to SPTEs are done under RCU protection.
+ */
+static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep)
+{
+	return READ_ONCE(*rcu_dereference(sptep));
+}
+static inline void kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 val)
+{
+	WRITE_ONCE(*rcu_dereference(sptep), val);
+}
+
 /*
  * A TDP iterator performs a pre-order walk over a TDP paging structure.
  */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 22b0c03b673b..371b6a108736 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -595,7 +595,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * here since the SPTE is going from non-present
 	 * to non-present.
 	 */
-	WRITE_ONCE(*rcu_dereference(iter->sptep), 0);
+	kvm_tdp_mmu_write_spte(iter->sptep, 0);
 
 	return 0;
 }
@@ -634,7 +634,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 	 */
 	WARN_ON(is_removed_spte(iter->old_spte));
 
-	WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
+	kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
 
 	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
 			      new_spte, iter->level, false);
@@ -1069,7 +1069,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			 * because the new value informs the !present
 			 * path below.
 			 */
-			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+			iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
 		}
 
 		if (!is_shadow_present_pte(iter.old_spte)) {
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 12/30] KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (10 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 11/30] KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 13/30] KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw values Paolo Bonzini
                   ` (18 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

WARN if the new_spte being set by __tdp_mmu_set_spte() is a REMOVED_SPTE,
which is called out by the comment as being disallowed but not actually
checked.  Keep the WARN on the old_spte as well, because overwriting a
REMOVED_SPTE in the non-atomic path is also disallowed (as evidence by
lack of splats with the existing WARN).

Fixes: 08f07c800e9d ("KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler")
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-12-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 371b6a108736..41175ee7e111 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -626,13 +626,13 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	/*
-	 * No thread should be using this function to set SPTEs to the
+	 * No thread should be using this function to set SPTEs to or from the
 	 * temporary removed SPTE value.
 	 * If operating under the MMU lock in read mode, tdp_mmu_set_spte_atomic
 	 * should be used. If operating under the MMU lock in write mode, the
 	 * use of the removed SPTE should not be necessary.
 	 */
-	WARN_ON(is_removed_spte(iter->old_spte));
+	WARN_ON(is_removed_spte(iter->old_spte) || is_removed_spte(new_spte));
 
 	kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 13/30] KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw values
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (11 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 12/30] KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 14/30] KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery Paolo Bonzini
                   ` (17 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Refactor __tdp_mmu_set_spte() to work with raw values instead of a
tdp_iter objects so that a future patch can modify SPTEs without doing a
walk, and without having to synthesize a tdp_iter.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-13-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 51 +++++++++++++++++++++++---------------
 1 file changed, 31 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 41175ee7e111..0ffa62abde2d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -603,9 +603,13 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 
 /*
  * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
- * @kvm: kvm instance
- * @iter: a tdp_iter instance currently on the SPTE that should be set
- * @new_spte: The value the SPTE should be set to
+ * @kvm:	      KVM instance
+ * @as_id:	      Address space ID, i.e. regular vs. SMM
+ * @sptep:	      Pointer to the SPTE
+ * @old_spte:	      The current value of the SPTE
+ * @new_spte:	      The new value that will be set for the SPTE
+ * @gfn:	      The base GFN that was (or will be) mapped by the SPTE
+ * @level:	      The level _containing_ the SPTE (its parent PT's level)
  * @record_acc_track: Notify the MM subsystem of changes to the accessed state
  *		      of the page. Should be set unless handling an MMU
  *		      notifier for access tracking. Leaving record_acc_track
@@ -617,12 +621,10 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
  *		      Leaving record_dirty_log unset in that case prevents page
  *		      writes from being double counted.
  */
-static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
-				      u64 new_spte, bool record_acc_track,
-				      bool record_dirty_log)
+static void __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
+			       u64 old_spte, u64 new_spte, gfn_t gfn, int level,
+			       bool record_acc_track, bool record_dirty_log)
 {
-	WARN_ON_ONCE(iter->yielded);
-
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	/*
@@ -632,39 +634,48 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 	 * should be used. If operating under the MMU lock in write mode, the
 	 * use of the removed SPTE should not be necessary.
 	 */
-	WARN_ON(is_removed_spte(iter->old_spte) || is_removed_spte(new_spte));
+	WARN_ON(is_removed_spte(old_spte) || is_removed_spte(new_spte));
 
-	kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
+	kvm_tdp_mmu_write_spte(sptep, new_spte);
+
+	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
 
-	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
-			      new_spte, iter->level, false);
 	if (record_acc_track)
-		handle_changed_spte_acc_track(iter->old_spte, new_spte,
-					      iter->level);
+		handle_changed_spte_acc_track(old_spte, new_spte, level);
 	if (record_dirty_log)
-		handle_changed_spte_dirty_log(kvm, iter->as_id, iter->gfn,
-					      iter->old_spte, new_spte,
-					      iter->level);
+		handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
+					      new_spte, level);
+}
+
+static inline void _tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
+				     u64 new_spte, bool record_acc_track,
+				     bool record_dirty_log)
+{
+	WARN_ON_ONCE(iter->yielded);
+
+	__tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep, iter->old_spte,
+			   new_spte, iter->gfn, iter->level,
+			   record_acc_track, record_dirty_log);
 }
 
 static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 				    u64 new_spte)
 {
-	__tdp_mmu_set_spte(kvm, iter, new_spte, true, true);
+	_tdp_mmu_set_spte(kvm, iter, new_spte, true, true);
 }
 
 static inline void tdp_mmu_set_spte_no_acc_track(struct kvm *kvm,
 						 struct tdp_iter *iter,
 						 u64 new_spte)
 {
-	__tdp_mmu_set_spte(kvm, iter, new_spte, false, true);
+	_tdp_mmu_set_spte(kvm, iter, new_spte, false, true);
 }
 
 static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 						 struct tdp_iter *iter,
 						 u64 new_spte)
 {
-	__tdp_mmu_set_spte(kvm, iter, new_spte, true, false);
+	_tdp_mmu_set_spte(kvm, iter, new_spte, true, false);
 }
 
 #define tdp_root_for_each_pte(_iter, _root, _start, _end) \
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 14/30] KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (12 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 13/30] KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw values Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 15/30] KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU Paolo Bonzini
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

When recovering a potential hugepage that was shattered for the iTLB
multihit workaround, precisely zap only the target page instead of
iterating over the TDP MMU to find the SP that was passed in.  This will
allow future simplification of zap_gfn_range() by having it zap only
leaf SPTEs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu_internal.h |  7 ++++++-
 arch/x86/kvm/mmu/tdp_iter.h     |  2 --
 arch/x86/kvm/mmu/tdp_mmu.c      | 36 +++++++++++++++++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h      | 18 +----------------
 4 files changed, 39 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index da6166b5c377..be063b6c91b7 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -30,6 +30,8 @@ extern bool dbg;
 #define INVALID_PAE_ROOT	0
 #define IS_VALID_PAE_ROOT(x)	(!!(x))
 
+typedef u64 __rcu *tdp_ptep_t;
+
 struct kvm_mmu_page {
 	/*
 	 * Note, "link" through "spt" fit in a single 64 byte cache line on
@@ -59,7 +61,10 @@ struct kvm_mmu_page {
 		refcount_t tdp_mmu_root_count;
 	};
 	unsigned int unsync_children;
-	struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
+	union {
+		struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
+		tdp_ptep_t ptep;
+	};
 	DECLARE_BITMAP(unsync_child_bitmap, 512);
 
 	struct list_head lpage_disallowed_link;
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index bb9b581f1ee4..e2a7e267a77d 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -7,8 +7,6 @@
 
 #include "mmu.h"
 
-typedef u64 __rcu *tdp_ptep_t;
-
 /*
  * TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
  * to be zapped while holding mmu_lock for read.  Holding RCU isn't required for
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0ffa62abde2d..dc9db5057f3b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -199,13 +199,14 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 	return sp;
 }
 
-static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, gfn_t gfn,
-			      union kvm_mmu_page_role role)
+static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
+			    gfn_t gfn, union kvm_mmu_page_role role)
 {
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
 	sp->role = role;
 	sp->gfn = gfn;
+	sp->ptep = sptep;
 	sp->tdp_mmu_page = true;
 
 	trace_kvm_mmu_get_page(sp, true);
@@ -222,7 +223,7 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
 	role = parent_sp->role;
 	role.level--;
 
-	tdp_mmu_init_sp(child_sp, iter->gfn, role);
+	tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
 }
 
 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
@@ -244,7 +245,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 	}
 
 	root = tdp_mmu_alloc_sp(vcpu);
-	tdp_mmu_init_sp(root, 0, role);
+	tdp_mmu_init_sp(root, NULL, 0, role);
 
 	refcount_set(&root->tdp_mmu_root_count, 1);
 
@@ -736,6 +737,33 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
 	return iter->yielded;
 }
 
+bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+	u64 old_spte;
+
+	/*
+	 * This helper intentionally doesn't allow zapping a root shadow page,
+	 * which doesn't have a parent page table and thus no associated entry.
+	 */
+	if (WARN_ON_ONCE(!sp->ptep))
+		return false;
+
+	rcu_read_lock();
+
+	old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
+	if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) {
+		rcu_read_unlock();
+		return false;
+	}
+
+	__tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
+			   sp->gfn, sp->role.level + 1, true, true);
+
+	rcu_read_unlock();
+
+	return true;
+}
+
 /*
  * Tears down the mappings for the range of gfns, [start, end), and frees the
  * non-root pages mapping GFNs strictly within that range. Returns true if
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 57c73d8f76ce..5e5ef2576c81 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -22,24 +22,8 @@ static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
 {
 	return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
 }
-static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
-	gfn_t end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level + 1);
-
-	/*
-	 * Don't allow yielding, as the caller may have a flush pending.  Note,
-	 * if mmu_lock is held for write, zapping will never yield in this case,
-	 * but explicitly disallow it for safety.  The TDP MMU does not yield
-	 * until it has made forward progress (steps sideways), and when zapping
-	 * a single shadow page that it's guaranteed to see (thus the mmu_lock
-	 * requirement), its "step sideways" will always step beyond the bounds
-	 * of the shadow page's gfn range and stop iterating before yielding.
-	 */
-	lockdep_assert_held_write(&kvm->mmu_lock);
-	return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp),
-					   sp->gfn, end, false, false);
-}
 
+bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 15/30] KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (13 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 14/30] KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 16/30] KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page Paolo Bonzini
                   ` (15 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Don't flush the TLBs when zapping all TDP MMU pages, as the only time KVM
uses the slow version of "zap everything" is when the VM is being
destroyed or the owning mm has exited.  In either case, KVM_RUN is
unreachable for the VM, i.e. the guest TLB entries cannot be consumed.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-15-seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index dc9db5057f3b..f59f3ff5cb75 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -860,14 +860,15 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
 
 void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 {
-	bool flush = false;
 	int i;
 
+	/*
+	 * A TLB flush is unnecessary, KVM zaps everything if and only the VM
+	 * is being destroyed or the userspace VMM has exited.  In both cases,
+	 * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
+	 */
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
-		flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush);
-
-	if (flush)
-		kvm_flush_remote_tlbs(kvm);
+		(void)kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, false);
 }
 
 static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 16/30] KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (14 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 15/30] KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-04  0:07   ` Mingwei Zhang
  2022-03-03 19:38 ` [PATCH v4 17/30] KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range Paolo Bonzini
                   ` (14 subsequent siblings)
  30 siblings, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Add a dedicated helper for zapping a TDP MMU root, and use it in the three
flows that do "zap_all" and intentionally do not do a TLB flush if SPTEs
are zapped (zapping an entire root is safe if and only if it cannot be in
use by any vCPU).  Because a TLB flush is never required, unconditionally
pass "false" to tdp_mmu_iter_cond_resched() when potentially yielding.

Opportunistically document why KVM must not yield when zapping roots that
are being zapped by kvm_tdp_mmu_put_root(), i.e. roots whose refcount has
reached zero, and further harden the flow to detect improper KVM behavior
with respect to roots that are supposed to be unreachable.

In addition to hardening zapping of roots, isolating zapping of roots
will allow future simplification of zap_gfn_range() by having it zap only
leaf SPTEs, and by removing its tricky "zap all" heuristic.  By having
all paths that truly need to free _all_ SPs flow through the dedicated
root zapper, the generic zapper can be freed of those concerns.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 98 +++++++++++++++++++++++++++++++-------
 1 file changed, 82 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f59f3ff5cb75..970376297b30 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -56,10 +56,6 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 	rcu_barrier();
 }
 
-static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
-			  gfn_t start, gfn_t end, bool can_yield, bool flush,
-			  bool shared);
-
 static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
 {
 	free_page((unsigned long)sp->spt);
@@ -82,6 +78,9 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
 	tdp_mmu_free_sp(sp);
 }
 
+static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
+			     bool shared);
+
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			  bool shared)
 {
@@ -104,7 +103,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	 * intermediate paging structures, that may be zapped, as such entries
 	 * are associated with the ASID on both VMX and SVM.
 	 */
-	(void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
+	tdp_mmu_zap_root(kvm, root, shared);
 
 	call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
@@ -737,6 +736,76 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
 	return iter->yielded;
 }
 
+static inline gfn_t tdp_mmu_max_gfn_host(void)
+{
+	/*
+	 * Bound TDP MMU walks at host.MAXPHYADDR, guest accesses beyond that
+	 * will hit a #PF(RSVD) and never hit an EPT Violation/Misconfig / #NPF,
+	 * and so KVM will never install a SPTE for such addresses.
+	 */
+	return 1ULL << (shadow_phys_bits - PAGE_SHIFT);
+}
+
+static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
+			     bool shared)
+{
+	bool root_is_unreachable = !refcount_read(&root->tdp_mmu_root_count);
+	struct tdp_iter iter;
+
+	gfn_t end = tdp_mmu_max_gfn_host();
+	gfn_t start = 0;
+
+	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+
+	rcu_read_lock();
+
+	/*
+	 * No need to try to step down in the iterator when zapping an entire
+	 * root, zapping an upper-level SPTE will recurse on its children.
+	 */
+	for_each_tdp_pte_min_level(iter, root, root->role.level, start, end) {
+retry:
+		/*
+		 * Yielding isn't allowed when zapping an unreachable root as
+		 * the root won't be processed by mmu_notifier callbacks.  When
+		 * handling an unmap/release mmu_notifier command, KVM must
+		 * drop all references to relevant pages prior to completing
+		 * the callback.  Dropping mmu_lock can result in zapping SPTEs
+		 * for an unreachable root after a relevant callback completes,
+		 * which leads to use-after-free as zapping a SPTE triggers
+		 * "writeback" of dirty/accessed bits to the SPTE's associated
+		 * struct page.
+		 */
+		if (!root_is_unreachable &&
+		    tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
+			continue;
+
+		if (!is_shadow_present_pte(iter.old_spte))
+			continue;
+
+		if (!shared) {
+			tdp_mmu_set_spte(kvm, &iter, 0);
+		} else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
+			/*
+			 * cmpxchg() shouldn't fail if the root is unreachable.
+			 * Retry so as not to leak the page and its children.
+			 */
+			WARN_ONCE(root_is_unreachable,
+				  "Contended TDP MMU SPTE in unreachable root.");
+			goto retry;
+		}
+
+		/*
+		 * WARN if the root is invalid and is unreachable, all SPTEs
+		 * should've been zapped by kvm_tdp_mmu_zap_invalidated_roots(),
+		 * and inserting new SPTEs under an invalid root is a KVM bug.
+		 */
+		WARN_ON_ONCE(root_is_unreachable && root->role.invalid);
+	}
+
+	rcu_read_unlock();
+}
+
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	u64 old_spte;
@@ -785,8 +854,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			  gfn_t start, gfn_t end, bool can_yield, bool flush,
 			  bool shared)
 {
-	gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
-	bool zap_all = (start == 0 && end >= max_gfn_host);
+	bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
 	struct tdp_iter iter;
 
 	/*
@@ -795,12 +863,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	 */
 	int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
 
-	/*
-	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
-	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
-	 * and so KVM will never install a SPTE for such addresses.
-	 */
-	end = min(end, max_gfn_host);
+	end = min(end, tdp_mmu_max_gfn_host());
 
 	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
 
@@ -860,6 +923,7 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
 
 void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 {
+	struct kvm_mmu_page *root;
 	int i;
 
 	/*
@@ -867,8 +931,10 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 	 * is being destroyed or the userspace VMM has exited.  In both cases,
 	 * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
 	 */
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
-		(void)kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, false);
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		for_each_tdp_mmu_root_yield_safe(kvm, root, i)
+			tdp_mmu_zap_root(kvm, root, false);
+	}
 }
 
 static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
@@ -925,7 +991,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
 		 * will still flush on yield, but that's a minor performance
 		 * blip and not a functional issue.
 		 */
-		(void)zap_gfn_range(kvm, root, 0, -1ull, true, false, true);
+		tdp_mmu_zap_root(kvm, root, true);
 
 		/*
 		 * Put the reference acquired in
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 17/30] KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (15 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 16/30] KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-04  0:14   ` Mingwei Zhang
  2022-03-03 19:38 ` [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() Paolo Bonzini
                   ` (13 subsequent siblings)
  30 siblings, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Now that all callers of zap_gfn_range() hold mmu_lock for write, drop
support for zapping with mmu_lock held for read.  That all callers hold
mmu_lock for write isn't a random coincidence; now that the paths that
need to zap _everything_ have their own path, the only callers left are
those that need to zap for functional correctness.  And when zapping is
required for functional correctness, mmu_lock must be held for write,
otherwise the caller has no guarantees about the state of the TDP MMU
page tables after it has run, e.g. the SPTE(s) it zapped can be
immediately replaced by a vCPU faulting in a page.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-17-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 24 ++++++------------------
 1 file changed, 6 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 970376297b30..f3939ce4a115 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -844,15 +844,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
  * function cannot yield, it will not release the MMU lock or reschedule and
  * the caller must ensure it does not supply too large a GFN range, or the
  * operation can cause a soft lockup.
- *
- * If shared is true, this thread holds the MMU lock in read mode and must
- * account for the possibility that other threads are modifying the paging
- * structures concurrently. If shared is false, this thread should hold the
- * MMU lock in write mode.
  */
 static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
-			  gfn_t start, gfn_t end, bool can_yield, bool flush,
-			  bool shared)
+			  gfn_t start, gfn_t end, bool can_yield, bool flush)
 {
 	bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
 	struct tdp_iter iter;
@@ -865,14 +859,13 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	end = min(end, tdp_mmu_max_gfn_host());
 
-	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	rcu_read_lock();
 
 	for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
-retry:
 		if (can_yield &&
-		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
+		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
 			flush = false;
 			continue;
 		}
@@ -891,12 +884,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
-		if (!shared) {
-			tdp_mmu_set_spte(kvm, &iter, 0);
-			flush = true;
-		} else if (tdp_mmu_zap_spte_atomic(kvm, &iter)) {
-			goto retry;
-		}
+		tdp_mmu_set_spte(kvm, &iter, 0);
+		flush = true;
 	}
 
 	rcu_read_unlock();
@@ -915,8 +904,7 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
 	struct kvm_mmu_page *root;
 
 	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
-		flush = zap_gfn_range(kvm, root, start, end, can_yield, flush,
-				      false);
+		flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
 
 	return flush;
 }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (16 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 17/30] KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-04  1:16   ` Mingwei Zhang
                     ` (2 more replies)
  2022-03-03 19:38 ` [PATCH v4 19/30] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched Paolo Bonzini
                   ` (12 subsequent siblings)
  30 siblings, 3 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various
functions accordingly.  When removing mappings for functional correctness
(except for the stupid VFIO GPU passthrough memslots bug), zapping the
leaf SPTEs is sufficient as the paging structures themselves do not point
at guest memory and do not directly impact the final translation (in the
TDP MMU).

Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only
the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and
kvm_unmap_gfn_range().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c     |  4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c | 41 ++++++++++----------------------------
 arch/x86/kvm/mmu/tdp_mmu.h |  8 +-------
 3 files changed, 14 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8408d7db8d2a..febdcaaa7b94 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5834,8 +5834,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	if (is_tdp_mmu_enabled(kvm)) {
 		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
-			flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
-							  gfn_end, flush);
+			flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
+						      gfn_end, true, flush);
 	}
 
 	if (flush)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f3939ce4a115..c71debdbc732 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -834,10 +834,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 }
 
 /*
- * Tears down the mappings for the range of gfns, [start, end), and frees the
- * non-root pages mapping GFNs strictly within that range. Returns true if
- * SPTEs have been cleared and a TLB flush is needed before releasing the
- * MMU lock.
+ * Zap leafs SPTEs for the range of gfns, [start, end). Returns true if SPTEs
+ * have been cleared and a TLB flush is needed before releasing the MMU lock.
  *
  * If can_yield is true, will release the MMU lock and reschedule if the
  * scheduler needs the CPU or there is contention on the MMU lock. If this
@@ -845,42 +843,25 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
  * the caller must ensure it does not supply too large a GFN range, or the
  * operation can cause a soft lockup.
  */
-static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
-			  gfn_t start, gfn_t end, bool can_yield, bool flush)
+static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
+			      gfn_t start, gfn_t end, bool can_yield, bool flush)
 {
-	bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
 	struct tdp_iter iter;
 
-	/*
-	 * No need to try to step down in the iterator when zapping all SPTEs,
-	 * zapping the top-level non-leaf SPTEs will recurse on their children.
-	 */
-	int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
-
 	end = min(end, tdp_mmu_max_gfn_host());
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
+	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
 		if (can_yield &&
 		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
 			flush = false;
 			continue;
 		}
 
-		if (!is_shadow_present_pte(iter.old_spte))
-			continue;
-
-		/*
-		 * If this is a non-last-level SPTE that covers a larger range
-		 * than should be zapped, continue, and zap the mappings at a
-		 * lower level, except when zapping all SPTEs.
-		 */
-		if (!zap_all &&
-		    (iter.gfn < start ||
-		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
+		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
@@ -898,13 +879,13 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
  * SPTEs have been cleared and a TLB flush is needed before releasing the
  * MMU lock.
  */
-bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
-				 gfn_t end, bool can_yield, bool flush)
+bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
+			   bool can_yield, bool flush)
 {
 	struct kvm_mmu_page *root;
 
 	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
-		flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
+		flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
 
 	return flush;
 }
@@ -1202,8 +1183,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 				 bool flush)
 {
-	return __kvm_tdp_mmu_zap_gfn_range(kvm, range->slot->as_id, range->start,
-					   range->end, range->may_block, flush);
+	return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start,
+				     range->end, range->may_block, flush);
 }
 
 typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 5e5ef2576c81..54bc8118c40a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -15,14 +15,8 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			  bool shared);
 
-bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
+bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start,
 				 gfn_t end, bool can_yield, bool flush);
-static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
-					     gfn_t start, gfn_t end, bool flush)
-{
-	return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
-}
-
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 19/30] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (17 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-04  1:19   ` Mingwei Zhang
  2022-03-03 19:38 ` [PATCH v4 20/30] KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages Paolo Bonzini
                   ` (11 subsequent siblings)
  30 siblings, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

When yielding in the TDP MMU iterator, service any pending TLB flush
before dropping RCU protections in anticipation of using the caller's RCU
"lock" as a proxy for vCPUs in the guest.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-19-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c71debdbc732..3a866fcb5ea9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -716,11 +716,11 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
 		return false;
 
 	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
-		rcu_read_unlock();
-
 		if (flush)
 			kvm_flush_remote_tlbs(kvm);
 
+		rcu_read_unlock();
+
 		if (shared)
 			cond_resched_rwlock_read(&kvm->mmu_lock);
 		else
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 20/30] KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (18 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 19/30] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker Paolo Bonzini
                   ` (10 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead
of immediately flushing.  Because the shadow pages are freed in an RCU
callback, so long as at least one CPU holds RCU, all CPUs are protected.
For vCPUs running in the guest, i.e. consuming TLB entries, KVM only
needs to ensure the caller services the pending TLB flush before dropping
its RCU protections.  I.e. use the caller's RCU as a proxy for all vCPUs
running in the guest.

Deferring the flushes allows batching flushes, e.g. when installing a
1gb hugepage and zapping a pile of SPs.  And when zapping an entire root,
deferring flushes allows skipping the flush entirely (because flushes are
not needed in that case).

Avoiding flushes when zapping an entire root is especially important as
synchronizing with other CPUs via IPI after zapping every shadow page can
cause significant performance issues for large VMs.  The issue is
exacerbated by KVM zapping entire top-level entries without dropping
RCU protection, which can lead to RCU stalls even when zapping roots
backing relatively "small" amounts of guest memory, e.g. 2tb.  Removing
the IPI bottleneck largely mitigates the RCU issues, though it's likely
still a problem for 5-level paging.  A future patch will further address
the problem by zapping roots in multiple passes to avoid holding RCU for
an extended duration.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c      | 13 +++++++++++++
 arch/x86/kvm/mmu/tdp_iter.h |  7 +++----
 arch/x86/kvm/mmu/tdp_mmu.c  | 20 ++++++++++----------
 3 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index febdcaaa7b94..0b88592495f8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6349,6 +6349,13 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 	rcu_idx = srcu_read_lock(&kvm->srcu);
 	write_lock(&kvm->mmu_lock);
 
+	/*
+	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
+	 * be done under RCU protection, because the pages are freed via RCU
+	 * callback.
+	 */
+	rcu_read_lock();
+
 	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
 	to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0;
 	for ( ; to_zap; --to_zap) {
@@ -6373,12 +6380,18 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
 			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
+			rcu_read_unlock();
+
 			cond_resched_rwlock_write(&kvm->mmu_lock);
 			flush = false;
+
+			rcu_read_lock();
 		}
 	}
 	kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
 
+	rcu_read_unlock();
+
 	write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index e2a7e267a77d..b1eaf6ec0e0b 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -9,10 +9,9 @@
 
 /*
  * TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
- * to be zapped while holding mmu_lock for read.  Holding RCU isn't required for
- * correctness if mmu_lock is held for write, but plumbing "struct kvm" down to
- * the lower depths of the TDP MMU just to make lockdep happy is a nightmare, so
- * all accesses to SPTEs are done under RCU protection.
+ * to be zapped while holding mmu_lock for read, and to allow TLB flushes to be
+ * batched without having to collect the list of zapped SPs.  Flows that can
+ * remove SPs must service pending TLB flushes prior to dropping RCU protection.
  */
 static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep)
 {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3a866fcb5ea9..5038de0c872d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -391,9 +391,6 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 				    shared);
 	}
 
-	kvm_flush_remote_tlbs_with_address(kvm, base_gfn,
-					   KVM_PAGES_PER_HPAGE(level + 1));
-
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
@@ -817,19 +814,13 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 	if (WARN_ON_ONCE(!sp->ptep))
 		return false;
 
-	rcu_read_lock();
-
 	old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
-	if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) {
-		rcu_read_unlock();
+	if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
 		return false;
-	}
 
 	__tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
 			   sp->gfn, sp->role.level + 1, true, true);
 
-	rcu_read_unlock();
-
 	return true;
 }
 
@@ -870,6 +861,11 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 	}
 
 	rcu_read_unlock();
+
+	/*
+	 * Because this flow zaps _only_ leaf SPTEs, the caller doesn't need
+	 * to provide RCU protection as no 'struct kvm_mmu_page' will be freed.
+	 */
 	return flush;
 }
 
@@ -1036,6 +1032,10 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 		ret = RET_PF_SPURIOUS;
 	else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
 		return RET_PF_RETRY;
+	else if (is_shadow_present_pte(iter->old_spte) &&
+		 !is_last_spte(iter->old_spte, iter->level))
+		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+						   KVM_PAGES_PER_HPAGE(iter->level + 1));
 
 	/*
 	 * If the page fault was caused by a write but the page is write
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (19 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 20/30] KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 20:54   ` Sean Christopherson
  2022-03-03 21:20   ` Sean Christopherson
  2022-03-03 19:38 ` [PATCH v4 22/30] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root Paolo Bonzini
                   ` (9 subsequent siblings)
  30 siblings, 2 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

Use the system work queue also for roots invalidated by the TDP MMU's
"fast zap" mechanism implemented by kvm_tdp_mmu_invalidate_all_roots().
Currently this is done by kvm_tdp_mmu_zap_invalidated_roots(), but
there is no need to duplicate the code between the "normal"
kvm_tdp_mmu_put_root() path and the invalidation case.  The
only issue is that kvm_tdp_mmu_invalidate_all_roots() now
assumes that there is at least one reference in kvm->users_count;
so if the VM is dying just go through the slow path, as there is
nothing to gain by using the fast zapping.

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/mmu/mmu.c          |   6 +-
 arch/x86/kvm/mmu/mmu_internal.h |   8 +-
 arch/x86/kvm/mmu/tdp_mmu.c      | 158 +++++++++++++++-----------------
 4 files changed, 86 insertions(+), 88 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c45ab8b5c37f..fd05ad52b65c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -15,6 +15,7 @@
 #include <linux/cpumask.h>
 #include <linux/irq_work.h>
 #include <linux/irq.h>
+#include <linux/workqueue.h>
 
 #include <linux/kvm.h>
 #include <linux/kvm_para.h>
@@ -1218,6 +1219,7 @@ struct kvm_arch {
 	 * the thread holds the MMU lock in write mode.
 	 */
 	spinlock_t tdp_mmu_pages_lock;
+	struct workqueue_struct *tdp_mmu_zap_wq;
 #endif /* CONFIG_X86_64 */
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0b88592495f8..9287ee078c49 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5730,7 +5730,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
 
 	kvm_zap_obsolete_pages(kvm);
-
 	write_unlock(&kvm->mmu_lock);
 
 	/*
@@ -5741,11 +5740,8 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 	 * Deferring the zap until the final reference to the root is put would
 	 * lead to use-after-free.
 	 */
-	if (is_tdp_mmu_enabled(kvm)) {
-		read_lock(&kvm->mmu_lock);
+	if (is_tdp_mmu_enabled(kvm))
 		kvm_tdp_mmu_zap_invalidated_roots(kvm);
-		read_unlock(&kvm->mmu_lock);
-	}
 }
 
 static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index be063b6c91b7..1bff453f7cbe 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -65,7 +65,13 @@ struct kvm_mmu_page {
 		struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
 		tdp_ptep_t ptep;
 	};
-	DECLARE_BITMAP(unsync_child_bitmap, 512);
+	union {
+		DECLARE_BITMAP(unsync_child_bitmap, 512);
+		struct {
+			struct work_struct tdp_mmu_async_work;
+			void *tdp_mmu_async_data;
+		};
+	};
 
 	struct list_head lpage_disallowed_link;
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5038de0c872d..ed1bb63b342d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -25,6 +25,8 @@ bool kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
 	spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
 	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
+	kvm->arch.tdp_mmu_zap_wq =
+		alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
 
 	return true;
 }
@@ -49,11 +51,15 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_pages));
 	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
 
+	flush_workqueue(kvm->arch.tdp_mmu_zap_wq);
+
 	/*
 	 * Ensure that all the outstanding RCU callbacks to free shadow pages
-	 * can run before the VM is torn down.
+	 * can run before the VM is torn down.  Work items on tdp_mmu_zap_wq
+	 * can call kvm_tdp_mmu_put_root and create new callbacks.
 	 */
 	rcu_barrier();
+	destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);
 }
 
 static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
@@ -81,6 +87,53 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
 static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			     bool shared);
 
+static void tdp_mmu_zap_root_work(struct work_struct *work)
+{
+	struct kvm_mmu_page *root = container_of(work, struct kvm_mmu_page,
+						 tdp_mmu_async_work);
+	struct kvm *kvm = root->tdp_mmu_async_data;
+
+	read_lock(&kvm->mmu_lock);
+
+	/*
+	 * A TLB flush is not necessary as KVM performs a local TLB flush when
+	 * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
+	 * to a different pCPU.  Note, the local TLB flush on reuse also
+	 * invalidates any paging-structure-cache entries, i.e. TLB entries for
+	 * intermediate paging structures, that may be zapped, as such entries
+	 * are associated with the ASID on both VMX and SVM.
+	 */
+	tdp_mmu_zap_root(kvm, root, true);
+
+	/*
+	 * Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for
+	 * avoiding an infinite loop.  By design, the root is reachable while
+	 * it's being asynchronously zapped, thus a different task can put its
+	 * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an
+	 * asynchronously zapped root is unavoidable.
+	 */
+	kvm_tdp_mmu_put_root(kvm, root, true);
+
+	read_unlock(&kvm->mmu_lock);
+}
+
+static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root)
+{
+	root->tdp_mmu_async_data = kvm;
+	INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_work);
+	queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work);
+}
+
+static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
+{
+	union kvm_mmu_page_role role = page->role;
+	role.invalid = true;
+
+	/* No need to use cmpxchg, only the invalid bit can change.  */
+	role.word = xchg(&page->role.word, role.word);
+	return role.invalid;
+}
+
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			  bool shared)
 {
@@ -892,6 +945,13 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 	int i;
 
 	/*
+	 * Zap all roots, including invalid roots, as all SPTEs must be dropped
+	 * before returning to the caller.  Zap directly even if the root is
+	 * also being zapped by a worker.  Walking zapped top-level SPTEs isn't
+	 * all that expensive and mmu_lock is already held, which means the
+	 * worker has yielded, i.e. flushing the work instead of zapping here
+	 * isn't guaranteed to be any faster.
+	 *
 	 * A TLB flush is unnecessary, KVM zaps everything if and only the VM
 	 * is being destroyed or the userspace VMM has exited.  In both cases,
 	 * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
@@ -902,96 +962,28 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 	}
 }
 
-static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
-						  struct kvm_mmu_page *prev_root)
-{
-	struct kvm_mmu_page *next_root;
-
-	if (prev_root)
-		next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
-						  &prev_root->link,
-						  typeof(*prev_root), link);
-	else
-		next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
-						   typeof(*next_root), link);
-
-	while (next_root && !(next_root->role.invalid &&
-			      refcount_read(&next_root->tdp_mmu_root_count)))
-		next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
-						  &next_root->link,
-						  typeof(*next_root), link);
-
-	return next_root;
-}
-
 /*
  * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
- * zap" completes.  Since kvm_tdp_mmu_invalidate_all_roots() has acquired a
- * reference to each invalidated root, roots will not be freed until after this
- * function drops the gifted reference, e.g. so that vCPUs don't get stuck with
- * tearing down paging structures.
+ * zap" completes.
  */
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
 {
-	struct kvm_mmu_page *next_root;
-	struct kvm_mmu_page *root;
-
-	lockdep_assert_held_read(&kvm->mmu_lock);
-
-	rcu_read_lock();
-
-	root = next_invalidated_root(kvm, NULL);
-
-	while (root) {
-		next_root = next_invalidated_root(kvm, root);
-
-		rcu_read_unlock();
-
-		/*
-		 * A TLB flush is unnecessary, invalidated roots are guaranteed
-		 * to be unreachable by the guest (see kvm_tdp_mmu_put_root()
-		 * for more details), and unlike the legacy MMU, no vCPU kick
-		 * is needed to play nice with lockless shadow walks as the TDP
-		 * MMU protects its paging structures via RCU.  Note, zapping
-		 * will still flush on yield, but that's a minor performance
-		 * blip and not a functional issue.
-		 */
-		tdp_mmu_zap_root(kvm, root, true);
-
-		/*
-		 * Put the reference acquired in
-		 * kvm_tdp_mmu_invalidate_roots
-		 */
-		kvm_tdp_mmu_put_root(kvm, root, true);
-
-		root = next_root;
-
-		rcu_read_lock();
-	}
-
-	rcu_read_unlock();
+	flush_workqueue(kvm->arch.tdp_mmu_zap_wq);
 }
 
 /*
  * Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
- * is about to be zapped, e.g. in response to a memslots update.  The caller is
- * responsible for invoking kvm_tdp_mmu_zap_invalidated_roots() to do the actual
- * zapping.
- *
- * Take a reference on all roots to prevent the root from being freed before it
- * is zapped by this thread.  Freeing a root is not a correctness issue, but if
- * a vCPU drops the last reference to a root prior to the root being zapped, it
- * will get stuck with tearing down the entire paging structure.
+ * is about to be zapped, e.g. in response to a memslots update.  The actual
+ * zapping is performed asynchronously, so a reference is taken on all roots.
+ * Using a separate workqueue makes it easy to ensure that the destruction is
+ * performed before the "fast zap" completes, without keeping a separate list
+ * of invalidated roots; the list is effectively the list of work items in
+ * the workqueue.
  *
- * Get a reference even if the root is already invalid,
- * kvm_tdp_mmu_zap_invalidated_roots() assumes it was gifted a reference to all
- * invalid roots, e.g. there's no epoch to identify roots that were invalidated
- * by a previous call.  Roots stay on the list until the last reference is
- * dropped, so even though all invalid roots are zapped, a root may not go away
- * for quite some time, e.g. if a vCPU blocks across multiple memslot updates.
- *
- * Because mmu_lock is held for write, it should be impossible to observe a
- * root with zero refcount, i.e. the list of roots cannot be stale.
+ * Get a reference even if the root is already invalid, the asynchronous worker
+ * assumes it was gifted a reference to the root it processes.  Because mmu_lock
+ * is held for write, it should be impossible to observe a root with zero refcount,
+ * i.e. the list of roots cannot be stale.
  *
  * This has essentially the same effect for the TDP MMU
  * as updating mmu_valid_gen does for the shadow MMU.
@@ -1002,8 +994,10 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
-		if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
+		if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) {
 			root->role.invalid = true;
+			tdp_mmu_schedule_zap_root(kvm, root);
+		}
 	}
 }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 22/30] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (20 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 23/30] KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls Paolo Bonzini
                   ` (8 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

Allow yielding when zapping SPTEs after the last reference to a valid
root is put.  Because KVM must drop all SPTEs in response to relevant
mmu_notifier events, mark defunct roots invalid and reset their refcount
prior to zapping the root.  Keeping the refcount elevated while the zap
is in-progress ensures the root is reachable via mmu_notifier until the
zap completes and the last reference to the invalid, defunct root is put.

Allowing kvm_tdp_mmu_put_root() to yield fixes soft lockup issues if the
root in being put has a massive paging structure, e.g. zapping a root
that is backed entirely by 4kb pages for a guest with 32tb of memory can
take hundreds of seconds to complete.

  watchdog: BUG: soft lockup - CPU#49 stuck for 485s! [max_guest_memor:52368]
  RIP: 0010:kvm_set_pfn_dirty+0x30/0x50 [kvm]
   __handle_changed_spte+0x1b2/0x2f0 [kvm]
   handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
   __handle_changed_spte+0x1f4/0x2f0 [kvm]
   handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
   __handle_changed_spte+0x1f4/0x2f0 [kvm]
   tdp_mmu_zap_root+0x307/0x4d0 [kvm]
   kvm_tdp_mmu_put_root+0x7c/0xc0 [kvm]
   kvm_mmu_free_roots+0x22d/0x350 [kvm]
   kvm_mmu_reset_context+0x20/0x60 [kvm]
   kvm_arch_vcpu_ioctl_set_sregs+0x5a/0xc0 [kvm]
   kvm_vcpu_ioctl+0x5bd/0x710 [kvm]
   __se_sys_ioctl+0x77/0xc0
   __x64_sys_ioctl+0x1d/0x20
   do_syscall_64+0x44/0xa0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

KVM currently doesn't put a root from a non-preemptible context, so other
than the mmu_notifier wrinkle, yielding when putting a root is safe.

Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
take a reference to each root (it requires mmu_lock be held for the
entire duration of the walk).

tdp_mmu_next_root() is used only by the yield-friendly iterator.

tdp_mmu_zap_root_work() is explicitly yield friendly.

kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
but is still yield-friendly in all call sites, as all callers can be
traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
kvm_create_vm().

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 93 +++++++++++++++++++++-----------------
 1 file changed, 52 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ed1bb63b342d..408e21e4009c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -144,20 +144,46 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	WARN_ON(!root->tdp_mmu_page);
 
-	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
-	list_del_rcu(&root->link);
-	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
-
 	/*
-	 * A TLB flush is not necessary as KVM performs a local TLB flush when
-	 * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
-	 * to a different pCPU.  Note, the local TLB flush on reuse also
-	 * invalidates any paging-structure-cache entries, i.e. TLB entries for
-	 * intermediate paging structures, that may be zapped, as such entries
-	 * are associated with the ASID on both VMX and SVM.
+	 * The root now has refcount=0.  It is valid, but readers already
+	 * cannot acquire a reference to it because kvm_tdp_mmu_get_root()
+	 * rejects it.  This remains true for the rest of the execution
+	 * of this function, because readers visit valid roots only
+	 * (except for tdp_mmu_zap_root_work(), which however
+	 * does not acquire any reference itself).
+	 *
+	 * Even though there are flows that need to visit all roots for
+	 * correctness, they all take mmu_lock for write, so they cannot yet
+	 * run concurrently. The same is true after kvm_tdp_root_mark_invalid,
+	 * since the root still has refcount=0.
+	 *
+	 * However, tdp_mmu_zap_root can yield, and writers do not expect to
+	 * see refcount=0 (see for example kvm_tdp_mmu_invalidate_all_roots()).
+	 * So the root temporarily gets an extra reference, going to refcount=1
+	 * while staying invalid.  Readers still cannot acquire any reference;
+	 * but writers are now allowed to run if tdp_mmu_zap_root yields and
+	 * they might take an extra reference is they themselves yield.  Therefore,
+	 * when the reference is given back after tdp_mmu_zap_root terminates,
+	 * there is no guarantee that the refcount is still 1.  If not, whoever
+	 * puts the last reference will free the page, but they will not have to
+	 * zap the root because a root cannot go from invalid to valid.
 	 */
-	tdp_mmu_zap_root(kvm, root, shared);
+	if (!kvm_tdp_root_mark_invalid(root)) {
+		refcount_set(&root->tdp_mmu_root_count, 1);
+		tdp_mmu_zap_root(kvm, root, shared);
+
+		/*
+		 * Give back the reference that was added back above.  We now
+		 * know that the root is invalid, so go ahead and free it if
+		 * no one has taken a reference in the meanwhile.
+		 */
+		if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
+			return;
+	}
 
+	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+	list_del_rcu(&root->link);
+	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 	call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
@@ -799,12 +825,23 @@ static inline gfn_t tdp_mmu_max_gfn_host(void)
 static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			     bool shared)
 {
-	bool root_is_unreachable = !refcount_read(&root->tdp_mmu_root_count);
 	struct tdp_iter iter;
 
 	gfn_t end = tdp_mmu_max_gfn_host();
 	gfn_t start = 0;
 
+	/*
+	 * The root must have an elevated refcount so that it's reachable via
+	 * mmu_notifier callbacks, which allows this path to yield and drop
+	 * mmu_lock.  When handling an unmap/release mmu_notifier command, KVM
+	 * must drop all references to relevant pages prior to completing the
+	 * callback.  Dropping mmu_lock with an unreachable root would result
+	 * in zapping SPTEs after a relevant mmu_notifier callback completes
+	 * and lead to use-after-free as zapping a SPTE triggers "writeback" of
+	 * dirty accessed bits to the SPTE's associated struct page.
+	 */
+	WARN_ON_ONCE(!refcount_read(&root->tdp_mmu_root_count));
+
 	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
 
 	rcu_read_lock();
@@ -815,42 +852,16 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	 */
 	for_each_tdp_pte_min_level(iter, root, root->role.level, start, end) {
 retry:
-		/*
-		 * Yielding isn't allowed when zapping an unreachable root as
-		 * the root won't be processed by mmu_notifier callbacks.  When
-		 * handling an unmap/release mmu_notifier command, KVM must
-		 * drop all references to relevant pages prior to completing
-		 * the callback.  Dropping mmu_lock can result in zapping SPTEs
-		 * for an unreachable root after a relevant callback completes,
-		 * which leads to use-after-free as zapping a SPTE triggers
-		 * "writeback" of dirty/accessed bits to the SPTE's associated
-		 * struct page.
-		 */
-		if (!root_is_unreachable &&
-		    tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
 			continue;
 
 		if (!is_shadow_present_pte(iter.old_spte))
 			continue;
 
-		if (!shared) {
+		if (!shared)
 			tdp_mmu_set_spte(kvm, &iter, 0);
-		} else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
-			/*
-			 * cmpxchg() shouldn't fail if the root is unreachable.
-			 * Retry so as not to leak the page and its children.
-			 */
-			WARN_ONCE(root_is_unreachable,
-				  "Contended TDP MMU SPTE in unreachable root.");
+		else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
 			goto retry;
-		}
-
-		/*
-		 * WARN if the root is invalid and is unreachable, all SPTEs
-		 * should've been zapped by kvm_tdp_mmu_zap_invalidated_roots(),
-		 * and inserting new SPTEs under an invalid root is a KVM bug.
-		 */
-		WARN_ON_ONCE(root_is_unreachable && root->role.invalid);
 	}
 
 	rcu_read_unlock();
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 23/30] KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (21 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 22/30] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 24/30] KVM: x86/mmu: Zap defunct roots via asynchronous worker Paolo Bonzini
                   ` (7 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

When zapping a TDP MMU root, perform the zap in two passes to avoid
zapping an entire top-level SPTE while holding RCU, which can induce RCU
stalls.  In the first pass, zap SPTEs at PG_LEVEL_1G, and then
zap top-level entries in the second pass.

With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
SPTEs take up to ~7 or so seconds (time varies based on kernel config,
number of (v)CPUs, etc...).  With 5-level paging, that time can balloon
well into hundreds of seconds.

Before remote TLB flushes were omitted, the problem was even worse as
waiting for all active vCPUs to respond to the IPI introduced significant
overhead for VMs with large numbers of vCPUs.

By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
the amount of work that is done without dropping RCU protection is
strictly bounded, with the worst case latency for a single operation
being less than 100ms.

Zapping at 1gb in the first pass is not arbitrary.  First and foremost,
KVM relies on being able to zap 1gb shadow pages in a single shot when
when repacing a shadow page with a hugepage.  Zapping a 1gb shadow page
that is fully populated with 4kb dirty SPTEs also triggers the worst case
latency due writing back the struct page accessed/dirty bits for each 4kb
page, i.e. the two-pass approach is guaranteed to work so long as KVM can
cleany zap a 1gb shadow page.

  rcu: INFO: rcu_sched self-detected stall on CPU
  rcu:     52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
                                          softirq=15759/15759 fqs=5058
   (t=21016 jiffies g=66453 q=238577)
  NMI backtrace for cpu 52
  Call Trace:
   ...
   mark_page_accessed+0x266/0x2f0
   kvm_set_pfn_accessed+0x31/0x40
   handle_removed_tdp_mmu_page+0x259/0x2e0
   __handle_changed_spte+0x223/0x2c0
   handle_removed_tdp_mmu_page+0x1c1/0x2e0
   __handle_changed_spte+0x223/0x2c0
   handle_removed_tdp_mmu_page+0x1c1/0x2e0
   __handle_changed_spte+0x223/0x2c0
   zap_gfn_range+0x141/0x3b0
   kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
   kvm_mmu_zap_all_fast+0x121/0x190
   kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
   kvm_page_track_flush_slot+0x5c/0x80
   kvm_arch_flush_shadow_memslot+0xe/0x10
   kvm_set_memslot+0x172/0x4e0
   __kvm_set_memory_region+0x337/0x590
   kvm_vm_ioctl+0x49c/0xf80

Reported-by: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 51 +++++++++++++++++++++++++-------------
 1 file changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 408e21e4009c..e24a1bff9218 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -822,14 +822,36 @@ static inline gfn_t tdp_mmu_max_gfn_host(void)
 	return 1ULL << (shadow_phys_bits - PAGE_SHIFT);
 }
 
-static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
-			     bool shared)
+static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
+			       bool shared, int zap_level)
 {
 	struct tdp_iter iter;
 
 	gfn_t end = tdp_mmu_max_gfn_host();
 	gfn_t start = 0;
 
+	for_each_tdp_pte_min_level(iter, root, zap_level, start, end) {
+retry:
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
+			continue;
+
+		if (!is_shadow_present_pte(iter.old_spte))
+			continue;
+
+		if (iter.level > zap_level)
+			continue;
+
+		if (!shared)
+			tdp_mmu_set_spte(kvm, &iter, 0);
+		else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
+			goto retry;
+	}
+}
+
+static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
+			     bool shared)
+{
+
 	/*
 	 * The root must have an elevated refcount so that it's reachable via
 	 * mmu_notifier callbacks, which allows this path to yield and drop
@@ -847,22 +869,17 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	rcu_read_lock();
 
 	/*
-	 * No need to try to step down in the iterator when zapping an entire
-	 * root, zapping an upper-level SPTE will recurse on its children.
+	 * To avoid RCU stalls due to recursively removing huge swaths of SPs,
+	 * split the zap into two passes.  On the first pass, zap at the 1gb
+	 * level, and then zap top-level SPs on the second pass.  "1gb" is not
+	 * arbitrary, as KVM must be able to zap a 1gb shadow page without
+	 * inducing a stall to allow in-place replacement with a 1gb hugepage.
+	 *
+	 * Because zapping a SP recurses on its children, stepping down to
+	 * PG_LEVEL_4K in the iterator itself is unnecessary.
 	 */
-	for_each_tdp_pte_min_level(iter, root, root->role.level, start, end) {
-retry:
-		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
-			continue;
-
-		if (!is_shadow_present_pte(iter.old_spte))
-			continue;
-
-		if (!shared)
-			tdp_mmu_set_spte(kvm, &iter, 0);
-		else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
-			goto retry;
-	}
+	__tdp_mmu_zap_root(kvm, root, shared, PG_LEVEL_1G);
+	__tdp_mmu_zap_root(kvm, root, shared, root->role.level);
 
 	rcu_read_unlock();
 }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 24/30] KVM: x86/mmu: Zap defunct roots via asynchronous worker
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (22 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 23/30] KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 22:08   ` Sean Christopherson
  2022-03-03 19:38 ` [PATCH v4 25/30] KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE Paolo Bonzini
                   ` (6 subsequent siblings)
  30 siblings, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

Zap defunct roots, a.k.a. roots that have been invalidated after their
last reference was initially dropped, asynchronously via the system work
queue instead of forcing the work upon the unfortunate task that happened
to drop the last reference.

If a vCPU task drops the last reference, the vCPU is effectively blocked
by the host for the entire duration of the zap.  If the root being zapped
happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
being active, the zap can take several hundred seconds.  Unsurprisingly,
most guests are unhappy if a vCPU disappears for hundreds of seconds.

E.g. running a synthetic selftest that triggers a vCPU root zap with
~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
Offloading the zap to a worker drops the block time to <100ms.

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-23-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e24a1bff9218..2456f880508d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -170,13 +170,24 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	 */
 	if (!kvm_tdp_root_mark_invalid(root)) {
 		refcount_set(&root->tdp_mmu_root_count, 1);
-		tdp_mmu_zap_root(kvm, root, shared);
 
 		/*
-		 * Give back the reference that was added back above.  We now
+		 * If the struct kvm is alive, we might as well zap the root
+		 * in a worker.  The worker takes ownership of the reference we
+		 * just added to root and is flushed before the struct kvm dies.
+		 */
+		if (likely(refcount_read(&kvm->users_count))) {
+			tdp_mmu_schedule_zap_root(kvm, root);
+			return;
+		}
+
+		/*
+		 * The struct kvm is being destroyed, zap synchronously and give
+		 * back immediately the reference that was added above.  We now
 		 * know that the root is invalid, so go ahead and free it if
 		 * no one has taken a reference in the meanwhile.
 		 */
+		tdp_mmu_zap_root(kvm, root, shared);
 		if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
 			return;
 	}
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 25/30] KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (23 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 24/30] KVM: x86/mmu: Zap defunct roots via asynchronous worker Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 26/30] KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE Paolo Bonzini
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Explicitly check for a REMOVED leaf SPTE prior to attempting to map
the final SPTE when handling a TDP MMU fault.  Functionally, this is a
nop as tdp_mmu_set_spte_atomic() will eventually detect the frozen SPTE.
Pre-checking for a REMOVED SPTE is a minor optmization, but the real goal
is to allow tdp_mmu_set_spte_atomic() to have an invariant that the "old"
SPTE is never a REMOVED SPTE.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-24-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2456f880508d..89e6eb6640fe 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1202,7 +1202,11 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		}
 	}
 
-	if (iter.level != fault->goal_level) {
+	/*
+	 * Force the guest to retry the access if the upper level SPTEs aren't
+	 * in place, or if the target leaf SPTE is frozen by another CPU.
+	 */
+	if (iter.level != fault->goal_level || is_removed_spte(iter.old_spte)) {
 		rcu_read_unlock();
 		return RET_PF_RETRY;
 	}
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 26/30] KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (24 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 25/30] KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 27/30] KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils Paolo Bonzini
                   ` (4 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Disallow calling tdp_mmu_set_spte_atomic() with a REMOVED "old" SPTE.
This solves a conundrum introduced by commit 3255530ab191 ("KVM: x86/mmu:
Automatically update iter->old_spte if cmpxchg fails"); if the helper
doesn't update old_spte in the REMOVED case, then theoretically the
caller could get stuck in an infinite loop as it will fail indefinitely
on the REMOVED SPTE.  E.g. until recently, clear_dirty_gfn_range() didn't
check for a present SPTE and would have spun until getting rescheduled.

In practice, only the page fault path should "create" a new SPTE, all
other paths should only operate on existing, a.k.a. shadow present,
SPTEs.  Now that the page fault path pre-checks for a REMOVED SPTE in all
cases, require all other paths to indirectly pre-check by verifying the
target SPTE is a shadow-present SPTE.

Note, this does not guarantee the actual SPTE isn't REMOVED, nor is that
scenario disallowed.  The invariant is only that the caller mustn't
invoke tdp_mmu_set_spte_atomic() if the SPTE was REMOVED when last
observed by the caller.

Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-25-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 89e6eb6640fe..a0e24d260983 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -622,16 +622,15 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	u64 *sptep = rcu_dereference(iter->sptep);
 	u64 old_spte;
 
-	WARN_ON_ONCE(iter->yielded);
-
-	lockdep_assert_held_read(&kvm->mmu_lock);
-
 	/*
-	 * Do not change removed SPTEs. Only the thread that froze the SPTE
-	 * may modify it.
+	 * The caller is responsible for ensuring the old SPTE is not a REMOVED
+	 * SPTE.  KVM should never attempt to zap or manipulate a REMOVED SPTE,
+	 * and pre-checking before inserting a new SPTE is advantageous as it
+	 * avoids unnecessary work.
 	 */
-	if (is_removed_spte(iter->old_spte))
-		return -EBUSY;
+	WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
+
+	lockdep_assert_held_read(&kvm->mmu_lock);
 
 	/*
 	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 27/30] KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (25 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 26/30] KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 28/30] KVM: selftests: Split out helper to allocate guest mem via memfd Paolo Bonzini
                   ` (3 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Move set_memory_region_test's KVM_SET_USER_MEMORY_REGION helper to KVM's
utils so that it can be used by other tests.  Provide a raw version as
well as an assert-success version to reduce the amount of boilerplate
code need for basic usage.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-26-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 .../selftests/kvm/include/kvm_util_base.h     |  4 +++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 24 +++++++++++++
 .../selftests/kvm/set_memory_region_test.c    | 35 +++++--------------
 3 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index f987cf7c0d2e..573de0354175 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -147,6 +147,10 @@ void vcpu_dump(FILE *stream, struct kvm_vm *vm, uint32_t vcpuid,
 
 void vm_create_irqchip(struct kvm_vm *vm);
 
+void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+			       uint64_t gpa, uint64_t size, void *hva);
+int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+				uint64_t gpa, uint64_t size, void *hva);
 void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	enum vm_mem_backing_src_type src_type,
 	uint64_t guest_paddr, uint32_t slot, uint64_t npages,
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 64618032aa58..dcb8e96c6a54 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -839,6 +839,30 @@ static void vm_userspace_mem_region_hva_insert(struct rb_root *hva_tree,
 	rb_insert_color(&region->hva_node, hva_tree);
 }
 
+
+int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+				uint64_t gpa, uint64_t size, void *hva)
+{
+	struct kvm_userspace_memory_region region = {
+		.slot = slot,
+		.flags = flags,
+		.guest_phys_addr = gpa,
+		.memory_size = size,
+		.userspace_addr = (uintptr_t)hva,
+	};
+
+	return ioctl(vm->fd, KVM_SET_USER_MEMORY_REGION, &region);
+}
+
+void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+			       uint64_t gpa, uint64_t size, void *hva)
+{
+	int ret = __vm_set_user_memory_region(vm, slot, flags, gpa, size, hva);
+
+	TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION failed, errno = %d (%s)",
+		    errno, strerror(errno));
+}
+
 /*
  * VM Userspace Memory Region Add
  *
diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c
index 72a1c9b4882c..73bc297dabe6 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -329,22 +329,6 @@ static void test_zero_memory_regions(void)
 }
 #endif /* __x86_64__ */
 
-static int test_memory_region_add(struct kvm_vm *vm, void *mem, uint32_t slot,
-				   uint32_t size, uint64_t guest_addr)
-{
-	struct kvm_userspace_memory_region region;
-	int ret;
-
-	region.slot = slot;
-	region.flags = 0;
-	region.guest_phys_addr = guest_addr;
-	region.memory_size = size;
-	region.userspace_addr = (uintptr_t) mem;
-	ret = ioctl(vm_get_fd(vm), KVM_SET_USER_MEMORY_REGION, &region);
-
-	return ret;
-}
-
 /*
  * Test it can be added memory slots up to KVM_CAP_NR_MEMSLOTS, then any
  * tentative to add further slots should fail.
@@ -382,23 +366,20 @@ static void test_add_max_memory_regions(void)
 	TEST_ASSERT(mem != MAP_FAILED, "Failed to mmap() host");
 	mem_aligned = (void *)(((size_t) mem + alignment - 1) & ~(alignment - 1));
 
-	for (slot = 0; slot < max_mem_slots; slot++) {
-		ret = test_memory_region_add(vm, mem_aligned +
-					     ((uint64_t)slot * MEM_REGION_SIZE),
-					     slot, MEM_REGION_SIZE,
-					     (uint64_t)slot * MEM_REGION_SIZE);
-		TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
-			    "  rc: %i errno: %i slot: %i\n",
-			    ret, errno, slot);
-	}
+	for (slot = 0; slot < max_mem_slots; slot++)
+		vm_set_user_memory_region(vm, slot, 0,
+					  ((uint64_t)slot * MEM_REGION_SIZE),
+					  MEM_REGION_SIZE,
+					  mem_aligned + (uint64_t)slot * MEM_REGION_SIZE);
 
 	/* Check it cannot be added memory slots beyond the limit */
 	mem_extra = mmap(NULL, MEM_REGION_SIZE, PROT_READ | PROT_WRITE,
 			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 	TEST_ASSERT(mem_extra != MAP_FAILED, "Failed to mmap() host");
 
-	ret = test_memory_region_add(vm, mem_extra, max_mem_slots, MEM_REGION_SIZE,
-				     (uint64_t)max_mem_slots * MEM_REGION_SIZE);
+	ret = __vm_set_user_memory_region(vm, max_mem_slots, 0,
+					  (uint64_t)max_mem_slots * MEM_REGION_SIZE,
+					  MEM_REGION_SIZE, mem_extra);
 	TEST_ASSERT(ret == -1 && errno == EINVAL,
 		    "Adding one more memory slot should fail with EINVAL");
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 28/30] KVM: selftests: Split out helper to allocate guest mem via memfd
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (26 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 27/30] KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 29/30] KVM: selftests: Define cpu_relax() helpers for s390 and x86 Paolo Bonzini
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Extract the code for allocating guest memory via memfd out of
vm_userspace_mem_region_add() and into a new helper, kvm_memfd_alloc().
A future selftest to populate a guest with the maximum amount of guest
memory will abuse KVM's memslots to alias guest memory regions to a
single memfd-backed host region, i.e. needs to back a guest with memfd
memory without a 1:1 association between a memslot and a memfd instance.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-27-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 .../selftests/kvm/include/kvm_util_base.h     |  1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    | 42 +++++++++++--------
 2 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 573de0354175..92cef0ffb19e 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -123,6 +123,7 @@ int kvm_memcmp_hva_gva(void *hva, struct kvm_vm *vm, const vm_vaddr_t gva,
 		       size_t len);
 
 void kvm_vm_elf_load(struct kvm_vm *vm, const char *filename);
+int kvm_memfd_alloc(size_t size, bool hugepages);
 
 void vm_dump(FILE *stream, struct kvm_vm *vm, uint8_t indent);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index dcb8e96c6a54..1665a220abcb 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -718,6 +718,27 @@ void kvm_vm_free(struct kvm_vm *vmp)
 	free(vmp);
 }
 
+int kvm_memfd_alloc(size_t size, bool hugepages)
+{
+	int memfd_flags = MFD_CLOEXEC;
+	int fd, r;
+
+	if (hugepages)
+		memfd_flags |= MFD_HUGETLB;
+
+	fd = memfd_create("kvm_selftest", memfd_flags);
+	TEST_ASSERT(fd != -1, "memfd_create() failed, errno: %i (%s)",
+		    errno, strerror(errno));
+
+	r = ftruncate(fd, size);
+	TEST_ASSERT(!r, "ftruncate() failed, errno: %i (%s)", errno, strerror(errno));
+
+	r = fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, size);
+	TEST_ASSERT(!r, "fallocate() failed, errno: %i (%s)", errno, strerror(errno));
+
+	return fd;
+}
+
 /*
  * Memory Compare, host virtual to guest virtual
  *
@@ -970,24 +991,9 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 		region->mmap_size += alignment;
 
 	region->fd = -1;
-	if (backing_src_is_shared(src_type)) {
-		int memfd_flags = MFD_CLOEXEC;
-
-		if (src_type == VM_MEM_SRC_SHARED_HUGETLB)
-			memfd_flags |= MFD_HUGETLB;
-
-		region->fd = memfd_create("kvm_selftest", memfd_flags);
-		TEST_ASSERT(region->fd != -1,
-			    "memfd_create failed, errno: %i", errno);
-
-		ret = ftruncate(region->fd, region->mmap_size);
-		TEST_ASSERT(ret == 0, "ftruncate failed, errno: %i", errno);
-
-		ret = fallocate(region->fd,
-				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0,
-				region->mmap_size);
-		TEST_ASSERT(ret == 0, "fallocate failed, errno: %i", errno);
-	}
+	if (backing_src_is_shared(src_type))
+		region->fd = kvm_memfd_alloc(region->mmap_size,
+					     src_type == VM_MEM_SRC_SHARED_HUGETLB);
 
 	region->mmap_start = mmap(NULL, region->mmap_size,
 				  PROT_READ | PROT_WRITE,
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 29/30] KVM: selftests: Define cpu_relax() helpers for s390 and x86
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (27 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 28/30] KVM: selftests: Split out helper to allocate guest mem via memfd Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-03 19:38 ` [PATCH v4 30/30] KVM: selftests: Add test to populate a VM with the max possible guest mem Paolo Bonzini
  2022-03-08 17:25 ` [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Add cpu_relax() for s390 and x86 for use in arch-agnostic tests.  arm64
already defines its own version.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-28-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 tools/testing/selftests/kvm/include/s390x/processor.h  | 8 ++++++++
 tools/testing/selftests/kvm/include/x86_64/processor.h | 5 +++++
 2 files changed, 13 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/s390x/processor.h b/tools/testing/selftests/kvm/include/s390x/processor.h
index e0e96a5f608c..255c9b990f4c 100644
--- a/tools/testing/selftests/kvm/include/s390x/processor.h
+++ b/tools/testing/selftests/kvm/include/s390x/processor.h
@@ -5,6 +5,8 @@
 #ifndef SELFTEST_KVM_PROCESSOR_H
 #define SELFTEST_KVM_PROCESSOR_H
 
+#include <linux/compiler.h>
+
 /* Bits in the region/segment table entry */
 #define REGION_ENTRY_ORIGIN	~0xfffUL /* region/segment table origin	   */
 #define REGION_ENTRY_PROTECT	0x200	 /* region protection bit	   */
@@ -19,4 +21,10 @@
 #define PAGE_PROTECT	0x200		/* HW read-only bit  */
 #define PAGE_NOEXEC	0x100		/* HW no-execute bit */
 
+/* Is there a portable way to do this? */
+static inline void cpu_relax(void)
+{
+	barrier();
+}
+
 #endif
diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 8a470da7b71a..37db341d4cc5 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -363,6 +363,11 @@ static inline unsigned long get_xmm(int n)
 	return 0;
 }
 
+static inline void cpu_relax(void)
+{
+	asm volatile("rep; nop" ::: "memory");
+}
+
 bool is_intel_cpu(void);
 bool is_amd_cpu(void);
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 30/30] KVM: selftests: Add test to populate a VM with the max possible guest mem
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (28 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 29/30] KVM: selftests: Define cpu_relax() helpers for s390 and x86 Paolo Bonzini
@ 2022-03-03 19:38 ` Paolo Bonzini
  2022-03-08 14:47   ` Paolo Bonzini
  2022-03-08 17:25 ` [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
  30 siblings, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-03 19:38 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

From: Sean Christopherson <seanjc@google.com>

Add a selftest that enables populating a VM with the maximum amount of
guest memory allowed by the underlying architecture.  Abuse KVM's
memslots by mapping a single host memory region into multiple memslots so
that the selftest doesn't require a system with terabytes of RAM.

Default to 512gb of guest memory, which isn't all that interesting, but
should work on all MMUs and doesn't take an exorbitant amount of memory
or time.  E.g. testing with ~64tb of guest memory takes the better part
of an hour, and requires 200gb of memory for KVM's page tables when using
4kb pages.

To inflicit maximum abuse on KVM' MMU, default to 4kb pages (or whatever
the not-hugepage size is) in the backing store (memfd).  Use memfd for
the host backing store to ensure that hugepages are guaranteed when
requested, and to give the user explicit control of the size of hugepage
being tested.

By default, spin up as many vCPUs as there are available to the selftest,
and distribute the work of dirtying each 4kb chunk of memory across all
vCPUs.  Dirtying guest memory forces KVM to populate its page tables, and
also forces KVM to write back accessed/dirty information to struct page
when the guest memory is freed.

On x86, perform two passes with a MMU context reset between each pass to
coerce KVM into dropping all references to the MMU root, e.g. to emulate
a vCPU dropping the last reference.  Perform both passes and all
rendezvous on all architectures in the hope that arm64 and s390x can gain
similar shenanigans in the future.

Measure and report the duration of each operation, which is helpful not
only to verify the test is working as intended, but also to easily
evaluate the performance differences different page sizes.

Provide command line options to limit the amount of guest memory, set the
size of each slot (i.e. of the host memory region), set the number of
vCPUs, and to enable usage of hugepages.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-29-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   3 +
 .../selftests/kvm/max_guest_memory_test.c     | 292 ++++++++++++++++++
 3 files changed, 296 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/max_guest_memory_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 052ddfe4b23a..9b67343dc4ab 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -58,6 +58,7 @@
 /hardware_disable_test
 /kvm_create_max_vcpus
 /kvm_page_table_test
+/max_guest_memory_test
 /memslot_modification_stress_test
 /memslot_perf_test
 /rseq_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index f7fa5655e535..c06b1f8bc649 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -93,6 +93,7 @@ TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
 TEST_GEN_PROGS_x86_64 += hardware_disable_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
 TEST_GEN_PROGS_x86_64 += kvm_page_table_test
+TEST_GEN_PROGS_x86_64 += max_guest_memory_test
 TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test
 TEST_GEN_PROGS_x86_64 += memslot_perf_test
 TEST_GEN_PROGS_x86_64 += rseq_test
@@ -112,6 +113,7 @@ TEST_GEN_PROGS_aarch64 += dirty_log_test
 TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
 TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
 TEST_GEN_PROGS_aarch64 += kvm_page_table_test
+TEST_GEN_PROGS_aarch64 += max_guest_memory_test
 TEST_GEN_PROGS_aarch64 += memslot_modification_stress_test
 TEST_GEN_PROGS_aarch64 += memslot_perf_test
 TEST_GEN_PROGS_aarch64 += rseq_test
@@ -127,6 +129,7 @@ TEST_GEN_PROGS_s390x += demand_paging_test
 TEST_GEN_PROGS_s390x += dirty_log_test
 TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
 TEST_GEN_PROGS_s390x += kvm_page_table_test
+TEST_GEN_PROGS_s390x += max_guest_memory_test
 TEST_GEN_PROGS_s390x += rseq_test
 TEST_GEN_PROGS_s390x += set_memory_region_test
 TEST_GEN_PROGS_s390x += kvm_binary_stats_test
diff --git a/tools/testing/selftests/kvm/max_guest_memory_test.c b/tools/testing/selftests/kvm/max_guest_memory_test.c
new file mode 100644
index 000000000000..360c88288295
--- /dev/null
+++ b/tools/testing/selftests/kvm/max_guest_memory_test.c
@@ -0,0 +1,292 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <pthread.h>
+#include <semaphore.h>
+#include <sys/types.h>
+#include <signal.h>
+#include <errno.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+#include <linux/atomic.h>
+
+#include "kvm_util.h"
+#include "test_util.h"
+#include "guest_modes.h"
+#include "processor.h"
+
+static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
+{
+	uint64_t gpa;
+
+	for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
+		*((volatile uint64_t *)gpa) = gpa;
+
+	GUEST_DONE();
+}
+
+struct vcpu_info {
+	struct kvm_vm *vm;
+	uint32_t id;
+	uint64_t start_gpa;
+	uint64_t end_gpa;
+};
+
+static int nr_vcpus;
+static atomic_t rendezvous;
+
+static void rendezvous_with_boss(void)
+{
+	int orig = atomic_read(&rendezvous);
+
+	if (orig > 0) {
+		atomic_dec_and_test(&rendezvous);
+		while (atomic_read(&rendezvous) > 0)
+			cpu_relax();
+	} else {
+		atomic_inc(&rendezvous);
+		while (atomic_read(&rendezvous) < 0)
+			cpu_relax();
+	}
+}
+
+static void run_vcpu(struct kvm_vm *vm, uint32_t vcpu_id)
+{
+	vcpu_run(vm, vcpu_id);
+	ASSERT_EQ(get_ucall(vm, vcpu_id, NULL), UCALL_DONE);
+}
+
+static void *vcpu_worker(void *data)
+{
+	struct vcpu_info *vcpu = data;
+	struct kvm_vm *vm = vcpu->vm;
+	struct kvm_sregs sregs;
+	struct kvm_regs regs;
+
+	vcpu_args_set(vm, vcpu->id, 3, vcpu->start_gpa, vcpu->end_gpa,
+		      vm_get_page_size(vm));
+
+	/* Snapshot regs before the first run. */
+	vcpu_regs_get(vm, vcpu->id, &regs);
+	rendezvous_with_boss();
+
+	run_vcpu(vm, vcpu->id);
+	rendezvous_with_boss();
+	vcpu_regs_set(vm, vcpu->id, &regs);
+	vcpu_sregs_get(vm, vcpu->id, &sregs);
+#ifdef __x86_64__
+	/* Toggle CR0.WP to trigger a MMU context reset. */
+	sregs.cr0 ^= X86_CR0_WP;
+#endif
+	vcpu_sregs_set(vm, vcpu->id, &sregs);
+	rendezvous_with_boss();
+
+	run_vcpu(vm, vcpu->id);
+	rendezvous_with_boss();
+
+	return NULL;
+}
+
+static pthread_t *spawn_workers(struct kvm_vm *vm, uint64_t start_gpa,
+				uint64_t end_gpa)
+{
+	struct vcpu_info *info;
+	uint64_t gpa, nr_bytes;
+	pthread_t *threads;
+	int i;
+
+	threads = malloc(nr_vcpus * sizeof(*threads));
+	TEST_ASSERT(threads, "Failed to allocate vCPU threads");
+
+	info = malloc(nr_vcpus * sizeof(*info));
+	TEST_ASSERT(info, "Failed to allocate vCPU gpa ranges");
+
+	nr_bytes = ((end_gpa - start_gpa) / nr_vcpus) &
+			~((uint64_t)vm_get_page_size(vm) - 1);
+	TEST_ASSERT(nr_bytes, "C'mon, no way you have %d CPUs", nr_vcpus);
+
+	for (i = 0, gpa = start_gpa; i < nr_vcpus; i++, gpa += nr_bytes) {
+		info[i].vm = vm;
+		info[i].id = i;
+		info[i].start_gpa = gpa;
+		info[i].end_gpa = gpa + nr_bytes;
+		pthread_create(&threads[i], NULL, vcpu_worker, &info[i]);
+	}
+	return threads;
+}
+
+static void rendezvous_with_vcpus(struct timespec *time, const char *name)
+{
+	int i, rendezvoused;
+
+	pr_info("Waiting for vCPUs to finish %s...\n", name);
+
+	rendezvoused = atomic_read(&rendezvous);
+	for (i = 0; abs(rendezvoused) != 1; i++) {
+		usleep(100);
+		if (!(i & 0x3f))
+			pr_info("\r%d vCPUs haven't rendezvoused...",
+				abs(rendezvoused) - 1);
+		rendezvoused = atomic_read(&rendezvous);
+	}
+
+	clock_gettime(CLOCK_MONOTONIC, time);
+
+	/* Release the vCPUs after getting the time of the previous action. */
+	pr_info("\rAll vCPUs finished %s, releasing...\n", name);
+	if (rendezvoused > 0)
+		atomic_set(&rendezvous, -nr_vcpus - 1);
+	else
+		atomic_set(&rendezvous, nr_vcpus + 1);
+}
+
+static void calc_default_nr_vcpus(void)
+{
+	cpu_set_t possible_mask;
+	int r;
+
+	r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask);
+	TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)",
+		    errno, strerror(errno));
+
+	nr_vcpus = CPU_COUNT(&possible_mask);
+	TEST_ASSERT(nr_vcpus > 0, "Uh, no CPUs?");
+}
+
+int main(int argc, char *argv[])
+{
+	/*
+	 * Skip the first 4gb and slot0.  slot0 maps <1gb and is used to back
+	 * the guest's code, stack, and page tables.  Because selftests creates
+	 * an IRQCHIP, a.k.a. a local APIC, KVM creates an internal memslot
+	 * just below the 4gb boundary.  This test could create memory at
+	 * 1gb-3gb,but it's simpler to skip straight to 4gb.
+	 */
+	const uint64_t size_1gb = (1 << 30);
+	const uint64_t start_gpa = (4ull * size_1gb);
+	const int first_slot = 1;
+
+	struct timespec time_start, time_run1, time_reset, time_run2;
+	uint64_t max_gpa, gpa, slot_size, max_mem, i;
+	int max_slots, slot, opt, fd;
+	bool hugepages = false;
+	pthread_t *threads;
+	struct kvm_vm *vm;
+	void *mem;
+
+	/*
+	 * Default to 2gb so that maxing out systems with MAXPHADDR=46, which
+	 * are quite common for x86, requires changing only max_mem (KVM allows
+	 * 32k memslots, 32k * 2gb == ~64tb of guest memory).
+	 */
+	slot_size = 2 * size_1gb;
+
+	max_slots = kvm_check_cap(KVM_CAP_NR_MEMSLOTS);
+	TEST_ASSERT(max_slots > first_slot, "KVM is broken");
+
+	/* All KVM MMUs should be able to survive a 512gb guest. */
+	max_mem = 512 * size_1gb;
+
+	calc_default_nr_vcpus();
+
+	while ((opt = getopt(argc, argv, "c:h:m:s:u")) != -1) {
+		switch (opt) {
+		case 'c':
+			nr_vcpus = atoi(optarg);
+			TEST_ASSERT(nr_vcpus, "#DE");
+			break;
+		case 'm':
+			max_mem = atoi(optarg) * size_1gb;
+			TEST_ASSERT(max_mem, "#DE");
+			break;
+		case 's':
+			slot_size = atoi(optarg) * size_1gb;
+			TEST_ASSERT(slot_size, "#DE");
+			break;
+		case 'u':
+			hugepages = true;
+			break;
+		case 'h':
+		default:
+			printf("usage: %s [-c nr_vcpus] [-m max_mem_in_gb] [-s slot_size_in_gb] [-u [huge_page_size]]\n", argv[0]);
+			exit(1);
+		}
+	}
+
+	vm = vm_create_default_with_vcpus(nr_vcpus, 0, 0, guest_code, NULL);
+
+	max_gpa = vm_get_max_gfn(vm) << vm_get_page_shift(vm);
+	TEST_ASSERT(max_gpa > (4 * slot_size), "MAXPHYADDR <4gb ");
+
+	fd = kvm_memfd_alloc(slot_size, hugepages);
+	mem = mmap(NULL, slot_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmap() failed");
+
+	TEST_ASSERT(!madvise(mem, slot_size, MADV_NOHUGEPAGE), "madvise() failed");
+
+	/* Pre-fault the memory to avoid taking mmap_sem on guest page faults. */
+	for (i = 0; i < slot_size; i += vm_get_page_size(vm))
+		((uint8_t *)mem)[i] = 0xaa;
+
+	gpa = 0;
+	for (slot = first_slot; slot < max_slots; slot++) {
+		gpa = start_gpa + ((slot - first_slot) * slot_size);
+		if (gpa + slot_size > max_gpa)
+			break;
+
+		if ((gpa - start_gpa) >= max_mem)
+			break;
+
+		vm_set_user_memory_region(vm, slot, 0, gpa, slot_size, mem);
+
+#ifdef __x86_64__
+		/* Identity map memory in the guest using 1gb pages. */
+		for (i = 0; i < slot_size; i += size_1gb)
+			__virt_pg_map(vm, gpa + i, gpa + i, X86_PAGE_SIZE_1G);
+#else
+		for (i = 0; i < slot_size; i += vm_get_page_size(vm))
+			virt_pg_map(vm, gpa + i, gpa + i);
+#endif
+	}
+
+	atomic_set(&rendezvous, nr_vcpus + 1);
+	threads = spawn_workers(vm, start_gpa, gpa);
+
+	pr_info("Running with %lugb of guest memory and %u vCPUs\n",
+		(gpa - start_gpa) / size_1gb, nr_vcpus);
+
+	rendezvous_with_vcpus(&time_start, "spawning");
+	rendezvous_with_vcpus(&time_run1, "run 1");
+	rendezvous_with_vcpus(&time_reset, "reset");
+	rendezvous_with_vcpus(&time_run2, "run 2");
+
+	time_run2  = timespec_sub(time_run2,   time_reset);
+	time_reset = timespec_sub(time_reset, time_run1);
+	time_run1  = timespec_sub(time_run1,   time_start);
+
+	pr_info("run1 = %ld.%.9lds, reset = %ld.%.9lds, run2 =  %ld.%.9lds\n",
+		time_run1.tv_sec, time_run1.tv_nsec,
+		time_reset.tv_sec, time_reset.tv_nsec,
+		time_run2.tv_sec, time_run2.tv_nsec);
+
+	/*
+	 * Delete even numbered slots (arbitrary) and unmap the first half of
+	 * the backing (also arbitrary) to verify KVM correctly drops all
+	 * references to the removed regions.
+	 */
+	for (slot = (slot - 1) & ~1ull; slot >= first_slot; slot -= 2)
+		vm_set_user_memory_region(vm, slot, 0, 0, 0, NULL);
+
+	munmap(mem, slot_size / 2);
+
+	/* Sanity check that the vCPUs actually ran. */
+	for (i = 0; i < nr_vcpus; i++)
+		pthread_join(threads[i], NULL);
+
+	/*
+	 * Deliberately exit without deleting the remaining memslots or closing
+	 * kvm_fd to test cleanup via mmu_notifier.release.
+	 */
+}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 06/30] KVM: x86/mmu: only perform eager page splitting on valid roots
  2022-03-03 19:38 ` [PATCH v4 06/30] KVM: x86/mmu: only perform eager page splitting on valid roots Paolo Bonzini
@ 2022-03-03 20:03   ` Sean Christopherson
  0 siblings, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2022-03-03 20:03 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> Eager page splitting is an optimization; it does not have to be performed on
> invalid roots.  It is also the only case in which a reader might acquire
> a reference to an invalid root, so after this change we know that readers
> will skip both dying and invalid roots.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 07/30] KVM: x86/mmu: do not allow readers to acquire references to invalid roots
  2022-03-03 19:38 ` [PATCH v4 07/30] KVM: x86/mmu: do not allow readers to acquire references to invalid roots Paolo Bonzini
@ 2022-03-03 20:12   ` Sean Christopherson
  0 siblings, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2022-03-03 20:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> Remove the "shared" argument of for_each_tdp_mmu_root_yield_safe, thus ensuring
> that readers do not ever acquire a reference to an invalid root.  After this
> patch, all readers except kvm_tdp_mmu_zap_invalidated_roots() treat
> refcount=0/valid, refcount=0/invalid and refcount=1/invalid in exactly the
> same way.  kvm_tdp_mmu_zap_invalidated_roots() is different but it also
> does not acquire a reference to the invalid root, and it cannot see
> refcount=0/invalid because it is guaranteed to run after
> kvm_tdp_mmu_invalidate_all_roots().
> 
> Opportunistically add a lockdep assertion to the yield-safe iterator.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker
  2022-03-03 19:38 ` [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker Paolo Bonzini
@ 2022-03-03 20:54   ` Sean Christopherson
  2022-03-03 21:06     ` Sean Christopherson
  2022-03-03 21:20   ` Sean Christopherson
  1 sibling, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2022-03-03 20:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0b88592495f8..9287ee078c49 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5730,7 +5730,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>  	kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
>  
>  	kvm_zap_obsolete_pages(kvm);
> -

Spurious whitespace deletion.

>  	write_unlock(&kvm->mmu_lock);
>  
>  	/*
> @@ -5741,11 +5740,8 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>  	 * Deferring the zap until the final reference to the root is put would
>  	 * lead to use-after-free.
>  	 */
> -	if (is_tdp_mmu_enabled(kvm)) {
> -		read_lock(&kvm->mmu_lock);
> +	if (is_tdp_mmu_enabled(kvm))
>  		kvm_tdp_mmu_zap_invalidated_roots(kvm);
> -		read_unlock(&kvm->mmu_lock);
> -	}
>  }
>  
>  static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)

...

> +static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root)
> +{

Definitely worth doing (I'll provide more info in the "Zap defunct roots" patch):

	WARN_ON_ONCE(!root->role.invalid || root->tdp_mmu_async_data);

The assertion on role.invalid is a little overkill, but might help document when
and how this is used.

> +	root->tdp_mmu_async_data = kvm;
> +	INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_work);
> +	queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work);
> +}
> +
> +static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
> +{
> +	union kvm_mmu_page_role role = page->role;
> +	role.invalid = true;
> +
> +	/* No need to use cmpxchg, only the invalid bit can change.  */
> +	role.word = xchg(&page->role.word, role.word);
> +	return role.invalid;

This helper is unused.  It _could_ be used here, but I think it belongs in the
next patch.  Critically, until zapping defunct roots creates the invariant that
invalid roots are _always_ zapped via worker, kvm_tdp_mmu_invalidate_all_roots()
must not assume that an invalid root is queued for zapping.  I.e. doing this
before the "Zap defunct roots" would be wrong:

	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
		if (kvm_tdp_root_mark_invalid(root))
			continue;

		if (WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)));
			continue;

		tdp_mmu_schedule_zap_root(kvm, root);
	}

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker
  2022-03-03 20:54   ` Sean Christopherson
@ 2022-03-03 21:06     ` Sean Christopherson
  0 siblings, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2022-03-03 21:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

On Thu, Mar 03, 2022, Sean Christopherson wrote:
> On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> > +	root->tdp_mmu_async_data = kvm;
> > +	INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_work);
> > +	queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work);
> > +}
> > +
> > +static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
> > +{
> > +	union kvm_mmu_page_role role = page->role;
> > +	role.invalid = true;
> > +
> > +	/* No need to use cmpxchg, only the invalid bit can change.  */
> > +	role.word = xchg(&page->role.word, role.word);
> > +	return role.invalid;
> 
> This helper is unused.  It _could_ be used here, but I think it belongs in the
> next patch.  Critically, until zapping defunct roots creates the invariant that
> invalid roots are _always_ zapped via worker, kvm_tdp_mmu_invalidate_all_roots()
> must not assume that an invalid root is queued for zapping.  I.e. doing this
> before the "Zap defunct roots" would be wrong:
> 
> 	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
> 		if (kvm_tdp_root_mark_invalid(root))
> 			continue;
> 
> 		if (WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)));
> 			continue;
> 
> 		tdp_mmu_schedule_zap_root(kvm, root);
> 	}

Gah, lost my train of thought and forgot that this _can_ re-queue a root even in
this patch, it just can't it just can't re-queue a root that is _currently_ queued.

The re-queue scenario happens if a root is queued and zapped, but is kept alive
by a vCPU that hasn't yet put its reference.  If another memslot comes along before
the (sleeping) vCPU drops its reference, this will re-queue the root.

It's not a major problem in this patch as it's a small amount of wasted effort,
but it will be an issue when the "put" path starts using the queue, as that will
create a scenario where a memslot update (or NX toggle) can come along while a
defunct root is in the zap queue.

Checking for role.invalid is wrong (as above), so for this patch I think the
easiest thing is to use tdp_mmu_async_data as a sentinel that the root was zapped
in the past and doesn't need to be re-zapped.

/*
 * Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
 * is about to be zapped, e.g. in response to a memslots update.  The actual
 * zapping is performed asynchronously, so a reference is taken on all roots.
 * Using a separate workqueue makes it easy to ensure that the destruction is
 * performed before the "fast zap" completes, without keeping a separate list
 * of invalidated roots; the list is effectively the list of work items in
 * the workqueue.
 *
 * Skip roots that were already queued for zapping, the "fast zap" path is the
 * only user of the zap queue and always flushes the queue under slots_lock,
 * i.e. the queued zap is guaranteed to have completed already.
 *
 * Because mmu_lock is held for write, it should be impossible to observe a
 * root with zero refcount,* i.e. the list of roots cannot be stale.
 *
 * This has essentially the same effect for the TDP MMU
 * as updating mmu_valid_gen does for the shadow MMU.
 */
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
{
	struct kvm_mmu_page *root;

	lockdep_assert_held_write(&kvm->mmu_lock);
	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
		if (root->tdp_mmu_async_data)
			continue;

		if (WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
			continue;

		root->role.invalid = true;
		tdp_mmu_schedule_zap_root(kvm, root);
	}
}

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker
  2022-03-03 19:38 ` [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker Paolo Bonzini
  2022-03-03 20:54   ` Sean Christopherson
@ 2022-03-03 21:20   ` Sean Christopherson
  2022-03-03 21:32     ` Sean Christopherson
  1 sibling, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2022-03-03 21:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> The only issue is that kvm_tdp_mmu_invalidate_all_roots() now assumes that
> there is at least one reference in kvm->users_count; so if the VM is dying
> just go through the slow path, as there is nothing to gain by using the fast
> zapping.

This isn't actually implemented. :-)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker
  2022-03-03 21:20   ` Sean Christopherson
@ 2022-03-03 21:32     ` Sean Christopherson
  2022-03-04  6:48       ` Paolo Bonzini
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2022-03-03 21:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

On Thu, Mar 03, 2022, Sean Christopherson wrote:
> On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> > The only issue is that kvm_tdp_mmu_invalidate_all_roots() now assumes that
> > there is at least one reference in kvm->users_count; so if the VM is dying
> > just go through the slow path, as there is nothing to gain by using the fast
> > zapping.
> 
> This isn't actually implemented. :-)

Oh, and when you implement it (or copy paste), can you also add a lockdep and a
comment about the check being racy, but that the race is benign?  It took me a
second to realize why it's safe to use a work queue without holding a reference
to @kvm.

static void kvm_mmu_zap_all_fast(struct kvm *kvm)
{
	lockdep_assert_held(&kvm->slots_lock);

	/*
	 * Zap using the "slow" path if the VM is being destroyed.  The "slow"
	 * path isn't actually slower, it just just doesn't block vCPUs for an
	 * extended duration, which is irrelevant if the VM is dying.
	 *
	 * Note, this doesn't guarantee users_count won't go to '0' immediately
	 * after this check, but that race is benign as callers that don't hold
	 * a reference to @kvm must hold kvm_lock to prevent use-after-free.
	 */
	if (unlikely(refcount_read(&kvm->users_count)) {
		lockdep_assert_held(&kvm_lock);
		kvm_mmu_zap_all(kvm);
		return;
	}

	write_lock(&kvm->mmu_lock);
	trace_kvm_mmu_zap_all_fast(kvm);

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 24/30] KVM: x86/mmu: Zap defunct roots via asynchronous worker
  2022-03-03 19:38 ` [PATCH v4 24/30] KVM: x86/mmu: Zap defunct roots via asynchronous worker Paolo Bonzini
@ 2022-03-03 22:08   ` Sean Christopherson
  0 siblings, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2022-03-03 22:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> Zap defunct roots, a.k.a. roots that have been invalidated after their
> last reference was initially dropped, asynchronously via the system work
> queue instead of forcing the work upon the unfortunate task that happened
> to drop the last reference.
> 
> If a vCPU task drops the last reference, the vCPU is effectively blocked
> by the host for the entire duration of the zap.  If the root being zapped
> happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
> being active, the zap can take several hundred seconds.  Unsurprisingly,
> most guests are unhappy if a vCPU disappears for hundreds of seconds.
> 
> E.g. running a synthetic selftest that triggers a vCPU root zap with
> ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
> Offloading the zap to a worker drops the block time to <100ms.
> 
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Message-Id: <20220226001546.360188-23-seanjc@google.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index e24a1bff9218..2456f880508d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -170,13 +170,24 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  	 */
>  	if (!kvm_tdp_root_mark_invalid(root)) {
>  		refcount_set(&root->tdp_mmu_root_count, 1);
> -		tdp_mmu_zap_root(kvm, root, shared);
>  
>  		/*
> -		 * Give back the reference that was added back above.  We now
> +		 * If the struct kvm is alive, we might as well zap the root
> +		 * in a worker.  The worker takes ownership of the reference we
> +		 * just added to root and is flushed before the struct kvm dies.

Not a fan of the "we might as well zap the root in a worker", IMO we should require
going forward that invalidated, reachable TDP MMU roots are always zapped in a worker

> +		 */
> +		if (likely(refcount_read(&kvm->users_count))) {
> +			tdp_mmu_schedule_zap_root(kvm, root);

Regarding the need for kvm_tdp_mmu_invalidate_all_roots() to guard against
re-queueing a root for zapping, this is the point where it becomes functionally
problematic.  When "fast zap" was the only user of tdp_mmu_schedule_zap_root(),
re-queueing was benign as the work struct was guaranteed to not be currently
queued.  But this code runs outside of slots_lock, and so a root that was "put"
but hasn't finished zapping can be observed and re-queued by the "fast zap.

I think it makes sense to create a rule/invariant that an invalidated TDP MMU root
_must_ be zapped via the work queue.  Then 

I.e. do this as fixup:

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 40bf861b622a..cff4f2102a63 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1019,8 +1019,9 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
  * of invalidated roots; the list is effectively the list of work items in
  * the workqueue.
  *
- * Skip roots that are already queued for zapping, flushing the work queue will
- * ensure invalidated roots are zapped regardless of when they were queued.
+ * Skip roots that are already invalid and thus queued for zapping, flushing
+ * the work queue will ensure invalid roots are zapped regardless of when they
+ * were queued.
  *
  * Because mmu_lock is held for write, it should be impossible to observe a
  * root with zero refcount,* i.e. the list of roots cannot be stale.
@@ -1034,13 +1035,12 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)

        lockdep_assert_held_write(&kvm->mmu_lock);
        list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
-               if (root->tdp_mmu_async_data)
+               if (kvm_tdp_root_mark_invalid(root))
                        continue;

                if (WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
                        continue;

-               root->role.invalid = true;
                tdp_mmu_schedule_zap_root(kvm, root);
        }
 }

> +			return;
> +		}
> +
> +		/*
> +		 * The struct kvm is being destroyed, zap synchronously and give
> +		 * back immediately the reference that was added above.  We now
>  		 * know that the root is invalid, so go ahead and free it if
>  		 * no one has taken a reference in the meanwhile.
>  		 */
> +		tdp_mmu_zap_root(kvm, root, shared);
>  		if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
>  			return;
>  	}
> -- 
> 2.31.1
> 
>

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 03/30] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic
  2022-03-03 19:38 ` [PATCH v4 03/30] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic Paolo Bonzini
@ 2022-03-03 23:39   ` Mingwei Zhang
  0 siblings, 0 replies; 62+ messages in thread
From: Mingwei Zhang @ 2022-03-03 23:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, David Hildenbrand,
	David Matlack, Ben Gardon

On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Explicitly ignore the result of zap_gfn_range() when putting the last
> reference to a TDP MMU root, and add a pile of comments to formalize the
> TDP MMU's behavior of deferring TLB flushes to alloc/reuse.  Note, this
> only affects the !shared case, as zap_gfn_range() subtly never returns
> true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic().
> 
> Putting the root without a flush is ok because even if there are stale
> references to the root in the TLB, they are unreachable because KVM will
> not run the guest with the same ASID without first flushing (where ASID
> in this context refers to both SVM's explicit ASID and Intel's implicit
> ASID that is constructed from VPID+PCID+EPT4A+etc...).
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Message-Id: <20220226001546.360188-5-seanjc@google.com>
> Reviewed-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     |  8 ++++++++
>  arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++++-
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 32c041ed65cb..9a6df2d02777 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5083,6 +5083,14 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
>  	kvm_mmu_sync_roots(vcpu);
>  
>  	kvm_mmu_load_pgd(vcpu);
> +
> +	/*
> +	 * Flush any TLB entries for the new root, the provenance of the root
> +	 * is unknown.  Even if KVM ensures there are no stale TLB entries
> +	 * for a freed root, in theory another hypervisor could have left
> +	 * stale entries.  Flushing on alloc also allows KVM to skip the TLB
> +	 * flush when freeing a root (see kvm_tdp_mmu_put_root()).
> +	 */
>  	static_call(kvm_x86_flush_tlb_current)(vcpu);
>  out:
>  	return r;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b97a4125feac..921fa386df99 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  	list_del_rcu(&root->link);
>  	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>  
> -	zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> +	/*
> +	 * A TLB flush is not necessary as KVM performs a local TLB flush when
> +	 * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> +	 * to a different pCPU.  Note, the local TLB flush on reuse also
> +	 * invalidates any paging-structure-cache entries, i.e. TLB entries for
> +	 * intermediate paging structures, that may be zapped, as such entries
> +	 * are associated with the ASID on both VMX and SVM.
> +	 */
> +	(void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);

Discussed offline with Sean. Now I get myself comfortable with the style
of mmu with multiple 'roots' and leaving TLB unflushed for invalidated
roots.

I guess one minor improvement on the comment could be:

"A TLB flush is not necessary as each vCPU performs a local TLB flush
when allocating or assigning a new root (see kvm_mmu_load()), and when
migrating to a different pCPU."

The above could be better since "KVM performs a local TLB flush" makes
readers think why we miss the 'remote' TLB flushes?
>  
>  	call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
>  }
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 16/30] KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page
  2022-03-03 19:38 ` [PATCH v4 16/30] KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page Paolo Bonzini
@ 2022-03-04  0:07   ` Mingwei Zhang
  0 siblings, 0 replies; 62+ messages in thread
From: Mingwei Zhang @ 2022-03-04  0:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, David Hildenbrand,
	David Matlack, Ben Gardon

On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Add a dedicated helper for zapping a TDP MMU root, and use it in the three
> flows that do "zap_all" and intentionally do not do a TLB flush if SPTEs
> are zapped (zapping an entire root is safe if and only if it cannot be in
> use by any vCPU).  Because a TLB flush is never required, unconditionally
> pass "false" to tdp_mmu_iter_cond_resched() when potentially yielding.
> 
> Opportunistically document why KVM must not yield when zapping roots that
> are being zapped by kvm_tdp_mmu_put_root(), i.e. roots whose refcount has
> reached zero, and further harden the flow to detect improper KVM behavior
> with respect to roots that are supposed to be unreachable.
> 
> In addition to hardening zapping of roots, isolating zapping of roots
> will allow future simplification of zap_gfn_range() by having it zap only
> leaf SPTEs, and by removing its tricky "zap all" heuristic.  By having
> all paths that truly need to free _all_ SPs flow through the dedicated
> root zapper, the generic zapper can be freed of those concerns.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Message-Id: <20220226001546.360188-16-seanjc@google.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Reviewed-by: Mingwei Zhang <mizhang@google.com>

> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 98 +++++++++++++++++++++++++++++++-------
>  1 file changed, 82 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f59f3ff5cb75..970376297b30 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -56,10 +56,6 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>  	rcu_barrier();
>  }
>  
> -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> -			  gfn_t start, gfn_t end, bool can_yield, bool flush,
> -			  bool shared);
> -
>  static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
>  {
>  	free_page((unsigned long)sp->spt);
> @@ -82,6 +78,9 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
>  	tdp_mmu_free_sp(sp);
>  }
>  
> +static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> +			     bool shared);
> +
>  void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  			  bool shared)
>  {
> @@ -104,7 +103,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  	 * intermediate paging structures, that may be zapped, as such entries
>  	 * are associated with the ASID on both VMX and SVM.
>  	 */
> -	(void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> +	tdp_mmu_zap_root(kvm, root, shared);
>  
>  	call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
>  }
> @@ -737,6 +736,76 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
>  	return iter->yielded;
>  }
>  
> +static inline gfn_t tdp_mmu_max_gfn_host(void)
> +{
> +	/*
> +	 * Bound TDP MMU walks at host.MAXPHYADDR, guest accesses beyond that
> +	 * will hit a #PF(RSVD) and never hit an EPT Violation/Misconfig / #NPF,
> +	 * and so KVM will never install a SPTE for such addresses.
> +	 */
> +	return 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> +}
> +
> +static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> +			     bool shared)
> +{
> +	bool root_is_unreachable = !refcount_read(&root->tdp_mmu_root_count);
> +	struct tdp_iter iter;
> +
> +	gfn_t end = tdp_mmu_max_gfn_host();
> +	gfn_t start = 0;
> +
> +	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> +
> +	rcu_read_lock();
> +
> +	/*
> +	 * No need to try to step down in the iterator when zapping an entire
> +	 * root, zapping an upper-level SPTE will recurse on its children.
> +	 */
> +	for_each_tdp_pte_min_level(iter, root, root->role.level, start, end) {
> +retry:
> +		/*
> +		 * Yielding isn't allowed when zapping an unreachable root as
> +		 * the root won't be processed by mmu_notifier callbacks.  When
> +		 * handling an unmap/release mmu_notifier command, KVM must
> +		 * drop all references to relevant pages prior to completing
> +		 * the callback.  Dropping mmu_lock can result in zapping SPTEs
> +		 * for an unreachable root after a relevant callback completes,
> +		 * which leads to use-after-free as zapping a SPTE triggers
> +		 * "writeback" of dirty/accessed bits to the SPTE's associated
> +		 * struct page.
> +		 */
> +		if (!root_is_unreachable &&
> +		    tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
> +			continue;
> +
> +		if (!is_shadow_present_pte(iter.old_spte))
> +			continue;
> +
> +		if (!shared) {
> +			tdp_mmu_set_spte(kvm, &iter, 0);
> +		} else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
> +			/*
> +			 * cmpxchg() shouldn't fail if the root is unreachable.
> +			 * Retry so as not to leak the page and its children.
> +			 */
> +			WARN_ONCE(root_is_unreachable,
> +				  "Contended TDP MMU SPTE in unreachable root.");
> +			goto retry;
> +		}
> +
> +		/*
> +		 * WARN if the root is invalid and is unreachable, all SPTEs
> +		 * should've been zapped by kvm_tdp_mmu_zap_invalidated_roots(),
> +		 * and inserting new SPTEs under an invalid root is a KVM bug.
> +		 */
> +		WARN_ON_ONCE(root_is_unreachable && root->role.invalid);
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
>  bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
>  	u64 old_spte;
> @@ -785,8 +854,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  			  gfn_t start, gfn_t end, bool can_yield, bool flush,
>  			  bool shared)
>  {
> -	gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> -	bool zap_all = (start == 0 && end >= max_gfn_host);
> +	bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
>  	struct tdp_iter iter;
>  
>  	/*
> @@ -795,12 +863,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  	 */
>  	int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
>  
> -	/*
> -	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> -	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> -	 * and so KVM will never install a SPTE for such addresses.
> -	 */
> -	end = min(end, max_gfn_host);
> +	end = min(end, tdp_mmu_max_gfn_host());
>  
>  	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
>  
> @@ -860,6 +923,7 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
>  
>  void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>  {
> +	struct kvm_mmu_page *root;
>  	int i;
>  
>  	/*
> @@ -867,8 +931,10 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>  	 * is being destroyed or the userspace VMM has exited.  In both cases,
>  	 * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
>  	 */
> -	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> -		(void)kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, false);
> +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +		for_each_tdp_mmu_root_yield_safe(kvm, root, i)
> +			tdp_mmu_zap_root(kvm, root, false);
> +	}
>  }
>  
>  static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
> @@ -925,7 +991,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>  		 * will still flush on yield, but that's a minor performance
>  		 * blip and not a functional issue.
>  		 */
> -		(void)zap_gfn_range(kvm, root, 0, -1ull, true, false, true);
> +		tdp_mmu_zap_root(kvm, root, true);
>  
>  		/*
>  		 * Put the reference acquired in
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 17/30] KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range
  2022-03-03 19:38 ` [PATCH v4 17/30] KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range Paolo Bonzini
@ 2022-03-04  0:14   ` Mingwei Zhang
  0 siblings, 0 replies; 62+ messages in thread
From: Mingwei Zhang @ 2022-03-04  0:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, David Hildenbrand,
	David Matlack, Ben Gardon

On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Now that all callers of zap_gfn_range() hold mmu_lock for write, drop
> support for zapping with mmu_lock held for read.  That all callers hold
> mmu_lock for write isn't a random coincidence; now that the paths that
> need to zap _everything_ have their own path, the only callers left are
> those that need to zap for functional correctness.  And when zapping is
> required for functional correctness, mmu_lock must be held for write,
> otherwise the caller has no guarantees about the state of the TDP MMU
> page tables after it has run, e.g. the SPTE(s) it zapped can be
> immediately replaced by a vCPU faulting in a page.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Message-Id: <20220226001546.360188-17-seanjc@google.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Reviewed-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 24 ++++++------------------
>  1 file changed, 6 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 970376297b30..f3939ce4a115 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -844,15 +844,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>   * function cannot yield, it will not release the MMU lock or reschedule and
>   * the caller must ensure it does not supply too large a GFN range, or the
>   * operation can cause a soft lockup.
> - *
> - * If shared is true, this thread holds the MMU lock in read mode and must
> - * account for the possibility that other threads are modifying the paging
> - * structures concurrently. If shared is false, this thread should hold the
> - * MMU lock in write mode.
>   */
>  static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> -			  gfn_t start, gfn_t end, bool can_yield, bool flush,
> -			  bool shared)
> +			  gfn_t start, gfn_t end, bool can_yield, bool flush)
>  {
>  	bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
>  	struct tdp_iter iter;
> @@ -865,14 +859,13 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  
>  	end = min(end, tdp_mmu_max_gfn_host());
>  
> -	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> +	lockdep_assert_held_write(&kvm->mmu_lock);
>  
>  	rcu_read_lock();
>  
>  	for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
> -retry:
>  		if (can_yield &&
> -		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
> +		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
>  			flush = false;
>  			continue;
>  		}
> @@ -891,12 +884,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  		    !is_last_spte(iter.old_spte, iter.level))
>  			continue;
>  
> -		if (!shared) {
> -			tdp_mmu_set_spte(kvm, &iter, 0);
> -			flush = true;
> -		} else if (tdp_mmu_zap_spte_atomic(kvm, &iter)) {
> -			goto retry;
> -		}
> +		tdp_mmu_set_spte(kvm, &iter, 0);
> +		flush = true;
>  	}
>  
>  	rcu_read_unlock();
> @@ -915,8 +904,7 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
>  	struct kvm_mmu_page *root;
>  
>  	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
> -		flush = zap_gfn_range(kvm, root, start, end, can_yield, flush,
> -				      false);
> +		flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
>  
>  	return flush;
>  }
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
  2022-03-03 19:38 ` [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() Paolo Bonzini
@ 2022-03-04  1:16   ` Mingwei Zhang
  2022-03-04 16:11     ` Sean Christopherson
  2022-03-11 15:09   ` Vitaly Kuznetsov
  2022-03-13 18:40   ` Mingwei Zhang
  2 siblings, 1 reply; 62+ messages in thread
From: Mingwei Zhang @ 2022-03-04  1:16 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, David Hildenbrand,
	David Matlack, Ben Gardon

On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various
> functions accordingly.  When removing mappings for functional correctness
> (except for the stupid VFIO GPU passthrough memslots bug), zapping the
> leaf SPTEs is sufficient as the paging structures themselves do not point
> at guest memory and do not directly impact the final translation (in the
> TDP MMU).
> 
> Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only
> the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and
> kvm_unmap_gfn_range().
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Message-Id: <20220226001546.360188-18-seanjc@google.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     |  4 ++--
>  arch/x86/kvm/mmu/tdp_mmu.c | 41 ++++++++++----------------------------
>  arch/x86/kvm/mmu/tdp_mmu.h |  8 +-------
>  3 files changed, 14 insertions(+), 39 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8408d7db8d2a..febdcaaa7b94 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5834,8 +5834,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  
>  	if (is_tdp_mmu_enabled(kvm)) {
>  		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> -			flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
> -							  gfn_end, flush);
> +			flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
> +						      gfn_end, true, flush);
>  	}
>  
>  	if (flush)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f3939ce4a115..c71debdbc732 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -834,10 +834,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>  }
>  
>  /*
> - * Tears down the mappings for the range of gfns, [start, end), and frees the
> - * non-root pages mapping GFNs strictly within that range. Returns true if
> - * SPTEs have been cleared and a TLB flush is needed before releasing the
> - * MMU lock.
> + * Zap leafs SPTEs for the range of gfns, [start, end). Returns true if SPTEs
> + * have been cleared and a TLB flush is needed before releasing the MMU lock.

I think the original code does not _over_ zapping. But the new version
does. Will that have some side effects? In particular, if the range is
within a huge page (or HugeTLB page of various sizes), then we choose to
zap it even if it is more than the range.

Regardless of side effect, I think we probably should mention that in
the comments?

>   *
>   * If can_yield is true, will release the MMU lock and reschedule if the
>   * scheduler needs the CPU or there is contention on the MMU lock. If this
> @@ -845,42 +843,25 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>   * the caller must ensure it does not supply too large a GFN range, or the
>   * operation can cause a soft lockup.
>   */
> -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> -			  gfn_t start, gfn_t end, bool can_yield, bool flush)
> +static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> +			      gfn_t start, gfn_t end, bool can_yield, bool flush)
>  {
> -	bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
>  	struct tdp_iter iter;
>  
> -	/*
> -	 * No need to try to step down in the iterator when zapping all SPTEs,
> -	 * zapping the top-level non-leaf SPTEs will recurse on their children.
> -	 */
> -	int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
> -
>  	end = min(end, tdp_mmu_max_gfn_host());
>  
>  	lockdep_assert_held_write(&kvm->mmu_lock);
>  
>  	rcu_read_lock();
>  
> -	for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
> +	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
>  		if (can_yield &&
>  		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
>  			flush = false;
>  			continue;
>  		}
>  
> -		if (!is_shadow_present_pte(iter.old_spte))
> -			continue;
> -
> -		/*
> -		 * If this is a non-last-level SPTE that covers a larger range
> -		 * than should be zapped, continue, and zap the mappings at a
> -		 * lower level, except when zapping all SPTEs.
> -		 */
> -		if (!zap_all &&
> -		    (iter.gfn < start ||
> -		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> +		if (!is_shadow_present_pte(iter.old_spte) ||
>  		    !is_last_spte(iter.old_spte, iter.level))
>  			continue;
>  
> @@ -898,13 +879,13 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   * SPTEs have been cleared and a TLB flush is needed before releasing the
>   * MMU lock.
>   */
> -bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> -				 gfn_t end, bool can_yield, bool flush)
> +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
> +			   bool can_yield, bool flush)
>  {
>  	struct kvm_mmu_page *root;
>  
>  	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
> -		flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
> +		flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
>  
>  	return flush;
>  }
> @@ -1202,8 +1183,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
>  				 bool flush)
>  {
> -	return __kvm_tdp_mmu_zap_gfn_range(kvm, range->slot->as_id, range->start,
> -					   range->end, range->may_block, flush);
> +	return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start,
> +				     range->end, range->may_block, flush);
>  }
>  
>  typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 5e5ef2576c81..54bc8118c40a 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -15,14 +15,8 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
>  void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  			  bool shared);
>  
> -bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start,
>  				 gfn_t end, bool can_yield, bool flush);
> -static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
> -					     gfn_t start, gfn_t end, bool flush)
> -{
> -	return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
> -}
> -
>  bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>  void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 19/30] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched
  2022-03-03 19:38 ` [PATCH v4 19/30] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched Paolo Bonzini
@ 2022-03-04  1:19   ` Mingwei Zhang
  0 siblings, 0 replies; 62+ messages in thread
From: Mingwei Zhang @ 2022-03-04  1:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, David Hildenbrand,
	David Matlack, Ben Gardon

On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> When yielding in the TDP MMU iterator, service any pending TLB flush
> before dropping RCU protections in anticipation of using the caller's RCU
> "lock" as a proxy for vCPUs in the guest.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Message-Id: <20220226001546.360188-19-seanjc@google.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Reviewed-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index c71debdbc732..3a866fcb5ea9 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -716,11 +716,11 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
>  		return false;
>  
>  	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> -		rcu_read_unlock();
> -
>  		if (flush)
>  			kvm_flush_remote_tlbs(kvm);
>  
> +		rcu_read_unlock();
> +
>  		if (shared)
>  			cond_resched_rwlock_read(&kvm->mmu_lock);
>  		else
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker
  2022-03-03 21:32     ` Sean Christopherson
@ 2022-03-04  6:48       ` Paolo Bonzini
  2022-03-04 16:02         ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-04  6:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

On 3/3/22 22:32, Sean Christopherson wrote:

> The re-queue scenario happens if a root is queued and zapped, but is kept alive
> by a vCPU that hasn't yet put its reference.  If another memslot comes along before
> the (sleeping) vCPU drops its reference, this will re-queue the root.
> 
> It's not a major problem in this patch as it's a small amount of wasted effort,
> but it will be an issue when the "put" path starts using the queue, as that will
> create a scenario where a memslot update (or NX toggle) can come along while a
> defunct root is in the zap queue.

As of this patch it's not a problem because 
kvm_tdp_mmu_invalidate_all_roots()'s caller holds kvm->slots_lock, so 
kvm_tdp_mmu_invalidate_all_roots() is guarantee to queue its work items 
on an empty workqueue.  In effect the workqueue is just a fancy list. 
But as you point out in the review to patch 24, it becomes a problem 
when there's no kvm->slots_lock to guarantee that.  Then it needs to 
check that the root isn't already invalid.

>>> The only issue is that kvm_tdp_mmu_invalidate_all_roots() now assumes that
>>> there is at least one reference in kvm->users_count; so if the VM is dying
>>> just go through the slow path, as there is nothing to gain by using the fast
>>> zapping.
>> This isn't actually implemented.:-)
> Oh, and when you implement it (or copy paste), can you also add a lockdep and a
> comment about the check being racy, but that the race is benign?  It took me a
> second to realize why it's safe to use a work queue without holding a reference
> to @kvm.

I didn't remove the paragraph from the commit message, but I think it's 
unnecessary now.  The workqueue is flushed in kvm_mmu_zap_all_fast() and 
kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to 
take a reference to the VM.

I think I don't even need to check kvm->users_count in the defunct root 
case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the 
workqueue before it checks that the lists are empty.

I'll wait to hear from you later today before sending out v5.

Paolo


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker
  2022-03-04  6:48       ` Paolo Bonzini
@ 2022-03-04 16:02         ` Sean Christopherson
  2022-03-04 18:11           ` Paolo Bonzini
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2022-03-04 16:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

On Fri, Mar 04, 2022, Paolo Bonzini wrote:
> On 3/3/22 22:32, Sean Christopherson wrote:
> I didn't remove the paragraph from the commit message, but I think it's
> unnecessary now.  The workqueue is flushed in kvm_mmu_zap_all_fast() and
> kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to take
> a reference to the VM.
> 
> I think I don't even need to check kvm->users_count in the defunct root
> case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the workqueue
> before it checks that the lists are empty.

Yes, that should work.  IIRC, the WARN_ONs will tell us/you quite quickly if
we're wrong :-)  mmu_notifier_unregister() will call the "slow" kvm_mmu_zap_all()
and thus ensure all non-root pages zapped, but "leaking" a worker will trigger
the WARN_ON that there are no roots on the list.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
  2022-03-04  1:16   ` Mingwei Zhang
@ 2022-03-04 16:11     ` Sean Christopherson
  2022-03-04 18:00       ` Mingwei Zhang
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2022-03-04 16:11 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Hildenbrand, David Matlack,
	Ben Gardon

On Fri, Mar 04, 2022, Mingwei Zhang wrote:
> On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index f3939ce4a115..c71debdbc732 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -834,10 +834,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> >  }
> >  
> >  /*
> > - * Tears down the mappings for the range of gfns, [start, end), and frees the
> > - * non-root pages mapping GFNs strictly within that range. Returns true if
> > - * SPTEs have been cleared and a TLB flush is needed before releasing the
> > - * MMU lock.
> > + * Zap leafs SPTEs for the range of gfns, [start, end). Returns true if SPTEs
> > + * have been cleared and a TLB flush is needed before releasing the MMU lock.
> 
> I think the original code does not _over_ zapping. But the new version
> does.

No, the new version doesn't overzap.

> Will that have some side effects? In particular, if the range is
> within a huge page (or HugeTLB page of various sizes), then we choose to
> zap it even if it is more than the range.

The old version did that too.  KVM _must_ zap a hugepage that overlaps the range,
otherwise the guest would be able to access memory that has been freed/moved.  If
the operation has unmapped a subset of a hugepage, KVM needs to zap and rebuild
the portions that are still valid using smaller pages.

> Regardless of side effect, I think we probably should mention that in
> the comments?
> > -		/*
> > -		 * If this is a non-last-level SPTE that covers a larger range
> > -		 * than should be zapped, continue, and zap the mappings at a
> > -		 * lower level, except when zapping all SPTEs.
> > -		 */
> > -		if (!zap_all &&
> > -		    (iter.gfn < start ||
> > -		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> > +		if (!is_shadow_present_pte(iter.old_spte) ||
> >  		    !is_last_spte(iter.old_spte, iter.level))

It's hard to see in the diff, but the key is the "!is_last_spte()" check.  The
check before was skipping non-leaf, a.k.a. shadow pages, if they weren't in the
range.  The new version _always_ skips shadow pages.  Hugepages will always
return true for is_last_spte() and will never be skipped.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
  2022-03-04 16:11     ` Sean Christopherson
@ 2022-03-04 18:00       ` Mingwei Zhang
  2022-03-04 18:42         ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Mingwei Zhang @ 2022-03-04 18:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Hildenbrand, David Matlack,
	Ben Gardon

On Fri, Mar 04, 2022, Sean Christopherson wrote:
> On Fri, Mar 04, 2022, Mingwei Zhang wrote:
> > On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index f3939ce4a115..c71debdbc732 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -834,10 +834,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > >  }
> > >  
> > >  /*
> > > - * Tears down the mappings for the range of gfns, [start, end), and frees the
> > > - * non-root pages mapping GFNs strictly within that range. Returns true if
> > > - * SPTEs have been cleared and a TLB flush is needed before releasing the
> > > - * MMU lock.
> > > + * Zap leafs SPTEs for the range of gfns, [start, end). Returns true if SPTEs
> > > + * have been cleared and a TLB flush is needed before releasing the MMU lock.
> > 
> > I think the original code does not _over_ zapping. But the new version
> > does.
> 
> No, the new version doesn't overzap.

It does overzap, but it does not matter and the semantic does not
change.
> 
> > Will that have some side effects? In particular, if the range is
> > within a huge page (or HugeTLB page of various sizes), then we choose to
> > zap it even if it is more than the range.

ACK.
> 
> The old version did that too.  KVM _must_ zap a hugepage that overlaps the range,
> otherwise the guest would be able to access memory that has been freed/moved.  If
> the operation has unmapped a subset of a hugepage, KVM needs to zap and rebuild
> the portions that are still valid using smaller pages.
> 
> > Regardless of side effect, I think we probably should mention that in
> > the comments?
> > > -		/*
> > > -		 * If this is a non-last-level SPTE that covers a larger range
> > > -		 * than should be zapped, continue, and zap the mappings at a
> > > -		 * lower level, except when zapping all SPTEs.
> > > -		 */
> > > -		if (!zap_all &&
> > > -		    (iter.gfn < start ||
> > > -		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> > > +		if (!is_shadow_present_pte(iter.old_spte) ||
> > >  		    !is_last_spte(iter.old_spte, iter.level))
> 
> It's hard to see in the diff, but the key is the "!is_last_spte()" check.  The
> check before was skipping non-leaf, a.k.a. shadow pages, if they weren't in the
> range.  The new version _always_ skips shadow pages.  Hugepages will always
> return true for is_last_spte() and will never be skipped.

ACK

Reviewed-by: Mingwei Zhang <mizhang@google.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker
  2022-03-04 16:02         ` Sean Christopherson
@ 2022-03-04 18:11           ` Paolo Bonzini
  2022-03-05  0:34             ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-04 18:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

On 3/4/22 17:02, Sean Christopherson wrote:
> On Fri, Mar 04, 2022, Paolo Bonzini wrote:
>> On 3/3/22 22:32, Sean Christopherson wrote:
>> I didn't remove the paragraph from the commit message, but I think it's
>> unnecessary now.  The workqueue is flushed in kvm_mmu_zap_all_fast() and
>> kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to take
>> a reference to the VM.
>>
>> I think I don't even need to check kvm->users_count in the defunct root
>> case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the workqueue
>> before it checks that the lists are empty.
> 
> Yes, that should work.  IIRC, the WARN_ONs will tell us/you quite quickly if
> we're wrong :-)  mmu_notifier_unregister() will call the "slow" kvm_mmu_zap_all()
> and thus ensure all non-root pages zapped, but "leaking" a worker will trigger
> the WARN_ON that there are no roots on the list.

Good, for the record these are the commit messages I have:

     KVM: x86/mmu: Zap invalidated roots via asynchronous worker
     
     Use the system worker threads to zap the roots invalidated
     by the TDP MMU's "fast zap" mechanism, implemented by
     kvm_tdp_mmu_invalidate_all_roots().
     
     At this point, apart from allowing some parallelism in the zapping of
     roots, the workqueue is a glorified linked list: work items are added and
     flushed entirely within a single kvm->slots_lock critical section.  However,
     the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
     assumes that it owns a reference to all invalid roots; therefore, no
     one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
     invalidated roots on a linked list... erm, on a workqueue ensures that
     tdp_mmu_zap_root_work() only puts back those extra references that
     kvm_mmu_zap_all_invalidated_roots() had gifted to it.

and

     KVM: x86/mmu: Zap defunct roots via asynchronous worker
     
     Zap defunct roots, a.k.a. roots that have been invalidated after their
     last reference was initially dropped, asynchronously via the existing work
     queue instead of forcing the work upon the unfortunate task that happened
     to drop the last reference.
     
     If a vCPU task drops the last reference, the vCPU is effectively blocked
     by the host for the entire duration of the zap.  If the root being zapped
     happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
     being active, the zap can take several hundred seconds.  Unsurprisingly,
     most guests are unhappy if a vCPU disappears for hundreds of seconds.
     
     E.g. running a synthetic selftest that triggers a vCPU root zap with
     ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
     Offloading the zap to a worker drops the block time to <100ms.
     
     There is an important nuance to this change.  If the same work item
     was queued twice before the work function has run, it would only
     execute once and one reference would be leaked.  Therefore, now that
     queueing items is not anymore protected by write_lock(&kvm->mmu_lock),
     kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and
     skip already invalid roots.  On the other hand, kvm_mmu_zap_all_fast()
     must return only after those skipped roots have been zapped as well.
     These two requirements can be satisfied only if _all_ places that
     change invalid to true now schedule the worker before releasing the
     mmu_lock.  There are just two, kvm_tdp_mmu_put_root() and
     kvm_tdp_mmu_invalidate_all_roots().

Paolo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
  2022-03-04 18:00       ` Mingwei Zhang
@ 2022-03-04 18:42         ` Sean Christopherson
  0 siblings, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2022-03-04 18:42 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Hildenbrand, David Matlack,
	Ben Gardon

On Fri, Mar 04, 2022, Mingwei Zhang wrote:
> On Fri, Mar 04, 2022, Sean Christopherson wrote:
> > On Fri, Mar 04, 2022, Mingwei Zhang wrote:
> > > On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > index f3939ce4a115..c71debdbc732 100644
> > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > @@ -834,10 +834,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > > >  }
> > > >  
> > > >  /*
> > > > - * Tears down the mappings for the range of gfns, [start, end), and frees the
> > > > - * non-root pages mapping GFNs strictly within that range. Returns true if
> > > > - * SPTEs have been cleared and a TLB flush is needed before releasing the
> > > > - * MMU lock.
> > > > + * Zap leafs SPTEs for the range of gfns, [start, end). Returns true if SPTEs
> > > > + * have been cleared and a TLB flush is needed before releasing the MMU lock.
> > > 
> > > I think the original code does not _over_ zapping. But the new version
> > > does.
> > 
> > No, the new version doesn't overzap.
> 
> It does overzap, but it does not matter and the semantic does not
> change.

Belaboring the point a bit... it very much matters, KVM must "overzap" for functional
correctness.  It's only an "overzap" from the perspective that KVM could thoeretically
shatter the hugepage then zap only the relevant small pages.  But it's not an overzap
in the sense that KVM absolutely has to zap the hugepage.  Even if KVM replaces it
with a shadow page, the hugepage is still being zapped, i.e. it's gone and KVM must do
a TLB flush regardless of whether or not there's a new mapping.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker
  2022-03-04 18:11           ` Paolo Bonzini
@ 2022-03-05  0:34             ` Sean Christopherson
  2022-03-05 19:53               ` Paolo Bonzini
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2022-03-05  0:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

On Fri, Mar 04, 2022, Paolo Bonzini wrote:
> On 3/4/22 17:02, Sean Christopherson wrote:
> > On Fri, Mar 04, 2022, Paolo Bonzini wrote:
> > > On 3/3/22 22:32, Sean Christopherson wrote:
> > > I didn't remove the paragraph from the commit message, but I think it's
> > > unnecessary now.  The workqueue is flushed in kvm_mmu_zap_all_fast() and
> > > kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to take
> > > a reference to the VM.
> > > 
> > > I think I don't even need to check kvm->users_count in the defunct root
> > > case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the workqueue
> > > before it checks that the lists are empty.
> > 
> > Yes, that should work.  IIRC, the WARN_ONs will tell us/you quite quickly if
> > we're wrong :-)  mmu_notifier_unregister() will call the "slow" kvm_mmu_zap_all()
> > and thus ensure all non-root pages zapped, but "leaking" a worker will trigger
> > the WARN_ON that there are no roots on the list.
> 
> Good, for the record these are the commit messages I have:
> 
>     KVM: x86/mmu: Zap invalidated roots via asynchronous worker
>     Use the system worker threads to zap the roots invalidated
>     by the TDP MMU's "fast zap" mechanism, implemented by
>     kvm_tdp_mmu_invalidate_all_roots().
>     At this point, apart from allowing some parallelism in the zapping of
>     roots, the workqueue is a glorified linked list: work items are added and
>     flushed entirely within a single kvm->slots_lock critical section.  However,
>     the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
>     assumes that it owns a reference to all invalid roots; therefore, no
>     one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
>     invalidated roots on a linked list... erm, on a workqueue ensures that
>     tdp_mmu_zap_root_work() only puts back those extra references that
>     kvm_mmu_zap_all_invalidated_roots() had gifted to it.
> 
> and
> 
>     KVM: x86/mmu: Zap defunct roots via asynchronous worker
>     Zap defunct roots, a.k.a. roots that have been invalidated after their
>     last reference was initially dropped, asynchronously via the existing work
>     queue instead of forcing the work upon the unfortunate task that happened
>     to drop the last reference.
>     If a vCPU task drops the last reference, the vCPU is effectively blocked
>     by the host for the entire duration of the zap.  If the root being zapped
>     happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
>     being active, the zap can take several hundred seconds.  Unsurprisingly,
>     most guests are unhappy if a vCPU disappears for hundreds of seconds.
>     E.g. running a synthetic selftest that triggers a vCPU root zap with
>     ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
>     Offloading the zap to a worker drops the block time to <100ms.
>     There is an important nuance to this change.  If the same work item
>     was queued twice before the work function has run, it would only
>     execute once and one reference would be leaked.  Therefore, now that
>     queueing items is not anymore protected by write_lock(&kvm->mmu_lock),
>     kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and
>     skip already invalid roots.  On the other hand, kvm_mmu_zap_all_fast()
>     must return only after those skipped roots have been zapped as well.
>     These two requirements can be satisfied only if _all_ places that
>     change invalid to true now schedule the worker before releasing the
>     mmu_lock.  There are just two, kvm_tdp_mmu_put_root() and
>     kvm_tdp_mmu_invalidate_all_roots().

Very nice!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker
  2022-03-05  0:34             ` Sean Christopherson
@ 2022-03-05 19:53               ` Paolo Bonzini
  2022-03-08 21:29                 ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-05 19:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

On 3/5/22 01:34, Sean Christopherson wrote:
> On Fri, Mar 04, 2022, Paolo Bonzini wrote:
>> On 3/4/22 17:02, Sean Christopherson wrote:
>>> On Fri, Mar 04, 2022, Paolo Bonzini wrote:
>>>> On 3/3/22 22:32, Sean Christopherson wrote:
>>>> I didn't remove the paragraph from the commit message, but I think it's
>>>> unnecessary now.  The workqueue is flushed in kvm_mmu_zap_all_fast() and
>>>> kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to take
>>>> a reference to the VM.
>>>>
>>>> I think I don't even need to check kvm->users_count in the defunct root
>>>> case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the workqueue
>>>> before it checks that the lists are empty.
>>>
>>> Yes, that should work.  IIRC, the WARN_ONs will tell us/you quite quickly if
>>> we're wrong :-)  mmu_notifier_unregister() will call the "slow" kvm_mmu_zap_all()
>>> and thus ensure all non-root pages zapped, but "leaking" a worker will trigger
>>> the WARN_ON that there are no roots on the list.
>>
>> Good, for the record these are the commit messages I have:

I'm seeing some hangs in ~50% of installation jobs, both Windows and 
Linux.  I have not yet tried to reproduce outside the automated tests, 
or to bisect, but I'll try to push at least the first part of the series 
for 5.18.

Paolo

>>      KVM: x86/mmu: Zap invalidated roots via asynchronous worker
>>      Use the system worker threads to zap the roots invalidated
>>      by the TDP MMU's "fast zap" mechanism, implemented by
>>      kvm_tdp_mmu_invalidate_all_roots().
>>      At this point, apart from allowing some parallelism in the zapping of
>>      roots, the workqueue is a glorified linked list: work items are added and
>>      flushed entirely within a single kvm->slots_lock critical section.  However,
>>      the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
>>      assumes that it owns a reference to all invalid roots; therefore, no
>>      one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
>>      invalidated roots on a linked list... erm, on a workqueue ensures that
>>      tdp_mmu_zap_root_work() only puts back those extra references that
>>      kvm_mmu_zap_all_invalidated_roots() had gifted to it.
>>
>> and
>>
>>      KVM: x86/mmu: Zap defunct roots via asynchronous worker
>>      Zap defunct roots, a.k.a. roots that have been invalidated after their
>>      last reference was initially dropped, asynchronously via the existing work
>>      queue instead of forcing the work upon the unfortunate task that happened
>>      to drop the last reference.
>>      If a vCPU task drops the last reference, the vCPU is effectively blocked
>>      by the host for the entire duration of the zap.  If the root being zapped
>>      happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
>>      being active, the zap can take several hundred seconds.  Unsurprisingly,
>>      most guests are unhappy if a vCPU disappears for hundreds of seconds.
>>      E.g. running a synthetic selftest that triggers a vCPU root zap with
>>      ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
>>      Offloading the zap to a worker drops the block time to <100ms.
>>      There is an important nuance to this change.  If the same work item
>>      was queued twice before the work function has run, it would only
>>      execute once and one reference would be leaked.  Therefore, now that
>>      queueing items is not anymore protected by write_lock(&kvm->mmu_lock),
>>      kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and
>>      skip already invalid roots.  On the other hand, kvm_mmu_zap_all_fast()
>>      must return only after those skipped roots have been zapped as well.
>>      These two requirements can be satisfied only if _all_ places that
>>      change invalid to true now schedule the worker before releasing the
>>      mmu_lock.  There are just two, kvm_tdp_mmu_put_root() and
>>      kvm_tdp_mmu_invalidate_all_roots().
> 
> Very nice!
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 30/30] KVM: selftests: Add test to populate a VM with the max possible guest mem
  2022-03-03 19:38 ` [PATCH v4 30/30] KVM: selftests: Add test to populate a VM with the max possible guest mem Paolo Bonzini
@ 2022-03-08 14:47   ` Paolo Bonzini
  2022-03-08 15:36     ` Christian Borntraeger
  2022-03-08 21:09     ` Sean Christopherson
  0 siblings, 2 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-08 14:47 UTC (permalink / raw)
  To: linux-kernel, kvm, Sean Christopherson, Marc Zyngier,
	Christian Borntraeger

On 3/3/22 20:38, Paolo Bonzini wrote:
> From: Sean Christopherson<seanjc@google.com>
> 
> Add a selftest that enables populating a VM with the maximum amount of
> guest memory allowed by the underlying architecture.  Abuse KVM's
> memslots by mapping a single host memory region into multiple memslots so
> that the selftest doesn't require a system with terabytes of RAM.
> 
> Default to 512gb of guest memory, which isn't all that interesting, but
> should work on all MMUs and doesn't take an exorbitant amount of memory
> or time.  E.g. testing with ~64tb of guest memory takes the better part
> of an hour, and requires 200gb of memory for KVM's page tables when using
> 4kb pages.

I couldn't quite run this on a laptop, so I'll tune it down to 128gb and 
3/4 of the available CPUs.

> To inflicit maximum abuse on KVM' MMU, default to 4kb pages (or whatever
> the not-hugepage size is) in the backing store (memfd).  Use memfd for
> the host backing store to ensure that hugepages are guaranteed when
> requested, and to give the user explicit control of the size of hugepage
> being tested.
> 
> By default, spin up as many vCPUs as there are available to the selftest,
> and distribute the work of dirtying each 4kb chunk of memory across all
> vCPUs.  Dirtying guest memory forces KVM to populate its page tables, and
> also forces KVM to write back accessed/dirty information to struct page
> when the guest memory is freed.
> 
> On x86, perform two passes with a MMU context reset between each pass to
> coerce KVM into dropping all references to the MMU root, e.g. to emulate
> a vCPU dropping the last reference.  Perform both passes and all
> rendezvous on all architectures in the hope that arm64 and s390x can gain
> similar shenanigans in the future.

Did you actually test aarch64 (not even asking about s390 :))?  For now 
let's only add it for x86.

> +			TEST_ASSERT(nr_vcpus, "#DE");

srsly? :)

Paolo


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 30/30] KVM: selftests: Add test to populate a VM with the max possible guest mem
  2022-03-08 14:47   ` Paolo Bonzini
@ 2022-03-08 15:36     ` Christian Borntraeger
  2022-03-08 21:09     ` Sean Christopherson
  1 sibling, 0 replies; 62+ messages in thread
From: Christian Borntraeger @ 2022-03-08 15:36 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel, kvm, Sean Christopherson,
	Marc Zyngier, Claudio Imbrenda, Janosch Frank



Am 08.03.22 um 15:47 schrieb Paolo Bonzini:
> On 3/3/22 20:38, Paolo Bonzini wrote:
>> From: Sean Christopherson<seanjc@google.com>
>>
>> Add a selftest that enables populating a VM with the maximum amount of
>> guest memory allowed by the underlying architecture.  Abuse KVM's
>> memslots by mapping a single host memory region into multiple memslots so
>> that the selftest doesn't require a system with terabytes of RAM.
>>
>> Default to 512gb of guest memory, which isn't all that interesting, but
>> should work on all MMUs and doesn't take an exorbitant amount of memory
>> or time.  E.g. testing with ~64tb of guest memory takes the better part
>> of an hour, and requires 200gb of memory for KVM's page tables when using
>> 4kb pages.
> 
> I couldn't quite run this on a laptop, so I'll tune it down to 128gb and 3/4 of the available CPUs.
> 
>> To inflicit maximum abuse on KVM' MMU, default to 4kb pages (or whatever
>> the not-hugepage size is) in the backing store (memfd).  Use memfd for
>> the host backing store to ensure that hugepages are guaranteed when
>> requested, and to give the user explicit control of the size of hugepage
>> being tested.
>>
>> By default, spin up as many vCPUs as there are available to the selftest,
>> and distribute the work of dirtying each 4kb chunk of memory across all
>> vCPUs.  Dirtying guest memory forces KVM to populate its page tables, and
>> also forces KVM to write back accessed/dirty information to struct page
>> when the guest memory is freed.
>>
>> On x86, perform two passes with a MMU context reset between each pass to
>> coerce KVM into dropping all references to the MMU root, e.g. to emulate
>> a vCPU dropping the last reference.  Perform both passes and all
>> rendezvous on all architectures in the hope that arm64 and s390x can gain
>> similar shenanigans in the future.
> 
> Did you actually test aarch64 (not even asking about s390 :))?  For now let's only add it for x86.

I do get spurious
# selftests: kvm: max_guest_memory_test
# ==== Test Assertion Failure ====
#   lib/kvm_util.c:883: !ret
#   pid=575178 tid=575178 errno=22 - Invalid argument
#      1	0x000000000100385f: vm_set_user_memory_region at kvm_util.c:883
#      2	0x0000000001001ee1: main at max_guest_memory_test.c:242
#      3	0x000003ffa1033731: ?? ??:0
#      4	0x000003ffa103380d: ?? ??:0
#      5	0x0000000001002389: _start at ??:?
#   KVM_SET_USER_MEMORY_REGION failed, errno = 22 (Invalid argument)
not ok 9 selftests: kvm: max_guest_memory_test # exit=254

as the userspace address must be 1MB-aligned but the mmap is not (due to aslr).

There are probably more issues, so it certainly is ok to skip s390 for now.
> 
>> +            TEST_ASSERT(nr_vcpus, "#DE");
> 
> srsly? :)
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing
  2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
                   ` (29 preceding siblings ...)
  2022-03-03 19:38 ` [PATCH v4 30/30] KVM: selftests: Add test to populate a VM with the max possible guest mem Paolo Bonzini
@ 2022-03-08 17:25 ` Paolo Bonzini
  30 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-08 17:25 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

On 3/3/22 20:38, Paolo Bonzini wrote:
> 
> Overhaul TDP MMU's handling of zapping and TLB flushing to reduce the
> number of TLB flushes, fix soft lockups and RCU stalls, avoid blocking
> vCPUs for long durations while zapping paging structure, and to clean up
> the zapping code.
> 
> The largest cleanup is to separate the flows for zapping roots (zap
> _everything_), zapping leaf SPTEs (zap guest mappings for whatever reason),
> and zapping a specific SP (NX recovery).  They're currently smushed into a
> single zap_gfn_range(), which was a good idea at the time, but became a
> mess when trying to handle the different rules, e.g. TLB flushes aren't
> needed when zapping a root because KVM can safely zap a root if and only
> if it's unreachable.
> 
> To solve the soft lockups, stalls, and vCPU performance issues:
> 
>   - Defer remote TLB flushes to the caller when zapping TDP MMU shadow
>     pages by relying on RCU to ensure the paging structure isn't freed
>     until all vCPUs have exited the guest.
> 
>   - Allowing yielding when zapping TDP MMU roots in response to the root's
>     last reference being put.  This requires a bit of trickery to ensure
>     the root is reachable via mmu_notifier, but it's not too gross.
> 
>   - Zap roots in two passes to avoid holding RCU for potential hundreds of
>     seconds when zapping guest with terabytes of memory that is backed
>     entirely by 4kb SPTEs.
> 
>   - Zap defunct roots asynchronously via the common work_queue so that a
>     vCPU doesn't get stuck doing the work if the vCPU happens to drop the
>     last reference to a root.
> 
> The selftest at the end allows populating a guest with the max amount of
> memory allowed by the underlying architecture.  The most I've tested is
> ~64tb (MAXPHYADDR=46) as I don't have easy access to a system with
> MAXPHYADDR=52.  The selftest compiles on arm64 and s390x, but otherwise
> hasn't been tested outside of x86-64.  It will hopefully do something
> useful as is, but there's a non-zero chance it won't get past init with
> a high max memory.  Running on x86 without the TDP MMU is comically slow.
> 
> Testing: passes kvm-unit-tests and guest installation tests on Intel.
> Haven't yet run AMD or selftests.
> 
> Thanks,
> 
> Paolo
> 
> v4:
> - collected reviews and typo fixes (plus some typo fixes of my own)
> 
> - new patches to simplify reader invariants: they are not allowed to
>    acquire references to invalid roots
> 
> - new version of "Allow yielding when zapping GFNs for defunct TDP MMU
>    root", simplifying the atomic a bit by 1) using xchg and relying on
>    its implicit memory barriers 2) relying on readers to have the same
>    behavior for the three stats refcount=0/valid, refcount=0/invalid,
>    refcount=1/invalid (see previous point)
> 
> - switch zapping of invalidated roots to asynchronous workers on a
>    per-VM workqueue, fixing a bug in v3 where the extra reference added
>    by kvm_tdp_mmu_put_root could be given back twice.  This also replaces
>    "KVM: x86/mmu: Use common iterator for walking invalid TDP MMU roots"
>    in v3, since it gets rid of next_invalidated_root() in a different way.
> 
> - because of the previous point, most of the logic in v3's "KVM: x86/mmu:
>    Zap defunct roots via asynchronous worker" moves to the earlier patch
>    "KVM: x86/mmu: Zap invalidated roots via asynchronous worker"
> 
> 
> v3:
> - Drop patches that were applied.
> - Rebase to latest kvm/queue.
> - Collect a review. [David]
> - Use helper instead of goto to zap roots in two passes. [David]
> - Add patches to disallow REMOVED "old" SPTE when atomically
>    setting SPTE.
> 
> Paolo Bonzini (5):
>    KVM: x86/mmu: only perform eager page splitting on valid roots
>    KVM: x86/mmu: do not allow readers to acquire references to invalid roots
>    KVM: x86/mmu: Zap invalidated roots via asynchronous worker
>    KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root
>    KVM: x86/mmu: Zap defunct roots via asynchronous worker
> 
> Sean Christopherson (25):
>    KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU
>    KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap
>    KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic
>    KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush
>    KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter
>    KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal
>    KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte
>    KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks
>    KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU
>    KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path
>    KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw values
>    KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery
>    KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU
>    KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page
>    KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range
>    KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
>    KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched
>    KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages
>    KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls
>    KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE
>    KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE
>    KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils
>    KVM: selftests: Split out helper to allocate guest mem via memfd
>    KVM: selftests: Define cpu_relax() helpers for s390 and x86
>    KVM: selftests: Add test to populate a VM with the max possible guest mem
> 
>   arch/x86/include/asm/kvm_host.h               |   2 +
>   arch/x86/kvm/mmu/mmu.c                        |  49 +-
>   arch/x86/kvm/mmu/mmu_internal.h               |  15 +-
>   arch/x86/kvm/mmu/tdp_iter.c                   |   6 +-
>   arch/x86/kvm/mmu/tdp_iter.h                   |  15 +-
>   arch/x86/kvm/mmu/tdp_mmu.c                    | 559 +++++++++++-------
>   arch/x86/kvm/mmu/tdp_mmu.h                    |  26 +-
>   tools/testing/selftests/kvm/.gitignore        |   1 +
>   tools/testing/selftests/kvm/Makefile          |   3 +
>   .../selftests/kvm/include/kvm_util_base.h     |   5 +
>   .../selftests/kvm/include/s390x/processor.h   |   8 +
>   .../selftests/kvm/include/x86_64/processor.h  |   5 +
>   tools/testing/selftests/kvm/lib/kvm_util.c    |  66 ++-
>   .../selftests/kvm/max_guest_memory_test.c     | 292 +++++++++
>   .../selftests/kvm/set_memory_region_test.c    |  35 +-
>   15 files changed, 794 insertions(+), 293 deletions(-)
>   create mode 100644 tools/testing/selftests/kvm/max_guest_memory_test.c
> 


Queued, thanks.

Paolo


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 30/30] KVM: selftests: Add test to populate a VM with the max possible guest mem
  2022-03-08 14:47   ` Paolo Bonzini
  2022-03-08 15:36     ` Christian Borntraeger
@ 2022-03-08 21:09     ` Sean Christopherson
  1 sibling, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2022-03-08 21:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Marc Zyngier,
	Christian Borntraeger

On Tue, Mar 08, 2022, Paolo Bonzini wrote:
> On 3/3/22 20:38, Paolo Bonzini wrote:
> > On x86, perform two passes with a MMU context reset between each pass to
> > coerce KVM into dropping all references to the MMU root, e.g. to emulate
> > a vCPU dropping the last reference.  Perform both passes and all
> > rendezvous on all architectures in the hope that arm64 and s390x can gain
> > similar shenanigans in the future.
> 
> Did you actually test aarch64 (not even asking about s390 :))?  For now
> let's only add it for x86.

Nope, don't you read my cover letters?  :-D

  The selftest at the end allows populating a guest with the max amount of
  memory allowed by the underlying architecture.  The most I've tested is
  ~64tb (MAXPHYADDR=46) as I don't have easy access to a system with
  MAXPHYADDR=52.  The selftest compiles on arm64 and s390x, but otherwise
  hasn't been tested outside of x86-64.  It will hopefully do something
  useful as is, but there's a non-zero chance it won't get past init with
  a high max memory.  Running on x86 without the TDP MMU is comically slow.


> > +			TEST_ASSERT(nr_vcpus, "#DE");
> 
> srsly? :)

LOL, yes.  IIRC I added that because I screwed up computing nr_vcpus and my
test did nothing useful :-)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker
  2022-03-05 19:53               ` Paolo Bonzini
@ 2022-03-08 21:29                 ` Sean Christopherson
  2022-03-11 17:50                   ` Paolo Bonzini
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2022-03-08 21:29 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

On Sat, Mar 05, 2022, Paolo Bonzini wrote:
> On 3/5/22 01:34, Sean Christopherson wrote:
> > On Fri, Mar 04, 2022, Paolo Bonzini wrote:
> > > On 3/4/22 17:02, Sean Christopherson wrote:
> > > > On Fri, Mar 04, 2022, Paolo Bonzini wrote:
> > > > > On 3/3/22 22:32, Sean Christopherson wrote:
> > > > > I didn't remove the paragraph from the commit message, but I think it's
> > > > > unnecessary now.  The workqueue is flushed in kvm_mmu_zap_all_fast() and
> > > > > kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to take
> > > > > a reference to the VM.
> > > > > 
> > > > > I think I don't even need to check kvm->users_count in the defunct root
> > > > > case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the workqueue
> > > > > before it checks that the lists are empty.
> > > > 
> > > > Yes, that should work.  IIRC, the WARN_ONs will tell us/you quite quickly if
> > > > we're wrong :-)  mmu_notifier_unregister() will call the "slow" kvm_mmu_zap_all()
> > > > and thus ensure all non-root pages zapped, but "leaking" a worker will trigger
> > > > the WARN_ON that there are no roots on the list.
> > > 
> > > Good, for the record these are the commit messages I have:
> 
> I'm seeing some hangs in ~50% of installation jobs, both Windows and Linux.
> I have not yet tried to reproduce outside the automated tests, or to bisect,
> but I'll try to push at least the first part of the series for 5.18.

Out of curiosity, what was the bug?  I see this got pushed to kvm/next.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
  2022-03-03 19:38 ` [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() Paolo Bonzini
  2022-03-04  1:16   ` Mingwei Zhang
@ 2022-03-11 15:09   ` Vitaly Kuznetsov
  2022-03-13 18:40   ` Mingwei Zhang
  2 siblings, 0 replies; 62+ messages in thread
From: Vitaly Kuznetsov @ 2022-03-11 15:09 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Wanpeng Li, Jim Mattson, Joerg Roedel, David Hildenbrand,
	David Matlack, Ben Gardon, Mingwei Zhang, linux-kernel, kvm

Paolo Bonzini <pbonzini@redhat.com> writes:

> From: Sean Christopherson <seanjc@google.com>
>
> Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various
> functions accordingly.  When removing mappings for functional correctness
> (except for the stupid VFIO GPU passthrough memslots bug), zapping the
> leaf SPTEs is sufficient as the paging structures themselves do not point
> at guest memory and do not directly impact the final translation (in the
> TDP MMU).
>
> Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only
> the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and
> kvm_unmap_gfn_range().
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Message-Id: <20220226001546.360188-18-seanjc@google.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

I've noticed that multi-vCPU Hyper-V guests started crashing randomly on
boot with the latest kvm/queue and I've bisected the problem to this
particular patch. Basically, I'm not able to boot e.g. 16-vCPU guest
successfully anymore. Both Intel and AMD seem to be affected. Reverting
this commit saves the day.

Having some experience with similarly looking crashes in the past, I'd
suspect it is TLB flush related. I'd appreciate any thoughts.

> ---
>  arch/x86/kvm/mmu/mmu.c     |  4 ++--
>  arch/x86/kvm/mmu/tdp_mmu.c | 41 ++++++++++----------------------------
>  arch/x86/kvm/mmu/tdp_mmu.h |  8 +-------
>  3 files changed, 14 insertions(+), 39 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8408d7db8d2a..febdcaaa7b94 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5834,8 +5834,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  
>  	if (is_tdp_mmu_enabled(kvm)) {
>  		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> -			flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
> -							  gfn_end, flush);
> +			flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
> +						      gfn_end, true, flush);
>  	}
>  
>  	if (flush)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f3939ce4a115..c71debdbc732 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -834,10 +834,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>  }
>  
>  /*
> - * Tears down the mappings for the range of gfns, [start, end), and frees the
> - * non-root pages mapping GFNs strictly within that range. Returns true if
> - * SPTEs have been cleared and a TLB flush is needed before releasing the
> - * MMU lock.
> + * Zap leafs SPTEs for the range of gfns, [start, end). Returns true if SPTEs
> + * have been cleared and a TLB flush is needed before releasing the MMU lock.
>   *
>   * If can_yield is true, will release the MMU lock and reschedule if the
>   * scheduler needs the CPU or there is contention on the MMU lock. If this
> @@ -845,42 +843,25 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>   * the caller must ensure it does not supply too large a GFN range, or the
>   * operation can cause a soft lockup.
>   */
> -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> -			  gfn_t start, gfn_t end, bool can_yield, bool flush)
> +static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> +			      gfn_t start, gfn_t end, bool can_yield, bool flush)
>  {
> -	bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
>  	struct tdp_iter iter;
>  
> -	/*
> -	 * No need to try to step down in the iterator when zapping all SPTEs,
> -	 * zapping the top-level non-leaf SPTEs will recurse on their children.
> -	 */
> -	int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
> -
>  	end = min(end, tdp_mmu_max_gfn_host());
>  
>  	lockdep_assert_held_write(&kvm->mmu_lock);
>  
>  	rcu_read_lock();
>  
> -	for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
> +	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
>  		if (can_yield &&
>  		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
>  			flush = false;
>  			continue;
>  		}
>  
> -		if (!is_shadow_present_pte(iter.old_spte))
> -			continue;
> -
> -		/*
> -		 * If this is a non-last-level SPTE that covers a larger range
> -		 * than should be zapped, continue, and zap the mappings at a
> -		 * lower level, except when zapping all SPTEs.
> -		 */
> -		if (!zap_all &&
> -		    (iter.gfn < start ||
> -		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> +		if (!is_shadow_present_pte(iter.old_spte) ||
>  		    !is_last_spte(iter.old_spte, iter.level))
>  			continue;
>  
> @@ -898,13 +879,13 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   * SPTEs have been cleared and a TLB flush is needed before releasing the
>   * MMU lock.
>   */
> -bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> -				 gfn_t end, bool can_yield, bool flush)
> +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
> +			   bool can_yield, bool flush)
>  {
>  	struct kvm_mmu_page *root;
>  
>  	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
> -		flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
> +		flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
>  
>  	return flush;
>  }
> @@ -1202,8 +1183,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
>  				 bool flush)
>  {
> -	return __kvm_tdp_mmu_zap_gfn_range(kvm, range->slot->as_id, range->start,
> -					   range->end, range->may_block, flush);
> +	return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start,
> +				     range->end, range->may_block, flush);
>  }
>  
>  typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 5e5ef2576c81..54bc8118c40a 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -15,14 +15,8 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
>  void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  			  bool shared);
>  
> -bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start,
>  				 gfn_t end, bool can_yield, bool flush);
> -static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
> -					     gfn_t start, gfn_t end, bool flush)
> -{
> -	return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
> -}
> -
>  bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>  void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);

-- 
Vitaly


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker
  2022-03-08 21:29                 ` Sean Christopherson
@ 2022-03-11 17:50                   ` Paolo Bonzini
  0 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2022-03-11 17:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, David Matlack, Ben Gardon,
	Mingwei Zhang

On 3/8/22 22:29, Sean Christopherson wrote:
>>>> Good, for the record these are the commit messages I have:
>> I'm seeing some hangs in ~50% of installation jobs, both Windows and Linux.
>> I have not yet tried to reproduce outside the automated tests, or to bisect,
>> but I'll try to push at least the first part of the series for 5.18.
> Out of curiosity, what was the bug?  I see this got pushed to kvm/next.
> 

Of course it was in another, "harmless" patch that was in front of it. :)

Paolo


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
  2022-03-03 19:38 ` [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() Paolo Bonzini
  2022-03-04  1:16   ` Mingwei Zhang
  2022-03-11 15:09   ` Vitaly Kuznetsov
@ 2022-03-13 18:40   ` Mingwei Zhang
  2022-03-25 15:13     ` Sean Christopherson
  2 siblings, 1 reply; 62+ messages in thread
From: Mingwei Zhang @ 2022-03-13 18:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Hildenbrand, David Matlack,
	Ben Gardon

On Thu, Mar 3, 2022 at 11:39 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various
> functions accordingly.  When removing mappings for functional correctness
> (except for the stupid VFIO GPU passthrough memslots bug), zapping the
> leaf SPTEs is sufficient as the paging structures themselves do not point
> at guest memory and do not directly impact the final translation (in the
> TDP MMU).
>
> Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only
> the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and
> kvm_unmap_gfn_range().
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Message-Id: <20220226001546.360188-18-seanjc@google.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     |  4 ++--
>  arch/x86/kvm/mmu/tdp_mmu.c | 41 ++++++++++----------------------------
>  arch/x86/kvm/mmu/tdp_mmu.h |  8 +-------
>  3 files changed, 14 insertions(+), 39 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8408d7db8d2a..febdcaaa7b94 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5834,8 +5834,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>
>         if (is_tdp_mmu_enabled(kvm)) {
>                 for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> -                       flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
> -                                                         gfn_end, flush);
> +                       flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
> +                                                     gfn_end, true, flush);
>         }
>
>         if (flush)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f3939ce4a115..c71debdbc732 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -834,10 +834,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>  }
>
>  /*
> - * Tears down the mappings for the range of gfns, [start, end), and frees the
> - * non-root pages mapping GFNs strictly within that range. Returns true if
> - * SPTEs have been cleared and a TLB flush is needed before releasing the
> - * MMU lock.
> + * Zap leafs SPTEs for the range of gfns, [start, end). Returns true if SPTEs
> + * have been cleared and a TLB flush is needed before releasing the MMU lock.
>   *
>   * If can_yield is true, will release the MMU lock and reschedule if the
>   * scheduler needs the CPU or there is contention on the MMU lock. If this
> @@ -845,42 +843,25 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>   * the caller must ensure it does not supply too large a GFN range, or the
>   * operation can cause a soft lockup.
>   */
> -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> -                         gfn_t start, gfn_t end, bool can_yield, bool flush)
> +static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> +                             gfn_t start, gfn_t end, bool can_yield, bool flush)
>  {
> -       bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host());
>         struct tdp_iter iter;
>
> -       /*
> -        * No need to try to step down in the iterator when zapping all SPTEs,
> -        * zapping the top-level non-leaf SPTEs will recurse on their children.
> -        */
> -       int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
> -
>         end = min(end, tdp_mmu_max_gfn_host());
>
>         lockdep_assert_held_write(&kvm->mmu_lock);
>
>         rcu_read_lock();
>
> -       for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
> +       for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
>                 if (can_yield &&
>                     tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
>                         flush = false;
>                         continue;
>                 }
>
> -               if (!is_shadow_present_pte(iter.old_spte))
> -                       continue;
> -
> -               /*
> -                * If this is a non-last-level SPTE that covers a larger range
> -                * than should be zapped, continue, and zap the mappings at a
> -                * lower level, except when zapping all SPTEs.
> -                */
> -               if (!zap_all &&
> -                   (iter.gfn < start ||
> -                    iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> +               if (!is_shadow_present_pte(iter.old_spte) ||
>                     !is_last_spte(iter.old_spte, iter.level))
>                         continue;
>
> @@ -898,13 +879,13 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   * SPTEs have been cleared and a TLB flush is needed before releasing the
>   * MMU lock.
>   */
> -bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> -                                gfn_t end, bool can_yield, bool flush)
> +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
> +                          bool can_yield, bool flush)
>  {
>         struct kvm_mmu_page *root;
>
>         for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
> -               flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
> +               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);

hmm, I think we might have to be very careful here. If we only zap
leafs, then there could be side effects. For instance, the code in
disallowed_hugepage_adjust() may not work as intended. If you check
the following condition in arch/x86/kvm/mmu/mmu.c:2918

if (cur_level > PG_LEVEL_4K &&
    cur_level == fault->goal_level &&
    is_shadow_present_pte(spte) &&
    !is_large_pte(spte)) {

If we previously use 4K mappings in this range due to various reasons
(dirty logging etc), then afterwards, we zap the range. Then the guest
touches a 4K and now we should map the range with whatever the maximum
level we can for the guest.

However, if we just zap only the leafs, then when the code comes to
the above location, is_shadow_present_pte(spte) will return true,
since the spte is a non-leaf (say a regular PMD entry). The whole if
statement will be true, then we never allow remapping guest memory
with huge pages.

>
>         return flush;
>  }
> @@ -1202,8 +1183,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
>                                  bool flush)
>  {
> -       return __kvm_tdp_mmu_zap_gfn_range(kvm, range->slot->as_id, range->start,
> -                                          range->end, range->may_block, flush);
> +       return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start,
> +                                    range->end, range->may_block, flush);
>  }
>
>  typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 5e5ef2576c81..54bc8118c40a 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -15,14 +15,8 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
>  void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>                           bool shared);
>
> -bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start,
>                                  gfn_t end, bool can_yield, bool flush);
> -static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
> -                                            gfn_t start, gfn_t end, bool flush)
> -{
> -       return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
> -}
> -
>  bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>  void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
> --
> 2.31.1
>
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
  2022-03-13 18:40   ` Mingwei Zhang
@ 2022-03-25 15:13     ` Sean Christopherson
  2022-03-26 18:10       ` Mingwei Zhang
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2022-03-25 15:13 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, LKML, kvm, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Hildenbrand, David Matlack,
	Ben Gardon

On Sun, Mar 13, 2022, Mingwei Zhang wrote:
> On Thu, Mar 3, 2022 at 11:39 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > @@ -898,13 +879,13 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >   * SPTEs have been cleared and a TLB flush is needed before releasing the
> >   * MMU lock.
> >   */
> > -bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> > -                                gfn_t end, bool can_yield, bool flush)
> > +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
> > +                          bool can_yield, bool flush)
> >  {
> >         struct kvm_mmu_page *root;
> >
> >         for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
> > -               flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
> > +               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
> 
> hmm, I think we might have to be very careful here. If we only zap
> leafs, then there could be side effects. For instance, the code in
> disallowed_hugepage_adjust() may not work as intended. If you check
> the following condition in arch/x86/kvm/mmu/mmu.c:2918
> 
> if (cur_level > PG_LEVEL_4K &&
>     cur_level == fault->goal_level &&
>     is_shadow_present_pte(spte) &&
>     !is_large_pte(spte)) {
> 
> If we previously use 4K mappings in this range due to various reasons
> (dirty logging etc), then afterwards, we zap the range. Then the guest
> touches a 4K and now we should map the range with whatever the maximum
> level we can for the guest.
> 
> However, if we just zap only the leafs, then when the code comes to
> the above location, is_shadow_present_pte(spte) will return true,
> since the spte is a non-leaf (say a regular PMD entry). The whole if
> statement will be true, then we never allow remapping guest memory
> with huge pages.

But that's at worst a performance issue, and arguably working as intended.  The
zap in this case is never due to the _guest_ unmapping the pfn, so odds are good
the guest will want to map back in the same pfns with the same permissions.
Zapping shadow pages so that the guest can maybe create a hugepage may end up
being a lot of extra work for no benefit.  Or it may be a net positive.  Either
way, it's not a functional issue.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
  2022-03-25 15:13     ` Sean Christopherson
@ 2022-03-26 18:10       ` Mingwei Zhang
  2022-03-28 15:06         ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Mingwei Zhang @ 2022-03-26 18:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, LKML, kvm, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Hildenbrand, David Matlack,
	Ben Gardon

On Fri, Mar 25, 2022, Sean Christopherson wrote:
> On Sun, Mar 13, 2022, Mingwei Zhang wrote:
> > On Thu, Mar 3, 2022 at 11:39 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > > @@ -898,13 +879,13 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > >   * SPTEs have been cleared and a TLB flush is needed before releasing the
> > >   * MMU lock.
> > >   */
> > > -bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> > > -                                gfn_t end, bool can_yield, bool flush)
> > > +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
> > > +                          bool can_yield, bool flush)
> > >  {
> > >         struct kvm_mmu_page *root;
> > >
> > >         for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
> > > -               flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
> > > +               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
> > 
> > hmm, I think we might have to be very careful here. If we only zap
> > leafs, then there could be side effects. For instance, the code in
> > disallowed_hugepage_adjust() may not work as intended. If you check
> > the following condition in arch/x86/kvm/mmu/mmu.c:2918
> > 
> > if (cur_level > PG_LEVEL_4K &&
> >     cur_level == fault->goal_level &&
> >     is_shadow_present_pte(spte) &&
> >     !is_large_pte(spte)) {
> > 
> > If we previously use 4K mappings in this range due to various reasons
> > (dirty logging etc), then afterwards, we zap the range. Then the guest
> > touches a 4K and now we should map the range with whatever the maximum
> > level we can for the guest.
> > 
> > However, if we just zap only the leafs, then when the code comes to
> > the above location, is_shadow_present_pte(spte) will return true,
> > since the spte is a non-leaf (say a regular PMD entry). The whole if
> > statement will be true, then we never allow remapping guest memory
> > with huge pages.
> 
> But that's at worst a performance issue, and arguably working as intended.  The
> zap in this case is never due to the _guest_ unmapping the pfn, so odds are good
> the guest will want to map back in the same pfns with the same permissions.
> Zapping shadow pages so that the guest can maybe create a hugepage may end up
> being a lot of extra work for no benefit.  Or it may be a net positive.  Either
> way, it's not a functional issue.

This should be a performance bug instead of a functional one. But it
does affect both dirty logging (before Ben's early page promotion) and
our demand paging. So I proposed the fix in here:

https://lore.kernel.org/lkml/20220323184915.1335049-2-mizhang@google.com/T/#me78d50ffac33f4f418432f7b171c50630414ef28

If we see memory corruptions, I bet it could only be that we miss some
TLB flushes, since this patch series is basically trying to avoid
immediate TLB flushing by simply changing ASID (assigning new root).

To debug, maybe force the TLB flushes after zap_gfn_range and see if the
problem still exist?



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
  2022-03-26 18:10       ` Mingwei Zhang
@ 2022-03-28 15:06         ` Sean Christopherson
  0 siblings, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2022-03-28 15:06 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, LKML, kvm, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Hildenbrand, David Matlack,
	Ben Gardon

On Sat, Mar 26, 2022, Mingwei Zhang wrote:
> On Fri, Mar 25, 2022, Sean Christopherson wrote:
> > On Sun, Mar 13, 2022, Mingwei Zhang wrote:
> > > On Thu, Mar 3, 2022 at 11:39 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > > > @@ -898,13 +879,13 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > > >   * SPTEs have been cleared and a TLB flush is needed before releasing the
> > > >   * MMU lock.
> > > >   */
> > > > -bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> > > > -                                gfn_t end, bool can_yield, bool flush)
> > > > +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
> > > > +                          bool can_yield, bool flush)
> > > >  {
> > > >         struct kvm_mmu_page *root;
> > > >
> > > >         for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
> > > > -               flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
> > > > +               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
> > > 
> > > hmm, I think we might have to be very careful here. If we only zap
> > > leafs, then there could be side effects. For instance, the code in
> > > disallowed_hugepage_adjust() may not work as intended. If you check
> > > the following condition in arch/x86/kvm/mmu/mmu.c:2918
> > > 
> > > if (cur_level > PG_LEVEL_4K &&
> > >     cur_level == fault->goal_level &&
> > >     is_shadow_present_pte(spte) &&
> > >     !is_large_pte(spte)) {
> > > 
> > > If we previously use 4K mappings in this range due to various reasons
> > > (dirty logging etc), then afterwards, we zap the range. Then the guest
> > > touches a 4K and now we should map the range with whatever the maximum
> > > level we can for the guest.
> > > 
> > > However, if we just zap only the leafs, then when the code comes to
> > > the above location, is_shadow_present_pte(spte) will return true,
> > > since the spte is a non-leaf (say a regular PMD entry). The whole if
> > > statement will be true, then we never allow remapping guest memory
> > > with huge pages.
> > 
> > But that's at worst a performance issue, and arguably working as intended.  The
> > zap in this case is never due to the _guest_ unmapping the pfn, so odds are good
> > the guest will want to map back in the same pfns with the same permissions.
> > Zapping shadow pages so that the guest can maybe create a hugepage may end up
> > being a lot of extra work for no benefit.  Or it may be a net positive.  Either
> > way, it's not a functional issue.
> 
> This should be a performance bug instead of a functional one. But it
> does affect both dirty logging (before Ben's early page promotion) and
> our demand paging.

I'd buy the argument that KVM should zap shadow pages when zapping specifically to
recreate huge pages, but that's a different path entirely.  Disabling of dirty
logging uses a dedicated path, zap_collapsible_spte_range().

> So I proposed the fix in here:
> 
> https://lore.kernel.org/lkml/20220323184915.1335049-2-mizhang@google.com/T/#me78d50ffac33f4f418432f7b171c50630414ef28
> 
> If we see memory corruptions, I bet it could only be that we miss some
> TLB flushes, since this patch series is basically trying to avoid
> immediate TLB flushing by simply changing ASID (assigning new root).

Ya, it was a lost TLB flush goof.  My apologaies for not cc'ing you on the patch.

https://lore.kernel.org/all/20220325230348.2587437-1-seanjc@google.com

> To debug, maybe force the TLB flushes after zap_gfn_range and see if the
> problem still exist?
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2022-03-28 15:06 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-03 19:38 [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 01/30] KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 02/30] KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 03/30] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic Paolo Bonzini
2022-03-03 23:39   ` Mingwei Zhang
2022-03-03 19:38 ` [PATCH v4 04/30] KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 05/30] KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 06/30] KVM: x86/mmu: only perform eager page splitting on valid roots Paolo Bonzini
2022-03-03 20:03   ` Sean Christopherson
2022-03-03 19:38 ` [PATCH v4 07/30] KVM: x86/mmu: do not allow readers to acquire references to invalid roots Paolo Bonzini
2022-03-03 20:12   ` Sean Christopherson
2022-03-03 19:38 ` [PATCH v4 08/30] KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 09/30] KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 10/30] KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 11/30] KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 12/30] KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 13/30] KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw values Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 14/30] KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 15/30] KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 16/30] KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page Paolo Bonzini
2022-03-04  0:07   ` Mingwei Zhang
2022-03-03 19:38 ` [PATCH v4 17/30] KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range Paolo Bonzini
2022-03-04  0:14   ` Mingwei Zhang
2022-03-03 19:38 ` [PATCH v4 18/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() Paolo Bonzini
2022-03-04  1:16   ` Mingwei Zhang
2022-03-04 16:11     ` Sean Christopherson
2022-03-04 18:00       ` Mingwei Zhang
2022-03-04 18:42         ` Sean Christopherson
2022-03-11 15:09   ` Vitaly Kuznetsov
2022-03-13 18:40   ` Mingwei Zhang
2022-03-25 15:13     ` Sean Christopherson
2022-03-26 18:10       ` Mingwei Zhang
2022-03-28 15:06         ` Sean Christopherson
2022-03-03 19:38 ` [PATCH v4 19/30] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched Paolo Bonzini
2022-03-04  1:19   ` Mingwei Zhang
2022-03-03 19:38 ` [PATCH v4 20/30] KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker Paolo Bonzini
2022-03-03 20:54   ` Sean Christopherson
2022-03-03 21:06     ` Sean Christopherson
2022-03-03 21:20   ` Sean Christopherson
2022-03-03 21:32     ` Sean Christopherson
2022-03-04  6:48       ` Paolo Bonzini
2022-03-04 16:02         ` Sean Christopherson
2022-03-04 18:11           ` Paolo Bonzini
2022-03-05  0:34             ` Sean Christopherson
2022-03-05 19:53               ` Paolo Bonzini
2022-03-08 21:29                 ` Sean Christopherson
2022-03-11 17:50                   ` Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 22/30] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 23/30] KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 24/30] KVM: x86/mmu: Zap defunct roots via asynchronous worker Paolo Bonzini
2022-03-03 22:08   ` Sean Christopherson
2022-03-03 19:38 ` [PATCH v4 25/30] KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 26/30] KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 27/30] KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 28/30] KVM: selftests: Split out helper to allocate guest mem via memfd Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 29/30] KVM: selftests: Define cpu_relax() helpers for s390 and x86 Paolo Bonzini
2022-03-03 19:38 ` [PATCH v4 30/30] KVM: selftests: Add test to populate a VM with the max possible guest mem Paolo Bonzini
2022-03-08 14:47   ` Paolo Bonzini
2022-03-08 15:36     ` Christian Borntraeger
2022-03-08 21:09     ` Sean Christopherson
2022-03-08 17:25 ` [PATCH v4 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).