linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] KVM: x86/mmu: Fix TLB flushing bugs in TDP MMU
@ 2021-03-19 23:20 Sean Christopherson
  2021-03-19 23:20 ` [PATCH 1/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap Sean Christopherson
  2021-03-19 23:20 ` [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping Sean Christopherson
  0 siblings, 2 replies; 10+ messages in thread
From: Sean Christopherson @ 2021-03-19 23:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon

Two bug fixes involving the TDP MMU.  Found by inspection while working on
a series to consolidate MMU notifier memslot walks across architectures,
which I'll hopefully post next week.

Patch 1 fixes a bug where KVM yields, e.g. due to lock contention, without
performing a pending TLB flush that was required from a previous root.

Patch 2 fixes a much more egregious bug where it fails to handle TDP MMU
flushes in NX huge page recovery, as well as a similar bug to patch 1
where KVM can yield without correctly handling a previously triggered
pending TLB flush.

Sean Christopherson (2):
  KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range
    zap
  KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping

 arch/x86/kvm/mmu/mmu.c     | 15 ++++++++++-----
 arch/x86/kvm/mmu/tdp_mmu.c | 29 +++++++++++++++--------------
 arch/x86/kvm/mmu/tdp_mmu.h |  3 ++-
 3 files changed, 27 insertions(+), 20 deletions(-)

-- 
2.31.0.rc2.261.g7f71774620-goog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap
  2021-03-19 23:20 [PATCH 0/2] KVM: x86/mmu: Fix TLB flushing bugs in TDP MMU Sean Christopherson
@ 2021-03-19 23:20 ` Sean Christopherson
  2021-03-22 21:27   ` Ben Gardon
  2021-03-19 23:20 ` [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping Sean Christopherson
  1 sibling, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2021-03-19 23:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon

When flushing a range of GFNs across multiple roots, ensure any pending
flush from a previous root is honored before yielding while walking the
tables of the current root.

Note, kvm_tdp_mmu_zap_gfn_range() now intentionally overwrites it local
"flush" with the result to avoid redundant flushes.  zap_gfn_range()
preserves and return the incoming "flush", unless of course the flush was
performed prior to yielding and no new flush was triggered.

Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f0c99fa04ef2..6cf08c3c537f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -86,7 +86,7 @@ static inline struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 	list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)
 
 static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
-			  gfn_t start, gfn_t end, bool can_yield);
+			  gfn_t start, gfn_t end, bool can_yield, bool flush);
 
 void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
 {
@@ -99,7 +99,7 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
 
 	list_del(&root->link);
 
-	zap_gfn_range(kvm, root, 0, max_gfn, false);
+	zap_gfn_range(kvm, root, 0, max_gfn, false, false);
 
 	free_page((unsigned long)root->spt);
 	kmem_cache_free(mmu_page_header_cache, root);
@@ -664,20 +664,21 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
  * scheduler needs the CPU or there is contention on the MMU lock. If this
  * function cannot yield, it will not release the MMU lock or reschedule and
  * the caller must ensure it does not supply too large a GFN range, or the
- * operation can cause a soft lockup.
+ * operation can cause a soft lockup.  Note, in some use cases a flush may be
+ * required by prior actions.  Ensure the pending flush is performed prior to
+ * yielding.
  */
 static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
-			  gfn_t start, gfn_t end, bool can_yield)
+			  gfn_t start, gfn_t end, bool can_yield, bool flush)
 {
 	struct tdp_iter iter;
-	bool flush_needed = false;
 
 	rcu_read_lock();
 
 	tdp_root_for_each_pte(iter, root, start, end) {
 		if (can_yield &&
-		    tdp_mmu_iter_cond_resched(kvm, &iter, flush_needed)) {
-			flush_needed = false;
+		    tdp_mmu_iter_cond_resched(kvm, &iter, flush)) {
+			flush = false;
 			continue;
 		}
 
@@ -695,11 +696,11 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			continue;
 
 		tdp_mmu_set_spte(kvm, &iter, 0);
-		flush_needed = true;
+		flush = true;
 	}
 
 	rcu_read_unlock();
-	return flush_needed;
+	return flush;
 }
 
 /*
@@ -714,7 +715,7 @@ bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
 	bool flush = false;
 
 	for_each_tdp_mmu_root_yield_safe(kvm, root)
-		flush |= zap_gfn_range(kvm, root, start, end, true);
+		flush = zap_gfn_range(kvm, root, start, end, true, flush);
 
 	return flush;
 }
@@ -931,7 +932,7 @@ static int zap_gfn_range_hva_wrapper(struct kvm *kvm,
 				     struct kvm_mmu_page *root, gfn_t start,
 				     gfn_t end, unsigned long unused)
 {
-	return zap_gfn_range(kvm, root, start, end, false);
+	return zap_gfn_range(kvm, root, start, end, false, false);
 }
 
 int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
-- 
2.31.0.rc2.261.g7f71774620-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping
  2021-03-19 23:20 [PATCH 0/2] KVM: x86/mmu: Fix TLB flushing bugs in TDP MMU Sean Christopherson
  2021-03-19 23:20 ` [PATCH 1/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap Sean Christopherson
@ 2021-03-19 23:20 ` Sean Christopherson
  2021-03-22 21:27   ` Ben Gardon
  1 sibling, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2021-03-19 23:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon

Fix two intertwined bugs in the NX huge page zapping that were introduced
by the incorporation of the TDP MMU.  Because there is a unified list of
NX huge pages, zapping can encounter both TDP MMU and legacy MMU pages,
and the two MMUs have different tracking for TLB flushing.  If one flavor
needs a flush, but the code for the other flavor yields, KVM will fail to
flush before yielding.

First, honor the "flush needed" return from kvm_tdp_mmu_zap_gfn_range(),
which does the flush itself if and only if it yields, and otherwise
expects the caller to do the flush.  This requires feeding the result
into kvm_mmu_remote_flush_or_zap(), and so also fixes the case where the
TDP MMU needs a flush, the legacy MMU does not, and the main loop yields.

Second, tell the TDP MMU a flush is pending if the list of zapped pages
from legacy MMUs is not empty, i.e. the legacy MMU needs a flush.  This
fixes the case where the TDP MMU yields, but it iteslf does not require a
flush.

Fixes: 29cf0f5007a2 ("kvm: x86/mmu: NX largepage recovery for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 15 ++++++++++-----
 arch/x86/kvm/mmu/tdp_mmu.c |  6 +++---
 arch/x86/kvm/mmu/tdp_mmu.h |  3 ++-
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c6ed633594a2..413d6259340e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5517,7 +5517,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 	}
 
 	if (is_tdp_mmu_enabled(kvm)) {
-		flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end);
+		flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end,
+						  false);
 		if (flush)
 			kvm_flush_remote_tlbs(kvm);
 	}
@@ -5939,6 +5940,8 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 	struct kvm_mmu_page *sp;
 	unsigned int ratio;
 	LIST_HEAD(invalid_list);
+	bool flush = false;
+	gfn_t gfn_end;
 	ulong to_zap;
 
 	rcu_idx = srcu_read_lock(&kvm->srcu);
@@ -5960,19 +5963,21 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 				      lpage_disallowed_link);
 		WARN_ON_ONCE(!sp->lpage_disallowed);
 		if (is_tdp_mmu_page(sp)) {
-			kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn,
-				sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level));
+			gfn_end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level);
+			flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, gfn_end,
+							  flush || !list_empty(&invalid_list));
 		} else {
 			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 			WARN_ON_ONCE(sp->lpage_disallowed);
 		}
 
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
-			kvm_mmu_commit_zap_page(kvm, &invalid_list);
+			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
 			cond_resched_rwlock_write(&kvm->mmu_lock);
+			flush = false;
 		}
 	}
-	kvm_mmu_commit_zap_page(kvm, &invalid_list);
+	kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
 
 	write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6cf08c3c537f..367f12bf1026 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -709,10 +709,10 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
  * SPTEs have been cleared and a TLB flush is needed before releasing the
  * MMU lock.
  */
-bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
+bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end,
+			       bool flush)
 {
 	struct kvm_mmu_page *root;
-	bool flush = false;
 
 	for_each_tdp_mmu_root_yield_safe(kvm, root)
 		flush = zap_gfn_range(kvm, root, start, end, true, flush);
@@ -725,7 +725,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 	gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
 	bool flush;
 
-	flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn);
+	flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn, false);
 	if (flush)
 		kvm_flush_remote_tlbs(kvm);
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 3b761c111bff..e39bee52d49e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -8,7 +8,8 @@
 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
 void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);
 
-bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
+bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end,
+			       bool flush);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
-- 
2.31.0.rc2.261.g7f71774620-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping
  2021-03-19 23:20 ` [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping Sean Christopherson
@ 2021-03-22 21:27   ` Ben Gardon
  2021-03-23  0:15     ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Ben Gardon @ 2021-03-22 21:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Fri, Mar 19, 2021 at 4:20 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Fix two intertwined bugs in the NX huge page zapping that were introduced
> by the incorporation of the TDP MMU.  Because there is a unified list of
> NX huge pages, zapping can encounter both TDP MMU and legacy MMU pages,
> and the two MMUs have different tracking for TLB flushing.  If one flavor
> needs a flush, but the code for the other flavor yields, KVM will fail to
> flush before yielding.
>
> First, honor the "flush needed" return from kvm_tdp_mmu_zap_gfn_range(),
> which does the flush itself if and only if it yields, and otherwise
> expects the caller to do the flush.  This requires feeding the result
> into kvm_mmu_remote_flush_or_zap(), and so also fixes the case where the
> TDP MMU needs a flush, the legacy MMU does not, and the main loop yields.
>
> Second, tell the TDP MMU a flush is pending if the list of zapped pages
> from legacy MMUs is not empty, i.e. the legacy MMU needs a flush.  This
> fixes the case where the TDP MMU yields, but it iteslf does not require a
> flush.
>
> Fixes: 29cf0f5007a2 ("kvm: x86/mmu: NX largepage recovery for TDP MMU")
> Cc: stable@vger.kernel.org
> Cc: Ben Gardon <bgardon@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-By: Ben Gardon <bgardon@google.com>

This preserves an extremely unlikely degenerate case, which could
cause an unexpected delay.
The scenario is described below, but I don't think this change needs
to be blocked on it.

> ---
>  arch/x86/kvm/mmu/mmu.c     | 15 ++++++++++-----
>  arch/x86/kvm/mmu/tdp_mmu.c |  6 +++---
>  arch/x86/kvm/mmu/tdp_mmu.h |  3 ++-
>  3 files changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c6ed633594a2..413d6259340e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5517,7 +5517,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>         }
>
>         if (is_tdp_mmu_enabled(kvm)) {
> -               flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end);
> +               flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end,
> +                                                 false);
>                 if (flush)
>                         kvm_flush_remote_tlbs(kvm);
>         }
> @@ -5939,6 +5940,8 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
>         struct kvm_mmu_page *sp;
>         unsigned int ratio;
>         LIST_HEAD(invalid_list);
> +       bool flush = false;
> +       gfn_t gfn_end;
>         ulong to_zap;
>
>         rcu_idx = srcu_read_lock(&kvm->srcu);
> @@ -5960,19 +5963,21 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
>                                       lpage_disallowed_link);
>                 WARN_ON_ONCE(!sp->lpage_disallowed);
>                 if (is_tdp_mmu_page(sp)) {
> -                       kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn,
> -                               sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level));
> +                       gfn_end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level);
> +                       flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, gfn_end,
> +                                                         flush || !list_empty(&invalid_list));
>                 } else {
>                         kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
>                         WARN_ON_ONCE(sp->lpage_disallowed);
>                 }
>
>                 if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> -                       kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +                       kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);

This pattern of waiting until a yield is needed or lock contention is
detected has always been a little suspect to me because
kvm_mmu_commit_zap_page does work proportional to the work done before
the yield was needed. That seems like more work than we should like to
be doing at that point.

The yield in kvm_tdp_mmu_zap_gfn_range makes that phenomenon even
worse. Because we can satisfy the need to yield without clearing out
the invalid list, we can potentially queue many more pages which will
then all need to have their zaps committed at once. This is an
admittedly contrived case which could only be hit in a high load
nested scenario.

It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from
yielding. Since we should only need to zap one SPTE, the yield should
not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure
that only one SPTE is zapped we would have to specify the root though.
Otherwise we could end up zapping all the entries for the same GFN
range under an unrelated root.

Anyway, I can address this in another patch.

>                         cond_resched_rwlock_write(&kvm->mmu_lock);
> +                       flush = false;
>                 }
>         }
> -       kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +       kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
>
>         write_unlock(&kvm->mmu_lock);
>         srcu_read_unlock(&kvm->srcu, rcu_idx);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 6cf08c3c537f..367f12bf1026 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -709,10 +709,10 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   * SPTEs have been cleared and a TLB flush is needed before releasing the
>   * MMU lock.
>   */
> -bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end,
> +                              bool flush)
>  {
>         struct kvm_mmu_page *root;
> -       bool flush = false;
>
>         for_each_tdp_mmu_root_yield_safe(kvm, root)
>                 flush = zap_gfn_range(kvm, root, start, end, true, flush);
> @@ -725,7 +725,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>         gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
>         bool flush;
>
> -       flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn);
> +       flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn, false);
>         if (flush)
>                 kvm_flush_remote_tlbs(kvm);
>  }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 3b761c111bff..e39bee52d49e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -8,7 +8,8 @@
>  hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
>  void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);
>
> -bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
> +bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end,
> +                              bool flush);
>  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>
>  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> --
> 2.31.0.rc2.261.g7f71774620-goog
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap
  2021-03-19 23:20 ` [PATCH 1/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap Sean Christopherson
@ 2021-03-22 21:27   ` Ben Gardon
  0 siblings, 0 replies; 10+ messages in thread
From: Ben Gardon @ 2021-03-22 21:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Fri, Mar 19, 2021 at 4:20 PM Sean Christopherson <seanjc@google.com> wrote:
>
> When flushing a range of GFNs across multiple roots, ensure any pending
> flush from a previous root is honored before yielding while walking the
> tables of the current root.
>
> Note, kvm_tdp_mmu_zap_gfn_range() now intentionally overwrites it local
> "flush" with the result to avoid redundant flushes.  zap_gfn_range()
> preserves and return the incoming "flush", unless of course the flush was
> performed prior to yielding and no new flush was triggered.
>
> Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
> Cc: stable@vger.kernel.org
> Cc: Ben Gardon <bgardon@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-By: Ben Gardon <bgardon@google.com>

> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 23 ++++++++++++-----------
>  1 file changed, 12 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f0c99fa04ef2..6cf08c3c537f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -86,7 +86,7 @@ static inline struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
>         list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)
>
>  static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> -                         gfn_t start, gfn_t end, bool can_yield);
> +                         gfn_t start, gfn_t end, bool can_yield, bool flush);

This function is going to acquire so many arguments. Don't need to do
anything about it here, but this is going to need some kind of cleanup
at some point.
I'll have to add another "shared" type arg for running this function
under the read lock in a series I'm prepping.


>
>  void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
>  {
> @@ -99,7 +99,7 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
>
>         list_del(&root->link);
>
> -       zap_gfn_range(kvm, root, 0, max_gfn, false);
> +       zap_gfn_range(kvm, root, 0, max_gfn, false, false);
>
>         free_page((unsigned long)root->spt);
>         kmem_cache_free(mmu_page_header_cache, root);
> @@ -664,20 +664,21 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
>   * scheduler needs the CPU or there is contention on the MMU lock. If this
>   * function cannot yield, it will not release the MMU lock or reschedule and
>   * the caller must ensure it does not supply too large a GFN range, or the
> - * operation can cause a soft lockup.
> + * operation can cause a soft lockup.  Note, in some use cases a flush may be
> + * required by prior actions.  Ensure the pending flush is performed prior to
> + * yielding.
>   */
>  static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> -                         gfn_t start, gfn_t end, bool can_yield)
> +                         gfn_t start, gfn_t end, bool can_yield, bool flush)
>  {
>         struct tdp_iter iter;
> -       bool flush_needed = false;
>
>         rcu_read_lock();
>
>         tdp_root_for_each_pte(iter, root, start, end) {
>                 if (can_yield &&
> -                   tdp_mmu_iter_cond_resched(kvm, &iter, flush_needed)) {
> -                       flush_needed = false;
> +                   tdp_mmu_iter_cond_resched(kvm, &iter, flush)) {
> +                       flush = false;
>                         continue;
>                 }
>
> @@ -695,11 +696,11 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>                         continue;
>
>                 tdp_mmu_set_spte(kvm, &iter, 0);
> -               flush_needed = true;
> +               flush = true;
>         }
>
>         rcu_read_unlock();
> -       return flush_needed;
> +       return flush;
>  }
>
>  /*
> @@ -714,7 +715,7 @@ bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
>         bool flush = false;
>
>         for_each_tdp_mmu_root_yield_safe(kvm, root)
> -               flush |= zap_gfn_range(kvm, root, start, end, true);
> +               flush = zap_gfn_range(kvm, root, start, end, true, flush);
>
>         return flush;
>  }
> @@ -931,7 +932,7 @@ static int zap_gfn_range_hva_wrapper(struct kvm *kvm,
>                                      struct kvm_mmu_page *root, gfn_t start,
>                                      gfn_t end, unsigned long unused)
>  {
> -       return zap_gfn_range(kvm, root, start, end, false);
> +       return zap_gfn_range(kvm, root, start, end, false, false);
>  }
>
>  int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
> --
> 2.31.0.rc2.261.g7f71774620-goog
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping
  2021-03-22 21:27   ` Ben Gardon
@ 2021-03-23  0:15     ` Sean Christopherson
  2021-03-23 16:26       ` Ben Gardon
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2021-03-23  0:15 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Mon, Mar 22, 2021, Ben Gardon wrote:
> On Fri, Mar 19, 2021 at 4:20 PM Sean Christopherson <seanjc@google.com> wrote:
> > @@ -5960,19 +5963,21 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
> >                                       lpage_disallowed_link);
> >                 WARN_ON_ONCE(!sp->lpage_disallowed);
> >                 if (is_tdp_mmu_page(sp)) {
> > -                       kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn,
> > -                               sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level));
> > +                       gfn_end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level);
> > +                       flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, gfn_end,
> > +                                                         flush || !list_empty(&invalid_list));
> >                 } else {
> >                         kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> >                         WARN_ON_ONCE(sp->lpage_disallowed);
> >                 }
> >
> >                 if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> > -                       kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > +                       kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
> 
> This pattern of waiting until a yield is needed or lock contention is
> detected has always been a little suspect to me because
> kvm_mmu_commit_zap_page does work proportional to the work done before
> the yield was needed. That seems like more work than we should like to
> be doing at that point.
> 
> The yield in kvm_tdp_mmu_zap_gfn_range makes that phenomenon even
> worse. Because we can satisfy the need to yield without clearing out
> the invalid list, we can potentially queue many more pages which will
> then all need to have their zaps committed at once. This is an
> admittedly contrived case which could only be hit in a high load
> nested scenario.
> 
> It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from
> yielding. Since we should only need to zap one SPTE, the yield should
> not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure
> that only one SPTE is zapped we would have to specify the root though.
> Otherwise we could end up zapping all the entries for the same GFN
> range under an unrelated root.

Hmm, I originally did exactly that, but changed my mind because this zaps far
more than 1 SPTE.  This is zapping a SP that could be huge, but is not, which
means it's guaranteed to have a non-zero number of child SPTEs.  The worst case
scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k SPTEs.

But, I didn't consider the interplay between invalid_list and the TDP MMU
yielding.  Hrm.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping
  2021-03-23  0:15     ` Sean Christopherson
@ 2021-03-23 16:26       ` Ben Gardon
  2021-03-23 18:58         ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Ben Gardon @ 2021-03-23 16:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Mon, Mar 22, 2021 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Mar 22, 2021, Ben Gardon wrote:
> > On Fri, Mar 19, 2021 at 4:20 PM Sean Christopherson <seanjc@google.com> wrote:
> > > @@ -5960,19 +5963,21 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
> > >                                       lpage_disallowed_link);
> > >                 WARN_ON_ONCE(!sp->lpage_disallowed);
> > >                 if (is_tdp_mmu_page(sp)) {
> > > -                       kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn,
> > > -                               sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level));
> > > +                       gfn_end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level);
> > > +                       flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, gfn_end,
> > > +                                                         flush || !list_empty(&invalid_list));
> > >                 } else {
> > >                         kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> > >                         WARN_ON_ONCE(sp->lpage_disallowed);
> > >                 }
> > >
> > >                 if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> > > -                       kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > > +                       kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
> >
> > This pattern of waiting until a yield is needed or lock contention is
> > detected has always been a little suspect to me because
> > kvm_mmu_commit_zap_page does work proportional to the work done before
> > the yield was needed. That seems like more work than we should like to
> > be doing at that point.
> >
> > The yield in kvm_tdp_mmu_zap_gfn_range makes that phenomenon even
> > worse. Because we can satisfy the need to yield without clearing out
> > the invalid list, we can potentially queue many more pages which will
> > then all need to have their zaps committed at once. This is an
> > admittedly contrived case which could only be hit in a high load
> > nested scenario.
> >
> > It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from
> > yielding. Since we should only need to zap one SPTE, the yield should
> > not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure
> > that only one SPTE is zapped we would have to specify the root though.
> > Otherwise we could end up zapping all the entries for the same GFN
> > range under an unrelated root.
>
> Hmm, I originally did exactly that, but changed my mind because this zaps far
> more than 1 SPTE.  This is zapping a SP that could be huge, but is not, which
> means it's guaranteed to have a non-zero number of child SPTEs.  The worst case
> scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k SPTEs.

It's true that there are potentially 512^2 child sptes, but the code
to clear those after the single PUD spte is cleared doesn't yield
anyway. If the TDP MMU is only  operating with one root (as we would
expect in most cases), there should only be one chance for it to
yield.

I've considered how we could allow the recursive changed spte handlers
to yield, but it gets complicated quite fast because the caller needs
to know if it yielded and reset the TDP iterator to the root, and
there are some cases (mmu notifiers + vCPU path) where yielding is not
desirable.

>
> But, I didn't consider the interplay between invalid_list and the TDP MMU
> yielding.  Hrm.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping
  2021-03-23 16:26       ` Ben Gardon
@ 2021-03-23 18:58         ` Sean Christopherson
  2021-03-23 20:34           ` Ben Gardon
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2021-03-23 18:58 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Tue, Mar 23, 2021, Ben Gardon wrote:
> On Mon, Mar 22, 2021 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Mar 22, 2021, Ben Gardon wrote:
> > > It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from
> > > yielding. Since we should only need to zap one SPTE, the yield should
> > > not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure
> > > that only one SPTE is zapped we would have to specify the root though.
> > > Otherwise we could end up zapping all the entries for the same GFN
> > > range under an unrelated root.
> >
> > Hmm, I originally did exactly that, but changed my mind because this zaps far
> > more than 1 SPTE.  This is zapping a SP that could be huge, but is not, which
> > means it's guaranteed to have a non-zero number of child SPTEs.  The worst case
> > scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k SPTEs.
> 
> It's true that there are potentially 512^2 child sptes, but the code
> to clear those after the single PUD spte is cleared doesn't yield
> anyway. If the TDP MMU is only  operating with one root (as we would
> expect in most cases), there should only be one chance for it to
> yield.

Ah, right, I was thinking all the iterative flows yielded.  Disallowing
kvm_tdp_mmu_zap_gfn_range() from yielding in this case does seem like the best
fix.  Any objection to me sending v2 with that?

> I've considered how we could allow the recursive changed spte handlers
> to yield, but it gets complicated quite fast because the caller needs
> to know if it yielded and reset the TDP iterator to the root, and
> there are some cases (mmu notifiers + vCPU path) where yielding is not
> desirable.

Urgh, yeah, seems like we'd quickly end up with a mess resembling the legacy MMU
iterators.

> >
> > But, I didn't consider the interplay between invalid_list and the TDP MMU
> > yielding.  Hrm.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping
  2021-03-23 18:58         ` Sean Christopherson
@ 2021-03-23 20:34           ` Ben Gardon
  2021-03-25 19:15             ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Ben Gardon @ 2021-03-23 20:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Tue, Mar 23, 2021 at 11:58 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Mar 23, 2021, Ben Gardon wrote:
> > On Mon, Mar 22, 2021 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Mon, Mar 22, 2021, Ben Gardon wrote:
> > > > It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from
> > > > yielding. Since we should only need to zap one SPTE, the yield should
> > > > not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure
> > > > that only one SPTE is zapped we would have to specify the root though.
> > > > Otherwise we could end up zapping all the entries for the same GFN
> > > > range under an unrelated root.
> > >
> > > Hmm, I originally did exactly that, but changed my mind because this zaps far
> > > more than 1 SPTE.  This is zapping a SP that could be huge, but is not, which
> > > means it's guaranteed to have a non-zero number of child SPTEs.  The worst case
> > > scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k SPTEs.
> >
> > It's true that there are potentially 512^2 child sptes, but the code
> > to clear those after the single PUD spte is cleared doesn't yield
> > anyway. If the TDP MMU is only  operating with one root (as we would
> > expect in most cases), there should only be one chance for it to
> > yield.
>
> Ah, right, I was thinking all the iterative flows yielded.  Disallowing
> kvm_tdp_mmu_zap_gfn_range() from yielding in this case does seem like the best
> fix.  Any objection to me sending v2 with that?

That sounds good to me.

>
> > I've considered how we could allow the recursive changed spte handlers
> > to yield, but it gets complicated quite fast because the caller needs
> > to know if it yielded and reset the TDP iterator to the root, and
> > there are some cases (mmu notifiers + vCPU path) where yielding is not
> > desirable.
>
> Urgh, yeah, seems like we'd quickly end up with a mess resembling the legacy MMU
> iterators.
>
> > >
> > > But, I didn't consider the interplay between invalid_list and the TDP MMU
> > > yielding.  Hrm.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping
  2021-03-23 20:34           ` Ben Gardon
@ 2021-03-25 19:15             ` Sean Christopherson
  0 siblings, 0 replies; 10+ messages in thread
From: Sean Christopherson @ 2021-03-25 19:15 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Tue, Mar 23, 2021, Ben Gardon wrote:
> On Tue, Mar 23, 2021 at 11:58 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Mar 23, 2021, Ben Gardon wrote:
> > > On Mon, Mar 22, 2021 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Mon, Mar 22, 2021, Ben Gardon wrote:
> > > > > It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from
> > > > > yielding. Since we should only need to zap one SPTE, the yield should
> > > > > not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure
> > > > > that only one SPTE is zapped we would have to specify the root though.
> > > > > Otherwise we could end up zapping all the entries for the same GFN
> > > > > range under an unrelated root.
> > > >
> > > > Hmm, I originally did exactly that, but changed my mind because this zaps far
> > > > more than 1 SPTE.  This is zapping a SP that could be huge, but is not, which
> > > > means it's guaranteed to have a non-zero number of child SPTEs.  The worst case
> > > > scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k SPTEs.
> > >
> > > It's true that there are potentially 512^2 child sptes, but the code
> > > to clear those after the single PUD spte is cleared doesn't yield
> > > anyway. If the TDP MMU is only  operating with one root (as we would
> > > expect in most cases), there should only be one chance for it to
> > > yield.
> >
> > Ah, right, I was thinking all the iterative flows yielded.  Disallowing
> > kvm_tdp_mmu_zap_gfn_range() from yielding in this case does seem like the best
> > fix.  Any objection to me sending v2 with that?
> 
> That sounds good to me.

Ewww.  This analysis isn't 100% accurate.  It's actually impossible for
zap_gfn_range() to yield in this case.  Even though it may walk multiple roots
and levels, "yielded_gfn == next_last_level_gfn" will hold true until the iter
attempts to step sideways.  When the iter attempts to step sideways, it will
always do so at the level that matches the zapping level, and so will always
step past "end".  Thus, tdp_root_for_each_pte() will break without ever
yielding.

That being said, I'm still going to send a patch to explicitly prevent this path
from yielding.  Relying on the above is waaaay too subtle and fragile.

> > > I've considered how we could allow the recursive changed spte handlers
> > > to yield, but it gets complicated quite fast because the caller needs
> > > to know if it yielded and reset the TDP iterator to the root, and
> > > there are some cases (mmu notifiers + vCPU path) where yielding is not
> > > desirable.
> >
> > Urgh, yeah, seems like we'd quickly end up with a mess resembling the legacy MMU
> > iterators.
> >
> > > >
> > > > But, I didn't consider the interplay between invalid_list and the TDP MMU
> > > > yielding.  Hrm.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-03-25 19:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-19 23:20 [PATCH 0/2] KVM: x86/mmu: Fix TLB flushing bugs in TDP MMU Sean Christopherson
2021-03-19 23:20 ` [PATCH 1/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap Sean Christopherson
2021-03-22 21:27   ` Ben Gardon
2021-03-19 23:20 ` [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping Sean Christopherson
2021-03-22 21:27   ` Ben Gardon
2021-03-23  0:15     ` Sean Christopherson
2021-03-23 16:26       ` Ben Gardon
2021-03-23 18:58         ` Sean Christopherson
2021-03-23 20:34           ` Ben Gardon
2021-03-25 19:15             ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).