[PATCH v2 0/3] KVM: x86/mmu: Fix TLB flushing bugs in TDP MMU

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] KVM: x86/mmu: Fix TLB flushing bugs in TDP MMU
@ 2021-03-25 20:01 Sean Christopherson
  2021-03-25 20:01 ` [PATCH v2 1/3] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap Sean Christopherson
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Sean Christopherson @ 2021-03-25 20:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon

Two bug fixes and a clean up involving the TDP MMU, found by inspection.

Patch 1 fixes a bug where KVM yields, e.g. due to lock contention, without
performing a pending TLB flush that was required from a previous root.

Patch 2 fixes a much more egregious bug where it fails to handle TDP MMU
flushes in NX huge page recovery.

Patch 3 explicitly disallows yielding in the TDP MMU to prevent a similar
bug to patch 1 from sneaking in.

v2:
 - Collect a review. [Ben]
 - Disallowing yielding instead of feeding "flush" into the TDP MMU. [Ben]
 - Move the yielding logic to a separate patch since it's not strictly a
   bug fix and it's standalone anyways (the flush feedback loop was not).

v1:
 - https://lkml.kernel.org/r/20210319232006.3468382-1-seanjc@google.com

Sean Christopherson (3):
  KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range
    zap
  KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping
  KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages

 arch/x86/kvm/mmu/mmu.c     |  9 +++++----
 arch/x86/kvm/mmu/tdp_mmu.c | 26 ++++++++++++++------------
 arch/x86/kvm/mmu/tdp_mmu.h | 23 ++++++++++++++++++++++-
 3 files changed, 41 insertions(+), 17 deletions(-)

-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/3] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap
  2021-03-25 20:01 [PATCH v2 0/3] KVM: x86/mmu: Fix TLB flushing bugs in TDP MMU Sean Christopherson
@ 2021-03-25 20:01 ` Sean Christopherson
  2021-03-25 20:01 ` [PATCH v2 2/3] KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping Sean Christopherson
  2021-03-25 20:01 ` [PATCH v2 3/3] KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages Sean Christopherson
  2 siblings, 0 replies; 11+ messages in thread
From: Sean Christopherson @ 2021-03-25 20:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon

When flushing a range of GFNs across multiple roots, ensure any pending
flush from a previous root is honored before yielding while walking the
tables of the current root.

Note, kvm_tdp_mmu_zap_gfn_range() now intentionally overwrites its local
"flush" with the result to avoid redundant flushes.  zap_gfn_range()
preserves and return the incoming "flush", unless of course the flush was
performed prior to yielding and no new flush was triggered.

Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Cc: stable@vger.kernel.org
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f0c99fa04ef2..6cf08c3c537f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -86,7 +86,7 @@ static inline struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 	list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)
 
 static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
-			  gfn_t start, gfn_t end, bool can_yield);
+			  gfn_t start, gfn_t end, bool can_yield, bool flush);
 
 void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
 {
@@ -99,7 +99,7 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
 
 	list_del(&root->link);
 
-	zap_gfn_range(kvm, root, 0, max_gfn, false);
+	zap_gfn_range(kvm, root, 0, max_gfn, false, false);
 
 	free_page((unsigned long)root->spt);
 	kmem_cache_free(mmu_page_header_cache, root);
@@ -664,20 +664,21 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
  * scheduler needs the CPU or there is contention on the MMU lock. If this
  * function cannot yield, it will not release the MMU lock or reschedule and
  * the caller must ensure it does not supply too large a GFN range, or the
- * operation can cause a soft lockup.
+ * operation can cause a soft lockup.  Note, in some use cases a flush may be
+ * required by prior actions.  Ensure the pending flush is performed prior to
+ * yielding.
  */
 static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
-			  gfn_t start, gfn_t end, bool can_yield)
+			  gfn_t start, gfn_t end, bool can_yield, bool flush)
 {
 	struct tdp_iter iter;
-	bool flush_needed = false;
 
 	rcu_read_lock();
 
 	tdp_root_for_each_pte(iter, root, start, end) {
 		if (can_yield &&
-		    tdp_mmu_iter_cond_resched(kvm, &iter, flush_needed)) {
-			flush_needed = false;
+		    tdp_mmu_iter_cond_resched(kvm, &iter, flush)) {
+			flush = false;
 			continue;
 		}
 
@@ -695,11 +696,11 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			continue;
 
 		tdp_mmu_set_spte(kvm, &iter, 0);
-		flush_needed = true;
+		flush = true;
 	}
 
 	rcu_read_unlock();
-	return flush_needed;
+	return flush;
 }
 
 /*
@@ -714,7 +715,7 @@ bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
 	bool flush = false;
 
 	for_each_tdp_mmu_root_yield_safe(kvm, root)
-		flush |= zap_gfn_range(kvm, root, start, end, true);
+		flush = zap_gfn_range(kvm, root, start, end, true, flush);
 
 	return flush;
 }
@@ -931,7 +932,7 @@ static int zap_gfn_range_hva_wrapper(struct kvm *kvm,
 				     struct kvm_mmu_page *root, gfn_t start,
 				     gfn_t end, unsigned long unused)
 {
-	return zap_gfn_range(kvm, root, start, end, false);
+	return zap_gfn_range(kvm, root, start, end, false, false);
 }
 
 int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/3] KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping
  2021-03-25 20:01 [PATCH v2 0/3] KVM: x86/mmu: Fix TLB flushing bugs in TDP MMU Sean Christopherson
  2021-03-25 20:01 ` [PATCH v2 1/3] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap Sean Christopherson
@ 2021-03-25 20:01 ` Sean Christopherson
  2021-03-25 21:47   ` Ben Gardon
  2021-03-25 20:01 ` [PATCH v2 3/3] KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages Sean Christopherson
  2 siblings, 1 reply; 11+ messages in thread
From: Sean Christopherson @ 2021-03-25 20:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon

Honor the "flush needed" return from kvm_tdp_mmu_zap_gfn_range(), which
does the flush itself if and only if it yields (which it will never do in
this particular scenario), and otherwise expects the caller to do the
flush.  If pages are zapped from the TDP MMU but not the legacy MMU, then
no flush will occur.

Fixes: 29cf0f5007a2 ("kvm: x86/mmu: NX largepage recovery for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c6ed633594a2..5a53743b37bc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5939,6 +5939,8 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 	struct kvm_mmu_page *sp;
 	unsigned int ratio;
 	LIST_HEAD(invalid_list);
+	bool flush = false;
+	gfn_t gfn_end;
 	ulong to_zap;
 
 	rcu_idx = srcu_read_lock(&kvm->srcu);
@@ -5960,19 +5962,20 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 				      lpage_disallowed_link);
 		WARN_ON_ONCE(!sp->lpage_disallowed);
 		if (is_tdp_mmu_page(sp)) {
-			kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn,
-				sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level));
+			gfn_end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level);
+			flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, gfn_end);
 		} else {
 			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 			WARN_ON_ONCE(sp->lpage_disallowed);
 		}
 
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
-			kvm_mmu_commit_zap_page(kvm, &invalid_list);
+			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
 			cond_resched_rwlock_write(&kvm->mmu_lock);
+			flush = false;
 		}
 	}
-	kvm_mmu_commit_zap_page(kvm, &invalid_list);
+	kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
 
 	write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/3] KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages
  2021-03-25 20:01 [PATCH v2 0/3] KVM: x86/mmu: Fix TLB flushing bugs in TDP MMU Sean Christopherson
  2021-03-25 20:01 ` [PATCH v2 1/3] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap Sean Christopherson
  2021-03-25 20:01 ` [PATCH v2 2/3] KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping Sean Christopherson
@ 2021-03-25 20:01 ` Sean Christopherson
  2021-03-25 21:46   ` Ben Gardon
  2 siblings, 1 reply; 11+ messages in thread
From: Sean Christopherson @ 2021-03-25 20:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon

Prevent the TDP MMU from yielding when zapping a gfn range during NX
page recovery.  If a flush is pending from a previous invocation of the
zapping helper, either in the TDP MMU or the legacy MMU, but the TDP MMU
has not accumulated a flush for the current invocation, then yielding
will release mmu_lock with stale TLB entriesr

That being said, this isn't technically a bug fix in the current code, as
the TDP MMU will never yield in this case.  tdp_mmu_iter_cond_resched()
will yield if and only if it has made forward progress, as defined by the
current gfn vs. the last yielded (or starting) gfn.  Because zapping a
single shadow page is guaranteed to (a) find that page and (b) step
sideways at the level of the shadow page, the TDP iter will break its loop
before getting a chance to yield.

But that is all very, very subtle, and will break at the slightest sneeze,
e.g. zapping while holding mmu_lock for read would break as the TDP MMU
wouldn't be guaranteed to see the present shadow page, and thus could step
sideways at a lower level.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c     |  4 +---
 arch/x86/kvm/mmu/tdp_mmu.c |  5 +++--
 arch/x86/kvm/mmu/tdp_mmu.h | 23 ++++++++++++++++++++++-
 3 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5a53743b37bc..7a99e59c8c1c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5940,7 +5940,6 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 	unsigned int ratio;
 	LIST_HEAD(invalid_list);
 	bool flush = false;
-	gfn_t gfn_end;
 	ulong to_zap;
 
 	rcu_idx = srcu_read_lock(&kvm->srcu);
@@ -5962,8 +5961,7 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 				      lpage_disallowed_link);
 		WARN_ON_ONCE(!sp->lpage_disallowed);
 		if (is_tdp_mmu_page(sp)) {
-			gfn_end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level);
-			flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, gfn_end);
+			flush = kvm_tdp_mmu_zap_sp(kvm, sp);
 		} else {
 			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 			WARN_ON_ONCE(sp->lpage_disallowed);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6cf08c3c537f..08667e3cf091 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -709,13 +709,14 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
  * SPTEs have been cleared and a TLB flush is needed before releasing the
  * MMU lock.
  */
-bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
+bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end,
+				 bool can_yield)
 {
 	struct kvm_mmu_page *root;
 	bool flush = false;
 
 	for_each_tdp_mmu_root_yield_safe(kvm, root)
-		flush = zap_gfn_range(kvm, root, start, end, true, flush);
+		flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
 
 	return flush;
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 3b761c111bff..715aa4e0196d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -8,7 +8,28 @@
 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
 void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);
 
-bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
+bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end,
+				 bool can_yield);
+static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start,
+					     gfn_t end)
+{
+	return __kvm_tdp_mmu_zap_gfn_range(kvm, start, end, true);
+}
+static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+	gfn_t end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level);
+
+	/*
+	 * Don't allow yielding, as the caller may have a flush pending.  Note,
+	 * if mmu_lock is held for write, zapping will never yield in this case,
+	 * but explicitly disallow it for safety.  The TDP MMU does not yield
+	 * until it has made forward progress (steps sideways), and when zapping
+	 * a single shadow page that it's guaranteed to see (thus the mmu_lock
+	 * requirement), its "step sideways" will always step beyond the bounds
+	 * of the shadow page's gfn range and stop iterating before yielding.
+	 */
+	return __kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, end, false);
+}
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages
  2021-03-25 20:01 ` [PATCH v2 3/3] KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages Sean Christopherson
@ 2021-03-25 21:46   ` Ben Gardon
  2021-03-25 22:25     ` Sean Christopherson
  0 siblings, 1 reply; 11+ messages in thread
From: Ben Gardon @ 2021-03-25 21:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Thu, Mar 25, 2021 at 1:01 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Prevent the TDP MMU from yielding when zapping a gfn range during NX
> page recovery.  If a flush is pending from a previous invocation of the
> zapping helper, either in the TDP MMU or the legacy MMU, but the TDP MMU
> has not accumulated a flush for the current invocation, then yielding
> will release mmu_lock with stale TLB entriesr

Extra r here.

>
> That being said, this isn't technically a bug fix in the current code, as
> the TDP MMU will never yield in this case.  tdp_mmu_iter_cond_resched()
> will yield if and only if it has made forward progress, as defined by the
> current gfn vs. the last yielded (or starting) gfn.  Because zapping a
> single shadow page is guaranteed to (a) find that page and (b) step
> sideways at the level of the shadow page, the TDP iter will break its loop
> before getting a chance to yield.
>
> But that is all very, very subtle, and will break at the slightest sneeze,
> e.g. zapping while holding mmu_lock for read would break as the TDP MMU
> wouldn't be guaranteed to see the present shadow page, and thus could step
> sideways at a lower level.
>
> Cc: Ben Gardon <bgardon@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     |  4 +---
>  arch/x86/kvm/mmu/tdp_mmu.c |  5 +++--
>  arch/x86/kvm/mmu/tdp_mmu.h | 23 ++++++++++++++++++++++-
>  3 files changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5a53743b37bc..7a99e59c8c1c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5940,7 +5940,6 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
>         unsigned int ratio;
>         LIST_HEAD(invalid_list);
>         bool flush = false;
> -       gfn_t gfn_end;
>         ulong to_zap;
>
>         rcu_idx = srcu_read_lock(&kvm->srcu);
> @@ -5962,8 +5961,7 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
>                                       lpage_disallowed_link);
>                 WARN_ON_ONCE(!sp->lpage_disallowed);
>                 if (is_tdp_mmu_page(sp)) {
> -                       gfn_end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level);
> -                       flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, gfn_end);
> +                       flush = kvm_tdp_mmu_zap_sp(kvm, sp);
>                 } else {
>                         kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
>                         WARN_ON_ONCE(sp->lpage_disallowed);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 6cf08c3c537f..08667e3cf091 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -709,13 +709,14 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   * SPTEs have been cleared and a TLB flush is needed before releasing the
>   * MMU lock.
>   */
> -bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end,
> +                                bool can_yield)
>  {
>         struct kvm_mmu_page *root;
>         bool flush = false;
>
>         for_each_tdp_mmu_root_yield_safe(kvm, root)
> -               flush = zap_gfn_range(kvm, root, start, end, true, flush);
> +               flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
>
>         return flush;
>  }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 3b761c111bff..715aa4e0196d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -8,7 +8,28 @@
>  hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
>  void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);
>
> -bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
> +bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end,
> +                                bool can_yield);
> +static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start,
> +                                            gfn_t end)
> +{
> +       return __kvm_tdp_mmu_zap_gfn_range(kvm, start, end, true);
> +}
> +static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)

I'm a little leary of adding an interface which takes a non-root
struct kvm_mmu_page as an argument to the TDP MMU.
In the TDP MMU, the struct kvm_mmu_pages are protected rather subtly.
I agree this is safe because we hold the MMU lock in write mode here,
but if we ever wanted to convert to holding it in read mode things
could get complicated fast.
Maybe this is more of a concern if the function started to be used
elsewhere since NX recovery is already so dependent on the write lock.
Ideally though, NX reclaim could use MMU read lock +
tdp_mmu_pages_lock to protect the list and do reclaim in parallel with
everything else.
The nice thing about drawing the TDP MMU interface in terms of GFNs
and address space IDs instead of SPs is that it doesn't put
constraints on the implementation of the TDP MMU because those GFNs
are always going to be valid / don't require any shared memory.
This is kind of innocuous because it's immediately converted into that
gfn interface, so I don't know how much it really matters.

In any case this change looks correct and I don't want to hold up
progress with bikeshedding.
WDYT?

> +{
> +       gfn_t end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level);
> +
> +       /*
> +        * Don't allow yielding, as the caller may have a flush pending.  Note,
> +        * if mmu_lock is held for write, zapping will never yield in this case,
> +        * but explicitly disallow it for safety.  The TDP MMU does not yield
> +        * until it has made forward progress (steps sideways), and when zapping
> +        * a single shadow page that it's guaranteed to see (thus the mmu_lock
> +        * requirement), its "step sideways" will always step beyond the bounds
> +        * of the shadow page's gfn range and stop iterating before yielding.
> +        */
> +       return __kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, end, false);
> +}
>  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>
>  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> --
> 2.31.0.291.g576ba9dcdaf-goog
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping
  2021-03-25 20:01 ` [PATCH v2 2/3] KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping Sean Christopherson
@ 2021-03-25 21:47   ` Ben Gardon
  0 siblings, 0 replies; 11+ messages in thread
From: Ben Gardon @ 2021-03-25 21:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Thu, Mar 25, 2021 at 1:01 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Honor the "flush needed" return from kvm_tdp_mmu_zap_gfn_range(), which
> does the flush itself if and only if it yields (which it will never do in
> this particular scenario), and otherwise expects the caller to do the
> flush.  If pages are zapped from the TDP MMU but not the legacy MMU, then
> no flush will occur.
>
> Fixes: 29cf0f5007a2 ("kvm: x86/mmu: NX largepage recovery for TDP MMU")
> Cc: stable@vger.kernel.org
> Cc: Ben Gardon <bgardon@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>

> ---
>  arch/x86/kvm/mmu/mmu.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c6ed633594a2..5a53743b37bc 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5939,6 +5939,8 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
>         struct kvm_mmu_page *sp;
>         unsigned int ratio;
>         LIST_HEAD(invalid_list);
> +       bool flush = false;
> +       gfn_t gfn_end;
>         ulong to_zap;
>
>         rcu_idx = srcu_read_lock(&kvm->srcu);
> @@ -5960,19 +5962,20 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
>                                       lpage_disallowed_link);
>                 WARN_ON_ONCE(!sp->lpage_disallowed);
>                 if (is_tdp_mmu_page(sp)) {
> -                       kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn,
> -                               sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level));
> +                       gfn_end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level);
> +                       flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, gfn_end);
>                 } else {
>                         kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
>                         WARN_ON_ONCE(sp->lpage_disallowed);
>                 }
>
>                 if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> -                       kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +                       kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
>                         cond_resched_rwlock_write(&kvm->mmu_lock);
> +                       flush = false;
>                 }
>         }
> -       kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +       kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
>
>         write_unlock(&kvm->mmu_lock);
>         srcu_read_unlock(&kvm->srcu, rcu_idx);
> --
> 2.31.0.291.g576ba9dcdaf-goog
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages
  2021-03-25 21:46   ` Ben Gardon
@ 2021-03-25 22:25     ` Sean Christopherson
  2021-03-25 22:45       ` Ben Gardon
                         ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Sean Christopherson @ 2021-03-25 22:25 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Thu, Mar 25, 2021, Ben Gardon wrote:
> On Thu, Mar 25, 2021 at 1:01 PM Sean Christopherson <seanjc@google.com> wrote:
> > +static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start,
> > +                                            gfn_t end)
> > +{
> > +       return __kvm_tdp_mmu_zap_gfn_range(kvm, start, end, true);
> > +}
> > +static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> 
> I'm a little leary of adding an interface which takes a non-root
> struct kvm_mmu_page as an argument to the TDP MMU.
> In the TDP MMU, the struct kvm_mmu_pages are protected rather subtly.
> I agree this is safe because we hold the MMU lock in write mode here,
> but if we ever wanted to convert to holding it in read mode things
> could get complicated fast.
> Maybe this is more of a concern if the function started to be used
> elsewhere since NX recovery is already so dependent on the write lock.

Agreed.  Even writing the comment below felt a bit awkward when thinking about
additional users holding mmu_lock for read.  Actually, I should remove that
specific blurb since zapping currently requires holding mmu_lock for write.

> Ideally though, NX reclaim could use MMU read lock +
> tdp_mmu_pages_lock to protect the list and do reclaim in parallel with
> everything else.

Yar, processing all legacy MMU pages, and then all TDP MMU pages to avoid some
of these dependencies crossed my mind.  But, it's hard to justify effectively
walking the list twice.  And maintaining two lists might lead to balancing
issues, e.g. the legacy MMU and thus nested VMs get zapped more often than the
TDP MMU, or vice versa.

> The nice thing about drawing the TDP MMU interface in terms of GFNs
> and address space IDs instead of SPs is that it doesn't put
> constraints on the implementation of the TDP MMU because those GFNs
> are always going to be valid / don't require any shared memory.
> This is kind of innocuous because it's immediately converted into that
> gfn interface, so I don't know how much it really matters.
> 
> In any case this change looks correct and I don't want to hold up
> progress with bikeshedding.
> WDYT?

I think we're kind of hosed either way.  Either we add a helper in the TDP MMU
that takes a SP, or we bleed a lot of information about the details of TDP MMU
into the common MMU.  E.g. the function could be open-coded verbatim, but the
whole comment below, and the motivation for not feeding in flush is very
dependent on the internal details of TDP MMU.

I don't have a super strong preference.  One thought would be to assert that
mmu_lock is held for write, and then it largely come future person's problem :-)

> > +{
> > +       gfn_t end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level);
> > +
> > +       /*
> > +        * Don't allow yielding, as the caller may have a flush pending.  Note,
> > +        * if mmu_lock is held for write, zapping will never yield in this case,
> > +        * but explicitly disallow it for safety.  The TDP MMU does not yield
> > +        * until it has made forward progress (steps sideways), and when zapping
> > +        * a single shadow page that it's guaranteed to see (thus the mmu_lock
> > +        * requirement), its "step sideways" will always step beyond the bounds
> > +        * of the shadow page's gfn range and stop iterating before yielding.
> > +        */
> > +       return __kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, end, false);
> > +}
> >  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
> >
> >  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> > --
> > 2.31.0.291.g576ba9dcdaf-goog
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages
  2021-03-25 22:25     ` Sean Christopherson
@ 2021-03-25 22:45       ` Ben Gardon
  2021-03-26 17:11         ` Paolo Bonzini
  2021-03-26 17:12       ` Paolo Bonzini
  2021-03-30 17:18       ` Paolo Bonzini
  2 siblings, 1 reply; 11+ messages in thread
From: Ben Gardon @ 2021-03-25 22:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Thu, Mar 25, 2021 at 3:25 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Mar 25, 2021, Ben Gardon wrote:
> > On Thu, Mar 25, 2021 at 1:01 PM Sean Christopherson <seanjc@google.com> wrote:
> > > +static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start,
> > > +                                            gfn_t end)
> > > +{
> > > +       return __kvm_tdp_mmu_zap_gfn_range(kvm, start, end, true);
> > > +}
> > > +static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> >
> > I'm a little leary of adding an interface which takes a non-root
> > struct kvm_mmu_page as an argument to the TDP MMU.
> > In the TDP MMU, the struct kvm_mmu_pages are protected rather subtly.
> > I agree this is safe because we hold the MMU lock in write mode here,
> > but if we ever wanted to convert to holding it in read mode things
> > could get complicated fast.
> > Maybe this is more of a concern if the function started to be used
> > elsewhere since NX recovery is already so dependent on the write lock.
>
> Agreed.  Even writing the comment below felt a bit awkward when thinking about
> additional users holding mmu_lock for read.  Actually, I should remove that
> specific blurb since zapping currently requires holding mmu_lock for write.
>
> > Ideally though, NX reclaim could use MMU read lock +
> > tdp_mmu_pages_lock to protect the list and do reclaim in parallel with
> > everything else.
>
> Yar, processing all legacy MMU pages, and then all TDP MMU pages to avoid some
> of these dependencies crossed my mind.  But, it's hard to justify effectively
> walking the list twice.  And maintaining two lists might lead to balancing
> issues, e.g. the legacy MMU and thus nested VMs get zapped more often than the
> TDP MMU, or vice versa.

I think in an earlier version of the TDP that I sent out, NX reclaim
was a seperate thread for the two MMUs, sidestepping the balance
issue.
I think the TDP MMU also had a seperate NX reclaim list.
That would also make it easier to do something under the read lock.

>
> > The nice thing about drawing the TDP MMU interface in terms of GFNs
> > and address space IDs instead of SPs is that it doesn't put
> > constraints on the implementation of the TDP MMU because those GFNs
> > are always going to be valid / don't require any shared memory.
> > This is kind of innocuous because it's immediately converted into that
> > gfn interface, so I don't know how much it really matters.
> >
> > In any case this change looks correct and I don't want to hold up
> > progress with bikeshedding.
> > WDYT?
>
> I think we're kind of hosed either way.  Either we add a helper in the TDP MMU
> that takes a SP, or we bleed a lot of information about the details of TDP MMU
> into the common MMU.  E.g. the function could be open-coded verbatim, but the
> whole comment below, and the motivation for not feeding in flush is very
> dependent on the internal details of TDP MMU.
>
> I don't have a super strong preference.  One thought would be to assert that
> mmu_lock is held for write, and then it largely come future person's problem :-)

Yeah, I agree and I'm happy to kick this proverbial can down the road
until we actually add an NX reclaim implementation that uses the MMU
read lock.

>
> > > +{
> > > +       gfn_t end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level);
> > > +
> > > +       /*
> > > +        * Don't allow yielding, as the caller may have a flush pending.  Note,
> > > +        * if mmu_lock is held for write, zapping will never yield in this case,
> > > +        * but explicitly disallow it for safety.  The TDP MMU does not yield
> > > +        * until it has made forward progress (steps sideways), and when zapping
> > > +        * a single shadow page that it's guaranteed to see (thus the mmu_lock
> > > +        * requirement), its "step sideways" will always step beyond the bounds
> > > +        * of the shadow page's gfn range and stop iterating before yielding.
> > > +        */
> > > +       return __kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, end, false);
> > > +}
> > >  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
> > >
> > >  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> > > --
> > > 2.31.0.291.g576ba9dcdaf-goog
> > >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages
  2021-03-25 22:45       ` Ben Gardon
@ 2021-03-26 17:11         ` Paolo Bonzini
  0 siblings, 0 replies; 11+ messages in thread
From: Paolo Bonzini @ 2021-03-26 17:11 UTC (permalink / raw)
  To: Ben Gardon, Sean Christopherson
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, LKML

On 25/03/21 23:45, Ben Gardon wrote:
> I think in an earlier version of the TDP that I sent out, NX reclaim
> was a seperate thread for the two MMUs, sidestepping the balance
> issue.
> I think the TDP MMU also had a seperate NX reclaim list.
> That would also make it easier to do something under the read lock.

Yes that was my suggestion actually, I preferred to keep things simple 
because most of the time there would be only TDP MMU pages.

Paolo


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages
  2021-03-25 22:25     ` Sean Christopherson
  2021-03-25 22:45       ` Ben Gardon
@ 2021-03-26 17:12       ` Paolo Bonzini
  2021-03-30 17:18       ` Paolo Bonzini
  2 siblings, 0 replies; 11+ messages in thread
From: Paolo Bonzini @ 2021-03-26 17:12 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, LKML

On 25/03/21 23:25, Sean Christopherson wrote:
> I don't have a super strong preference.  One thought would be to
> assert that mmu_lock is held for write, and then it largely come
> future person's problem:-)

Well that is what I was going to suggest.  Let's keep things as simple 
as possible for the TDP MMU and build up slowly.

Paolo


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages
  2021-03-25 22:25     ` Sean Christopherson
  2021-03-25 22:45       ` Ben Gardon
  2021-03-26 17:12       ` Paolo Bonzini
@ 2021-03-30 17:18       ` Paolo Bonzini
  2 siblings, 0 replies; 11+ messages in thread
From: Paolo Bonzini @ 2021-03-30 17:18 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, LKML

On 25/03/21 23:25, Sean Christopherson wrote:
> On Thu, Mar 25, 2021, Ben Gardon wrote:
>> On Thu, Mar 25, 2021 at 1:01 PM Sean Christopherson <seanjc@google.com> wrote:
>>> +static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start,
>>> +                                            gfn_t end)
>>> +{
>>> +       return __kvm_tdp_mmu_zap_gfn_range(kvm, start, end, true);
>>> +}
>>> +static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>>
>> I'm a little leary of adding an interface which takes a non-root
>> struct kvm_mmu_page as an argument to the TDP MMU.
>> In the TDP MMU, the struct kvm_mmu_pages are protected rather subtly.
>> I agree this is safe because we hold the MMU lock in write mode here,
>> but if we ever wanted to convert to holding it in read mode things
>> could get complicated fast.
>> Maybe this is more of a concern if the function started to be used
>> elsewhere since NX recovery is already so dependent on the write lock.
> 
> Agreed.  Even writing the comment below felt a bit awkward when thinking about
> additional users holding mmu_lock for read.  Actually, I should remove that
> specific blurb since zapping currently requires holding mmu_lock for write.
> 
>> Ideally though, NX reclaim could use MMU read lock +
>> tdp_mmu_pages_lock to protect the list and do reclaim in parallel with
>> everything else.
> 
> Yar, processing all legacy MMU pages, and then all TDP MMU pages to avoid some
> of these dependencies crossed my mind.  But, it's hard to justify effectively
> walking the list twice.  And maintaining two lists might lead to balancing
> issues, e.g. the legacy MMU and thus nested VMs get zapped more often than the
> TDP MMU, or vice versa.
> 
>> The nice thing about drawing the TDP MMU interface in terms of GFNs
>> and address space IDs instead of SPs is that it doesn't put
>> constraints on the implementation of the TDP MMU because those GFNs
>> are always going to be valid / don't require any shared memory.
>> This is kind of innocuous because it's immediately converted into that
>> gfn interface, so I don't know how much it really matters.
>>
>> In any case this change looks correct and I don't want to hold up
>> progress with bikeshedding.
>> WDYT?
> 
> I think we're kind of hosed either way.  Either we add a helper in the TDP MMU
> that takes a SP, or we bleed a lot of information about the details of TDP MMU
> into the common MMU.  E.g. the function could be open-coded verbatim, but the
> whole comment below, and the motivation for not feeding in flush is very
> dependent on the internal details of TDP MMU.
> 
> I don't have a super strong preference.  One thought would be to assert that
> mmu_lock is held for write, and then it largely come future person's problem :-)

Queued all three, with lockdep_assert_held_write here.

Paolo


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-03-30 17:19 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-25 20:01 [PATCH v2 0/3] KVM: x86/mmu: Fix TLB flushing bugs in TDP MMU Sean Christopherson
2021-03-25 20:01 ` [PATCH v2 1/3] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap Sean Christopherson
2021-03-25 20:01 ` [PATCH v2 2/3] KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping Sean Christopherson
2021-03-25 21:47   ` Ben Gardon
2021-03-25 20:01 ` [PATCH v2 3/3] KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages Sean Christopherson
2021-03-25 21:46   ` Ben Gardon
2021-03-25 22:25     ` Sean Christopherson
2021-03-25 22:45       ` Ben Gardon
2021-03-26 17:11         ` Paolo Bonzini
2021-03-26 17:12       ` Paolo Bonzini
2021-03-30 17:18       ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).