kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] optimize spte zapping in zap_gfn_range()
@ 2021-11-24 21:44 Mingwei Zhang
  2021-11-24 21:44 ` [PATCH 1/2] Revert "KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs" Mingwei Zhang
  2021-11-24 21:44 ` [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall Mingwei Zhang
  0 siblings, 2 replies; 7+ messages in thread
From: Mingwei Zhang @ 2021-11-24 21:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon, David Matlack,
	Mingwei Zhang

TDP MMU SPTE zapping process currently uses two levels of iterations. The
first level iteration happens at the for loop within the zap_gfn_range()
with the purpose of calibrating the accurate range for zapping. The second
level itreration start at tdp_mmu_set_spte{,_atomic}() that tears down the
whole paging structures (leaf and non-leaf SPTEs) within the range. The
former iteration is yield safe, while the second one is not.

In many cases, zapping SPTE process could be optimized since the non-leaf
SPTEs could most likely be retained for the next allocation. On the other
hand, for large scale SPTE zapping scenarios, we may end up zapping too
many SPTEs and use excessive CPU time that causes the RCU stall warning.

The follow selftest reproduces the warning:

        (env: kvm.tdp_mmu=Y)
        ./dirty_log_perf_test -v 64 -b 8G

This patch set revert a previous optimization and create a helper
__zap_gfn_range() to help optimize the zapping process.

In particular, it does the following two things:
 - optimize the zapping by retaining some non-leaf SPTEs.
 - avoid RCU stall warning when zapping too many SPTEs.

Mingwei Zhang (2):
  Revert "KVM: x86/mmu: Don't step down in the TDP iterator when zapping
    all SPTEs"
  KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid
    rcu stall

 arch/x86/kvm/mmu/tdp_mmu.c | 66 +++++++++++++++++++++++---------------
 1 file changed, 41 insertions(+), 25 deletions(-)

--
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] Revert "KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs"
  2021-11-24 21:44 [PATCH 0/2] optimize spte zapping in zap_gfn_range() Mingwei Zhang
@ 2021-11-24 21:44 ` Mingwei Zhang
  2021-11-30  0:50   ` David Matlack
  2021-11-24 21:44 ` [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall Mingwei Zhang
  1 sibling, 1 reply; 7+ messages in thread
From: Mingwei Zhang @ 2021-11-24 21:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon, David Matlack,
	Mingwei Zhang

Not stepping down in TDP iterator in `zap_all` case avoids re-reading the
non-leaf SPTEs, thus accelerates the zapping process . But when the number
of SPTEs is too large, we may run out of CPU time and causes a RCU stall
warnings in __handle_changed_pte() in the context of zap_gfn_range().

Revert this patch to allow eliminating RCU stall warning using a two-phase
zapping for `zap_all` case.

This reverts commit 0103098fb4f13b447b26ed514bcd3140f6791047.

Cc: Sean Christopherson <seanjc@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7c5dd83e52de..89d16bb104de 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -706,12 +706,6 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	bool zap_all = (start == 0 && end >= max_gfn_host);
 	struct tdp_iter iter;
 
-	/*
-	 * No need to try to step down in the iterator when zapping all SPTEs,
-	 * zapping the top-level non-leaf SPTEs will recurse on their children.
-	 */
-	int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
-
 	/*
 	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
 	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
@@ -723,8 +717,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
-				   min_level, start, end) {
+	tdp_root_for_each_pte(iter, root, start, end) {
 retry:
 		if (can_yield &&
 		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall
  2021-11-24 21:44 [PATCH 0/2] optimize spte zapping in zap_gfn_range() Mingwei Zhang
  2021-11-24 21:44 ` [PATCH 1/2] Revert "KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs" Mingwei Zhang
@ 2021-11-24 21:44 ` Mingwei Zhang
  2021-11-30  0:48   ` David Matlack
  1 sibling, 1 reply; 7+ messages in thread
From: Mingwei Zhang @ 2021-11-24 21:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon, David Matlack,
	Mingwei Zhang

TDP MMU SPTE zapping process currently uses two levels of iterations. The
first level iteration happens at the for loop within the zap_gfn_range()
with the purpose of calibrating the accurate range for zapping. The second
level itreration start at tdp_mmu_set_spte{,_atomic}() that tears down the
whole paging structures (leaf and non-leaf SPTEs) within the range. The
former iteration is yield safe, while the second one is not.

In many cases, zapping SPTE process could be optimized since the non-leaf
SPTEs could most likely be retained for the next allocation. On the other
hand, for large scale SPTE zapping scenarios, we may end up zapping too
many SPTEs and use excessive CPU time that causes the RCU stall warning.

The follow selftest reproduces the warning:

	(env: kvm.tdp_mmu=Y)
	./dirty_log_perf_test -v 64 -b 8G

Optimize the zapping process by skipping all SPTEs above a certain level in
the first iteration. This allows us to control the granularity of the
actual zapping and invoke tdp_mmu_iter_cond_resched() on time. In addition,
we would retain some of the non-leaf SPTEs to accelerate next allocation.

For the selection of the `certain level`, we choose the PG_LEVEL_1G because
it is currently the largest page size supported and it natually fits the
scenario of splitting large pages.

For `zap_all` case (usually) at VM teardown time, we use a two-phase
mechanism: the 1st phase zaps all SPTEs at PG_LEVEL_1G level and 2nd phase
zaps everything else. This is achieved by the helper function
__zap_gfn_range().

Cc: Sean Christopherson <seanjc@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 57 ++++++++++++++++++++++++++------------
 1 file changed, 40 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 89d16bb104de..3fadc51c004a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -697,24 +697,16 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
  * account for the possibility that other threads are modifying the paging
  * structures concurrently. If shared is false, this thread should hold the
  * MMU lock in write mode.
+ *
+ * If zap_all is true, eliminate all the paging structures that contains the
+ * SPTEs.
  */
-static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
-			  gfn_t start, gfn_t end, bool can_yield, bool flush,
-			  bool shared)
+static bool __zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+			    gfn_t start, gfn_t end, bool can_yield, bool flush,
+			    bool shared, bool zap_all)
 {
-	gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
-	bool zap_all = (start == 0 && end >= max_gfn_host);
 	struct tdp_iter iter;
 
-	/*
-	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
-	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
-	 * and so KVM will never install a SPTE for such addresses.
-	 */
-	end = min(end, max_gfn_host);
-
-	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
-
 	rcu_read_lock();
 
 	tdp_root_for_each_pte(iter, root, start, end) {
@@ -725,17 +717,24 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			continue;
 		}
 
-		if (!is_shadow_present_pte(iter.old_spte))
+		/*
+		 * In zap_all case, ignore the checking of present since we have
+		 * to zap everything.
+		 */
+		if (!zap_all && !is_shadow_present_pte(iter.old_spte))
 			continue;
 
 		/*
 		 * If this is a non-last-level SPTE that covers a larger range
 		 * than should be zapped, continue, and zap the mappings at a
-		 * lower level, except when zapping all SPTEs.
+		 * lower level. Actual zapping started at proper granularity
+		 * that is not so large as to cause a soft lockup when handling
+		 * the changed pte (which does not yield).
 		 */
 		if (!zap_all &&
 		    (iter.gfn < start ||
-		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
+		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end ||
+		     iter.level > PG_LEVEL_1G) &&
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
@@ -756,6 +755,30 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	return flush;
 }
 
+static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+			  gfn_t start, gfn_t end, bool can_yield, bool flush,
+			  bool shared)
+{
+	gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
+	bool zap_all = (start == 0 && end >= max_gfn_host);
+
+	/*
+	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
+	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
+	 * and so KVM will never install a SPTE for such addresses.
+	 */
+	end = min(end, max_gfn_host);
+
+	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+
+	flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush, shared,
+				false);
+	if (zap_all)
+		flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush,
+					shared, true);
+	return flush;
+}
+
 /*
  * Tears down the mappings for the range of gfns, [start, end), and frees the
  * non-root pages mapping GFNs strictly within that range. Returns true if
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall
  2021-11-24 21:44 ` [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall Mingwei Zhang
@ 2021-11-30  0:48   ` David Matlack
  2021-11-30  0:50     ` David Matlack
  2021-11-30 19:40     ` Sean Christopherson
  0 siblings, 2 replies; 7+ messages in thread
From: David Matlack @ 2021-11-30  0:48 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Ben Gardon

On Wed, Nov 24, 2021 at 09:44:21PM +0000, Mingwei Zhang wrote:
> TDP MMU SPTE zapping process currently uses two levels of iterations. The
> first level iteration happens at the for loop within the zap_gfn_range()
> with the purpose of calibrating the accurate range for zapping. The second
> level itreration start at tdp_mmu_set_spte{,_atomic}() that tears down the

iteration

> whole paging structures (leaf and non-leaf SPTEs) within the range. The
> former iteration is yield safe, while the second one is not.

I know what you mean but I'd suggest being more specific than "yield
safe". For example:

  Unlike the outer loop, the recursive zapping done under
  tdp_mmu_set_spte{,_atomic} does not yield. Since zapping is done with
  a pre-order traversal, zapping sufficiently large ranges can lead to
  RCU stall warnings.

I'd also clarify here that the TDP MMU iterator uses a pre-order
traversal which causes us KVM to end up doing the maximum amount of
zapping under tdp_mmu_set_spte{,_atomic} and not the outer for loop.

> 
> In many cases, zapping SPTE process could be optimized since the non-leaf
> SPTEs could most likely be retained for the next allocation. On the other
> hand, for large scale SPTE zapping scenarios, we may end up zapping too
> many SPTEs and use excessive CPU time that causes the RCU stall warning.
> 
> The follow selftest reproduces the warning:
> 
> 	(env: kvm.tdp_mmu=Y)
> 	./dirty_log_perf_test -v 64 -b 8G
> 
> Optimize the zapping process by skipping all SPTEs above a certain level in
> the first iteration. This allows us to control the granularity of the
> actual zapping and invoke tdp_mmu_iter_cond_resched() on time. In addition,
> we would retain some of the non-leaf SPTEs to accelerate next allocation.
> 
> For the selection of the `certain level`, we choose the PG_LEVEL_1G because
> it is currently the largest page size supported and it natually fits the
> scenario of splitting large pages.
> 
> For `zap_all` case (usually) at VM teardown time, we use a two-phase
> mechanism: the 1st phase zaps all SPTEs at PG_LEVEL_1G level and 2nd phase
> zaps everything else. This is achieved by the helper function
> __zap_gfn_range().
> 
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Ben Gardon <bgardon@google.com>
> Cc: David Matlack <dmatlack@google.com>
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 57 ++++++++++++++++++++++++++------------
>  1 file changed, 40 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 89d16bb104de..3fadc51c004a 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -697,24 +697,16 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
>   * account for the possibility that other threads are modifying the paging
>   * structures concurrently. If shared is false, this thread should hold the
>   * MMU lock in write mode.
> + *
> + * If zap_all is true, eliminate all the paging structures that contains the
> + * SPTEs.
>   */
> -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> -			  gfn_t start, gfn_t end, bool can_yield, bool flush,
> -			  bool shared)
> +static bool __zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> +			    gfn_t start, gfn_t end, bool can_yield, bool flush,
> +			    bool shared, bool zap_all)
>  {
> -	gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> -	bool zap_all = (start == 0 && end >= max_gfn_host);
>  	struct tdp_iter iter;
>  
> -	/*
> -	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> -	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> -	 * and so KVM will never install a SPTE for such addresses.
> -	 */
> -	end = min(end, max_gfn_host);
> -
> -	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> -
>  	rcu_read_lock();
>  
>  	tdp_root_for_each_pte(iter, root, start, end) {
> @@ -725,17 +717,24 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  			continue;
>  		}
>  
> -		if (!is_shadow_present_pte(iter.old_spte))
> +		/*
> +		 * In zap_all case, ignore the checking of present since we have
> +		 * to zap everything.
> +		 */
> +		if (!zap_all && !is_shadow_present_pte(iter.old_spte))
>  			continue;

I don't believe there's any reason to attempt to zap a non-present spte,
even in the zap_all case. In any case, this change deserves its own
patch and a commit message that describes why the old logic is incorrect
and how this fixes it.

>  
>  		/*
>  		 * If this is a non-last-level SPTE that covers a larger range
>  		 * than should be zapped, continue, and zap the mappings at a
> -		 * lower level, except when zapping all SPTEs.
> +		 * lower level. Actual zapping started at proper granularity
> +		 * that is not so large as to cause a soft lockup when handling
> +		 * the changed pte (which does not yield).
>  		 */
>  		if (!zap_all &&
>  		    (iter.gfn < start ||
> -		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> +		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end ||
> +		     iter.level > PG_LEVEL_1G) &&
>  		    !is_last_spte(iter.old_spte, iter.level))
>  			continue;

This if statement is getting a bit long. I'd suggest breaking out the
level check and also using KVM_MAX_HUGEPAGE_LEVEL.

e.g.

        /*
         * If not doing zap_all, only zap up to the huge page level to
         * avoid doing too much work in the recursive tdp_mmu_set_spte*
         * call below, since it does not yield.
         *
         * This will potentially leave behind some childless page tables
         * but that's ok because ...
         */
         if (!zap_all && iter.level > KVM_MAX_HUGEPAGE_LEVEL)
                continue;

And on that note, what is the reasoning for why it's ok to leave behind
childless page tables? I assume it's because most of the time we'll use
that page table again in the future, and at worst we leave the page
table allocated until the VM is cleaned up?

>  
> @@ -756,6 +755,30 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  	return flush;
>  }
>  
> +static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> +			  gfn_t start, gfn_t end, bool can_yield, bool flush,
> +			  bool shared)
> +{
> +	gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> +	bool zap_all = (start == 0 && end >= max_gfn_host);
> +
> +	/*
> +	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> +	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> +	 * and so KVM will never install a SPTE for such addresses.
> +	 */
> +	end = min(end, max_gfn_host);
> +
> +	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> +
> +	flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush, shared,
> +				false);
> +	if (zap_all)
> +		flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush,
> +					shared, true);
> +	return flush;
> +}
> +
>  /*
>   * Tears down the mappings for the range of gfns, [start, end), and frees the
>   * non-root pages mapping GFNs strictly within that range. Returns true if
> -- 
> 2.34.0.rc2.393.gf8c9666880-goog
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall
  2021-11-30  0:48   ` David Matlack
@ 2021-11-30  0:50     ` David Matlack
  2021-11-30 19:40     ` Sean Christopherson
  1 sibling, 0 replies; 7+ messages in thread
From: David Matlack @ 2021-11-30  0:50 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Ben Gardon

On Tue, Nov 30, 2021 at 12:48:24AM +0000, David Matlack wrote:
> On Wed, Nov 24, 2021 at 09:44:21PM +0000, Mingwei Zhang wrote:
> > TDP MMU SPTE zapping process currently uses two levels of iterations. The
> > first level iteration happens at the for loop within the zap_gfn_range()
> > with the purpose of calibrating the accurate range for zapping. The second
> > level itreration start at tdp_mmu_set_spte{,_atomic}() that tears down the
> 
> iteration
> 
> > whole paging structures (leaf and non-leaf SPTEs) within the range. The
> > former iteration is yield safe, while the second one is not.
> 
> I know what you mean but I'd suggest being more specific than "yield
> safe". For example:
> 
>   Unlike the outer loop, the recursive zapping done under
>   tdp_mmu_set_spte{,_atomic} does not yield. Since zapping is done with
>   a pre-order traversal, zapping sufficiently large ranges can lead to
>   RCU stall warnings.
> 
> I'd also clarify here that the TDP MMU iterator uses a pre-order
> traversal which causes us KVM to end up doing the maximum amount of
> zapping under tdp_mmu_set_spte{,_atomic} and not the outer for loop.

(Ah sorry for the redundant suggestion. I wrote this paragraph and then
reworded my suggested wording to mention the pre-order traversal.)

> 
> > 
> > In many cases, zapping SPTE process could be optimized since the non-leaf
> > SPTEs could most likely be retained for the next allocation. On the other
> > hand, for large scale SPTE zapping scenarios, we may end up zapping too
> > many SPTEs and use excessive CPU time that causes the RCU stall warning.
> > 
> > The follow selftest reproduces the warning:
> > 
> > 	(env: kvm.tdp_mmu=Y)
> > 	./dirty_log_perf_test -v 64 -b 8G
> > 
> > Optimize the zapping process by skipping all SPTEs above a certain level in
> > the first iteration. This allows us to control the granularity of the
> > actual zapping and invoke tdp_mmu_iter_cond_resched() on time. In addition,
> > we would retain some of the non-leaf SPTEs to accelerate next allocation.
> > 
> > For the selection of the `certain level`, we choose the PG_LEVEL_1G because
> > it is currently the largest page size supported and it natually fits the
> > scenario of splitting large pages.
> > 
> > For `zap_all` case (usually) at VM teardown time, we use a two-phase
> > mechanism: the 1st phase zaps all SPTEs at PG_LEVEL_1G level and 2nd phase
> > zaps everything else. This is achieved by the helper function
> > __zap_gfn_range().
> > 
> > Cc: Sean Christopherson <seanjc@google.com>
> > Cc: Ben Gardon <bgardon@google.com>
> > Cc: David Matlack <dmatlack@google.com>
> > 
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 57 ++++++++++++++++++++++++++------------
> >  1 file changed, 40 insertions(+), 17 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 89d16bb104de..3fadc51c004a 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -697,24 +697,16 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> >   * account for the possibility that other threads are modifying the paging
> >   * structures concurrently. If shared is false, this thread should hold the
> >   * MMU lock in write mode.
> > + *
> > + * If zap_all is true, eliminate all the paging structures that contains the
> > + * SPTEs.
> >   */
> > -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > -			  gfn_t start, gfn_t end, bool can_yield, bool flush,
> > -			  bool shared)
> > +static bool __zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > +			    gfn_t start, gfn_t end, bool can_yield, bool flush,
> > +			    bool shared, bool zap_all)
> >  {
> > -	gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> > -	bool zap_all = (start == 0 && end >= max_gfn_host);
> >  	struct tdp_iter iter;
> >  
> > -	/*
> > -	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> > -	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> > -	 * and so KVM will never install a SPTE for such addresses.
> > -	 */
> > -	end = min(end, max_gfn_host);
> > -
> > -	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> > -
> >  	rcu_read_lock();
> >  
> >  	tdp_root_for_each_pte(iter, root, start, end) {
> > @@ -725,17 +717,24 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >  			continue;
> >  		}
> >  
> > -		if (!is_shadow_present_pte(iter.old_spte))
> > +		/*
> > +		 * In zap_all case, ignore the checking of present since we have
> > +		 * to zap everything.
> > +		 */
> > +		if (!zap_all && !is_shadow_present_pte(iter.old_spte))
> >  			continue;
> 
> I don't believe there's any reason to attempt to zap a non-present spte,
> even in the zap_all case. In any case, this change deserves its own
> patch and a commit message that describes why the old logic is incorrect
> and how this fixes it.
> 
> >  
> >  		/*
> >  		 * If this is a non-last-level SPTE that covers a larger range
> >  		 * than should be zapped, continue, and zap the mappings at a
> > -		 * lower level, except when zapping all SPTEs.
> > +		 * lower level. Actual zapping started at proper granularity
> > +		 * that is not so large as to cause a soft lockup when handling
> > +		 * the changed pte (which does not yield).
> >  		 */
> >  		if (!zap_all &&
> >  		    (iter.gfn < start ||
> > -		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> > +		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end ||
> > +		     iter.level > PG_LEVEL_1G) &&
> >  		    !is_last_spte(iter.old_spte, iter.level))
> >  			continue;
> 
> This if statement is getting a bit long. I'd suggest breaking out the
> level check and also using KVM_MAX_HUGEPAGE_LEVEL.
> 
> e.g.
> 
>         /*
>          * If not doing zap_all, only zap up to the huge page level to
>          * avoid doing too much work in the recursive tdp_mmu_set_spte*
>          * call below, since it does not yield.
>          *
>          * This will potentially leave behind some childless page tables
>          * but that's ok because ...
>          */
>          if (!zap_all && iter.level > KVM_MAX_HUGEPAGE_LEVEL)
>                 continue;
> 
> And on that note, what is the reasoning for why it's ok to leave behind
> childless page tables? I assume it's because most of the time we'll use
> that page table again in the future, and at worst we leave the page
> table allocated until the VM is cleaned up?
> 
> >  
> > @@ -756,6 +755,30 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >  	return flush;
> >  }
> >  
> > +static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > +			  gfn_t start, gfn_t end, bool can_yield, bool flush,
> > +			  bool shared)
> > +{
> > +	gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> > +	bool zap_all = (start == 0 && end >= max_gfn_host);
> > +
> > +	/*
> > +	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> > +	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> > +	 * and so KVM will never install a SPTE for such addresses.
> > +	 */
> > +	end = min(end, max_gfn_host);
> > +
> > +	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> > +
> > +	flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush, shared,
> > +				false);
> > +	if (zap_all)
> > +		flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush,
> > +					shared, true);
> > +	return flush;
> > +}
> > +
> >  /*
> >   * Tears down the mappings for the range of gfns, [start, end), and frees the
> >   * non-root pages mapping GFNs strictly within that range. Returns true if
> > -- 
> > 2.34.0.rc2.393.gf8c9666880-goog
> > 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] Revert "KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs"
  2021-11-24 21:44 ` [PATCH 1/2] Revert "KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs" Mingwei Zhang
@ 2021-11-30  0:50   ` David Matlack
  0 siblings, 0 replies; 7+ messages in thread
From: David Matlack @ 2021-11-30  0:50 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Ben Gardon

On Wed, Nov 24, 2021 at 09:44:20PM +0000, Mingwei Zhang wrote:
> Not stepping down in TDP iterator in `zap_all` case avoids re-reading the
> non-leaf SPTEs, thus accelerates the zapping process . But when the number
> of SPTEs is too large, we may run out of CPU time and causes a RCU stall
> warnings in __handle_changed_pte() in the context of zap_gfn_range().
> 
> Revert this patch to allow eliminating RCU stall warning using a two-phase
> zapping for `zap_all` case.
> 
> This reverts commit 0103098fb4f13b447b26ed514bcd3140f6791047.
> 
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Ben Gardon <bgardon@google.com>
> Cc: David Matlack <dmatlack@google.com>
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>

Reviewed-by: David Matlack <dmatlack@google.com>

> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7c5dd83e52de..89d16bb104de 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -706,12 +706,6 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  	bool zap_all = (start == 0 && end >= max_gfn_host);
>  	struct tdp_iter iter;
>  
> -	/*
> -	 * No need to try to step down in the iterator when zapping all SPTEs,
> -	 * zapping the top-level non-leaf SPTEs will recurse on their children.
> -	 */
> -	int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
> -
>  	/*
>  	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
>  	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> @@ -723,8 +717,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  
>  	rcu_read_lock();
>  
> -	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
> -				   min_level, start, end) {
> +	tdp_root_for_each_pte(iter, root, start, end) {
>  retry:
>  		if (can_yield &&
>  		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
> -- 
> 2.34.0.rc2.393.gf8c9666880-goog
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall
  2021-11-30  0:48   ` David Matlack
  2021-11-30  0:50     ` David Matlack
@ 2021-11-30 19:40     ` Sean Christopherson
  1 sibling, 0 replies; 7+ messages in thread
From: Sean Christopherson @ 2021-11-30 19:40 UTC (permalink / raw)
  To: David Matlack
  Cc: Mingwei Zhang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Ben Gardon

On Tue, Nov 30, 2021, David Matlack wrote:
> > +		/*
> > +		 * In zap_all case, ignore the checking of present since we have
> > +		 * to zap everything.
> > +		 */
> > +		if (!zap_all && !is_shadow_present_pte(iter.old_spte))
> >  			continue;
> 
> I don't believe there's any reason to attempt to zap a non-present spte,
> even in the zap_all case. In any case, this change deserves its own
> patch and a commit message that describes why the old logic is incorrect
> and how this fixes it.

Yep, at best it's wasted cycles, at worst it will trigger WARN/BUG due to using
accessors that require the caller to check for a shadow present SPTE.

> >  		 * If this is a non-last-level SPTE that covers a larger range
> >  		 * than should be zapped, continue, and zap the mappings at a
> > -		 * lower level, except when zapping all SPTEs.
> > +		 * lower level. Actual zapping started at proper granularity
> > +		 * that is not so large as to cause a soft lockup when handling
> > +		 * the changed pte (which does not yield).
> >  		 */
> >  		if (!zap_all &&
> >  		    (iter.gfn < start ||
> > -		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> > +		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end ||
> > +		     iter.level > PG_LEVEL_1G) &&
> >  		    !is_last_spte(iter.old_spte, iter.level))
> >  			continue;
> 
> This if statement is getting a bit long. I'd suggest breaking out the
> level check and also using KVM_MAX_HUGEPAGE_LEVEL.
> 
> e.g.
> 
>         /*
>          * If not doing zap_all, only zap up to the huge page level to
>          * avoid doing too much work in the recursive tdp_mmu_set_spte*
>          * call below, since it does not yield.
>          *
>          * This will potentially leave behind some childless page tables
>          * but that's ok because ...
>          */
>          if (!zap_all && iter.level > KVM_MAX_HUGEPAGE_LEVEL)
>                 continue;
> 
> And on that note, what is the reasoning for why it's ok to leave behind
> childless page tables? I assume it's because most of the time we'll use
> that page table again in the future, and at worst we leave the page
> table allocated until the VM is cleaned up?

Yes.  If zap_all=false, KVM is zapping a gfn range but not dropping or invalidating
the root.  The non-leaf paging structures are perfectly valid and can be reused.
There's certainly no guarantees that a structure will be reused, but keeping the
pages is ok because the memory consumed scales with the size of the VM, and the
number of shadow pages used for TDP is quite small, e.g. an order of magnitude less
than what's needed for shadow paging.

If we're ok waiting until my zap n' flush rework[*] lands, this is much easier to
handle, e.g. I think this will do it.

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 62cb357b1dff..8d3df03024b7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -765,7 +765,7 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
                             bool shared, bool root_is_unreachable)
 {
        struct tdp_iter iter;
-
+       bool leafs_only;
        gfn_t end = tdp_mmu_max_gfn_host();
        gfn_t start = 0;

@@ -773,12 +773,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,

        rcu_read_lock();

-       /*
-        * No need to try to step down in the iterator when zapping an entire
-        * root, zapping an upper-level SPTE will recurse on its children.
-        */
+again:
+       /* Add comment here. */
        for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
-                                  root->role.level, start, end) {
+                                  leafs_only ? PG_LEVEL_4K : root->role.level,
+                                  start, end) {
 retry:
                if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
                        continue;
@@ -786,6 +785,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
                if (!is_shadow_present_pte(iter.old_spte))
                        continue;

+               if (leafs_only && !is_last_spte(iter.old_spte, iter.level))
+                       continue;
+
                if (!shared) {
                        tdp_mmu_set_spte(kvm, &iter, 0);
                } else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
@@ -807,6 +809,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
                WARN_ON_ONCE(root_is_unreachable && root->role.invalid);
        }

+       if (leafs_only) {
+               leafs_only = false;
+               goto again;
+       }
+
        rcu_read_unlock();
 }

[*] https://lore.kernel.org/all/20211120045046.3940942-1-seanjc@google.com/

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-11-30 19:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-24 21:44 [PATCH 0/2] optimize spte zapping in zap_gfn_range() Mingwei Zhang
2021-11-24 21:44 ` [PATCH 1/2] Revert "KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs" Mingwei Zhang
2021-11-30  0:50   ` David Matlack
2021-11-24 21:44 ` [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall Mingwei Zhang
2021-11-30  0:48   ` David Matlack
2021-11-30  0:50     ` David Matlack
2021-11-30 19:40     ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).