All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging
@ 2022-03-21 22:43 Ben Gardon
  2022-03-21 22:43 ` [PATCH v2 1/9] KVM: x86/mmu: Move implementation of make_spte to a helper Ben Gardon
                   ` (10 more replies)
  0 siblings, 11 replies; 33+ messages in thread
From: Ben Gardon @ 2022-03-21 22:43 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid, Ben Gardon

Currently disabling dirty logging with the TDP MMU is extremely slow.
On a 96 vCPU / 96G VM it takes ~256 seconds to disable dirty logging
with the TDP MMU, as opposed to ~4 seconds with the legacy MMU. This
series optimizes TLB flushes and introduces in-place large page
promotion, to bring the disable dirty log time down to ~3 seconds.

Testing:
Ran KVM selftests and kvm-unit-tests on an Intel Haswell. This
series introduced no new failures.

Performance:

Without this series, TDP MMU:
> ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
Test iterations: 2
Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
guest physical test memory offset: 0x3fe7c0000000
Populate memory time: 4.972184425s
Enabling dirty logging time: 0.001943807s

Iteration 1 dirty memory time: 0.061862112s
Iteration 1 get dirty log time: 0.001416413s
Iteration 1 clear dirty log time: 1.417428057s
Iteration 2 dirty memory time: 0.664103656s
Iteration 2 get dirty log time: 0.000676724s
Iteration 2 clear dirty log time: 1.149387201s
Disabling dirty logging time: 256.682188868s
Get dirty log over 2 iterations took 0.002093137s. (Avg 0.001046568s/iteration)
Clear dirty log over 2 iterations took 2.566815258s. (Avg 1.283407629s/iteration)

Without this series, Legacy MMU:
> ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
Test iterations: 2
Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
guest physical test memory offset: 0x3fe7c0000000
Populate memory time: 4.892940915s
Enabling dirty logging time: 0.001864603s

Iteration 1 dirty memory time: 0.060490391s
Iteration 1 get dirty log time: 0.001416277s
Iteration 1 clear dirty log time: 0.323548614s
Iteration 2 dirty memory time: 29.217064826s
Iteration 2 get dirty log time: 0.000696202s
Iteration 2 clear dirty log time: 0.907089084s
Disabling dirty logging time: 4.246216551s
Get dirty log over 2 iterations took 0.002112479s. (Avg 0.001056239s/iteration)
Clear dirty log over 2 iterations took 1.230637698s. (Avg 0.615318849s/iteration)

With this series, TDP MMU:
(Updated since RFC. Pulling out patches 1-4 could have a performance impact.)
> ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
Test iterations: 2
Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
guest physical test memory offset: 0x3fe7c0000000
Populate memory time: 4.878083336s
Enabling dirty logging time: 0.001874340s

Iteration 1 dirty memory time: 0.054867383s
Iteration 1 get dirty log time: 0.001368377s
Iteration 1 clear dirty log time: 1.406960856s
Iteration 2 dirty memory time: 0.679301083s
Iteration 2 get dirty log time: 0.000662905s
Iteration 2 clear dirty log time: 1.138263359s
Disabling dirty logging time: 3.169381810s
Get dirty log over 2 iterations took 0.002031282s. (Avg 0.001015641s/iteration)
Clear dirty log over 2 iterations took 2.545224215s. (Avg 1.272612107s/iteration)

Patch breakdown:
Patches 1-4 remove the need for a vCPU pointer to make_spte
Patches 5-8 are small refactors in preparation for in-place lpage promotion
Patch 9 implements in-place largepage promotion when disabling dirty logging

Changelog:
RFC -> v1:
	Dropped the first 4 patches from the series. Patch 1 was sent
	separately, patches 2-4 will be taken over by Sean Christopherson.
	Incorporated David Matlack's Reviewed-by.
v1 -> v2:
	Several patches were queued and dropped from this revision.
	Incorporated feedback from Peter Xu on the last patch in the series.
	Refreshed performance data
		Between versions 1 and 2 of this series, disable time without
		the TDP MMU went from 45s to 256, a major regression. I was
		testing on a skylake before and haswell this time, but that
		does not explain the huge performance loss.

Ben Gardon (9):
  KVM: x86/mmu: Move implementation of make_spte to a helper
  KVM: x86/mmu: Factor mt_mask out of __make_spte
  KVM: x86/mmu: Factor shadow_zero_check out of __make_spte
  KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte
  KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask
  KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend
    on vcpu
  KVM: x86/mmu: Add try_get_mt_mask to x86_ops
  KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c
  KVM: x86/mmu: Promote pages in-place when disabling dirty logging

 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  2 +
 arch/x86/kvm/mmu/mmu.c             | 21 +++++----
 arch/x86/kvm/mmu/mmu_internal.h    |  6 +++
 arch/x86/kvm/mmu/spte.c            | 39 +++++++++++-----
 arch/x86/kvm/mmu/spte.h            |  6 +++
 arch/x86/kvm/mmu/tdp_mmu.c         | 73 +++++++++++++++++++++++++++++-
 arch/x86/kvm/svm/svm.c             |  9 ++++
 arch/x86/kvm/vmx/vmx.c             | 25 ++++++++--
 9 files changed, 155 insertions(+), 27 deletions(-)

-- 
2.35.1.894.gb6a874cedc-goog


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 1/9] KVM: x86/mmu: Move implementation of make_spte to a helper
  2022-03-21 22:43 [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging Ben Gardon
@ 2022-03-21 22:43 ` Ben Gardon
  2022-03-21 22:43 ` [PATCH v2 2/9] KVM: x86/mmu: Factor mt_mask out of __make_spte Ben Gardon
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Ben Gardon @ 2022-03-21 22:43 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid, Ben Gardon

Move the implementation of make_spte to a helper function. This will
facilitate factoring out all uses of the vCPU pointer from the helper
in subsequent commits.

No functional change intended.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/spte.c | 20 +++++++++++++++-----
 arch/x86/kvm/mmu/spte.h |  4 ++++
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4739b53c9734..d3da0d3d41cb 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -90,11 +90,10 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 				     E820_TYPE_RAM);
 }
 
-bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
-	       const struct kvm_memory_slot *slot,
-	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
-	       u64 old_spte, bool prefetch, bool can_unsync,
-	       bool host_writable, u64 *new_spte)
+bool __make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+		 const struct kvm_memory_slot *slot, unsigned int pte_access,
+		 gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
+		 bool can_unsync, bool host_writable, u64 *new_spte)
 {
 	int level = sp->role.level;
 	u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -192,6 +191,17 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	return wrprot;
 }
 
+bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+	       const struct kvm_memory_slot *slot,
+	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
+	       u64 old_spte, bool prefetch, bool can_unsync,
+	       bool host_writable, u64 *new_spte)
+{
+	return __make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
+			   prefetch, can_unsync, host_writable, new_spte);
+
+}
+
 static u64 make_spte_executable(u64 spte)
 {
 	bool is_access_track = is_access_track_spte(spte);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 73f12615416f..3fae3c3124f7 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -410,6 +410,10 @@ static inline u64 get_mmio_spte_generation(u64 spte)
 	return gen;
 }
 
+bool __make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+		 const struct kvm_memory_slot *slot, unsigned int pte_access,
+		 gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
+		 bool can_unsync, bool host_writable, u64 *new_spte);
 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
-- 
2.35.1.894.gb6a874cedc-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 2/9] KVM: x86/mmu: Factor mt_mask out of __make_spte
  2022-03-21 22:43 [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging Ben Gardon
  2022-03-21 22:43 ` [PATCH v2 1/9] KVM: x86/mmu: Move implementation of make_spte to a helper Ben Gardon
@ 2022-03-21 22:43 ` Ben Gardon
  2022-03-21 22:43 ` [PATCH v2 3/9] KVM: x86/mmu: Factor shadow_zero_check " Ben Gardon
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Ben Gardon @ 2022-03-21 22:43 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid, Ben Gardon

In service of removing the vCPU pointer from __make_spte, factor the memory
type mask calculation out of __make_spte.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/spte.c | 12 ++++++++----
 arch/x86/kvm/mmu/spte.h |  3 ++-
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index d3da0d3d41cb..931cf93c3b7e 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -93,7 +93,8 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 bool __make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		 const struct kvm_memory_slot *slot, unsigned int pte_access,
 		 gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
-		 bool can_unsync, bool host_writable, u64 *new_spte)
+		 bool can_unsync, bool host_writable, u64 mt_mask,
+		 u64 *new_spte)
 {
 	int level = sp->role.level;
 	u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -130,8 +131,7 @@ bool __make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	if (level > PG_LEVEL_4K)
 		spte |= PT_PAGE_SIZE_MASK;
 	if (tdp_enabled)
-		spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
-			kvm_is_mmio_pfn(pfn));
+		spte |= mt_mask;
 
 	if (host_writable)
 		spte |= shadow_host_writable_mask;
@@ -197,8 +197,12 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte)
 {
+	u64 mt_mask = static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
+						       kvm_is_mmio_pfn(pfn));
+
 	return __make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
-			   prefetch, can_unsync, host_writable, new_spte);
+			   prefetch, can_unsync, host_writable, mt_mask,
+			   new_spte);
 
 }
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 3fae3c3124f7..d051f955699e 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -413,7 +413,8 @@ static inline u64 get_mmio_spte_generation(u64 spte)
 bool __make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		 const struct kvm_memory_slot *slot, unsigned int pte_access,
 		 gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
-		 bool can_unsync, bool host_writable, u64 *new_spte);
+		 bool can_unsync, bool host_writable, u64 mt_mask,
+		 u64 *new_spte);
 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
-- 
2.35.1.894.gb6a874cedc-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 3/9] KVM: x86/mmu: Factor shadow_zero_check out of __make_spte
  2022-03-21 22:43 [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging Ben Gardon
  2022-03-21 22:43 ` [PATCH v2 1/9] KVM: x86/mmu: Move implementation of make_spte to a helper Ben Gardon
  2022-03-21 22:43 ` [PATCH v2 2/9] KVM: x86/mmu: Factor mt_mask out of __make_spte Ben Gardon
@ 2022-03-21 22:43 ` Ben Gardon
  2022-04-12 15:52   ` Sean Christopherson
  2022-03-21 22:43 ` [PATCH v2 4/9] KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte Ben Gardon
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Ben Gardon @ 2022-03-21 22:43 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid, Ben Gardon

In the interest of devloping a version of __make_spte that can function
without a vCPU pointer, factor out the shadow_zero_mask to be an
additional argument to the function.

No functional change intended.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/spte.c | 10 ++++++----
 arch/x86/kvm/mmu/spte.h |  2 +-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 931cf93c3b7e..ef2d85577abb 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -94,7 +94,7 @@ bool __make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		 const struct kvm_memory_slot *slot, unsigned int pte_access,
 		 gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
 		 bool can_unsync, bool host_writable, u64 mt_mask,
-		 u64 *new_spte)
+		 struct rsvd_bits_validate *shadow_zero_check, u64 *new_spte)
 {
 	int level = sp->role.level;
 	u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -177,9 +177,9 @@ bool __make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	if (prefetch)
 		spte = mark_spte_for_access_track(spte);
 
-	WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
+	WARN_ONCE(is_rsvd_spte(shadow_zero_check, spte, level),
 		  "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
-		  get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
+		  get_rsvd_bits(shadow_zero_check, spte, level));
 
 	if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
 		/* Enforced by kvm_mmu_hugepage_adjust. */
@@ -199,10 +199,12 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 {
 	u64 mt_mask = static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
 						       kvm_is_mmio_pfn(pfn));
+	struct rsvd_bits_validate *shadow_zero_check =
+			&vcpu->arch.mmu->shadow_zero_check;
 
 	return __make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
 			   prefetch, can_unsync, host_writable, mt_mask,
-			   new_spte);
+			   shadow_zero_check, new_spte);
 
 }
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index d051f955699e..e8a051188eb6 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -414,7 +414,7 @@ bool __make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		 const struct kvm_memory_slot *slot, unsigned int pte_access,
 		 gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
 		 bool can_unsync, bool host_writable, u64 mt_mask,
-		 u64 *new_spte);
+		 struct rsvd_bits_validate *shadow_zero_check, u64 *new_spte);
 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
-- 
2.35.1.894.gb6a874cedc-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 4/9] KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte
  2022-03-21 22:43 [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging Ben Gardon
                   ` (2 preceding siblings ...)
  2022-03-21 22:43 ` [PATCH v2 3/9] KVM: x86/mmu: Factor shadow_zero_check " Ben Gardon
@ 2022-03-21 22:43 ` Ben Gardon
  2022-03-21 22:43 ` [PATCH v2 5/9] KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask Ben Gardon
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Ben Gardon @ 2022-03-21 22:43 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid, Ben Gardon

No that nothing in make_spte actually needs the vCPU argument, just
pass in a pointer to the struct kvm. This allows the function to be used
in situations where there is no relevant struct vcpu.

No functional change intended.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/spte.c | 9 ++++-----
 arch/x86/kvm/mmu/spte.h | 2 +-
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index ef2d85577abb..45e9c0c3932e 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -90,7 +90,7 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 				     E820_TYPE_RAM);
 }
 
-bool __make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+bool __make_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
 		 const struct kvm_memory_slot *slot, unsigned int pte_access,
 		 gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
 		 bool can_unsync, bool host_writable, u64 mt_mask,
@@ -161,7 +161,7 @@ bool __make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		 * e.g. it's write-tracked (upper-level SPs) or has one or more
 		 * shadow pages and unsync'ing pages is not allowed.
 		 */
-		if (mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, can_unsync, prefetch)) {
+		if (mmu_try_to_unsync_pages(kvm, slot, gfn, can_unsync, prefetch)) {
 			pgprintk("%s: found shadow page for %llx, marking ro\n",
 				 __func__, gfn);
 			wrprot = true;
@@ -184,7 +184,7 @@ bool __make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
 		/* Enforced by kvm_mmu_hugepage_adjust. */
 		WARN_ON(level > PG_LEVEL_4K);
-		mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
+		mark_page_dirty_in_slot(kvm, slot, gfn);
 	}
 
 	*new_spte = spte;
@@ -202,10 +202,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	struct rsvd_bits_validate *shadow_zero_check =
 			&vcpu->arch.mmu->shadow_zero_check;
 
-	return __make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
+	return __make_spte(vcpu->kvm, sp, slot, pte_access, gfn, pfn, old_spte,
 			   prefetch, can_unsync, host_writable, mt_mask,
 			   shadow_zero_check, new_spte);
-
 }
 
 static u64 make_spte_executable(u64 spte)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index e8a051188eb6..cee02fe63429 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -410,7 +410,7 @@ static inline u64 get_mmio_spte_generation(u64 spte)
 	return gen;
 }
 
-bool __make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+bool __make_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
 		 const struct kvm_memory_slot *slot, unsigned int pte_access,
 		 gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
 		 bool can_unsync, bool host_writable, u64 mt_mask,
-- 
2.35.1.894.gb6a874cedc-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 5/9] KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask
  2022-03-21 22:43 [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging Ben Gardon
                   ` (3 preceding siblings ...)
  2022-03-21 22:43 ` [PATCH v2 4/9] KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte Ben Gardon
@ 2022-03-21 22:43 ` Ben Gardon
  2022-04-12 15:46   ` Sean Christopherson
  2022-03-21 22:43 ` [PATCH v2 6/9] KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend on vcpu Ben Gardon
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Ben Gardon @ 2022-03-21 22:43 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid, Ben Gardon

Factor out the implementation of reset_tdp_shadow_zero_bits_mask to a
helper function which does not require a vCPU pointer. The only element
of the struct kvm_mmu context used by the function is the shadow root
level, so pass that in too instead of the mmu context.

No functional change intended.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3b8da8b0745e..6f98111f8f8b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4487,16 +4487,14 @@ static inline bool boot_cpu_is_amd(void)
  * possible, however, kvm currently does not do execution-protection.
  */
 static void
-reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
+build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
+				int shadow_root_level)
 {
-	struct rsvd_bits_validate *shadow_zero_check;
 	int i;
 
-	shadow_zero_check = &context->shadow_zero_check;
-
 	if (boot_cpu_is_amd())
 		__reset_rsvds_bits_mask(shadow_zero_check, reserved_hpa_bits(),
-					context->shadow_root_level, false,
+					shadow_root_level, false,
 					boot_cpu_has(X86_FEATURE_GBPAGES),
 					false, true);
 	else
@@ -4507,12 +4505,19 @@ reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
 	if (!shadow_me_mask)
 		return;
 
-	for (i = context->shadow_root_level; --i >= 0;) {
+	for (i = shadow_root_level; --i >= 0;) {
 		shadow_zero_check->rsvd_bits_mask[0][i] &= ~shadow_me_mask;
 		shadow_zero_check->rsvd_bits_mask[1][i] &= ~shadow_me_mask;
 	}
 }
 
+static void
+reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
+{
+	build_tdp_shadow_zero_bits_mask(&context->shadow_zero_check,
+					context->shadow_root_level);
+}
+
 /*
  * as the comments in reset_shadow_zero_bits_mask() except it
  * is the shadow page table for intel nested guest.
-- 
2.35.1.894.gb6a874cedc-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 6/9] KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend on vcpu
  2022-03-21 22:43 [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging Ben Gardon
                   ` (4 preceding siblings ...)
  2022-03-21 22:43 ` [PATCH v2 5/9] KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask Ben Gardon
@ 2022-03-21 22:43 ` Ben Gardon
  2022-03-28 18:04   ` David Matlack
  2022-03-21 22:43 ` [PATCH v2 7/9] KVM: x86/mmu: Add try_get_mt_mask to x86_ops Ben Gardon
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Ben Gardon @ 2022-03-21 22:43 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid, Ben Gardon

Factor out the parts of vmx_get_mt_mask which do not depend on the vCPU
argument. This also requires adding some error reporting to the helper
function to say whether it was possible to generate the MT mask without
a vCPU argument. This refactoring will allow the MT mask to be computed
when noncoherent DMA is not enabled on a VM.

No functional change intended.


Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e8963f5af618..69c654567475 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7149,9 +7149,26 @@ static int __init vmx_check_processor_compat(void)
 	return 0;
 }
 
+static bool vmx_try_get_mt_mask(struct kvm *kvm, gfn_t gfn,
+				bool is_mmio, u64 *mask)
+{
+	if (is_mmio) {
+		*mask =  MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
+		return true;
+	}
+
+	if (!kvm_arch_has_noncoherent_dma(kvm)) {
+		*mask = (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+		return true;
+	}
+
+	return false;
+}
+
 static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
 	u8 cache;
+	u64 mask;
 
 	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
 	 * memory aliases with conflicting memory types and sometimes MCEs.
@@ -7171,11 +7188,8 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 	 * EPT memory type is used to emulate guest CD/MTRR.
 	 */
 
-	if (is_mmio)
-		return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
-
-	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
-		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+	if (vmx_try_get_mt_mask(vcpu->kvm, gfn, is_mmio, &mask))
+		return mask;
 
 	if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
 		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
-- 
2.35.1.894.gb6a874cedc-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 7/9] KVM: x86/mmu: Add try_get_mt_mask to x86_ops
  2022-03-21 22:43 [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging Ben Gardon
                   ` (5 preceding siblings ...)
  2022-03-21 22:43 ` [PATCH v2 6/9] KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend on vcpu Ben Gardon
@ 2022-03-21 22:43 ` Ben Gardon
  2022-04-11 23:00   ` Sean Christopherson
  2022-03-21 22:43 ` [PATCH v2 8/9] KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c Ben Gardon
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Ben Gardon @ 2022-03-21 22:43 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid, Ben Gardon

Add another function for getting the memory type mask to x86_ops.
This version of the function can fail, but it does not require a vCPU
pointer. It will be used in a subsequent commit for in-place large page
promotion when disabling dirty logging.

No functional change intended.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/include/asm/kvm-x86-ops.h | 1 +
 arch/x86/include/asm/kvm_host.h    | 2 ++
 arch/x86/kvm/svm/svm.c             | 9 +++++++++
 arch/x86/kvm/vmx/vmx.c             | 1 +
 4 files changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 29affccb353c..29880363b5ed 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -88,6 +88,7 @@ KVM_X86_OP_OPTIONAL(sync_pir_to_irr)
 KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
 KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
 KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
+KVM_X86_OP(try_get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f72e80178ffc..a114e4782702 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1422,6 +1422,8 @@ struct kvm_x86_ops {
 	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
 	int (*set_identity_map_addr)(struct kvm *kvm, u64 ident_addr);
 	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+	bool (*try_get_mt_mask)(struct kvm *kvm, gfn_t gfn,
+				bool is_mmio, u64 *mask);
 
 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			     int root_level);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b069493ad5c7..e73415dfcf52 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3944,6 +3944,13 @@ static bool svm_has_emulated_msr(struct kvm *kvm, u32 index)
 	return true;
 }
 
+static bool svm_try_get_mt_mask(struct kvm *kvm, gfn_t gfn,
+				bool is_mmio, u64 *mask)
+{
+	*mask = 0;
+	return true;
+}
+
 static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -4600,6 +4607,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.check_apicv_inhibit_reasons = avic_check_apicv_inhibit_reasons,
 	.apicv_post_state_restore = avic_apicv_post_state_restore,
 
+	.try_get_mt_mask = svm_try_get_mt_mask,
+
 	.get_exit_info = svm_get_exit_info,
 
 	.vcpu_after_set_cpuid = svm_vcpu_after_set_cpuid,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 69c654567475..81e9805ed1d8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7813,6 +7813,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.set_tss_addr = vmx_set_tss_addr,
 	.set_identity_map_addr = vmx_set_identity_map_addr,
 	.get_mt_mask = vmx_get_mt_mask,
+	.try_get_mt_mask = vmx_try_get_mt_mask,
 
 	.get_exit_info = vmx_get_exit_info,
 
-- 
2.35.1.894.gb6a874cedc-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 8/9] KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c
  2022-03-21 22:43 [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging Ben Gardon
                   ` (6 preceding siblings ...)
  2022-03-21 22:43 ` [PATCH v2 7/9] KVM: x86/mmu: Add try_get_mt_mask to x86_ops Ben Gardon
@ 2022-03-21 22:43 ` Ben Gardon
  2022-04-12 19:39   ` Sean Christopherson
  2022-03-21 22:43 ` [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging Ben Gardon
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Ben Gardon @ 2022-03-21 22:43 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid, Ben Gardon

Export kvm_is_mmio_pfn from spte.c. It will be used in a subsequent
commit for in-place lpage promotion when disabling dirty logging.


Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/spte.c | 2 +-
 arch/x86/kvm/mmu/spte.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 45e9c0c3932e..8e9b827c4ed5 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -69,7 +69,7 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
 	return spte;
 }
 
-static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
+bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
 	if (pfn_valid(pfn))
 		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index cee02fe63429..e058a85e6c66 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -443,4 +443,5 @@ u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);
 
 void kvm_mmu_reset_all_pte_masks(void);
 
+bool kvm_is_mmio_pfn(kvm_pfn_t pfn);
 #endif
-- 
2.35.1.894.gb6a874cedc-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging
  2022-03-21 22:43 [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging Ben Gardon
                   ` (7 preceding siblings ...)
  2022-03-21 22:43 ` [PATCH v2 8/9] KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c Ben Gardon
@ 2022-03-21 22:43 ` Ben Gardon
  2022-03-28 17:45   ` David Matlack
                     ` (2 more replies)
  2022-03-25 12:00 ` [PATCH v2 0/9] KVM: x86/MMU: Optimize " Paolo Bonzini
  2022-03-28 17:49 ` David Matlack
  10 siblings, 3 replies; 33+ messages in thread
From: Ben Gardon @ 2022-03-21 22:43 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid, Ben Gardon

When disabling dirty logging, the TDP MMU currently zaps each leaf entry
mapping memory in the relevant memslot. This is very slow. Doing the zaps
under the mmu read lock requires a TLB flush for every zap and the
zapping causes a storm of ETP/NPT violations.

Instead of zapping, replace the split large pages with large page
mappings directly. While this sort of operation has historically only
been done in the vCPU page fault handler context, refactorings earlier
in this series and the relative simplicity of the TDP MMU make it
possible here as well.

Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G
of memory per vCPU, this reduces the time required to disable dirty
logging from over 45 seconds to just over 1 second. It also avoids
provoking page faults, improving vCPU performance while disabling
dirty logging.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c          |  4 +-
 arch/x86/kvm/mmu/mmu_internal.h |  6 +++
 arch/x86/kvm/mmu/tdp_mmu.c      | 73 ++++++++++++++++++++++++++++++++-
 3 files changed, 79 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f98111f8f8b..a99c23ef90b6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -100,7 +100,7 @@ module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
  */
 bool tdp_enabled = false;
 
-static int max_huge_page_level __read_mostly;
+int max_huge_page_level;
 static int tdp_root_level __read_mostly;
 static int max_tdp_level __read_mostly;
 
@@ -4486,7 +4486,7 @@ static inline bool boot_cpu_is_amd(void)
  * the direct page table on host, use as much mmu features as
  * possible, however, kvm currently does not do execution-protection.
  */
-static void
+void
 build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
 				int shadow_root_level)
 {
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1bff453f7cbe..6c08a5731fcb 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -171,4 +171,10 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+void
+build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
+				int shadow_root_level);
+
+extern int max_huge_page_level __read_mostly;
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index af60922906ef..eb8929e394ec 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1709,6 +1709,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 		clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
 }
 
+static bool try_promote_lpage(struct kvm *kvm,
+			      const struct kvm_memory_slot *slot,
+			      struct tdp_iter *iter)
+{
+	struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
+	struct rsvd_bits_validate shadow_zero_check;
+	bool map_writable;
+	kvm_pfn_t pfn;
+	u64 new_spte;
+	u64 mt_mask;
+
+	/*
+	 * If addresses are being invalidated, don't do in-place promotion to
+	 * avoid accidentally mapping an invalidated address.
+	 */
+	if (unlikely(kvm->mmu_notifier_count))
+		return false;
+
+	if (iter->level > max_huge_page_level || iter->gfn < slot->base_gfn ||
+	    iter->gfn >= slot->base_gfn + slot->npages)
+		return false;
+
+	pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
+				   &map_writable, NULL);
+	if (is_error_noslot_pfn(pfn))
+		return false;
+
+	/*
+	 * Can't reconstitute an lpage if the consituent pages can't be
+	 * mapped higher.
+	 */
+	if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
+						    pfn, PG_LEVEL_NUM))
+		return false;
+
+	build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
+
+	/*
+	 * In some cases, a vCPU pointer is required to get the MT mask,
+	 * however in most cases it can be generated without one. If a
+	 * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
+	 * In that case, bail on in-place promotion.
+	 */
+	if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
+							   kvm_is_mmio_pfn(pfn),
+							   &mt_mask)))
+		return false;
+
+	__make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
+		  map_writable, mt_mask, &shadow_zero_check, &new_spte);
+
+	if (tdp_mmu_set_spte_atomic(kvm, iter, new_spte))
+		return true;
+
+	/* Re-read the SPTE as it must have been changed by another thread. */
+	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+
+	return false;
+}
+
 /*
  * Clear leaf entries which could be replaced by large mappings, for
  * GFNs within the slot.
@@ -1729,8 +1789,17 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
 
-		if (!is_shadow_present_pte(iter.old_spte) ||
-		    !is_last_spte(iter.old_spte, iter.level))
+		if (iter.level > max_huge_page_level ||
+		    iter.gfn < slot->base_gfn ||
+		    iter.gfn >= slot->base_gfn + slot->npages)
+			continue;
+
+		if (!is_shadow_present_pte(iter.old_spte))
+			continue;
+
+		/* Try to promote the constitutent pages to an lpage. */
+		if (!is_last_spte(iter.old_spte, iter.level) &&
+		    try_promote_lpage(kvm, slot, &iter))
 			continue;
 
 		pfn = spte_to_pfn(iter.old_spte);
-- 
2.35.1.894.gb6a874cedc-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging
  2022-03-21 22:43 [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging Ben Gardon
                   ` (8 preceding siblings ...)
  2022-03-21 22:43 ` [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging Ben Gardon
@ 2022-03-25 12:00 ` Paolo Bonzini
  2022-07-12  1:37   ` Sean Christopherson
  2022-03-28 17:49 ` David Matlack
  10 siblings, 1 reply; 33+ messages in thread
From: Paolo Bonzini @ 2022-03-25 12:00 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, David Matlack, Jim Mattson,
	David Dunn, Jing Zhang, Junaid Shahid

On 3/21/22 23:43, Ben Gardon wrote:
> Currently disabling dirty logging with the TDP MMU is extremely slow.
> On a 96 vCPU / 96G VM it takes ~256 seconds to disable dirty logging
> with the TDP MMU, as opposed to ~4 seconds with the legacy MMU. This
> series optimizes TLB flushes and introduces in-place large page
> promotion, to bring the disable dirty log time down to ~3 seconds.
> 
> Testing:
> Ran KVM selftests and kvm-unit-tests on an Intel Haswell. This
> series introduced no new failures.

Thanks, looks good.  The one change I'd make is to just place the 
outcome of build_tdp_shadow_zero_bits_mask() in a global (say 
tdp_shadow_zero_check) at kvm_configure_mmu() time.  The 
tdp_max_root_level works as a conservative choice for the second 
argument of build_tdp_shadow_zero_bits_mask().

No need to do anything though, I'll handle this later in 5.19 time (and 
first merge my changes that factor out the constant part of 
vcpu->arch.root_mmu initialization, since this is part of the same ideas).

Paolo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging
  2022-03-21 22:43 ` [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging Ben Gardon
@ 2022-03-28 17:45   ` David Matlack
  2022-03-28 18:07     ` Ben Gardon
  2022-03-28 18:21   ` David Matlack
  2022-04-12 16:43   ` Sean Christopherson
  2 siblings, 1 reply; 33+ messages in thread
From: David Matlack @ 2022-03-28 17:45 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On Mon, Mar 21, 2022 at 03:43:58PM -0700, Ben Gardon wrote:
> When disabling dirty logging, the TDP MMU currently zaps each leaf entry
> mapping memory in the relevant memslot. This is very slow. Doing the zaps
> under the mmu read lock requires a TLB flush for every zap and the
> zapping causes a storm of ETP/NPT violations.
> 
> Instead of zapping, replace the split large pages with large page
> mappings directly. While this sort of operation has historically only
> been done in the vCPU page fault handler context, refactorings earlier
> in this series and the relative simplicity of the TDP MMU make it
> possible here as well.
> 
> Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G
> of memory per vCPU, this reduces the time required to disable dirty
> logging from over 45 seconds to just over 1 second. It also avoids
> provoking page faults, improving vCPU performance while disabling
> dirty logging.
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          |  4 +-
>  arch/x86/kvm/mmu/mmu_internal.h |  6 +++
>  arch/x86/kvm/mmu/tdp_mmu.c      | 73 ++++++++++++++++++++++++++++++++-
>  3 files changed, 79 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f98111f8f8b..a99c23ef90b6 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -100,7 +100,7 @@ module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
>   */
>  bool tdp_enabled = false;
>  
> -static int max_huge_page_level __read_mostly;
> +int max_huge_page_level;
>  static int tdp_root_level __read_mostly;
>  static int max_tdp_level __read_mostly;
>  
> @@ -4486,7 +4486,7 @@ static inline bool boot_cpu_is_amd(void)
>   * the direct page table on host, use as much mmu features as
>   * possible, however, kvm currently does not do execution-protection.
>   */
> -static void
> +void
>  build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
>  				int shadow_root_level)
>  {
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1bff453f7cbe..6c08a5731fcb 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -171,4 +171,10 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  
> +void
> +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> +				int shadow_root_level);
> +
> +extern int max_huge_page_level __read_mostly;
> +
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index af60922906ef..eb8929e394ec 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1709,6 +1709,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
>  		clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
>  }
>  
> +static bool try_promote_lpage(struct kvm *kvm,
> +			      const struct kvm_memory_slot *slot,
> +			      struct tdp_iter *iter)

Use "huge_page" instead of "lpage" to be consistent with eager page
splitting and the rest of the Linux kernel. Some of the old KVM methods
still use "lpage" and "large page", but we're slowly moving away from
that.

> +{
> +	struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> +	struct rsvd_bits_validate shadow_zero_check;
> +	bool map_writable;
> +	kvm_pfn_t pfn;
> +	u64 new_spte;
> +	u64 mt_mask;
> +
> +	/*
> +	 * If addresses are being invalidated, don't do in-place promotion to
> +	 * avoid accidentally mapping an invalidated address.
> +	 */
> +	if (unlikely(kvm->mmu_notifier_count))
> +		return false;

Why is this necessary? Seeing this makes me wonder if we need a similar
check for eager page splitting.

> +
> +	if (iter->level > max_huge_page_level || iter->gfn < slot->base_gfn ||
> +	    iter->gfn >= slot->base_gfn + slot->npages)
> +		return false;
> +
> +	pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
> +				   &map_writable, NULL);
> +	if (is_error_noslot_pfn(pfn))
> +		return false;
> +
> +	/*
> +	 * Can't reconstitute an lpage if the consituent pages can't be
> +	 * mapped higher.
> +	 */
> +	if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
> +						    pfn, PG_LEVEL_NUM))
> +		return false;
> +
> +	build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
> +
> +	/*
> +	 * In some cases, a vCPU pointer is required to get the MT mask,
> +	 * however in most cases it can be generated without one. If a
> +	 * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
> +	 * In that case, bail on in-place promotion.
> +	 */
> +	if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
> +							   kvm_is_mmio_pfn(pfn),
> +							   &mt_mask)))
> +		return false;
> +
> +	__make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
> +		  map_writable, mt_mask, &shadow_zero_check, &new_spte);
> +
> +	if (tdp_mmu_set_spte_atomic(kvm, iter, new_spte))
> +		return true;
> +
> +	/* Re-read the SPTE as it must have been changed by another thread. */
> +	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));

Huge page promotion could be retried in this case.

> +
> +	return false;
> +}
> +
>  /*
>   * Clear leaf entries which could be replaced by large mappings, for
>   * GFNs within the slot.
> @@ -1729,8 +1789,17 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>  		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
>  			continue;
>  
> -		if (!is_shadow_present_pte(iter.old_spte) ||
> -		    !is_last_spte(iter.old_spte, iter.level))
> +		if (iter.level > max_huge_page_level ||
> +		    iter.gfn < slot->base_gfn ||
> +		    iter.gfn >= slot->base_gfn + slot->npages)

I feel like I've been seeing this "does slot contain gfn" calculation a
lot in recent commits. It's probably time to create a helper function.
No need to do this clean up as part of your series though, unless you
want to :).

> +			continue;
> +
> +		if (!is_shadow_present_pte(iter.old_spte))
> +			continue;
> +
> +		/* Try to promote the constitutent pages to an lpage. */
> +		if (!is_last_spte(iter.old_spte, iter.level) &&
> +		    try_promote_lpage(kvm, slot, &iter))
>  			continue;

If iter.old_spte is not a leaf, the only loop would always continue to
the next SPTE. Now we try to promote it and if that fails we run through
the rest of the loop. This seems broken. For example, in the next line
we end up grabbing the pfn of the non-leaf SPTE (which would be the PFN
of the TDP MMU page table?) and treat that as the PFN backing this GFN,
which is wrong.

In the worst case we end up zapping an SPTE that we didn't need to, but
we should still fix up this code.

>  
>  		pfn = spte_to_pfn(iter.old_spte);
> -- 
> 2.35.1.894.gb6a874cedc-goog
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging
  2022-03-21 22:43 [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging Ben Gardon
                   ` (9 preceding siblings ...)
  2022-03-25 12:00 ` [PATCH v2 0/9] KVM: x86/MMU: Optimize " Paolo Bonzini
@ 2022-03-28 17:49 ` David Matlack
  10 siblings, 0 replies; 33+ messages in thread
From: David Matlack @ 2022-03-28 17:49 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On Mon, Mar 21, 2022 at 03:43:49PM -0700, Ben Gardon wrote:
> Currently disabling dirty logging with the TDP MMU is extremely slow.
> On a 96 vCPU / 96G VM it takes ~256 seconds to disable dirty logging
> with the TDP MMU, as opposed to ~4 seconds with the legacy MMU. This
> series optimizes TLB flushes and introduces in-place large page
> promotion, to bring the disable dirty log time down to ~3 seconds.
> 
> Testing:
> Ran KVM selftests and kvm-unit-tests on an Intel Haswell. This
> series introduced no new failures.
> 
> Performance:
> 
> Without this series, TDP MMU:
> > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
> Test iterations: 2
> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> guest physical test memory offset: 0x3fe7c0000000
> Populate memory time: 4.972184425s
> Enabling dirty logging time: 0.001943807s
> 
> Iteration 1 dirty memory time: 0.061862112s
> Iteration 1 get dirty log time: 0.001416413s
> Iteration 1 clear dirty log time: 1.417428057s
> Iteration 2 dirty memory time: 0.664103656s
> Iteration 2 get dirty log time: 0.000676724s
> Iteration 2 clear dirty log time: 1.149387201s
> Disabling dirty logging time: 256.682188868s
> Get dirty log over 2 iterations took 0.002093137s. (Avg 0.001046568s/iteration)
> Clear dirty log over 2 iterations took 2.566815258s. (Avg 1.283407629s/iteration)
> 
> Without this series, Legacy MMU:
> > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
> Test iterations: 2
> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> guest physical test memory offset: 0x3fe7c0000000
> Populate memory time: 4.892940915s
> Enabling dirty logging time: 0.001864603s
> 
> Iteration 1 dirty memory time: 0.060490391s
> Iteration 1 get dirty log time: 0.001416277s
> Iteration 1 clear dirty log time: 0.323548614s
> Iteration 2 dirty memory time: 29.217064826s
> Iteration 2 get dirty log time: 0.000696202s
> Iteration 2 clear dirty log time: 0.907089084s
> Disabling dirty logging time: 4.246216551s
> Get dirty log over 2 iterations took 0.002112479s. (Avg 0.001056239s/iteration)
> Clear dirty log over 2 iterations took 1.230637698s. (Avg 0.615318849s/iteration)
> 
> With this series, TDP MMU:
> (Updated since RFC. Pulling out patches 1-4 could have a performance impact.)
> > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
> Test iterations: 2
> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> guest physical test memory offset: 0x3fe7c0000000
> Populate memory time: 4.878083336s
> Enabling dirty logging time: 0.001874340s
> 
> Iteration 1 dirty memory time: 0.054867383s
> Iteration 1 get dirty log time: 0.001368377s
> Iteration 1 clear dirty log time: 1.406960856s
> Iteration 2 dirty memory time: 0.679301083s
> Iteration 2 get dirty log time: 0.000662905s
> Iteration 2 clear dirty log time: 1.138263359s
> Disabling dirty logging time: 3.169381810s
> Get dirty log over 2 iterations took 0.002031282s. (Avg 0.001015641s/iteration)
> Clear dirty log over 2 iterations took 2.545224215s. (Avg 1.272612107s/iteration)
> 
> Patch breakdown:
> Patches 1-4 remove the need for a vCPU pointer to make_spte
> Patches 5-8 are small refactors in preparation for in-place lpage promotion
> Patch 9 implements in-place largepage promotion when disabling dirty logging
> 
> Changelog:
> RFC -> v1:
> 	Dropped the first 4 patches from the series. Patch 1 was sent
> 	separately, patches 2-4 will be taken over by Sean Christopherson.
> 	Incorporated David Matlack's Reviewed-by.
> v1 -> v2:
> 	Several patches were queued and dropped from this revision.
> 	Incorporated feedback from Peter Xu on the last patch in the series.
> 	Refreshed performance data
> 		Between versions 1 and 2 of this series, disable time without
> 		the TDP MMU went from 45s to 256, a major regression. I was
> 		testing on a skylake before and haswell this time, but that
> 		does not explain the huge performance loss.
> 
> Ben Gardon (9):
>   KVM: x86/mmu: Move implementation of make_spte to a helper
>   KVM: x86/mmu: Factor mt_mask out of __make_spte
>   KVM: x86/mmu: Factor shadow_zero_check out of __make_spte
>   KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte
>   KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask
>   KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend
>     on vcpu
>   KVM: x86/mmu: Add try_get_mt_mask to x86_ops
>   KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c
>   KVM: x86/mmu: Promote pages in-place when disabling dirty logging

Use () after function names to make it clear you are referring to a
function and not something else. e.g.

  KVM: x86/mmu: Move implementation of make_spte to a helper

becomes

  KVM: x86/mmu: Move implementation of make_spte() to a helper

This applies throughout the series, in commit messages and comments.

> 
>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
>  arch/x86/include/asm/kvm_host.h    |  2 +
>  arch/x86/kvm/mmu/mmu.c             | 21 +++++----
>  arch/x86/kvm/mmu/mmu_internal.h    |  6 +++
>  arch/x86/kvm/mmu/spte.c            | 39 +++++++++++-----
>  arch/x86/kvm/mmu/spte.h            |  6 +++
>  arch/x86/kvm/mmu/tdp_mmu.c         | 73 +++++++++++++++++++++++++++++-
>  arch/x86/kvm/svm/svm.c             |  9 ++++
>  arch/x86/kvm/vmx/vmx.c             | 25 ++++++++--
>  9 files changed, 155 insertions(+), 27 deletions(-)
> 
> -- 
> 2.35.1.894.gb6a874cedc-goog
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 6/9] KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend on vcpu
  2022-03-21 22:43 ` [PATCH v2 6/9] KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend on vcpu Ben Gardon
@ 2022-03-28 18:04   ` David Matlack
  0 siblings, 0 replies; 33+ messages in thread
From: David Matlack @ 2022-03-28 18:04 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On Mon, Mar 21, 2022 at 03:43:55PM -0700, Ben Gardon wrote:
> Factor out the parts of vmx_get_mt_mask which do not depend on the vCPU
> argument. This also requires adding some error reporting to the helper
> function to say whether it was possible to generate the MT mask without
> a vCPU argument. This refactoring will allow the MT mask to be computed
> when noncoherent DMA is not enabled on a VM.

We could probably make vmx_get_mt_mask() entirely independent of
the kvm_vcpu, but it would take more work.

For MTRRs, the guest must update them on all CPUs at once (SDM 11.11.8)
so we could just cache vCPU 0's MTRRs at the VM level and use that here.
(From my experience, Intel CPUs implement MTRRs at the core level.
Properly emulating that would require a different EPT table for every
virtual core.)

For CR0.CD, I'm not exactly sure what the semantics are for MP systems
but I can't imagine it's valid for software to configure CR0.CD
differently on different cores. I would have to scoure the SDM closely
to confirm, but we could probably do something like cache
max(CR0.CD for all vCPUs) at the VM level and use that to indicate if
caching is disabled.

> 
> No functional change intended.
> 
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index e8963f5af618..69c654567475 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7149,9 +7149,26 @@ static int __init vmx_check_processor_compat(void)
>  	return 0;
>  }
>  
> +static bool vmx_try_get_mt_mask(struct kvm *kvm, gfn_t gfn,
> +				bool is_mmio, u64 *mask)
> +{
> +	if (is_mmio) {
> +		*mask =  MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> +		return true;
> +	}
> +
> +	if (!kvm_arch_has_noncoherent_dma(kvm)) {
> +		*mask = (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
>  static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  {
>  	u8 cache;
> +	u64 mask;
>  
>  	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
>  	 * memory aliases with conflicting memory types and sometimes MCEs.
> @@ -7171,11 +7188,8 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  	 * EPT memory type is used to emulate guest CD/MTRR.
>  	 */
>  
> -	if (is_mmio)
> -		return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> -
> -	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> -		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> +	if (vmx_try_get_mt_mask(vcpu->kvm, gfn, is_mmio, &mask))
> +		return mask;
>  
>  	if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
>  		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> -- 
> 2.35.1.894.gb6a874cedc-goog
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging
  2022-03-28 17:45   ` David Matlack
@ 2022-03-28 18:07     ` Ben Gardon
  2022-03-28 18:20       ` David Matlack
  2022-07-12 23:21       ` Sean Christopherson
  0 siblings, 2 replies; 33+ messages in thread
From: Ben Gardon @ 2022-03-28 18:07 UTC (permalink / raw)
  To: David Matlack
  Cc: LKML, kvm, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On Mon, Mar 28, 2022 at 10:45 AM David Matlack <dmatlack@google.com> wrote:
>
> On Mon, Mar 21, 2022 at 03:43:58PM -0700, Ben Gardon wrote:
> > When disabling dirty logging, the TDP MMU currently zaps each leaf entry
> > mapping memory in the relevant memslot. This is very slow. Doing the zaps
> > under the mmu read lock requires a TLB flush for every zap and the
> > zapping causes a storm of ETP/NPT violations.
> >
> > Instead of zapping, replace the split large pages with large page
> > mappings directly. While this sort of operation has historically only
> > been done in the vCPU page fault handler context, refactorings earlier
> > in this series and the relative simplicity of the TDP MMU make it
> > possible here as well.
> >
> > Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G
> > of memory per vCPU, this reduces the time required to disable dirty
> > logging from over 45 seconds to just over 1 second. It also avoids
> > provoking page faults, improving vCPU performance while disabling
> > dirty logging.
> >
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          |  4 +-
> >  arch/x86/kvm/mmu/mmu_internal.h |  6 +++
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 73 ++++++++++++++++++++++++++++++++-
> >  3 files changed, 79 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6f98111f8f8b..a99c23ef90b6 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -100,7 +100,7 @@ module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
> >   */
> >  bool tdp_enabled = false;
> >
> > -static int max_huge_page_level __read_mostly;
> > +int max_huge_page_level;
> >  static int tdp_root_level __read_mostly;
> >  static int max_tdp_level __read_mostly;
> >
> > @@ -4486,7 +4486,7 @@ static inline bool boot_cpu_is_amd(void)
> >   * the direct page table on host, use as much mmu features as
> >   * possible, however, kvm currently does not do execution-protection.
> >   */
> > -static void
> > +void
> >  build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> >                               int shadow_root_level)
> >  {
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 1bff453f7cbe..6c08a5731fcb 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -171,4 +171,10 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +void
> > +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> > +                             int shadow_root_level);
> > +
> > +extern int max_huge_page_level __read_mostly;
> > +
> >  #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index af60922906ef..eb8929e394ec 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1709,6 +1709,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
> >               clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
> >  }
> >
> > +static bool try_promote_lpage(struct kvm *kvm,
> > +                           const struct kvm_memory_slot *slot,
> > +                           struct tdp_iter *iter)
>
> Use "huge_page" instead of "lpage" to be consistent with eager page
> splitting and the rest of the Linux kernel. Some of the old KVM methods
> still use "lpage" and "large page", but we're slowly moving away from
> that.

Ah good catch. Paolo, if you want me to send a v2 to address all these
comments, I can. Otherwise I'll just reply to the questions below.

>
> > +{
> > +     struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> > +     struct rsvd_bits_validate shadow_zero_check;
> > +     bool map_writable;
> > +     kvm_pfn_t pfn;
> > +     u64 new_spte;
> > +     u64 mt_mask;
> > +
> > +     /*
> > +      * If addresses are being invalidated, don't do in-place promotion to
> > +      * avoid accidentally mapping an invalidated address.
> > +      */
> > +     if (unlikely(kvm->mmu_notifier_count))
> > +             return false;
>
> Why is this necessary? Seeing this makes me wonder if we need a similar
> check for eager page splitting.

This is needed here, but not in the page splitting case, because we
are potentially mapping new memory.
If a page is split for dirt logging, but then the backing transparent
huge page is split for some reason, we could race with the THP split.
Since we're mapping the entire huge page, this could wind up mapping
more memory than it should. Checking the MMU notifier count prevents
that. It's not needed in the splitting case because the memory in
question is already mapped. We're essentially trying to do what the
page fault handler does, since we know that's safe and it's what
replaces the zapped page with a huge page. The page fault handler
checks for MMU notifiers, so we need to as well.
>
> > +
> > +     if (iter->level > max_huge_page_level || iter->gfn < slot->base_gfn ||
> > +         iter->gfn >= slot->base_gfn + slot->npages)
> > +             return false;
> > +
> > +     pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
> > +                                &map_writable, NULL);
> > +     if (is_error_noslot_pfn(pfn))
> > +             return false;
> > +
> > +     /*
> > +      * Can't reconstitute an lpage if the consituent pages can't be
> > +      * mapped higher.
> > +      */
> > +     if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
> > +                                                 pfn, PG_LEVEL_NUM))
> > +             return false;
> > +
> > +     build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
> > +
> > +     /*
> > +      * In some cases, a vCPU pointer is required to get the MT mask,
> > +      * however in most cases it can be generated without one. If a
> > +      * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
> > +      * In that case, bail on in-place promotion.
> > +      */
> > +     if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
> > +                                                        kvm_is_mmio_pfn(pfn),
> > +                                                        &mt_mask)))
> > +             return false;
> > +
> > +     __make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
> > +               map_writable, mt_mask, &shadow_zero_check, &new_spte);
> > +
> > +     if (tdp_mmu_set_spte_atomic(kvm, iter, new_spte))
> > +             return true;
> > +
> > +     /* Re-read the SPTE as it must have been changed by another thread. */
> > +     iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
>
> Huge page promotion could be retried in this case.

That's true, but retries always get complicated since we need to
guarantee forward progress and then you get into counting retries and
it adds complexity. Given how rare this race should be, I'm inclined
to just let it fall back to zapping the spte.

>
> > +
> > +     return false;
> > +}
> > +
> >  /*
> >   * Clear leaf entries which could be replaced by large mappings, for
> >   * GFNs within the slot.
> > @@ -1729,8 +1789,17 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> >               if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> >                       continue;
> >
> > -             if (!is_shadow_present_pte(iter.old_spte) ||
> > -                 !is_last_spte(iter.old_spte, iter.level))
> > +             if (iter.level > max_huge_page_level ||
> > +                 iter.gfn < slot->base_gfn ||
> > +                 iter.gfn >= slot->base_gfn + slot->npages)
>
> I feel like I've been seeing this "does slot contain gfn" calculation a
> lot in recent commits. It's probably time to create a helper function.
> No need to do this clean up as part of your series though, unless you
> want to :).
>
> > +                     continue;
> > +
> > +             if (!is_shadow_present_pte(iter.old_spte))
> > +                     continue;
> > +
> > +             /* Try to promote the constitutent pages to an lpage. */
> > +             if (!is_last_spte(iter.old_spte, iter.level) &&
> > +                 try_promote_lpage(kvm, slot, &iter))
> >                       continue;
>
> If iter.old_spte is not a leaf, the only loop would always continue to
> the next SPTE. Now we try to promote it and if that fails we run through
> the rest of the loop. This seems broken. For example, in the next line
> we end up grabbing the pfn of the non-leaf SPTE (which would be the PFN
> of the TDP MMU page table?) and treat that as the PFN backing this GFN,
> which is wrong.
>
> In the worst case we end up zapping an SPTE that we didn't need to, but
> we should still fix up this code.
>
> >
> >               pfn = spte_to_pfn(iter.old_spte);
> > --
> > 2.35.1.894.gb6a874cedc-goog
> >

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging
  2022-03-28 18:07     ` Ben Gardon
@ 2022-03-28 18:20       ` David Matlack
  2022-07-12 23:21       ` Sean Christopherson
  1 sibling, 0 replies; 33+ messages in thread
From: David Matlack @ 2022-03-28 18:20 UTC (permalink / raw)
  To: Ben Gardon
  Cc: LKML, kvm, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On Mon, Mar 28, 2022 at 11:07 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Mon, Mar 28, 2022 at 10:45 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Mon, Mar 21, 2022 at 03:43:58PM -0700, Ben Gardon wrote:
> > > When disabling dirty logging, the TDP MMU currently zaps each leaf entry
> > > mapping memory in the relevant memslot. This is very slow. Doing the zaps
> > > under the mmu read lock requires a TLB flush for every zap and the
> > > zapping causes a storm of ETP/NPT violations.
> > >
> > > Instead of zapping, replace the split large pages with large page
> > > mappings directly. While this sort of operation has historically only
> > > been done in the vCPU page fault handler context, refactorings earlier
> > > in this series and the relative simplicity of the TDP MMU make it
> > > possible here as well.
> > >
> > > Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G
> > > of memory per vCPU, this reduces the time required to disable dirty
> > > logging from over 45 seconds to just over 1 second. It also avoids
> > > provoking page faults, improving vCPU performance while disabling
> > > dirty logging.
> > >
> > > Signed-off-by: Ben Gardon <bgardon@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c          |  4 +-
> > >  arch/x86/kvm/mmu/mmu_internal.h |  6 +++
> > >  arch/x86/kvm/mmu/tdp_mmu.c      | 73 ++++++++++++++++++++++++++++++++-
> > >  3 files changed, 79 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 6f98111f8f8b..a99c23ef90b6 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -100,7 +100,7 @@ module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
> > >   */
> > >  bool tdp_enabled = false;
> > >
> > > -static int max_huge_page_level __read_mostly;
> > > +int max_huge_page_level;
> > >  static int tdp_root_level __read_mostly;
> > >  static int max_tdp_level __read_mostly;
> > >
> > > @@ -4486,7 +4486,7 @@ static inline bool boot_cpu_is_amd(void)
> > >   * the direct page table on host, use as much mmu features as
> > >   * possible, however, kvm currently does not do execution-protection.
> > >   */
> > > -static void
> > > +void
> > >  build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> > >                               int shadow_root_level)
> > >  {
> > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > > index 1bff453f7cbe..6c08a5731fcb 100644
> > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > @@ -171,4 +171,10 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > >  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > >  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > >
> > > +void
> > > +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> > > +                             int shadow_root_level);
> > > +
> > > +extern int max_huge_page_level __read_mostly;
> > > +
> > >  #endif /* __KVM_X86_MMU_INTERNAL_H */
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index af60922906ef..eb8929e394ec 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -1709,6 +1709,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
> > >               clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
> > >  }
> > >
> > > +static bool try_promote_lpage(struct kvm *kvm,
> > > +                           const struct kvm_memory_slot *slot,
> > > +                           struct tdp_iter *iter)
> >
> > Use "huge_page" instead of "lpage" to be consistent with eager page
> > splitting and the rest of the Linux kernel. Some of the old KVM methods
> > still use "lpage" and "large page", but we're slowly moving away from
> > that.
>
> Ah good catch. Paolo, if you want me to send a v2 to address all these
> comments, I can. Otherwise I'll just reply to the questions below.
>
> >
> > > +{
> > > +     struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> > > +     struct rsvd_bits_validate shadow_zero_check;
> > > +     bool map_writable;
> > > +     kvm_pfn_t pfn;
> > > +     u64 new_spte;
> > > +     u64 mt_mask;
> > > +
> > > +     /*
> > > +      * If addresses are being invalidated, don't do in-place promotion to
> > > +      * avoid accidentally mapping an invalidated address.
> > > +      */
> > > +     if (unlikely(kvm->mmu_notifier_count))
> > > +             return false;
> >
> > Why is this necessary? Seeing this makes me wonder if we need a similar
> > check for eager page splitting.
>
> This is needed here, but not in the page splitting case, because we
> are potentially mapping new memory.
> If a page is split for dirt logging, but then the backing transparent
> huge page is split for some reason, we could race with the THP split.
> Since we're mapping the entire huge page, this could wind up mapping
> more memory than it should. Checking the MMU notifier count prevents
> that. It's not needed in the splitting case because the memory in
> question is already mapped. We're essentially trying to do what the
> page fault handler does, since we know that's safe and it's what
> replaces the zapped page with a huge page. The page fault handler
> checks for MMU notifiers, so we need to as well.
> >
> > > +
> > > +     if (iter->level > max_huge_page_level || iter->gfn < slot->base_gfn ||
> > > +         iter->gfn >= slot->base_gfn + slot->npages)
> > > +             return false;
> > > +
> > > +     pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
> > > +                                &map_writable, NULL);
> > > +     if (is_error_noslot_pfn(pfn))
> > > +             return false;
> > > +
> > > +     /*
> > > +      * Can't reconstitute an lpage if the consituent pages can't be
> > > +      * mapped higher.
> > > +      */
> > > +     if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
> > > +                                                 pfn, PG_LEVEL_NUM))
> > > +             return false;
> > > +
> > > +     build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
> > > +
> > > +     /*
> > > +      * In some cases, a vCPU pointer is required to get the MT mask,
> > > +      * however in most cases it can be generated without one. If a
> > > +      * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
> > > +      * In that case, bail on in-place promotion.
> > > +      */
> > > +     if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
> > > +                                                        kvm_is_mmio_pfn(pfn),
> > > +                                                        &mt_mask)))
> > > +             return false;
> > > +
> > > +     __make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
> > > +               map_writable, mt_mask, &shadow_zero_check, &new_spte);
> > > +
> > > +     if (tdp_mmu_set_spte_atomic(kvm, iter, new_spte))
> > > +             return true;
> > > +
> > > +     /* Re-read the SPTE as it must have been changed by another thread. */
> > > +     iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
> >
> > Huge page promotion could be retried in this case.
>
> That's true, but retries always get complicated since we need to
> guarantee forward progress and then you get into counting retries and
> it adds complexity.

There's plenty of unbounding retrying in tdp_mmu.c (search for "goto
retry"). I think that' fine though. I can't imagine a scenario where a
thread is blocked retrying more than a few times.

> Given how rare this race should be, I'm inclined
> to just let it fall back to zapping the spte.

I think that's fine too. Although it'd be pretty easy to plumb by
checking for -EBUSY from tdp_mmu_set_spte_atomic().

Maybe just leave a comment explaining why we don't care about going
through the effort of retrying.

>
> >
> > > +
> > > +     return false;
> > > +}
> > > +
> > >  /*
> > >   * Clear leaf entries which could be replaced by large mappings, for
> > >   * GFNs within the slot.
> > > @@ -1729,8 +1789,17 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> > >               if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> > >                       continue;
> > >
> > > -             if (!is_shadow_present_pte(iter.old_spte) ||
> > > -                 !is_last_spte(iter.old_spte, iter.level))
> > > +             if (iter.level > max_huge_page_level ||
> > > +                 iter.gfn < slot->base_gfn ||
> > > +                 iter.gfn >= slot->base_gfn + slot->npages)
> >
> > I feel like I've been seeing this "does slot contain gfn" calculation a
> > lot in recent commits. It's probably time to create a helper function.
> > No need to do this clean up as part of your series though, unless you
> > want to :).
> >
> > > +                     continue;
> > > +
> > > +             if (!is_shadow_present_pte(iter.old_spte))
> > > +                     continue;
> > > +
> > > +             /* Try to promote the constitutent pages to an lpage. */
> > > +             if (!is_last_spte(iter.old_spte, iter.level) &&
> > > +                 try_promote_lpage(kvm, slot, &iter))
> > >                       continue;
> >
> > If iter.old_spte is not a leaf, the only loop would always continue to
> > the next SPTE. Now we try to promote it and if that fails we run through
> > the rest of the loop. This seems broken. For example, in the next line
> > we end up grabbing the pfn of the non-leaf SPTE (which would be the PFN
> > of the TDP MMU page table?) and treat that as the PFN backing this GFN,
> > which is wrong.
> >
> > In the worst case we end up zapping an SPTE that we didn't need to, but
> > we should still fix up this code.
> >
> > >
> > >               pfn = spte_to_pfn(iter.old_spte);
> > > --
> > > 2.35.1.894.gb6a874cedc-goog
> > >

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging
  2022-03-21 22:43 ` [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging Ben Gardon
  2022-03-28 17:45   ` David Matlack
@ 2022-03-28 18:21   ` David Matlack
  2022-04-12 16:43   ` Sean Christopherson
  2 siblings, 0 replies; 33+ messages in thread
From: David Matlack @ 2022-03-28 18:21 UTC (permalink / raw)
  To: Ben Gardon
  Cc: LKML, kvm list, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On Mon, Mar 21, 2022 at 3:44 PM Ben Gardon <bgardon@google.com> wrote:
>
> When disabling dirty logging, the TDP MMU currently zaps each leaf entry
> mapping memory in the relevant memslot. This is very slow. Doing the zaps
> under the mmu read lock requires a TLB flush for every zap and the
> zapping causes a storm of ETP/NPT violations.
>
> Instead of zapping, replace the split large pages with large page
> mappings directly. While this sort of operation has historically only
> been done in the vCPU page fault handler context, refactorings earlier
> in this series and the relative simplicity of the TDP MMU make it
> possible here as well.
>
> Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G
> of memory per vCPU, this reduces the time required to disable dirty
> logging from over 45 seconds to just over 1 second. It also avoids
> provoking page faults, improving vCPU performance while disabling
> dirty logging.
>
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          |  4 +-
>  arch/x86/kvm/mmu/mmu_internal.h |  6 +++
>  arch/x86/kvm/mmu/tdp_mmu.c      | 73 ++++++++++++++++++++++++++++++++-
>  3 files changed, 79 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f98111f8f8b..a99c23ef90b6 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -100,7 +100,7 @@ module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
>   */
>  bool tdp_enabled = false;
>
> -static int max_huge_page_level __read_mostly;
> +int max_huge_page_level;
>  static int tdp_root_level __read_mostly;
>  static int max_tdp_level __read_mostly;
>
> @@ -4486,7 +4486,7 @@ static inline bool boot_cpu_is_amd(void)
>   * the direct page table on host, use as much mmu features as
>   * possible, however, kvm currently does not do execution-protection.
>   */
> -static void
> +void
>  build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
>                                 int shadow_root_level)
>  {
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1bff453f7cbe..6c08a5731fcb 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -171,4 +171,10 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +void
> +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> +                               int shadow_root_level);
> +
> +extern int max_huge_page_level __read_mostly;
> +
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index af60922906ef..eb8929e394ec 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1709,6 +1709,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
>                 clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
>  }
>
> +static bool try_promote_lpage(struct kvm *kvm,
> +                             const struct kvm_memory_slot *slot,
> +                             struct tdp_iter *iter)
> +{
> +       struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> +       struct rsvd_bits_validate shadow_zero_check;
> +       bool map_writable;
> +       kvm_pfn_t pfn;
> +       u64 new_spte;
> +       u64 mt_mask;
> +
> +       /*
> +        * If addresses are being invalidated, don't do in-place promotion to
> +        * avoid accidentally mapping an invalidated address.
> +        */
> +       if (unlikely(kvm->mmu_notifier_count))
> +               return false;
> +
> +       if (iter->level > max_huge_page_level || iter->gfn < slot->base_gfn ||
> +           iter->gfn >= slot->base_gfn + slot->npages)
> +               return false;
> +
> +       pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
> +                                  &map_writable, NULL);
> +       if (is_error_noslot_pfn(pfn))
> +               return false;
> +
> +       /*
> +        * Can't reconstitute an lpage if the consituent pages can't be
> +        * mapped higher.
> +        */
> +       if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
> +                                                   pfn, PG_LEVEL_NUM))
> +               return false;
> +
> +       build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
> +
> +       /*
> +        * In some cases, a vCPU pointer is required to get the MT mask,
> +        * however in most cases it can be generated without one. If a
> +        * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
> +        * In that case, bail on in-place promotion.
> +        */
> +       if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
> +                                                          kvm_is_mmio_pfn(pfn),
> +                                                          &mt_mask)))
> +               return false;
> +
> +       __make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
> +                 map_writable, mt_mask, &shadow_zero_check, &new_spte);
> +
> +       if (tdp_mmu_set_spte_atomic(kvm, iter, new_spte))
> +               return true;

Ah shoot, tdp_mmu_set_spte_atomic() now returns 0/-EBUSY, so this
conditional needs to be flipped.

> +
> +       /* Re-read the SPTE as it must have been changed by another thread. */
> +       iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));

tdp_mmu_set_spte_atomic() does this for you now.

> +
> +       return false;
> +}
> +
>  /*
>   * Clear leaf entries which could be replaced by large mappings, for
>   * GFNs within the slot.
> @@ -1729,8 +1789,17 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>                 if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
>                         continue;
>
> -               if (!is_shadow_present_pte(iter.old_spte) ||
> -                   !is_last_spte(iter.old_spte, iter.level))
> +               if (iter.level > max_huge_page_level ||
> +                   iter.gfn < slot->base_gfn ||
> +                   iter.gfn >= slot->base_gfn + slot->npages)
> +                       continue;
> +
> +               if (!is_shadow_present_pte(iter.old_spte))
> +                       continue;
> +
> +               /* Try to promote the constitutent pages to an lpage. */
> +               if (!is_last_spte(iter.old_spte, iter.level) &&
> +                   try_promote_lpage(kvm, slot, &iter))
>                         continue;
>
>                 pfn = spte_to_pfn(iter.old_spte);
> --
> 2.35.1.894.gb6a874cedc-goog
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 7/9] KVM: x86/mmu: Add try_get_mt_mask to x86_ops
  2022-03-21 22:43 ` [PATCH v2 7/9] KVM: x86/mmu: Add try_get_mt_mask to x86_ops Ben Gardon
@ 2022-04-11 23:00   ` Sean Christopherson
  2022-04-11 23:24     ` Ben Gardon
                       ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Sean Christopherson @ 2022-04-11 23:00 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On Mon, Mar 21, 2022, Ben Gardon wrote:
> Add another function for getting the memory type mask to x86_ops.
> This version of the function can fail, but it does not require a vCPU
> pointer. It will be used in a subsequent commit for in-place large page
> promotion when disabling dirty logging.
> 
> No functional change intended.
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h | 1 +
>  arch/x86/include/asm/kvm_host.h    | 2 ++
>  arch/x86/kvm/svm/svm.c             | 9 +++++++++
>  arch/x86/kvm/vmx/vmx.c             | 1 +
>  4 files changed, 13 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 29affccb353c..29880363b5ed 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -88,6 +88,7 @@ KVM_X86_OP_OPTIONAL(sync_pir_to_irr)
>  KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
>  KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
>  KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> +KVM_X86_OP(try_get_mt_mask)
>  KVM_X86_OP(load_mmu_pgd)
>  KVM_X86_OP(has_wbinvd_exit)
>  KVM_X86_OP(get_l2_tsc_offset)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f72e80178ffc..a114e4782702 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1422,6 +1422,8 @@ struct kvm_x86_ops {
>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>  	int (*set_identity_map_addr)(struct kvm *kvm, u64 ident_addr);
>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> +	bool (*try_get_mt_mask)(struct kvm *kvm, gfn_t gfn,
> +				bool is_mmio, u64 *mask);

There's an old saying in Tennessee - I know it's in Texas, probably in Tennessee -
that says, fool me once, shame on... shame on you. Fool me... you can't get fooled again.

Thou shalt not trick me again by using a bool for pass/fail!  Though this one
doesn't have same potential for pain as the TDP MMU's atomic operations.

And as a bonus, if we use 0/-errno, then we can use KVM_X86_OP_OPTIONAL_RET0()
and SVM doesn't need to provide an implementation.

Tangentially related to the return type, what about naming it something like
get_vm_wide_mt_mask() to convey exactly what it's doing?  The @kvm param kinda
does that, but IMO it doesn't do a good of capturing why the function can fail.
Adding "vm_wide" helps explain why it can, i.e. that there may not be a VM-wide
memtype established for the gfn.

As penance for your boolean sin, can you slot this in earlier in your series?
It's obviously not a hard dependency, but using a u64 for the mask here and then
undoing the whole thing is rather silly.  Compile tested only at this point, I'll
test on an actual system ASAP and let you know if I did something stupid.

From: Sean Christopherson <seanjc@google.com>
Date: Mon, 11 Apr 2022 15:12:16 -0700
Subject: [PATCH] KVM: x86: Restrict get_mt_mask() to a u8, use
 KVM_X86_OP_OPTIONAL_RET0

Restrict get_mt_mask() to a u8 and reintroduce using a RET0 static_call
for the SVM implementation.  EPT stores the memtype information in the
lower 8 bits (bits 6:3 to be precise), and even returns a shifted u8
without an explicit cast to a larger type; there's no need to return a
full u64.

Note, RET0 doesn't play nice with a u64 return on 32-bit kernels, see
commit bf07be36cd88 ("KVM: x86: do not use KVM_X86_OP_OPTIONAL_RET0 for
get_mt_mask").

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm-x86-ops.h | 2 +-
 arch/x86/include/asm/kvm_host.h    | 2 +-
 arch/x86/kvm/svm/svm.c             | 6 ------
 arch/x86/kvm/vmx/vmx.c             | 2 +-
 4 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 96e4e9842dfc..0d16f21a6203 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -87,7 +87,7 @@ KVM_X86_OP(deliver_interrupt)
 KVM_X86_OP_OPTIONAL(sync_pir_to_irr)
 KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
 KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
-KVM_X86_OP(get_mt_mask)
+KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2c20f715f009..dc4d34f1bcf9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1421,7 +1421,7 @@ struct kvm_x86_ops {
 	int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
 	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
 	int (*set_identity_map_addr)(struct kvm *kvm, u64 ident_addr);
-	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+	u8 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);

 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			     int root_level);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index fc1725b7d05f..56f03eafe421 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4011,11 +4011,6 @@ static bool svm_has_emulated_msr(struct kvm *kvm, u32 index)
 	return true;
 }

-static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
-{
-	return 0;
-}
-
 static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -4673,7 +4668,6 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.check_apicv_inhibit_reasons = avic_check_apicv_inhibit_reasons,
 	.apicv_post_state_restore = avic_apicv_post_state_restore,

-	.get_mt_mask = svm_get_mt_mask,
 	.get_exit_info = svm_get_exit_info,

 	.vcpu_after_set_cpuid = svm_vcpu_after_set_cpuid,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cf8581978bce..646fa609aa0d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7142,7 +7142,7 @@ static int __init vmx_check_processor_compat(void)
 	return 0;
 }

-static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
 	u8 cache;


base-commit: 59d9e75d641565603e7c293f4cec182d86db8586
--







^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 7/9] KVM: x86/mmu: Add try_get_mt_mask to x86_ops
  2022-04-11 23:00   ` Sean Christopherson
@ 2022-04-11 23:24     ` Ben Gardon
  2022-04-11 23:33     ` Sean Christopherson
  2022-04-12 19:30     ` Sean Christopherson
  2 siblings, 0 replies; 33+ messages in thread
From: Ben Gardon @ 2022-04-11 23:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Paolo Bonzini, Peter Xu, David Matlack, Jim Mattson,
	David Dunn, Jing Zhang, Junaid Shahid

On Mon, Apr 11, 2022 at 4:00 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Mar 21, 2022, Ben Gardon wrote:
> > Add another function for getting the memory type mask to x86_ops.
> > This version of the function can fail, but it does not require a vCPU
> > pointer. It will be used in a subsequent commit for in-place large page
> > promotion when disabling dirty logging.
> >
> > No functional change intended.
> >
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/include/asm/kvm-x86-ops.h | 1 +
> >  arch/x86/include/asm/kvm_host.h    | 2 ++
> >  arch/x86/kvm/svm/svm.c             | 9 +++++++++
> >  arch/x86/kvm/vmx/vmx.c             | 1 +
> >  4 files changed, 13 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > index 29affccb353c..29880363b5ed 100644
> > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > @@ -88,6 +88,7 @@ KVM_X86_OP_OPTIONAL(sync_pir_to_irr)
> >  KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
> >  KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> >  KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> > +KVM_X86_OP(try_get_mt_mask)
> >  KVM_X86_OP(load_mmu_pgd)
> >  KVM_X86_OP(has_wbinvd_exit)
> >  KVM_X86_OP(get_l2_tsc_offset)
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index f72e80178ffc..a114e4782702 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1422,6 +1422,8 @@ struct kvm_x86_ops {
> >       int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> >       int (*set_identity_map_addr)(struct kvm *kvm, u64 ident_addr);
> >       u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> > +     bool (*try_get_mt_mask)(struct kvm *kvm, gfn_t gfn,
> > +                             bool is_mmio, u64 *mask);
>
> There's an old saying in Tennessee - I know it's in Texas, probably in Tennessee -
> that says, fool me once, shame on... shame on you. Fool me... you can't get fooled again.

Haha shoot here I was saying it wrong all these years and getting
fooled too many times.

Paolo, did you already queue this series or should I send out a v3? I
thought I saw the "queued,thanks" come through at some point, but
maybe I'm mis-remembering.

>
> Thou shalt not trick me again by using a bool for pass/fail!  Though this one
> doesn't have same potential for pain as the TDP MMU's atomic operations.
>
> And as a bonus, if we use 0/-errno, then we can use KVM_X86_OP_OPTIONAL_RET0()
> and SVM doesn't need to provide an implementation.
>
> Tangentially related to the return type, what about naming it something like
> get_vm_wide_mt_mask() to convey exactly what it's doing?  The @kvm param kinda
> does that, but IMO it doesn't do a good of capturing why the function can fail.
> Adding "vm_wide" helps explain why it can, i.e. that there may not be a VM-wide
> memtype established for the gfn.
>
> As penance for your boolean sin, can you slot this in earlier in your series?
> It's obviously not a hard dependency, but using a u64 for the mask here and then
> undoing the whole thing is rather silly.  Compile tested only at this point, I'll
> test on an actual system ASAP and let you know if I did something stupid.
>
> From: Sean Christopherson <seanjc@google.com>
> Date: Mon, 11 Apr 2022 15:12:16 -0700
> Subject: [PATCH] KVM: x86: Restrict get_mt_mask() to a u8, use
>  KVM_X86_OP_OPTIONAL_RET0
>
> Restrict get_mt_mask() to a u8 and reintroduce using a RET0 static_call
> for the SVM implementation.  EPT stores the memtype information in the
> lower 8 bits (bits 6:3 to be precise), and even returns a shifted u8
> without an explicit cast to a larger type; there's no need to return a
> full u64.
>
> Note, RET0 doesn't play nice with a u64 return on 32-bit kernels, see
> commit bf07be36cd88 ("KVM: x86: do not use KVM_X86_OP_OPTIONAL_RET0 for
> get_mt_mask").
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h | 2 +-
>  arch/x86/include/asm/kvm_host.h    | 2 +-
>  arch/x86/kvm/svm/svm.c             | 6 ------
>  arch/x86/kvm/vmx/vmx.c             | 2 +-
>  4 files changed, 3 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 96e4e9842dfc..0d16f21a6203 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -87,7 +87,7 @@ KVM_X86_OP(deliver_interrupt)
>  KVM_X86_OP_OPTIONAL(sync_pir_to_irr)
>  KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
>  KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> -KVM_X86_OP(get_mt_mask)
> +KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
>  KVM_X86_OP(load_mmu_pgd)
>  KVM_X86_OP(has_wbinvd_exit)
>  KVM_X86_OP(get_l2_tsc_offset)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2c20f715f009..dc4d34f1bcf9 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1421,7 +1421,7 @@ struct kvm_x86_ops {
>         int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
>         int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>         int (*set_identity_map_addr)(struct kvm *kvm, u64 ident_addr);
> -       u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> +       u8 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>
>         void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>                              int root_level);
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index fc1725b7d05f..56f03eafe421 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4011,11 +4011,6 @@ static bool svm_has_emulated_msr(struct kvm *kvm, u32 index)
>         return true;
>  }
>
> -static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> -{
> -       return 0;
> -}
> -
>  static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  {
>         struct vcpu_svm *svm = to_svm(vcpu);
> @@ -4673,7 +4668,6 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>         .check_apicv_inhibit_reasons = avic_check_apicv_inhibit_reasons,
>         .apicv_post_state_restore = avic_apicv_post_state_restore,
>
> -       .get_mt_mask = svm_get_mt_mask,
>         .get_exit_info = svm_get_exit_info,
>
>         .vcpu_after_set_cpuid = svm_vcpu_after_set_cpuid,
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index cf8581978bce..646fa609aa0d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7142,7 +7142,7 @@ static int __init vmx_check_processor_compat(void)
>         return 0;
>  }
>
> -static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> +static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  {
>         u8 cache;
>
>
> base-commit: 59d9e75d641565603e7c293f4cec182d86db8586
> --
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 7/9] KVM: x86/mmu: Add try_get_mt_mask to x86_ops
  2022-04-11 23:00   ` Sean Christopherson
  2022-04-11 23:24     ` Ben Gardon
@ 2022-04-11 23:33     ` Sean Christopherson
  2022-04-12 19:30     ` Sean Christopherson
  2 siblings, 0 replies; 33+ messages in thread
From: Sean Christopherson @ 2022-04-11 23:33 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On Mon, Apr 11, 2022, Sean Christopherson wrote:
> And as a bonus, if we use 0/-errno, then we can use KVM_X86_OP_OPTIONAL_RET0()
> and SVM doesn't need to provide an implementation.

Gah, got ahead of myself.  If we go this route, the caller would need to ensure
it initializes mt_mask.  Probabably not worth it, so scratch that idea.  I also
though about overloading the return type, e.g.

	mt_mask = ...
	if (mt_mask < 0)
		return false;

But again, probably not a net positive.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 5/9] KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask
  2022-03-21 22:43 ` [PATCH v2 5/9] KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask Ben Gardon
@ 2022-04-12 15:46   ` Sean Christopherson
  2022-04-21 18:50     ` Ben Gardon
  0 siblings, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2022-04-12 15:46 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On Mon, Mar 21, 2022, Ben Gardon wrote:
> Factor out the implementation of reset_tdp_shadow_zero_bits_mask to a
> helper function which does not require a vCPU pointer. The only element
> of the struct kvm_mmu context used by the function is the shadow root
> level, so pass that in too instead of the mmu context.
> 
> No functional change intended.
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3b8da8b0745e..6f98111f8f8b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4487,16 +4487,14 @@ static inline bool boot_cpu_is_amd(void)
>   * possible, however, kvm currently does not do execution-protection.
>   */
>  static void

Strongly prefer the newline here get dropped (see below).

> -reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
> +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,

Kind of a nit, but KVM uses "calc" for this sort of thing.  There are no other
instances of "build_" to describe this behavior.

Am I alone in think that shadow_zero_check is an awful, awful name?  E.g. the EPT
memtype case has legal non-zero values.  Anyone object to opportunistically
renaming the function and the local shadow_zero_check to "rsvd_bits" to shorten
line lengths and move KVM one step closer to consistent naming?

> +				int shadow_root_level)
>  {
> -	struct rsvd_bits_validate *shadow_zero_check;
>  	int i;
>  
> -	shadow_zero_check = &context->shadow_zero_check;
> -
>  	if (boot_cpu_is_amd())
>  		__reset_rsvds_bits_mask(shadow_zero_check, reserved_hpa_bits(),
> -					context->shadow_root_level, false,
> +					shadow_root_level, false,
>  					boot_cpu_has(X86_FEATURE_GBPAGES),
>  					false, true);
>  	else
> @@ -4507,12 +4505,19 @@ reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
>  	if (!shadow_me_mask)
>  		return;
>  
> -	for (i = context->shadow_root_level; --i >= 0;) {
> +	for (i = shadow_root_level; --i >= 0;) {
>  		shadow_zero_check->rsvd_bits_mask[0][i] &= ~shadow_me_mask;
>  		shadow_zero_check->rsvd_bits_mask[1][i] &= ~shadow_me_mask;
>  	}
>  }
>  
> +static void
> +reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)

One line!  Aside from being against the One True Style[*], there is zero reason
for a newline here.

And I vote to drop the "mask", because (a) it's not a singular mask and (b) it's
not even a mask in all cases.

And while I'm on a naming consistency rant, s/context/mmu.

I.e. end up with:

static void calc_tdp_shadow_rsvd_bits(struct rsvd_bits_validate *rsvd_bits,
				      int shadow_root_level)

static void reset_tdp_shadow_rsvd_bits(struct kvm_mmu *mmu)

[*] https://lore.kernel.org/mm-commits/CAHk-=wjS-Jg7sGMwUPpDsjv392nDOOs0CtUtVkp=S6Q7JzFJRw@mail.gmail.com

> +{
> +	build_tdp_shadow_zero_bits_mask(&context->shadow_zero_check,
> +					context->shadow_root_level);
> +}
> +
>  /*
>   * as the comments in reset_shadow_zero_bits_mask() except it
>   * is the shadow page table for intel nested guest.
> -- 
> 2.35.1.894.gb6a874cedc-goog
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 3/9] KVM: x86/mmu: Factor shadow_zero_check out of __make_spte
  2022-03-21 22:43 ` [PATCH v2 3/9] KVM: x86/mmu: Factor shadow_zero_check " Ben Gardon
@ 2022-04-12 15:52   ` Sean Christopherson
  0 siblings, 0 replies; 33+ messages in thread
From: Sean Christopherson @ 2022-04-12 15:52 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On Mon, Mar 21, 2022, Ben Gardon wrote:
> In the interest of devloping a version of __make_spte that can function
> without a vCPU pointer, factor out the shadow_zero_mask to be an
> additional argument to the function.
> 
> No functional change intended.
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/spte.c | 10 ++++++----
>  arch/x86/kvm/mmu/spte.h |  2 +-
>  2 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 931cf93c3b7e..ef2d85577abb 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -94,7 +94,7 @@ bool __make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  		 const struct kvm_memory_slot *slot, unsigned int pte_access,
>  		 gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
>  		 bool can_unsync, bool host_writable, u64 mt_mask,
> -		 u64 *new_spte)
> +		 struct rsvd_bits_validate *shadow_zero_check, u64 *new_spte)

Can we name the new param "rsvd_bits"?  As mentioned in the other patch, it's not
a pure "are these bits zero" check.

>  {
>  	int level = sp->role.level;
>  	u64 spte = SPTE_MMU_PRESENT_MASK;
> @@ -177,9 +177,9 @@ bool __make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	if (prefetch)
>  		spte = mark_spte_for_access_track(spte);
>  
> -	WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
> +	WARN_ONCE(is_rsvd_spte(shadow_zero_check, spte, level),
>  		  "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
> -		  get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
> +		  get_rsvd_bits(shadow_zero_check, spte, level));
>  
>  	if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
>  		/* Enforced by kvm_mmu_hugepage_adjust. */
> @@ -199,10 +199,12 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  {
>  	u64 mt_mask = static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
>  						       kvm_is_mmio_pfn(pfn));
> +	struct rsvd_bits_validate *shadow_zero_check =
> +			&vcpu->arch.mmu->shadow_zero_check;
>  
>  	return __make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
>  			   prefetch, can_unsync, host_writable, mt_mask,
> -			   new_spte);
> +			   shadow_zero_check, new_spte);

I don't see any reason to snapshot the reserved bits, IMO this is much more
readable overall:

	u64 mt_mask = static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
						       kvm_is_mmio_pfn(pfn));

	return __make_spte(vcpu->kvm, sp, slot, pte_access, gfn, pfn, old_spte,
			   prefetch, can_unsync, host_writable, mt_mask,
			   &vcpu->arch.mmu->shadow_zero_check, new_spte);

And it avoids propagating the shadow_zero_check naming.

> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index d051f955699e..e8a051188eb6 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -414,7 +414,7 @@ bool __make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  		 const struct kvm_memory_slot *slot, unsigned int pte_access,
>  		 gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
>  		 bool can_unsync, bool host_writable, u64 mt_mask,
> -		 u64 *new_spte);
> +		 struct rsvd_bits_validate *shadow_zero_check, u64 *new_spte);
>  bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	       const struct kvm_memory_slot *slot,
>  	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
> -- 
> 2.35.1.894.gb6a874cedc-goog
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging
  2022-03-21 22:43 ` [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging Ben Gardon
  2022-03-28 17:45   ` David Matlack
  2022-03-28 18:21   ` David Matlack
@ 2022-04-12 16:43   ` Sean Christopherson
  2022-04-25 18:09     ` Ben Gardon
  2 siblings, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2022-04-12 16:43 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On Mon, Mar 21, 2022, Ben Gardon wrote:
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1bff453f7cbe..6c08a5731fcb 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -171,4 +171,10 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  
> +void
> +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> +				int shadow_root_level);

Same comments from the earlier patch.

> +extern int max_huge_page_level __read_mostly;

Can you put this at the top of the heaader?  x86.h somehow ended up with extern
variables being declared in the middle of the file and I find it very jarring,
e.g. global definitions are pretty much never buried in the middle of a .c file.

>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index af60922906ef..eb8929e394ec 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1709,6 +1709,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
>  		clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
>  }
>  
> +static bool try_promote_lpage(struct kvm *kvm,

I believe we've settled on huge_page instead of lpage.

And again, I strongly prefer a 0/-errno return instead of a boolean as seeing
-EBUSY or whatever makes it super obviously that the early returns are failure
paths.

> +			      const struct kvm_memory_slot *slot,
> +			      struct tdp_iter *iter)
> +{
> +	struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> +	struct rsvd_bits_validate shadow_zero_check;
> +	bool map_writable;
> +	kvm_pfn_t pfn;
> +	u64 new_spte;
> +	u64 mt_mask;
> +
> +	/*
> +	 * If addresses are being invalidated, don't do in-place promotion to
> +	 * avoid accidentally mapping an invalidated address.
> +	 */
> +	if (unlikely(kvm->mmu_notifier_count))
> +		return false;
> +
> +	if (iter->level > max_huge_page_level || iter->gfn < slot->base_gfn ||
> +	    iter->gfn >= slot->base_gfn + slot->npages)
> +		return false;
> +
> +	pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
> +				   &map_writable, NULL);
> +	if (is_error_noslot_pfn(pfn))
> +		return false;
> +
> +	/*
> +	 * Can't reconstitute an lpage if the consituent pages can't be

"huge page", though honestly I'd just drop the comment, IMO this is more intuitive
then say the checks against the slot stuff above.

> +	 * mapped higher.
> +	 */
> +	if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
> +						    pfn, PG_LEVEL_NUM))
> +		return false;
> +
> +	build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
> +
> +	/*
> +	 * In some cases, a vCPU pointer is required to get the MT mask,
> +	 * however in most cases it can be generated without one. If a
> +	 * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
> +	 * In that case, bail on in-place promotion.
> +	 */
> +	if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,

I wouldn't bother with the "unlikely".  It's wrong for a VM with non-coherent DMA,
and it's very unlikely (heh) to actually be a meaningful optimization in any case.

> +							   kvm_is_mmio_pfn(pfn),
> +							   &mt_mask)))
> +		return false;
> +
> +	__make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,

A comment stating the return type is intentionally ignore would be helpful.  Not
strictly necessary because it's mostly obvious after looking at the details, but
it'd save someone from having to dig into said details.

> +		  map_writable, mt_mask, &shadow_zero_check, &new_spte);

Bad indentation.

> +
> +	if (tdp_mmu_set_spte_atomic(kvm, iter, new_spte))
> +		return true;

And by returning an int, and because the failure path rereads the SPTE for you,
this becomes:

	return tdp_mmu_set_spte_atomic(kvm, iter, new_spte);

> +
> +	/* Re-read the SPTE as it must have been changed by another thread. */
> +	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
> +
> +	return false;
> +}
> +
>  /*
>   * Clear leaf entries which could be replaced by large mappings, for
>   * GFNs within the slot.

This comment needs to be updated to include the huge page promotion behavior. And
maybe renamed the function too?  E.g.

static void zap_or_promote_collapsible_sptes(struct kvm *kvm,
					     struct kvm_mmu_page *root,
					     const struct kvm_memory_slot *slot)

> @@ -1729,8 +1789,17 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>  		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
>  			continue;
>  
> -		if (!is_shadow_present_pte(iter.old_spte) ||
> -		    !is_last_spte(iter.old_spte, iter.level))
> +		if (iter.level > max_huge_page_level ||
> +		    iter.gfn < slot->base_gfn ||
> +		    iter.gfn >= slot->base_gfn + slot->npages)

Isn't this exact check in try_promote_lpage()?  Ditto for the kvm_mmu_max_mapping_level()
check that's just out of sight.  That one in particular can be somewhat expsensive,
especially when KVM is fixed to use a helper that disable IRQs so the host page tables
aren't freed while they're being walked.  Oh, and the huge page promotion path
doesn't incorporate the reserved pfn check.

In other words, shouldn't this be:


		if (!is_shadow_present_pte(iter.old_spte))
			continue;

		if (iter.level > max_huge_page_level ||
		    iter.gfn < slot->base_gfn ||
		    iter.gfn >= slot->base_gfn + slot->npages)
			continue;

		pfn = spte_to_pfn(iter.old_spte);
		if (kvm_is_reserved_pfn(pfn) ||
		    iter.level >= kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
							    pfn, PG_LEVEL_NUM))
			continue;

Followed by the promotion stuff.  And then unless I'm overlooking something, "pfn"
can be passed into try_promote_huge_page(), it just needs to be masked appropriately.
I.e. the promotion path can avoid the __gfn_to_pfn_memslot() lookup and also drop
its is_error_noslot_pfn() check since the pfn is pulled from the SPTE and KVM should
never install garbage into the SPTE (emulated/noslot MMIO pfns fail the shadow
present check).

> +			continue;
> +
> +		if (!is_shadow_present_pte(iter.old_spte))
> +			continue;

I strongly prefer to keep the !is_shadow_present_pte() check first, it really
should be the first thing any of these flows check.

> +
> +		/* Try to promote the constitutent pages to an lpage. */
> +		if (!is_last_spte(iter.old_spte, iter.level) &&
> +		    try_promote_lpage(kvm, slot, &iter))

There is an undocumented function change here, and I can't tell if it's intentional.
If the promotion fails, KVM continues on an zaps the non-leaf shadow page.  If that
is intentional behavior, it should be done in a follow-up patch, e.g. so that it can
be easily reverted if it turns out that zappping e.g. a PUD is bad for performance.

I.e. shouldn't this be:

		if (!is_last_spte(iter.old_spte, iter.level)) {
			try_promote_huge_page(...);
			continue;
		}

and then converted to the current variant in a follow-up?

>  			continue;
>  
>  		pfn = spte_to_pfn(iter.old_spte);
> -- 
> 2.35.1.894.gb6a874cedc-goog
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 7/9] KVM: x86/mmu: Add try_get_mt_mask to x86_ops
  2022-04-11 23:00   ` Sean Christopherson
  2022-04-11 23:24     ` Ben Gardon
  2022-04-11 23:33     ` Sean Christopherson
@ 2022-04-12 19:30     ` Sean Christopherson
  2 siblings, 0 replies; 33+ messages in thread
From: Sean Christopherson @ 2022-04-12 19:30 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On Mon, Apr 11, 2022, Sean Christopherson wrote:
> It's obviously not a hard dependency, but using a u64 for the mask here and then
> undoing the whole thing is rather silly.  Compile tested only at this point, I'll
> test on an actual system ASAP and let you know if I did something stupid.
> 
> From: Sean Christopherson <seanjc@google.com>
> Date: Mon, 11 Apr 2022 15:12:16 -0700
> Subject: [PATCH] KVM: x86: Restrict get_mt_mask() to a u8, use
>  KVM_X86_OP_OPTIONAL_RET0

Verified this works as expected when patching in RET0 on a 32-bit kernel (returning
a u64 results in reserved bits in the upper half being set due to EDX not being cleared).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 8/9] KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c
  2022-03-21 22:43 ` [PATCH v2 8/9] KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c Ben Gardon
@ 2022-04-12 19:39   ` Sean Christopherson
  0 siblings, 0 replies; 33+ messages in thread
From: Sean Christopherson @ 2022-04-12 19:39 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On Mon, Mar 21, 2022, Ben Gardon wrote:
> Export kvm_is_mmio_pfn from spte.c. It will be used in a subsequent
> commit for in-place lpage promotion when disabling dirty logging.

Rather than force the promotion path to call kvm_is_mmio_pfn(), what about:

  a. Truly exporting the helper, i.e. EXPORT_SYMBOL_GPL
  b. Move this patch earlier in the series, before "KVM: x86/mmu: Factor out part of
     vmx_get_mt_mask which does not depend on vcpu"
  c. In the same patch, drop the "is_mmio" param from kvm_x86_ops.get_mt_mask()
     and have vmx_get_mt_mask() call it directly.

That way the call to kvm_is_mmio_pfn() is avoided when running on AMD hosts
(ignoring the shadow_me_mask thing, which I have a separate tweak for).  The
worst case scenario for a lookup is actually quite expensive, e.g. retpoline and
a spinlock.

> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/spte.c | 2 +-
>  arch/x86/kvm/mmu/spte.h | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 45e9c0c3932e..8e9b827c4ed5 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -69,7 +69,7 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
>  	return spte;
>  }
>  
> -static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
> +bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
>  {
>  	if (pfn_valid(pfn))
>  		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index cee02fe63429..e058a85e6c66 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -443,4 +443,5 @@ u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);
>  
>  void kvm_mmu_reset_all_pte_masks(void);
>  
> +bool kvm_is_mmio_pfn(kvm_pfn_t pfn);
>  #endif
> -- 
> 2.35.1.894.gb6a874cedc-goog
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 5/9] KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask
  2022-04-12 15:46   ` Sean Christopherson
@ 2022-04-21 18:50     ` Ben Gardon
  2022-04-21 19:09       ` Ben Gardon
  0 siblings, 1 reply; 33+ messages in thread
From: Ben Gardon @ 2022-04-21 18:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Paolo Bonzini, Peter Xu, David Matlack, Jim Mattson,
	David Dunn, Jing Zhang, Junaid Shahid

On Tue, Apr 12, 2022 at 8:46 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Mar 21, 2022, Ben Gardon wrote:
> > Factor out the implementation of reset_tdp_shadow_zero_bits_mask to a
> > helper function which does not require a vCPU pointer. The only element
> > of the struct kvm_mmu context used by the function is the shadow root
> > level, so pass that in too instead of the mmu context.
> >
> > No functional change intended.
> >
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 17 +++++++++++------
> >  1 file changed, 11 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 3b8da8b0745e..6f98111f8f8b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4487,16 +4487,14 @@ static inline bool boot_cpu_is_amd(void)
> >   * possible, however, kvm currently does not do execution-protection.
> >   */
> >  static void
>
> Strongly prefer the newline here get dropped (see below).
>
> > -reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
> > +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
>
> Kind of a nit, but KVM uses "calc" for this sort of thing.  There are no other
> instances of "build_" to describe this behavior.
>
> Am I alone in think that shadow_zero_check is an awful, awful name?  E.g. the EPT
> memtype case has legal non-zero values.  Anyone object to opportunistically
> renaming the function and the local shadow_zero_check to "rsvd_bits" to shorten
> line lengths and move KVM one step closer to consistent naming?

That makes sense to me. I'm happy to add a commit to this series to
standardize on rsvd_bits.

>
> > +                             int shadow_root_level)
> >  {
> > -     struct rsvd_bits_validate *shadow_zero_check;
> >       int i;
> >
> > -     shadow_zero_check = &context->shadow_zero_check;
> > -
> >       if (boot_cpu_is_amd())
> >               __reset_rsvds_bits_mask(shadow_zero_check, reserved_hpa_bits(),
> > -                                     context->shadow_root_level, false,
> > +                                     shadow_root_level, false,
> >                                       boot_cpu_has(X86_FEATURE_GBPAGES),
> >                                       false, true);
> >       else
> > @@ -4507,12 +4505,19 @@ reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
> >       if (!shadow_me_mask)
> >               return;
> >
> > -     for (i = context->shadow_root_level; --i >= 0;) {
> > +     for (i = shadow_root_level; --i >= 0;) {
> >               shadow_zero_check->rsvd_bits_mask[0][i] &= ~shadow_me_mask;
> >               shadow_zero_check->rsvd_bits_mask[1][i] &= ~shadow_me_mask;
> >       }
> >  }
> >
> > +static void
> > +reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
>
> One line!  Aside from being against the One True Style[*], there is zero reason
> for a newline here.
>
> And I vote to drop the "mask", because (a) it's not a singular mask and (b) it's
> not even a mask in all cases.
>
> And while I'm on a naming consistency rant, s/context/mmu.
>
> I.e. end up with:
>
> static void calc_tdp_shadow_rsvd_bits(struct rsvd_bits_validate *rsvd_bits,
>                                       int shadow_root_level)
>
> static void reset_tdp_shadow_rsvd_bits(struct kvm_mmu *mmu)
>
> [*] https://lore.kernel.org/mm-commits/CAHk-=wjS-Jg7sGMwUPpDsjv392nDOOs0CtUtVkp=S6Q7JzFJRw@mail.gmail.com
>
> > +{
> > +     build_tdp_shadow_zero_bits_mask(&context->shadow_zero_check,
> > +                                     context->shadow_root_level);
> > +}
> > +
> >  /*
> >   * as the comments in reset_shadow_zero_bits_mask() except it
> >   * is the shadow page table for intel nested guest.
> > --
> > 2.35.1.894.gb6a874cedc-goog
> >

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 5/9] KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask
  2022-04-21 18:50     ` Ben Gardon
@ 2022-04-21 19:09       ` Ben Gardon
  0 siblings, 0 replies; 33+ messages in thread
From: Ben Gardon @ 2022-04-21 19:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Paolo Bonzini, Peter Xu, David Matlack, Jim Mattson,
	David Dunn, Jing Zhang, Junaid Shahid

On Thu, Apr 21, 2022 at 11:50 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Tue, Apr 12, 2022 at 8:46 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Mar 21, 2022, Ben Gardon wrote:
> > > Factor out the implementation of reset_tdp_shadow_zero_bits_mask to a
> > > helper function which does not require a vCPU pointer. The only element
> > > of the struct kvm_mmu context used by the function is the shadow root
> > > level, so pass that in too instead of the mmu context.
> > >
> > > No functional change intended.
> > >
> > > Signed-off-by: Ben Gardon <bgardon@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c | 17 +++++++++++------
> > >  1 file changed, 11 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 3b8da8b0745e..6f98111f8f8b 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4487,16 +4487,14 @@ static inline bool boot_cpu_is_amd(void)
> > >   * possible, however, kvm currently does not do execution-protection.
> > >   */
> > >  static void
> >
> > Strongly prefer the newline here get dropped (see below).
> >
> > > -reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
> > > +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> >
> > Kind of a nit, but KVM uses "calc" for this sort of thing.  There are no other
> > instances of "build_" to describe this behavior.
> >
> > Am I alone in think that shadow_zero_check is an awful, awful name?  E.g. the EPT
> > memtype case has legal non-zero values.  Anyone object to opportunistically
> > renaming the function and the local shadow_zero_check to "rsvd_bits" to shorten
> > line lengths and move KVM one step closer to consistent naming?
>
> That makes sense to me. I'm happy to add a commit to this series to
> standardize on rsvd_bits.

Actually rsvd_bits is already a function name so I'm going to
standardize on spte_rsvd_bits, if that works for everyone.

>
> >
> > > +                             int shadow_root_level)
> > >  {
> > > -     struct rsvd_bits_validate *shadow_zero_check;
> > >       int i;
> > >
> > > -     shadow_zero_check = &context->shadow_zero_check;
> > > -
> > >       if (boot_cpu_is_amd())
> > >               __reset_rsvds_bits_mask(shadow_zero_check, reserved_hpa_bits(),
> > > -                                     context->shadow_root_level, false,
> > > +                                     shadow_root_level, false,
> > >                                       boot_cpu_has(X86_FEATURE_GBPAGES),
> > >                                       false, true);
> > >       else
> > > @@ -4507,12 +4505,19 @@ reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
> > >       if (!shadow_me_mask)
> > >               return;
> > >
> > > -     for (i = context->shadow_root_level; --i >= 0;) {
> > > +     for (i = shadow_root_level; --i >= 0;) {
> > >               shadow_zero_check->rsvd_bits_mask[0][i] &= ~shadow_me_mask;
> > >               shadow_zero_check->rsvd_bits_mask[1][i] &= ~shadow_me_mask;
> > >       }
> > >  }
> > >
> > > +static void
> > > +reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
> >
> > One line!  Aside from being against the One True Style[*], there is zero reason
> > for a newline here.
> >
> > And I vote to drop the "mask", because (a) it's not a singular mask and (b) it's
> > not even a mask in all cases.
> >
> > And while I'm on a naming consistency rant, s/context/mmu.
> >
> > I.e. end up with:
> >
> > static void calc_tdp_shadow_rsvd_bits(struct rsvd_bits_validate *rsvd_bits,
> >                                       int shadow_root_level)
> >
> > static void reset_tdp_shadow_rsvd_bits(struct kvm_mmu *mmu)
> >
> > [*] https://lore.kernel.org/mm-commits/CAHk-=wjS-Jg7sGMwUPpDsjv392nDOOs0CtUtVkp=S6Q7JzFJRw@mail.gmail.com
> >
> > > +{
> > > +     build_tdp_shadow_zero_bits_mask(&context->shadow_zero_check,
> > > +                                     context->shadow_root_level);
> > > +}
> > > +
> > >  /*
> > >   * as the comments in reset_shadow_zero_bits_mask() except it
> > >   * is the shadow page table for intel nested guest.
> > > --
> > > 2.35.1.894.gb6a874cedc-goog
> > >

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging
  2022-04-12 16:43   ` Sean Christopherson
@ 2022-04-25 18:09     ` Ben Gardon
  0 siblings, 0 replies; 33+ messages in thread
From: Ben Gardon @ 2022-04-25 18:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Paolo Bonzini, Peter Xu, David Matlack, Jim Mattson,
	David Dunn, Jing Zhang, Junaid Shahid

On Tue, Apr 12, 2022 at 9:43 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Mar 21, 2022, Ben Gardon wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 1bff453f7cbe..6c08a5731fcb 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -171,4 +171,10 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +void
> > +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> > +                             int shadow_root_level);
>
> Same comments from the earlier patch.
>
> > +extern int max_huge_page_level __read_mostly;
>
> Can you put this at the top of the heaader?  x86.h somehow ended up with extern
> variables being declared in the middle of the file and I find it very jarring,
> e.g. global definitions are pretty much never buried in the middle of a .c file.

Will do. I'm working on a v3 of this series now.

>
>
> >  #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index af60922906ef..eb8929e394ec 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1709,6 +1709,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
> >               clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
> >  }
> >
> > +static bool try_promote_lpage(struct kvm *kvm,
>
> I believe we've settled on huge_page instead of lpage.
>
> And again, I strongly prefer a 0/-errno return instead of a boolean as seeing
> -EBUSY or whatever makes it super obviously that the early returns are failure
> paths.

Will do. To your and David's comments about retries, this makes the
retry scheme really nice and clean.

>
> > +                           const struct kvm_memory_slot *slot,
> > +                           struct tdp_iter *iter)
> > +{
> > +     struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> > +     struct rsvd_bits_validate shadow_zero_check;
> > +     bool map_writable;
> > +     kvm_pfn_t pfn;
> > +     u64 new_spte;
> > +     u64 mt_mask;
> > +
> > +     /*
> > +      * If addresses are being invalidated, don't do in-place promotion to
> > +      * avoid accidentally mapping an invalidated address.
> > +      */
> > +     if (unlikely(kvm->mmu_notifier_count))
> > +             return false;
> > +
> > +     if (iter->level > max_huge_page_level || iter->gfn < slot->base_gfn ||
> > +         iter->gfn >= slot->base_gfn + slot->npages)
> > +             return false;
> > +
> > +     pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
> > +                                &map_writable, NULL);
> > +     if (is_error_noslot_pfn(pfn))
> > +             return false;
> > +
> > +     /*
> > +      * Can't reconstitute an lpage if the consituent pages can't be
>
> "huge page", though honestly I'd just drop the comment, IMO this is more intuitive
> then say the checks against the slot stuff above.
>
> > +      * mapped higher.
> > +      */
> > +     if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
> > +                                                 pfn, PG_LEVEL_NUM))
> > +             return false;
> > +
> > +     build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
> > +
> > +     /*
> > +      * In some cases, a vCPU pointer is required to get the MT mask,
> > +      * however in most cases it can be generated without one. If a
> > +      * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
> > +      * In that case, bail on in-place promotion.
> > +      */
> > +     if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
>
> I wouldn't bother with the "unlikely".  It's wrong for a VM with non-coherent DMA,
> and it's very unlikely (heh) to actually be a meaningful optimization in any case.
>
> > +                                                        kvm_is_mmio_pfn(pfn),
> > +                                                        &mt_mask)))
> > +             return false;
> > +
> > +     __make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
>
> A comment stating the return type is intentionally ignore would be helpful.  Not
> strictly necessary because it's mostly obvious after looking at the details, but
> it'd save someone from having to dig into said details.
>
> > +               map_writable, mt_mask, &shadow_zero_check, &new_spte);
>
> Bad indentation.
>
> > +
> > +     if (tdp_mmu_set_spte_atomic(kvm, iter, new_spte))
> > +             return true;
>
> And by returning an int, and because the failure path rereads the SPTE for you,
> this becomes:
>
>         return tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
>
> > +
> > +     /* Re-read the SPTE as it must have been changed by another thread. */
> > +     iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
> > +
> > +     return false;
> > +}
> > +
> >  /*
> >   * Clear leaf entries which could be replaced by large mappings, for
> >   * GFNs within the slot.
>
> This comment needs to be updated to include the huge page promotion behavior. And
> maybe renamed the function too?  E.g.
>
> static void zap_or_promote_collapsible_sptes(struct kvm *kvm,
>                                              struct kvm_mmu_page *root,
>                                              const struct kvm_memory_slot *slot)
>
> > @@ -1729,8 +1789,17 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> >               if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> >                       continue;
> >
> > -             if (!is_shadow_present_pte(iter.old_spte) ||
> > -                 !is_last_spte(iter.old_spte, iter.level))
> > +             if (iter.level > max_huge_page_level ||
> > +                 iter.gfn < slot->base_gfn ||
> > +                 iter.gfn >= slot->base_gfn + slot->npages)
>
> Isn't this exact check in try_promote_lpage()?  Ditto for the kvm_mmu_max_mapping_level()
> check that's just out of sight.  That one in particular can be somewhat expsensive,
> especially when KVM is fixed to use a helper that disable IRQs so the host page tables
> aren't freed while they're being walked.  Oh, and the huge page promotion path
> doesn't incorporate the reserved pfn check.
>
> In other words, shouldn't this be:
>
>
>                 if (!is_shadow_present_pte(iter.old_spte))
>                         continue;
>
>                 if (iter.level > max_huge_page_level ||
>                     iter.gfn < slot->base_gfn ||
>                     iter.gfn >= slot->base_gfn + slot->npages)
>                         continue;
>
>                 pfn = spte_to_pfn(iter.old_spte);
>                 if (kvm_is_reserved_pfn(pfn) ||
>                     iter.level >= kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
>                                                             pfn, PG_LEVEL_NUM))
>                         continue;
>
> Followed by the promotion stuff.  And then unless I'm overlooking something, "pfn"
> can be passed into try_promote_huge_page(), it just needs to be masked appropriately.
> I.e. the promotion path can avoid the __gfn_to_pfn_memslot() lookup and also drop
> its is_error_noslot_pfn() check since the pfn is pulled from the SPTE and KVM should
> never install garbage into the SPTE (emulated/noslot MMIO pfns fail the shadow
> present check).

I'll work on deduplicating the checks. The big distinction is that in
the promotion function, the iterator is still at the non-leaf SPTE, so
if we get the PFN from the SPTE, it'll be the PFN of a page table, not
a PFN backing guest memory. I could use the same GFN to PFN memslot
conversion in both cases, but it seems more expensive than extracting
the PFN from the SPTE.

__gfn_to_pfn_memslot should never return a reserved PFN right?

>
> > +                     continue;
> > +
> > +             if (!is_shadow_present_pte(iter.old_spte))
> > +                     continue;
>
> I strongly prefer to keep the !is_shadow_present_pte() check first, it really
> should be the first thing any of these flows check.
>
> > +
> > +             /* Try to promote the constitutent pages to an lpage. */
> > +             if (!is_last_spte(iter.old_spte, iter.level) &&
> > +                 try_promote_lpage(kvm, slot, &iter))
>
> There is an undocumented function change here, and I can't tell if it's intentional.
> If the promotion fails, KVM continues on an zaps the non-leaf shadow page.  If that
> is intentional behavior, it should be done in a follow-up patch, e.g. so that it can
> be easily reverted if it turns out that zappping e.g. a PUD is bad for performance.
>
> I.e. shouldn't this be:
>
>                 if (!is_last_spte(iter.old_spte, iter.level)) {
>                         try_promote_huge_page(...);
>                         continue;
>                 }
>
> and then converted to the current variant in a follow-up?

Ah, good point.

>
> >                       continue;
> >
> >               pfn = spte_to_pfn(iter.old_spte);
> > --
> > 2.35.1.894.gb6a874cedc-goog
> >

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging
  2022-03-25 12:00 ` [PATCH v2 0/9] KVM: x86/MMU: Optimize " Paolo Bonzini
@ 2022-07-12  1:37   ` Sean Christopherson
  2022-07-14  7:55     ` Paolo Bonzini
  0 siblings, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2022-07-12  1:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ben Gardon, linux-kernel, kvm, Peter Xu, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On Fri, Mar 25, 2022, Paolo Bonzini wrote:
> On 3/21/22 23:43, Ben Gardon wrote:
> > Currently disabling dirty logging with the TDP MMU is extremely slow.
> > On a 96 vCPU / 96G VM it takes ~256 seconds to disable dirty logging
> > with the TDP MMU, as opposed to ~4 seconds with the legacy MMU. This
> > series optimizes TLB flushes and introduces in-place large page
> > promotion, to bring the disable dirty log time down to ~3 seconds.
> > 
> > Testing:
> > Ran KVM selftests and kvm-unit-tests on an Intel Haswell. This
> > series introduced no new failures.
> 
> Thanks, looks good.  The one change I'd make is to just place the outcome of
> build_tdp_shadow_zero_bits_mask() in a global (say tdp_shadow_zero_check) at
> kvm_configure_mmu() time.  The tdp_max_root_level works as a conservative
> choice for the second argument of build_tdp_shadow_zero_bits_mask().
> 
> No need to do anything though, I'll handle this later in 5.19 time (and
> first merge my changes that factor out the constant part of
> vcpu->arch.root_mmu initialization, since this is part of the same ideas).

This fell through the cracks.  Ben is on a long vacation, I'll find my copy of
the Necronomicon and do a bit of resurrection, and address the feedback from v2
along the way.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging
  2022-03-28 18:07     ` Ben Gardon
  2022-03-28 18:20       ` David Matlack
@ 2022-07-12 23:21       ` Sean Christopherson
  2022-07-13 16:20         ` Sean Christopherson
  1 sibling, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2022-07-12 23:21 UTC (permalink / raw)
  To: Ben Gardon
  Cc: David Matlack, LKML, kvm, Paolo Bonzini, Peter Xu, Jim Mattson,
	David Dunn, Jing Zhang, Junaid Shahid

On Mon, Mar 28, 2022, Ben Gardon wrote:
> On Mon, Mar 28, 2022 at 10:45 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Mon, Mar 21, 2022 at 03:43:58PM -0700, Ben Gardon wrote:
> > > +{
> > > +     struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> > > +     struct rsvd_bits_validate shadow_zero_check;
> > > +     bool map_writable;
> > > +     kvm_pfn_t pfn;
> > > +     u64 new_spte;
> > > +     u64 mt_mask;
> > > +
> > > +     /*
> > > +      * If addresses are being invalidated, don't do in-place promotion to
> > > +      * avoid accidentally mapping an invalidated address.
> > > +      */
> > > +     if (unlikely(kvm->mmu_notifier_count))
> > > +             return false;
> >
> > Why is this necessary? Seeing this makes me wonder if we need a similar
> > check for eager page splitting.
> 
> This is needed here, but not in the page splitting case, because we
> are potentially mapping new memory.

As written, it's required because KVM doesn't check that there's at least one
leaf SPTE for the range.  If KVM were to step down and find a leaf SPTE before
stepping back up to promote, then this check can be dropped because KVM zaps leaf
SPTEs during invalidate_range_start(), and the primary MMU must invalidate the
entire range covered by a huge page if it's splitting a huge page.

I'm inclined to go that route because it allows for a more unified path (with some
other prep work).  Having to find a leaf SPTE could increase the time to disable
dirty logging, but unless it's an order of magnitude or worse, I'm not sure we care
because walking SPTEs doesn't impact vCPUs (unlike actually zapping).

> > > +             /* Try to promote the constitutent pages to an lpage. */
> > > +             if (!is_last_spte(iter.old_spte, iter.level) &&
> > > +                 try_promote_lpage(kvm, slot, &iter))
> > >                       continue;
> >
> > If iter.old_spte is not a leaf, the only loop would always continue to
> > the next SPTE. Now we try to promote it and if that fails we run through
> > the rest of the loop. This seems broken. For example, in the next line
> > we end up grabbing the pfn of the non-leaf SPTE (which would be the PFN
> > of the TDP MMU page table?) and treat that as the PFN backing this GFN,
> > which is wrong.
> >
> > In the worst case we end up zapping an SPTE that we didn't need to, but
> > we should still fix up this code.

My thought to remedy this is to drop the @pfn argument to kvm_mmu_max_mapping_level().
It's used only for historical reasons, where KVM didn't walk the host page tables
to get the max mapping level and instead pulled THP information out of struct page.
I.e. KVM needed the pfn to get the page.

That would also allow KVM to use huge pages for things that aren't backed by
struct page (I know of at least one potential use case).

I _think_ we can do the below.  It's compile tested only at this point, and I
want to split some of the changes into separate patches, e.g. the WARN on the step-up
not going out-of-bounds.  I'll put this on the backburner for now, it's too late
for 5.20, and too many people are OOO :-)

	tdp_root_for_each_pte(iter, root, start, end) {
		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
			continue;

		/*
		 * Step down until a PRESENT, leaf SPTE is found, even when
		 * promoting a parent shadow page.  Requiring a leaf SPTE
		 * ensures that KVM is not creating a new mapping while an MMU
		 * notifier invalidation is in-progress (KVM zaps only leaf
		 * SPTEs in response to MMU notifier invlidation events), and
		 * avoids doing work for shadow pages with no children.
		 */
		if (!is_shadow_present_pte(iter.old_spte) ||
		    !is_last_spte(iter.old_spte, iter.level))
			continue;

		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
							      PG_LEVEL_NUM);
		if (iter.level == max_mapping_level)
			continue;

		/*
		 * KVM zaps leaf SPTEs when handling MMU notifier invalidations,
		 * and the primary MMU is supposed to invalidate secondary MMUs
		 * _before_ zapping PTEs in the host page tables.  It should be
		 * impossible for a leaf SPTE to violate the host mapping level.
		 */
		if (WARN_ON_ONCE(max_mapping_level < iter.level))
			continue;

		/*
		 * The page can be remapped at a higher level, so step
		 * up to zap the parent SPTE.
		 */
		while (max_mapping_level > iter.level)
			tdp_iter_step_up(&iter);

		/*
		 * Stepping up should not cause the iter to go out of range of
		 * the memslot, the max mapping level is bounded by the memslot
		 * (among other things).
		 */
		if (WARN_ON_ONCE(iter.gfn < start || iter.gfn >= end))
			continue;

		/*
		 * Attempt to promote the non-leaf SPTE to a huge page.  If the
		 * promotion fails, zap the SPTE and let it be rebuilt on the
		 * next associated TDP page fault.
		 */
		if (!try_promote_to_huge_page(kvm, &rsvd_bits, slot, &iter))
			continue;

		/* Note, a successful atomic zap also does a remote TLB flush. */
		tdp_mmu_zap_spte_atomic(kvm, &iter);

		/*
		 * If the atomic zap fails, the iter will recurse back into
		 * the same subtree to retry.
		 */
	}

and then promotion shrinks a decent amount too as it's just getting the pfn,
memtype, and making the SPTE.

  static int try_promote_to_huge_page(struct kvm *kvm,
				      struct rsvd_bits_validate *rsvd_bits,
				      const struct kvm_memory_slot *slot,
				      struct tdp_iter *iter)
  {
	struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
	kvm_pfn_t pfn;
	u64 new_spte;
	u8 mt_mask;
	int r;

	/*
	 * Treat the lookup as a write "fault", in-place promotion is used only
	 * when disabling dirty logging, which requires a writable memslot.
	 */
	pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true, NULL, NULL);
	if (is_error_noslot_pfn(pfn))
		return -EINVAL;

	/*
	 * In some cases, a vCPU pointer is required to get the MT mask,
	 * however in most cases it can be generated without one. If a
	 * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
	 * In that case, bail on in-place promotion.
	 */
	r = static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
						 kvm_is_mmio_pfn(pfn), &mt_mask);
	if (r)
		return r;

	__make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
		    true, mt_mask, rsvd_bits, &new_spte);

	return tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
  }



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging
  2022-07-12 23:21       ` Sean Christopherson
@ 2022-07-13 16:20         ` Sean Christopherson
  0 siblings, 0 replies; 33+ messages in thread
From: Sean Christopherson @ 2022-07-13 16:20 UTC (permalink / raw)
  To: Ben Gardon
  Cc: David Matlack, LKML, kvm, Paolo Bonzini, Peter Xu, Jim Mattson,
	David Dunn, Jing Zhang, Junaid Shahid

On Tue, Jul 12, 2022, Sean Christopherson wrote:
> On Mon, Mar 28, 2022, Ben Gardon wrote:
> > On Mon, Mar 28, 2022 at 10:45 AM David Matlack <dmatlack@google.com> wrote:
> > > If iter.old_spte is not a leaf, the only loop would always continue to
> > > the next SPTE. Now we try to promote it and if that fails we run through
> > > the rest of the loop. This seems broken. For example, in the next line
> > > we end up grabbing the pfn of the non-leaf SPTE (which would be the PFN
> > > of the TDP MMU page table?) and treat that as the PFN backing this GFN,
> > > which is wrong.
> > >
> > > In the worst case we end up zapping an SPTE that we didn't need to, but
> > > we should still fix up this code.
> 
> My thought to remedy this is to drop the @pfn argument to kvm_mmu_max_mapping_level().
> It's used only for historical reasons, where KVM didn't walk the host page tables
> to get the max mapping level and instead pulled THP information out of struct page.
> I.e. KVM needed the pfn to get the page.
> 
> That would also allow KVM to use huge pages for things that aren't backed by
> struct page (I know of at least one potential use case).
> 
> I _think_ we can do the below.  It's compile tested only at this point, and I
> want to split some of the changes into separate patches, e.g. the WARN on the step-up
> not going out-of-bounds.  I'll put this on the backburner for now, it's too late
> for 5.20, and too many people are OOO :-)

Heh, that was a bit of a lie.  _Now_ it's going on the backburner.

Thinking about the pfn coming from the old leaf SPTE made me realize all of the
information we need to use __make_spte() during promotion is available in the
existing leaf SPTE.

If KVM first retrieves a PRESENT leaf SPTE, then pfn _can't_ be different, because
that would mean KVM done messed up and didn't zap existing entries in response to
a MMU notifier invalidation, and holding mmu_lock prevents new invalidations.
And because the primary MMU must invalidate before splitting a huge page, having a
valid leaf SPTE means that host mapping level can't become stale until mmu_lock is
dropped.  In other words, KVM can compute the huge pfn by using the smaller pfn
and adjusting based on the target mapping level.

As for the EPT memtype, that can also come from the existing leaf SPTE.  KVM only
forces the memtype for host MMIO pfns, and if the primary MMU maps a huge page that
straddles host MMIO (UC) and regular memory (WB), then the kernel is already hosed.
If the VM doesn't have non-coherent DMA, then the EPT memtype will be WB regardless
of the page size.  That means KVM just needs to reject promotion if the VM has
non-coherent DMA and the target pfn is not host MMIO, else KVM can use the leaf's
memtype as-is.

Using the pfn avoids gup() (fast-only, but still), and using the memtype avoids
having to split vmx_get_mt_mask() and add another kvm_x86_ops hook.

And digging into all of that yielded another optimization.  kvm_tdp_page_fault()
needs to restrict the host mapping level if and only if it may consume the guest
MTRRs.  If KVM ignores the guest MTRRs, then the fact that they're inconsistent
across a TDP page is irrelevant because the _guest_ MTRRs are completely virtual
and are not consumed by either EPT or NPT.  I doubt this meaningfully affects
whether or not KVM can create huge pages for real world VMs, but it does avoid
having to walk the guest variable MTRRs when faulting in a huge page.

Compile tested only at this point, but I'm mostly certain my logic is sound.

int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
	/*
	 * If the guest's MTRRs may be used, restrict the mapping level to
	 * ensure KVM uses a consistent memtype across the entire mapping.
	 */
	if (kvm_may_need_guest_mtrrs(vcpu->kvm)) {
		for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
			int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
			gfn_t base = (fault->addr >> PAGE_SHIFT) & ~(page_num - 1);

			if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
				break;
		}
	}

	return direct_page_fault(vcpu, fault);
}

static int try_promote_to_huge_page(struct kvm *kvm,
				    struct rsvd_bits_validate *rsvd_bits,
				    const struct kvm_memory_slot *slot,
				    u64 leaf_spte, struct tdp_iter *iter)
{
	struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
	kvm_pfn_t pfn;
	u64 new_spte;
	u8 mt_mask;

	if (WARN_ON_ONCE(slot->flags & KVM_MEM_READONLY))
		return -EINVAL;

	pfn = spte_to_pfn(leaf_spte) & ~(KVM_PAGES_PER_HPAGE(iter->level) - 1);
	mt_mask = leaf_spte & shadow_memtype_mask;

	/*
	 * Bail if KVM needs guest MTRRs to compute the memtype and will not
	 * force the memtype (host MMIO).  There is no guarantee the guest uses
	 * a consistent MTRR memtype for the entire huge page, and MTRRs are
	 * tracked per vCPU, not per VM.
	 */
	if (kvm_may_need_guest_mtrrs(kvm) && !kvm_is_mmio_pfn(pfn))
		return -EIO;

	__make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
		    true, mt_mask, rsvd_bits, &new_spte);

	return tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging
  2022-07-12  1:37   ` Sean Christopherson
@ 2022-07-14  7:55     ` Paolo Bonzini
  2022-07-14 15:27       ` Sean Christopherson
  0 siblings, 1 reply; 33+ messages in thread
From: Paolo Bonzini @ 2022-07-14  7:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ben Gardon, linux-kernel, kvm, Peter Xu, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On 7/12/22 03:37, Sean Christopherson wrote:
> This fell through the cracks.  Ben is on a long vacation, I'll find my copy of
> the Necronomicon and do a bit of resurrection, and address the feedback from v2
> along the way.

This was superseded by the simple patch to zap only the leaves I think?

Paolo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging
  2022-07-14  7:55     ` Paolo Bonzini
@ 2022-07-14 15:27       ` Sean Christopherson
  0 siblings, 0 replies; 33+ messages in thread
From: Sean Christopherson @ 2022-07-14 15:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ben Gardon, linux-kernel, kvm, Peter Xu, David Matlack,
	Jim Mattson, David Dunn, Jing Zhang, Junaid Shahid

On Thu, Jul 14, 2022, Paolo Bonzini wrote:
> On 7/12/22 03:37, Sean Christopherson wrote:
> > This fell through the cracks.  Ben is on a long vacation, I'll find my copy of
> > the Necronomicon and do a bit of resurrection, and address the feedback from v2
> > along the way.
> 
> This was superseded by the simple patch to zap only the leaves I think?

Ah, right you are, commit 5ba7c4c6d1c7 ("KVM: x86/MMU: Zap non-leaf SPTEs when
disabling dirty logging").  I got somewhat confused because there's a stale comment
above the inner helper:

	/*
	 * Clear leaf entries which could be replaced by large mappings, for
	 * GFNs within the slot.
	 */

If we drop the "only refcounted struct pages can be huge" requirement, then the
flow becomes much simpler as there's no need to recurse down to the leafs only to
step back up:

	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
retry:
		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
			continue;

		if (!is_shadow_present_pte(iter.old_spte))
			continue;

		/*
		 * Don't zap leaf SPTEs, if a leaf SPTE could be replaced with
		 * a large page size, then its parent would have been zapped
		 * instead of stepping down.
		 */
		if (is_last_spte(iter.old_spte, iter.level))
			continue;

		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
							      iter.gfn, PG_LEVEL_NUM);
		if (max_mapping_level <= iter.level)
			continue;

		/* Note, a successful atomic zap also does a remote TLB flush. */
		if (tdp_mmu_zap_spte_atomic(kvm, &iter))
			goto retry;
	}

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2022-07-14 15:27 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-21 22:43 [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging Ben Gardon
2022-03-21 22:43 ` [PATCH v2 1/9] KVM: x86/mmu: Move implementation of make_spte to a helper Ben Gardon
2022-03-21 22:43 ` [PATCH v2 2/9] KVM: x86/mmu: Factor mt_mask out of __make_spte Ben Gardon
2022-03-21 22:43 ` [PATCH v2 3/9] KVM: x86/mmu: Factor shadow_zero_check " Ben Gardon
2022-04-12 15:52   ` Sean Christopherson
2022-03-21 22:43 ` [PATCH v2 4/9] KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte Ben Gardon
2022-03-21 22:43 ` [PATCH v2 5/9] KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask Ben Gardon
2022-04-12 15:46   ` Sean Christopherson
2022-04-21 18:50     ` Ben Gardon
2022-04-21 19:09       ` Ben Gardon
2022-03-21 22:43 ` [PATCH v2 6/9] KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend on vcpu Ben Gardon
2022-03-28 18:04   ` David Matlack
2022-03-21 22:43 ` [PATCH v2 7/9] KVM: x86/mmu: Add try_get_mt_mask to x86_ops Ben Gardon
2022-04-11 23:00   ` Sean Christopherson
2022-04-11 23:24     ` Ben Gardon
2022-04-11 23:33     ` Sean Christopherson
2022-04-12 19:30     ` Sean Christopherson
2022-03-21 22:43 ` [PATCH v2 8/9] KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c Ben Gardon
2022-04-12 19:39   ` Sean Christopherson
2022-03-21 22:43 ` [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging Ben Gardon
2022-03-28 17:45   ` David Matlack
2022-03-28 18:07     ` Ben Gardon
2022-03-28 18:20       ` David Matlack
2022-07-12 23:21       ` Sean Christopherson
2022-07-13 16:20         ` Sean Christopherson
2022-03-28 18:21   ` David Matlack
2022-04-12 16:43   ` Sean Christopherson
2022-04-25 18:09     ` Ben Gardon
2022-03-25 12:00 ` [PATCH v2 0/9] KVM: x86/MMU: Optimize " Paolo Bonzini
2022-07-12  1:37   ` Sean Christopherson
2022-07-14  7:55     ` Paolo Bonzini
2022-07-14 15:27       ` Sean Christopherson
2022-03-28 17:49 ` David Matlack

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.