linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes
@ 2022-02-09 17:00 Paolo Bonzini
  2022-02-09 17:00 ` [PATCH 01/12] KVM: x86: host-initiated EFER.LME write affects the MMU Paolo Bonzini
                   ` (12 more replies)
  0 siblings, 13 replies; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-09 17:00 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: vkuznets, mlevitsk, dmatlack, seanjc

The TDP MMU has a performance regression compared to the legacy MMU
when CR0 changes often.  This was reported for the grsecurity kernel,
which uses CR0.WP to implement kernel W^X.  In that case, each change to
CR0.WP unloads the MMU and causes a lot of unnecessary work.  When running
nested, this can even cause the L1 to hardly make progress, as the L0
hypervisor it is overwhelmed by the amount of MMU work that is needed.

Initially, my plan for this was to pull kvm_mmu_unload from
kvm_mmu_reset_context into kvm_init_mmu.  Therefore I started by separating
the CPU setup (CR0/CR4/EFER, SMM, guest mode, etc.) from the shadow
page table format.  Right now the "MMU role" is a messy mix of the two
and, whenever something is different between the MMU and the CPU, it is
stored as an extra field in struct kvm_mmu; for extra bonus complication,
sometimes the same thing is stored in both the role and an extra field.
The aim was to keep kvm_mmu_unload only if the MMU role changed, and
drop it if the CPU role changed.

I even posted that cleanup, but it occurred to me later that even
a conditional kvm_mmu_unload in kvm_init_mmu would be overkill.
kvm_mmu_unload is only needed in the rare cases where a TLB flush is
needed (e.g. CR0.PG changing from 1 to 0) or where the guest page table
interpretation changes in way not captured by the role (that is, CPUID
changes).  But the implementation of fast PGD switching is subtle
and requires a call to kvm_mmu_new_pgd (and therefore knowing the
new MMU role) before kvm_init_mmu, therefore kvm_mmu_reset_context
chickens and drops all the roots.

Therefore, the meat of this series is a reorganization of fast PGD
switching; it makes it possible to call kvm_mmu_new_pgd *after*
the MMU has been set up, just using the MMU role instead of
kvm_mmu_calc_root_page_role.

Patches 1 to 3 are bugfixes found while working on the series.

Patches 4 to 5 add more sanity checks that triggered a lot during
development.

Patches 6 and 7 are related cleanups.  In particular patch 7 makes
the cache lookup code a bit more pleasant.

Patches 8 to 9 rework the fast PGD switching.  Patches 10 and
11 are cleanups enabled by the rework, and the only survivors
of the CPU role patchset.

Finally, patch 12 optimizes kvm_mmu_reset_context.

Paolo


Paolo Bonzini (12):
  KVM: x86: host-initiated EFER.LME write affects the MMU
  KVM: MMU: move MMU role accessors to header
  KVM: x86: do not deliver asynchronous page faults if CR0.PG=0
  KVM: MMU: WARN if PAE roots linger after kvm_mmu_unload
  KVM: MMU: avoid NULL-pointer dereference on page freeing bugs
  KVM: MMU: rename kvm_mmu_reload
  KVM: x86: use struct kvm_mmu_root_info for mmu->root
  KVM: MMU: do not consult levels when freeing roots
  KVM: MMU: look for a cached PGD when going from 32-bit to 64-bit
  KVM: MMU: load new PGD after the shadow MMU is initialized
  KVM: MMU: remove kvm_mmu_calc_root_page_role
  KVM: x86: do not unload MMU roots on all role changes

 arch/x86/include/asm/kvm_host.h |   3 +-
 arch/x86/kvm/mmu.h              |  28 +++-
 arch/x86/kvm/mmu/mmu.c          | 253 ++++++++++++++++----------------
 arch/x86/kvm/mmu/mmu_audit.c    |   4 +-
 arch/x86/kvm/mmu/paging_tmpl.h  |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.h      |   2 +-
 arch/x86/kvm/svm/nested.c       |   6 +-
 arch/x86/kvm/vmx/nested.c       |   8 +-
 arch/x86/kvm/vmx/vmx.c          |   2 +-
 arch/x86/kvm/x86.c              |  39 +++--
 11 files changed, 190 insertions(+), 159 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 01/12] KVM: x86: host-initiated EFER.LME write affects the MMU
  2022-02-09 17:00 [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes Paolo Bonzini
@ 2022-02-09 17:00 ` Paolo Bonzini
  2022-02-10 22:49   ` Sean Christopherson
  2022-02-09 17:00 ` [PATCH 02/12] KVM: MMU: move MMU role accessors to header Paolo Bonzini
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-09 17:00 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: vkuznets, mlevitsk, dmatlack, seanjc, stable

While the guest runs, EFER.LME cannot change unless CR0.PG is clear, and therefore
EFER.NX is the only bit that can affect the MMU role.  However, set_efer accepts
a host-initiated change to EFER.LME even with CR0.PG=1.  In that case, the
MMU has to be reset.

Fixes: 11988499e62b ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu.h | 1 +
 arch/x86/kvm/x86.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 51faa2c76ca5..a5a50cfeffff 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -48,6 +48,7 @@
 			       X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE)
 
 #define KVM_MMU_CR0_ROLE_BITS (X86_CR0_PG | X86_CR0_WP)
+#define KVM_MMU_EFER_ROLE_BITS (EFER_LME | EFER_NX)
 
 static __always_inline u64 rsvd_bits(int s, int e)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a9006226501..5e1298aef9e2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1647,7 +1647,7 @@ static int set_efer(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	}
 
 	/* Update reserved bits */
-	if ((efer ^ old_efer) & EFER_NX)
+	if ((efer ^ old_efer) & KVM_MMU_EFER_ROLE_BITS)
 		kvm_mmu_reset_context(vcpu);
 
 	return 0;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 02/12] KVM: MMU: move MMU role accessors to header
  2022-02-09 17:00 [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes Paolo Bonzini
  2022-02-09 17:00 ` [PATCH 01/12] KVM: x86: host-initiated EFER.LME write affects the MMU Paolo Bonzini
@ 2022-02-09 17:00 ` Paolo Bonzini
  2022-02-10 23:00   ` Sean Christopherson
  2022-02-09 17:00 ` [PATCH 03/12] KVM: x86: do not deliver asynchronous page faults if CR0.PG=0 Paolo Bonzini
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-09 17:00 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: vkuznets, mlevitsk, dmatlack, seanjc

We will use is_cr0_pg to check whether a page fault can be delivered.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu.h     | 21 +++++++++++++++++++++
 arch/x86/kvm/mmu/mmu.c | 21 ---------------------
 2 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index a5a50cfeffff..b9d06a218b2c 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -65,6 +65,27 @@ static __always_inline u64 rsvd_bits(int s, int e)
 	return ((2ULL << (e - s)) - 1) << s;
 }
 
+/*
+ * The MMU itself (with a valid role) is the single source of truth for the
+ * MMU.  Do not use the regs used to build the MMU/role, nor the vCPU.  The
+ * regs don't account for dependencies, e.g. clearing CR4 bits if CR0.PG=1,
+ * and the vCPU may be incorrect/irrelevant.
+ */
+#define BUILD_MMU_ROLE_ACCESSOR(base_or_ext, reg, name)		\
+static inline bool __maybe_unused is_##reg##_##name(struct kvm_mmu *mmu)	\
+{								\
+	return !!(mmu->mmu_role. base_or_ext . reg##_##name);	\
+}
+BUILD_MMU_ROLE_ACCESSOR(ext,  cr0, pg);
+BUILD_MMU_ROLE_ACCESSOR(base, cr0, wp);
+BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, pse);
+BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, pae);
+BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, smep);
+BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, smap);
+BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, pke);
+BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, la57);
+BUILD_MMU_ROLE_ACCESSOR(base, efer, nx);
+
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 296f8723f9ae..e0c0f0bc2e8b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -223,27 +223,6 @@ BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, la57, X86_CR4_LA57);
 BUILD_MMU_ROLE_REGS_ACCESSOR(efer, nx, EFER_NX);
 BUILD_MMU_ROLE_REGS_ACCESSOR(efer, lma, EFER_LMA);
 
-/*
- * The MMU itself (with a valid role) is the single source of truth for the
- * MMU.  Do not use the regs used to build the MMU/role, nor the vCPU.  The
- * regs don't account for dependencies, e.g. clearing CR4 bits if CR0.PG=1,
- * and the vCPU may be incorrect/irrelevant.
- */
-#define BUILD_MMU_ROLE_ACCESSOR(base_or_ext, reg, name)		\
-static inline bool __maybe_unused is_##reg##_##name(struct kvm_mmu *mmu)	\
-{								\
-	return !!(mmu->mmu_role. base_or_ext . reg##_##name);	\
-}
-BUILD_MMU_ROLE_ACCESSOR(ext,  cr0, pg);
-BUILD_MMU_ROLE_ACCESSOR(base, cr0, wp);
-BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, pse);
-BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, pae);
-BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, smep);
-BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, smap);
-BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, pke);
-BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, la57);
-BUILD_MMU_ROLE_ACCESSOR(base, efer, nx);
-
 static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu_role_regs regs = {
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 03/12] KVM: x86: do not deliver asynchronous page faults if CR0.PG=0
  2022-02-09 17:00 [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes Paolo Bonzini
  2022-02-09 17:00 ` [PATCH 01/12] KVM: x86: host-initiated EFER.LME write affects the MMU Paolo Bonzini
  2022-02-09 17:00 ` [PATCH 02/12] KVM: MMU: move MMU role accessors to header Paolo Bonzini
@ 2022-02-09 17:00 ` Paolo Bonzini
  2022-02-10 23:10   ` Sean Christopherson
  2022-02-09 17:00 ` [PATCH 04/12] KVM: MMU: WARN if PAE roots linger after kvm_mmu_unload Paolo Bonzini
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-09 17:00 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: vkuznets, mlevitsk, dmatlack, seanjc

Enabling async page faults is nonsensical if paging is disabled, but
it is allowed because CR0.PG=0 does not clear the async page fault
MSR.  Just ignore them and only use the artificial halt state,
similar to what happens in guest mode if async #PF vmexits are disabled.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/x86.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5e1298aef9e2..98aca0f2af12 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12272,7 +12272,9 @@ static inline bool apf_pageready_slot_free(struct kvm_vcpu *vcpu)
 
 static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
 {
-	if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu))
+	if (is_guest_mode(vcpu)
+	    ? !vcpu->arch.apf.delivery_as_pf_vmexit
+	    : !is_cr0_pg(vcpu->arch.mmu))
 		return false;
 
 	if (!kvm_pv_async_pf_enabled(vcpu) ||
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 04/12] KVM: MMU: WARN if PAE roots linger after kvm_mmu_unload
  2022-02-09 17:00 [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes Paolo Bonzini
                   ` (2 preceding siblings ...)
  2022-02-09 17:00 ` [PATCH 03/12] KVM: x86: do not deliver asynchronous page faults if CR0.PG=0 Paolo Bonzini
@ 2022-02-09 17:00 ` Paolo Bonzini
  2022-02-10 23:20   ` Sean Christopherson
  2022-02-09 17:00 ` [PATCH 05/12] KVM: MMU: avoid NULL-pointer dereference on page freeing bugs Paolo Bonzini
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-09 17:00 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: vkuznets, mlevitsk, dmatlack, seanjc

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e0c0f0bc2e8b..7b5765ced928 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5065,12 +5065,21 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 	return r;
 }
 
+static void __kvm_mmu_unload(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
+{
+	int i;
+	kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOTS_ALL);
+	WARN_ON(VALID_PAGE(mmu->root_hpa));
+	if (mmu->pae_root) {
+		for (i = 0; i < 4; ++i)
+			WARN_ON(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
+	}
+}
+
 void kvm_mmu_unload(struct kvm_vcpu *vcpu)
 {
-	kvm_mmu_free_roots(vcpu, &vcpu->arch.root_mmu, KVM_MMU_ROOTS_ALL);
-	WARN_ON(VALID_PAGE(vcpu->arch.root_mmu.root_hpa));
-	kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL);
-	WARN_ON(VALID_PAGE(vcpu->arch.guest_mmu.root_hpa));
+	__kvm_mmu_unload(vcpu, &vcpu->arch.root_mmu);
+	__kvm_mmu_unload(vcpu, &vcpu->arch.guest_mmu);
 }
 
 static bool need_remote_flush(u64 old, u64 new)
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 05/12] KVM: MMU: avoid NULL-pointer dereference on page freeing bugs
  2022-02-09 17:00 [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes Paolo Bonzini
                   ` (3 preceding siblings ...)
  2022-02-09 17:00 ` [PATCH 04/12] KVM: MMU: WARN if PAE roots linger after kvm_mmu_unload Paolo Bonzini
@ 2022-02-09 17:00 ` Paolo Bonzini
  2022-02-11  0:24   ` Sean Christopherson
  2022-02-09 17:00 ` [PATCH 06/12] KVM: MMU: rename kvm_mmu_reload Paolo Bonzini
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-09 17:00 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: vkuznets, mlevitsk, dmatlack, seanjc

If kvm_mmu_free_roots encounters a PAE page table where a 64-bit page
table is expected, the result is a NULL pointer dereference.  Instead
just WARN and exit.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7b5765ced928..d0f2077bd798 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3201,6 +3201,8 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
 		return;
 
 	sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
+	if (WARN_ON(!sp))
+		return;
 
 	if (is_tdp_mmu_page(sp))
 		kvm_tdp_mmu_put_root(kvm, sp, false);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 06/12] KVM: MMU: rename kvm_mmu_reload
  2022-02-09 17:00 [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes Paolo Bonzini
                   ` (4 preceding siblings ...)
  2022-02-09 17:00 ` [PATCH 05/12] KVM: MMU: avoid NULL-pointer dereference on page freeing bugs Paolo Bonzini
@ 2022-02-09 17:00 ` Paolo Bonzini
  2022-02-11  0:27   ` Sean Christopherson
  2022-02-09 17:00 ` [PATCH 07/12] KVM: x86: use struct kvm_mmu_root_info for mmu->root Paolo Bonzini
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-09 17:00 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: vkuznets, mlevitsk, dmatlack, seanjc

The name of kvm_mmu_reload is very confusing for two reasons:
first, KVM_REQ_MMU_RELOAD actually does not call it; second,
it only does anything if there is no valid root.

Rename it to kvm_mmu_ensure_valid_root, which matches the actual
behavior better.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu.h | 2 +-
 arch/x86/kvm/x86.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b9d06a218b2c..c9f1c2162ade 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -104,7 +104,7 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu);
 void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu);
 void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
 
-static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
+static inline int kvm_mmu_ensure_valid_root(struct kvm_vcpu *vcpu)
 {
 	if (likely(vcpu->arch.mmu->root_hpa != INVALID_PAGE))
 		return 0;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 98aca0f2af12..2685fb62807e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9976,7 +9976,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		}
 	}
 
-	r = kvm_mmu_reload(vcpu);
+	r = kvm_mmu_ensure_valid_root(vcpu);
 	if (unlikely(r)) {
 		goto cancel_injection;
 	}
@@ -12164,7 +12164,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	      work->wakeup_all)
 		return;
 
-	r = kvm_mmu_reload(vcpu);
+	r = kvm_mmu_ensure_valid_root(vcpu);
 	if (unlikely(r))
 		return;
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 07/12] KVM: x86: use struct kvm_mmu_root_info for mmu->root
  2022-02-09 17:00 [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes Paolo Bonzini
                   ` (5 preceding siblings ...)
  2022-02-09 17:00 ` [PATCH 06/12] KVM: MMU: rename kvm_mmu_reload Paolo Bonzini
@ 2022-02-09 17:00 ` Paolo Bonzini
  2022-02-11 17:39   ` Sean Christopherson
  2022-02-09 17:00 ` [PATCH 08/12] KVM: MMU: do not consult levels when freeing roots Paolo Bonzini
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-09 17:00 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: vkuznets, mlevitsk, dmatlack, seanjc

The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info.
Use the struct to have more consistency between mmu->root and
mmu->prev_roots.

The patch is entirely search and replace except for cached_root_available,
which does not need a temporary struct kvm_mmu_root_info anymore.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +-
 arch/x86/kvm/mmu.h              |  4 +-
 arch/x86/kvm/mmu/mmu.c          | 69 +++++++++++++++------------------
 arch/x86/kvm/mmu/mmu_audit.c    |  4 +-
 arch/x86/kvm/mmu/paging_tmpl.h  |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.h      |  2 +-
 arch/x86/kvm/vmx/nested.c       |  2 +-
 arch/x86/kvm/vmx/vmx.c          |  2 +-
 arch/x86/kvm/x86.c              |  2 +-
 10 files changed, 42 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a0d2925b6651..6da9a460e584 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -432,8 +432,7 @@ struct kvm_mmu {
 	int (*sync_page)(struct kvm_vcpu *vcpu,
 			 struct kvm_mmu_page *sp);
 	void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa);
-	hpa_t root_hpa;
-	gpa_t root_pgd;
+	struct kvm_mmu_root_info root;
 	union kvm_mmu_role mmu_role;
 	u8 root_level;
 	u8 shadow_root_level;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index c9f1c2162ade..f896c438c8ee 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -106,7 +106,7 @@ void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
 
 static inline int kvm_mmu_ensure_valid_root(struct kvm_vcpu *vcpu)
 {
-	if (likely(vcpu->arch.mmu->root_hpa != INVALID_PAGE))
+	if (likely(vcpu->arch.mmu->root.hpa != INVALID_PAGE))
 		return 0;
 
 	return kvm_mmu_load(vcpu);
@@ -128,7 +128,7 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
 
 static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
 {
-	u64 root_hpa = vcpu->arch.mmu->root_hpa;
+	u64 root_hpa = vcpu->arch.mmu->root.hpa;
 
 	if (!VALID_PAGE(root_hpa))
 		return;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d0f2077bd798..3c3f597ea00d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2141,7 +2141,7 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
 		 * prev_root is currently only used for 64-bit hosts. So only
 		 * the active root_hpa is valid here.
 		 */
-		BUG_ON(root != vcpu->arch.mmu->root_hpa);
+		BUG_ON(root != vcpu->arch.mmu->root.hpa);
 
 		iterator->shadow_addr
 			= vcpu->arch.mmu->pae_root[(addr >> 30) & 3];
@@ -2155,7 +2155,7 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
 static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
 			     struct kvm_vcpu *vcpu, u64 addr)
 {
-	shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.mmu->root_hpa,
+	shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.mmu->root.hpa,
 				    addr);
 }
 
@@ -3224,7 +3224,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG);
 
 	/* Before acquiring the MMU lock, see if we need to do any real work. */
-	if (!(free_active_root && VALID_PAGE(mmu->root_hpa))) {
+	if (!(free_active_root && VALID_PAGE(mmu->root.hpa))) {
 		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 			if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
 			    VALID_PAGE(mmu->prev_roots[i].hpa))
@@ -3244,7 +3244,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	if (free_active_root) {
 		if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
 		    (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
-			mmu_free_root_page(kvm, &mmu->root_hpa, &invalid_list);
+			mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
 		} else if (mmu->pae_root) {
 			for (i = 0; i < 4; ++i) {
 				if (!IS_VALID_PAE_ROOT(mmu->pae_root[i]))
@@ -3255,8 +3255,8 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 				mmu->pae_root[i] = INVALID_PAE_ROOT;
 			}
 		}
-		mmu->root_hpa = INVALID_PAGE;
-		mmu->root_pgd = 0;
+		mmu->root.hpa = INVALID_PAGE;
+		mmu->root.pgd = 0;
 	}
 
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
@@ -3329,10 +3329,10 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 
 	if (is_tdp_mmu_enabled(vcpu->kvm)) {
 		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
-		mmu->root_hpa = root;
+		mmu->root.hpa = root;
 	} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
 		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
-		mmu->root_hpa = root;
+		mmu->root.hpa = root;
 	} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
 		if (WARN_ON_ONCE(!mmu->pae_root)) {
 			r = -EIO;
@@ -3347,15 +3347,15 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 			mmu->pae_root[i] = root | PT_PRESENT_MASK |
 					   shadow_me_mask;
 		}
-		mmu->root_hpa = __pa(mmu->pae_root);
+		mmu->root.hpa = __pa(mmu->pae_root);
 	} else {
 		WARN_ONCE(1, "Bad TDP root level = %d\n", shadow_root_level);
 		r = -EIO;
 		goto out_unlock;
 	}
 
-	/* root_pgd is ignored for direct MMUs. */
-	mmu->root_pgd = 0;
+	/* root.pgd is ignored for direct MMUs. */
+	mmu->root.pgd = 0;
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
 	return r;
@@ -3468,7 +3468,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	if (mmu->root_level >= PT64_ROOT_4LEVEL) {
 		root = mmu_alloc_root(vcpu, root_gfn, 0,
 				      mmu->shadow_root_level, false);
-		mmu->root_hpa = root;
+		mmu->root.hpa = root;
 		goto set_root_pgd;
 	}
 
@@ -3518,14 +3518,14 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	}
 
 	if (mmu->shadow_root_level == PT64_ROOT_5LEVEL)
-		mmu->root_hpa = __pa(mmu->pml5_root);
+		mmu->root.hpa = __pa(mmu->pml5_root);
 	else if (mmu->shadow_root_level == PT64_ROOT_4LEVEL)
-		mmu->root_hpa = __pa(mmu->pml4_root);
+		mmu->root.hpa = __pa(mmu->pml4_root);
 	else
-		mmu->root_hpa = __pa(mmu->pae_root);
+		mmu->root.hpa = __pa(mmu->pae_root);
 
 set_root_pgd:
-	mmu->root_pgd = root_pgd;
+	mmu->root.pgd = root_pgd;
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
 
@@ -3638,13 +3638,13 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.mmu->direct_map)
 		return;
 
-	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
+	if (!VALID_PAGE(vcpu->arch.mmu->root.hpa))
 		return;
 
 	vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
 
 	if (vcpu->arch.mmu->root_level >= PT64_ROOT_4LEVEL) {
-		hpa_t root = vcpu->arch.mmu->root_hpa;
+		hpa_t root = vcpu->arch.mmu->root.hpa;
 		sp = to_shadow_page(root);
 
 		if (!is_unsync_root(root))
@@ -3935,7 +3935,7 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 				struct kvm_page_fault *fault, int mmu_seq)
 {
-	struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root_hpa);
+	struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
 
 	/* Special roots, e.g. pae_root, are not backed by shadow pages. */
 	if (sp && is_obsolete_sp(vcpu->kvm, sp))
@@ -4092,34 +4092,27 @@ static inline bool is_root_usable(struct kvm_mmu_root_info *root, gpa_t pgd,
 /*
  * Find out if a previously cached root matching the new pgd/role is available.
  * The current root is also inserted into the cache.
- * If a matching root was found, it is assigned to kvm_mmu->root_hpa and true is
+ * If a matching root was found, it is assigned to kvm_mmu->root.hpa and true is
  * returned.
- * Otherwise, the LRU root from the cache is assigned to kvm_mmu->root_hpa and
+ * Otherwise, the LRU root from the cache is assigned to kvm_mmu->root.hpa and
  * false is returned. This root should now be freed by the caller.
  */
 static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_pgd,
 				  union kvm_mmu_page_role new_role)
 {
 	uint i;
-	struct kvm_mmu_root_info root;
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 
-	root.pgd = mmu->root_pgd;
-	root.hpa = mmu->root_hpa;
-
-	if (is_root_usable(&root, new_pgd, new_role))
+	if (is_root_usable(&mmu->root, new_pgd, new_role))
 		return true;
 
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
-		swap(root, mmu->prev_roots[i]);
+		swap(mmu->root, mmu->prev_roots[i]);
 
-		if (is_root_usable(&root, new_pgd, new_role))
+		if (is_root_usable(&mmu->root, new_pgd, new_role))
 			break;
 	}
 
-	mmu->root_hpa = root.hpa;
-	mmu->root_pgd = root.pgd;
-
 	return i < KVM_MMU_NUM_PREV_ROOTS;
 }
 
@@ -4175,7 +4168,7 @@ static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
 	 */
 	if (!new_role.direct)
 		__clear_sp_write_flooding_count(
-				to_shadow_page(vcpu->arch.mmu->root_hpa));
+				to_shadow_page(vcpu->arch.mmu->root.hpa));
 }
 
 void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
@@ -5071,7 +5064,7 @@ static void __kvm_mmu_unload(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 {
 	int i;
 	kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOTS_ALL);
-	WARN_ON(VALID_PAGE(mmu->root_hpa));
+	WARN_ON(VALID_PAGE(mmu->root.hpa));
 	if (mmu->pae_root) {
 		for (i = 0; i < 4; ++i)
 			WARN_ON(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
@@ -5266,7 +5259,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 	int r, emulation_type = EMULTYPE_PF;
 	bool direct = vcpu->arch.mmu->direct_map;
 
-	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
+	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
 		return RET_PF_RETRY;
 
 	r = RET_PF_INVALID;
@@ -5338,7 +5331,7 @@ void kvm_mmu_invalidate_gva(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		return;
 
 	if (root_hpa == INVALID_PAGE) {
-		mmu->invlpg(vcpu, gva, mmu->root_hpa);
+		mmu->invlpg(vcpu, gva, mmu->root.hpa);
 
 		/*
 		 * INVLPG is required to invalidate any global mappings for the VA,
@@ -5374,7 +5367,7 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid)
 	uint i;
 
 	if (pcid == kvm_get_active_pcid(vcpu)) {
-		mmu->invlpg(vcpu, gva, mmu->root_hpa);
+		mmu->invlpg(vcpu, gva, mmu->root.hpa);
 		tlb_flush = true;
 	}
 
@@ -5487,8 +5480,8 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 	struct page *page;
 	int i;
 
-	mmu->root_hpa = INVALID_PAGE;
-	mmu->root_pgd = 0;
+	mmu->root.hpa = INVALID_PAGE;
+	mmu->root.pgd = 0;
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
 
diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
index f31fdb874f1f..3e5d62a25350 100644
--- a/arch/x86/kvm/mmu/mmu_audit.c
+++ b/arch/x86/kvm/mmu/mmu_audit.c
@@ -56,11 +56,11 @@ static void mmu_spte_walk(struct kvm_vcpu *vcpu, inspect_spte_fn fn)
 	int i;
 	struct kvm_mmu_page *sp;
 
-	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
+	if (!VALID_PAGE(vcpu->arch.mmu->root.hpa))
 		return;
 
 	if (vcpu->arch.mmu->root_level >= PT64_ROOT_4LEVEL) {
-		hpa_t root = vcpu->arch.mmu->root_hpa;
+		hpa_t root = vcpu->arch.mmu->root.hpa;
 
 		sp = to_shadow_page(root);
 		__mmu_spte_walk(vcpu, sp, fn, vcpu->arch.mmu->root_level);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 5b5bdac97c7b..346f3bad3cb9 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -668,7 +668,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	if (FNAME(gpte_changed)(vcpu, gw, top_level))
 		goto out_gpte_changed;
 
-	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
+	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
 		goto out_gpte_changed;
 
 	for (shadow_walk_init(&it, vcpu, fault->addr);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8def8f810cb0..debf08212f12 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -657,7 +657,7 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 		else
 
 #define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end)		\
-	for_each_tdp_pte(_iter, to_shadow_page(_mmu->root_hpa), _start, _end)
+	for_each_tdp_pte(_iter, to_shadow_page(_mmu->root.hpa), _start, _end)
 
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 3f987785702a..57c73d8f76ce 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -95,7 +95,7 @@ static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu
 static inline bool is_tdp_mmu(struct kvm_mmu *mmu)
 {
 	struct kvm_mmu_page *sp;
-	hpa_t hpa = mmu->root_hpa;
+	hpa_t hpa = mmu->root.hpa;
 
 	if (WARN_ON(!VALID_PAGE(hpa)))
 		return false;
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index c73e4d938ddc..29289ecca223 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -5466,7 +5466,7 @@ static int handle_invept(struct kvm_vcpu *vcpu)
 				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
 
 		roots_to_free = 0;
-		if (nested_ept_root_matches(mmu->root_hpa, mmu->root_pgd,
+		if (nested_ept_root_matches(mmu->root.hpa, mmu->root.pgd,
 					    operand.eptp))
 			roots_to_free |= KVM_MMU_ROOT_CURRENT;
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 70e7f00362bc..5542a2b536e0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2957,7 +2957,7 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)
 static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
-	u64 root_hpa = mmu->root_hpa;
+	u64 root_hpa = mmu->root.hpa;
 
 	/* No flush required if the current context is invalid. */
 	if (!VALID_PAGE(root_hpa))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2685fb62807e..0d3646535cc5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -762,7 +762,7 @@ bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
 	if ((fault->error_code & PFERR_PRESENT_MASK) &&
 	    !(fault->error_code & PFERR_RSVD_MASK))
 		kvm_mmu_invalidate_gva(vcpu, fault_mmu, fault->address,
-				       fault_mmu->root_hpa);
+				       fault_mmu->root.hpa);
 
 	fault_mmu->inject_page_fault(vcpu, fault);
 	return fault->nested_page_fault;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 08/12] KVM: MMU: do not consult levels when freeing roots
  2022-02-09 17:00 [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes Paolo Bonzini
                   ` (6 preceding siblings ...)
  2022-02-09 17:00 ` [PATCH 07/12] KVM: x86: use struct kvm_mmu_root_info for mmu->root Paolo Bonzini
@ 2022-02-09 17:00 ` Paolo Bonzini
  2022-02-11  0:41   ` Sean Christopherson
  2022-02-09 17:00 ` [PATCH 09/12] KVM: MMU: look for a cached PGD when going from 32-bit to 64-bit Paolo Bonzini
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-09 17:00 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: vkuznets, mlevitsk, dmatlack, seanjc

Right now, PGD caching requires a complicated dance of first computing
the MMU role and passing it to __kvm_mmu_new_pgd, and then separately calling
kvm_init_mmu.

Part of this is due to kvm_mmu_free_roots using mmu->root_level and
mmu->shadow_root_level to distinguish whether the page table uses a single
root or 4 PAE roots.  Because kvm_init_mmu can overwrite mmu->root_level,
kvm_mmu_free_roots must be called before kvm_init_mmu.

However, even after kvm_init_mmu there is a way to detect whether the page table
has a single root or four, because the pae_root does not have an associated
struct kvm_mmu_page.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3c3f597ea00d..95d0fa0bb876 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3219,12 +3219,15 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	struct kvm *kvm = vcpu->kvm;
 	int i;
 	LIST_HEAD(invalid_list);
-	bool free_active_root = roots_to_free & KVM_MMU_ROOT_CURRENT;
+	bool free_active_root;
 
 	BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG);
 
 	/* Before acquiring the MMU lock, see if we need to do any real work. */
-	if (!(free_active_root && VALID_PAGE(mmu->root.hpa))) {
+	free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT)
+		&& VALID_PAGE(mmu->root.hpa);
+
+	if (!free_active_root) {
 		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 			if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
 			    VALID_PAGE(mmu->prev_roots[i].hpa))
@@ -3242,8 +3245,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 					   &invalid_list);
 
 	if (free_active_root) {
-		if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
-		    (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
+		if (to_shadow_page(mmu->root.hpa)) {
 			mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
 		} else if (mmu->pae_root) {
 			for (i = 0; i < 4; ++i) {
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 09/12] KVM: MMU: look for a cached PGD when going from 32-bit to 64-bit
  2022-02-09 17:00 [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes Paolo Bonzini
                   ` (7 preceding siblings ...)
  2022-02-09 17:00 ` [PATCH 08/12] KVM: MMU: do not consult levels when freeing roots Paolo Bonzini
@ 2022-02-09 17:00 ` Paolo Bonzini
  2022-02-11  1:32   ` Sean Christopherson
  2022-02-09 17:00 ` [PATCH 10/12] KVM: MMU: load new PGD after the shadow MMU is initialized Paolo Bonzini
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-09 17:00 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: vkuznets, mlevitsk, dmatlack, seanjc

Right now, PGD caching avoids placing a PAE root in the cache by using the
old value of mmu->root_level and mmu->shadow_root_level; it does not look
for a cached PGD if the old root is a PAE one, and then frees it using
kvm_mmu_free_roots.

Change the logic instead to free the uncacheable root early.
This way, __kvm_new_mmu_pgd is able to look up the cache when going from
32-bit to 64-bit (if there is a hit, the invalid root becomes the least
recently used).  An example of this is nested virtualization with shadow
paging, when a 64-bit L1 runs a 32-bit L2.

As a side effect (which is actually the reason why this patch was
written), PGD caching does not use the old value of mmu->root_level
and mmu->shadow_root_level anymore.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 71 ++++++++++++++++++++++++++++++++----------
 1 file changed, 54 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 95d0fa0bb876..f61208ccce43 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4087,20 +4087,20 @@ static inline bool is_root_usable(struct kvm_mmu_root_info *root, gpa_t pgd,
 				  union kvm_mmu_page_role role)
 {
 	return (role.direct || pgd == root->pgd) &&
-	       VALID_PAGE(root->hpa) && to_shadow_page(root->hpa) &&
+	       VALID_PAGE(root->hpa) &&
 	       role.word == to_shadow_page(root->hpa)->role.word;
 }
 
 /*
  * Find out if a previously cached root matching the new pgd/role is available.
- * The current root is also inserted into the cache.
- * If a matching root was found, it is assigned to kvm_mmu->root.hpa and true is
- * returned.
- * Otherwise, the LRU root from the cache is assigned to kvm_mmu->root.hpa and
- * false is returned. This root should now be freed by the caller.
+ * If a matching root is found, it is assigned to kvm_mmu->root and
+ * true is returned.
+ * If no match is found, the current root becomes the MRU of the cache
+ * if valid (thus evicting the LRU root), kvm_mmu->root is left invalid,
+ * and false is returned.
  */
-static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_pgd,
-				  union kvm_mmu_page_role new_role)
+static bool cached_root_find_and_promote(struct kvm_vcpu *vcpu, gpa_t new_pgd,
+					 union kvm_mmu_page_role new_role)
 {
 	uint i;
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
@@ -4109,13 +4109,48 @@ static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_pgd,
 		return true;
 
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
+		/*
+		 * The swaps end up rotating the cache like this:
+		 *   C   0 1 2 3   (on entry to the function)
+		 *   0   C 1 2 3
+		 *   1   C 0 2 3
+		 *   2   C 0 1 3
+		 *   3   C 0 1 2   (on exit from the loop)
+		 */
 		swap(mmu->root, mmu->prev_roots[i]);
-
 		if (is_root_usable(&mmu->root, new_pgd, new_role))
-			break;
+			return true;
 	}
 
-	return i < KVM_MMU_NUM_PREV_ROOTS;
+	kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
+	return false;
+}
+
+/*
+ * Find out if a previously cached root matching the new pgd/role is available.
+ * If a matching root is found, it is assigned to kvm_mmu->root and true
+ * is returned.  The current, invalid root goes to the bottom of the cache.
+ * If no match is found, kvm_mmu->root is left invalid and false is returned.
+ */
+static bool cached_root_find_and_replace(struct kvm_vcpu *vcpu, gpa_t new_pgd,
+					 union kvm_mmu_page_role new_role)
+{
+	uint i;
+	struct kvm_mmu *mmu = vcpu->arch.mmu;
+
+	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
+		if (is_root_usable(&mmu->prev_roots[i], new_pgd, new_role))
+			goto hit;
+
+	return false;
+
+hit:
+	swap(mmu->root, mmu->prev_roots[i]);
+	/* Bubble up the remaining roots.  */
+	for (; i < KVM_MMU_NUM_PREV_ROOTS - 1; i++)
+		mmu->prev_roots[i] = mmu->prev_roots[i + 1];
+	mmu->prev_roots[i].hpa = INVALID_PAGE;
+	return true;
 }
 
 static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
@@ -4124,22 +4159,24 @@ static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 
 	/*
-	 * For now, limit the fast switch to 64-bit hosts+VMs in order to avoid
+	 * For now, limit the caching to 64-bit hosts+VMs in order to avoid
 	 * having to deal with PDPTEs. We may add support for 32-bit hosts/VMs
 	 * later if necessary.
 	 */
-	if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
-	    mmu->root_level >= PT64_ROOT_4LEVEL)
-		return cached_root_available(vcpu, new_pgd, new_role);
+	if (VALID_PAGE(mmu->root.hpa) && !to_shadow_page(mmu->root.hpa))
+		kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
 
-	return false;
+	if (VALID_PAGE(mmu->root.hpa))
+		return cached_root_find_and_promote(vcpu, new_pgd, new_role);
+	else
+		return cached_root_find_and_replace(vcpu, new_pgd, new_role);
 }
 
 static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
 			      union kvm_mmu_page_role new_role)
 {
 	if (!fast_pgd_switch(vcpu, new_pgd, new_role)) {
-		kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
+		/* kvm_mmu_ensure_valid_pgd will set up a new root.  */
 		return;
 	}
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 10/12] KVM: MMU: load new PGD after the shadow MMU is initialized
  2022-02-09 17:00 [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes Paolo Bonzini
                   ` (8 preceding siblings ...)
  2022-02-09 17:00 ` [PATCH 09/12] KVM: MMU: look for a cached PGD when going from 32-bit to 64-bit Paolo Bonzini
@ 2022-02-09 17:00 ` Paolo Bonzini
  2022-02-11 17:45   ` Sean Christopherson
  2022-02-09 17:00 ` [PATCH 11/12] KVM: MMU: remove kvm_mmu_calc_root_page_role Paolo Bonzini
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-09 17:00 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: vkuznets, mlevitsk, dmatlack, seanjc

Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and
shadow_root_level anymore, pull the PGD load after the initialization of
the shadow MMUs.

Besides being more intuitive, this enables future simplifications
and optimizations because it's not necessary anymore to compute the
role outside kvm_init_mmu.  In particular, kvm_mmu_reset_context was not
attempting to use a cached PGD to avoid having to figure out the new role.
It will soon be able to follow what nested_{vmx,svm}_load_cr3 are doing,
and avoid unloading all the cached roots.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c    | 37 +++++++++++++++++--------------------
 arch/x86/kvm/svm/nested.c |  6 +++---
 arch/x86/kvm/vmx/nested.c |  6 +++---
 3 files changed, 23 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f61208ccce43..df9e0a43513c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4882,9 +4882,8 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
 
 	new_role = kvm_calc_shadow_npt_root_page_role(vcpu, &regs);
 
-	__kvm_mmu_new_pgd(vcpu, nested_cr3, new_role.base);
-
 	shadow_mmu_init_context(vcpu, context, &regs, new_role);
+	__kvm_mmu_new_pgd(vcpu, nested_cr3, new_role.base);
 }
 EXPORT_SYMBOL_GPL(kvm_init_shadow_npt_mmu);
 
@@ -4922,27 +4921,25 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 		kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty,
 						   execonly, level);
 
-	__kvm_mmu_new_pgd(vcpu, new_eptp, new_role.base);
-
-	if (new_role.as_u64 == context->mmu_role.as_u64)
-		return;
-
-	context->mmu_role.as_u64 = new_role.as_u64;
+	if (new_role.as_u64 != context->mmu_role.as_u64) {
+		context->mmu_role.as_u64 = new_role.as_u64;
 
-	context->shadow_root_level = level;
+		context->shadow_root_level = level;
 
-	context->ept_ad = accessed_dirty;
-	context->page_fault = ept_page_fault;
-	context->gva_to_gpa = ept_gva_to_gpa;
-	context->sync_page = ept_sync_page;
-	context->invlpg = ept_invlpg;
-	context->root_level = level;
-	context->direct_map = false;
+		context->ept_ad = accessed_dirty;
+		context->page_fault = ept_page_fault;
+		context->gva_to_gpa = ept_gva_to_gpa;
+		context->sync_page = ept_sync_page;
+		context->invlpg = ept_invlpg;
+		context->root_level = level;
+		context->direct_map = false;
+		update_permission_bitmask(context, true);
+		context->pkru_mask = 0;
+		reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level);
+		reset_ept_shadow_zero_bits_mask(context, execonly);
+	}
 
-	update_permission_bitmask(context, true);
-	context->pkru_mask = 0;
-	reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level);
-	reset_ept_shadow_zero_bits_mask(context, execonly);
+	__kvm_mmu_new_pgd(vcpu, new_eptp, new_role.base);
 }
 EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
 
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index f284e61451c8..96bab464967f 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -492,14 +492,14 @@ static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
 	    CC(!load_pdptrs(vcpu, cr3)))
 		return -EINVAL;
 
-	if (!nested_npt)
-		kvm_mmu_new_pgd(vcpu, cr3);
-
 	vcpu->arch.cr3 = cr3;
 
 	/* Re-initialize the MMU, e.g. to pick up CR4 MMU role changes. */
 	kvm_init_mmu(vcpu);
 
+	if (!nested_npt)
+		kvm_mmu_new_pgd(vcpu, cr3);
+
 	return 0;
 }
 
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 29289ecca223..abfcd71f787f 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1126,15 +1126,15 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
 		return -EINVAL;
 	}
 
-	if (!nested_ept)
-		kvm_mmu_new_pgd(vcpu, cr3);
-
 	vcpu->arch.cr3 = cr3;
 	kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
 
 	/* Re-initialize the MMU, e.g. to pick up CR4 MMU role changes. */
 	kvm_init_mmu(vcpu);
 
+	if (!nested_ept)
+		kvm_mmu_new_pgd(vcpu, cr3);
+
 	return 0;
 }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 11/12] KVM: MMU: remove kvm_mmu_calc_root_page_role
  2022-02-09 17:00 [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes Paolo Bonzini
                   ` (9 preceding siblings ...)
  2022-02-09 17:00 ` [PATCH 10/12] KVM: MMU: load new PGD after the shadow MMU is initialized Paolo Bonzini
@ 2022-02-09 17:00 ` Paolo Bonzini
  2022-02-11 17:53   ` Sean Christopherson
  2022-02-09 17:00 ` [PATCH 12/12] KVM: x86: do not unload MMU roots on all role changes Paolo Bonzini
  2022-02-09 17:07 ` [PATCH 00/12] KVM: MMU: " Sean Christopherson
  12 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-09 17:00 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: vkuznets, mlevitsk, dmatlack, seanjc

Since the guest PGD is now loaded after the MMU has been set up
completely, the desired role for a cache hit is simply the current
mmu_role.  There is no need to compute it again, so __kvm_mmu_new_pgd
can be folded in kvm_mmu_new_pgd.

For the !tdp_enabled case, it would also have been possible to use
the role that is already in vcpu->arch.mmu.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 29 ++++-------------------------
 1 file changed, 4 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index df9e0a43513c..38b40ddcaad7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -190,8 +190,6 @@ struct kmem_cache *mmu_page_header_cache;
 static struct percpu_counter kvm_total_used_mmu_pages;
 
 static void mmu_spte_set(u64 *sptep, u64 spte);
-static union kvm_mmu_page_role
-kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu);
 
 struct kvm_mmu_role_regs {
 	const unsigned long cr0;
@@ -4172,9 +4170,9 @@ static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
 		return cached_root_find_and_replace(vcpu, new_pgd, new_role);
 }
 
-static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
-			      union kvm_mmu_page_role new_role)
+void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
 {
+	union kvm_mmu_page_role new_role = vcpu->arch.mmu->mmu_role.base;
 	if (!fast_pgd_switch(vcpu, new_pgd, new_role)) {
 		/* kvm_mmu_ensure_valid_pgd will set up a new root.  */
 		return;
@@ -4209,11 +4207,6 @@ static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
 		__clear_sp_write_flooding_count(
 				to_shadow_page(vcpu->arch.mmu->root.hpa));
 }
-
-void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
-{
-	__kvm_mmu_new_pgd(vcpu, new_pgd, kvm_mmu_calc_root_page_role(vcpu));
-}
 EXPORT_SYMBOL_GPL(kvm_mmu_new_pgd);
 
 static unsigned long get_cr3(struct kvm_vcpu *vcpu)
@@ -4883,7 +4876,7 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
 	new_role = kvm_calc_shadow_npt_root_page_role(vcpu, &regs);
 
 	shadow_mmu_init_context(vcpu, context, &regs, new_role);
-	__kvm_mmu_new_pgd(vcpu, nested_cr3, new_role.base);
+	kvm_mmu_new_pgd(vcpu, nested_cr3);
 }
 EXPORT_SYMBOL_GPL(kvm_init_shadow_npt_mmu);
 
@@ -4939,7 +4932,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 		reset_ept_shadow_zero_bits_mask(context, execonly);
 	}
 
-	__kvm_mmu_new_pgd(vcpu, new_eptp, new_role.base);
+	kvm_mmu_new_pgd(vcpu, new_eptp);
 }
 EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
 
@@ -5024,20 +5017,6 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_init_mmu);
 
-static union kvm_mmu_page_role
-kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu)
-{
-	struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu);
-	union kvm_mmu_role role;
-
-	if (tdp_enabled)
-		role = kvm_calc_tdp_mmu_root_page_role(vcpu, &regs, true);
-	else
-		role = kvm_calc_shadow_mmu_root_page_role(vcpu, &regs, true);
-
-	return role.base;
-}
-
 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	/*
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 12/12] KVM: x86: do not unload MMU roots on all role changes
  2022-02-09 17:00 [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes Paolo Bonzini
                   ` (10 preceding siblings ...)
  2022-02-09 17:00 ` [PATCH 11/12] KVM: MMU: remove kvm_mmu_calc_root_page_role Paolo Bonzini
@ 2022-02-09 17:00 ` Paolo Bonzini
  2022-02-11  9:08   ` Nikunj A. Dadhania
  2022-02-11 18:48   ` Sean Christopherson
  2022-02-09 17:07 ` [PATCH 00/12] KVM: MMU: " Sean Christopherson
  12 siblings, 2 replies; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-09 17:00 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: vkuznets, mlevitsk, dmatlack, seanjc

kvm_mmu_reset_context is called on all role changes and right now it
calls kvm_mmu_unload.  With the legacy MMU this is a relatively cheap
operation; the previous PGDs remains in the hash table and is picked
up immediately on the next page fault.  With the TDP MMU, however, the
roots are thrown away for good and a full rebuild of the page tables is
necessary, which is many times more expensive.

Fortunately, throwing away the roots is not necessary except when
the manual says a TLB flush is required:

- changing CR0.PG from 1 to 0 (because it flushes the TLB according to
  the x86 architecture specification)

- changing CPUID (which changes the interpretation of page tables in
  ways not reflected by the role).

- changing CR4.SMEP from 0 to 1 (not doing so actually breaks access.c!)

Except for these cases, once the MMU has updated the CPU/MMU roles
and metadata it is enough to force-reload the current value of CR3.
KVM will look up the cached roots for an entry with the right role and
PGD, and only if the cache misses a new root will be created.

Measuring with vmexit.flat from kvm-unit-tests shows the following
improvement:

             TDP         legacy       shadow
   before    46754       5096         5150
   after     4879        4875         5006

which is for very small page tables.  The impact is however much larger
when running as an L1 hypervisor, because the new page tables cause
extra work for L0 to shadow them.

Reported-by: Brad Spengler <spender@grsecurity.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c |  7 ++++---
 arch/x86/kvm/x86.c     | 27 ++++++++++++++++++---------
 2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 38b40ddcaad7..dbd4e98ba426 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5020,8 +5020,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	/*
-	 * Invalidate all MMU roles to force them to reinitialize as CPUID
-	 * information is factored into reserved bit calculations.
+	 * Invalidate all MMU roles and roots to force them to reinitialize,
+	 * as CPUID information is factored into reserved bit calculations.
 	 *
 	 * Correctly handling multiple vCPU models with respect to paging and
 	 * physical address properties) in a single VM would require tracking
@@ -5034,6 +5034,7 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	vcpu->arch.root_mmu.mmu_role.ext.valid = 0;
 	vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
 	vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
+	kvm_mmu_unload(vcpu);
 	kvm_mmu_reset_context(vcpu);
 
 	/*
@@ -5045,8 +5046,8 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
 {
-	kvm_mmu_unload(vcpu);
 	kvm_init_mmu(vcpu);
+	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0d3646535cc5..97c4f5fc291f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -873,8 +873,12 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 		kvm_async_pf_hash_reset(vcpu);
 	}
 
-	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
+	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
+		/* Flush the TLB if CR0 is changed 1 -> 0.  */
+		if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
+			kvm_mmu_unload(vcpu);
 		kvm_mmu_reset_context(vcpu);
+	}
 
 	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
 	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
@@ -1067,15 +1071,18 @@ void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned lon
 	 * free them all.  KVM_REQ_MMU_RELOAD is fit for the both cases; it
 	 * is slow, but changing CR4.PCIDE is a rare case.
 	 *
-	 * If CR4.PGE is changed, the guest TLB must be flushed.
+	 * Setting SMEP also needs to flush the TLB, in addition to resetting
+	 * the MMU.
 	 *
-	 * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
-	 * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
-	 * the usage of "else if".
+	 * If CR4.PGE is changed, the guest TLB must be flushed.  Because
+	 * the shadow MMU ignores global pages, this bit is not part of
+	 * KVM_MMU_CR4_ROLE_BITS.
 	 */
-	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
+	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS) {
 		kvm_mmu_reset_context(vcpu);
-	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
+		if ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP))
+			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
+	} else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
 		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
 	else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
 		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
@@ -11329,8 +11336,10 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	 * paging related bits are ignored if paging is disabled, i.e. CR0.WP,
 	 * CR4, and EFER changes are all irrelevant if CR0.PG was '0'.
 	 */
-	if (old_cr0 & X86_CR0_PG)
-		kvm_mmu_reset_context(vcpu);
+	if (old_cr0 & X86_CR0_PG) {
+		kvm_mmu_unload(vcpu);
+		kvm_init_mmu(vcpu);
+	}
 
 	/*
 	 * Intel's SDM states that all TLB entries are flushed on INIT.  AMD's
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes
  2022-02-09 17:00 [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes Paolo Bonzini
                   ` (11 preceding siblings ...)
  2022-02-09 17:00 ` [PATCH 12/12] KVM: x86: do not unload MMU roots on all role changes Paolo Bonzini
@ 2022-02-09 17:07 ` Sean Christopherson
  2022-02-09 17:11   ` Paolo Bonzini
  12 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-09 17:07 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> The TDP MMU has a performance regression compared to the legacy MMU
> when CR0 changes often.  This was reported for the grsecurity kernel,
> which uses CR0.WP to implement kernel W^X.  In that case, each change to
> CR0.WP unloads the MMU and causes a lot of unnecessary work.  When running
> nested, this can even cause the L1 to hardly make progress, as the L0
> hypervisor it is overwhelmed by the amount of MMU work that is needed.

FWIW, my flushing/zapping series fixes this by doing the teardown in an async
worker.  There's even a selftest for this exact case :-)

https://lore.kernel.org/all/20211223222318.1039223-1-seanjc@google.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes
  2022-02-09 17:07 ` [PATCH 00/12] KVM: MMU: " Sean Christopherson
@ 2022-02-09 17:11   ` Paolo Bonzini
  2022-02-09 17:16     ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-09 17:11 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On 2/9/22 18:07, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>> The TDP MMU has a performance regression compared to the legacy MMU
>> when CR0 changes often.  This was reported for the grsecurity kernel,
>> which uses CR0.WP to implement kernel W^X.  In that case, each change to
>> CR0.WP unloads the MMU and causes a lot of unnecessary work.  When running
>> nested, this can even cause the L1 to hardly make progress, as the L0
>> hypervisor it is overwhelmed by the amount of MMU work that is needed.
> 
> FWIW, my flushing/zapping series fixes this by doing the teardown in an async
> worker.  There's even a selftest for this exact case :-)
> 
> https://lore.kernel.org/all/20211223222318.1039223-1-seanjc@google.com

I'll check it out (it's next on my list as soon as I finally push 
kvm/{master,next}, which in turn was blocked by this work).

But not zapping the roots is even better---especially when KVM is nested 
and the TDP MMU's page table rebuild is very heavy on L0.  I'm not sure 
if there are any (cumulative) stats that capture the optimization, but 
if not I'll add them.

Paolo


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes
  2022-02-09 17:11   ` Paolo Bonzini
@ 2022-02-09 17:16     ` Sean Christopherson
  0 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2022-02-09 17:16 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> On 2/9/22 18:07, Sean Christopherson wrote:
> > On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> > > The TDP MMU has a performance regression compared to the legacy MMU
> > > when CR0 changes often.  This was reported for the grsecurity kernel,
> > > which uses CR0.WP to implement kernel W^X.  In that case, each change to
> > > CR0.WP unloads the MMU and causes a lot of unnecessary work.  When running
> > > nested, this can even cause the L1 to hardly make progress, as the L0
> > > hypervisor it is overwhelmed by the amount of MMU work that is needed.
> > 
> > FWIW, my flushing/zapping series fixes this by doing the teardown in an async
> > worker.  There's even a selftest for this exact case :-)
> > 
> > https://lore.kernel.org/all/20211223222318.1039223-1-seanjc@google.com
> 
> I'll check it out (it's next on my list as soon as I finally push
> kvm/{master,next}, which in turn was blocked by this work).

No rush, I need to spin a new version (rebase, and hopefully drop unnecessarily
complex be3havior).

> But not zapping the roots is even better

No argument there :-)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/12] KVM: x86: host-initiated EFER.LME write affects the MMU
  2022-02-09 17:00 ` [PATCH 01/12] KVM: x86: host-initiated EFER.LME write affects the MMU Paolo Bonzini
@ 2022-02-10 22:49   ` Sean Christopherson
  0 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2022-02-10 22:49 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack, stable

On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> While the guest runs, EFER.LME cannot change unless CR0.PG is clear, and therefore
> EFER.NX is the only bit that can affect the MMU role.  However, set_efer accepts
> a host-initiated change to EFER.LME even with CR0.PG=1.  In that case, the
> MMU has to be reset.
> 
> Fixes: 11988499e62b ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes")
> Cc: stable@vger.kernel.org
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

Ugh, but KVM_SET_SREGS handles this... It's basically KVM's equivalent of VMX putting
EFER in the VMCS, but then also allowing EFER in the load/store lists.

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 02/12] KVM: MMU: move MMU role accessors to header
  2022-02-09 17:00 ` [PATCH 02/12] KVM: MMU: move MMU role accessors to header Paolo Bonzini
@ 2022-02-10 23:00   ` Sean Christopherson
  0 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2022-02-10 23:00 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> We will use is_cr0_pg to check whether a page fault can be delivered.

I strongly prefer we keep these MMU-only, the terse names were deemed safe
specifically because they are encapsulated in mmu.c.  The async #PF case can use
is_paging(vcpu).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/12] KVM: x86: do not deliver asynchronous page faults if CR0.PG=0
  2022-02-09 17:00 ` [PATCH 03/12] KVM: x86: do not deliver asynchronous page faults if CR0.PG=0 Paolo Bonzini
@ 2022-02-10 23:10   ` Sean Christopherson
  2022-02-10 23:14     ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-10 23:10 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> Enabling async page faults is nonsensical if paging is disabled, but
> it is allowed because CR0.PG=0 does not clear the async page fault
> MSR.  Just ignore them and only use the artificial halt state,
> similar to what happens in guest mode if async #PF vmexits are disabled.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/x86.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5e1298aef9e2..98aca0f2af12 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12272,7 +12272,9 @@ static inline bool apf_pageready_slot_free(struct kvm_vcpu *vcpu)
>  
>  static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>  {
> -	if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu))
> +	if (is_guest_mode(vcpu)
> +	    ? !vcpu->arch.apf.delivery_as_pf_vmexit
> +	    : !is_cr0_pg(vcpu->arch.mmu))

As suggested in the previous patch, is_paging(vcpu).

I find a more tradition if-elif marginally easier to understand the implication
that CR0.PG is L2's CR0 and thus irrelevant if is_guest_mode()==true.  Not a big
deal though.

	if (is_guest_mode(vcpu)) {
		if (!vcpu->arch.apf.delivery_as_pf_vmexit)
			return false;
	} else if (!is_paging(vcpu)) {
		return false;
	}

>  		return false;
>  
>  	if (!kvm_pv_async_pf_enabled(vcpu) ||
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/12] KVM: x86: do not deliver asynchronous page faults if CR0.PG=0
  2022-02-10 23:10   ` Sean Christopherson
@ 2022-02-10 23:14     ` Sean Christopherson
  2022-02-10 23:16       ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-10 23:14 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Thu, Feb 10, 2022, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> > Enabling async page faults is nonsensical if paging is disabled, but
> > it is allowed because CR0.PG=0 does not clear the async page fault
> > MSR.  Just ignore them and only use the artificial halt state,
> > similar to what happens in guest mode if async #PF vmexits are disabled.
> > 
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > ---
> >  arch/x86/kvm/x86.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 5e1298aef9e2..98aca0f2af12 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12272,7 +12272,9 @@ static inline bool apf_pageready_slot_free(struct kvm_vcpu *vcpu)
> >  
> >  static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
> >  {
> > -	if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu))
> > +	if (is_guest_mode(vcpu)
> > +	    ? !vcpu->arch.apf.delivery_as_pf_vmexit
> > +	    : !is_cr0_pg(vcpu->arch.mmu))
> 
> As suggested in the previous patch, is_paging(vcpu).
> 
> I find a more tradition if-elif marginally easier to understand the implication
> that CR0.PG is L2's CR0 and thus irrelevant if is_guest_mode()==true.  Not a big
> deal though.
> 
> 	if (is_guest_mode(vcpu)) {
> 		if (!vcpu->arch.apf.delivery_as_pf_vmexit)
> 			return false;
> 	} else if (!is_paging(vcpu)) {
> 		return false;
> 	}

Alternatively, what about reordering and refactoring to yield:

	if (kvm_pv_async_pf_enabled(vcpu))
		return false;

	if (vcpu->arch.apf.send_user_only &&
	    static_call(kvm_x86_get_cpl)(vcpu) == 0)
		return false;

	/* L1 CR0.PG=1 is guaranteed if the vCPU is in guest mode (L2). */
	if (is_guest_mode(vcpu)
		return !vcpu->arch.apf.delivery_as_pf_vmexit;

	return is_cr0_pg(vcpu->arch.mmu);

There isn't any need to "batch" the if statements.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/12] KVM: x86: do not deliver asynchronous page faults if CR0.PG=0
  2022-02-10 23:14     ` Sean Christopherson
@ 2022-02-10 23:16       ` Sean Christopherson
  2022-02-11 11:16         ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-10 23:16 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Thu, Feb 10, 2022, Sean Christopherson wrote:
> On Thu, Feb 10, 2022, Sean Christopherson wrote:
> > On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> > > Enabling async page faults is nonsensical if paging is disabled, but
> > > it is allowed because CR0.PG=0 does not clear the async page fault
> > > MSR.  Just ignore them and only use the artificial halt state,
> > > similar to what happens in guest mode if async #PF vmexits are disabled.
> > > 
> > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > ---
> > >  arch/x86/kvm/x86.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 5e1298aef9e2..98aca0f2af12 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -12272,7 +12272,9 @@ static inline bool apf_pageready_slot_free(struct kvm_vcpu *vcpu)
> > >  
> > >  static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
> > >  {
> > > -	if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu))
> > > +	if (is_guest_mode(vcpu)
> > > +	    ? !vcpu->arch.apf.delivery_as_pf_vmexit
> > > +	    : !is_cr0_pg(vcpu->arch.mmu))
> > 
> > As suggested in the previous patch, is_paging(vcpu).
> > 
> > I find a more tradition if-elif marginally easier to understand the implication
> > that CR0.PG is L2's CR0 and thus irrelevant if is_guest_mode()==true.  Not a big
> > deal though.
> > 
> > 	if (is_guest_mode(vcpu)) {
> > 		if (!vcpu->arch.apf.delivery_as_pf_vmexit)
> > 			return false;
> > 	} else if (!is_paging(vcpu)) {
> > 		return false;
> > 	}
> 
> Alternatively, what about reordering and refactoring to yield:
> 
> 	if (kvm_pv_async_pf_enabled(vcpu))
> 		return false;
> 
> 	if (vcpu->arch.apf.send_user_only &&
> 	    static_call(kvm_x86_get_cpl)(vcpu) == 0)
> 		return false;
> 
> 	/* L1 CR0.PG=1 is guaranteed if the vCPU is in guest mode (L2). */
> 	if (is_guest_mode(vcpu)
> 		return !vcpu->arch.apf.delivery_as_pf_vmexit;
> 
> 	return is_cr0_pg(vcpu->arch.mmu);
> 
> There isn't any need to "batch" the if statements.

Third time's a charm...

	if (kvm_pv_async_pf_enabled(vcpu))
		return false;

	if (vcpu->arch.apf.send_user_only &&
	    static_call(kvm_x86_get_cpl)(vcpu) == 0)
		return false;

	/* L1 CR0.PG=1 is guaranteed if the vCPU is in guest mode (L2). */
	if (is_guest_mode(vcpu))
		return !vcpu->arch.apf.delivery_as_pf_vmexit;

	return is_paging(vcpu);



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/12] KVM: MMU: WARN if PAE roots linger after kvm_mmu_unload
  2022-02-09 17:00 ` [PATCH 04/12] KVM: MMU: WARN if PAE roots linger after kvm_mmu_unload Paolo Bonzini
@ 2022-02-10 23:20   ` Sean Christopherson
  2022-02-11 11:18     ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-10 23:20 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e0c0f0bc2e8b..7b5765ced928 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5065,12 +5065,21 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
>  	return r;
>  }
>  
> +static void __kvm_mmu_unload(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> +{
> +	int i;
> +	kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOTS_ALL);
> +	WARN_ON(VALID_PAGE(mmu->root_hpa));
> +	if (mmu->pae_root) {
> +		for (i = 0; i < 4; ++i)
> +			WARN_ON(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
> +	}

I'm somewhat ambivalent, but if you're at all on the fence, I vote to drop this
one.  I've always viewed the WARN on root_hpa as gratuitous.

But, if it helped during development, then why not...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 05/12] KVM: MMU: avoid NULL-pointer dereference on page freeing bugs
  2022-02-09 17:00 ` [PATCH 05/12] KVM: MMU: avoid NULL-pointer dereference on page freeing bugs Paolo Bonzini
@ 2022-02-11  0:24   ` Sean Christopherson
  2022-02-11 11:21     ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-11  0:24 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> If kvm_mmu_free_roots encounters a PAE page table where a 64-bit page
> table is expected, the result is a NULL pointer dereference.  Instead
> just WARN and exit.

This confused the heck out of me, because we obviously free PAE page tables.  What
we don't do is back the root that gets shoved into CR3 with a shadow page.  It'd
be especially confusing without the context that this WARN was helpful during
related development, as it's not super obvious why mmu_free_root_page() is a special
snowflake and deserves a WARN.

Something like this?

  WARN and bail if KVM attempts to free a root that isn't backed by a shadow
  page.  KVM allocates a bare page for "special" roots, e.g. when using PAE
  paging or shadowing 2/3/4-level page tables with 4/5-level, and so root_hpa
  will be valid but won't be backed by a shadow page.  It's all too easy to
  blindly call mmu_free_root_page() on root_hpa, be nice and WARN instead of
  crashing KVM and possibly the kernel.

> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7b5765ced928..d0f2077bd798 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3201,6 +3201,8 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
>  		return;
>  
>  	sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
> +	if (WARN_ON(!sp))

Should this be KVM_BUG_ON()?  I.e. when you triggered these, would continuing on
potentially corrupt guest data, or was it truly benign-ish?

> +		return;
>  
>  	if (is_tdp_mmu_page(sp))
>  		kvm_tdp_mmu_put_root(kvm, sp, false);
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 06/12] KVM: MMU: rename kvm_mmu_reload
  2022-02-09 17:00 ` [PATCH 06/12] KVM: MMU: rename kvm_mmu_reload Paolo Bonzini
@ 2022-02-11  0:27   ` Sean Christopherson
  2022-02-11 10:07     ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-11  0:27 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> The name of kvm_mmu_reload is very confusing for two reasons:
> first, KVM_REQ_MMU_RELOAD actually does not call it; second,
> it only does anything if there is no valid root.
> 
> Rename it to kvm_mmu_ensure_valid_root, which matches the actual
> behavior better.

100% agree that kvm_mmu_reload() is a terrible name, but kvm_mmu_ensure_valid_root()
isn't much better, e.g. it sounds like a sanity check and nothing more.

Maybe just be very literalal?

  kvm_mmu_load_if_necessary()
  kvm_mmu_load_if_invalid()

Or follow cond_sched()?

  kvm_mmu_cond_load()

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/12] KVM: MMU: do not consult levels when freeing roots
  2022-02-09 17:00 ` [PATCH 08/12] KVM: MMU: do not consult levels when freeing roots Paolo Bonzini
@ 2022-02-11  0:41   ` Sean Christopherson
  2022-02-11  0:54     ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-11  0:41 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> Right now, PGD caching requires a complicated dance of first computing
> the MMU role and passing it to __kvm_mmu_new_pgd, and then separately calling

Nit, adding () after function names helps readers easily recognize when you're
taking about a specific function, e.g. as opposed to a concept or whatever.

> kvm_init_mmu.
> 
> Part of this is due to kvm_mmu_free_roots using mmu->root_level and
> mmu->shadow_root_level to distinguish whether the page table uses a single
> root or 4 PAE roots.  Because kvm_init_mmu can overwrite mmu->root_level,
> kvm_mmu_free_roots must be called before kvm_init_mmu.
> 
> However, even after kvm_init_mmu there is a way to detect whether the page table
> has a single root or four, because the pae_root does not have an associated
> struct kvm_mmu_page.

Suggest a reword on the final paragraph, because there's a discrepancy with the
code (which handles 0, 1, or 4 "roots", versus just "single or four").

  However, even after kvm_init_mmu() there is a way to detect whether the
  page table may hold PAE roots, as root.hpa isn't backed by a shadow when
  it points at PAE roots.

> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c3f597ea00d..95d0fa0bb876 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3219,12 +3219,15 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  	struct kvm *kvm = vcpu->kvm;
>  	int i;
>  	LIST_HEAD(invalid_list);
> -	bool free_active_root = roots_to_free & KVM_MMU_ROOT_CURRENT;
> +	bool free_active_root;
>  
>  	BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG);
>  
>  	/* Before acquiring the MMU lock, see if we need to do any real work. */
> -	if (!(free_active_root && VALID_PAGE(mmu->root.hpa))) {
> +	free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT)
> +		&& VALID_PAGE(mmu->root.hpa);

	free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT) &&
			   VALID_PAGE(mmu->root.hpa);

Isn't this a separate bug fix?  E.g. call kvm_mmu_unload() without a valid current
root, but with valid previous roots?  In which case we'd try to free garbage, no?
			   
> +
> +	if (!free_active_root) {
>  		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>  			if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
>  			    VALID_PAGE(mmu->prev_roots[i].hpa))
> @@ -3242,8 +3245,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  					   &invalid_list);
>  
>  	if (free_active_root) {
> -		if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> -		    (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
> +		if (to_shadow_page(mmu->root.hpa)) {
>  			mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
>  		} else if (mmu->pae_root) {
>  			for (i = 0; i < 4; ++i) {
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/12] KVM: MMU: do not consult levels when freeing roots
  2022-02-11  0:41   ` Sean Christopherson
@ 2022-02-11  0:54     ` Sean Christopherson
  2022-02-11  1:07       ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-11  0:54 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Fri, Feb 11, 2022, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> > Right now, PGD caching requires a complicated dance of first computing
> > the MMU role and passing it to __kvm_mmu_new_pgd, and then separately calling
> 
> Nit, adding () after function names helps readers easily recognize when you're
> taking about a specific function, e.g. as opposed to a concept or whatever.
> 
> > kvm_init_mmu.
> > 
> > Part of this is due to kvm_mmu_free_roots using mmu->root_level and
> > mmu->shadow_root_level to distinguish whether the page table uses a single
> > root or 4 PAE roots.  Because kvm_init_mmu can overwrite mmu->root_level,
> > kvm_mmu_free_roots must be called before kvm_init_mmu.
> > 
> > However, even after kvm_init_mmu there is a way to detect whether the page table
> > has a single root or four, because the pae_root does not have an associated
> > struct kvm_mmu_page.
> 
> Suggest a reword on the final paragraph, because there's a discrepancy with the
> code (which handles 0, 1, or 4 "roots", versus just "single or four").
> 
>   However, even after kvm_init_mmu() there is a way to detect whether the
>   page table may hold PAE roots, as root.hpa isn't backed by a shadow when
>   it points at PAE roots.
> 
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 10 ++++++----
> >  1 file changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 3c3f597ea00d..95d0fa0bb876 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3219,12 +3219,15 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >  	struct kvm *kvm = vcpu->kvm;
> >  	int i;
> >  	LIST_HEAD(invalid_list);
> > -	bool free_active_root = roots_to_free & KVM_MMU_ROOT_CURRENT;
> > +	bool free_active_root;
> >  
> >  	BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG);
> >  
> >  	/* Before acquiring the MMU lock, see if we need to do any real work. */
> > -	if (!(free_active_root && VALID_PAGE(mmu->root.hpa))) {
> > +	free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT)
> > +		&& VALID_PAGE(mmu->root.hpa);
> 
> 	free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT) &&
> 			   VALID_PAGE(mmu->root.hpa);
> 
> Isn't this a separate bug fix?  E.g. call kvm_mmu_unload() without a valid current
> root, but with valid previous roots?  In which case we'd try to free garbage, no?
> 			   
> > +
> > +	if (!free_active_root) {
> >  		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> >  			if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
> >  			    VALID_PAGE(mmu->prev_roots[i].hpa))
> > @@ -3242,8 +3245,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >  					   &invalid_list);
> >  
> >  	if (free_active_root) {
> > -		if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> > -		    (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
> > +		if (to_shadow_page(mmu->root.hpa)) {
> >  			mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
> >  		} else if (mmu->pae_root) {

Gah, this is technically wrong.  It shouldn't truly matter, but it's wrong.  root.hpa
will not be backed by shadow page if the root is pml4_root or pml5_root, in which
case freeing the PAE root is wrong.  They should obviously be invalid already, but
it's a little confusing because KVM wanders down a path that may not be relevant
to the current mode.

For clarity, I think it's worth doing:

		} else if (mmu->root.hpa == __pa(mmu->pae_root)) {


> >  			for (i = 0; i < 4; ++i) {
> > -- 
> > 2.31.1
> > 
> > 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/12] KVM: MMU: do not consult levels when freeing roots
  2022-02-11  0:54     ` Sean Christopherson
@ 2022-02-11  1:07       ` Paolo Bonzini
  2022-02-11  1:35         ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-11  1:07 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On 2/11/22 01:54, Sean Christopherson wrote:
> On Fri, Feb 11, 2022, Sean Christopherson wrote:
>> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>>> Right now, PGD caching requires a complicated dance of first computing
>>> the MMU role and passing it to __kvm_mmu_new_pgd, and then separately calling
>>
>> Nit, adding () after function names helps readers easily recognize when you're
>> taking about a specific function, e.g. as opposed to a concept or whatever.
>>
>>> kvm_init_mmu.
>>>
>>> Part of this is due to kvm_mmu_free_roots using mmu->root_level and
>>> mmu->shadow_root_level to distinguish whether the page table uses a single
>>> root or 4 PAE roots.  Because kvm_init_mmu can overwrite mmu->root_level,
>>> kvm_mmu_free_roots must be called before kvm_init_mmu.
>>>
>>> However, even after kvm_init_mmu there is a way to detect whether the page table
>>> has a single root or four, because the pae_root does not have an associated
>>> struct kvm_mmu_page.
>>
>> Suggest a reword on the final paragraph, because there's a discrepancy with the
>> code (which handles 0, 1, or 4 "roots", versus just "single or four").
>>
>>    However, even after kvm_init_mmu() there is a way to detect whether the
>>    page table may hold PAE roots, as root.hpa isn't backed by a shadow when
>>    it points at PAE roots.
>>
>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>> ---
>>>   arch/x86/kvm/mmu/mmu.c | 10 ++++++----
>>>   1 file changed, 6 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index 3c3f597ea00d..95d0fa0bb876 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -3219,12 +3219,15 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>>>   	struct kvm *kvm = vcpu->kvm;
>>>   	int i;
>>>   	LIST_HEAD(invalid_list);
>>> -	bool free_active_root = roots_to_free & KVM_MMU_ROOT_CURRENT;
>>> +	bool free_active_root;
>>>   
>>>   	BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG);
>>>   
>>>   	/* Before acquiring the MMU lock, see if we need to do any real work. */
>>> -	if (!(free_active_root && VALID_PAGE(mmu->root.hpa))) {
>>> +	free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT)
>>> +		&& VALID_PAGE(mmu->root.hpa);
>>
>> 	free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT) &&
>> 			   VALID_PAGE(mmu->root.hpa);
>>
>> Isn't this a separate bug fix?  E.g. call kvm_mmu_unload() without a valid current
>> root, but with valid previous roots?  In which case we'd try to free garbage, no?

mmu_free_root_page checks VALID_PAGE(*root_hpa).  If that's what you 
meant, then it wouldn't be a preexisting bug (and I think it'd be a 
fairly common case).

>>> +
>>> +	if (!free_active_root) {
>>>   		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>>>   			if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
>>>   			    VALID_PAGE(mmu->prev_roots[i].hpa))
>>> @@ -3242,8 +3245,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>>>   					   &invalid_list);
>>>   
>>>   	if (free_active_root) {
>>> -		if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
>>> -		    (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
>>> +		if (to_shadow_page(mmu->root.hpa)) {
>>>   			mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
>>>   		} else if (mmu->pae_root) {
> 
> Gah, this is technically wrong.  It shouldn't truly matter, but it's wrong.  root.hpa
> will not be backed by shadow page if the root is pml4_root or pml5_root, in which
> case freeing the PAE root is wrong.  They should obviously be invalid already, but
> it's a little confusing because KVM wanders down a path that may not be relevant
> to the current mode.

pml4_root and pml5_root are dummy, and the first "real" level of page 
tables is stored in pae_root for that case too, so I think that should DTRT.

That's why I also disliked the shadow_root_level/root_level/direct 
check: even though there's half a dozen of cases involved, they all boil 
down to either 4 pae_roots or a single root with a backing kvm_mmu_page.

It's even more obscure to check shadow_root_level/root_level/direct in 
fast_pgd_switch, where it's pretty obvious that you cannot cache 4 
pae_roots in a single (hpa, pgd) pair...

Paolo


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/12] KVM: MMU: look for a cached PGD when going from 32-bit to 64-bit
  2022-02-09 17:00 ` [PATCH 09/12] KVM: MMU: look for a cached PGD when going from 32-bit to 64-bit Paolo Bonzini
@ 2022-02-11  1:32   ` Sean Christopherson
  2022-02-11  1:37     ` Sean Christopherson
  2022-02-11 11:45     ` Paolo Bonzini
  0 siblings, 2 replies; 50+ messages in thread
From: Sean Christopherson @ 2022-02-11  1:32 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> Right now, PGD caching avoids placing a PAE root in the cache by using the
> old value of mmu->root_level and mmu->shadow_root_level; it does not look
> for a cached PGD if the old root is a PAE one, and then frees it using
> kvm_mmu_free_roots.
> 
> Change the logic instead to free the uncacheable root early.
> This way, __kvm_new_mmu_pgd is able to look up the cache when going from
> 32-bit to 64-bit (if there is a hit, the invalid root becomes the least
> recently used).  An example of this is nested virtualization with shadow
> paging, when a 64-bit L1 runs a 32-bit L2.
> 
> As a side effect (which is actually the reason why this patch was
> written), PGD caching does not use the old value of mmu->root_level
> and mmu->shadow_root_level anymore.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 71 ++++++++++++++++++++++++++++++++----------
>  1 file changed, 54 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 95d0fa0bb876..f61208ccce43 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4087,20 +4087,20 @@ static inline bool is_root_usable(struct kvm_mmu_root_info *root, gpa_t pgd,
>  				  union kvm_mmu_page_role role)
>  {
>  	return (role.direct || pgd == root->pgd) &&
> -	       VALID_PAGE(root->hpa) && to_shadow_page(root->hpa) &&
> +	       VALID_PAGE(root->hpa) &&
>  	       role.word == to_shadow_page(root->hpa)->role.word;
>  }
>  
>  /*
>   * Find out if a previously cached root matching the new pgd/role is available.
> - * The current root is also inserted into the cache.
> - * If a matching root was found, it is assigned to kvm_mmu->root.hpa and true is
> - * returned.
> - * Otherwise, the LRU root from the cache is assigned to kvm_mmu->root.hpa and
> - * false is returned. This root should now be freed by the caller.
> + * If a matching root is found, it is assigned to kvm_mmu->root and
> + * true is returned.
> + * If no match is found, the current root becomes the MRU of the cache

This is misleading, the current root _always_ becomes the MRU of the cache, i.e.
the explanation about rotating the cache to do LRU/MRU updates happens regardless
of whether or not a root is found.

> + * if valid (thus evicting the LRU root), kvm_mmu->root is left invalid,

The "if valid" part is also misleading, this function is called iff the current
root is valid.

> + * and false is returned.
>   */
> -static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_pgd,
> -				  union kvm_mmu_page_role new_role)
> +static bool cached_root_find_and_promote(struct kvm_vcpu *vcpu, gpa_t new_pgd,
> +					 union kvm_mmu_page_role new_role)

I'm not entirely understand what you mean by "promote", but regardless of what
is meant, I find it confusing.

If you mean the root that's found is promoted to the current root, then it's
confusing because cached_root_find_and_replace() also promotes the root it finds.

If you mean something else, then it's obviously confusing, because I don't know
what you mean :-)  If you're referring to the cached roots, that's arguably wrong
because any non-matching root, including the current root, is demoted, not promoted.

Maybe cached_root_find_and_rotate() or cached_root_find_and_age()?

>  {
>  	uint i;
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
> @@ -4109,13 +4109,48 @@ static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_pgd,
>  		return true;
>  
>  	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
> +		/*
> +		 * The swaps end up rotating the cache like this:
> +		 *   C   0 1 2 3   (on entry to the function)
> +		 *   0   C 1 2 3
> +		 *   1   C 0 2 3
> +		 *   2   C 0 1 3
> +		 *   3   C 0 1 2   (on exit from the loop)

Woot!  I've written this down on my whiteboard sooo many times.  This defintely
warrants a virtual gold star.  :-)

        .
       ,O,
      ,OOO,
'oooooOOOOOooooo'
  `OOOOOOOOOOO`
    `OOOOOOO`
    OOOO'OOOO
   OOO'   'OOO
  O'         'O

> +		 */
>  		swap(mmu->root, mmu->prev_roots[i]);
> -
>  		if (is_root_usable(&mmu->root, new_pgd, new_role))
> -			break;
> +			return true;
>  	}
>  
> -	return i < KVM_MMU_NUM_PREV_ROOTS;
> +	kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> +	return false;
> +}
> +
> +/*
> + * Find out if a previously cached root matching the new pgd/role is available.
> + * If a matching root is found, it is assigned to kvm_mmu->root and true
> + * is returned.  The current, invalid root goes to the bottom of the cache.

Can you phrase this as LRU instead of "bottom of the cache"?  E.g. seomthing like
"The current, invalid root becomes the LRU entry in the cache."

> + * If no match is found, kvm_mmu->root is left invalid and false is returned.
> + */
> +static bool cached_root_find_and_replace(struct kvm_vcpu *vcpu, gpa_t new_pgd,
> +					 union kvm_mmu_page_role new_role)
> +{
> +	uint i;
> +	struct kvm_mmu *mmu = vcpu->arch.mmu;

Hmm, while we're refactoring this, I'd really prefer we not grab vcpu->arch.mmu
way down in the helpers.  @vcpu is needed only for the request, so what about
doing this?

	if (!fast_pgd_switch(vcpu, new_pgd, new_role)) {
		/*
		 * <whatever kvm_mmu_reload() becomes> will set up a new root
		 * prior to the next VM-Enter.  Free the current root if it's
		 * valid, i.e. if a valid root was evicted from the cache.
		 */
		if (VALID_PAGE(vcpu->arch.mmu->root.hpa))
			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
		return;
	}

Then the low level helpers can take @mmu, not @vcpu.

> +
> +	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)

Needs curly braces.
> +		if (is_root_usable(&mmu->prev_roots[i], new_pgd, new_role))
> +			goto hit;
> +
> +	return false;
> +
> +hit:
> +	swap(mmu->root, mmu->prev_roots[i]);
> +	/* Bubble up the remaining roots.  */
> +	for (; i < KVM_MMU_NUM_PREV_ROOTS - 1; i++)
> +		mmu->prev_roots[i] = mmu->prev_roots[i + 1];
> +	mmu->prev_roots[i].hpa = INVALID_PAGE;

This looks wrong.

> +	return true;
>  }
>  
>  static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
> @@ -4124,22 +4159,24 @@ static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
>  
>  	/*
> -	 * For now, limit the fast switch to 64-bit hosts+VMs in order to avoid
> +	 * For now, limit the caching to 64-bit hosts+VMs in order to avoid
>  	 * having to deal with PDPTEs. We may add support for 32-bit hosts/VMs
>  	 * later if necessary.

Probably worth explicitly calling out that the current root needs to be nuked to
avoid shoving it into the cache, that's somewhat subtle.

>  	 */
> -	if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> -	    mmu->root_level >= PT64_ROOT_4LEVEL)
> -		return cached_root_available(vcpu, new_pgd, new_role);
> +	if (VALID_PAGE(mmu->root.hpa) && !to_shadow_page(mmu->root.hpa))

Hmm, so it doesn't matter (yet) because shadowing 4-level NPT with 5-level NPT
is fubar anyways[1], but this will prevent caching roots when shadowing 4-level
NPT with 5-level NPT (have I mentioned how much I love NPT?).  This can be:

	if (VALID_PAGE(mmu->root.hpa) && mmu->root.hpa == __pa(mmu->pae_root))

This should also Just Work when all roots are backed by shadow pages[2], which I
really hope we can make happen (it's by far the easiest way to deal with nested NPT).

[1] https://lore.kernel.org/lkml/YbFY533IT3XSIqAK@google.com
[2] https://lore.kernel.org/all/20211210092508.7185-1-jiangshanlai@gmail.com

> +		kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
>  
> -	return false;
> +	if (VALID_PAGE(mmu->root.hpa))
> +		return cached_root_find_and_promote(vcpu, new_pgd, new_role);
> +	else
> +		return cached_root_find_and_replace(vcpu, new_pgd, new_role);
>  }
>  
>  static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
>  			      union kvm_mmu_page_role new_role)
>  {
>  	if (!fast_pgd_switch(vcpu, new_pgd, new_role)) {
> -		kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> +		/* kvm_mmu_ensure_valid_pgd will set up a new root.  */
>  		return;
>  	}
>  
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/12] KVM: MMU: do not consult levels when freeing roots
  2022-02-11  1:07       ` Paolo Bonzini
@ 2022-02-11  1:35         ` Sean Christopherson
  2022-02-11  1:44           ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-11  1:35 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Fri, Feb 11, 2022, Paolo Bonzini wrote:
> On 2/11/22 01:54, Sean Christopherson wrote:
> > > 	free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT) &&
> > > 			   VALID_PAGE(mmu->root.hpa);
> > > 
> > > Isn't this a separate bug fix?  E.g. call kvm_mmu_unload() without a valid current
> > > root, but with valid previous roots?  In which case we'd try to free garbage, no?
> 
> mmu_free_root_page checks VALID_PAGE(*root_hpa).  If that's what you meant,
> then it wouldn't be a preexisting bug (and I think it'd be a fairly common
> case).

Ahh, yep.

> > > > +
> > > > +	if (!free_active_root) {
> > > >   		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> > > >   			if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
> > > >   			    VALID_PAGE(mmu->prev_roots[i].hpa))
> > > > @@ -3242,8 +3245,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > >   					   &invalid_list);
> > > >   	if (free_active_root) {
> > > > -		if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> > > > -		    (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
> > > > +		if (to_shadow_page(mmu->root.hpa)) {
> > > >   			mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
> > > >   		} else if (mmu->pae_root) {
> > 
> > Gah, this is technically wrong.  It shouldn't truly matter, but it's wrong.  root.hpa
> > will not be backed by shadow page if the root is pml4_root or pml5_root, in which
> > case freeing the PAE root is wrong.  They should obviously be invalid already, but
> > it's a little confusing because KVM wanders down a path that may not be relevant
> > to the current mode.
> 
> pml4_root and pml5_root are dummy, and the first "real" level of page tables
> is stored in pae_root for that case too, so I think that should DTRT.

Ugh, completely forgot that detail.  You're correct.  Probably worth a comment?

> That's why I also disliked the shadow_root_level/root_level/direct check:
> even though there's half a dozen of cases involved, they all boil down to
> either 4 pae_roots or a single root with a backing kvm_mmu_page.
> 
> It's even more obscure to check shadow_root_level/root_level/direct in
> fast_pgd_switch, where it's pretty obvious that you cannot cache 4 pae_roots
> in a single (hpa, pgd) pair...

Heh, apparently not obvious enough for me :-)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/12] KVM: MMU: look for a cached PGD when going from 32-bit to 64-bit
  2022-02-11  1:32   ` Sean Christopherson
@ 2022-02-11  1:37     ` Sean Christopherson
  2022-02-11 10:09       ` Paolo Bonzini
  2022-02-11 11:45     ` Paolo Bonzini
  1 sibling, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-11  1:37 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Fri, Feb 11, 2022, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> Hmm, so it doesn't matter (yet) because shadowing 4-level NPT with 5-level NPT
> is fubar anyways[1], but this will prevent caching roots when shadowing 4-level
> NPT with 5-level NPT (have I mentioned how much I love NPT?).  This can be:
> 
> 	if (VALID_PAGE(mmu->root.hpa) && mmu->root.hpa == __pa(mmu->pae_root))

Gah, that's wrong too, it will allow the pml4_root case.  *sigh*  I'm fine punting
this until the special roots are less special, pml5_root is completely broken
anyways.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/12] KVM: MMU: do not consult levels when freeing roots
  2022-02-11  1:35         ` Sean Christopherson
@ 2022-02-11  1:44           ` Sean Christopherson
  2022-02-11  2:20             ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-11  1:44 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Fri, Feb 11, 2022, Sean Christopherson wrote:
> On Fri, Feb 11, 2022, Paolo Bonzini wrote:
> > On 2/11/22 01:54, Sean Christopherson wrote:
> > > > > @@ -3242,8 +3245,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > > >   					   &invalid_list);
> > > > >   	if (free_active_root) {
> > > > > -		if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> > > > > -		    (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
> > > > > +		if (to_shadow_page(mmu->root.hpa)) {
> > > > >   			mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
> > > > >   		} else if (mmu->pae_root) {
> > > 
> > > Gah, this is technically wrong.  It shouldn't truly matter, but it's wrong.  root.hpa
> > > will not be backed by shadow page if the root is pml4_root or pml5_root, in which
> > > case freeing the PAE root is wrong.  They should obviously be invalid already, but
> > > it's a little confusing because KVM wanders down a path that may not be relevant
> > > to the current mode.
> > 
> > pml4_root and pml5_root are dummy, and the first "real" level of page tables
> > is stored in pae_root for that case too, so I think that should DTRT.
> 
> Ugh, completely forgot that detail.  You're correct.  Probably worth a comment?

Actually, can't this be

			if (to_shadow_page(mmu->root.hpa)) {
				...
			else if (!WARN_ON(!mmu->pae_root)) {
				...
			}

now that it's wrapped with VALID_PAGE(root.hpa)?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/12] KVM: MMU: do not consult levels when freeing roots
  2022-02-11  1:44           ` Sean Christopherson
@ 2022-02-11  2:20             ` Sean Christopherson
  0 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2022-02-11  2:20 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Fri, Feb 11, 2022, Sean Christopherson wrote:
> On Fri, Feb 11, 2022, Sean Christopherson wrote:
> > On Fri, Feb 11, 2022, Paolo Bonzini wrote:
> > > On 2/11/22 01:54, Sean Christopherson wrote:
> > > > > > @@ -3242,8 +3245,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > > > >   					   &invalid_list);
> > > > > >   	if (free_active_root) {
> > > > > > -		if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> > > > > > -		    (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
> > > > > > +		if (to_shadow_page(mmu->root.hpa)) {
> > > > > >   			mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
> > > > > >   		} else if (mmu->pae_root) {
> > > > 
> > > > Gah, this is technically wrong.  It shouldn't truly matter, but it's wrong.  root.hpa
> > > > will not be backed by shadow page if the root is pml4_root or pml5_root, in which
> > > > case freeing the PAE root is wrong.  They should obviously be invalid already, but
> > > > it's a little confusing because KVM wanders down a path that may not be relevant
> > > > to the current mode.
> > > 
> > > pml4_root and pml5_root are dummy, and the first "real" level of page tables
> > > is stored in pae_root for that case too, so I think that should DTRT.
> > 
> > Ugh, completely forgot that detail.  You're correct.

Mostly correct.  The first "real" level will be PML4 in the hCR4.LA57=1, gCR4.LA57=0
nested NPT case.  Ditto for shadowing PAE NPT with 4/5-level NPT, though in that
case KVM still allocates pae_root entries, it just happens to be a "real" level.

And now I realize why I'm so confused, mmu_alloc_shadow_roots() is also broken
with respect to 5-level shadowing 4-level.  I believe the part that got fixed
was 5-level with a 32-bit guest.  Ugh.

For the stuff that actually works in KVM, this will do just fine.  5-level nNPT
can be punted to the future.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 12/12] KVM: x86: do not unload MMU roots on all role changes
  2022-02-09 17:00 ` [PATCH 12/12] KVM: x86: do not unload MMU roots on all role changes Paolo Bonzini
@ 2022-02-11  9:08   ` Nikunj A. Dadhania
  2022-02-11 18:48   ` Sean Christopherson
  1 sibling, 0 replies; 50+ messages in thread
From: Nikunj A. Dadhania @ 2022-02-11  9:08 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel, kvm; +Cc: vkuznets, mlevitsk, dmatlack, seanjc

On 2/9/2022 10:30 PM, Paolo Bonzini wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0d3646535cc5..97c4f5fc291f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -873,8 +873,12 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>  		kvm_async_pf_hash_reset(vcpu);
>  	}
>  
> -	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> +	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
> +		/* Flush the TLB if CR0 is changed 1 -> 0.  */

                                      ^^ CR0.PG here ?

> +		if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
> +			kvm_mmu_unload(vcpu);
>  		kvm_mmu_reset_context(vcpu);
> +	}

Regards
Nikunj


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 06/12] KVM: MMU: rename kvm_mmu_reload
  2022-02-11  0:27   ` Sean Christopherson
@ 2022-02-11 10:07     ` Paolo Bonzini
  2022-02-11 16:16       ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-11 10:07 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On 2/11/22 01:27, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>> The name of kvm_mmu_reload is very confusing for two reasons:
>> first, KVM_REQ_MMU_RELOAD actually does not call it; second,
>> it only does anything if there is no valid root.
>>
>> Rename it to kvm_mmu_ensure_valid_root, which matches the actual
>> behavior better.
> 
> 100% agree that kvm_mmu_reload() is a terrible name, but kvm_mmu_ensure_valid_root()
> isn't much better, e.g. it sounds like a sanity check and nothing more.

I would have thought that would be more of a check_valid_root().  There 
are other functions in the kernel following the idea that "ensure" means 
idempotency: skb_ensure_writable, perf_cgroup_ensure_storage, 
btf_ensure_modifiable and libbpf_ensure_mem in libbpf.  I'm not a native 
speaker but, at least in computing, "ensure" seems to mean not just "to 
make certain that (something) will be true", but also taking steps if 
that's not already the case.

I also thought of "establish_valid_root", but it has the opposite 
problem---it does not convey well, if at all, that the root could be 
valid already.

Paolo

> 
> Maybe just be very literalal?
> 
>    kvm_mmu_load_if_necessary()
>    kvm_mmu_load_if_invalid()
> 
> Or follow cond_sched()?
> 
>    kvm_mmu_cond_load()
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/12] KVM: MMU: look for a cached PGD when going from 32-bit to 64-bit
  2022-02-11  1:37     ` Sean Christopherson
@ 2022-02-11 10:09       ` Paolo Bonzini
  0 siblings, 0 replies; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-11 10:09 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On 2/11/22 02:37, Sean Christopherson wrote:
>
>> 	if (VALID_PAGE(mmu->root.hpa) && mmu->root.hpa == __pa(mmu->pae_root))
> Gah, that's wrong too, it will allow the pml4_root case.*sigh*   I'm fine punting
> this until the special roots are less special, pml5_root is completely broken
> anyways.

Ok, I'll look at your comments on Jiangshan's series.

Paolo


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/12] KVM: x86: do not deliver asynchronous page faults if CR0.PG=0
  2022-02-10 23:16       ` Sean Christopherson
@ 2022-02-11 11:16         ` Paolo Bonzini
  0 siblings, 0 replies; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-11 11:16 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On 2/11/22 00:16, Sean Christopherson wrote:
> Third time's a charm...
> 
> 	if (kvm_pv_async_pf_enabled(vcpu))
> 		return false;
> 
> 	if (vcpu->arch.apf.send_user_only &&
> 	    static_call(kvm_x86_get_cpl)(vcpu) == 0)
> 		return false;
> 
> 	/* L1 CR0.PG=1 is guaranteed if the vCPU is in guest mode (L2). */
> 	if (is_guest_mode(vcpu))
> 		return !vcpu->arch.apf.delivery_as_pf_vmexit;
> 
> 	return is_paging(vcpu);
> 
> 

Went for this, but with slightly different final "if":

         if (is_guest_mode(vcpu)) {
                 /*
                  * L1 needs to opt into the special #PF vmexits that are
                  * used to deliver async page faults.
                  */
                 return vcpu->arch.apf.delivery_as_pf_vmexit;
         } else {
                 /*
                  * Play it safe in case the guest does a quick real mode
                  * foray.  The real mode IDT is unlikely to have a #PF
                  * exception setup.
                  */
                 return is_paging(vcpu);
         }

Paolo


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/12] KVM: MMU: WARN if PAE roots linger after kvm_mmu_unload
  2022-02-10 23:20   ` Sean Christopherson
@ 2022-02-11 11:18     ` Paolo Bonzini
  0 siblings, 0 replies; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-11 11:18 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On 2/11/22 00:20, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>   arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++----
>>   1 file changed, 13 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index e0c0f0bc2e8b..7b5765ced928 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -5065,12 +5065,21 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
>>   	return r;
>>   }
>>   
>> +static void __kvm_mmu_unload(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
>> +{
>> +	int i;
>> +	kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOTS_ALL);
>> +	WARN_ON(VALID_PAGE(mmu->root_hpa));
>> +	if (mmu->pae_root) {
>> +		for (i = 0; i < 4; ++i)
>> +			WARN_ON(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
>> +	}
> 
> I'm somewhat ambivalent, but if you're at all on the fence, I vote to drop this
> one.  I've always viewed the WARN on root_hpa as gratuitous.
> 
> But, if it helped during development, then why not...

Well, it was not really helping in that the WARN triggered, but rather 
it was ruling out the more blatant violations of invariants.  The one in 
patch 5 triggered a lot, though.

Paolo


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 05/12] KVM: MMU: avoid NULL-pointer dereference on page freeing bugs
  2022-02-11  0:24   ` Sean Christopherson
@ 2022-02-11 11:21     ` Paolo Bonzini
  0 siblings, 0 replies; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-11 11:21 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On 2/11/22 01:24, Sean Christopherson wrote:
>>   
>>   	sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
>> +	if (WARN_ON(!sp))
>
> Should this be KVM_BUG_ON()?  I.e. when you triggered these, would continuing on
> potentially corrupt guest data, or was it truly benign-ish?

It only triggered on the mode_switch SVM unit test (with npt=0); so, in 
a very small test which just hung after the bug.  The WARN however was 
the 10-minute difference between rmmod and reboot...

I didn't use KVM_BUG_ON because we're in a pretty deep call stack 
(kvm_mmu_new_pgd, itself called from nested vmentry) and all sort of 
stuff will happen before bailing out.  My mental model is to use 
KVM_BUG_ON in situations for which error propagation is possible and clean.

Paolo


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/12] KVM: MMU: look for a cached PGD when going from 32-bit to 64-bit
  2022-02-11  1:32   ` Sean Christopherson
  2022-02-11  1:37     ` Sean Christopherson
@ 2022-02-11 11:45     ` Paolo Bonzini
  2022-02-11 17:38       ` Sean Christopherson
  1 sibling, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-11 11:45 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On 2/11/22 02:32, Sean Christopherson wrote:
> Maybe cached_root_find_and_rotate() or cached_root_find_and_age()?

I'll go for cached_root_find_and_keep_current() and 
cached_root_find_without_current(), respectively.

> 
> Hmm, while we're refactoring this, I'd really prefer we not grab vcpu->arch.mmu
> way down in the helpers.  @vcpu is needed only for the request, so what about
> doing this?
> 
> 	if (!fast_pgd_switch(vcpu, new_pgd, new_role)) {
> 		/*
> 		 * <whatever kvm_mmu_reload() becomes> will set up a new root
> 		 * prior to the next VM-Enter.  Free the current root if it's
> 		 * valid, i.e. if a valid root was evicted from the cache.
> 		 */
> 		if (VALID_PAGE(vcpu->arch.mmu->root.hpa))
> 			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> 		return;
> 	}

I tried, but it's much easier to describe the cache functions if their 
common postcondition is "vcpu->arch.mmu->root.hpa is never stale"; which 
requires not a struct kvm_vcpu* but at least a struct kvm*, for the MMU 
lock.

I could change kvm_mmu_free_roots and cached_root_* to take a struct 
kvm* plus a struct kvm_mmu*.  Does that sound better?

Paolo


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 06/12] KVM: MMU: rename kvm_mmu_reload
  2022-02-11 10:07     ` Paolo Bonzini
@ 2022-02-11 16:16       ` Sean Christopherson
  2022-02-11 16:52         ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-11 16:16 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Fri, Feb 11, 2022, Paolo Bonzini wrote:
> On 2/11/22 01:27, Sean Christopherson wrote:
> > On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> > > The name of kvm_mmu_reload is very confusing for two reasons:
> > > first, KVM_REQ_MMU_RELOAD actually does not call it; second,
> > > it only does anything if there is no valid root.
> > > 
> > > Rename it to kvm_mmu_ensure_valid_root, which matches the actual
> > > behavior better.
> > 
> > 100% agree that kvm_mmu_reload() is a terrible name, but kvm_mmu_ensure_valid_root()
> > isn't much better, e.g. it sounds like a sanity check and nothing more.
> 
> I would have thought that would be more of a check_valid_root().  There are
> other functions in the kernel following the idea that "ensure" means
> idempotency: skb_ensure_writable, perf_cgroup_ensure_storage,
> btf_ensure_modifiable and libbpf_ensure_mem in libbpf.  I'm not a native
> speaker but, at least in computing, "ensure" seems to mean not just "to make
> certain that (something) will be true", but also taking steps if that's not
> already the case.

There's no ambiguity on the "make certain that <x> will be true", it's the second
part about taking steps that's ambiguous.  Specifically, it doesn't convey any
information about _what_ steps will be taken, e.g. the below implementation is
also a possibility since it ensures the root is valid by preventing forward
progress if the root is invalid.

  static inline int kvm_mmu_ensure_valid_root(struct kvm_vcpu *vcpu)
  { 
	if (unlikely(vcpu->arch.mmu->root.hpa != INVALID_PAGE))
		return -EFAULT;
	return 0;
  }

Existing example of that interpretation are input_dev_ensure_poller() and
rtnl_ensure_unique_netns().

The other nuance that I want to avoid is the implication that KVM is checking for
a valid root because it doesn't trust what has happened before, i.e. that the call
is there as a safeguard.  That's misleading for the most common path, vcpu_enter_guest(),
because when the helper does do some work, it's usually because KVM deliberately
invalidated the root.


> I also thought of "establish_valid_root", but it has the opposite
> problem---it does not convey well, if at all, that the root could be valid
> already.

Heh, I agree that "establish" would imply the root is always invalid, but amusingly
"establish" is listed as a synonym for "ensure" on the few sites of checked. Yay English.  

I was going to suggest we just open code it in vcpu_enter_guest, but async #PF
uses it too :-/

Can we put this on the backburner for now?  IMO, KVM_REQ_MMU_RELOAD is far more
misleading than kvm_mmu_reload(), and I posted a series to remedy that (though I
need to check if it's still viable since you vetoed adding the check for a pending
request in the page fault handler).

https://lore.kernel.org/all/20211209060552.2956723-1-seanjc@google.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 06/12] KVM: MMU: rename kvm_mmu_reload
  2022-02-11 16:16       ` Sean Christopherson
@ 2022-02-11 16:52         ` Paolo Bonzini
  0 siblings, 0 replies; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-11 16:52 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On 2/11/22 17:16, Sean Christopherson wrote:
> The other nuance that I want to avoid is the implication that KVM is checking for
> a valid root because it doesn't trust what has happened before, i.e. that the call
> is there as a safeguard.  That's misleading for the most common path, vcpu_enter_guest(),
> because when the helper does do some work, it's usually because KVM deliberately
> invalidated the root.

Fair enough.

>> I also thought of "establish_valid_root", but it has the opposite
>> problem---it does not convey well, if at all, that the root could be valid
>> already.
> 
> Heh, I agree that "establish" would imply the root is always invalid, but amusingly
> "establish" is listed as a synonym for "ensure" on the few sites of checked. Yay English.

Well, synonyms rarely have a perfectly matching meaning.

> Can we put this on the backburner for now?

Yes, of course.

Paolo

> IMO, KVM_REQ_MMU_RELOAD is far more
> misleading than kvm_mmu_reload(), and I posted a series to remedy that (though I
> need to check if it's still viable since you vetoed adding the check for a pending
> request in the page fault handler).
> 
> https://lore.kernel.org/all/20211209060552.2956723-1-seanjc@google.com


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/12] KVM: MMU: look for a cached PGD when going from 32-bit to 64-bit
  2022-02-11 11:45     ` Paolo Bonzini
@ 2022-02-11 17:38       ` Sean Christopherson
  0 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2022-02-11 17:38 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Fri, Feb 11, 2022, Paolo Bonzini wrote:
> On 2/11/22 02:32, Sean Christopherson wrote:
> > Maybe cached_root_find_and_rotate() or cached_root_find_and_age()?
> 
> I'll go for cached_root_find_and_keep_current() and
> cached_root_find_without_current(), respectively.
> 
> > 
> > Hmm, while we're refactoring this, I'd really prefer we not grab vcpu->arch.mmu
> > way down in the helpers.  @vcpu is needed only for the request, so what about
> > doing this?
> > 
> > 	if (!fast_pgd_switch(vcpu, new_pgd, new_role)) {
> > 		/*
> > 		 * <whatever kvm_mmu_reload() becomes> will set up a new root
> > 		 * prior to the next VM-Enter.  Free the current root if it's
> > 		 * valid, i.e. if a valid root was evicted from the cache.
> > 		 */
> > 		if (VALID_PAGE(vcpu->arch.mmu->root.hpa))
> > 			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> > 		return;
> > 	}
> 
> I tried, but it's much easier to describe the cache functions if their
> common postcondition is "vcpu->arch.mmu->root.hpa is never stale"; which
> requires not a struct kvm_vcpu* but at least a struct kvm*, for the MMU
> lock.
> 
> I could change kvm_mmu_free_roots and cached_root_* to take a struct kvm*
> plus a struct kvm_mmu*.  Does that sound better?

Ya, works for me, thanks!

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 07/12] KVM: x86: use struct kvm_mmu_root_info for mmu->root
  2022-02-09 17:00 ` [PATCH 07/12] KVM: x86: use struct kvm_mmu_root_info for mmu->root Paolo Bonzini
@ 2022-02-11 17:39   ` Sean Christopherson
  0 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2022-02-11 17:39 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info.
> Use the struct to have more consistency between mmu->root and
> mmu->prev_roots.
> 
> The patch is entirely search and replace except for cached_root_available,
> which does not need a temporary struct kvm_mmu_root_info anymore.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: MMU: load new PGD after the shadow MMU is initialized
  2022-02-09 17:00 ` [PATCH 10/12] KVM: MMU: load new PGD after the shadow MMU is initialized Paolo Bonzini
@ 2022-02-11 17:45   ` Sean Christopherson
  2022-02-11 17:47     ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-11 17:45 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and
> shadow_root_level anymore, pull the PGD load after the initialization of
> the shadow MMUs.
> 
> Besides being more intuitive, this enables future simplifications
> and optimizations because it's not necessary anymore to compute the
> role outside kvm_init_mmu.  In particular, kvm_mmu_reset_context was not
> attempting to use a cached PGD to avoid having to figure out the new role.
> It will soon be able to follow what nested_{vmx,svm}_load_cr3 are doing,
> and avoid unloading all the cached roots.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

If you add a sanity check as described in the other thread[*],

Reviewed-by: Sean Christopherson <seanjc@google.com>


[*] https://lore.kernel.org/all/1e8c38eb-d66a-60e7-9432-eb70e7ec1dd4@redhat.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: MMU: load new PGD after the shadow MMU is initialized
  2022-02-11 17:45   ` Sean Christopherson
@ 2022-02-11 17:47     ` Paolo Bonzini
  0 siblings, 0 replies; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-11 17:47 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On 2/11/22 18:45, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>> Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and
>> shadow_root_level anymore, pull the PGD load after the initialization of
>> the shadow MMUs.
>>
>> Besides being more intuitive, this enables future simplifications
>> and optimizations because it's not necessary anymore to compute the
>> role outside kvm_init_mmu.  In particular, kvm_mmu_reset_context was not
>> attempting to use a cached PGD to avoid having to figure out the new role.
>> It will soon be able to follow what nested_{vmx,svm}_load_cr3 are doing,
>> and avoid unloading all the cached roots.
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
> 
> If you add a sanity check as described in the other thread[*],
> 
> Reviewed-by: Sean Christopherson <seanjc@google.com>

It's not as easy as I thought, but it becomes almost trivial with the 
CPU/MMU role refactoring.  I'll get that posted as soon as I can push a 
final-ish version of this one to kvm/queue.

Paolo


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 11/12] KVM: MMU: remove kvm_mmu_calc_root_page_role
  2022-02-09 17:00 ` [PATCH 11/12] KVM: MMU: remove kvm_mmu_calc_root_page_role Paolo Bonzini
@ 2022-02-11 17:53   ` Sean Christopherson
  0 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2022-02-11 17:53 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

IMO, the shortlog is too literal and doesn't help understand the implications of
the change.  I prefer something like:

  KVM: x86/mmu: Always use current mmu's role when loading new PGD

On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> Since the guest PGD is now loaded after the MMU has been set up
> completely, the desired role for a cache hit is simply the current
> mmu_role.  There is no need to compute it again, so __kvm_mmu_new_pgd
> can be folded in kvm_mmu_new_pgd.
> 
> For the !tdp_enabled case, it would also have been possible to use
> the role that is already in vcpu->arch.mmu.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

With a different shortlog and newline,

Reviewed-by: Sean Christopherson <seanjc@google.com>

>  arch/x86/kvm/mmu/mmu.c | 29 ++++-------------------------
>  1 file changed, 4 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index df9e0a43513c..38b40ddcaad7 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -190,8 +190,6 @@ struct kmem_cache *mmu_page_header_cache;
>  static struct percpu_counter kvm_total_used_mmu_pages;
>  
>  static void mmu_spte_set(u64 *sptep, u64 spte);
> -static union kvm_mmu_page_role
> -kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu);
>  
>  struct kvm_mmu_role_regs {
>  	const unsigned long cr0;
> @@ -4172,9 +4170,9 @@ static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
>  		return cached_root_find_and_replace(vcpu, new_pgd, new_role);
>  }
>  
> -static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
> -			      union kvm_mmu_page_role new_role)
> +void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
>  {
> +	union kvm_mmu_page_role new_role = vcpu->arch.mmu->mmu_role.base;

Newline needed.

>  	if (!fast_pgd_switch(vcpu, new_pgd, new_role)) {
>  		/* kvm_mmu_ensure_valid_pgd will set up a new root.  */
>  		return;

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 12/12] KVM: x86: do not unload MMU roots on all role changes
  2022-02-09 17:00 ` [PATCH 12/12] KVM: x86: do not unload MMU roots on all role changes Paolo Bonzini
  2022-02-11  9:08   ` Nikunj A. Dadhania
@ 2022-02-11 18:48   ` Sean Christopherson
  2022-02-14 16:34     ` Paolo Bonzini
  1 sibling, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-11 18:48 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> kvm_mmu_reset_context is called on all role changes and right now it
> calls kvm_mmu_unload.  With the legacy MMU this is a relatively cheap
> operation; the previous PGDs remains in the hash table and is picked
> up immediately on the next page fault.  With the TDP MMU, however, the
> roots are thrown away for good and a full rebuild of the page tables is
> necessary, which is many times more expensive.
> 
> Fortunately, throwing away the roots is not necessary except when
> the manual says a TLB flush is required:
> 
> - changing CR0.PG from 1 to 0 (because it flushes the TLB according to
>   the x86 architecture specification)
> 
> - changing CPUID (which changes the interpretation of page tables in
>   ways not reflected by the role).
> 
> - changing CR4.SMEP from 0 to 1 (not doing so actually breaks access.c!)
> 
> Except for these cases, once the MMU has updated the CPU/MMU roles
> and metadata it is enough to force-reload the current value of CR3.
> KVM will look up the cached roots for an entry with the right role and
> PGD, and only if the cache misses a new root will be created.
> 
> Measuring with vmexit.flat from kvm-unit-tests shows the following
> improvement:
> 
>              TDP         legacy       shadow
>    before    46754       5096         5150
>    after     4879        4875         5006
> 
> which is for very small page tables.  The impact is however much larger
> when running as an L1 hypervisor, because the new page tables cause
> extra work for L0 to shadow them.
> 
> Reported-by: Brad Spengler <spender@grsecurity.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c |  7 ++++---
>  arch/x86/kvm/x86.c     | 27 ++++++++++++++++++---------
>  2 files changed, 22 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 38b40ddcaad7..dbd4e98ba426 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5020,8 +5020,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
>  void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  {
>  	/*
> -	 * Invalidate all MMU roles to force them to reinitialize as CPUID
> -	 * information is factored into reserved bit calculations.
> +	 * Invalidate all MMU roles and roots to force them to reinitialize,
> +	 * as CPUID information is factored into reserved bit calculations.
>  	 *
>  	 * Correctly handling multiple vCPU models with respect to paging and
>  	 * physical address properties) in a single VM would require tracking
> @@ -5034,6 +5034,7 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  	vcpu->arch.root_mmu.mmu_role.ext.valid = 0;
>  	vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
>  	vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
> +	kvm_mmu_unload(vcpu);
>  	kvm_mmu_reset_context(vcpu);
>  
>  	/*
> @@ -5045,8 +5046,8 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  
>  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
>  {
> -	kvm_mmu_unload(vcpu);
>  	kvm_init_mmu(vcpu);
> +	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);

This is too risky IMO, there are far more flows than just MOV CR0/CR4 that are
affected, e.g. SMM transitions, KVM_SET_SREG, etc...

Given that kvm_post_set_cr{0,4}() and kvm_vcpu_reset() explicitly handle CR0.PG
and CR4.SMEP toggling, I highly doubt the other flows are correct in all instances.
The call to kvm_mmu_new_pgd() is also 

To minimize risk, we should leave kvm_mmu_reset_context() as is (rename it if
necessary) and instead add a new helper to handle kvm_post_set_cr{0,4}().  In
the future we can/should work on avoiding unload in all paths, but again, future
problem.

>  }
>  EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0d3646535cc5..97c4f5fc291f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -873,8 +873,12 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>  		kvm_async_pf_hash_reset(vcpu);
>  	}
>  
> -	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> +	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
> +		/* Flush the TLB if CR0 is changed 1 -> 0.  */
> +		if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
> +			kvm_mmu_unload(vcpu);

Calling kvm_mmu_unload() instead of requesting a flush isn't coherent with respect
to the comment, or with SMEP handling.  And the SMEP handling isn't coherent with
respect to the changelog.  Please elaborate :-)

>  		kvm_mmu_reset_context(vcpu);
> +	}
>  
>  	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
>  	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
> @@ -1067,15 +1071,18 @@ void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned lon
>  	 * free them all.  KVM_REQ_MMU_RELOAD is fit for the both cases; it
>  	 * is slow, but changing CR4.PCIDE is a rare case.
>  	 *
> -	 * If CR4.PGE is changed, the guest TLB must be flushed.
> +	 * Setting SMEP also needs to flush the TLB, in addition to resetting
> +	 * the MMU.
>  	 *
> -	 * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
> -	 * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
> -	 * the usage of "else if".
> +	 * If CR4.PGE is changed, the guest TLB must be flushed.  Because
> +	 * the shadow MMU ignores global pages, this bit is not part of
> +	 * KVM_MMU_CR4_ROLE_BITS.
>  	 */
> -	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
> +	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS) {
>  		kvm_mmu_reset_context(vcpu);
> -	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
> +		if ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP))
> +			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);

This mishandles CR4.PGE.  Per the comment above, the if-elif-elif sequence relies
on kvm_mmu_reset_context being a superset of KVM_REQ_TLB_FLUSH_GUEST.

For both CR0 and CR4, I think we should disassociate the TLB flush logic from the
MMU role logic, e.g. CR4.PGE _could_ be part of the role, but it's not because KVM
doesn't emulate global pages.

This is what I'm thinking, assuming CR0.PG 1=>0 really only needs a flush.


---
 arch/x86/kvm/mmu/mmu.c |  4 ++--
 arch/x86/kvm/x86.c     | 42 +++++++++++++++++++++++++++++-------------
 2 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e41834748d52..c477c519c784 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5041,8 +5041,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	/*
-	 * Invalidate all MMU roles to force them to reinitialize as CPUID
-	 * information is factored into reserved bit calculations.
+	 * Invalidate all MMU roles and roots to force them to reinitialize,
+	 * as CPUID information is factored into reserved bit calculations.
 	 *
 	 * Correctly handling multiple vCPU models with respect to paging and
 	 * physical address properties) in a single VM would require tracking
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 782dc9cd31d8..b8dad04301ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -863,15 +863,28 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3)
 }
 EXPORT_SYMBOL_GPL(load_pdptrs);

+static void kvm_post_set_cr_reinit_mmu(struct kvm_vcpu *vcpu)
+{
+	kvm_mmu_init(vcpu);
+	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
+}
+
 void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
 {
 	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
 		kvm_clear_async_pf_completion_queue(vcpu);
 		kvm_async_pf_hash_reset(vcpu);
+
+		/*
+		 * Clearing CR0.PG is architecturally defined as flushing the
+		 * TLB from the guest's perspective.
+		 */
+		if (!(cr0 & X86_CR0_PG))
+			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 	}

 	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
-		kvm_mmu_reset_context(vcpu);
+		kvm_post_set_cr_reinit_mmu(vcpu);

 	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
 	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
@@ -1055,26 +1068,29 @@ EXPORT_SYMBOL_GPL(kvm_is_valid_cr4);
 void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
 {
 	/*
-	 * If any role bit is changed, the MMU needs to be reset.
-	 *
 	 * If CR4.PCIDE is changed 1 -> 0, the guest TLB must be flushed.
 	 * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the TLB
 	 * according to the SDM; however, stale prev_roots could be reused
 	 * incorrectly in the future after a MOV to CR3 with NOFLUSH=1, so we
 	 * free them all.  KVM_REQ_MMU_RELOAD is fit for the both cases; it
 	 * is slow, but changing CR4.PCIDE is a rare case.
-	 *
-	 * If CR4.PGE is changed, the guest TLB must be flushed.
-	 *
-	 * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
-	 * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
-	 * the usage of "else if".
 	 */
-	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
-		kvm_mmu_reset_context(vcpu);
-	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
+	if ((cr4 ^ old_cr4) & X86_CR4_PCIDE) {
 		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
-	else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
+		return;
+	}
+
+	/* If any role bit is changed, the MMU needs to be reinitialized. */
+	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
+		kvm_post_set_cr_reinit_mmu(vcpu);
+
+	/*
+	 * Setting SMEP or toggling PGE is architecturally defined as flushing
+	 * the TLB from the guest's perspective.  Note, because the shadow MMU
+	 * ignores global pages, CR4.PGE is not part of KVM_MMU_CR4_ROLE_BITS.
+	 */
+	if (((cr4 ^ old_cr4) & X86_CR4_PGE) ||
+	    ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP)))
 		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_post_set_cr4);

base-commit: a8c36d04d70d0b15e696561e1a2134fcbdd3a3bd
--


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 12/12] KVM: x86: do not unload MMU roots on all role changes
  2022-02-11 18:48   ` Sean Christopherson
@ 2022-02-14 16:34     ` Paolo Bonzini
  2022-02-14 19:24       ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-14 16:34 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On 2/11/22 19:48, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>> kvm_mmu_reset_context is called on all role changes and right now it
>> calls kvm_mmu_unload.  With the legacy MMU this is a relatively cheap
>> operation; the previous PGDs remains in the hash table and is picked
>> up immediately on the next page fault.  With the TDP MMU, however, the
>> roots are thrown away for good and a full rebuild of the page tables is
>> necessary, which is many times more expensive.
>>
>> Fortunately, throwing away the roots is not necessary except when
>> the manual says a TLB flush is required:
>>
>> - changing CR0.PG from 1 to 0 (because it flushes the TLB according to
>>    the x86 architecture specification)
>>
>> - changing CPUID (which changes the interpretation of page tables in
>>    ways not reflected by the role).
>>
>> - changing CR4.SMEP from 0 to 1 (not doing so actually breaks access.c!)
>>
>> Except for these cases, once the MMU has updated the CPU/MMU roles
>> and metadata it is enough to force-reload the current value of CR3.
>> KVM will look up the cached roots for an entry with the right role and
>> PGD, and only if the cache misses a new root will be created.
>>
>> Measuring with vmexit.flat from kvm-unit-tests shows the following
>> improvement:
>>
>>               TDP         legacy       shadow
>>     before    46754       5096         5150
>>     after     4879        4875         5006
>>
>> which is for very small page tables.  The impact is however much larger
>> when running as an L1 hypervisor, because the new page tables cause
>> extra work for L0 to shadow them.
>>
>> Reported-by: Brad Spengler <spender@grsecurity.com>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>   arch/x86/kvm/mmu/mmu.c |  7 ++++---
>>   arch/x86/kvm/x86.c     | 27 ++++++++++++++++++---------
>>   2 files changed, 22 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 38b40ddcaad7..dbd4e98ba426 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -5020,8 +5020,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
>>   void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>   {
>>   	/*
>> -	 * Invalidate all MMU roles to force them to reinitialize as CPUID
>> -	 * information is factored into reserved bit calculations.
>> +	 * Invalidate all MMU roles and roots to force them to reinitialize,
>> +	 * as CPUID information is factored into reserved bit calculations.
>>   	 *
>>   	 * Correctly handling multiple vCPU models with respect to paging and
>>   	 * physical address properties) in a single VM would require tracking
>> @@ -5034,6 +5034,7 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>   	vcpu->arch.root_mmu.mmu_role.ext.valid = 0;
>>   	vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
>>   	vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
>> +	kvm_mmu_unload(vcpu);
>>   	kvm_mmu_reset_context(vcpu);
>>   
>>   	/*
>> @@ -5045,8 +5046,8 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>   
>>   void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
>>   {
>> -	kvm_mmu_unload(vcpu);
>>   	kvm_init_mmu(vcpu);
>> +	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
> 
> This is too risky IMO, there are far more flows than just MOV CR0/CR4 that are
> affected, e.g. SMM transitions, KVM_SET_SREG, etc...

SMM exit does flush the TLB because RSM clears CR0.PG (I did check this 
:)).  SMM re-entry then does not need to flush.  But I don't think SMM 
exit should flush the TLB *for non-SMM roles*.

For KVM_SET_SREGS I'm not sure if it should flush the TLB, but I agree 
it is certainly safer to keep it that way.

> Given that kvm_post_set_cr{0,4}() and kvm_vcpu_reset() explicitly handle CR0.PG
> and CR4.SMEP toggling, I highly doubt the other flows are correct in all instances.
> The call to kvm_mmu_new_pgd() is also

*white noise*

> To minimize risk, we should leave kvm_mmu_reset_context() as is (rename it if
> necessary) and instead add a new helper to handle kvm_post_set_cr{0,4}().  In
> the future we can/should work on avoiding unload in all paths, but again, future
> problem.

I disagree on this.  There aren't many calls to kvm_mmu_reset_context.

>>   
>> -	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
>> +	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
>> +		/* Flush the TLB if CR0 is changed 1 -> 0.  */
>> +		if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
>> +			kvm_mmu_unload(vcpu);
> 
> Calling kvm_mmu_unload() instead of requesting a flush isn't coherent with respect
> to the comment, or with SMEP handling.  And the SMEP handling isn't coherent with
> respect to the changelog.  Please elaborate :-)

Yep, will do (the CR0.PG=0 case is similar to the CR0.PCIDE=0 case 
below).  Using kvm_mmu_unload() avoids loading a cached root just to 
throw it away immediately after, but I can change this to a new 
KVM_REQ_MMU_UPDATE_ROOT flag that does

	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);

By the way, I have a possibly stupid question.  In kvm_set_cr3 (called 
e.g. from emulator_set_cr()) there is

  	if (cr3 != kvm_read_cr3(vcpu))
		kvm_mmu_new_pgd(vcpu, cr3);

What makes this work if mmu_is_nested(vcpu)?  Should this also have an 
"if (... & !tdp_enabled)"?

>> -	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
>> +		if ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP))
>> +			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> 
> This mishandles CR4.PGE.  Per the comment above, the if-elif-elif sequence relies
> on kvm_mmu_reset_context being a superset of KVM_REQ_TLB_FLUSH_GUEST.
> 
> For both CR0 and CR4, I think we should disassociate the TLB flush logic from the
> MMU role logic, e.g. CR4.PGE _could_ be part of the role, but it's not because KVM
> doesn't emulate global pages.

Makes sense, yes.  It also needs to handle flushing the current PCID 
when changing CR4.PAE (previously done for "free" by 
kvm_mmu_reset_context), but I agree with the idea.

Paolo

> This is what I'm thinking, assuming CR0.PG 1=>0 really only needs a flush.
> 
> 
> ---
>   arch/x86/kvm/mmu/mmu.c |  4 ++--
>   arch/x86/kvm/x86.c     | 42 +++++++++++++++++++++++++++++-------------
>   2 files changed, 31 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e41834748d52..c477c519c784 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5041,8 +5041,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
>   void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>   {
>   	/*
> -	 * Invalidate all MMU roles to force them to reinitialize as CPUID
> -	 * information is factored into reserved bit calculations.
> +	 * Invalidate all MMU roles and roots to force them to reinitialize,
> +	 * as CPUID information is factored into reserved bit calculations.
>   	 *
>   	 * Correctly handling multiple vCPU models with respect to paging and
>   	 * physical address properties) in a single VM would require tracking
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 782dc9cd31d8..b8dad04301ee 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -863,15 +863,28 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3)
>   }
>   EXPORT_SYMBOL_GPL(load_pdptrs);
> 
> +static void kvm_post_set_cr_reinit_mmu(struct kvm_vcpu *vcpu)
> +{
> +	kvm_mmu_init(vcpu);
> +	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
> +}
> +
>   void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>   {
>   	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
>   		kvm_clear_async_pf_completion_queue(vcpu);
>   		kvm_async_pf_hash_reset(vcpu);
> +
> +		/*
> +		 * Clearing CR0.PG is architecturally defined as flushing the
> +		 * TLB from the guest's perspective.
> +		 */
> +		if (!(cr0 & X86_CR0_PG))
> +			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>   	}
> 
>   	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> -		kvm_mmu_reset_context(vcpu);
> +		kvm_post_set_cr_reinit_mmu(vcpu);
> 
>   	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
>   	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
> @@ -1055,26 +1068,29 @@ EXPORT_SYMBOL_GPL(kvm_is_valid_cr4);
>   void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
>   {
>   	/*
> -	 * If any role bit is changed, the MMU needs to be reset.
> -	 *
>   	 * If CR4.PCIDE is changed 1 -> 0, the guest TLB must be flushed.
>   	 * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the TLB
>   	 * according to the SDM; however, stale prev_roots could be reused
>   	 * incorrectly in the future after a MOV to CR3 with NOFLUSH=1, so we
>   	 * free them all.  KVM_REQ_MMU_RELOAD is fit for the both cases; it
>   	 * is slow, but changing CR4.PCIDE is a rare case.
> -	 *
> -	 * If CR4.PGE is changed, the guest TLB must be flushed.
> -	 *
> -	 * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
> -	 * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
> -	 * the usage of "else if".
>   	 */
> -	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
> -		kvm_mmu_reset_context(vcpu);
> -	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
> +	if ((cr4 ^ old_cr4) & X86_CR4_PCIDE) {
>   		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> -	else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
> +		return;
> +	}
> +
> +	/* If any role bit is changed, the MMU needs to be reinitialized. */
> +	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
> +		kvm_post_set_cr_reinit_mmu(vcpu);
> +
> +	/*
> +	 * Setting SMEP or toggling PGE is architecturally defined as flushing
> +	 * the TLB from the guest's perspective.  Note, because the shadow MMU
> +	 * ignores global pages, CR4.PGE is not part of KVM_MMU_CR4_ROLE_BITS.
> +	 */
> +	if (((cr4 ^ old_cr4) & X86_CR4_PGE) ||
> +	    ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP)))
>   		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>   }
>   EXPORT_SYMBOL_GPL(kvm_post_set_cr4);
> 
> base-commit: a8c36d04d70d0b15e696561e1a2134fcbdd3a3bd
> --
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 12/12] KVM: x86: do not unload MMU roots on all role changes
  2022-02-14 16:34     ` Paolo Bonzini
@ 2022-02-14 19:24       ` Sean Christopherson
  2022-02-15  8:17         ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2022-02-14 19:24 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On Mon, Feb 14, 2022, Paolo Bonzini wrote:
> On 2/11/22 19:48, Sean Christopherson wrote:
> > On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> > > @@ -5045,8 +5046,8 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > >   void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
> > >   {
> > > -	kvm_mmu_unload(vcpu);
> > >   	kvm_init_mmu(vcpu);
> > > +	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
> > 
> > This is too risky IMO, there are far more flows than just MOV CR0/CR4 that are
> > affected, e.g. SMM transitions, KVM_SET_SREG, etc...
> 
> SMM exit does flush the TLB because RSM clears CR0.PG (I did check this :)).
> SMM re-entry then does not need to flush.  But I don't think SMM exit should
> flush the TLB *for non-SMM roles*.

I'm not concerned about the TLB flush aspects so much as the addition of
kvm_mmu_new_pgd() in new paths.

> For KVM_SET_SREGS I'm not sure if it should flush the TLB, but I agree it is
> certainly safer to keep it that way.
> 
> > Given that kvm_post_set_cr{0,4}() and kvm_vcpu_reset() explicitly handle CR0.PG
> > and CR4.SMEP toggling, I highly doubt the other flows are correct in all instances.
> > The call to kvm_mmu_new_pgd() is also
> 
> *white noise*
> 
> > To minimize risk, we should leave kvm_mmu_reset_context() as is (rename it if
> > necessary) and instead add a new helper to handle kvm_post_set_cr{0,4}().  In
> > the future we can/should work on avoiding unload in all paths, but again, future
> > problem.
> 
> I disagree on this.  There aren't many calls to kvm_mmu_reset_context.

All the more reason to do things incrementally.  I have no objection to allowing
all flows to reuse a cached (or current) root, I'm objecting to converting them
all in a single patch.  

> > > -	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> > > +	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
> > > +		/* Flush the TLB if CR0 is changed 1 -> 0.  */
> > > +		if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
> > > +			kvm_mmu_unload(vcpu);
> > 
> > Calling kvm_mmu_unload() instead of requesting a flush isn't coherent with respect
> > to the comment, or with SMEP handling.  And the SMEP handling isn't coherent with
> > respect to the changelog.  Please elaborate :-)
> 
> Yep, will do (the CR0.PG=0 case is similar to the CR0.PCIDE=0 case below).

Oh, you're freeing all roots to ensure a future MOV CR3 with NO_FLUSH and PCIDE=1
can't reuse a stale root.  That's necessary if and only if the MMU is shadowing
the guest, non-nested TDP MMUs just need to flush the guest's TLB.  The same is
true for the PCIDE case, i.e. we could optimize that too, though the main motivation
would be to clarify why all roots are unloaded.

> Using kvm_mmu_unload() avoids loading a cached root just to throw it away
> immediately after,

The shadow paging case will throw it away, but not the non-nested TDP MMU case?

> but I can change this to a new KVM_REQ_MMU_UPDATE_ROOT flag that does
> 
> 	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);

I don't think that's necessary, I was just confused by the discrepancy.

> By the way, I have a possibly stupid question.  In kvm_set_cr3 (called e.g.
> from emulator_set_cr()) there is
> 
>  	if (cr3 != kvm_read_cr3(vcpu))
> 		kvm_mmu_new_pgd(vcpu, cr3);
> 
> What makes this work if mmu_is_nested(vcpu)?

Hmm, nothing.  VMX is "broken" anyways because it will kick out to userspace with
X86EMUL_UNHANDLEABLE due to the CR3 intercept check.  SVM is also broken in that
it doesn't check INTERCEPT_CR3_WRITE, e.g. will do the wrong thing even if L1 wants
to intercept CR3 accesses.

> Should this also have an "if (... & !tdp_enabled)"?

Yes?  That should avoid the nested mess.  This patch also needs to handle CR0 and
CR4 modifications if L2 is active, e.g. if L1 choose not to intercept CR0/CR4.
kvm_post_set_cr_reinit_mmu() would be a lovely landing spot for that check :-D

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 12/12] KVM: x86: do not unload MMU roots on all role changes
  2022-02-14 19:24       ` Sean Christopherson
@ 2022-02-15  8:17         ` Paolo Bonzini
  0 siblings, 0 replies; 50+ messages in thread
From: Paolo Bonzini @ 2022-02-15  8:17 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm, vkuznets, mlevitsk, dmatlack

On 2/14/22 20:24, Sean Christopherson wrote:
> On Mon, Feb 14, 2022, Paolo Bonzini wrote:
>> On 2/11/22 19:48, Sean Christopherson wrote:
>>> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>>>> -	kvm_mmu_unload(vcpu);
>>>>    	kvm_init_mmu(vcpu);
>>>> +	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
>>>
>>> This is too risky IMO, there are far more flows than just MOV CR0/CR4 that are
>>> affected, e.g. SMM transitions, KVM_SET_SREG, etc...
>
> I'm not concerned about the TLB flush aspects so much as the addition of
> kvm_mmu_new_pgd() in new paths.

Okay, yeah those are more complex and the existing ones are broken too.

>>>> -	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
>>>> +	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
>>>> +		/* Flush the TLB if CR0 is changed 1 -> 0.  */
>>>> +		if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
>>>> +			kvm_mmu_unload(vcpu);
>>>
>>> Calling kvm_mmu_unload() instead of requesting a flush isn't coherent with respect
>>> to the comment, or with SMEP handling.  And the SMEP handling isn't coherent with
>>> respect to the changelog.  Please elaborate :-)
>>
>> Yep, will do (the CR0.PG=0 case is similar to the CR0.PCIDE=0 case below).
> 
> Oh, you're freeing all roots to ensure a future MOV CR3 with NO_FLUSH and PCIDE=1
> can't reuse a stale root.  That's necessary if and only if the MMU is shadowing
> the guest, non-nested TDP MMUs just need to flush the guest's TLB.  The same is
> true for the PCIDE case, i.e. we could optimize that too, though the main motivation
> would be to clarify why all roots are unloaded.

Yes.  Clarifying all this should be done before the big change to 
kvm_mmu_reset_context().

>> Using kvm_mmu_unload() avoids loading a cached root just to throw it away
>> immediately after,
> 
> The shadow paging case will throw it away, but not the non-nested TDP MMU case?

Yes, the TDP case is okay since the role is the same.  kvm_init_mmu() is 
enough.

>> but I can change this to a new KVM_REQ_MMU_UPDATE_ROOT flag that does
>>
>> 	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
> 
> I don't think that's necessary, I was just confused by the discrepancy.

It may not be necessary but it is clearer IMO.  Let me post a new patch.

Paolo


^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2022-02-15  8:17 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-09 17:00 [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes Paolo Bonzini
2022-02-09 17:00 ` [PATCH 01/12] KVM: x86: host-initiated EFER.LME write affects the MMU Paolo Bonzini
2022-02-10 22:49   ` Sean Christopherson
2022-02-09 17:00 ` [PATCH 02/12] KVM: MMU: move MMU role accessors to header Paolo Bonzini
2022-02-10 23:00   ` Sean Christopherson
2022-02-09 17:00 ` [PATCH 03/12] KVM: x86: do not deliver asynchronous page faults if CR0.PG=0 Paolo Bonzini
2022-02-10 23:10   ` Sean Christopherson
2022-02-10 23:14     ` Sean Christopherson
2022-02-10 23:16       ` Sean Christopherson
2022-02-11 11:16         ` Paolo Bonzini
2022-02-09 17:00 ` [PATCH 04/12] KVM: MMU: WARN if PAE roots linger after kvm_mmu_unload Paolo Bonzini
2022-02-10 23:20   ` Sean Christopherson
2022-02-11 11:18     ` Paolo Bonzini
2022-02-09 17:00 ` [PATCH 05/12] KVM: MMU: avoid NULL-pointer dereference on page freeing bugs Paolo Bonzini
2022-02-11  0:24   ` Sean Christopherson
2022-02-11 11:21     ` Paolo Bonzini
2022-02-09 17:00 ` [PATCH 06/12] KVM: MMU: rename kvm_mmu_reload Paolo Bonzini
2022-02-11  0:27   ` Sean Christopherson
2022-02-11 10:07     ` Paolo Bonzini
2022-02-11 16:16       ` Sean Christopherson
2022-02-11 16:52         ` Paolo Bonzini
2022-02-09 17:00 ` [PATCH 07/12] KVM: x86: use struct kvm_mmu_root_info for mmu->root Paolo Bonzini
2022-02-11 17:39   ` Sean Christopherson
2022-02-09 17:00 ` [PATCH 08/12] KVM: MMU: do not consult levels when freeing roots Paolo Bonzini
2022-02-11  0:41   ` Sean Christopherson
2022-02-11  0:54     ` Sean Christopherson
2022-02-11  1:07       ` Paolo Bonzini
2022-02-11  1:35         ` Sean Christopherson
2022-02-11  1:44           ` Sean Christopherson
2022-02-11  2:20             ` Sean Christopherson
2022-02-09 17:00 ` [PATCH 09/12] KVM: MMU: look for a cached PGD when going from 32-bit to 64-bit Paolo Bonzini
2022-02-11  1:32   ` Sean Christopherson
2022-02-11  1:37     ` Sean Christopherson
2022-02-11 10:09       ` Paolo Bonzini
2022-02-11 11:45     ` Paolo Bonzini
2022-02-11 17:38       ` Sean Christopherson
2022-02-09 17:00 ` [PATCH 10/12] KVM: MMU: load new PGD after the shadow MMU is initialized Paolo Bonzini
2022-02-11 17:45   ` Sean Christopherson
2022-02-11 17:47     ` Paolo Bonzini
2022-02-09 17:00 ` [PATCH 11/12] KVM: MMU: remove kvm_mmu_calc_root_page_role Paolo Bonzini
2022-02-11 17:53   ` Sean Christopherson
2022-02-09 17:00 ` [PATCH 12/12] KVM: x86: do not unload MMU roots on all role changes Paolo Bonzini
2022-02-11  9:08   ` Nikunj A. Dadhania
2022-02-11 18:48   ` Sean Christopherson
2022-02-14 16:34     ` Paolo Bonzini
2022-02-14 19:24       ` Sean Christopherson
2022-02-15  8:17         ` Paolo Bonzini
2022-02-09 17:07 ` [PATCH 00/12] KVM: MMU: " Sean Christopherson
2022-02-09 17:11   ` Paolo Bonzini
2022-02-09 17:16     ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).