linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/15] KVM: X86: Fix and clean up for register caches
@ 2021-11-08 12:43 Lai Jiangshan
  2021-11-08 12:43 ` [PATCH 01/15] KVM: X86: Ensure the dirty PDPTEs to be loaded Lai Jiangshan
                   ` (17 more replies)
  0 siblings, 18 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:43 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini; +Cc: Lai Jiangshan

From: Lai Jiangshan <laijs@linux.alibaba.com>

The patchset was started when I read the code of nested_svm_load_cr3()
and found that it marks CR3 available other than dirty when changing
vcpu->arch.cr3.  I thought its caller has ensured that vmcs.GUEST_CR3
will be or already be set to @cr3 so that it doesn't need to be marked
dirty.  And later I found that it is not true and it must be a bug in
a rare case before I realized that all the code just (ab)uses
vcpu->arch.regs_avail for VCPU_EXREG_CR3 and there is not such bug
of using regs_avail here.
(The above finding becomes a low meaning patch_15 rather than a fix)

The unhappyness of the reading code made me do some cleanup for
regs_avail and regs_dirty and kvm_register_xxx() functions in the hope
that the code become clearer with less misunderstanding.

Major focus was on VCPU_EXREG_CR3 and VCPU_EXREG_PDPTR.  They are
ensured to be marked the correct tags (available or dirty), and the
value is ensured to be synced to architecture before run if it is marked
dirty.

When cleaning VCPU_EXREG_PDPTR, I also checked if the corresponding
cr0/cr4 pdptr bits are all intercepted when !tdp_enabled, and I think
it is not clear enough, so X86_CR4_PDPTR_BITS is added as self-comments
in the code.

Lai Jiangshan (15):
  KVM: X86: Ensure the dirty PDPTEs to be loaded
  KVM: VMX: Mark VCPU_EXREG_PDPTR available in ept_save_pdptrs()
  KVM: SVM: Always clear available of VCPU_EXREG_PDPTR in svm_vcpu_run()
  KVM: VMX: Add and use X86_CR4_TLB_BITS when !enable_ept
  KVM: VMX: Add and use X86_CR4_PDPTR_BITS when !enable_ept
  KVM: X86: Move CR0 pdptr_bits into header file as X86_CR0_PDPTR_BITS
  KVM: SVM: Remove outdate comment in svm_load_mmu_pgd()
  KVM: SVM: Remove useless check in svm_load_mmu_pgd()
  KVM: SVM: Remove the unneeded code to mark available for CR3
  KVM: X86: Mark CR3 dirty when vcpu->arch.cr3 is changed
  KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty
  KVM: VMX: Reset the bits that are meaningful to be reset in
    vmx_register_cache_reset()
  KVM: SVM: Add and use svm_register_cache_reset()
  KVM: X86: Remove kvm_register_clear_available()
  KVM: nVMX: Always write vmcs.GUEST_CR3 during nested VM-Exit

 arch/x86/kvm/kvm_cache_regs.h | 13 ++++++------
 arch/x86/kvm/svm/nested.c     |  1 -
 arch/x86/kvm/svm/svm.c        | 17 ++++++++--------
 arch/x86/kvm/svm/svm.h        | 26 ++++++++++++++++++++++++
 arch/x86/kvm/vmx/nested.c     | 30 ++++++++++++++++++----------
 arch/x86/kvm/vmx/vmx.c        | 12 +++++++-----
 arch/x86/kvm/vmx/vmx.h        | 37 +++++++++++++++++++++++++----------
 arch/x86/kvm/x86.c            | 13 ++++++------
 8 files changed, 101 insertions(+), 48 deletions(-)

-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 01/15] KVM: X86: Ensure the dirty PDPTEs to be loaded
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
@ 2021-11-08 12:43 ` Lai Jiangshan
  2021-11-08 12:43 ` [PATCH 02/15] KVM: VMX: Mark VCPU_EXREG_PDPTR available in ept_save_pdptrs() Lai Jiangshan
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:43 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Xiao Guangrong

From: Lai Jiangshan <laijs@linux.alibaba.com>

For VMX, the dirty PDPTEs needs to be loaded before the coming VMENTER
via vmx_load_mmu_pgd() if EPT is enabled.

But not all paths that call load_pdptrs() will cause vmx_load_mmu_pgd()
to be invoked.  Normally, kvm_mmu_reset_context() and KVM_REQ_LOAD_MMU_PGD
are used to launch later vmx_load_mmu_pgd().

The commit d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and
CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs()
when changing CR0.CD and CR0.NW.

The commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
PCID on MOV CR3 w/ flush") skips KVM_REQ_LOAD_MMU_PGD after
load_pdptrs() when rewriting the CR3 with the same value.

The commit a91a7c709600("KVM: X86: Don't reset mmu context when
toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
load_pdptrs() when changing CR4.PGE.

Fixes: d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
Fixes: 21823fbda552("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
Fixes: a91a7c709600("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/x86.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ac83d873d65b..034c233ea5a1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -830,6 +830,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
 
 	memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
 	kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
+	/* Ensure the dirty PDPTEs to be loaded. */
+	kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
 	vcpu->arch.pdptrs_from_userspace = false;
 
 	return 1;
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 02/15] KVM: VMX: Mark VCPU_EXREG_PDPTR available in ept_save_pdptrs()
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
  2021-11-08 12:43 ` [PATCH 01/15] KVM: X86: Ensure the dirty PDPTEs to be loaded Lai Jiangshan
@ 2021-11-08 12:43 ` Lai Jiangshan
  2021-11-08 12:43 ` [PATCH 03/15] KVM: SVM: Always clear available of VCPU_EXREG_PDPTR in svm_vcpu_run() Lai Jiangshan
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:43 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

mmu->pdptrs[] and vmcs.GUEST_PDPTR[0-3] are synced, so mmu->pdptrs is
available and GUEST_PDPTR[0-3] is not dirty.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/vmx/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 39db4f56bffd..79e5df5fbb32 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3001,7 +3001,7 @@ void ept_save_pdptrs(struct kvm_vcpu *vcpu)
 	mmu->pdptrs[2] = vmcs_read64(GUEST_PDPTR2);
 	mmu->pdptrs[3] = vmcs_read64(GUEST_PDPTR3);
 
-	kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
+	kvm_register_mark_available(vcpu, VCPU_EXREG_PDPTR);
 }
 
 #define CR3_EXITING_BITS (CPU_BASED_CR3_LOAD_EXITING | \
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 03/15] KVM: SVM: Always clear available of VCPU_EXREG_PDPTR in svm_vcpu_run()
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
  2021-11-08 12:43 ` [PATCH 01/15] KVM: X86: Ensure the dirty PDPTEs to be loaded Lai Jiangshan
  2021-11-08 12:43 ` [PATCH 02/15] KVM: VMX: Mark VCPU_EXREG_PDPTR available in ept_save_pdptrs() Lai Jiangshan
@ 2021-11-08 12:43 ` Lai Jiangshan
  2021-11-08 12:43 ` [PATCH 04/15] KVM: VMX: Add and use X86_CR4_TLB_BITS when !enable_ept Lai Jiangshan
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:43 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

Make it the same logic to handle the availability of VCPU_EXREG_PDPTR
as VMX and also remove a branch in svm_vcpu_run().

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/svm/svm.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 88a730ad47a1..3e7043173668 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1583,10 +1583,16 @@ static void svm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 
 static void svm_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
 {
+	kvm_register_mark_available(vcpu, reg);
+
 	switch (reg) {
 	case VCPU_EXREG_PDPTR:
-		BUG_ON(!npt_enabled);
-		load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu));
+		/*
+		 * When !npt_enabled, mmu->pdptrs[] is already available since
+		 * it is always updated per SDM when moving to CRs.
+		 */
+		if (npt_enabled)
+			load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu));
 		break;
 	default:
 		KVM_BUG_ON(1, vcpu->kvm);
@@ -3964,8 +3970,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
 	svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
 	vmcb_mark_all_clean(svm->vmcb);
 
-	if (npt_enabled)
-		kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
+	kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
 
 	/*
 	 * We need to handle MC intercepts here before the vcpu has a chance to
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 04/15] KVM: VMX: Add and use X86_CR4_TLB_BITS when !enable_ept
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
                   ` (2 preceding siblings ...)
  2021-11-08 12:43 ` [PATCH 03/15] KVM: SVM: Always clear available of VCPU_EXREG_PDPTR in svm_vcpu_run() Lai Jiangshan
@ 2021-11-08 12:43 ` Lai Jiangshan
  2021-11-18 15:18   ` Paolo Bonzini
  2021-11-08 12:43 ` [PATCH 05/15] KVM: VMX: Add and use X86_CR4_PDPTR_BITS " Lai Jiangshan
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:43 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

In set_cr4_guest_host_mask(), X86_CR4_PGE is set to be intercepted when
!enable_ept just because X86_CR4_PGE is the only bit that is
responsible for flushing TLB but listed in KVM_POSSIBLE_CR4_GUEST_BITS.

It is clearer and self-documented to use X86_CR4_TLB_BITS instead.

No functionality changed.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/kvm_cache_regs.h | 2 ++
 arch/x86/kvm/vmx/vmx.c        | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 90e1ffdc05b7..8fe036efa654 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -9,6 +9,8 @@
 	(X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR  \
 	 | X86_CR4_OSXMMEXCPT | X86_CR4_PGE | X86_CR4_TSD | X86_CR4_FSGSBASE)
 
+#define X86_CR4_TLB_BITS (X86_CR4_PGE | X86_CR4_PCIDE | X86_CR4_PAE | X86_CR4_SMEP)
+
 #define BUILD_KVM_GPR_ACCESSORS(lname, uname)				      \
 static __always_inline unsigned long kvm_##lname##_read(struct kvm_vcpu *vcpu)\
 {									      \
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 79e5df5fbb32..1795702dc6de 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4107,7 +4107,7 @@ void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
 	vcpu->arch.cr4_guest_owned_bits = KVM_POSSIBLE_CR4_GUEST_BITS &
 					  ~vcpu->arch.cr4_guest_rsvd_bits;
 	if (!enable_ept)
-		vcpu->arch.cr4_guest_owned_bits &= ~X86_CR4_PGE;
+		vcpu->arch.cr4_guest_owned_bits &= ~X86_CR4_TLB_BITS;
 	if (is_guest_mode(&vmx->vcpu))
 		vcpu->arch.cr4_guest_owned_bits &=
 			~get_vmcs12(vcpu)->cr4_guest_host_mask;
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 05/15] KVM: VMX: Add and use X86_CR4_PDPTR_BITS when !enable_ept
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
                   ` (3 preceding siblings ...)
  2021-11-08 12:43 ` [PATCH 04/15] KVM: VMX: Add and use X86_CR4_TLB_BITS when !enable_ept Lai Jiangshan
@ 2021-11-08 12:43 ` Lai Jiangshan
  2021-11-08 12:43 ` [PATCH 06/15] KVM: X86: Move CR0 pdptr_bits into header file as X86_CR0_PDPTR_BITS Lai Jiangshan
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:43 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

In set_cr4_guest_host_mask(), all cr4 pdptr bits are already set to be
intercepted in an unclear way.

Add X86_CR4_PDPTR_BITS to make it clear and self-documented.

No functionality changed.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/kvm_cache_regs.h | 1 +
 arch/x86/kvm/vmx/vmx.c        | 4 +++-
 arch/x86/kvm/x86.c            | 4 +---
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 8fe036efa654..592f9eb9753b 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -10,6 +10,7 @@
 	 | X86_CR4_OSXMMEXCPT | X86_CR4_PGE | X86_CR4_TSD | X86_CR4_FSGSBASE)
 
 #define X86_CR4_TLB_BITS (X86_CR4_PGE | X86_CR4_PCIDE | X86_CR4_PAE | X86_CR4_SMEP)
+#define X86_CR4_PDPTR_BITS (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_SMEP)
 
 #define BUILD_KVM_GPR_ACCESSORS(lname, uname)				      \
 static __always_inline unsigned long kvm_##lname##_read(struct kvm_vcpu *vcpu)\
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1795702dc6de..d94e51e9c08f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4106,8 +4106,10 @@ void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
 
 	vcpu->arch.cr4_guest_owned_bits = KVM_POSSIBLE_CR4_GUEST_BITS &
 					  ~vcpu->arch.cr4_guest_rsvd_bits;
-	if (!enable_ept)
+	if (!enable_ept) {
 		vcpu->arch.cr4_guest_owned_bits &= ~X86_CR4_TLB_BITS;
+		vcpu->arch.cr4_guest_owned_bits &= ~X86_CR4_PDPTR_BITS;
+	}
 	if (is_guest_mode(&vmx->vcpu))
 		vcpu->arch.cr4_guest_owned_bits &=
 			~get_vmcs12(vcpu)->cr4_guest_host_mask;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 034c233ea5a1..b92d4241b4d9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1052,8 +1052,6 @@ EXPORT_SYMBOL_GPL(kvm_post_set_cr4);
 int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long old_cr4 = kvm_read_cr4(vcpu);
-	unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
-				   X86_CR4_SMEP;
 
 	if (!kvm_is_valid_cr4(vcpu, cr4))
 		return 1;
@@ -1064,7 +1062,7 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 		if ((cr4 ^ old_cr4) & X86_CR4_LA57)
 			return 1;
 	} else if (is_paging(vcpu) && (cr4 & X86_CR4_PAE)
-		   && ((cr4 ^ old_cr4) & pdptr_bits)
+		   && ((cr4 ^ old_cr4) & X86_CR4_PDPTR_BITS)
 		   && !load_pdptrs(vcpu, vcpu->arch.walk_mmu,
 				   kvm_read_cr3(vcpu)))
 		return 1;
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 06/15] KVM: X86: Move CR0 pdptr_bits into header file as X86_CR0_PDPTR_BITS
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
                   ` (4 preceding siblings ...)
  2021-11-08 12:43 ` [PATCH 05/15] KVM: VMX: Add and use X86_CR4_PDPTR_BITS " Lai Jiangshan
@ 2021-11-08 12:43 ` Lai Jiangshan
  2021-11-08 12:43 ` [PATCH 07/15] KVM: SVM: Remove outdated comment in svm_load_mmu_pgd() Lai Jiangshan
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:43 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

Not functionality changed.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/kvm_cache_regs.h | 3 +++
 arch/x86/kvm/x86.c            | 3 +--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 592f9eb9753b..54a996adb18d 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -9,9 +9,12 @@
 	(X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR  \
 	 | X86_CR4_OSXMMEXCPT | X86_CR4_PGE | X86_CR4_TSD | X86_CR4_FSGSBASE)
 
+#define X86_CR0_PDPTR_BITS (X86_CR0_CD | X86_CR0_NW | X86_CR0_PG)
 #define X86_CR4_TLB_BITS (X86_CR4_PGE | X86_CR4_PCIDE | X86_CR4_PAE | X86_CR4_SMEP)
 #define X86_CR4_PDPTR_BITS (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_SMEP)
 
+static_assert(!(KVM_POSSIBLE_CR0_GUEST_BITS & X86_CR0_PDPTR_BITS));
+
 #define BUILD_KVM_GPR_ACCESSORS(lname, uname)				      \
 static __always_inline unsigned long kvm_##lname##_read(struct kvm_vcpu *vcpu)\
 {									      \
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b92d4241b4d9..e5f5042d4842 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -858,7 +858,6 @@ EXPORT_SYMBOL_GPL(kvm_post_set_cr0);
 int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 {
 	unsigned long old_cr0 = kvm_read_cr0(vcpu);
-	unsigned long pdptr_bits = X86_CR0_CD | X86_CR0_NW | X86_CR0_PG;
 
 	cr0 |= X86_CR0_ET;
 
@@ -888,7 +887,7 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	}
 #endif
 	if (!(vcpu->arch.efer & EFER_LME) && (cr0 & X86_CR0_PG) &&
-	    is_pae(vcpu) && ((cr0 ^ old_cr0) & pdptr_bits) &&
+	    is_pae(vcpu) && ((cr0 ^ old_cr0) & X86_CR0_PDPTR_BITS) &&
 	    !load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu)))
 		return 1;
 
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 07/15] KVM: SVM: Remove outdated comment in svm_load_mmu_pgd()
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
                   ` (5 preceding siblings ...)
  2021-11-08 12:43 ` [PATCH 06/15] KVM: X86: Move CR0 pdptr_bits into header file as X86_CR0_PDPTR_BITS Lai Jiangshan
@ 2021-11-08 12:43 ` Lai Jiangshan
  2021-11-08 12:44 ` [PATCH 08/15] KVM: SVM: Remove useless check " Lai Jiangshan
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:43 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

The comment had been added in the commit 689f3bf21628 ("KVM: x86: unify
callbacks to load paging root") and its related code was removed later,
and it has nothing to do with the next line of code.

So the comment should be removed too.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/svm/svm.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 3e7043173668..e3607fa025d3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4000,7 +4000,6 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 
 		hv_track_root_tdp(vcpu, root_hpa);
 
-		/* Loading L2's CR3 is handled by enter_svm_guest_mode.  */
 		if (!test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail))
 			return;
 		cr3 = vcpu->arch.cr3;
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 08/15] KVM: SVM: Remove useless check in svm_load_mmu_pgd()
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
                   ` (6 preceding siblings ...)
  2021-11-08 12:43 ` [PATCH 07/15] KVM: SVM: Remove outdated comment in svm_load_mmu_pgd() Lai Jiangshan
@ 2021-11-08 12:44 ` Lai Jiangshan
  2021-11-08 12:44 ` [PATCH 09/15] KVM: SVM: Remove the unneeded code to mark available for CR3 Lai Jiangshan
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:44 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

VCPU_EXREG_CR3 is never cleared from vcpu->arch.regs_avail in SVM so
the if-branch is always false and useless, just remove it.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/svm/svm.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index e3607fa025d3..b7da66935e72 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4000,8 +4000,6 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 
 		hv_track_root_tdp(vcpu, root_hpa);
 
-		if (!test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail))
-			return;
 		cr3 = vcpu->arch.cr3;
 	} else if (vcpu->arch.mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
 		cr3 = __sme_set(root_hpa) | kvm_get_active_pcid(vcpu);
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 09/15] KVM: SVM: Remove the unneeded code to mark available for CR3
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
                   ` (7 preceding siblings ...)
  2021-11-08 12:44 ` [PATCH 08/15] KVM: SVM: Remove useless check " Lai Jiangshan
@ 2021-11-08 12:44 ` Lai Jiangshan
  2021-11-18 15:17   ` Paolo Bonzini
  2021-11-08 12:44 ` [PATCH 10/15] KVM: X86: Mark CR3 dirty when vcpu->arch.cr3 is changed Lai Jiangshan
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:44 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

VCPU_EXREG_CR3 is never cleared from vcpu->arch.regs_avail in SVM so
marking available for CR3 is mere an NOP, just remove it.

And it is not required to mark it dirty since VCPU_EXREG_CR3 is neither
never cleared from vcpu->arch.regs_dirty and SVM doesn't use the dirty
information of VCPU_EXREG_CR3.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/svm/nested.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 13a58722e097..2d88ff584d61 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -444,7 +444,6 @@ static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
 		kvm_mmu_new_pgd(vcpu, cr3);
 
 	vcpu->arch.cr3 = cr3;
-	kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
 
 	/* Re-initialize the MMU, e.g. to pick up CR4 MMU role changes. */
 	kvm_init_mmu(vcpu);
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 10/15] KVM: X86: Mark CR3 dirty when vcpu->arch.cr3 is changed
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
                   ` (8 preceding siblings ...)
  2021-11-08 12:44 ` [PATCH 09/15] KVM: SVM: Remove the unneeded code to mark available for CR3 Lai Jiangshan
@ 2021-11-08 12:44 ` Lai Jiangshan
  2021-11-08 12:44 ` [PATCH 11/15] KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty Lai Jiangshan
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:44 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

When vcpu->arch.cr3 is changed, it should be marked dirty unless it
is being updated to the value of the architecture guest CR3 (i.e.
VMX.GUEST_CR3 or vmcb->save.cr3 when tdp is enabled).

This patch has no functionality changed because
kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3) is superset of
kvm_register_mark_available(vcpu, VCPU_EXREG_CR3) with additional
change to vcpu->arch.regs_dirty, but no code uses regs_dirty for
VCPU_EXREG_CR3.  (vmx_load_mmu_pgd() uses vcpu->arch.regs_avail instead
to test if VCPU_EXREG_CR3 dirty which means current code (ab)uses
regs_avail for VCPU_EXREG_CR3 dirty information.)

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/vmx/nested.c | 2 +-
 arch/x86/kvm/x86.c        | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index dc0e5f80715d..ee5a68c2ea3a 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1134,7 +1134,7 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
 		kvm_mmu_new_pgd(vcpu, cr3);
 
 	vcpu->arch.cr3 = cr3;
-	kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
+	kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
 
 	/* Re-initialize the MMU, e.g. to pick up CR4 MMU role changes. */
 	kvm_init_mmu(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e5f5042d4842..6ca19cac4aff 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1159,7 +1159,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 		kvm_mmu_new_pgd(vcpu, cr3);
 
 	vcpu->arch.cr3 = cr3;
-	kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
+	kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
 
 handle_tlb_flush:
 	/*
@@ -10591,7 +10591,7 @@ static int __set_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs,
 	vcpu->arch.cr2 = sregs->cr2;
 	*mmu_reset_needed |= kvm_read_cr3(vcpu) != sregs->cr3;
 	vcpu->arch.cr3 = sregs->cr3;
-	kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
+	kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
 
 	kvm_set_cr8(vcpu, sregs->cr8);
 
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 11/15] KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
                   ` (9 preceding siblings ...)
  2021-11-08 12:44 ` [PATCH 10/15] KVM: X86: Mark CR3 dirty when vcpu->arch.cr3 is changed Lai Jiangshan
@ 2021-11-08 12:44 ` Lai Jiangshan
  2021-12-15 15:47   ` Maxim Levitsky
  2021-11-08 12:44 ` [PATCH 12/15] KVM: VMX: Reset the bits that are meaningful to be reset in vmx_register_cache_reset() Lai Jiangshan
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:44 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

When vcpu->arch.cr3 is changed, it is marked dirty, so vmcs.GUEST_CR3
can be updated only when kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3).

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/vmx/vmx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d94e51e9c08f..38b65b97fb7b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3126,9 +3126,9 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 
 		if (!enable_unrestricted_guest && !is_paging(vcpu))
 			guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
-		else if (test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail))
+		else if (kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3))
 			guest_cr3 = vcpu->arch.cr3;
-		else /* vmcs01.GUEST_CR3 is already up-to-date. */
+		else /* vmcs.GUEST_CR3 is already up-to-date. */
 			update_guest_cr3 = false;
 		vmx_ept_load_pdptrs(vcpu);
 	} else {
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 12/15] KVM: VMX: Reset the bits that are meaningful to be reset in vmx_register_cache_reset()
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
                   ` (10 preceding siblings ...)
  2021-11-08 12:44 ` [PATCH 11/15] KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty Lai Jiangshan
@ 2021-11-08 12:44 ` Lai Jiangshan
  2021-11-18 15:25   ` Paolo Bonzini
  2021-11-08 12:44 ` [PATCH 13/15] KVM: SVM: Add and use svm_register_cache_reset() Lai Jiangshan
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:44 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

Add meaningful bits as VMX_REGS_AVAIL_SET and VMX_REGS_DIRTY_SET.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/vmx/vmx.h | 37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index e7db42e3b0ce..465aa415c3cb 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -437,16 +437,33 @@ BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL)
 
 static inline void vmx_register_cache_reset(struct kvm_vcpu *vcpu)
 {
-	vcpu->arch.regs_avail = ~((1 << VCPU_REGS_RIP) | (1 << VCPU_REGS_RSP)
-				  | (1 << VCPU_EXREG_RFLAGS)
-				  | (1 << VCPU_EXREG_PDPTR)
-				  | (1 << VCPU_EXREG_SEGMENTS)
-				  | (1 << VCPU_EXREG_CR0)
-				  | (1 << VCPU_EXREG_CR3)
-				  | (1 << VCPU_EXREG_CR4)
-				  | (1 << VCPU_EXREG_EXIT_INFO_1)
-				  | (1 << VCPU_EXREG_EXIT_INFO_2));
-	vcpu->arch.regs_dirty = 0;
+/*
+ * VMX_REGS_AVAIL_SET - The set of registers that will be updated in cache on
+ *			demand.  Other registers not listed here are synced to
+ *			the cache immediately after VM-Exit.
+ *
+ * VMX_REGS_DIRTY_SET - The set of registers that might be outdated in
+ *			architecture. Other registers not listed here are synced
+ *			to the architecture immediately when modifying.
+ */
+#define VMX_REGS_AVAIL_SET	((1 << VCPU_REGS_RIP) |\
+				(1 << VCPU_REGS_RSP) |\
+				(1 << VCPU_EXREG_RFLAGS) |\
+				(1 << VCPU_EXREG_PDPTR) |\
+				(1 << VCPU_EXREG_SEGMENTS) |\
+				(1 << VCPU_EXREG_CR0) |\
+				(1 << VCPU_EXREG_CR3) |\
+				(1 << VCPU_EXREG_CR4) |\
+				(1 << VCPU_EXREG_EXIT_INFO_1) |\
+				(1 << VCPU_EXREG_EXIT_INFO_2))
+
+#define VMX_REGS_DIRTY_SET	((1 << VCPU_REGS_RIP) |\
+				(1 << VCPU_REGS_RSP) |\
+				(1 << VCPU_EXREG_PDPTR) |\
+				(1 << VCPU_EXREG_CR3))
+
+	vcpu->arch.regs_avail &= ~VMX_REGS_AVAIL_SET;
+	vcpu->arch.regs_dirty &= ~VMX_REGS_DIRTY_SET;
 }
 
 static inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm)
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 13/15] KVM: SVM: Add and use svm_register_cache_reset()
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
                   ` (11 preceding siblings ...)
  2021-11-08 12:44 ` [PATCH 12/15] KVM: VMX: Reset the bits that are meaningful to be reset in vmx_register_cache_reset() Lai Jiangshan
@ 2021-11-08 12:44 ` Lai Jiangshan
  2021-11-18 15:37   ` Paolo Bonzini
  2021-11-08 12:44 ` [PATCH 14/15] KVM: X86: Remove kvm_register_clear_available() Lai Jiangshan
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:44 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

It resets all the appropriate bits like vmx.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/svm/svm.c |  3 +--
 arch/x86/kvm/svm/svm.h | 26 ++++++++++++++++++++++++++
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b7da66935e72..ba9cfddd2875 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3969,8 +3969,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
 
 	svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
 	vmcb_mark_all_clean(svm->vmcb);
-
-	kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
+	svm_register_cache_reset(vcpu);
 
 	/*
 	 * We need to handle MC intercepts here before the vcpu has a chance to
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 0d7bbe548ac3..1cf5d5e2d0cd 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -274,6 +274,32 @@ static inline bool vmcb_is_dirty(struct vmcb *vmcb, int bit)
         return !test_bit(bit, (unsigned long *)&vmcb->control.clean);
 }
 
+static inline void svm_register_cache_reset(struct kvm_vcpu *vcpu)
+{
+/*
+ * SVM_REGS_AVAIL_SET - The set of registers that will be updated in cache on
+ *			demand.  Other registers not listed here are synced to
+ *			the cache immediately after VM-Exit.
+ *
+ * SVM_REGS_DIRTY_SET - The set of registers that might be outdated in
+ *			architecture. Other registers not listed here are synced
+ *			to the architecture immediately when modifying.
+ *
+ *			Special case: VCPU_EXREG_CR3 should be in this set due
+ *			to the fact.  But KVM_REQ_LOAD_MMU_PGD is always
+ *			requested when the cache vcpu->arch.cr3 is changed and
+ *			svm_load_mmu_pgd() always syncs the new CR3 value into
+ *			the architecture.  So the dirty information of
+ *			VCPU_EXREG_CR3 is not used which means VCPU_EXREG_CR3
+ *			isn't required to be put in this set.
+ */
+#define SVM_REGS_AVAIL_SET	(1 << VCPU_EXREG_PDPTR)
+#define SVM_REGS_DIRTY_SET	(0)
+
+	vcpu->arch.regs_avail &= ~SVM_REGS_AVAIL_SET;
+	vcpu->arch.regs_dirty &= ~SVM_REGS_DIRTY_SET;
+}
+
 static inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu)
 {
 	return container_of(vcpu, struct vcpu_svm, vcpu);
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 14/15] KVM: X86: Remove kvm_register_clear_available()
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
                   ` (12 preceding siblings ...)
  2021-11-08 12:44 ` [PATCH 13/15] KVM: SVM: Add and use svm_register_cache_reset() Lai Jiangshan
@ 2021-11-08 12:44 ` Lai Jiangshan
  2021-11-08 12:44 ` [PATCH 15/15] KVM: nVMX: Always write vmcs.GUEST_CR3 during nested VM-Exit Lai Jiangshan
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:44 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

It has no user.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/kvm_cache_regs.h | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 54a996adb18d..0f8847b981e5 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -61,13 +61,6 @@ static inline void kvm_register_mark_available(struct kvm_vcpu *vcpu,
 	__set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
 }
 
-static inline void kvm_register_clear_available(struct kvm_vcpu *vcpu,
-					       enum kvm_reg reg)
-{
-	__clear_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
-	__clear_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty);
-}
-
 static inline void kvm_register_mark_dirty(struct kvm_vcpu *vcpu,
 					   enum kvm_reg reg)
 {
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 15/15] KVM: nVMX: Always write vmcs.GUEST_CR3 during nested VM-Exit
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
                   ` (13 preceding siblings ...)
  2021-11-08 12:44 ` [PATCH 14/15] KVM: X86: Remove kvm_register_clear_available() Lai Jiangshan
@ 2021-11-08 12:44 ` Lai Jiangshan
  2021-11-18 15:52   ` Paolo Bonzini
  2021-11-11 14:45 ` [PATCH 16/15] KVM: X86: Update mmu->pdptrs only when it is changed Lai Jiangshan
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-08 12:44 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

For VM-Enter, vmcs.GUEST_CR3 and vcpu->arch.cr3 are synced and it is
better to mark VCPU_EXREG_CR3 available rather than dirty to reduce a
redundant vmwrite(GUEST_CR3) in vmx_load_mmu_pgd().

But nested_vmx_load_cr3() is also served for VM-Exit which doesn't
set vmcs.GUEST_CR3.

This patch moves writing to vmcs.GUEST_CR3 into nested_vmx_load_cr3()
for both nested VM-Eneter/Exit and use kvm_register_mark_available().

This patch doesn't cause any extra writing to vmcs.GUEST_CR3 and if
userspace is modifying CR3 with KVM_SET_SREGS later, the dirty info
for VCPU_EXREG_CR3 would be set for next writing to vmcs.GUEST_CR3
and no update will be lost.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/vmx/nested.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index ee5a68c2ea3a..4ddd4b1b0503 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1133,8 +1133,28 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
 	if (!nested_ept)
 		kvm_mmu_new_pgd(vcpu, cr3);
 
+	/*
+	 * Immediately write vmcs.GUEST_CR3 when changing vcpu->arch.cr3.
+	 *
+	 * VCPU_EXREG_CR3 is marked available rather than dirty because
+	 * vcpu->arch.cr3 and vmcs.GUEST_CR3 are synced when enable_ept and
+	 * vmcs.GUEST_CR3 is irrelevant to vcpu->arch.cr3 when !enable_ept.
+	 *
+	 * For VM-Enter case, it will be propagated to vmcs12 on nested
+	 * VM-Exit, which can occur without actually running L2 and thus
+	 * without hitting vmx_load_mmu_pgd(), e.g. if L1 is entering L2 with
+	 * vmcs12.GUEST_ACTIVITYSTATE=HLT, in which case KVM will intercept
+	 * the transition to HLT instead of running L2.
+	 *
+	 * For VM-Exit case, it is likely that vmcs.GUEST_CR3 == cr3 here, but
+	 * L1 may set HOST_CR3 to a value other than its CR3 before VM-Entry,
+	 * so we just update it unconditionally.
+	 */
+	if (enable_ept)
+		vmcs_writel(GUEST_CR3, cr3);
+
 	vcpu->arch.cr3 = cr3;
-	kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
+	kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
 
 	/* Re-initialize the MMU, e.g. to pick up CR4 MMU role changes. */
 	kvm_init_mmu(vcpu);
@@ -2600,16 +2620,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 				from_vmentry, entry_failure_code))
 		return -EINVAL;
 
-	/*
-	 * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
-	 * on nested VM-Exit, which can occur without actually running L2 and
-	 * thus without hitting vmx_load_mmu_pgd(), e.g. if L1 is entering L2 with
-	 * vmcs12.GUEST_ACTIVITYSTATE=HLT, in which case KVM will intercept the
-	 * transition to HLT instead of running L2.
-	 */
-	if (enable_ept)
-		vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
-
 	/* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
 	if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
 	    is_pae_paging(vcpu)) {
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 16/15] KVM: X86: Update mmu->pdptrs only when it is changed
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
                   ` (14 preceding siblings ...)
  2021-11-08 12:44 ` [PATCH 15/15] KVM: nVMX: Always write vmcs.GUEST_CR3 during nested VM-Exit Lai Jiangshan
@ 2021-11-11 14:45 ` Lai Jiangshan
  2021-12-07 23:43   ` Sean Christopherson
  2021-11-11 14:46 ` [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs " Lai Jiangshan
  2021-11-18  8:53 ` [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
  17 siblings, 1 reply; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-11 14:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, Lai Jiangshan, Paolo Bonzini, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

It is unchanged in most cases.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/x86.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6ca19cac4aff..0176eaa86a35 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -828,10 +828,13 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
 		}
 	}
 
-	memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
-	kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
-	/* Ensure the dirty PDPTEs to be loaded. */
-	kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
+	kvm_register_mark_available(vcpu, VCPU_EXREG_PDPTR);
+	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
+		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
+		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
+		/* Ensure the dirty PDPTEs to be loaded. */
+		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
+	}
 	vcpu->arch.pdptrs_from_userspace = false;
 
 	return 1;
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs is changed
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
                   ` (15 preceding siblings ...)
  2021-11-11 14:45 ` [PATCH 16/15] KVM: X86: Update mmu->pdptrs only when it is changed Lai Jiangshan
@ 2021-11-11 14:46 ` Lai Jiangshan
  2021-11-23  9:34   ` Lai Jiangshan
  2021-12-08  0:15   ` Sean Christopherson
  2021-11-18  8:53 ` [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
  17 siblings, 2 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-11 14:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, Lai Jiangshan, Paolo Bonzini, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Xiao Guangrong

From: Lai Jiangshan <laijs@linux.alibaba.com>

For shadow paging, the pae_root needs to be reconstructed before the
coming VMENTER if the guest PDPTEs is changed.

But not all paths that call load_pdptrs() will cause the pae_root to be
reconstructed. Normally, kvm_mmu_reset_context() and kvm_mmu_free_roots()
are used to launch later reconstruction.

The commit d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and
CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs()
when changing CR0.CD and CR0.NW.

The commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
PCID on MOV CR3 w/ flush") skips kvm_mmu_free_roots() after
load_pdptrs() when rewriting the CR3 with the same value.

The commit a91a7c709600("KVM: X86: Don't reset mmu context when
toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
load_pdptrs() when changing CR4.PGE.

Normally, the guest doesn't change the PDPTEs before doing only the
above operation without touching other bits that can force pae_root to
be reconstructed.  Guests like linux would keep the PDPTEs unchaged
for every instance of pagetable.

Fixes: d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
Fixes: 21823fbda552("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
Fixes: a91a7c709600("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/x86.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0176eaa86a35..cfba337e46ab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -832,8 +832,14 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
 	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
 		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
 		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
-		/* Ensure the dirty PDPTEs to be loaded. */
-		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
+		/*
+		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
+		 * enabled or pae_root to be reconstructed for shadow paging.
+		 */
+		if (tdp_enabled)
+			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
+		else
+			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
 	}
 	vcpu->arch.pdptrs_from_userspace = false;
 
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 00/15] KVM: X86: Fix and clean up for register caches
  2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
                   ` (16 preceding siblings ...)
  2021-11-11 14:46 ` [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs " Lai Jiangshan
@ 2021-11-18  8:53 ` Lai Jiangshan
  17 siblings, 0 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-18  8:53 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, kvm, Paolo Bonzini

Hello,all

Ping

Thanks
Lai

On 2021/11/8 20:43, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> The patchset was started when I read the code of nested_svm_load_cr3()
> and found that it marks CR3 available other than dirty when changing
> vcpu->arch.cr3.  I thought its caller has ensured that vmcs.GUEST_CR3
> will be or already be set to @cr3 so that it doesn't need to be marked
> dirty.  And later I found that it is not true and it must be a bug in
> a rare case before I realized that all the code just (ab)uses
> vcpu->arch.regs_avail for VCPU_EXREG_CR3 and there is not such bug
> of using regs_avail here.
> (The above finding becomes a low meaning patch_15 rather than a fix)
> 
> The unhappyness of the reading code made me do some cleanup for
> regs_avail and regs_dirty and kvm_register_xxx() functions in the hope
> that the code become clearer with less misunderstanding.
> 
> Major focus was on VCPU_EXREG_CR3 and VCPU_EXREG_PDPTR.  They are
> ensured to be marked the correct tags (available or dirty), and the
> value is ensured to be synced to architecture before run if it is marked
> dirty.
> 
> When cleaning VCPU_EXREG_PDPTR, I also checked if the corresponding
> cr0/cr4 pdptr bits are all intercepted when !tdp_enabled, and I think
> it is not clear enough, so X86_CR4_PDPTR_BITS is added as self-comments
> in the code.
> 
> Lai Jiangshan (15):
>    KVM: X86: Ensure the dirty PDPTEs to be loaded
>    KVM: VMX: Mark VCPU_EXREG_PDPTR available in ept_save_pdptrs()
>    KVM: SVM: Always clear available of VCPU_EXREG_PDPTR in svm_vcpu_run()
>    KVM: VMX: Add and use X86_CR4_TLB_BITS when !enable_ept
>    KVM: VMX: Add and use X86_CR4_PDPTR_BITS when !enable_ept
>    KVM: X86: Move CR0 pdptr_bits into header file as X86_CR0_PDPTR_BITS
>    KVM: SVM: Remove outdate comment in svm_load_mmu_pgd()
>    KVM: SVM: Remove useless check in svm_load_mmu_pgd()
>    KVM: SVM: Remove the unneeded code to mark available for CR3
>    KVM: X86: Mark CR3 dirty when vcpu->arch.cr3 is changed
>    KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty
>    KVM: VMX: Reset the bits that are meaningful to be reset in
>      vmx_register_cache_reset()
>    KVM: SVM: Add and use svm_register_cache_reset()
>    KVM: X86: Remove kvm_register_clear_available()
>    KVM: nVMX: Always write vmcs.GUEST_CR3 during nested VM-Exit
> 
>   arch/x86/kvm/kvm_cache_regs.h | 13 ++++++------
>   arch/x86/kvm/svm/nested.c     |  1 -
>   arch/x86/kvm/svm/svm.c        | 17 ++++++++--------
>   arch/x86/kvm/svm/svm.h        | 26 ++++++++++++++++++++++++
>   arch/x86/kvm/vmx/nested.c     | 30 ++++++++++++++++++----------
>   arch/x86/kvm/vmx/vmx.c        | 12 +++++++-----
>   arch/x86/kvm/vmx/vmx.h        | 37 +++++++++++++++++++++++++----------
>   arch/x86/kvm/x86.c            | 13 ++++++------
>   8 files changed, 101 insertions(+), 48 deletions(-)
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 09/15] KVM: SVM: Remove the unneeded code to mark available for CR3
  2021-11-08 12:44 ` [PATCH 09/15] KVM: SVM: Remove the unneeded code to mark available for CR3 Lai Jiangshan
@ 2021-11-18 15:17   ` Paolo Bonzini
  0 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2021-11-18 15:17 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, kvm
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

On 11/8/21 13:44, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> VCPU_EXREG_CR3 is never cleared from vcpu->arch.regs_avail in SVM so
> marking available for CR3 is mere an NOP, just remove it.
> 
> And it is not required to mark it dirty since VCPU_EXREG_CR3 is neither
> never cleared from vcpu->arch.regs_dirty and SVM doesn't use the dirty
> information of VCPU_EXREG_CR3.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>   arch/x86/kvm/svm/nested.c | 1 -
>   1 file changed, 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 13a58722e097..2d88ff584d61 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -444,7 +444,6 @@ static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
>   		kvm_mmu_new_pgd(vcpu, cr3);
>   
>   	vcpu->arch.cr3 = cr3;
> -	kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
>   
>   	/* Re-initialize the MMU, e.g. to pick up CR4 MMU role changes. */
>   	kvm_init_mmu(vcpu);
> 

These two patches can be merged, I think, with a commit message like

     VCPU_EXREG_CR3 is never cleared from vcpu->arch.regs_avail or
     vcpu->arch.regs_dirty in SVM; therefore, marking CR3 as available is
     merely a NOP, and testing it will likewise always succeed.

Paolo


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/15] KVM: VMX: Add and use X86_CR4_TLB_BITS when !enable_ept
  2021-11-08 12:43 ` [PATCH 04/15] KVM: VMX: Add and use X86_CR4_TLB_BITS when !enable_ept Lai Jiangshan
@ 2021-11-18 15:18   ` Paolo Bonzini
  0 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2021-11-18 15:18 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, kvm
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

On 11/8/21 13:43, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> In set_cr4_guest_host_mask(), X86_CR4_PGE is set to be intercepted when
> !enable_ept just because X86_CR4_PGE is the only bit that is
> responsible for flushing TLB but listed in KVM_POSSIBLE_CR4_GUEST_BITS.
> 
> It is clearer and self-documented to use X86_CR4_TLB_BITS instead.

Very good idea, but I'd go with a slightly clearer X86_CR4_TLBFLUSH_BITS.

Paolo

> No functionality changed.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>   arch/x86/kvm/kvm_cache_regs.h | 2 ++
>   arch/x86/kvm/vmx/vmx.c        | 2 +-
>   2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
> index 90e1ffdc05b7..8fe036efa654 100644
> --- a/arch/x86/kvm/kvm_cache_regs.h
> +++ b/arch/x86/kvm/kvm_cache_regs.h
> @@ -9,6 +9,8 @@
>   	(X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR  \
>   	 | X86_CR4_OSXMMEXCPT | X86_CR4_PGE | X86_CR4_TSD | X86_CR4_FSGSBASE)
>   
> +#define X86_CR4_TLB_BITS (X86_CR4_PGE | X86_CR4_PCIDE | X86_CR4_PAE | X86_CR4_SMEP)
> +
>   #define BUILD_KVM_GPR_ACCESSORS(lname, uname)				      \
>   static __always_inline unsigned long kvm_##lname##_read(struct kvm_vcpu *vcpu)\
>   {									      \
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 79e5df5fbb32..1795702dc6de 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4107,7 +4107,7 @@ void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
>   	vcpu->arch.cr4_guest_owned_bits = KVM_POSSIBLE_CR4_GUEST_BITS &
>   					  ~vcpu->arch.cr4_guest_rsvd_bits;
>   	if (!enable_ept)
> -		vcpu->arch.cr4_guest_owned_bits &= ~X86_CR4_PGE;
> +		vcpu->arch.cr4_guest_owned_bits &= ~X86_CR4_TLB_BITS;
>   	if (is_guest_mode(&vmx->vcpu))
>   		vcpu->arch.cr4_guest_owned_bits &=
>   			~get_vmcs12(vcpu)->cr4_guest_host_mask;
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 12/15] KVM: VMX: Reset the bits that are meaningful to be reset in vmx_register_cache_reset()
  2021-11-08 12:44 ` [PATCH 12/15] KVM: VMX: Reset the bits that are meaningful to be reset in vmx_register_cache_reset() Lai Jiangshan
@ 2021-11-18 15:25   ` Paolo Bonzini
  0 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2021-11-18 15:25 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, kvm
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

On 11/8/21 13:44, Lai Jiangshan wrote:
> +/*
> + * VMX_REGS_AVAIL_SET - The set of registers that will be updated in cache on
> + *			demand.  Other registers not listed here are synced to
> + *			the cache immediately after VM-Exit.
> + *
> + * VMX_REGS_DIRTY_SET - The set of registers that might be outdated in
> + *			architecture. Other registers not listed here are synced
> + *			to the architecture immediately when modifying.

Slightly more expressive:

/*
  * VMX_REGS_LAZY_LOAD_SET - The set of registers that will be updated in the
  * cache on demand.  Other registers not listed here are synced to
  * the cache immediately after VM-Exit.
  */
...

/*
  * VMX_REGS_LAZY_UPDATE_SET - The set of registers that might be outdated in
  * VMCS. Other registers not listed here are synced to the VMCS
  * immediately when modified.
  */
...

	BUILD_BUG_ON(VMX_REGS_LAZY_UPDATE_SET & ~VMX_REGS_LAZY_LOAD_SET);
         vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
         vcpu->arch.regs_dirty &= ~VMX_REGS_LAZY_UPDATE_SET;

That is lazily loaded registers become unavailable, and lazily updated registers
become unavailable and dirty.

Paolo


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 13/15] KVM: SVM: Add and use svm_register_cache_reset()
  2021-11-08 12:44 ` [PATCH 13/15] KVM: SVM: Add and use svm_register_cache_reset() Lai Jiangshan
@ 2021-11-18 15:37   ` Paolo Bonzini
  2021-11-18 16:28     ` Lai Jiangshan
  0 siblings, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2021-11-18 15:37 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, kvm
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

On 11/8/21 13:44, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> It resets all the appropriate bits like vmx.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>   arch/x86/kvm/svm/svm.c |  3 +--
>   arch/x86/kvm/svm/svm.h | 26 ++++++++++++++++++++++++++
>   2 files changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index b7da66935e72..ba9cfddd2875 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3969,8 +3969,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
>   
>   	svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
>   	vmcb_mark_all_clean(svm->vmcb);
> -
> -	kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
> +	svm_register_cache_reset(vcpu);
>   
>   	/*
>   	 * We need to handle MC intercepts here before the vcpu has a chance to
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 0d7bbe548ac3..1cf5d5e2d0cd 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -274,6 +274,32 @@ static inline bool vmcb_is_dirty(struct vmcb *vmcb, int bit)
>           return !test_bit(bit, (unsigned long *)&vmcb->control.clean);
>   }
>   
> +static inline void svm_register_cache_reset(struct kvm_vcpu *vcpu)
> +{
> +/*
> + * SVM_REGS_AVAIL_SET - The set of registers that will be updated in cache on
> + *			demand.  Other registers not listed here are synced to
> + *			the cache immediately after VM-Exit.
> + *
> + * SVM_REGS_DIRTY_SET - The set of registers that might be outdated in
> + *			architecture. Other registers not listed here are synced
> + *			to the architecture immediately when modifying.
> + *
> + *			Special case: VCPU_EXREG_CR3 should be in this set due
> + *			to the fact.  But KVM_REQ_LOAD_MMU_PGD is always
> + *			requested when the cache vcpu->arch.cr3 is changed and
> + *			svm_load_mmu_pgd() always syncs the new CR3 value into
> + *			the architecture.  So the dirty information of
> + *			VCPU_EXREG_CR3 is not used which means VCPU_EXREG_CR3
> + *			isn't required to be put in this set.
> + */
> +#define SVM_REGS_AVAIL_SET	(1 << VCPU_EXREG_PDPTR)
> +#define SVM_REGS_DIRTY_SET	(0)
> +
> +	vcpu->arch.regs_avail &= ~SVM_REGS_AVAIL_SET;
> +	vcpu->arch.regs_dirty &= ~SVM_REGS_DIRTY_SET;
> +}

I think touching regs_dirty is confusing here, so I'd go with this:

         vcpu->arch.regs_avail &= ~SVM_REGS_LAZY_LOAD_SET;

         /*
          * SVM does not use vcpu->arch.regs_dirty.  The only register that
          * might be out of date in the VMCB is CR3, but KVM_REQ_LOAD_MMU_PGD
          * is always requested when the cache vcpu->arch.cr3 is changed and
          * svm_load_mmu_pgd() always syncs the new CR3 value into the VMCB.
          */

(VMX instead needs VCPU_EXREG_CR3 mostly because it does not want to
update it unconditionally on exit).

Paolo


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 15/15] KVM: nVMX: Always write vmcs.GUEST_CR3 during nested VM-Exit
  2021-11-08 12:44 ` [PATCH 15/15] KVM: nVMX: Always write vmcs.GUEST_CR3 during nested VM-Exit Lai Jiangshan
@ 2021-11-18 15:52   ` Paolo Bonzini
  0 siblings, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2021-11-18 15:52 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, kvm
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

On 11/8/21 13:44, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> For VM-Enter, vmcs.GUEST_CR3 and vcpu->arch.cr3 are synced and it is
> better to mark VCPU_EXREG_CR3 available rather than dirty to reduce a
> redundant vmwrite(GUEST_CR3) in vmx_load_mmu_pgd().
> 
> But nested_vmx_load_cr3() is also served for VM-Exit which doesn't
> set vmcs.GUEST_CR3.
> 
> This patch moves writing to vmcs.GUEST_CR3 into nested_vmx_load_cr3()
> for both nested VM-Eneter/Exit and use kvm_register_mark_available().
> 
> This patch doesn't cause any extra writing to vmcs.GUEST_CR3 and if
> userspace is modifying CR3 with KVM_SET_SREGS later, the dirty info
> for VCPU_EXREG_CR3 would be set for next writing to vmcs.GUEST_CR3
> and no update will be lost.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>   arch/x86/kvm/vmx/nested.c | 32 +++++++++++++++++++++-----------
>   1 file changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index ee5a68c2ea3a..4ddd4b1b0503 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1133,8 +1133,28 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
>   	if (!nested_ept)
>   		kvm_mmu_new_pgd(vcpu, cr3);
>   
> +	/*
> +	 * Immediately write vmcs.GUEST_CR3 when changing vcpu->arch.cr3.
> +	 *
> +	 * VCPU_EXREG_CR3 is marked available rather than dirty because
> +	 * vcpu->arch.cr3 and vmcs.GUEST_CR3 are synced when enable_ept and
> +	 * vmcs.GUEST_CR3 is irrelevant to vcpu->arch.cr3 when !enable_ept.
> +	 *
> +	 * For VM-Enter case, it will be propagated to vmcs12 on nested
> +	 * VM-Exit, which can occur without actually running L2 and thus
> +	 * without hitting vmx_load_mmu_pgd(), e.g. if L1 is entering L2 with
> +	 * vmcs12.GUEST_ACTIVITYSTATE=HLT, in which case KVM will intercept
> +	 * the transition to HLT instead of running L2.
> +	 *
> +	 * For VM-Exit case, it is likely that vmcs.GUEST_CR3 == cr3 here, but
> +	 * L1 may set HOST_CR3 to a value other than its CR3 before VM-Entry,
> +	 * so we just update it unconditionally.
> +	 */
> +	if (enable_ept)
> +		vmcs_writel(GUEST_CR3, cr3);
> +
>   	vcpu->arch.cr3 = cr3;
> -	kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
> +	kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
>   
>   	/* Re-initialize the MMU, e.g. to pick up CR4 MMU role changes. */
>   	kvm_init_mmu(vcpu);
> @@ -2600,16 +2620,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>   				from_vmentry, entry_failure_code))
>   		return -EINVAL;
>   
> -	/*
> -	 * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
> -	 * on nested VM-Exit, which can occur without actually running L2 and
> -	 * thus without hitting vmx_load_mmu_pgd(), e.g. if L1 is entering L2 with
> -	 * vmcs12.GUEST_ACTIVITYSTATE=HLT, in which case KVM will intercept the
> -	 * transition to HLT instead of running L2.
> -	 */
> -	if (enable_ept)
> -		vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
> -
>   	/* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
>   	if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
>   	    is_pae_paging(vcpu)) {
> 

I have to think more about this one and patch 17.  I queued the rest.

Paolo


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 13/15] KVM: SVM: Add and use svm_register_cache_reset()
  2021-11-18 15:37   ` Paolo Bonzini
@ 2021-11-18 16:28     ` Lai Jiangshan
  2021-11-18 17:54       ` Paolo Bonzini
  0 siblings, 1 reply; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-18 16:28 UTC (permalink / raw)
  To: Paolo Bonzini, Lai Jiangshan, linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin



On 2021/11/18 23:37, Paolo Bonzini wrote:
> On 11/8/21 13:44, Lai Jiangshan wrote:
>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>
>> It resets all the appropriate bits like vmx.
>>
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> ---
>>   arch/x86/kvm/svm/svm.c |  3 +--
>>   arch/x86/kvm/svm/svm.h | 26 ++++++++++++++++++++++++++
>>   2 files changed, 27 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index b7da66935e72..ba9cfddd2875 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -3969,8 +3969,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
>>       svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
>>       vmcb_mark_all_clean(svm->vmcb);
>> -
>> -    kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
>> +    svm_register_cache_reset(vcpu);
>>       /*
>>        * We need to handle MC intercepts here before the vcpu has a chance to
>> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
>> index 0d7bbe548ac3..1cf5d5e2d0cd 100644
>> --- a/arch/x86/kvm/svm/svm.h
>> +++ b/arch/x86/kvm/svm/svm.h
>> @@ -274,6 +274,32 @@ static inline bool vmcb_is_dirty(struct vmcb *vmcb, int bit)
>>           return !test_bit(bit, (unsigned long *)&vmcb->control.clean);
>>   }
>> +static inline void svm_register_cache_reset(struct kvm_vcpu *vcpu)
>> +{
>> +/*
>> + * SVM_REGS_AVAIL_SET - The set of registers that will be updated in cache on
>> + *            demand.  Other registers not listed here are synced to
>> + *            the cache immediately after VM-Exit.
>> + *
>> + * SVM_REGS_DIRTY_SET - The set of registers that might be outdated in
>> + *            architecture. Other registers not listed here are synced
>> + *            to the architecture immediately when modifying.
>> + *
>> + *            Special case: VCPU_EXREG_CR3 should be in this set due
>> + *            to the fact.  But KVM_REQ_LOAD_MMU_PGD is always
>> + *            requested when the cache vcpu->arch.cr3 is changed and
>> + *            svm_load_mmu_pgd() always syncs the new CR3 value into
>> + *            the architecture.  So the dirty information of
>> + *            VCPU_EXREG_CR3 is not used which means VCPU_EXREG_CR3
>> + *            isn't required to be put in this set.
>> + */
>> +#define SVM_REGS_AVAIL_SET    (1 << VCPU_EXREG_PDPTR)
>> +#define SVM_REGS_DIRTY_SET    (0)
>> +
>> +    vcpu->arch.regs_avail &= ~SVM_REGS_AVAIL_SET;
>> +    vcpu->arch.regs_dirty &= ~SVM_REGS_DIRTY_SET;
>> +}
> 
> I think touching regs_dirty is confusing here, so I'd go with this:

It makes the code the same as vmx by clearing all the SVM_REGS_DIRTY_SET
bits.  And the compiler will remove this line of code since
SVM_REGS_DIRTY_SET is (0), and regs_dirty is not touched.

Using VMX_REGS_DIRTY_SET and SVM_REGS_DIRTY_SET and making the code
similar is my intent for patch12,13.  If it causes confusing, I would
like to make a second thought.  SVM_REGS_DIRTY_SET does be special
in svm where VCPU_EXREG_CR3 is in it by definition, but it is not
added into SVM_REGS_DIRTY_SET in the patch just for optimization to allow
the compiler optimizes the line of code out.

I'm Ok to not queue patch12,13,14.

> 
>          vcpu->arch.regs_avail &= ~SVM_REGS_LAZY_LOAD_SET;
> 
>          /*
>           * SVM does not use vcpu->arch.regs_dirty.  The only register that
>           * might be out of date in the VMCB is CR3, but KVM_REQ_LOAD_MMU_PGD
>           * is always requested when the cache vcpu->arch.cr3 is changed and
>           * svm_load_mmu_pgd() always syncs the new CR3 value into the VMCB.
>           */
> 
> (VMX instead needs VCPU_EXREG_CR3 mostly because it does not want to
> update it unconditionally on exit).
> 
> Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 13/15] KVM: SVM: Add and use svm_register_cache_reset()
  2021-11-18 16:28     ` Lai Jiangshan
@ 2021-11-18 17:54       ` Paolo Bonzini
  2021-11-19  0:49         ` Lai Jiangshan
  0 siblings, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2021-11-18 17:54 UTC (permalink / raw)
  To: Lai Jiangshan, Lai Jiangshan, linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

On 11/18/21 17:28, Lai Jiangshan wrote:
> Using VMX_REGS_DIRTY_SET and SVM_REGS_DIRTY_SET and making the code
> similar is my intent for patch12,13.  If it causes confusing, I would
> like to make a second thought.  SVM_REGS_DIRTY_SET does be special
> in svm where VCPU_EXREG_CR3 is in it by definition, but it is not
> added into SVM_REGS_DIRTY_SET in the patch just for optimization to allow
> the compiler optimizes the line of code out.

I think this is where we disagree.  In my opinion it is enough to
document that CR3 _can_ be out of date, but it doesn't have to be marked
dirty because its dirty bit is effectively KVM_REQ_LOAD_MMU_PGD.

For VMX, it is important to clear VCPU_EXREG_CR3 because the combination
"avail=0, dirty=1" is nonsensical:

	av d
	0  0    in VMCS
	0  1    *INVALID*
	1  0	in vcpu->arch
	1  1	in vcpu->arch, needs store

But on SVM, VCPU_EXREG_CR3 is always available.

Thinking more about it, it makes more sense for VMX to reset _all_
bits of dirty to 0, just like it was before your change, but doing
so even earlier in vmx_vcpu_run.

I appreciate that VMX_REGS_LAZY_UPDATE_SET is useful for documentation,
but it's also important that the values in avail/dirty make sense as
a pair.

So here is what I would do:

diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 6e6d0d01f18d..ac3d3bd662f4 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -43,6 +43,13 @@ BUILD_KVM_GPR_ACCESSORS(r14, R14)
  BUILD_KVM_GPR_ACCESSORS(r15, R15)
  #endif
  
+/*
+ * avail  dirty
+ * 0	  0	  register in VMCS/VMCB
+ * 0	  1	  *INVALID*
+ * 1	  0	  register in vcpu->arch
+ * 1	  1	  register in vcpu->arch, needs to be stored back
+ */
  static inline bool kvm_register_is_available(struct kvm_vcpu *vcpu,
  					     enum kvm_reg reg)
  {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6fce61fc98e3..72ae67e214b5 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6635,6 +6635,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
  		vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);
  	if (kvm_register_is_dirty(vcpu, VCPU_REGS_RIP))
  		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
+	vcpu->arch.regs_dirty = 0;
  
  	cr3 = __get_current_cr3_fast();
  	if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
@@ -6729,7 +6730,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
  	loadsegment(es, __USER_DS);
  #endif
  
-	vmx_register_cache_reset(vcpu);
+	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
  
  	pt_guest_exit(vmx);
  
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 4df2ac24ffc1..f978699480e3 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -473,19 +473,21 @@ BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL)
  BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL)
  BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL)
  
-static inline void vmx_register_cache_reset(struct kvm_vcpu *vcpu)
-{
-	vcpu->arch.regs_avail = ~((1 << VCPU_REGS_RIP) | (1 << VCPU_REGS_RSP)
-				  | (1 << VCPU_EXREG_RFLAGS)
-				  | (1 << VCPU_EXREG_PDPTR)
-				  | (1 << VCPU_EXREG_SEGMENTS)
-				  | (1 << VCPU_EXREG_CR0)
-				  | (1 << VCPU_EXREG_CR3)
-				  | (1 << VCPU_EXREG_CR4)
-				  | (1 << VCPU_EXREG_EXIT_INFO_1)
-				  | (1 << VCPU_EXREG_EXIT_INFO_2));
-	vcpu->arch.regs_dirty = 0;
-}
+/*
+ * VMX_REGS_LAZY_LOAD_SET - The set of registers that will be updated in the
+ * cache on demand.  Other registers not listed here are synced to
+ * the cache immediately after VM-Exit.
+ */
+#define VMX_REGS_LAZY_LOAD_SET	((1 << VCPU_REGS_RIP) |         \
+				(1 << VCPU_REGS_RSP) |          \
+				(1 << VCPU_EXREG_RFLAGS) |      \
+				(1 << VCPU_EXREG_PDPTR) |       \
+				(1 << VCPU_EXREG_SEGMENTS) |    \
+				(1 << VCPU_EXREG_CR0) |         \
+				(1 << VCPU_EXREG_CR3) |         \
+				(1 << VCPU_EXREG_CR4) |         \
+				(1 << VCPU_EXREG_EXIT_INFO_1) | \
+				(1 << VCPU_EXREG_EXIT_INFO_2))
  
  static inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm)
  {

and likewise for SVM:

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index eb2a2609cae8..4b22aa7d55d0 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3944,6 +3944,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
  		vcpu->arch.regs[VCPU_REGS_RSP] = svm->vmcb->save.rsp;
  		vcpu->arch.regs[VCPU_REGS_RIP] = svm->vmcb->save.rip;
  	}
+	vcpu->arch.regs_dirty = 0;
  
  	if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
  		kvm_before_interrupt(vcpu);
@@ -3978,7 +3978,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
  		vcpu->arch.apf.host_apf_flags =
  			kvm_read_and_reset_apf_flags();
  
-	kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
+	vcpu->arch.regs_avail &= ~SVM_REGS_LAZY_LOAD_SET;
  
  	/*
  	 * We need to handle MC intercepts here before the vcpu has a chance to
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 32769d227860..b3c3c3098216 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -321,6 +321,16 @@ static inline bool vmcb_is_dirty(struct vmcb *vmcb, int bit)
          return !test_bit(bit, (unsigned long *)&vmcb->control.clean);
  }
  
+/*
+ * Only the PDPTRs are loaded on demand into the shadow MMU.  All other
+ * fields are synchronized in handle_exit, because accessing the VMCB is cheap.
+ *
+ * CR3 might be out of date in the VMCB but it is not marked dirty; instead,
+ * KVM_REQ_LOAD_MMU_PGD is always requested when the cached vcpu->arch.cr3
+ * is changed.  svm_load_mmu_pgd() then syncs the new CR3 value into the VMCB.
+ */
+#define SVM_REGS_LAZY_LOAD_SET	(1 << VCPU_EXREG_PDPTR)
+
  static inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu)
  {
  	return container_of(vcpu, struct vcpu_svm, vcpu);



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 13/15] KVM: SVM: Add and use svm_register_cache_reset()
  2021-11-18 17:54       ` Paolo Bonzini
@ 2021-11-19  0:49         ` Lai Jiangshan
  0 siblings, 0 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-19  0:49 UTC (permalink / raw)
  To: Paolo Bonzini, Lai Jiangshan, linux-kernel, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin



On 2021/11/19 01:54, Paolo Bonzini wrote:
> On 11/18/21 17:28, Lai Jiangshan wrote:
>> Using VMX_REGS_DIRTY_SET and SVM_REGS_DIRTY_SET and making the code
>> similar is my intent for patch12,13.  If it causes confusing, I would
>> like to make a second thought.  SVM_REGS_DIRTY_SET does be special
>> in svm where VCPU_EXREG_CR3 is in it by definition, but it is not
>> added into SVM_REGS_DIRTY_SET in the patch just for optimization to allow
>> the compiler optimizes the line of code out.
> 
> I think this is where we disagree.  In my opinion it is enough to
> document that CR3 _can_ be out of date, but it doesn't have to be marked
> dirty because its dirty bit is effectively KVM_REQ_LOAD_MMU_PGD.
> 
> For VMX, it is important to clear VCPU_EXREG_CR3 because the combination
> "avail=0, dirty=1" is nonsensical:
> 
>      av d
>      0  0    in VMCS
>      0  1    *INVALID*
>      1  0    in vcpu->arch
>      1  1    in vcpu->arch, needs store
> 
> But on SVM, VCPU_EXREG_CR3 is always available.
> 
> Thinking more about it, it makes more sense for VMX to reset _all_
> bits of dirty to 0, just like it was before your change, but doing
> so even earlier in vmx_vcpu_run.
> 
> I appreciate that VMX_REGS_LAZY_UPDATE_SET is useful for documentation,
> but it's also important that the values in avail/dirty make sense as
> a pair.
> 
> So here is what I would do:

Reviewed-by: Lai Jiangshan <laijs@linux.alibaba.com>


> 
> diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
> index 6e6d0d01f18d..ac3d3bd662f4 100644
> --- a/arch/x86/kvm/kvm_cache_regs.h
> +++ b/arch/x86/kvm/kvm_cache_regs.h
> @@ -43,6 +43,13 @@ BUILD_KVM_GPR_ACCESSORS(r14, R14)
>   BUILD_KVM_GPR_ACCESSORS(r15, R15)
>   #endif
> 
> +/*
> + * avail  dirty
> + * 0      0      register in VMCS/VMCB
> + * 0      1      *INVALID*
> + * 1      0      register in vcpu->arch
> + * 1      1      register in vcpu->arch, needs to be stored back
> + */
>   static inline bool kvm_register_is_available(struct kvm_vcpu *vcpu,
>                            enum kvm_reg reg)
>   {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 6fce61fc98e3..72ae67e214b5 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6635,6 +6635,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>           vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);
>       if (kvm_register_is_dirty(vcpu, VCPU_REGS_RIP))
>           vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
> +    vcpu->arch.regs_dirty = 0;
> 
>       cr3 = __get_current_cr3_fast();
>       if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
> @@ -6729,7 +6730,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>       loadsegment(es, __USER_DS);
>   #endif
> 
> -    vmx_register_cache_reset(vcpu);
> +    vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> 
>       pt_guest_exit(vmx);
> 
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 4df2ac24ffc1..f978699480e3 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -473,19 +473,21 @@ BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL)
>   BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL)
>   BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL)
> 
> -static inline void vmx_register_cache_reset(struct kvm_vcpu *vcpu)
> -{
> -    vcpu->arch.regs_avail = ~((1 << VCPU_REGS_RIP) | (1 << VCPU_REGS_RSP)
> -                  | (1 << VCPU_EXREG_RFLAGS)
> -                  | (1 << VCPU_EXREG_PDPTR)
> -                  | (1 << VCPU_EXREG_SEGMENTS)
> -                  | (1 << VCPU_EXREG_CR0)
> -                  | (1 << VCPU_EXREG_CR3)
> -                  | (1 << VCPU_EXREG_CR4)
> -                  | (1 << VCPU_EXREG_EXIT_INFO_1)
> -                  | (1 << VCPU_EXREG_EXIT_INFO_2));
> -    vcpu->arch.regs_dirty = 0;
> -}
> +/*
> + * VMX_REGS_LAZY_LOAD_SET - The set of registers that will be updated in the
> + * cache on demand.  Other registers not listed here are synced to
> + * the cache immediately after VM-Exit.
> + */
> +#define VMX_REGS_LAZY_LOAD_SET    ((1 << VCPU_REGS_RIP) |         \
> +                (1 << VCPU_REGS_RSP) |          \
> +                (1 << VCPU_EXREG_RFLAGS) |      \
> +                (1 << VCPU_EXREG_PDPTR) |       \
> +                (1 << VCPU_EXREG_SEGMENTS) |    \
> +                (1 << VCPU_EXREG_CR0) |         \
> +                (1 << VCPU_EXREG_CR3) |         \
> +                (1 << VCPU_EXREG_CR4) |         \
> +                (1 << VCPU_EXREG_EXIT_INFO_1) | \
> +                (1 << VCPU_EXREG_EXIT_INFO_2))
> 
>   static inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm)
>   {
> 
> and likewise for SVM:
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index eb2a2609cae8..4b22aa7d55d0 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3944,6 +3944,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
>           vcpu->arch.regs[VCPU_REGS_RSP] = svm->vmcb->save.rsp;
>           vcpu->arch.regs[VCPU_REGS_RIP] = svm->vmcb->save.rip;
>       }
> +    vcpu->arch.regs_dirty = 0;
> 
>       if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
>           kvm_before_interrupt(vcpu);
> @@ -3978,7 +3978,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
>           vcpu->arch.apf.host_apf_flags =
>               kvm_read_and_reset_apf_flags();
> 
> -    kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
> +    vcpu->arch.regs_avail &= ~SVM_REGS_LAZY_LOAD_SET;
> 
>       /*
>        * We need to handle MC intercepts here before the vcpu has a chance to
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 32769d227860..b3c3c3098216 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -321,6 +321,16 @@ static inline bool vmcb_is_dirty(struct vmcb *vmcb, int bit)
>           return !test_bit(bit, (unsigned long *)&vmcb->control.clean);
>   }
> 
> +/*
> + * Only the PDPTRs are loaded on demand into the shadow MMU.  All other
> + * fields are synchronized in handle_exit, because accessing the VMCB is cheap.
> + *
> + * CR3 might be out of date in the VMCB but it is not marked dirty; instead,
> + * KVM_REQ_LOAD_MMU_PGD is always requested when the cached vcpu->arch.cr3
> + * is changed.  svm_load_mmu_pgd() then syncs the new CR3 value into the VMCB.
> + */
> +#define SVM_REGS_LAZY_LOAD_SET    (1 << VCPU_EXREG_PDPTR)
> +
>   static inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu)
>   {
>       return container_of(vcpu, struct vcpu_svm, vcpu);
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs is changed
  2021-11-11 14:46 ` [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs " Lai Jiangshan
@ 2021-11-23  9:34   ` Lai Jiangshan
  2021-12-08  0:15   ` Sean Christopherson
  1 sibling, 0 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-11-23  9:34 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel
  Cc: kvm, Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin


Hello, Paolo

any thought/concern about this one?

Thanks
Lai


On 2021/11/11 22:46, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> For shadow paging, the pae_root needs to be reconstructed before the
> coming VMENTER if the guest PDPTEs is changed.
> 
> But not all paths that call load_pdptrs() will cause the pae_root to be
> reconstructed. Normally, kvm_mmu_reset_context() and kvm_mmu_free_roots()
> are used to launch later reconstruction.
> 
> The commit d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and
> CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs()
> when changing CR0.CD and CR0.NW.
> 
> The commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
> PCID on MOV CR3 w/ flush") skips kvm_mmu_free_roots() after
> load_pdptrs() when rewriting the CR3 with the same value.
> 
> The commit a91a7c709600("KVM: X86: Don't reset mmu context when
> toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
> load_pdptrs() when changing CR4.PGE.
> 
> Normally, the guest doesn't change the PDPTEs before doing only the
> above operation without touching other bits that can force pae_root to
> be reconstructed.  Guests like linux would keep the PDPTEs unchaged
> for every instance of pagetable.
> 
> Fixes: d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
> Fixes: 21823fbda552("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
> Fixes: a91a7c709600("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>   arch/x86/kvm/x86.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0176eaa86a35..cfba337e46ab 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -832,8 +832,14 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
>   	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
>   		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
>   		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> -		/* Ensure the dirty PDPTEs to be loaded. */
> -		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> +		/*
> +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
> +		 * enabled or pae_root to be reconstructed for shadow paging.
> +		 */
> +		if (tdp_enabled)
> +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> +		else
> +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
>   	}
>   	vcpu->arch.pdptrs_from_userspace = false;
>   
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 16/15] KVM: X86: Update mmu->pdptrs only when it is changed
  2021-11-11 14:45 ` [PATCH 16/15] KVM: X86: Update mmu->pdptrs only when it is changed Lai Jiangshan
@ 2021-12-07 23:43   ` Sean Christopherson
  2021-12-08  3:29     ` Lai Jiangshan
  2021-12-08  9:09     ` Paolo Bonzini
  0 siblings, 2 replies; 48+ messages in thread
From: Sean Christopherson @ 2021-12-07 23:43 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Lai Jiangshan, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

On Thu, Nov 11, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> It is unchanged in most cases.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/kvm/x86.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6ca19cac4aff..0176eaa86a35 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -828,10 +828,13 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
>  		}
>  	}
>  
> -	memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
> -	kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> -	/* Ensure the dirty PDPTEs to be loaded. */
> -	kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> +	kvm_register_mark_available(vcpu, VCPU_EXREG_PDPTR);
> +	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
> +		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
> +		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> +		/* Ensure the dirty PDPTEs to be loaded. */
> +		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> +	}

Can this be unqueued until there's sufficient justification that (a) this is
correct and (b) actually provides a meaningful performance optimization?  There
have been far too many PDPTR caching bugs to make this change without an analysis
of why it's safe, e.g. what guarantees the that PDPTRs in the VMCS are sync'd
with mmu->pdptrs?  I'm not saying they aren't, I just want the changelog to prove
that they are.

The next patch does add a fairly heavy unload of the current root for !TDP, but
that's a bug fix and should be ordered before any optimizations anyways.

>  	vcpu->arch.pdptrs_from_userspace = false;
>  
>  	return 1;
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs is changed
  2021-11-11 14:46 ` [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs " Lai Jiangshan
  2021-11-23  9:34   ` Lai Jiangshan
@ 2021-12-08  0:15   ` Sean Christopherson
  2021-12-08  4:00     ` Lai Jiangshan
  2021-12-09 22:46     ` Paolo Bonzini
  1 sibling, 2 replies; 48+ messages in thread
From: Sean Christopherson @ 2021-12-08  0:15 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Lai Jiangshan, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Xiao Guangrong

On Thu, Nov 11, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> For shadow paging, the pae_root needs to be reconstructed before the
> coming VMENTER if the guest PDPTEs is changed.
> 
> But not all paths that call load_pdptrs() will cause the pae_root to be
> reconstructed. Normally, kvm_mmu_reset_context() and kvm_mmu_free_roots()
> are used to launch later reconstruction.
> 
> The commit d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and
> CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs()
> when changing CR0.CD and CR0.NW.
> 
> The commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
> PCID on MOV CR3 w/ flush") skips kvm_mmu_free_roots() after
> load_pdptrs() when rewriting the CR3 with the same value.

This isn't accurate, prior to that commit KVM wasn't guaranteed to do
kvm_mmu_free_roots() if it got a hit on the current CR3 or if a previous CR3 in
the cache matched the new CR3 (the "cache" has done some odd things in the past).

So I think this particular flavor would be:

  Fixes: 7c390d350f8b ("kvm: x86: Add fast CR3 switch code path")

> The commit a91a7c709600("KVM: X86: Don't reset mmu context when
> toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
> load_pdptrs() when changing CR4.PGE.
> 
> Normally, the guest doesn't change the PDPTEs before doing only the
> above operation without touching other bits that can force pae_root to
> be reconstructed.  Guests like linux would keep the PDPTEs unchaged
> for every instance of pagetable.
> 
> Fixes: d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
> Fixes: 21823fbda552("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
> Fixes: a91a7c709600("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/kvm/x86.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0176eaa86a35..cfba337e46ab 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -832,8 +832,14 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
>  	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
>  		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
>  		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> -		/* Ensure the dirty PDPTEs to be loaded. */
> -		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> +		/*
> +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
> +		 * enabled or pae_root to be reconstructed for shadow paging.
> +		 */
> +		if (tdp_enabled)
> +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> +		else
> +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);

Shouldn't matter since it's legacy shadow paging, but @mmu should be used instead
of vcpu->arch.mmuvcpu->arch.mmu.

To avoid a dependency on the previous patch, I think it makes sense to have this be:

	if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
		kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOT_CURRENT);

before the memcpy().

Then we can decide independently if skipping the KVM_REQ_LOAD_MMU_PGD if the
PDPTRs are unchanged with respect to the MMU is safe.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 16/15] KVM: X86: Update mmu->pdptrs only when it is changed
  2021-12-07 23:43   ` Sean Christopherson
@ 2021-12-08  3:29     ` Lai Jiangshan
  2021-12-08  9:09     ` Paolo Bonzini
  1 sibling, 0 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-12-08  3:29 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin



On 2021/12/8 07:43, Sean Christopherson wrote:
> On Thu, Nov 11, 2021, Lai Jiangshan wrote:
>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>
>> It is unchanged in most cases.
>>
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> ---
>>   arch/x86/kvm/x86.c | 11 +++++++----
>>   1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 6ca19cac4aff..0176eaa86a35 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -828,10 +828,13 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
>>   		}
>>   	}
>>   
>> -	memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
>> -	kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
>> -	/* Ensure the dirty PDPTEs to be loaded. */
>> -	kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
>> +	kvm_register_mark_available(vcpu, VCPU_EXREG_PDPTR);
>> +	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
>> +		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
>> +		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
>> +		/* Ensure the dirty PDPTEs to be loaded. */
>> +		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
>> +	}
> 
> Can this be unqueued until there's sufficient justification that (a) this is
> correct and (b) actually provides a meaningful performance optimization?  There
> have been far too many PDPTR caching bugs to make this change without an analysis
> of why it's safe, e.g. what guarantees the that PDPTRs in the VMCS are sync'd
> with mmu->pdptrs?  I'm not saying they aren't, I just want the changelog to prove
> that they are.

I have zero intent to improve performance for 32bit guests.

For correctness, the patch needs to be revaluated, I agree it to be unqueued.


> 
> The next patch does add a fairly heavy unload of the current root for !TDP, but
> that's a bug fix and should be ordered before any optimizations anyways.

I don't have strong option on the patch ordering and I have a preference which is
already applied to other patchset.

This patch is a preparation patch for next patch.  It has very limited or even
negative value without next patch and it was not in the original 15-patch patchset
until I found the defect fixed by the next patch that I appended both as "16/15"
and "17/15".

But kvm has many small defects for so many years, I has fixed several in this
circle and there are some pending.

And the kvm works for so many years even with these small defects and they only
hit when the guest deliberately do something, and even the guest does these things,
it only reveals that the kvm doesn't fully conform the SDM, it can't hurt host any
degree.

For important fix or for bug that can hurt host, I would definitely make the bug
fix first and other patches are ordered later.

For defect like this, I prefer "clean" patches policy: several cleanup or
preparation patches first to give the code a better basic and then fix the defects.

It is OK for me if fixes like this (normal guest doesn't hit it and it can't hurt
host) go into or doesn't go into the stable tree.

For example, I used my patch ordering policy in "permission_fault() for SMAP"
patchset where the fix went in patch3 which fixes implicit supervisor access when
the CPL < 3.  For next spin, I would be a new preparation patch first to add
"implicit access" in pfec. The fix itself can't go as the first patch.

This is a reply not specified to this patch.  And for this patch and next patch I
will take your suggestion or another possible approach.

Thanks
Lai

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs is changed
  2021-12-08  0:15   ` Sean Christopherson
@ 2021-12-08  4:00     ` Lai Jiangshan
  2021-12-08 15:29       ` Sean Christopherson
  2021-12-09 22:46     ` Paolo Bonzini
  1 sibling, 1 reply; 48+ messages in thread
From: Lai Jiangshan @ 2021-12-08  4:00 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Xiao Guangrong



On 2021/12/8 08:15, Sean Christopherson wrote:
>>
>> The commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
>> PCID on MOV CR3 w/ flush") skips kvm_mmu_free_roots() after
>> load_pdptrs() when rewriting the CR3 with the same value.
> 
> This isn't accurate, prior to that commit KVM wasn't guaranteed to do
> kvm_mmu_free_roots() if it got a hit on the current CR3 or if a previous CR3 in
> the cache matched the new CR3 (the "cache" has done some odd things in the past).
> 
> So I think this particular flavor would be:
> 
>    Fixes: 7c390d350f8b ("kvm: x86: Add fast CR3 switch code path")

If guest is 32bit, fast_cr3_switch() always return false, and
kvm_mmu_free_roots() is always called, and no cr3 goes in prev_root.

And from 21823fbda552, fast_cr3_switch() and kvm_mmu_free_roots() are
both skipped when cr3 is unchanged.

> 
>> The commit a91a7c709600("KVM: X86: Don't reset mmu context when
>> toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
>> load_pdptrs() when changing CR4.PGE.
>>
>> Normally, the guest doesn't change the PDPTEs before doing only the
>> above operation without touching other bits that can force pae_root to
>> be reconstructed.  Guests like linux would keep the PDPTEs unchaged
>> for every instance of pagetable.
>>
>> Fixes: d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
>> Fixes: 21823fbda552("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
>> Fixes: a91a7c709600("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> ---
>>   arch/x86/kvm/x86.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 0176eaa86a35..cfba337e46ab 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -832,8 +832,14 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
>>   	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
>>   		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
>>   		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
>> -		/* Ensure the dirty PDPTEs to be loaded. */
>> -		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
>> +		/*
>> +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
>> +		 * enabled or pae_root to be reconstructed for shadow paging.
>> +		 */
>> +		if (tdp_enabled)
>> +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
>> +		else
>> +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> 
> Shouldn't matter since it's legacy shadow paging, but @mmu should be used instead
> of vcpu->arch.mmuvcpu->arch.mmu.

@mmu is the "guest mmu" (vcpu->arch.walk_mmu), which is used to walk
including loading pdptr.

vcpu->arch.mmu is for host constructing mmu for shadowed or tdp mmu
which is used in host side management including kvm_mmu_free_roots().

Even they are the same pointer now for !tdp, the meaning is different.  I prefer
to distinguish them even before kvm_mmu is split different for guest mmu
(vcpu->arch.walk_mmu) and host constructing mmu (vcpu->arch.mmu).

(I once searched all the usage of undistinguished usage of kvm_mmu *mmu, and
found a bug, see "Use vcpu->arch.walk_mmu for kvm_mmu_invlpg()")

I think Paolo is doing the splitting, unless I would take the job because
I have some patches pending depended them.

> 
> To avoid a dependency on the previous patch, I think it makes sense to have this be:
> 
> 	if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
> 		kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOT_CURRENT);
> 

Yes, it is a good idea to add this first.

Thanks for review and suggestion.
Lai

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 16/15] KVM: X86: Update mmu->pdptrs only when it is changed
  2021-12-07 23:43   ` Sean Christopherson
  2021-12-08  3:29     ` Lai Jiangshan
@ 2021-12-08  9:09     ` Paolo Bonzini
  2021-12-08  9:34       ` Lai Jiangshan
  1 sibling, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2021-12-08  9:09 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: linux-kernel, kvm, Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

On 12/8/21 00:43, Sean Christopherson wrote:
> what guarantees the that PDPTRs in the VMCS are sync'd with
> mmu->pdptrs?  I'm not saying they aren't, I just want the changelog
> to prove that they are.

If they aren't synced you should *already* have dirty VCPU_EXREG_PDPTR 
and pending KVM_REQ_LOAD_MMU_PGD, shouldn't you?  As long as the caching 
invariants are respected, this patch is fairly safe, and if they aren't 
there are plenty of preexisting bugs anyway.

Paolo

> The next patch does add a fairly heavy unload of the current root for
> !TDP, but that's a bug fix and should be ordered before any
> optimizations anyways.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 16/15] KVM: X86: Update mmu->pdptrs only when it is changed
  2021-12-08  9:09     ` Paolo Bonzini
@ 2021-12-08  9:34       ` Lai Jiangshan
  0 siblings, 0 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-12-08  9:34 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Lai Jiangshan
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin



On 2021/12/8 17:09, Paolo Bonzini wrote:
> On 12/8/21 00:43, Sean Christopherson wrote:
>> what guarantees the that PDPTRs in the VMCS are sync'd with
>> mmu->pdptrs?  I'm not saying they aren't, I just want the changelog
>> to prove that they are.
> 
> If they aren't synced you should *already* have dirty VCPU_EXREG_PDPTR and pending KVM_REQ_LOAD_MMU_PGD, shouldn't you?  
> As long as the caching invariants are respected, this patch is fairly safe, and if they aren't there are plenty of 
> preexisting bugs anyway.
> 


They can be not synced in other side: not available.

If (!kvm_register_is_available(vcpu, VCPU_EXREG_PDPTR))
it will make no sense to compare mmu->pdptrs when EPT is enabled.

Because vmcs might have different copy, it is better to just mark it
dirty in load_pdptrs().

(SVM is OK even with NPT enabled, since vmcb doesn't have a copy)

I haven't investigated enough then and today.  It is quit complicated.

Thanks
Lai

> 
>> The next patch does add a fairly heavy unload of the current root for
>> !TDP, but that's a bug fix and should be ordered before any
>> optimizations anyways.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs is changed
  2021-12-08  4:00     ` Lai Jiangshan
@ 2021-12-08 15:29       ` Sean Christopherson
  0 siblings, 0 replies; 48+ messages in thread
From: Sean Christopherson @ 2021-12-08 15:29 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, linux-kernel, kvm, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Xiao Guangrong

On Wed, Dec 08, 2021, Lai Jiangshan wrote:
> 
> 
> On 2021/12/8 08:15, Sean Christopherson wrote:
> > > 
> > > The commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
> > > PCID on MOV CR3 w/ flush") skips kvm_mmu_free_roots() after
> > > load_pdptrs() when rewriting the CR3 with the same value.
> > 
> > This isn't accurate, prior to that commit KVM wasn't guaranteed to do
> > kvm_mmu_free_roots() if it got a hit on the current CR3 or if a previous CR3 in
> > the cache matched the new CR3 (the "cache" has done some odd things in the past).
> > 
> > So I think this particular flavor would be:
> > 
> >    Fixes: 7c390d350f8b ("kvm: x86: Add fast CR3 switch code path")
> 
> If guest is 32bit, fast_cr3_switch() always return false, and
> kvm_mmu_free_roots() is always called, and no cr3 goes in prev_root.

Oh, duh, PAE paging.

> > > +		/*
> > > +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
> > > +		 * enabled or pae_root to be reconstructed for shadow paging.
> > > +		 */
> > > +		if (tdp_enabled)
> > > +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> > > +		else
> > > +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> > 
> > Shouldn't matter since it's legacy shadow paging, but @mmu should be used instead
> > of vcpu->arch.mmuvcpu->arch.mmu.
> 
> @mmu is the "guest mmu" (vcpu->arch.walk_mmu), which is used to walk
> including loading pdptr.
> 
> vcpu->arch.mmu is for host constructing mmu for shadowed or tdp mmu
> which is used in host side management including kvm_mmu_free_roots().
> 
> Even they are the same pointer now for !tdp, the meaning is different.  I prefer
> to distinguish them even before kvm_mmu is split different for guest mmu
> (vcpu->arch.walk_mmu) and host constructing mmu (vcpu->arch.mmu).

I see where you're coming from.  I was viewing it from the perspective of,
"they're the same thing for shadow paging, why reread mmu?".

I'm ok with explicitly referencing arch.mmu, but maybe add a comment?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs is changed
  2021-12-08  0:15   ` Sean Christopherson
  2021-12-08  4:00     ` Lai Jiangshan
@ 2021-12-09 22:46     ` Paolo Bonzini
  2021-12-10 21:07       ` Sean Christopherson
  1 sibling, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2021-12-09 22:46 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: linux-kernel, kvm, Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Xiao Guangrong

On 12/8/21 01:15, Sean Christopherson wrote:
>> @@ -832,8 +832,14 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
>>   	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
>>   		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
>>   		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
>> -		/* Ensure the dirty PDPTEs to be loaded. */
>> -		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
>> +		/*
>> +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
>> +		 * enabled or pae_root to be reconstructed for shadow paging.
>> +		 */
>> +		if (tdp_enabled)
>> +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
>> +		else
>> +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> Shouldn't matter since it's legacy shadow paging, but @mmu should be used instead
> of vcpu->arch.mmuvcpu->arch.mmu.

In kvm/next actually there's no mmu parameter to load_pdptrs, so it's 
okay to keep vcpu->arch.mmu.

> To avoid a dependency on the previous patch, I think it makes sense to have this be:
> 
> 	if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
> 		kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOT_CURRENT);
> 
> before the memcpy().
> 
> Then we can decide independently if skipping the KVM_REQ_LOAD_MMU_PGD if the
> PDPTRs are unchanged with respect to the MMU is safe.

Do you disagree that there's already an invariant that the PDPTRs can 
only be dirty if KVM_REQ_LOAD_MMU_PGD---and therefore a previous change 
to the PDPTRs would have triggered KVM_REQ_LOAD_MMU_PGD?  This is 
opposed to the guest TLB flush due to MOV CR3; that one has to be done 
unconditionally for PAE paging, and it is handled separately within 
kvm_set_cr3.

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs is changed
  2021-12-09 22:46     ` Paolo Bonzini
@ 2021-12-10 21:07       ` Sean Christopherson
  2021-12-10 21:08         ` Sean Christopherson
  2021-12-11  6:56         ` Maxim Levitsky
  0 siblings, 2 replies; 48+ messages in thread
From: Sean Christopherson @ 2021-12-10 21:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Lai Jiangshan, linux-kernel, kvm, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Xiao Guangrong

On Thu, Dec 09, 2021, Paolo Bonzini wrote:
> On 12/8/21 01:15, Sean Christopherson wrote:
> > > @@ -832,8 +832,14 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
> > >   	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
> > >   		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
> > >   		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> > > -		/* Ensure the dirty PDPTEs to be loaded. */
> > > -		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> > > +		/*
> > > +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
> > > +		 * enabled or pae_root to be reconstructed for shadow paging.
> > > +		 */
> > > +		if (tdp_enabled)
> > > +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> > > +		else
> > > +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> > Shouldn't matter since it's legacy shadow paging, but @mmu should be used instead
> > of vcpu->arch.mmuvcpu->arch.mmu.
> 
> In kvm/next actually there's no mmu parameter to load_pdptrs, so it's okay
> to keep vcpu->arch.mmu.
> 
> > To avoid a dependency on the previous patch, I think it makes sense to have this be:
> > 
> > 	if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
> > 		kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOT_CURRENT);
> > 
> > before the memcpy().
> > 
> > Then we can decide independently if skipping the KVM_REQ_LOAD_MMU_PGD if the
> > PDPTRs are unchanged with respect to the MMU is safe.
> 
> Do you disagree that there's already an invariant that the PDPTRs can only
> be dirty if KVM_REQ_LOAD_MMU_PGD---and therefore a previous change to the
> PDPTRs would have triggered KVM_REQ_LOAD_MMU_PGD?

What I think is moot, because commit 24cd19a28cb7 ("KVM: X86: Update mmu->pdptrs
only when it is changed") breaks nested VMs with EPT in L0 and PAE shadow paging
in L2.  Reproducing is trivial, just disable EPT in L1 and run a VM.  I haven't
investigating how it breaks things, because why it's broken is secondary for me.

My primary concern is that we would even consider optimizing the PDPTR logic without
a mountain of evidence that any patch is correct for all scenarios.  We had to add
an entire ioctl() just to get PDPTRs functional.  This apparently wasn't validated
against a simple use case, let alone against things like migration with nested VMs,
multliple L2s, etc...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs is changed
  2021-12-10 21:07       ` Sean Christopherson
@ 2021-12-10 21:08         ` Sean Christopherson
  2021-12-11  6:56         ` Maxim Levitsky
  1 sibling, 0 replies; 48+ messages in thread
From: Sean Christopherson @ 2021-12-10 21:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Lai Jiangshan, linux-kernel, kvm, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Xiao Guangrong

On Fri, Dec 10, 2021, Sean Christopherson wrote:
> On Thu, Dec 09, 2021, Paolo Bonzini wrote:
> > On 12/8/21 01:15, Sean Christopherson wrote:
> > > > @@ -832,8 +832,14 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
> > > >   	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
> > > >   		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
> > > >   		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> > > > -		/* Ensure the dirty PDPTEs to be loaded. */
> > > > -		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> > > > +		/*
> > > > +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
> > > > +		 * enabled or pae_root to be reconstructed for shadow paging.
> > > > +		 */
> > > > +		if (tdp_enabled)
> > > > +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> > > > +		else
> > > > +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> > > Shouldn't matter since it's legacy shadow paging, but @mmu should be used instead
> > > of vcpu->arch.mmuvcpu->arch.mmu.
> > 
> > In kvm/next actually there's no mmu parameter to load_pdptrs, so it's okay
> > to keep vcpu->arch.mmu.
> > 
> > > To avoid a dependency on the previous patch, I think it makes sense to have this be:
> > > 
> > > 	if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
> > > 		kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOT_CURRENT);
> > > 
> > > before the memcpy().
> > > 
> > > Then we can decide independently if skipping the KVM_REQ_LOAD_MMU_PGD if the
> > > PDPTRs are unchanged with respect to the MMU is safe.
> > 
> > Do you disagree that there's already an invariant that the PDPTRs can only
> > be dirty if KVM_REQ_LOAD_MMU_PGD---and therefore a previous change to the
> > PDPTRs would have triggered KVM_REQ_LOAD_MMU_PGD?
> 
> What I think is moot, because commit 24cd19a28cb7 ("KVM: X86: Update mmu->pdptrs
> only when it is changed") breaks nested VMs with EPT in L0 and PAE shadow paging
> in L2.  Reproducing is trivial, just disable EPT in L1 and run a VM.  I haven't

Doh, s/L2/L1

> investigating how it breaks things, because why it's broken is secondary for me.
> 
> My primary concern is that we would even consider optimizing the PDPTR logic without
> a mountain of evidence that any patch is correct for all scenarios.  We had to add
> an entire ioctl() just to get PDPTRs functional.  This apparently wasn't validated
> against a simple use case, let alone against things like migration with nested VMs,
> multliple L2s, etc...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs is changed
  2021-12-10 21:07       ` Sean Christopherson
  2021-12-10 21:08         ` Sean Christopherson
@ 2021-12-11  6:56         ` Maxim Levitsky
  2021-12-11  8:22           ` Paolo Bonzini
  1 sibling, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2021-12-11  6:56 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Lai Jiangshan, linux-kernel, kvm, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Xiao Guangrong

On Fri, 2021-12-10 at 21:07 +0000, Sean Christopherson wrote:
> On Thu, Dec 09, 2021, Paolo Bonzini wrote:
> > On 12/8/21 01:15, Sean Christopherson wrote:
> > > > @@ -832,8 +832,14 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
> > > >   	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
> > > >   		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
> > > >   		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> > > > -		/* Ensure the dirty PDPTEs to be loaded. */
> > > > -		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> > > > +		/*
> > > > +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
> > > > +		 * enabled or pae_root to be reconstructed for shadow paging.
> > > > +		 */
> > > > +		if (tdp_enabled)
> > > > +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> > > > +		else
> > > > +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> > > Shouldn't matter since it's legacy shadow paging, but @mmu should be used instead
> > > of vcpu->arch.mmuvcpu->arch.mmu.
> > 
> > In kvm/next actually there's no mmu parameter to load_pdptrs, so it's okay
> > to keep vcpu->arch.mmu.
> > 
> > > To avoid a dependency on the previous patch, I think it makes sense to have this be:
> > > 
> > > 	if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
> > > 		kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOT_CURRENT);
> > > 
> > > before the memcpy().
> > > 
> > > Then we can decide independently if skipping the KVM_REQ_LOAD_MMU_PGD if the
> > > PDPTRs are unchanged with respect to the MMU is safe.
> > 
> > Do you disagree that there's already an invariant that the PDPTRs can only
> > be dirty if KVM_REQ_LOAD_MMU_PGD---and therefore a previous change to the
> > PDPTRs would have triggered KVM_REQ_LOAD_MMU_PGD?
> 
> What I think is moot, because commit 24cd19a28cb7 ("KVM: X86: Update mmu->pdptrs
> only when it is changed") breaks nested VMs with EPT in L0 and PAE shadow paging
> in L2.  Reproducing is trivial, just disable EPT in L1 and run a VM.  I haven't
> investigating how it breaks things, because why it's broken is secondary for me.
> 
> My primary concern is that we would even consider optimizing the PDPTR logic without
> a mountain of evidence that any patch is correct for all scenarios.  We had to add
> an entire ioctl() just to get PDPTRs functional.  This apparently wasn't validated
> against a simple use case, let alone against things like migration with nested VMs,
> multliple L2s, etc...

I did validate the *SREGS2* against all the cases I could (like migration, EPT/NPT disabled/etc.
I even started testing SMM to see how it affects PDPTRs, and patched seabios to use PAE paging.
I still could have missed something.

But note that qemu still doesn't use that ioctl (patch stuck in review).

Best regards,
	Maxim Levitsky


> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs is changed
  2021-12-11  6:56         ` Maxim Levitsky
@ 2021-12-11  8:22           ` Paolo Bonzini
  2021-12-13 16:54             ` Sean Christopherson
  0 siblings, 1 reply; 48+ messages in thread
From: Paolo Bonzini @ 2021-12-11  8:22 UTC (permalink / raw)
  To: Maxim Levitsky, Sean Christopherson
  Cc: Lai Jiangshan, linux-kernel, kvm, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Xiao Guangrong

On 12/11/21 07:56, Maxim Levitsky wrote:
>> This apparently wasn't validated against a simple use case, let
>> alone against things like migration with nested VMs, multliple L2s,
>> etc...
> 
> I did validate the *SREGS2* against all the cases I could (like
> migration, EPT/NPT disabled/etc. I even started testing SMM to see
> how it affects PDPTRs, and patched seabios to use PAE paging. I still
> could have missed something.

Don't worry, I think Sean was talking about patch 16 and specifically
digging at me (who deserved it completely).

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs is changed
  2021-12-11  8:22           ` Paolo Bonzini
@ 2021-12-13 16:54             ` Sean Christopherson
  0 siblings, 0 replies; 48+ messages in thread
From: Sean Christopherson @ 2021-12-13 16:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Maxim Levitsky, Lai Jiangshan, linux-kernel, kvm, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Xiao Guangrong

On Sat, Dec 11, 2021, Paolo Bonzini wrote:
> On 12/11/21 07:56, Maxim Levitsky wrote:
> > > This apparently wasn't validated against a simple use case, let
> > > alone against things like migration with nested VMs, multliple L2s,
> > > etc...
> > 
> > I did validate the *SREGS2* against all the cases I could (like
> > migration, EPT/NPT disabled/etc. I even started testing SMM to see
> > how it affects PDPTRs, and patched seabios to use PAE paging. I still
> > could have missed something.
> 
> Don't worry, I think Sean was talking about patch 16 and specifically
> digging at me (who deserved it completely).

Yes, patch 16.  My goal wasn't to dig at anyone, I just wanted to dramatically
emphasize how ridiculousy fragile and complex the PDPTR crud is due to the number
of edge cases.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/15] KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty
  2021-11-08 12:44 ` [PATCH 11/15] KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty Lai Jiangshan
@ 2021-12-15 15:47   ` Maxim Levitsky
  2021-12-15 16:31     ` Lai Jiangshan
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2021-12-15 15:47 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

On Mon, 2021-11-08 at 20:44 +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> When vcpu->arch.cr3 is changed, it is marked dirty, so vmcs.GUEST_CR3
> can be updated only when kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3).
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d94e51e9c08f..38b65b97fb7b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3126,9 +3126,9 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>  
>  		if (!enable_unrestricted_guest && !is_paging(vcpu))
>  			guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
> -		else if (test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail))
> +		else if (kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3))
>  			guest_cr3 = vcpu->arch.cr3;
> -		else /* vmcs01.GUEST_CR3 is already up-to-date. */
> +		else /* vmcs.GUEST_CR3 is already up-to-date. */
>  			update_guest_cr3 = false;
>  		vmx_ept_load_pdptrs(vcpu);
>  	} else {


I just bisected this patch to break booting a VM with ept=1 but unrestricted_guest=0
(I needed to re-test unrestricted_guest=0 bug related to SMM, but didn't want
to boot without EPT. With ept=0,the VM boots with this patch applied).

Reverting this patch on top of kvm/queue also works.

Best regards,
	Maxim levitsky


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/15] KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty
  2021-12-15 15:47   ` Maxim Levitsky
@ 2021-12-15 16:31     ` Lai Jiangshan
  2021-12-15 16:43       ` Lai Jiangshan
                         ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-12-15 16:31 UTC (permalink / raw)
  To: Maxim Levitsky, Lai Jiangshan, linux-kernel, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin



On 2021/12/15 23:47, Maxim Levitsky wrote:
> On Mon, 2021-11-08 at 20:44 +0800, Lai Jiangshan wrote:
>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>
>> When vcpu->arch.cr3 is changed, it is marked dirty, so vmcs.GUEST_CR3
>> can be updated only when kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3).
>>
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> ---
>>   arch/x86/kvm/vmx/vmx.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index d94e51e9c08f..38b65b97fb7b 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -3126,9 +3126,9 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>>   
>>   		if (!enable_unrestricted_guest && !is_paging(vcpu))
>>   			guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
>> -		else if (test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail))
>> +		else if (kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3))
>>   			guest_cr3 = vcpu->arch.cr3;
>> -		else /* vmcs01.GUEST_CR3 is already up-to-date. */
>> +		else /* vmcs.GUEST_CR3 is already up-to-date. */
>>   			update_guest_cr3 = false;
>>   		vmx_ept_load_pdptrs(vcpu);
>>   	} else {
> 
> 
> I just bisected this patch to break booting a VM with ept=1 but unrestricted_guest=0
> (I needed to re-test unrestricted_guest=0 bug related to SMM, but didn't want
> to boot without EPT. With ept=0,the VM boots with this patch applied).
> 


Thanks for reporting.

Sorry, I never tested it with unrestricted_guest=0. I can't reproduce it now shortly
with unrestricted_guest=0.  Maybe it can be reproduced easily if I try more guests or
I write a piece of guest code to deliberate hit it if the following analyses is correct.

All the paths changing %cr3 are followed with kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3)
and GUEST_CR3 will be expected to be updated.

What I missed is the case of "if (!enable_unrestricted_guest && !is_paging(vcpu))"
in vmx_load_mmu_pgd() which doesn't load GUEST_CR3 but clears dirty of VCPU_EXREG_CR3
(when after next run).

So when CR0 !PG -> PG, VCPU_EXREG_CR3 dirty bit should be set.

Maybe adding the following patch on top of the original patch can work.

Thanks
Lai

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 85127b3e3690..55b45005ebb9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -858,6 +858,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
  	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
  		kvm_clear_async_pf_completion_queue(vcpu);
  		kvm_async_pf_hash_reset(vcpu);
+		kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
  	}

  	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/15] KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty
  2021-12-15 16:31     ` Lai Jiangshan
@ 2021-12-15 16:43       ` Lai Jiangshan
  2021-12-15 16:45       ` Sean Christopherson
  2021-12-15 20:20       ` Maxim Levitsky
  2 siblings, 0 replies; 48+ messages in thread
From: Lai Jiangshan @ 2021-12-15 16:43 UTC (permalink / raw)
  To: Maxim Levitsky, Lai Jiangshan, linux-kernel, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin



On 2021/12/16 00:31, Lai Jiangshan wrote:

> 
> What I missed is the case of "if (!enable_unrestricted_guest && !is_paging(vcpu))"
> in vmx_load_mmu_pgd() which doesn't load GUEST_CR3 but clears dirty of VCPU_EXREG_CR3
> (when after next run).

Oops.

What I missed is the case of "if (!enable_unrestricted_guest && !is_paging(vcpu))"
in vmx_load_mmu_pgd() which doesn't load GUEST_CR3 with the guest cr3 and
VCPU_EXREG_CR3 dirty bit is cleared after VMEXIT.  When !PG -> PG, GUEST_CR3 is
still the ept_identity_map_addr, and VCPU_EXREG_CR3 dirty bit is not set, so
vmx_load_mmu_pgd() doesn't update GUEST_CR3.


> So when CR0 !PG -> PG, VCPU_EXREG_CR3 dirty bit should be set.
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/15] KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty
  2021-12-15 16:31     ` Lai Jiangshan
  2021-12-15 16:43       ` Lai Jiangshan
@ 2021-12-15 16:45       ` Sean Christopherson
  2021-12-15 17:10         ` Paolo Bonzini
  2021-12-15 20:21         ` Maxim Levitsky
  2021-12-15 20:20       ` Maxim Levitsky
  2 siblings, 2 replies; 48+ messages in thread
From: Sean Christopherson @ 2021-12-15 16:45 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Maxim Levitsky, Lai Jiangshan, linux-kernel, kvm, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

On Thu, Dec 16, 2021, Lai Jiangshan wrote:
> 
> 
> On 2021/12/15 23:47, Maxim Levitsky wrote:
> > On Mon, 2021-11-08 at 20:44 +0800, Lai Jiangshan wrote:
> > > From: Lai Jiangshan <laijs@linux.alibaba.com>
> > > 
> > > When vcpu->arch.cr3 is changed, it is marked dirty, so vmcs.GUEST_CR3
> > > can be updated only when kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3).
> > > 
> > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > > ---
> > >   arch/x86/kvm/vmx/vmx.c | 4 ++--
> > >   1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index d94e51e9c08f..38b65b97fb7b 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -3126,9 +3126,9 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> > >   		if (!enable_unrestricted_guest && !is_paging(vcpu))
> > >   			guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
> > > -		else if (test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail))
> > > +		else if (kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3))
> > >   			guest_cr3 = vcpu->arch.cr3;
> > > -		else /* vmcs01.GUEST_CR3 is already up-to-date. */
> > > +		else /* vmcs.GUEST_CR3 is already up-to-date. */
> > >   			update_guest_cr3 = false;
> > >   		vmx_ept_load_pdptrs(vcpu);
> > >   	} else {
> > 
> > 
> > I just bisected this patch to break booting a VM with ept=1 but unrestricted_guest=0
> > (I needed to re-test unrestricted_guest=0 bug related to SMM, but didn't want
> > to boot without EPT. With ept=0,the VM boots with this patch applied).
> > 
> 
> 
> Thanks for reporting.
> 
> Sorry, I never tested it with unrestricted_guest=0. I can't reproduce it now shortly
> with unrestricted_guest=0.  Maybe it can be reproduced easily if I try more guests or
> I write a piece of guest code to deliberate hit it if the following analyses is correct.
> 
> All the paths changing %cr3 are followed with kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3)
> and GUEST_CR3 will be expected to be updated.
> 
> What I missed is the case of "if (!enable_unrestricted_guest && !is_paging(vcpu))"
> in vmx_load_mmu_pgd() which doesn't load GUEST_CR3 but clears dirty of VCPU_EXREG_CR3
> (when after next run).
> 
> So when CR0 !PG -> PG, VCPU_EXREG_CR3 dirty bit should be set.
> 
> Maybe adding the following patch on top of the original patch can work.
> 
> Thanks
> Lai
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 85127b3e3690..55b45005ebb9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -858,6 +858,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>  	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
>  		kvm_clear_async_pf_completion_queue(vcpu);
>  		kvm_async_pf_hash_reset(vcpu);
> +		kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);

The better place for this is the path in vmx_set_cr0() that handles EPT + !URG.
I believe nested_vmx_restore_host_state(), which is used to restore host state if
hardware detects a VM-Fail that KVM misses on nested VM-Enter, is also missing a
call to mark CR3 dirty.

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 2f6f465e575f..b0c3eb80f5d5 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4426,7 +4426,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)

        nested_ept_uninit_mmu_context(vcpu);
        vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
-       kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
+       kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);

        /*
         * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 640f4719612c..1701cee96678 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3066,8 +3066,10 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
                }

                /* Note, vmx_set_cr4() consumes the new vcpu->arch.cr0. */
-               if ((old_cr0_pg ^ cr0) & X86_CR0_PG)
+               if ((old_cr0_pg ^ cr0) & X86_CR0_PG) {
                        vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
+                       kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
+               }
        }

        /* depends on vcpu->arch.cr0 to be set to a new value */


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/15] KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty
  2021-12-15 16:45       ` Sean Christopherson
@ 2021-12-15 17:10         ` Paolo Bonzini
  2021-12-15 20:21         ` Maxim Levitsky
  1 sibling, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2021-12-15 17:10 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: Maxim Levitsky, Lai Jiangshan, linux-kernel, kvm,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

On 12/15/21 17:45, Sean Christopherson wrote:
> The better place for this is the path in vmx_set_cr0() that handles EPT + !URG.
> I believe nested_vmx_restore_host_state(), which is used to restore host state if
> hardware detects a VM-Fail that KVM misses on nested VM-Enter, is also missing a
> call to mark CR3 dirty.
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 2f6f465e575f..b0c3eb80f5d5 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4426,7 +4426,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
> 
>          nested_ept_uninit_mmu_context(vcpu);
>          vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> -       kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
> +       kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
> 
>          /*
>           * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs

Yes, I agree.  Jiangshan, please send all your fixes tomorrow, or tell 
me if you prefer that I push the reverts to kvm/next.

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/15] KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty
  2021-12-15 16:31     ` Lai Jiangshan
  2021-12-15 16:43       ` Lai Jiangshan
  2021-12-15 16:45       ` Sean Christopherson
@ 2021-12-15 20:20       ` Maxim Levitsky
  2 siblings, 0 replies; 48+ messages in thread
From: Maxim Levitsky @ 2021-12-15 20:20 UTC (permalink / raw)
  To: Lai Jiangshan, Lai Jiangshan, linux-kernel, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

On Thu, 2021-12-16 at 00:31 +0800, Lai Jiangshan wrote:
> 
> On 2021/12/15 23:47, Maxim Levitsky wrote:
> > On Mon, 2021-11-08 at 20:44 +0800, Lai Jiangshan wrote:
> > > From: Lai Jiangshan <laijs@linux.alibaba.com>
> > > 
> > > When vcpu->arch.cr3 is changed, it is marked dirty, so vmcs.GUEST_CR3
> > > can be updated only when kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3).
> > > 
> > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > > ---
> > >   arch/x86/kvm/vmx/vmx.c | 4 ++--
> > >   1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index d94e51e9c08f..38b65b97fb7b 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -3126,9 +3126,9 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> > >   
> > >   		if (!enable_unrestricted_guest && !is_paging(vcpu))
> > >   			guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
> > > -		else if (test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail))
> > > +		else if (kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3))
> > >   			guest_cr3 = vcpu->arch.cr3;
> > > -		else /* vmcs01.GUEST_CR3 is already up-to-date. */
> > > +		else /* vmcs.GUEST_CR3 is already up-to-date. */
> > >   			update_guest_cr3 = false;
> > >   		vmx_ept_load_pdptrs(vcpu);
> > >   	} else {
> > 
> > I just bisected this patch to break booting a VM with ept=1 but unrestricted_guest=0
> > (I needed to re-test unrestricted_guest=0 bug related to SMM, but didn't want
> > to boot without EPT. With ept=0,the VM boots with this patch applied).
> > 
> 
> Thanks for reporting.
> 
> Sorry, I never tested it with unrestricted_guest=0. I can't reproduce it now shortly
> with unrestricted_guest=0.  Maybe it can be reproduced easily if I try more guests or
> I write a piece of guest code to deliberate hit it if the following analyses is correct.
> 
> All the paths changing %cr3 are followed with kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3)
> and GUEST_CR3 will be expected to be updated.
> 
> What I missed is the case of "if (!enable_unrestricted_guest && !is_paging(vcpu))"
> in vmx_load_mmu_pgd() which doesn't load GUEST_CR3 but clears dirty of VCPU_EXREG_CR3
> (when after next run).
> 
> So when CR0 !PG -> PG, VCPU_EXREG_CR3 dirty bit should be set.
> 
> Maybe adding the following patch on top of the original patch can work.
> 
> Thanks
> Lai
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 85127b3e3690..55b45005ebb9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -858,6 +858,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>   	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
>   		kvm_clear_async_pf_completion_queue(vcpu);
>   		kvm_async_pf_hash_reset(vcpu);
> +		kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
>   	}
> 
>   	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> 

Tested this patch and my guests boot. I didn't test more stuff like migration or so,
will do tomorrow.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/15] KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty
  2021-12-15 16:45       ` Sean Christopherson
  2021-12-15 17:10         ` Paolo Bonzini
@ 2021-12-15 20:21         ` Maxim Levitsky
  1 sibling, 0 replies; 48+ messages in thread
From: Maxim Levitsky @ 2021-12-15 20:21 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: Lai Jiangshan, linux-kernel, kvm, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

On Wed, 2021-12-15 at 16:45 +0000, Sean Christopherson wrote:
> On Thu, Dec 16, 2021, Lai Jiangshan wrote:
> > 
> > On 2021/12/15 23:47, Maxim Levitsky wrote:
> > > On Mon, 2021-11-08 at 20:44 +0800, Lai Jiangshan wrote:
> > > > From: Lai Jiangshan <laijs@linux.alibaba.com>
> > > > 
> > > > When vcpu->arch.cr3 is changed, it is marked dirty, so vmcs.GUEST_CR3
> > > > can be updated only when kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3).
> > > > 
> > > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > > > ---
> > > >   arch/x86/kvm/vmx/vmx.c | 4 ++--
> > > >   1 file changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index d94e51e9c08f..38b65b97fb7b 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -3126,9 +3126,9 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> > > >   		if (!enable_unrestricted_guest && !is_paging(vcpu))
> > > >   			guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
> > > > -		else if (test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail))
> > > > +		else if (kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3))
> > > >   			guest_cr3 = vcpu->arch.cr3;
> > > > -		else /* vmcs01.GUEST_CR3 is already up-to-date. */
> > > > +		else /* vmcs.GUEST_CR3 is already up-to-date. */
> > > >   			update_guest_cr3 = false;
> > > >   		vmx_ept_load_pdptrs(vcpu);
> > > >   	} else {
> > > 
> > > I just bisected this patch to break booting a VM with ept=1 but unrestricted_guest=0
> > > (I needed to re-test unrestricted_guest=0 bug related to SMM, but didn't want
> > > to boot without EPT. With ept=0,the VM boots with this patch applied).
> > > 
> > 
> > Thanks for reporting.
> > 
> > Sorry, I never tested it with unrestricted_guest=0. I can't reproduce it now shortly
> > with unrestricted_guest=0.  Maybe it can be reproduced easily if I try more guests or
> > I write a piece of guest code to deliberate hit it if the following analyses is correct.
> > 
> > All the paths changing %cr3 are followed with kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3)
> > and GUEST_CR3 will be expected to be updated.
> > 
> > What I missed is the case of "if (!enable_unrestricted_guest && !is_paging(vcpu))"
> > in vmx_load_mmu_pgd() which doesn't load GUEST_CR3 but clears dirty of VCPU_EXREG_CR3
> > (when after next run).
> > 
> > So when CR0 !PG -> PG, VCPU_EXREG_CR3 dirty bit should be set.
> > 
> > Maybe adding the following patch on top of the original patch can work.
> > 
> > Thanks
> > Lai
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 85127b3e3690..55b45005ebb9 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -858,6 +858,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
> >  	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
> >  		kvm_clear_async_pf_completion_queue(vcpu);
> >  		kvm_async_pf_hash_reset(vcpu);
> > +		kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
> 
> The better place for this is the path in vmx_set_cr0() that handles EPT + !URG.
> I believe nested_vmx_restore_host_state(), which is used to restore host state if
> hardware detects a VM-Fail that KVM misses on nested VM-Enter, is also missing a
> call to mark CR3 dirty.
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 2f6f465e575f..b0c3eb80f5d5 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4426,7 +4426,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
> 
>         nested_ept_uninit_mmu_context(vcpu);
>         vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> -       kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
> +       kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
> 
>         /*
>          * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 640f4719612c..1701cee96678 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3066,8 +3066,10 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
>                 }
> 
>                 /* Note, vmx_set_cr4() consumes the new vcpu->arch.cr0. */
> -               if ((old_cr0_pg ^ cr0) & X86_CR0_PG)
> +               if ((old_cr0_pg ^ cr0) & X86_CR0_PG) {
>                         vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
> +                       kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
> +               }
>         }
> 
>         /* depends on vcpu->arch.cr0 to be set to a new value */
> 

I tested this patch and my guests boot as well. I didn't test more stuff like migration or so,
will do tomorrow.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-12-15 20:21 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-08 12:43 [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan
2021-11-08 12:43 ` [PATCH 01/15] KVM: X86: Ensure the dirty PDPTEs to be loaded Lai Jiangshan
2021-11-08 12:43 ` [PATCH 02/15] KVM: VMX: Mark VCPU_EXREG_PDPTR available in ept_save_pdptrs() Lai Jiangshan
2021-11-08 12:43 ` [PATCH 03/15] KVM: SVM: Always clear available of VCPU_EXREG_PDPTR in svm_vcpu_run() Lai Jiangshan
2021-11-08 12:43 ` [PATCH 04/15] KVM: VMX: Add and use X86_CR4_TLB_BITS when !enable_ept Lai Jiangshan
2021-11-18 15:18   ` Paolo Bonzini
2021-11-08 12:43 ` [PATCH 05/15] KVM: VMX: Add and use X86_CR4_PDPTR_BITS " Lai Jiangshan
2021-11-08 12:43 ` [PATCH 06/15] KVM: X86: Move CR0 pdptr_bits into header file as X86_CR0_PDPTR_BITS Lai Jiangshan
2021-11-08 12:43 ` [PATCH 07/15] KVM: SVM: Remove outdated comment in svm_load_mmu_pgd() Lai Jiangshan
2021-11-08 12:44 ` [PATCH 08/15] KVM: SVM: Remove useless check " Lai Jiangshan
2021-11-08 12:44 ` [PATCH 09/15] KVM: SVM: Remove the unneeded code to mark available for CR3 Lai Jiangshan
2021-11-18 15:17   ` Paolo Bonzini
2021-11-08 12:44 ` [PATCH 10/15] KVM: X86: Mark CR3 dirty when vcpu->arch.cr3 is changed Lai Jiangshan
2021-11-08 12:44 ` [PATCH 11/15] KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty Lai Jiangshan
2021-12-15 15:47   ` Maxim Levitsky
2021-12-15 16:31     ` Lai Jiangshan
2021-12-15 16:43       ` Lai Jiangshan
2021-12-15 16:45       ` Sean Christopherson
2021-12-15 17:10         ` Paolo Bonzini
2021-12-15 20:21         ` Maxim Levitsky
2021-12-15 20:20       ` Maxim Levitsky
2021-11-08 12:44 ` [PATCH 12/15] KVM: VMX: Reset the bits that are meaningful to be reset in vmx_register_cache_reset() Lai Jiangshan
2021-11-18 15:25   ` Paolo Bonzini
2021-11-08 12:44 ` [PATCH 13/15] KVM: SVM: Add and use svm_register_cache_reset() Lai Jiangshan
2021-11-18 15:37   ` Paolo Bonzini
2021-11-18 16:28     ` Lai Jiangshan
2021-11-18 17:54       ` Paolo Bonzini
2021-11-19  0:49         ` Lai Jiangshan
2021-11-08 12:44 ` [PATCH 14/15] KVM: X86: Remove kvm_register_clear_available() Lai Jiangshan
2021-11-08 12:44 ` [PATCH 15/15] KVM: nVMX: Always write vmcs.GUEST_CR3 during nested VM-Exit Lai Jiangshan
2021-11-18 15:52   ` Paolo Bonzini
2021-11-11 14:45 ` [PATCH 16/15] KVM: X86: Update mmu->pdptrs only when it is changed Lai Jiangshan
2021-12-07 23:43   ` Sean Christopherson
2021-12-08  3:29     ` Lai Jiangshan
2021-12-08  9:09     ` Paolo Bonzini
2021-12-08  9:34       ` Lai Jiangshan
2021-11-11 14:46 ` [PATCH 17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs " Lai Jiangshan
2021-11-23  9:34   ` Lai Jiangshan
2021-12-08  0:15   ` Sean Christopherson
2021-12-08  4:00     ` Lai Jiangshan
2021-12-08 15:29       ` Sean Christopherson
2021-12-09 22:46     ` Paolo Bonzini
2021-12-10 21:07       ` Sean Christopherson
2021-12-10 21:08         ` Sean Christopherson
2021-12-11  6:56         ` Maxim Levitsky
2021-12-11  8:22           ` Paolo Bonzini
2021-12-13 16:54             ` Sean Christopherson
2021-11-18  8:53 ` [PATCH 00/15] KVM: X86: Fix and clean up for register caches Lai Jiangshan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).