All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] KVM: X86: Improve guest TLB flushing
@ 2021-10-19 11:01 Lai Jiangshan
  2021-10-19 11:01 ` [PATCH 1/4] KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid() Lai Jiangshan
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Lai Jiangshan @ 2021-10-19 11:01 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini; +Cc: Lai Jiangshan

From: Lai Jiangshan <laijs@linux.alibaba.com>

The patchset focuses on the guest TLB flushing related to prev_roots.
It tries to keep CR3s in prev_roots and improves some comments.

Lai Jiangshan (4):
  KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid()
  KVM: X86: Cache CR3 in prev_roots when PCID is disabled
  KVM: X86: Use smp_rmb() to pair with smp_wmb() in
    mmu_try_to_unsync_pages()
  KVM: X86: Don't unload MMU in kvm_vcpu_flush_tlb_guest()

 arch/x86/kvm/mmu.h     |  1 +
 arch/x86/kvm/mmu/mmu.c | 63 ++++++++++++++++++++++++++++++++----------
 arch/x86/kvm/x86.c     | 53 +++++++++++++++++++++++++++++------
 3 files changed, 95 insertions(+), 22 deletions(-)

-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid()
  2021-10-19 11:01 [PATCH 0/4] KVM: X86: Improve guest TLB flushing Lai Jiangshan
@ 2021-10-19 11:01 ` Lai Jiangshan
  2021-10-19 15:25   ` Sean Christopherson
  2021-10-19 11:01 ` [PATCH 2/4] KVM: X86: Cache CR3 in prev_roots when PCID is disabled Lai Jiangshan
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Lai Jiangshan @ 2021-10-19 11:01 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

The KVM doesn't know whether any TLB for a specific pcid is cached in
the CPU when tdp is enabled.  So it is better to flush all the guest
TLB when invalidating any single PCID context.

The case is rare or even impossible since KVM doesn't intercept CR3
write or INVPCID instructions when tdp is enabled.  The fix is just
for the sake of robustness in case emulation can reach here or the
interception policy is changed.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/x86.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c59b63c56af9..06169ed08db0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1073,6 +1073,16 @@ static void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid)
 	unsigned long roots_to_free = 0;
 	int i;
 
+	/*
+	 * It is very unlikely to reach here when tdp_enabled.  But if it is
+	 * the case, the kvm doesn't know whether any TLB for the @pcid is
+	 * cached in the CPU.  So just flush the guest instead.
+	 */
+	if (unlikely(tdp_enabled)) {
+		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
+		return;
+	}
+
 	/*
 	 * If neither the current CR3 nor any of the prev_roots use the given
 	 * PCID, then nothing needs to be done here because a resync will
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] KVM: X86: Cache CR3 in prev_roots when PCID is disabled
  2021-10-19 11:01 [PATCH 0/4] KVM: X86: Improve guest TLB flushing Lai Jiangshan
  2021-10-19 11:01 ` [PATCH 1/4] KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid() Lai Jiangshan
@ 2021-10-19 11:01 ` Lai Jiangshan
  2021-10-21 17:43   ` Paolo Bonzini
  2021-10-19 11:01 ` [PATCH 3/4] KVM: X86: Use smp_rmb() to pair with smp_wmb() in mmu_try_to_unsync_pages() Lai Jiangshan
  2021-10-19 11:01 ` [PATCH 4/4] KVM: X86: Don't unload MMU in kvm_vcpu_flush_tlb_guest() Lai Jiangshan
  3 siblings, 1 reply; 17+ messages in thread
From: Lai Jiangshan @ 2021-10-19 11:01 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

The commit 21823fbda5522 ("KVM: x86: Invalidate all PGDs for the
current PCID on MOV CR3 w/ flush") invalidates all PGDs for the specific
PCID and in the case of PCID is disabled, it includes all PGDs in the
prev_roots and the commit made prev_roots totally unused in this case.

Not using prev_roots fixes a problem when CR4.PCIDE is changed 0 -> 1
before the said commit:
	(CR4.PCIDE=0, CR3=cr3_a, the page for the guest
	 kernel is global, cr3_b is cached in prev_roots)

	modify the user part of cr3_b
		the shadow root of cr3_b is unsync in kvm
	INVPCID single context
		the guest expects the TLB is clean for PCID=0
	change CR4.PCIDE 0 -> 1
	switch to cr3_b with PCID=0,NOFLUSH=1
		No sync in kvm, cr3_b is still unsync in kvm
	return to the user part (of cr3_b)
		the user accesses to wrong page

It is a very unlikely case, but it shows that virtualizing guest TLB in
prev_roots is not safe in this case and the said commit did fix the
problem.

But the said commit also disabled caching CR3 in prev_roots when PCID
is disabled and NOT all CPUs have PCID, especially the PCID support
for AMD CPUs is kind of recent.  To restore the original optimization,
we have to enable caching CR3 without re-introducing problems.

Actually, in short, the said commit just ensures prev_roots not part of
the virtualized TLB.  So this change caches CR3 in prev_roots, and
ensures prev_roots not part of the virtualized TLB by always flushing
the virtualized TLB when CR3 is switched from prev_roots to current
(it is already the current behavior) and by freeing prev_roots when
CR4.PCIDE is changed 0 -> 1.

Anyway:
PCID enabled: vTLB includes root_hpa, prev_roots and hardware TLB.
PCID disabled: vTLB includes root_hpa and hardware TLB, no prev_roots.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/x86.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 06169ed08db0..13df3ca88e09 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1022,10 +1022,29 @@ EXPORT_SYMBOL_GPL(kvm_is_valid_cr4);
 
 void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
 {
+	/*
+	 * If any role bit is changed, the MMU needs to be reset.
+	 *
+	 * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the guest
+	 * TLB per SDM, but the virtualized TLB doesn't include prev_roots when
+	 * CR4.PCIDE is 0, so the prev_roots has to be freed to avoid later
+	 * resuing without explicit flushing.
+	 * If CR4.PCIDE is changed 1 -> 0, there is required to flush the guest
+	 * TLB and KVM_REQ_MMU_RELOAD is fit for the both cases.  Although
+	 * KVM_REQ_MMU_RELOAD is slow, changing CR4.PCIDE is a rare case.
+	 *
+	 * If CR4.PGE is changed, there is required to just flush the guest TLB.
+	 *
+	 * Note: reseting MMU covers KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_RELOAD
+	 * covers KVM_REQ_TLB_FLUSH_GUEST, so "else if" is used here and the
+	 * check for later cases are skipped if the check for the preceding
+	 * case is matched.
+	 */
 	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
 		kvm_mmu_reset_context(vcpu);
-	else if (((cr4 ^ old_cr4) & X86_CR4_PGE) ||
-		 (!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)))
+	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
+		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+	else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
 		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_post_set_cr4);
@@ -1093,6 +1112,15 @@ static void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid)
 		kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
 	}
 
+	/*
+	 * If PCID is disabled, there is no need to free prev_roots even the
+	 * PCIDs for them are also 0.  The prev_roots are just not included
+	 * in the "clean" virtualized TLB and a resync will happen anyway
+	 * before switching to any other CR3.
+	 */
+	if (!kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE))
+		return;
+
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 		if (kvm_get_pcid(vcpu, mmu->prev_roots[i].pgd) == pcid)
 			roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] KVM: X86: Use smp_rmb() to pair with smp_wmb() in mmu_try_to_unsync_pages()
  2021-10-19 11:01 [PATCH 0/4] KVM: X86: Improve guest TLB flushing Lai Jiangshan
  2021-10-19 11:01 ` [PATCH 1/4] KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid() Lai Jiangshan
  2021-10-19 11:01 ` [PATCH 2/4] KVM: X86: Cache CR3 in prev_roots when PCID is disabled Lai Jiangshan
@ 2021-10-19 11:01 ` Lai Jiangshan
  2021-10-21  2:32   ` Lai Jiangshan
  2021-10-21 17:44   ` Paolo Bonzini
  2021-10-19 11:01 ` [PATCH 4/4] KVM: X86: Don't unload MMU in kvm_vcpu_flush_tlb_guest() Lai Jiangshan
  3 siblings, 2 replies; 17+ messages in thread
From: Lai Jiangshan @ 2021-10-19 11:01 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Junaid Shahid, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

The commit 578e1c4db2213 ("kvm: x86: Avoid taking MMU lock in
kvm_mmu_sync_roots if no sync is needed") added smp_wmb() in
mmu_try_to_unsync_pages(), but the corresponding smp_load_acquire()
isn't used on the load of SPTE.W which is impossible since the load of
SPTE.W is performed in the CPU's pagetable walking.

This patch changes to use smp_rmb() instead.  This patch fixes nothing
but just comments since smp_rmb() is NOP and compiler barrier() is not
required since the load of SPTE.W is before VMEXIT.

Cc: Junaid Shahid <junaids@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/mmu/mmu.c | 47 +++++++++++++++++++++++++++++-------------
 1 file changed, 33 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c6ddb042b281..900c7a157c99 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2665,8 +2665,9 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 	 *     (sp->unsync = true)
 	 *
 	 * The write barrier below ensures that 1.1 happens before 1.2 and thus
-	 * the situation in 2.4 does not arise. The implicit barrier in 2.2
-	 * pairs with this write barrier.
+	 * the situation in 2.4 does not arise.  The implicit read barrier
+	 * between 2.1's load of SPTE.W and 2.3 (as in is_unsync_root()) pairs
+	 * with this write barrier.
 	 */
 	smp_wmb();
 
@@ -3629,6 +3630,35 @@ static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
 #endif
 }
 
+static bool is_unsync_root(hpa_t root)
+{
+	struct kvm_mmu_page *sp;
+
+	/*
+	 * Even if another CPU was marking the SP as unsync-ed simultaneously,
+	 * any guest page table changes are not guaranteed to be visible anyway
+	 * until this VCPU issues a TLB flush strictly after those changes are
+	 * made.  We only need to ensure that the other CPU sets these flags
+	 * before any actual changes to the page tables are made.  The comments
+	 * in mmu_try_to_unsync_pages() describe what could go wrong if this
+	 * requirement isn't satisfied.
+	 *
+	 * To pair with the smp_wmb() in mmu_try_to_unsync_pages() between the
+	 * write to sp->unsync[_children] and the write to SPTE.W, a read
+	 * barrier is needed after the CPU reads SPTE.W (or the read itself is
+	 * an acquire operation) while doing page table walk and before the
+	 * checks of sp->unsync[_children] here.  The CPU has already provided
+	 * the needed semantic, but an NOP smp_rmb() here can provide symmetric
+	 * pairing and richer information.
+	 */
+	smp_rmb();
+	sp = to_shadow_page(root);
+	if (sp->unsync || sp->unsync_children)
+		return true;
+
+	return false;
+}
+
 void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 {
 	int i;
@@ -3646,18 +3676,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 		hpa_t root = vcpu->arch.mmu->root_hpa;
 		sp = to_shadow_page(root);
 
-		/*
-		 * Even if another CPU was marking the SP as unsync-ed
-		 * simultaneously, any guest page table changes are not
-		 * guaranteed to be visible anyway until this VCPU issues a TLB
-		 * flush strictly after those changes are made. We only need to
-		 * ensure that the other CPU sets these flags before any actual
-		 * changes to the page tables are made. The comments in
-		 * mmu_try_to_unsync_pages() describe what could go wrong if
-		 * this requirement isn't satisfied.
-		 */
-		if (!smp_load_acquire(&sp->unsync) &&
-		    !smp_load_acquire(&sp->unsync_children))
+		if (!is_unsync_root(root))
 			return;
 
 		write_lock(&vcpu->kvm->mmu_lock);
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] KVM: X86: Don't unload MMU in kvm_vcpu_flush_tlb_guest()
  2021-10-19 11:01 [PATCH 0/4] KVM: X86: Improve guest TLB flushing Lai Jiangshan
                   ` (2 preceding siblings ...)
  2021-10-19 11:01 ` [PATCH 3/4] KVM: X86: Use smp_rmb() to pair with smp_wmb() in mmu_try_to_unsync_pages() Lai Jiangshan
@ 2021-10-19 11:01 ` Lai Jiangshan
  3 siblings, 0 replies; 17+ messages in thread
From: Lai Jiangshan @ 2021-10-19 11:01 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

kvm_mmu_unload() destroys all the PGD caches.  Use the lighter
kvm_mmu_sync_roots() and kvm_mmu_sync_prev_roots() instead.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/mmu.h     |  1 +
 arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++++
 arch/x86/kvm/x86.c     | 11 +++++------
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 1ae70efedcf4..8e9dd63b68a9 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -79,6 +79,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
 void kvm_mmu_unload(struct kvm_vcpu *vcpu);
 void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu);
+void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
 
 static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 900c7a157c99..fb45eeb8dd22 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3634,6 +3634,9 @@ static bool is_unsync_root(hpa_t root)
 {
 	struct kvm_mmu_page *sp;
 
+	if (!VALID_PAGE(root))
+		return false;
+
 	/*
 	 * Even if another CPU was marking the SP as unsync-ed simultaneously,
 	 * any guest page table changes are not guaranteed to be visible anyway
@@ -3706,6 +3709,19 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 	write_unlock(&vcpu->kvm->mmu_lock);
 }
 
+void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu)
+{
+	unsigned long roots_to_free = 0;
+	int i;
+
+	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
+		if (is_unsync_root(vcpu->arch.mmu->prev_roots[i].hpa))
+			roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
+
+	/* sync prev_roots by simply freeing them */
+	kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, roots_to_free);
+}
+
 static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gpa_t vaddr,
 				  u32 access, struct x86_exception *exception)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 13df3ca88e09..1771cd4bb449 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3251,15 +3251,14 @@ static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
 	++vcpu->stat.tlb_flush;
 
 	if (!tdp_enabled) {
-               /*
+		/*
 		 * A TLB flush on behalf of the guest is equivalent to
 		 * INVPCID(all), toggling CR4.PGE, etc., which requires
-		 * a forced sync of the shadow page tables.  Unload the
-		 * entire MMU here and the subsequent load will sync the
-		 * shadow page tables, and also flush the TLB.
+		 * a forced sync of the shadow page tables.  Ensure all the
+		 * roots are synced and the guest TLB in hardware is clean.
 		 */
-		kvm_mmu_unload(vcpu);
-		return;
+		kvm_mmu_sync_roots(vcpu);
+		kvm_mmu_sync_prev_roots(vcpu);
 	}
 
 	static_call(kvm_x86_tlb_flush_guest)(vcpu);
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid()
  2021-10-19 11:01 ` [PATCH 1/4] KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid() Lai Jiangshan
@ 2021-10-19 15:25   ` Sean Christopherson
  2021-10-20  9:54     ` Lai Jiangshan
  0 siblings, 1 reply; 17+ messages in thread
From: Sean Christopherson @ 2021-10-19 15:25 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

On Tue, Oct 19, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> The KVM doesn't know whether any TLB for a specific pcid is cached in
> the CPU when tdp is enabled.  So it is better to flush all the guest
> TLB when invalidating any single PCID context.
> 
> The case is rare or even impossible since KVM doesn't intercept CR3
> write or INVPCID instructions when tdp is enabled.  The fix is just
> for the sake of robustness in case emulation can reach here or the
> interception policy is changed.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/kvm/x86.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c59b63c56af9..06169ed08db0 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1073,6 +1073,16 @@ static void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid)
>  	unsigned long roots_to_free = 0;
>  	int i;
>  
> +	/*
> +	 * It is very unlikely to reach here when tdp_enabled.  But if it is
> +	 * the case, the kvm doesn't know whether any TLB for the @pcid is
> +	 * cached in the CPU.  So just flush the guest instead.
> +	 */
> +	if (unlikely(tdp_enabled)) {

This is reachable on VMX if EPT=1, unrestricted_guest=0, and CR0.PG=0.  In that
case, KVM is running the guest with the KVM-defined identity mapped CR3 / page
tables and intercepts MOV CR3 so that the guest can't ovewrite the "real" CR3,
and so that the guest sees its last written CR3 on read.

This is also reachable from the emulator if the guest manipulates a vCPU code
stream so that KVM sees a MOV CR3 after a legitimate emulation trigger.

However, in both cases the KVM_REQ_TLB_FLUSH_GUEST is unnecessary.  In the first
case, paging is disabled so there are no TLB entries from the guest's perspective.
In the second, the guest is malicious/broken and gets to keep the pieces.

That said, I agree a sanity check is worthwhile, though with a reworded comment
to call out the known scenarios and that the TDP page tables are not affected by
the invalidation.  Maybe this?

	/*
	 * MOV CR3 and INVPCID are usually not intercepted when using TDP, but
	 * this is reachable when running EPT=1 and unrestricted_guest=0,  and
	 * also via the emulator.  KVM's TDP page tables are not in the scope of
	 * the invalidation, but the guest's TLB entries need to be flushed as
	 * the CPU may have cached entries in its TLB for the target PCID.
	 */

> +		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> +		return;
> +	}
> +
>  	/*
>  	 * If neither the current CR3 nor any of the prev_roots use the given
>  	 * PCID, then nothing needs to be done here because a resync will
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid()
  2021-10-19 15:25   ` Sean Christopherson
@ 2021-10-20  9:54     ` Lai Jiangshan
  2021-10-20 18:26       ` Sean Christopherson
  0 siblings, 1 reply; 17+ messages in thread
From: Lai Jiangshan @ 2021-10-20  9:54 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin



On 2021/10/19 23:25, Sean Christopherson wrote:

> 
> 	/*
> 	 * MOV CR3 and INVPCID are usually not intercepted when using TDP, but
> 	 * this is reachable when running EPT=1 and unrestricted_guest=0,  and
> 	 * also via the emulator.  KVM's TDP page tables are not in the scope of
> 	 * the invalidation, but the guest's TLB entries need to be flushed as
> 	 * the CPU may have cached entries in its TLB for the target PCID.
> 	 */

Thanks! It is a better description.

I just read some interception policy in vmx.c, if EPT=1 but vmx_need_pf_intercept()
return true for some reasons/configs, #PF is intercepted.  But CR3 write is not
intercepted, which means there will be an EPT fault _after_ (IIUC) the CR3 write if
the GPA of the new CR3 exceeds the guest maxphyaddr limit.  And kvm queues a fault to
the guest which is also _after_ the CR3 write, but the guest expects the fault before
the write.

IIUC, it can be fixed by intercepting CR3 write or reversing the CR3 write in EPT
violation handler.

Thanks
Lai.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid()
  2021-10-20  9:54     ` Lai Jiangshan
@ 2021-10-20 18:26       ` Sean Christopherson
  2021-10-21  1:27         ` Lai Jiangshan
  0 siblings, 1 reply; 17+ messages in thread
From: Sean Christopherson @ 2021-10-20 18:26 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, linux-kernel, kvm, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

On Wed, Oct 20, 2021, Lai Jiangshan wrote:
> On 2021/10/19 23:25, Sean Christopherson wrote:
> I just read some interception policy in vmx.c, if EPT=1 but vmx_need_pf_intercept()
> return true for some reasons/configs, #PF is intercepted.  But CR3 write is not
> intercepted, which means there will be an EPT fault _after_ (IIUC) the CR3 write if
> the GPA of the new CR3 exceeds the guest maxphyaddr limit.  And kvm queues a fault to
> the guest which is also _after_ the CR3 write, but the guest expects the fault before
> the write.
> 
> IIUC, it can be fixed by intercepting CR3 write or reversing the CR3 write in EPT
> violation handler.

KVM implicitly does the latter by emulating the faulting instruction.

  static int handle_ept_violation(struct kvm_vcpu *vcpu)
  {
	...

	/*
	 * Check that the GPA doesn't exceed physical memory limits, as that is
	 * a guest page fault.  We have to emulate the instruction here, because
	 * if the illegal address is that of a paging structure, then
	 * EPT_VIOLATION_ACC_WRITE bit is set.  Alternatively, if supported we
	 * would also use advanced VM-exit information for EPT violations to
	 * reconstruct the page fault error code.
	 */
	if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
		return kvm_emulate_instruction(vcpu, 0);

	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
  }

and injecting a #GP when kvm_set_cr3() fails.

  static int em_cr_write(struct x86_emulate_ctxt *ctxt)
  {
	if (ctxt->ops->set_cr(ctxt, ctxt->modrm_reg, ctxt->src.val))
		return emulate_gp(ctxt, 0);

	/* Disable writeback. */
	ctxt->dst.type = OP_NONE;
	return X86EMUL_CONTINUE;
  }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid()
  2021-10-20 18:26       ` Sean Christopherson
@ 2021-10-21  1:27         ` Lai Jiangshan
  2021-10-21 14:52           ` Sean Christopherson
  0 siblings, 1 reply; 17+ messages in thread
From: Lai Jiangshan @ 2021-10-21  1:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lai Jiangshan, linux-kernel, kvm, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin



On 2021/10/21 02:26, Sean Christopherson wrote:
> On Wed, Oct 20, 2021, Lai Jiangshan wrote:
>> On 2021/10/19 23:25, Sean Christopherson wrote:
>> I just read some interception policy in vmx.c, if EPT=1 but vmx_need_pf_intercept()
>> return true for some reasons/configs, #PF is intercepted.  But CR3 write is not
>> intercepted, which means there will be an EPT fault _after_ (IIUC) the CR3 write if
>> the GPA of the new CR3 exceeds the guest maxphyaddr limit.  And kvm queues a fault to
>> the guest which is also _after_ the CR3 write, but the guest expects the fault before
>> the write.
>>
>> IIUC, it can be fixed by intercepting CR3 write or reversing the CR3 write in EPT
>> violation handler.
> 
> KVM implicitly does the latter by emulating the faulting instruction.
> 
>    static int handle_ept_violation(struct kvm_vcpu *vcpu)
>    {
> 	...
> 
> 	/*
> 	 * Check that the GPA doesn't exceed physical memory limits, as that is
> 	 * a guest page fault.  We have to emulate the instruction here, because
> 	 * if the illegal address is that of a paging structure, then
> 	 * EPT_VIOLATION_ACC_WRITE bit is set.  Alternatively, if supported we
> 	 * would also use advanced VM-exit information for EPT violations to
> 	 * reconstruct the page fault error code.
> 	 */
> 	if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
> 		return kvm_emulate_instruction(vcpu, 0);
> 
> 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>    }
> 
> and injecting a #GP when kvm_set_cr3() fails.

I think the EPT violation happens *after* the cr3 write.  So the instruction to be
emulated is not "cr3 write".  The emulation will queue fault into guest though,
recursive EPT violation happens since the cr3 exceeds maxphyaddr limit.

In this case, the guest is malicious/broken and gets to keep the pieces too.

> 
>    static int em_cr_write(struct x86_emulate_ctxt *ctxt)
>    {
> 	if (ctxt->ops->set_cr(ctxt, ctxt->modrm_reg, ctxt->src.val))
> 		return emulate_gp(ctxt, 0);
> 
> 	/* Disable writeback. */
> 	ctxt->dst.type = OP_NONE;
> 	return X86EMUL_CONTINUE;
>    }
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] KVM: X86: Use smp_rmb() to pair with smp_wmb() in mmu_try_to_unsync_pages()
  2021-10-19 11:01 ` [PATCH 3/4] KVM: X86: Use smp_rmb() to pair with smp_wmb() in mmu_try_to_unsync_pages() Lai Jiangshan
@ 2021-10-21  2:32   ` Lai Jiangshan
  2021-10-21 17:44   ` Paolo Bonzini
  1 sibling, 0 replies; 17+ messages in thread
From: Lai Jiangshan @ 2021-10-21  2:32 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, kvm, Paolo Bonzini, Junaid Shahid
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

Hello Junaid Shahid

Any comments on the patch?  I may have been misunderstanding many things and I
have been often.

thanks
Lai

On 2021/10/19 19:01, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> The commit 578e1c4db2213 ("kvm: x86: Avoid taking MMU lock in
> kvm_mmu_sync_roots if no sync is needed") added smp_wmb() in
> mmu_try_to_unsync_pages(), but the corresponding smp_load_acquire()
> isn't used on the load of SPTE.W which is impossible since the load of
> SPTE.W is performed in the CPU's pagetable walking.
> 
> This patch changes to use smp_rmb() instead.  This patch fixes nothing
> but just comments since smp_rmb() is NOP and compiler barrier() is not
> required since the load of SPTE.W is before VMEXIT.
> 
> Cc: Junaid Shahid <junaids@google.com>
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid()
  2021-10-21  1:27         ` Lai Jiangshan
@ 2021-10-21 14:52           ` Sean Christopherson
  2021-10-21 17:13             ` Paolo Bonzini
  2021-10-22  0:22             ` Lai Jiangshan
  0 siblings, 2 replies; 17+ messages in thread
From: Sean Christopherson @ 2021-10-21 14:52 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, linux-kernel, kvm, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

On Thu, Oct 21, 2021, Lai Jiangshan wrote:
> 
> 
> On 2021/10/21 02:26, Sean Christopherson wrote:
> > On Wed, Oct 20, 2021, Lai Jiangshan wrote:
> > > On 2021/10/19 23:25, Sean Christopherson wrote:
> > > I just read some interception policy in vmx.c, if EPT=1 but vmx_need_pf_intercept()
> > > return true for some reasons/configs, #PF is intercepted.  But CR3 write is not
> > > intercepted, which means there will be an EPT fault _after_ (IIUC) the CR3 write if
> > > the GPA of the new CR3 exceeds the guest maxphyaddr limit.  And kvm queues a fault to
> > > the guest which is also _after_ the CR3 write, but the guest expects the fault before
> > > the write.
> > > 
> > > IIUC, it can be fixed by intercepting CR3 write or reversing the CR3 write in EPT
> > > violation handler.
> > 
> > KVM implicitly does the latter by emulating the faulting instruction.
> > 
> >    static int handle_ept_violation(struct kvm_vcpu *vcpu)
> >    {
> > 	...
> > 
> > 	/*
> > 	 * Check that the GPA doesn't exceed physical memory limits, as that is
> > 	 * a guest page fault.  We have to emulate the instruction here, because
> > 	 * if the illegal address is that of a paging structure, then
> > 	 * EPT_VIOLATION_ACC_WRITE bit is set.  Alternatively, if supported we
> > 	 * would also use advanced VM-exit information for EPT violations to
> > 	 * reconstruct the page fault error code.
> > 	 */
> > 	if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
> > 		return kvm_emulate_instruction(vcpu, 0);
> > 
> > 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> >    }
> > 
> > and injecting a #GP when kvm_set_cr3() fails.
> 
> I think the EPT violation happens *after* the cr3 write.  So the instruction to be
> emulated is not "cr3 write".  The emulation will queue fault into guest though,
> recursive EPT violation happens since the cr3 exceeds maxphyaddr limit.

Doh, you're correct.  I think my mind wandered into thinking about what would
happen with PDPTRs and forgot to get back to normal MOV CR3.

So yeah, the only way to correctly handle this would be to intercept CR3 loads.
I'm guessing that would have a noticeable impact on guest performance.

Paolo, I'll leave this one for you to decide, we have pretty much written off
allow_smaller_maxphyaddr :-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid()
  2021-10-21 14:52           ` Sean Christopherson
@ 2021-10-21 17:13             ` Paolo Bonzini
  2021-10-21 17:32               ` Jim Mattson
  2021-10-22  0:22             ` Lai Jiangshan
  1 sibling, 1 reply; 17+ messages in thread
From: Paolo Bonzini @ 2021-10-21 17:13 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: Lai Jiangshan, linux-kernel, kvm, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

On 21/10/21 16:52, Sean Christopherson wrote:
>> I think the EPT violation happens*after*  the cr3 write.  So the instruction to be
>> emulated is not "cr3 write".  The emulation will queue fault into guest though,
>> recursive EPT violation happens since the cr3 exceeds maxphyaddr limit.
> Doh, you're correct.  I think my mind wandered into thinking about what would
> happen with PDPTRs and forgot to get back to normal MOV CR3.
> 
> So yeah, the only way to correctly handle this would be to intercept CR3 loads.
> I'm guessing that would have a noticeable impact on guest performance.

Ouch... yeah, allow_smaller_maxphyaddr already has bad performance, but 
intercepting CR3 loads would be another kind of slow.

Paolo

> Paolo, I'll leave this one for you to decide, we have pretty much written off
> allow_smaller_maxphyaddr:-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid()
  2021-10-21 17:13             ` Paolo Bonzini
@ 2021-10-21 17:32               ` Jim Mattson
  0 siblings, 0 replies; 17+ messages in thread
From: Jim Mattson @ 2021-10-21 17:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Lai Jiangshan, Lai Jiangshan, linux-kernel,
	kvm, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin

On Thu, Oct 21, 2021 at 10:13 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 21/10/21 16:52, Sean Christopherson wrote:
> >> I think the EPT violation happens*after*  the cr3 write.  So the instruction to be
> >> emulated is not "cr3 write".  The emulation will queue fault into guest though,
> >> recursive EPT violation happens since the cr3 exceeds maxphyaddr limit.
> > Doh, you're correct.  I think my mind wandered into thinking about what would
> > happen with PDPTRs and forgot to get back to normal MOV CR3.
> >
> > So yeah, the only way to correctly handle this would be to intercept CR3 loads.
> > I'm guessing that would have a noticeable impact on guest performance.
>
> Ouch... yeah, allow_smaller_maxphyaddr already has bad performance, but
> intercepting CR3 loads would be another kind of slow.

Can we kill it? It's only half-baked as it is. Or are we committed to it now?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] KVM: X86: Cache CR3 in prev_roots when PCID is disabled
  2021-10-19 11:01 ` [PATCH 2/4] KVM: X86: Cache CR3 in prev_roots when PCID is disabled Lai Jiangshan
@ 2021-10-21 17:43   ` Paolo Bonzini
  2021-10-22  2:11     ` Lai Jiangshan
  0 siblings, 1 reply; 17+ messages in thread
From: Paolo Bonzini @ 2021-10-21 17:43 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, kvm
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

On 19/10/21 13:01, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> The commit 21823fbda5522 ("KVM: x86: Invalidate all PGDs for the
> current PCID on MOV CR3 w/ flush") invalidates all PGDs for the specific
> PCID and in the case of PCID is disabled, it includes all PGDs in the
> prev_roots and the commit made prev_roots totally unused in this case.
> 
> Not using prev_roots fixes a problem when CR4.PCIDE is changed 0 -> 1
> before the said commit:
> 	(CR4.PCIDE=0, CR3=cr3_a, the page for the guest
> 	 kernel is global, cr3_b is cached in prev_roots)
> 
> 	modify the user part of cr3_b
> 		the shadow root of cr3_b is unsync in kvm
> 	INVPCID single context
> 		the guest expects the TLB is clean for PCID=0
> 	change CR4.PCIDE 0 -> 1
> 	switch to cr3_b with PCID=0,NOFLUSH=1
> 		No sync in kvm, cr3_b is still unsync in kvm
> 	return to the user part (of cr3_b)
> 		the user accesses to wrong page
> 
> It is a very unlikely case, but it shows that virtualizing guest TLB in
> prev_roots is not safe in this case and the said commit did fix the
> problem.
> 
> But the said commit also disabled caching CR3 in prev_roots when PCID
> is disabled and NOT all CPUs have PCID, especially the PCID support
> for AMD CPUs is kind of recent.  To restore the original optimization,
> we have to enable caching CR3 without re-introducing problems.
> 
> Actually, in short, the said commit just ensures prev_roots not part of
> the virtualized TLB.  So this change caches CR3 in prev_roots, and
> ensures prev_roots not part of the virtualized TLB by always flushing
> the virtualized TLB when CR3 is switched from prev_roots to current
> (it is already the current behavior) and by freeing prev_roots when
> CR4.PCIDE is changed 0 -> 1.
> 
> Anyway:
> PCID enabled: vTLB includes root_hpa, prev_roots and hardware TLB.
> PCID disabled: vTLB includes root_hpa and hardware TLB, no prev_roots.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>   arch/x86/kvm/x86.c | 32 ++++++++++++++++++++++++++++++--
>   1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 06169ed08db0..13df3ca88e09 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1022,10 +1022,29 @@ EXPORT_SYMBOL_GPL(kvm_is_valid_cr4);
>   
>   void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
>   {
> +	/*
> +	 * If any role bit is changed, the MMU needs to be reset.
> +	 *
> +	 * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the guest
> +	 * TLB per SDM, but the virtualized TLB doesn't include prev_roots when
> +	 * CR4.PCIDE is 0, so the prev_roots has to be freed to avoid later
> +	 * resuing without explicit flushing.
> +	 * If CR4.PCIDE is changed 1 -> 0, there is required to flush the guest
> +	 * TLB and KVM_REQ_MMU_RELOAD is fit for the both cases.  Although
> +	 * KVM_REQ_MMU_RELOAD is slow, changing CR4.PCIDE is a rare case.

          * If CR4.PCIDE is changed 1 -> 0, the guest TLB must be flushed.
          * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the TLB
          * according to the SDM; however, stale prev_roots could be reused
          * reused incorrectly by MOV to CR3 with NOFLUSH=1, so we free them
          * all.  KVM_REQ_MMU_RELOAD is fit for the both cases; it
          * is slow, but changing CR4.PCIDE is a rare case.

> +	 * If CR4.PGE is changed, there is required to just flush the guest TLB.
> +	 *
> +	 * Note: reseting MMU covers KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_RELOAD
> +	 * covers KVM_REQ_TLB_FLUSH_GUEST, so "else if" is used here and the
> +	 * check for later cases are skipped if the check for the preceding
> +	 * case is matched.

          * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
          * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
          * the usage of "else if".

> +	 */
>   	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
>   		kvm_mmu_reset_context(vcpu);
> -	else if (((cr4 ^ old_cr4) & X86_CR4_PGE) ||
> -		 (!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)))
> +	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
> +		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> +	else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
>   		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>   }
>   EXPORT_SYMBOL_GPL(kvm_post_set_cr4);
> @@ -1093,6 +1112,15 @@ static void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid)
>   		kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
>   	}
>   
> +	/*
> +	 * If PCID is disabled, there is no need to free prev_roots even the
> +	 * PCIDs for them are also 0.  The prev_roots are just not included
> +	 * in the "clean" virtualized TLB and a resync will happen anyway
> +	 * before switching to any other CR3.
> +	 */

         /*
          * If PCID is disabled, there is no need to free prev_roots even if the
          * PCIDs for them are also 0, because all moves to CR3 flush the TLB
          * with PCID=0.
          */

> +	if (!kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE))
> +		return;
> +
>   	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>   		if (kvm_get_pcid(vcpu, mmu->prev_roots[i].pgd) == pcid)
>   			roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
> 


Can you confirm the above comments are accurate?

Paolo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] KVM: X86: Use smp_rmb() to pair with smp_wmb() in mmu_try_to_unsync_pages()
  2021-10-19 11:01 ` [PATCH 3/4] KVM: X86: Use smp_rmb() to pair with smp_wmb() in mmu_try_to_unsync_pages() Lai Jiangshan
  2021-10-21  2:32   ` Lai Jiangshan
@ 2021-10-21 17:44   ` Paolo Bonzini
  1 sibling, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2021-10-21 17:44 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, kvm
  Cc: Lai Jiangshan, Junaid Shahid, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

On 19/10/21 13:01, Lai Jiangshan wrote:
> From: Lai Jiangshan<laijs@linux.alibaba.com>
> 
> The commit 578e1c4db2213 ("kvm: x86: Avoid taking MMU lock in
> kvm_mmu_sync_roots if no sync is needed") added smp_wmb() in
> mmu_try_to_unsync_pages(), but the corresponding smp_load_acquire()
> isn't used on the load of SPTE.W which is impossible since the load of
> SPTE.W is performed in the CPU's pagetable walking.
> 
> This patch changes to use smp_rmb() instead.  This patch fixes nothing
> but just comments since smp_rmb() is NOP and compiler barrier() is not
> required since the load of SPTE.W is before VMEXIT.

I think that even implicit loads during pagetable walking obey read-read 
ordering on x86, but this is clearer and it is necessary for patch 4.

Paolo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid()
  2021-10-21 14:52           ` Sean Christopherson
  2021-10-21 17:13             ` Paolo Bonzini
@ 2021-10-22  0:22             ` Lai Jiangshan
  1 sibling, 0 replies; 17+ messages in thread
From: Lai Jiangshan @ 2021-10-22  0:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lai Jiangshan, linux-kernel, kvm, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin



On 2021/10/21 22:52, Sean Christopherson wrote:
> On Thu, Oct 21, 2021, Lai Jiangshan wrote:
>>
>>
>> On 2021/10/21 02:26, Sean Christopherson wrote:
>>> On Wed, Oct 20, 2021, Lai Jiangshan wrote:
>>>> On 2021/10/19 23:25, Sean Christopherson wrote:
>>>> I just read some interception policy in vmx.c, if EPT=1 but vmx_need_pf_intercept()
>>>> return true for some reasons/configs, #PF is intercepted.  But CR3 write is not
>>>> intercepted, which means there will be an EPT fault _after_ (IIUC) the CR3 write if
>>>> the GPA of the new CR3 exceeds the guest maxphyaddr limit.  And kvm queues a fault to
>>>> the guest which is also _after_ the CR3 write, but the guest expects the fault before
>>>> the write.
>>>>
>>>> IIUC, it can be fixed by intercepting CR3 write or reversing the CR3 write in EPT
>>>> violation handler.
>>>
>>> KVM implicitly does the latter by emulating the faulting instruction.
>>>
>>>     static int handle_ept_violation(struct kvm_vcpu *vcpu)
>>>     {
>>> 	...
>>>
>>> 	/*
>>> 	 * Check that the GPA doesn't exceed physical memory limits, as that is
>>> 	 * a guest page fault.  We have to emulate the instruction here, because
>>> 	 * if the illegal address is that of a paging structure, then
>>> 	 * EPT_VIOLATION_ACC_WRITE bit is set.  Alternatively, if supported we
>>> 	 * would also use advanced VM-exit information for EPT violations to
>>> 	 * reconstruct the page fault error code.
>>> 	 */
>>> 	if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
>>> 		return kvm_emulate_instruction(vcpu, 0);
>>>
>>> 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>>>     }
>>>
>>> and injecting a #GP when kvm_set_cr3() fails.
>>
>> I think the EPT violation happens *after* the cr3 write.  So the instruction to be
>> emulated is not "cr3 write".  The emulation will queue fault into guest though,
>> recursive EPT violation happens since the cr3 exceeds maxphyaddr limit.
> 
> Doh, you're correct.  I think my mind wandered into thinking about what would
> happen with PDPTRs and forgot to get back to normal MOV CR3.
> 
> So yeah, the only way to correctly handle this would be to intercept CR3 loads.
> I'm guessing that would have a noticeable impact on guest performance.

I think we can detect it in handle_ept_violation() via checking the cr3 value,
and make it triple-fault if it is the case, so that the VMM can exit.  I don't
think any OS would use the reserved bit in CR3 and the corresponding #GP.

> 
> Paolo, I'll leave this one for you to decide, we have pretty much written off
> allow_smaller_maxphyaddr :-)
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] KVM: X86: Cache CR3 in prev_roots when PCID is disabled
  2021-10-21 17:43   ` Paolo Bonzini
@ 2021-10-22  2:11     ` Lai Jiangshan
  0 siblings, 0 replies; 17+ messages in thread
From: Lai Jiangshan @ 2021-10-22  2:11 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, X86 ML, H. Peter Anvin

On Fri, Oct 22, 2021 at 1:43 AM Paolo Bonzini <pbonzini@redhat.com> wrote:

>
>           * If CR4.PCIDE is changed 1 -> 0, the guest TLB must be flushed.
>           * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the TLB
>           * according to the SDM; however, stale prev_roots could be reused
>           * reused incorrectly by MOV to CR3 with NOFLUSH=1, so we free them
>           * all.  KVM_REQ_MMU_RELOAD is fit for the both cases; it
>           * is slow, but changing CR4.PCIDE is a rare case.
>

There is a double "reused" separated by "\".

>
>
> Can you confirm the above comments are accurate?
>

Yes, they are better and consistent with what I meant, only one redundant
"reused" in the comments.

thanks
Lai

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-10-22  2:11 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-19 11:01 [PATCH 0/4] KVM: X86: Improve guest TLB flushing Lai Jiangshan
2021-10-19 11:01 ` [PATCH 1/4] KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid() Lai Jiangshan
2021-10-19 15:25   ` Sean Christopherson
2021-10-20  9:54     ` Lai Jiangshan
2021-10-20 18:26       ` Sean Christopherson
2021-10-21  1:27         ` Lai Jiangshan
2021-10-21 14:52           ` Sean Christopherson
2021-10-21 17:13             ` Paolo Bonzini
2021-10-21 17:32               ` Jim Mattson
2021-10-22  0:22             ` Lai Jiangshan
2021-10-19 11:01 ` [PATCH 2/4] KVM: X86: Cache CR3 in prev_roots when PCID is disabled Lai Jiangshan
2021-10-21 17:43   ` Paolo Bonzini
2021-10-22  2:11     ` Lai Jiangshan
2021-10-19 11:01 ` [PATCH 3/4] KVM: X86: Use smp_rmb() to pair with smp_wmb() in mmu_try_to_unsync_pages() Lai Jiangshan
2021-10-21  2:32   ` Lai Jiangshan
2021-10-21 17:44   ` Paolo Bonzini
2021-10-19 11:01 ` [PATCH 4/4] KVM: X86: Don't unload MMU in kvm_vcpu_flush_tlb_guest() Lai Jiangshan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.