linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups
@ 2021-08-24  7:55 Lai Jiangshan
  2021-08-24  7:55 ` [PATCH 1/7] KVM: X86: Fix missed remote tlb flush in rmap_write_protect() Lai Jiangshan
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Lai Jiangshan @ 2021-08-24  7:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: Paolo Bonzini, Lai Jiangshan

From: Lai Jiangshan <laijs@linux.alibaba.com>

The first two patches fix two old possible defects.
And the others are just cleanups.

Lai Jiangshan (7):
  KVM: X86: Fix missed remote tlb flush in rmap_write_protect()
  KVM: X86: Synchronize the shadow pagetable before link it
  KVM: X86: Zap the invalid list after remote tlb flushing
  KVM: X86: Remove FNAME(update_pte)
  KVM: X86: Don't unsync pagetables when speculative
  KVM: X86: Don't check unsync if the original spte is writible
  KVM: X86: Also prefetch the last range in __direct_pte_prefetch().

 arch/x86/kvm/mmu/mmu.c          | 50 +++++++++++++++++++++++++--------
 arch/x86/kvm/mmu/mmu_internal.h |  3 +-
 arch/x86/kvm/mmu/paging_tmpl.h  | 38 +++++++++++++++++--------
 arch/x86/kvm/mmu/spte.c         |  6 ++--
 4 files changed, 70 insertions(+), 27 deletions(-)

-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/7] KVM: X86: Fix missed remote tlb flush in rmap_write_protect()
  2021-08-24  7:55 [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups Lai Jiangshan
@ 2021-08-24  7:55 ` Lai Jiangshan
  2021-09-02 21:38   ` Sean Christopherson
  2021-09-13  9:57   ` Maxim Levitsky
  2021-08-24  7:55 ` [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it Lai Jiangshan
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Lai Jiangshan @ 2021-08-24  7:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paolo Bonzini, Lai Jiangshan, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Marcelo Tosatti, Avi Kivity, kvm

From: Lai Jiangshan <laijs@linux.alibaba.com>

When kvm->tlbs_dirty > 0, some rmaps might have been deleted
without flushing tlb remotely after kvm_sync_page().  If @gfn
was writable before and it's rmaps was deleted in kvm_sync_page(),
we need to flush tlb too even if __rmap_write_protect() doesn't
request it.

Fixes: 4731d4c7a077 ("KVM: MMU: out of sync shadow core")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4853c033e6ce..313918df1a10 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1420,6 +1420,14 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 			rmap_head = gfn_to_rmap(gfn, i, slot);
 			write_protected |= __rmap_write_protect(kvm, rmap_head, true);
 		}
+		/*
+		 * When kvm->tlbs_dirty > 0, some rmaps might have been deleted
+		 * without flushing tlb remotely after kvm_sync_page().  If @gfn
+		 * was writable before and it's rmaps was deleted in kvm_sync_page(),
+		 * we need to flush tlb too.
+		 */
+		if (min_level == PG_LEVEL_4K && kvm->tlbs_dirty)
+			write_protected = true;
 	}
 
 	if (is_tdp_mmu_enabled(kvm))
@@ -5733,6 +5741,14 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
 					  start_level, KVM_MAX_HUGEPAGE_LEVEL,
 					  false);
+		/*
+		 * When kvm->tlbs_dirty > 0, some rmaps might have been deleted
+		 * without flushing tlb remotely after kvm_sync_page().  If @gfn
+		 * was writable before and it's rmaps was deleted in kvm_sync_page(),
+		 * we need to flush tlb too.
+		 */
+		if (start_level == PG_LEVEL_4K && kvm->tlbs_dirty)
+			flush = true;
 		write_unlock(&kvm->mmu_lock);
 	}
 
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it
  2021-08-24  7:55 [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups Lai Jiangshan
  2021-08-24  7:55 ` [PATCH 1/7] KVM: X86: Fix missed remote tlb flush in rmap_write_protect() Lai Jiangshan
@ 2021-08-24  7:55 ` Lai Jiangshan
  2021-09-02 23:40   ` Sean Christopherson
  2021-08-24  7:55 ` [PATCH 3/7] KVM: X86: Zap the invalid list after remote tlb flushing Lai Jiangshan
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Lai Jiangshan @ 2021-08-24  7:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paolo Bonzini, Lai Jiangshan, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Marcelo Tosatti, Avi Kivity, kvm

From: Lai Jiangshan <laijs@linux.alibaba.com>

If gpte is changed from non-present to present, the guest doesn't need
to flush tlb per SDM.  So the host must synchronze sp before
link it.  Otherwise the guest might use a wrong mapping.

For example: the guest first changes a level-1 pagetable, and then
links its parent to a new place where the original gpte is non-present.
Finally the guest can access the remapped area without flushing
the tlb.  The guest's behavior should be allowed per SDM, but the host
kvm mmu makes it wrong.

Fixes: 4731d4c7a077 ("KVM: MMU: out of sync shadow core")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/mmu/mmu.c         | 21 ++++++++++++++-------
 arch/x86/kvm/mmu/paging_tmpl.h | 28 +++++++++++++++++++++++++---
 2 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 313918df1a10..987953a901d2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2032,8 +2032,9 @@ static void mmu_pages_clear_parents(struct mmu_page_path *parents)
 	} while (!sp->unsync_children);
 }
 
-static void mmu_sync_children(struct kvm_vcpu *vcpu,
-			      struct kvm_mmu_page *parent)
+static bool mmu_sync_children(struct kvm_vcpu *vcpu,
+			      struct kvm_mmu_page *parent,
+			      bool root)
 {
 	int i;
 	struct kvm_mmu_page *sp;
@@ -2061,11 +2062,20 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
 		if (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock)) {
 			kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
 			cond_resched_rwlock_write(&vcpu->kvm->mmu_lock);
+			/*
+			 * If @parent is not root, the caller doesn't have
+			 * any reference to it.  And we couldn't access to
+			 * @parent and continue synchronizing after the
+			 * mmu_lock was once released.
+			 */
+			if (!root)
+				return false;
 			flush = false;
 		}
 	}
 
 	kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
+	return true;
 }
 
 static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
@@ -2151,9 +2161,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
 		}
 
-		if (sp->unsync_children)
-			kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
-
 		__clear_sp_write_flooding_count(sp);
 
 trace_get_page:
@@ -3650,7 +3657,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 		write_lock(&vcpu->kvm->mmu_lock);
 		kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
 
-		mmu_sync_children(vcpu, sp);
+		mmu_sync_children(vcpu, sp, true);
 
 		kvm_mmu_audit(vcpu, AUDIT_POST_SYNC);
 		write_unlock(&vcpu->kvm->mmu_lock);
@@ -3666,7 +3673,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 		if (IS_VALID_PAE_ROOT(root)) {
 			root &= PT64_BASE_ADDR_MASK;
 			sp = to_shadow_page(root);
-			mmu_sync_children(vcpu, sp);
+			mmu_sync_children(vcpu, sp, true);
 		}
 	}
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 50ade6450ace..48c7fe1b2d50 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -664,7 +664,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
  * emulate this operation, return 1 to indicate this case.
  */
 static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
-			 struct guest_walker *gw)
+			 struct guest_walker *gw, unsigned long mmu_seq)
 {
 	struct kvm_mmu_page *sp = NULL;
 	struct kvm_shadow_walk_iterator it;
@@ -678,6 +678,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	top_level = vcpu->arch.mmu->root_level;
 	if (top_level == PT32E_ROOT_LEVEL)
 		top_level = PT32_ROOT_LEVEL;
+
+again:
 	/*
 	 * Verify that the top-level gpte is still there.  Since the page
 	 * is a root page, it is either write protected (and cannot be
@@ -713,8 +715,28 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))
 			goto out_gpte_changed;
 
-		if (sp)
+		if (sp) {
+			/*
+			 * We must synchronize the pagetable before link it
+			 * because the guest doens't need to flush tlb when
+			 * gpte is changed from non-present to present.
+			 * Otherwise, the guest may use the wrong mapping.
+			 *
+			 * For PG_LEVEL_4K, kvm_mmu_get_page() has already
+			 * synchronized it transiently via kvm_sync_page().
+			 *
+			 * For higher level pagetable, we synchronize it
+			 * via slower mmu_sync_children().  If it once
+			 * released the mmu_lock, we need to restart from
+			 * the root since we don't have reference to @sp.
+			 */
+			if (sp->unsync_children && !mmu_sync_children(vcpu, sp, false)) {
+				if (mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva))
+					goto out_gpte_changed;
+				goto again;
+			}
 			link_shadow_page(vcpu, it.sptep, sp);
+		}
 	}
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -905,7 +927,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	r = make_mmu_pages_available(vcpu);
 	if (r)
 		goto out_unlock;
-	r = FNAME(fetch)(vcpu, fault, &walker);
+	r = FNAME(fetch)(vcpu, fault, &walker, mmu_seq);
 	kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 
 out_unlock:
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/7] KVM: X86: Zap the invalid list after remote tlb flushing
  2021-08-24  7:55 [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups Lai Jiangshan
  2021-08-24  7:55 ` [PATCH 1/7] KVM: X86: Fix missed remote tlb flush in rmap_write_protect() Lai Jiangshan
  2021-08-24  7:55 ` [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it Lai Jiangshan
@ 2021-08-24  7:55 ` Lai Jiangshan
  2021-09-02 21:54   ` Sean Christopherson
  2021-08-24  7:55 ` [PATCH 4/7] KVM: X86: Remove FNAME(update_pte) Lai Jiangshan
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Lai Jiangshan @ 2021-08-24  7:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paolo Bonzini, Lai Jiangshan, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kvm

From: Lai Jiangshan <laijs@linux.alibaba.com>

In mmu_sync_children(), it can zap the invalid list after remote tlb flushing.
Emptifying the invalid list ASAP might help reduce a remote tlb flushing
in some cases.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 987953a901d2..a165eb8713bc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2050,7 +2050,7 @@ static bool mmu_sync_children(struct kvm_vcpu *vcpu,
 			protected |= rmap_write_protect(vcpu, sp->gfn);
 
 		if (protected) {
-			kvm_flush_remote_tlbs(vcpu->kvm);
+			kvm_mmu_flush_or_zap(vcpu, &invalid_list, true, flush);
 			flush = false;
 		}
 
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/7] KVM: X86: Remove FNAME(update_pte)
  2021-08-24  7:55 [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups Lai Jiangshan
                   ` (2 preceding siblings ...)
  2021-08-24  7:55 ` [PATCH 3/7] KVM: X86: Zap the invalid list after remote tlb flushing Lai Jiangshan
@ 2021-08-24  7:55 ` Lai Jiangshan
  2021-09-13  9:49   ` Maxim Levitsky
  2021-08-24  7:55 ` [PATCH 5/7] KVM: X86: Don't unsync pagetables when speculative Lai Jiangshan
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Lai Jiangshan @ 2021-08-24  7:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paolo Bonzini, Lai Jiangshan, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kvm

From: Lai Jiangshan <laijs@linux.alibaba.com>

Its solo caller is changed to use FNAME(prefetch_gpte) directly.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/mmu/paging_tmpl.h | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 48c7fe1b2d50..6b2e248f2f4c 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -589,14 +589,6 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	return true;
 }
 
-static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
-			      u64 *spte, const void *pte)
-{
-	pt_element_t gpte = *(const pt_element_t *)pte;
-
-	FNAME(prefetch_gpte)(vcpu, sp, spte, gpte, false);
-}
-
 static bool FNAME(gpte_changed)(struct kvm_vcpu *vcpu,
 				struct guest_walker *gw, int level)
 {
@@ -998,7 +990,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
 						       sizeof(pt_element_t)))
 				break;
 
-			FNAME(update_pte)(vcpu, sp, sptep, &gpte);
+			FNAME(prefetch_gpte)(vcpu, sp, sptep, gpte, false);
 		}
 
 		if (!is_shadow_present_pte(*sptep) || !sp->unsync_children)
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/7] KVM: X86: Don't unsync pagetables when speculative
  2021-08-24  7:55 [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups Lai Jiangshan
                   ` (3 preceding siblings ...)
  2021-08-24  7:55 ` [PATCH 4/7] KVM: X86: Remove FNAME(update_pte) Lai Jiangshan
@ 2021-08-24  7:55 ` Lai Jiangshan
  2021-09-13 11:02   ` Maxim Levitsky
  2021-08-24  7:55 ` [PATCH 6/7] KVM: X86: Don't check unsync if the original spte is writible Lai Jiangshan
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Lai Jiangshan @ 2021-08-24  7:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paolo Bonzini, Lai Jiangshan, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kvm

From: Lai Jiangshan <laijs@linux.alibaba.com>

We'd better only unsync the pagetable when there just was a really
write fault on a level-1 pagetable.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/mmu/mmu.c          | 6 +++++-
 arch/x86/kvm/mmu/mmu_internal.h | 3 ++-
 arch/x86/kvm/mmu/spte.c         | 2 +-
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a165eb8713bc..e5932af6f11c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2600,7 +2600,8 @@ static void kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
  * were marked unsync (or if there is no shadow page), -EPERM if the SPTE must
  * be write-protected.
  */
-int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync)
+int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync,
+			    bool speculative)
 {
 	struct kvm_mmu_page *sp;
 	bool locked = false;
@@ -2626,6 +2627,9 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync)
 		if (sp->unsync)
 			continue;
 
+		if (speculative)
+			return -EEXIST;
+
 		/*
 		 * TDP MMU page faults require an additional spinlock as they
 		 * run with mmu_lock held for read, not write, and the unsync
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 658d8d228d43..f5d8be787993 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -116,7 +116,8 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
 	       kvm_x86_ops.cpu_dirty_log_size;
 }
 
-int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync);
+int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync,
+			    bool speculative);
 
 void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 3e97cdb13eb7..b68a580f3510 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -159,7 +159,7 @@ int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
 		 * e.g. it's write-tracked (upper-level SPs) or has one or more
 		 * shadow pages and unsync'ing pages is not allowed.
 		 */
-		if (mmu_try_to_unsync_pages(vcpu, gfn, can_unsync)) {
+		if (mmu_try_to_unsync_pages(vcpu, gfn, can_unsync, speculative)) {
 			pgprintk("%s: found shadow page for %llx, marking ro\n",
 				 __func__, gfn);
 			ret |= SET_SPTE_WRITE_PROTECTED_PT;
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 6/7] KVM: X86: Don't check unsync if the original spte is writible
  2021-08-24  7:55 [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups Lai Jiangshan
                   ` (4 preceding siblings ...)
  2021-08-24  7:55 ` [PATCH 5/7] KVM: X86: Don't unsync pagetables when speculative Lai Jiangshan
@ 2021-08-24  7:55 ` Lai Jiangshan
  2021-08-24  7:55 ` [PATCH 7/7] KVM: X86: Also prefetch the last range in __direct_pte_prefetch() Lai Jiangshan
  2021-08-31 18:02 ` [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups Lai Jiangshan
  7 siblings, 0 replies; 30+ messages in thread
From: Lai Jiangshan @ 2021-08-24  7:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paolo Bonzini, Lai Jiangshan, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kvm

From: Lai Jiangshan <laijs@linux.alibaba.com>

If the original spte is writable, the target gfn should not be the
gfn of synchronized shadowpage and can continue to be writable.

When !can_unsync, speculative must be false.  So when the check of
"!can_unsync" is removed, we need to move the label of "out" up.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/mmu/spte.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index b68a580f3510..a33c581aabd6 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -150,7 +150,7 @@ int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
 		 * is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots.
 		 * Same reasoning can be applied to dirty page accounting.
 		 */
-		if (!can_unsync && is_writable_pte(old_spte))
+		if (is_writable_pte(old_spte))
 			goto out;
 
 		/*
@@ -171,10 +171,10 @@ int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
 	if (pte_access & ACC_WRITE_MASK)
 		spte |= spte_shadow_dirty_mask(spte);
 
+out:
 	if (speculative)
 		spte = mark_spte_for_access_track(spte);
 
-out:
 	WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
 		  "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
 		  get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 7/7] KVM: X86: Also prefetch the last range in __direct_pte_prefetch().
  2021-08-24  7:55 [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups Lai Jiangshan
                   ` (5 preceding siblings ...)
  2021-08-24  7:55 ` [PATCH 6/7] KVM: X86: Don't check unsync if the original spte is writible Lai Jiangshan
@ 2021-08-24  7:55 ` Lai Jiangshan
  2021-08-25 15:18   ` Sean Christopherson
  2021-08-31 18:02 ` [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups Lai Jiangshan
  7 siblings, 1 reply; 30+ messages in thread
From: Lai Jiangshan @ 2021-08-24  7:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paolo Bonzini, Lai Jiangshan, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kvm

From: Lai Jiangshan <laijs@linux.alibaba.com>

__direct_pte_prefetch() skips prefetching the last range.

The last range are often the whole range after the faulted spte when
guest is touching huge-page-mapped(in guest view) memory forwardly
which means prefetching them can reduce pagefault.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/mmu/mmu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e5932af6f11c..ac260e01e9d8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2847,8 +2847,9 @@ static void __direct_pte_prefetch(struct kvm_vcpu *vcpu,
 	i = (sptep - sp->spt) & ~(PTE_PREFETCH_NUM - 1);
 	spte = sp->spt + i;
 
-	for (i = 0; i < PTE_PREFETCH_NUM; i++, spte++) {
-		if (is_shadow_present_pte(*spte) || spte == sptep) {
+	for (i = 0; i <= PTE_PREFETCH_NUM; i++, spte++) {
+		if (i == PTE_PREFETCH_NUM ||
+		    is_shadow_present_pte(*spte) || spte == sptep) {
 			if (!start)
 				continue;
 			if (direct_pte_prefetch_many(vcpu, sp, start, spte) < 0)
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] KVM: X86: Also prefetch the last range in __direct_pte_prefetch().
  2021-08-24  7:55 ` [PATCH 7/7] KVM: X86: Also prefetch the last range in __direct_pte_prefetch() Lai Jiangshan
@ 2021-08-25 15:18   ` Sean Christopherson
  2021-08-25 22:58     ` Lai Jiangshan
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2021-08-25 15:18 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Paolo Bonzini, Lai Jiangshan, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, kvm

On Tue, Aug 24, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> __direct_pte_prefetch() skips prefetching the last range.
> 
> The last range are often the whole range after the faulted spte when
> guest is touching huge-page-mapped(in guest view) memory forwardly
> which means prefetching them can reduce pagefault.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e5932af6f11c..ac260e01e9d8 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2847,8 +2847,9 @@ static void __direct_pte_prefetch(struct kvm_vcpu *vcpu,
>  	i = (sptep - sp->spt) & ~(PTE_PREFETCH_NUM - 1);
>  	spte = sp->spt + i;
>  
> -	for (i = 0; i < PTE_PREFETCH_NUM; i++, spte++) {
> -		if (is_shadow_present_pte(*spte) || spte == sptep) {
> +	for (i = 0; i <= PTE_PREFETCH_NUM; i++, spte++) {
> +		if (i == PTE_PREFETCH_NUM ||
> +		    is_shadow_present_pte(*spte) || spte == sptep) {

Heh, I posted a fix just a few days ago.  I prefer having a separate call after
the loop.  The "<= PTE_PREFETCH_NUM" is subtle, and a check at the ends avoids
a CMP+Jcc in the loop, though I highly doubt that actually affects performance.

https://lkml.kernel.org/r/20210818235615.2047588-1-seanjc@google.com

>  			if (!start)
>  				continue;
>  			if (direct_pte_prefetch_many(vcpu, sp, start, spte) < 0)
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] KVM: X86: Also prefetch the last range in __direct_pte_prefetch().
  2021-08-25 15:18   ` Sean Christopherson
@ 2021-08-25 22:58     ` Lai Jiangshan
  0 siblings, 0 replies; 30+ messages in thread
From: Lai Jiangshan @ 2021-08-25 22:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, Paolo Bonzini, Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, X86 ML, H. Peter Anvin, kvm

On Wed, Aug 25, 2021 at 11:18 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Aug 24, 2021, Lai Jiangshan wrote:
> > From: Lai Jiangshan <laijs@linux.alibaba.com>
> >
> > __direct_pte_prefetch() skips prefetching the last range.
> >
> > The last range are often the whole range after the faulted spte when
> > guest is touching huge-page-mapped(in guest view) memory forwardly
> > which means prefetching them can reduce pagefault.
> >
> > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e5932af6f11c..ac260e01e9d8 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2847,8 +2847,9 @@ static void __direct_pte_prefetch(struct kvm_vcpu *vcpu,
> >       i = (sptep - sp->spt) & ~(PTE_PREFETCH_NUM - 1);
> >       spte = sp->spt + i;
> >
> > -     for (i = 0; i < PTE_PREFETCH_NUM; i++, spte++) {
> > -             if (is_shadow_present_pte(*spte) || spte == sptep) {
> > +     for (i = 0; i <= PTE_PREFETCH_NUM; i++, spte++) {
> > +             if (i == PTE_PREFETCH_NUM ||
> > +                 is_shadow_present_pte(*spte) || spte == sptep) {
>
> Heh, I posted a fix just a few days ago.  I prefer having a separate call after
> the loop.  The "<= PTE_PREFETCH_NUM" is subtle, and a check at the ends avoids
> a CMP+Jcc in the loop, though I highly doubt that actually affects performance.
>
> https://lkml.kernel.org/r/20210818235615.2047588-1-seanjc@google.com

Thanks!

>
> >                       if (!start)
> >                               continue;
> >                       if (direct_pte_prefetch_many(vcpu, sp, start, spte) < 0)
> > --
> > 2.19.1.6.gb485710b
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups
  2021-08-24  7:55 [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups Lai Jiangshan
                   ` (6 preceding siblings ...)
  2021-08-24  7:55 ` [PATCH 7/7] KVM: X86: Also prefetch the last range in __direct_pte_prefetch() Lai Jiangshan
@ 2021-08-31 18:02 ` Lai Jiangshan
  2021-08-31 21:57   ` Sean Christopherson
  7 siblings, 1 reply; 30+ messages in thread
From: Lai Jiangshan @ 2021-08-31 18:02 UTC (permalink / raw)
  To: LKML, Sean Christopherson, Vitaly Kuznetsov; +Cc: Paolo Bonzini, Lai Jiangshan

On Wed, Aug 25, 2021 at 1:40 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>
> From: Lai Jiangshan <laijs@linux.alibaba.com>
>
> The first two patches fix two old possible defects.
> And the others are just cleanups.
>
> Lai Jiangshan (7):
>   KVM: X86: Fix missed remote tlb flush in rmap_write_protect()
>   KVM: X86: Synchronize the shadow pagetable before link it

Hello, KVM people

Ping.  Aside from patch 7 which has been addressed by Sean,
the patchset has bug fixes and useful cleanups.

Could you have a review or an ack?

Thanks,
Lai

>   KVM: X86: Zap the invalid list after remote tlb flushing
>   KVM: X86: Remove FNAME(update_pte)
>   KVM: X86: Don't unsync pagetables when speculative
>   KVM: X86: Don't check unsync if the original spte is writible
>   KVM: X86: Also prefetch the last range in __direct_pte_prefetch().
>
>  arch/x86/kvm/mmu/mmu.c          | 50 +++++++++++++++++++++++++--------
>  arch/x86/kvm/mmu/mmu_internal.h |  3 +-
>  arch/x86/kvm/mmu/paging_tmpl.h  | 38 +++++++++++++++++--------
>  arch/x86/kvm/mmu/spte.c         |  6 ++--
>  4 files changed, 70 insertions(+), 27 deletions(-)
>
> --
> 2.19.1.6.gb485710b
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups
  2021-08-31 18:02 ` [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups Lai Jiangshan
@ 2021-08-31 21:57   ` Sean Christopherson
  0 siblings, 0 replies; 30+ messages in thread
From: Sean Christopherson @ 2021-08-31 21:57 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: LKML, Vitaly Kuznetsov, Paolo Bonzini, Lai Jiangshan

On Wed, Sep 01, 2021, Lai Jiangshan wrote:
> On Wed, Aug 25, 2021 at 1:40 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> >
> > From: Lai Jiangshan <laijs@linux.alibaba.com>
> >
> > The first two patches fix two old possible defects.
> > And the others are just cleanups.
> >
> > Lai Jiangshan (7):
> >   KVM: X86: Fix missed remote tlb flush in rmap_write_protect()
> >   KVM: X86: Synchronize the shadow pagetable before link it
> 
> Hello, KVM people
> 
> Ping.  Aside from patch 7 which has been addressed by Sean,
> the patchset has bug fixes and useful cleanups.
> 
> Could you have a review or an ack?

I'll take a look this week or next, I'm aiming to start chewing through my review
backlog this week.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/7] KVM: X86: Fix missed remote tlb flush in rmap_write_protect()
  2021-08-24  7:55 ` [PATCH 1/7] KVM: X86: Fix missed remote tlb flush in rmap_write_protect() Lai Jiangshan
@ 2021-09-02 21:38   ` Sean Christopherson
  2021-09-13  9:57   ` Maxim Levitsky
  1 sibling, 0 replies; 30+ messages in thread
From: Sean Christopherson @ 2021-09-02 21:38 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Paolo Bonzini, Lai Jiangshan, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Marcelo Tosatti, Avi Kivity, kvm

On Tue, Aug 24, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> When kvm->tlbs_dirty > 0, some rmaps might have been deleted
> without flushing tlb remotely after kvm_sync_page().  If @gfn
> was writable before and it's rmaps was deleted in kvm_sync_page(),
> we need to flush tlb too even if __rmap_write_protect() doesn't
> request it.
> 
> Fixes: 4731d4c7a077 ("KVM: MMU: out of sync shadow core")

Should be

Fixes: a4ee1ca4a36e ("KVM: MMU: delay flush all tlbs on sync_page path")

> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4853c033e6ce..313918df1a10 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1420,6 +1420,14 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>  			rmap_head = gfn_to_rmap(gfn, i, slot);
>  			write_protected |= __rmap_write_protect(kvm, rmap_head, true);
>  		}
> +		/*
> +		 * When kvm->tlbs_dirty > 0, some rmaps might have been deleted
> +		 * without flushing tlb remotely after kvm_sync_page().  If @gfn
> +		 * was writable before and it's rmaps was deleted in kvm_sync_page(),
> +		 * we need to flush tlb too.
> +		 */
> +		if (min_level == PG_LEVEL_4K && kvm->tlbs_dirty)
> +			write_protected = true;
>  	}
>  
>  	if (is_tdp_mmu_enabled(kvm))
> @@ -5733,6 +5741,14 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  		flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
>  					  start_level, KVM_MAX_HUGEPAGE_LEVEL,
>  					  false);
> +		/*
> +		 * When kvm->tlbs_dirty > 0, some rmaps might have been deleted
> +		 * without flushing tlb remotely after kvm_sync_page().  If @gfn
> +		 * was writable before and it's rmaps was deleted in kvm_sync_page(),
> +		 * we need to flush tlb too.
> +		 */
> +		if (start_level == PG_LEVEL_4K && kvm->tlbs_dirty)
> +			flush = true;
>  		write_unlock(&kvm->mmu_lock);
>  	}

My vote is to do a revert of a4ee1ca4a36e with slightly less awful batching, and
then improve the batching even further if there's a noticeable loss of performance
(or just tell people to stop using shadow paging :-D).  Zapping SPTEs but not
flushing is just asking for these types of whack-a-mole bugs.

E.g. instead of a straight revert, do this for sync_page():

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 50ade6450ace..1fca27a08c00 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1095,13 +1095,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
                        return 0;

                if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
-                       /*
-                        * Update spte before increasing tlbs_dirty to make
-                        * sure no tlb flush is lost after spte is zapped; see
-                        * the comments in kvm_flush_remote_tlbs().
-                        */
-                       smp_wmb();
-                       vcpu->kvm->tlbs_dirty++;
+                       set_spte_ret |= SET_SPTE_NEED_REMOTE_TLB_FLUSH;
                        continue;
                }

@@ -1116,12 +1110,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)

                if (gfn != sp->gfns[i]) {
                        drop_spte(vcpu->kvm, &sp->spt[i]);
-                       /*
-                        * The same as above where we are doing
-                        * prefetch_invalid_gpte().
-                        */
-                       smp_wmb();
-                       vcpu->kvm->tlbs_dirty++;
+                       set_spte_ret |= SET_SPTE_NEED_REMOTE_TLB_FLUSH;
                        continue;
                }



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/7] KVM: X86: Zap the invalid list after remote tlb flushing
  2021-08-24  7:55 ` [PATCH 3/7] KVM: X86: Zap the invalid list after remote tlb flushing Lai Jiangshan
@ 2021-09-02 21:54   ` Sean Christopherson
  0 siblings, 0 replies; 30+ messages in thread
From: Sean Christopherson @ 2021-09-02 21:54 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Paolo Bonzini, Lai Jiangshan, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, kvm

On Tue, Aug 24, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> In mmu_sync_children(), it can zap the invalid list after remote tlb flushing.
> Emptifying the invalid list ASAP might help reduce a remote tlb flushing
> in some cases.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 987953a901d2..a165eb8713bc 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2050,7 +2050,7 @@ static bool mmu_sync_children(struct kvm_vcpu *vcpu,
>  			protected |= rmap_write_protect(vcpu, sp->gfn);
>  
>  		if (protected) {
> -			kvm_flush_remote_tlbs(vcpu->kvm);
> +			kvm_mmu_flush_or_zap(vcpu, &invalid_list, true, flush);

This can just be 

			kvm_mmu_remote_flush_or_zap(vcpu, &invalid_list, true);

since a remote flush always does a local flush too.

Related to the tlbs_dirty revert, to avoid overzealous flushing, kvm_sync_page()
can pass back a "remote_flush" flag instead of doing the flush itself.  Something
like the below, or maybe multiplex an 'int' return.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ac260e01e9d8..f61de53de55a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2041,7 +2041,7 @@ static bool mmu_sync_children(struct kvm_vcpu *vcpu,
        struct mmu_page_path parents;
        struct kvm_mmu_pages pages;
        LIST_HEAD(invalid_list);
-       bool flush = false;
+       bool flush = false, remote_flush = false;

        while (mmu_unsync_walk(parent, &pages)) {
                bool protected = false;
@@ -2050,17 +2050,17 @@ static bool mmu_sync_children(struct kvm_vcpu *vcpu,
                        protected |= rmap_write_protect(vcpu, sp->gfn);

                if (protected) {
-                       kvm_mmu_flush_or_zap(vcpu, &invalid_list, true, flush);
-                       flush = false;
+                       kvm_mmu_remote_flush_or_zap(vcpu, &invalid_list, true);
+                       remote_flush = flush = false;
                }

                for_each_sp(pages, sp, parents, i) {
                        kvm_unlink_unsync_page(vcpu->kvm, sp);
-                       flush |= kvm_sync_page(vcpu, sp, &invalid_list);
+                       flush |= kvm_sync_page(vcpu, sp, &invalid_list, &remote_flush);
                        mmu_pages_clear_parents(&parents);
                }
                if (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock)) {
-                       kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
+                       kvm_mmu_flush_or_zap(vcpu, &invalid_list, remote_flush, flush);
                        cond_resched_rwlock_write(&vcpu->kvm->mmu_lock);
                        /*
                         * If @parent is not root, the caller doesn't have
@@ -2074,7 +2074,7 @@ static bool mmu_sync_children(struct kvm_vcpu *vcpu,
                }
        }

-       kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
+       kvm_mmu_flush_or_zap(vcpu, &invalid_list, remote_flush, flush);
        return true;
 }

>  			flush = false;
>  		}
>  
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it
  2021-08-24  7:55 ` [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it Lai Jiangshan
@ 2021-09-02 23:40   ` Sean Christopherson
  2021-09-02 23:54     ` Sean Christopherson
                       ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Sean Christopherson @ 2021-09-02 23:40 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Paolo Bonzini, Lai Jiangshan, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Marcelo Tosatti, Avi Kivity, kvm

On Tue, Aug 24, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> If gpte is changed from non-present to present, the guest doesn't need
> to flush tlb per SDM.  So the host must synchronze sp before
> link it.  Otherwise the guest might use a wrong mapping.
> 
> For example: the guest first changes a level-1 pagetable, and then
> links its parent to a new place where the original gpte is non-present.
> Finally the guest can access the remapped area without flushing
> the tlb.  The guest's behavior should be allowed per SDM, but the host
> kvm mmu makes it wrong.

Ah, are you saying, given:

VA_x = PML4_A -> PDP_B -> PD_C -> PT_D

the guest can modify PT_D, then link it with

VA_y = PML4_A -> PDP_B -> PD_E -> PT_D

and access it via VA_y without flushing, and so KVM must sync PT_D.  Is that
correct?

> Fixes: 4731d4c7a077 ("KVM: MMU: out of sync shadow core")
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---

...

> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 50ade6450ace..48c7fe1b2d50 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -664,7 +664,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
>   * emulate this operation, return 1 to indicate this case.
>   */
>  static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> -			 struct guest_walker *gw)
> +			 struct guest_walker *gw, unsigned long mmu_seq)
>  {
>  	struct kvm_mmu_page *sp = NULL;
>  	struct kvm_shadow_walk_iterator it;
> @@ -678,6 +678,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  	top_level = vcpu->arch.mmu->root_level;
>  	if (top_level == PT32E_ROOT_LEVEL)
>  		top_level = PT32_ROOT_LEVEL;
> +
> +again:
>  	/*
>  	 * Verify that the top-level gpte is still there.  Since the page
>  	 * is a root page, it is either write protected (and cannot be
> @@ -713,8 +715,28 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  		if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))
>  			goto out_gpte_changed;
>  
> -		if (sp)
> +		if (sp) {
> +			/*
> +			 * We must synchronize the pagetable before link it
> +			 * because the guest doens't need to flush tlb when
> +			 * gpte is changed from non-present to present.
> +			 * Otherwise, the guest may use the wrong mapping.
> +			 *
> +			 * For PG_LEVEL_4K, kvm_mmu_get_page() has already
> +			 * synchronized it transiently via kvm_sync_page().
> +			 *
> +			 * For higher level pagetable, we synchronize it
> +			 * via slower mmu_sync_children().  If it once
> +			 * released the mmu_lock, we need to restart from
> +			 * the root since we don't have reference to @sp.
> +			 */
> +			if (sp->unsync_children && !mmu_sync_children(vcpu, sp, false)) {

I don't like dropping mmu_lock in the page fault path.  I agree that it's not
all that different than grabbing various things in kvm_mmu_do_page_fault() long
before acquiring mmu_lock, but I'm not 100% convinced we don't have a latent
bug hiding somehwere in there :-), and (b) there's a possibility, however small,
that something in FNAME(fetch) that we're missing.  Case in point, this technically
needs to do make_mmu_pages_available().

And I believe kvm_mmu_get_page() already tries to handle this case by requesting
KVM_REQ_MMU_SYNC if it uses a sp with unsync_children, it just doesn't handle SMP
interaction, e.g. can link a sp that's immediately available to other vCPUs before
the sync.

Rather than force the sync here, what about kicking all vCPUs and retrying the
page fault?  The only gross part is that kvm_mmu_get_page() can now fail :-(

---
 arch/x86/include/asm/kvm_host.h | 3 ++-
 arch/x86/kvm/mmu/mmu.c          | 9 +++++++--
 arch/x86/kvm/mmu/paging_tmpl.h  | 4 ++++
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09b256db394a..332b9fb3454c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -57,7 +57,8 @@
 #define KVM_REQ_MIGRATE_TIMER		KVM_ARCH_REQ(0)
 #define KVM_REQ_REPORT_TPR_ACCESS	KVM_ARCH_REQ(1)
 #define KVM_REQ_TRIPLE_FAULT		KVM_ARCH_REQ(2)
-#define KVM_REQ_MMU_SYNC		KVM_ARCH_REQ(3)
+#define KVM_REQ_MMU_SYNC \
+	KVM_ARCH_REQ_FLAGS(3, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_CLOCK_UPDATE		KVM_ARCH_REQ(4)
 #define KVM_REQ_LOAD_MMU_PGD		KVM_ARCH_REQ(5)
 #define KVM_REQ_EVENT			KVM_ARCH_REQ(6)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4853c033e6ce..03293cd3c7ae 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2143,8 +2143,10 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
 		}

-		if (sp->unsync_children)
-			kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
+		if (sp->unsync_children) {
+			kvm_make_all_cpus_request(KVM_REQ_MMU_SYNC, vcpu);
+			return NULL;
+		}

 		__clear_sp_write_flooding_count(sp);

@@ -2999,6 +3001,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)

 		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
 				      it.level - 1, true, ACC_ALL);
+		BUG_ON(!sp);

 		link_shadow_page(vcpu, it.sptep, sp);
 		if (fault->is_tdp && fault->huge_page_disallowed &&
@@ -3383,6 +3386,8 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 	struct kvm_mmu_page *sp;

 	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
+	BUG_ON(!sp);
+
 	++sp->root_count;

 	return __pa(sp->spt);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 50ade6450ace..f573d45e2c6f 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -704,6 +704,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 			access = gw->pt_access[it.level - 2];
 			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
 					      it.level-1, false, access);
+			if (!sp)
+				return RET_PF_RETRY;
 		}

 		/*
@@ -742,6 +744,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		if (!is_shadow_present_pte(*it.sptep)) {
 			sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
 					      it.level - 1, true, direct_access);
+			BUG_ON(!sp);
+
 			link_shadow_page(vcpu, it.sptep, sp);
 			if (fault->huge_page_disallowed &&
 			    fault->req_level >= it.level)
--

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it
  2021-09-02 23:40   ` Sean Christopherson
@ 2021-09-02 23:54     ` Sean Christopherson
  2021-09-03  0:44       ` Lai Jiangshan
  2021-09-03  0:51     ` Lai Jiangshan
  2021-09-13 11:30     ` Maxim Levitsky
  2 siblings, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2021-09-02 23:54 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Paolo Bonzini, Lai Jiangshan, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Marcelo Tosatti, Avi Kivity, kvm

On Thu, Sep 02, 2021, Sean Christopherson wrote:
> On Tue, Aug 24, 2021, Lai Jiangshan wrote:
> Rather than force the sync here, what about kicking all vCPUs and retrying the
> page fault?  The only gross part is that kvm_mmu_get_page() can now fail :-(
> 
> ---
>  arch/x86/include/asm/kvm_host.h | 3 ++-
>  arch/x86/kvm/mmu/mmu.c          | 9 +++++++--
>  arch/x86/kvm/mmu/paging_tmpl.h  | 4 ++++
>  3 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 09b256db394a..332b9fb3454c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -57,7 +57,8 @@
>  #define KVM_REQ_MIGRATE_TIMER		KVM_ARCH_REQ(0)
>  #define KVM_REQ_REPORT_TPR_ACCESS	KVM_ARCH_REQ(1)
>  #define KVM_REQ_TRIPLE_FAULT		KVM_ARCH_REQ(2)
> -#define KVM_REQ_MMU_SYNC		KVM_ARCH_REQ(3)
> +#define KVM_REQ_MMU_SYNC \
> +	KVM_ARCH_REQ_FLAGS(3, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_CLOCK_UPDATE		KVM_ARCH_REQ(4)
>  #define KVM_REQ_LOAD_MMU_PGD		KVM_ARCH_REQ(5)
>  #define KVM_REQ_EVENT			KVM_ARCH_REQ(6)
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4853c033e6ce..03293cd3c7ae 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2143,8 +2143,10 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  			kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
>  		}
> 
> -		if (sp->unsync_children)
> -			kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> +		if (sp->unsync_children) {
> +			kvm_make_all_cpus_request(KVM_REQ_MMU_SYNC, vcpu);
> +			return NULL;
> +		}
> 
>  		__clear_sp_write_flooding_count(sp);
> 
> @@ -2999,6 +3001,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> 
>  		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
>  				      it.level - 1, true, ACC_ALL);
> +		BUG_ON(!sp);
> 
>  		link_shadow_page(vcpu, it.sptep, sp);
>  		if (fault->is_tdp && fault->huge_page_disallowed &&
> @@ -3383,6 +3386,8 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
>  	struct kvm_mmu_page *sp;
> 
>  	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
> +	BUG_ON(!sp);

Gah, this is obviously wrong when allocating an indirect root.  On the happy side,
it points out a cleaner approach.  I think this is what we want?

---
 arch/x86/kvm/mmu/mmu.c         | 3 ---
 arch/x86/kvm/mmu/paging_tmpl.h | 4 ++++
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4853c033e6ce..f24e8088192c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2143,9 +2143,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
 		}

-		if (sp->unsync_children)
-			kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
-
 		__clear_sp_write_flooding_count(sp);

 trace_get_page:
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 50ade6450ace..5b13918a55c2 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -704,6 +704,10 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 			access = gw->pt_access[it.level - 2];
 			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
 					      it.level-1, false, access);
+			if (sp->unsync_children) {
+				kvm_make_all_cpus_request(KVM_REQ_MMU_SYNC, vcpu);
+				return RET_PF_RETRY;
+			}
 		}

 		/*
--

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it
  2021-09-02 23:54     ` Sean Christopherson
@ 2021-09-03  0:44       ` Lai Jiangshan
  2021-09-03 16:06         ` Sean Christopherson
  0 siblings, 1 reply; 30+ messages in thread
From: Lai Jiangshan @ 2021-09-03  0:44 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: linux-kernel, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Marcelo Tosatti,
	Avi Kivity, kvm



On 2021/9/3 07:54, Sean Christopherson wrote:
> 
>   trace_get_page:
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 50ade6450ace..5b13918a55c2 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -704,6 +704,10 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>   			access = gw->pt_access[it.level - 2];
>   			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
>   					      it.level-1, false, access);
> +			if (sp->unsync_children) {
> +				kvm_make_all_cpus_request(KVM_REQ_MMU_SYNC, vcpu);
> +				return RET_PF_RETRY;

Making KVM_REQ_MMU_SYNC be able remotely is good idea.
But if the sp is not linked, the @sp might not be synced even we
tried many times. So we should continue to link it.

But if we continue to link it, KVM_REQ_MMU_SYNC should be extended to
sync all roots (current root and prev_roots).  And maybe add a
KVM_REQ_MMU_SYNC_CURRENT for current root syncing.

It is not going to be a simple.  I have a new way to sync pages
and also fix the problem,  but that include several non-fix patches.

We need to fix this problem in the simplest way.  In my patch
mmu_sync_children() has a @root argument.  I think we can disallow
releasing the lock when @root is false. Is it OK?



> +			}
>   		}
> 
>   		/*
> --
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it
  2021-09-02 23:40   ` Sean Christopherson
  2021-09-02 23:54     ` Sean Christopherson
@ 2021-09-03  0:51     ` Lai Jiangshan
  2021-09-13 11:30     ` Maxim Levitsky
  2 siblings, 0 replies; 30+ messages in thread
From: Lai Jiangshan @ 2021-09-03  0:51 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: linux-kernel, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Marcelo Tosatti,
	Avi Kivity, kvm



On 2021/9/3 07:40, Sean Christopherson wrote:
> On Tue, Aug 24, 2021, Lai Jiangshan wrote:
>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>
>> If gpte is changed from non-present to present, the guest doesn't need
>> to flush tlb per SDM.  So the host must synchronze sp before
>> link it.  Otherwise the guest might use a wrong mapping.
>>
>> For example: the guest first changes a level-1 pagetable, and then
>> links its parent to a new place where the original gpte is non-present.
>> Finally the guest can access the remapped area without flushing
>> the tlb.  The guest's behavior should be allowed per SDM, but the host
>> kvm mmu makes it wrong.
> 
> Ah, are you saying, given:
> 
> VA_x = PML4_A -> PDP_B -> PD_C -> PT_D
> 
> the guest can modify PT_D, then link it with
> 
> VA_y = PML4_A -> PDP_B -> PD_E -> PT_D
> 
> and access it via VA_y without flushing, and so KVM must sync PT_D.  Is that
> correct?

Yes. and another vcpu accesses it via VA_y without flushing.

> 
>> Fixes: 4731d4c7a077 ("KVM: MMU: out of sync shadow core")
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> ---
> 
> ...
> 
>> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
>> index 50ade6450ace..48c7fe1b2d50 100644
>> --- a/arch/x86/kvm/mmu/paging_tmpl.h
>> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
>> @@ -664,7 +664,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
>>    * emulate this operation, return 1 to indicate this case.
>>    */
>>   static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>> -			 struct guest_walker *gw)
>> +			 struct guest_walker *gw, unsigned long mmu_seq)
>>   {
>>   	struct kvm_mmu_page *sp = NULL;
>>   	struct kvm_shadow_walk_iterator it;
>> @@ -678,6 +678,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>>   	top_level = vcpu->arch.mmu->root_level;
>>   	if (top_level == PT32E_ROOT_LEVEL)
>>   		top_level = PT32_ROOT_LEVEL;
>> +
>> +again:
>>   	/*
>>   	 * Verify that the top-level gpte is still there.  Since the page
>>   	 * is a root page, it is either write protected (and cannot be
>> @@ -713,8 +715,28 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>>   		if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))
>>   			goto out_gpte_changed;
>>   
>> -		if (sp)
>> +		if (sp) {
>> +			/*
>> +			 * We must synchronize the pagetable before link it
>> +			 * because the guest doens't need to flush tlb when
>> +			 * gpte is changed from non-present to present.
>> +			 * Otherwise, the guest may use the wrong mapping.
>> +			 *
>> +			 * For PG_LEVEL_4K, kvm_mmu_get_page() has already
>> +			 * synchronized it transiently via kvm_sync_page().
>> +			 *
>> +			 * For higher level pagetable, we synchronize it
>> +			 * via slower mmu_sync_children().  If it once
>> +			 * released the mmu_lock, we need to restart from
>> +			 * the root since we don't have reference to @sp.
>> +			 */
>> +			if (sp->unsync_children && !mmu_sync_children(vcpu, sp, false)) {
> 
> I don't like dropping mmu_lock in the page fault path.  I agree that it's not
> all that different than grabbing various things in kvm_mmu_do_page_fault() long
> before acquiring mmu_lock, but I'm not 100% convinced we don't have a latent
> bug hiding somehwere in there :-), and (b) there's a possibility, however small,
> that something in FNAME(fetch) that we're missing.  Case in point, this technically
> needs to do make_mmu_pages_available().
> 
> And I believe kvm_mmu_get_page() already tries to handle this case by requesting
> KVM_REQ_MMU_SYNC if it uses a sp with unsync_children, it just doesn't handle SMP
> interaction, e.g. can link a sp that's immediately available to other vCPUs before
> the sync.
> 
> Rather than force the sync here, what about kicking all vCPUs and retrying the
> page fault?  The only gross part is that kvm_mmu_get_page() can now fail :-(
> 
> ---
>   arch/x86/include/asm/kvm_host.h | 3 ++-
>   arch/x86/kvm/mmu/mmu.c          | 9 +++++++--
>   arch/x86/kvm/mmu/paging_tmpl.h  | 4 ++++
>   3 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 09b256db394a..332b9fb3454c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -57,7 +57,8 @@
>   #define KVM_REQ_MIGRATE_TIMER		KVM_ARCH_REQ(0)
>   #define KVM_REQ_REPORT_TPR_ACCESS	KVM_ARCH_REQ(1)
>   #define KVM_REQ_TRIPLE_FAULT		KVM_ARCH_REQ(2)
> -#define KVM_REQ_MMU_SYNC		KVM_ARCH_REQ(3)
> +#define KVM_REQ_MMU_SYNC \
> +	KVM_ARCH_REQ_FLAGS(3, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>   #define KVM_REQ_CLOCK_UPDATE		KVM_ARCH_REQ(4)
>   #define KVM_REQ_LOAD_MMU_PGD		KVM_ARCH_REQ(5)
>   #define KVM_REQ_EVENT			KVM_ARCH_REQ(6)
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4853c033e6ce..03293cd3c7ae 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2143,8 +2143,10 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>   			kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
>   		}
> 
> -		if (sp->unsync_children)
> -			kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> +		if (sp->unsync_children) {
> +			kvm_make_all_cpus_request(KVM_REQ_MMU_SYNC, vcpu);
> +			return NULL;
> +		}
> 
>   		__clear_sp_write_flooding_count(sp);
> 
> @@ -2999,6 +3001,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> 
>   		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
>   				      it.level - 1, true, ACC_ALL);
> +		BUG_ON(!sp);
> 
>   		link_shadow_page(vcpu, it.sptep, sp);
>   		if (fault->is_tdp && fault->huge_page_disallowed &&
> @@ -3383,6 +3386,8 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
>   	struct kvm_mmu_page *sp;
> 
>   	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
> +	BUG_ON(!sp);
> +
>   	++sp->root_count;
> 
>   	return __pa(sp->spt);
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 50ade6450ace..f573d45e2c6f 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -704,6 +704,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>   			access = gw->pt_access[it.level - 2];
>   			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
>   					      it.level-1, false, access);
> +			if (!sp)
> +				return RET_PF_RETRY;
>   		}
> 
>   		/*
> @@ -742,6 +744,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>   		if (!is_shadow_present_pte(*it.sptep)) {
>   			sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
>   					      it.level - 1, true, direct_access);
> +			BUG_ON(!sp);
> +
>   			link_shadow_page(vcpu, it.sptep, sp);
>   			if (fault->huge_page_disallowed &&
>   			    fault->req_level >= it.level)
> --
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it
  2021-09-03  0:44       ` Lai Jiangshan
@ 2021-09-03 16:06         ` Sean Christopherson
  2021-09-03 16:25           ` Lai Jiangshan
  2021-09-03 16:33           ` Lai Jiangshan
  0 siblings, 2 replies; 30+ messages in thread
From: Sean Christopherson @ 2021-09-03 16:06 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, linux-kernel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Marcelo Tosatti, Avi Kivity, kvm

On Fri, Sep 03, 2021, Lai Jiangshan wrote:
> 
> On 2021/9/3 07:54, Sean Christopherson wrote:
> > 
> >   trace_get_page:
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index 50ade6450ace..5b13918a55c2 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -704,6 +704,10 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >   			access = gw->pt_access[it.level - 2];
> >   			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
> >   					      it.level-1, false, access);
> > +			if (sp->unsync_children) {
> > +				kvm_make_all_cpus_request(KVM_REQ_MMU_SYNC, vcpu);
> > +				return RET_PF_RETRY;
> 
> Making KVM_REQ_MMU_SYNC be able remotely is good idea.
> But if the sp is not linked, the @sp might not be synced even we
> tried many times. So we should continue to link it.

Hrm, yeah.  The sp has to be linked in at least one mmu, but it may not be linked
in the current mmu, so KVM would have to sync all roots across all current and
previous mmus in order to guarantee the target page is linked.  Eww.

> But if we continue to link it, KVM_REQ_MMU_SYNC should be extended to
> sync all roots (current root and prev_roots).  And maybe add a
> KVM_REQ_MMU_SYNC_CURRENT for current root syncing.
> 
> It is not going to be a simple.  I have a new way to sync pages
> and also fix the problem,  but that include several non-fix patches.
> 
> We need to fix this problem in the simplest way.  In my patch
> mmu_sync_children() has a @root argument.  I think we can disallow
> releasing the lock when @root is false. Is it OK?

With a caveat, it should work.  I was exploring that option before the remote
sync idea.

The danger is inducing a stall in the host (RCU, etc...) if sp is an upper level
entry, e.g. with 5-level paging it can even be a PML4.  My thought for that is to
skip the yield if there are less than N unsync children remaining, and then bail
out if the caller doesn't allow yielding.  If mmu_sync_children() fails, restart
the guest and go through the entire page fault path.  Worst case scenario, it will
take a "few" rounds for the vCPU to finally resolve the page fault.

Regarding params, please use "can_yield" instead of "root" to match similar logic
in the TDP MMU, and return an int instead of a bool.

Thanks!

---
 arch/x86/kvm/mmu/mmu.c         | 18 ++++++++++++------
 arch/x86/kvm/mmu/paging_tmpl.h |  3 +++
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4853c033e6ce..5be990cdb2be 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2024,8 +2024,8 @@ static void mmu_pages_clear_parents(struct mmu_page_path *parents)
 	} while (!sp->unsync_children);
 }

-static void mmu_sync_children(struct kvm_vcpu *vcpu,
-			      struct kvm_mmu_page *parent)
+static int mmu_sync_children(struct kvm_vcpu *vcpu,
+			     struct kvm_mmu_page *parent, bool can_yield)
 {
 	int i;
 	struct kvm_mmu_page *sp;
@@ -2050,7 +2050,15 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
 			flush |= kvm_sync_page(vcpu, sp, &invalid_list);
 			mmu_pages_clear_parents(&parents);
 		}
-		if (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock)) {
+		/*
+		 * Don't yield if there are fewer than <N> unsync children
+		 * remaining, just finish up and get out.
+		 */
+		if (parent->unsync_children > SOME_ARBITRARY_THRESHOLD &&
+		    (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock))) {
+			if (!can_yield)
+				return -EINTR;
+
 			kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
 			cond_resched_rwlock_write(&vcpu->kvm->mmu_lock);
 			flush = false;
@@ -2058,6 +2066,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
 	}

 	kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
+	return 0;
 }

 static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
@@ -2143,9 +2152,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
 		}

-		if (sp->unsync_children)
-			kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
-
 		__clear_sp_write_flooding_count(sp);

 trace_get_page:
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 50ade6450ace..2ff123ec0d64 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -704,6 +704,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 			access = gw->pt_access[it.level - 2];
 			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
 					      it.level-1, false, access);
+			if (sp->unsync_children &&
+			    mmu_sync_children(vcpu, sp, false))
+				return RET_PF_RETRY;
 		}

 		/*
--

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it
  2021-09-03 16:06         ` Sean Christopherson
@ 2021-09-03 16:25           ` Lai Jiangshan
  2021-09-03 16:40             ` Sean Christopherson
  2021-09-03 16:33           ` Lai Jiangshan
  1 sibling, 1 reply; 30+ messages in thread
From: Lai Jiangshan @ 2021-09-03 16:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lai Jiangshan, linux-kernel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Marcelo Tosatti, Avi Kivity, kvm



On 2021/9/4 00:06, Sean Christopherson wrote:

> 
>   trace_get_page:
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 50ade6450ace..2ff123ec0d64 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -704,6 +704,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>   			access = gw->pt_access[it.level - 2];
>   			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
>   					      it.level-1, false, access);
> +			if (sp->unsync_children &&
> +			    mmu_sync_children(vcpu, sp, false))
> +				return RET_PF_RETRY;

It was like my first (unsent) fix.  Just return RET_PF_RETRY when break.

And then I thought that it'd be better to retry fetching directly rather than
retry guest when the conditions are still valid/unchanged to avoid all the
next guest page walking and GUP().  Although the code does not check all
conditions such as interrupt event pending. (we can add that too)

I think it is a good design to allow break mmu_lock when mmu is handling
heavy work.


>   		}
> 
>   		/*
> --
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it
  2021-09-03 16:06         ` Sean Christopherson
  2021-09-03 16:25           ` Lai Jiangshan
@ 2021-09-03 16:33           ` Lai Jiangshan
  1 sibling, 0 replies; 30+ messages in thread
From: Lai Jiangshan @ 2021-09-03 16:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lai Jiangshan, linux-kernel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Marcelo Tosatti, Avi Kivity, kvm



On 2021/9/4 00:06, Sean Christopherson wrote:

> -static void mmu_sync_children(struct kvm_vcpu *vcpu,
> -			      struct kvm_mmu_page *parent)
> +static int mmu_sync_children(struct kvm_vcpu *vcpu,
> +			     struct kvm_mmu_page *parent, bool can_yield)
>   {
>   	int i;
>   	struct kvm_mmu_page *sp;
> @@ -2050,7 +2050,15 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
>   			flush |= kvm_sync_page(vcpu, sp, &invalid_list);
>   			mmu_pages_clear_parents(&parents);
>   		}
> -		if (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock)) {
> +		/*
> +		 * Don't yield if there are fewer than <N> unsync children
> +		 * remaining, just finish up and get out.
> +		 */
> +		if (parent->unsync_children > SOME_ARBITRARY_THRESHOLD &&
> +		    (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock))) {
> +			if (!can_yield)
> +				return -EINTR;
> +


Another thought about this function.

It is courtesy to break when rwlock_needbreak(), but the price is quite high,
with remote flushing to interrupt several pCPUs.  I think we can only break
when need_resched().

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it
  2021-09-03 16:25           ` Lai Jiangshan
@ 2021-09-03 16:40             ` Sean Christopherson
  2021-09-03 17:00               ` Lai Jiangshan
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2021-09-03 16:40 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, linux-kernel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Marcelo Tosatti, Avi Kivity, kvm

On Sat, Sep 04, 2021, Lai Jiangshan wrote:
> 
> On 2021/9/4 00:06, Sean Christopherson wrote:
> 
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index 50ade6450ace..2ff123ec0d64 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -704,6 +704,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >   			access = gw->pt_access[it.level - 2];
> >   			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
> >   					      it.level-1, false, access);
> > +			if (sp->unsync_children &&
> > +			    mmu_sync_children(vcpu, sp, false))
> > +				return RET_PF_RETRY;
> 
> It was like my first (unsent) fix.  Just return RET_PF_RETRY when break.
> 
> And then I thought that it'd be better to retry fetching directly rather than
> retry guest when the conditions are still valid/unchanged to avoid all the
> next guest page walking and GUP().  Although the code does not check all
> conditions such as interrupt event pending. (we can add that too)

But not in a bug fix that needs to go to stable branches.  
 
> I think it is a good design to allow break mmu_lock when mmu is handling
> heavy work.

I don't disagree in principle, but I question the relevance/need.  I doubt this
code is relevant to nested TDP performance as hypervisors generally don't do the
type of PTE manipulations that would lead to linking an existing unsync sp.  And
for legacy shadow paging, my preference would be to put it into maintenance-only
mode as much as possible.  I'm not dead set against new features/functionality
for shadow paging, but for something like dropping mmu_lock in the page fault path,
IMO there needs to be performance numbers to justify such a change.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it
  2021-09-03 16:40             ` Sean Christopherson
@ 2021-09-03 17:00               ` Lai Jiangshan
  0 siblings, 0 replies; 30+ messages in thread
From: Lai Jiangshan @ 2021-09-03 17:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lai Jiangshan, linux-kernel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Marcelo Tosatti, Avi Kivity, kvm



On 2021/9/4 00:40, Sean Christopherson wrote:
> On Sat, Sep 04, 2021, Lai Jiangshan wrote:
>>
>> On 2021/9/4 00:06, Sean Christopherson wrote:
>>
>>> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
>>> index 50ade6450ace..2ff123ec0d64 100644
>>> --- a/arch/x86/kvm/mmu/paging_tmpl.h
>>> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
>>> @@ -704,6 +704,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>>>    			access = gw->pt_access[it.level - 2];
>>>    			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
>>>    					      it.level-1, false, access);
>>> +			if (sp->unsync_children &&
>>> +			    mmu_sync_children(vcpu, sp, false))
>>> +				return RET_PF_RETRY;
>>
>> It was like my first (unsent) fix.  Just return RET_PF_RETRY when break.
>>
>> And then I thought that it'd be better to retry fetching directly rather than
>> retry guest when the conditions are still valid/unchanged to avoid all the
>> next guest page walking and GUP().  Although the code does not check all
>> conditions such as interrupt event pending. (we can add that too)
> 
> But not in a bug fix that needs to go to stable branches.

Good point, it is too complicated for a fix, I accept just "return RET_PF_RETRY".
(and don't need "SOME_ARBITRARY_THRESHOLD").

Is it Ok? I will update the patch as it.

>   
>> I think it is a good design to allow break mmu_lock when mmu is handling
>> heavy work.
> 
> I don't disagree in principle, but I question the relevance/need.  I doubt this
> code is relevant to nested TDP performance as hypervisors generally don't do the
> type of PTE manipulations that would lead to linking an existing unsync sp.  And
> for legacy shadow paging, my preference would be to put it into maintenance-only
> mode as much as possible.  I'm not dead set against new features/functionality
> for shadow paging, but for something like dropping mmu_lock in the page fault path,
> IMO there needs to be performance numbers to justify such a change.
> 

I understood the concern and the relevance/need.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/7] KVM: X86: Remove FNAME(update_pte)
  2021-08-24  7:55 ` [PATCH 4/7] KVM: X86: Remove FNAME(update_pte) Lai Jiangshan
@ 2021-09-13  9:49   ` Maxim Levitsky
  0 siblings, 0 replies; 30+ messages in thread
From: Maxim Levitsky @ 2021-09-13  9:49 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel
  Cc: Paolo Bonzini, Lai Jiangshan, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kvm

On Tue, 2021-08-24 at 15:55 +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> Its solo caller is changed to use FNAME(prefetch_gpte) directly.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/kvm/mmu/paging_tmpl.h | 10 +---------
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 48c7fe1b2d50..6b2e248f2f4c 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -589,14 +589,6 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	return true;
>  }
>  
> -static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> -			      u64 *spte, const void *pte)
> -{
> -	pt_element_t gpte = *(const pt_element_t *)pte;
> -
> -	FNAME(prefetch_gpte)(vcpu, sp, spte, gpte, false);
> -}
> -
>  static bool FNAME(gpte_changed)(struct kvm_vcpu *vcpu,
>  				struct guest_walker *gw, int level)
>  {
> @@ -998,7 +990,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
>  						       sizeof(pt_element_t)))
>  				break;
>  
> -			FNAME(update_pte)(vcpu, sp, sptep, &gpte);
> +			FNAME(prefetch_gpte)(vcpu, sp, sptep, gpte, false);
>  		}
>  
>  		if (!is_shadow_present_pte(*sptep) || !sp->unsync_children)

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/7] KVM: X86: Fix missed remote tlb flush in rmap_write_protect()
  2021-08-24  7:55 ` [PATCH 1/7] KVM: X86: Fix missed remote tlb flush in rmap_write_protect() Lai Jiangshan
  2021-09-02 21:38   ` Sean Christopherson
@ 2021-09-13  9:57   ` Maxim Levitsky
  1 sibling, 0 replies; 30+ messages in thread
From: Maxim Levitsky @ 2021-09-13  9:57 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel
  Cc: Paolo Bonzini, Lai Jiangshan, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Marcelo Tosatti, Avi Kivity, kvm

On Tue, 2021-08-24 at 15:55 +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> When kvm->tlbs_dirty > 0, some rmaps might have been deleted
> without flushing tlb remotely after kvm_sync_page().  If @gfn
> was writable before and it's rmaps was deleted in kvm_sync_page(),
> we need to flush tlb too even if __rmap_write_protect() doesn't
> request it.
> 
> Fixes: 4731d4c7a077 ("KVM: MMU: out of sync shadow core")
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4853c033e6ce..313918df1a10 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1420,6 +1420,14 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>  			rmap_head = gfn_to_rmap(gfn, i, slot);
>  			write_protected |= __rmap_write_protect(kvm, rmap_head, true);
>  		}
> +		/*
> +		 * When kvm->tlbs_dirty > 0, some rmaps might have been deleted
> +		 * without flushing tlb remotely after kvm_sync_page().  If @gfn
> +		 * was writable before and it's rmaps was deleted in kvm_sync_page(),
> +		 * we need to flush tlb too.
> +		 */
> +		if (min_level == PG_LEVEL_4K && kvm->tlbs_dirty)
> +			write_protected = true;

This is a bit misleading, as the gfn is not write protected in this case, but we
just want tlb flush. Maybe add a new output paramter to the function, telling
if we want a tlb flush?

>  	}
>  
>  	if (is_tdp_mmu_enabled(kvm))
> @@ -5733,6 +5741,14 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  		flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
>  					  start_level, KVM_MAX_HUGEPAGE_LEVEL,
>  					  false);
> +		/*
> +		 * When kvm->tlbs_dirty > 0, some rmaps might have been deleted
> +		 * without flushing tlb remotely after kvm_sync_page().  If @gfn
> +		 * was writable before and it's rmaps was deleted in kvm_sync_page(),
> +		 * we need to flush tlb too.
> +		 */
> +		if (start_level == PG_LEVEL_4K && kvm->tlbs_dirty)
> +			flush = true;
>  		write_unlock(&kvm->mmu_lock);
>  	}
>  


Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/7] KVM: X86: Don't unsync pagetables when speculative
  2021-08-24  7:55 ` [PATCH 5/7] KVM: X86: Don't unsync pagetables when speculative Lai Jiangshan
@ 2021-09-13 11:02   ` Maxim Levitsky
  2021-09-18  3:06     ` Lai Jiangshan
  0 siblings, 1 reply; 30+ messages in thread
From: Maxim Levitsky @ 2021-09-13 11:02 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel
  Cc: Paolo Bonzini, Lai Jiangshan, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kvm

On Tue, 2021-08-24 at 15:55 +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> We'd better only unsync the pagetable when there just was a really
> write fault on a level-1 pagetable.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 6 +++++-
>  arch/x86/kvm/mmu/mmu_internal.h | 3 ++-
>  arch/x86/kvm/mmu/spte.c         | 2 +-
>  3 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a165eb8713bc..e5932af6f11c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2600,7 +2600,8 @@ static void kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>   * were marked unsync (or if there is no shadow page), -EPERM if the SPTE must
>   * be write-protected.
>   */
> -int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync)
> +int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync,
> +			    bool speculative)
>  {
>  	struct kvm_mmu_page *sp;
>  	bool locked = false;
> @@ -2626,6 +2627,9 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync)
>  		if (sp->unsync)
>  			continue;
>  
> +		if (speculative)
> +			return -EEXIST;

Woudn't it be better to ensure that callers set can_unsync = false when speculating?

Also if I understand correctly this is not fixing a bug, but an optimization?

Best regards,
	Maxim Levitsky


> +
>  		/*
>  		 * TDP MMU page faults require an additional spinlock as they
>  		 * run with mmu_lock held for read, not write, and the unsync
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 658d8d228d43..f5d8be787993 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -116,7 +116,8 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
>  	       kvm_x86_ops.cpu_dirty_log_size;
>  }
>  
> -int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync);
> +int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync,
> +			    bool speculative);
>  
>  void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
>  void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 3e97cdb13eb7..b68a580f3510 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -159,7 +159,7 @@ int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
>  		 * e.g. it's write-tracked (upper-level SPs) or has one or more
>  		 * shadow pages and unsync'ing pages is not allowed.
>  		 */
> -		if (mmu_try_to_unsync_pages(vcpu, gfn, can_unsync)) {
> +		if (mmu_try_to_unsync_pages(vcpu, gfn, can_unsync, speculative)) {
>  			pgprintk("%s: found shadow page for %llx, marking ro\n",
>  				 __func__, gfn);
>  			ret |= SET_SPTE_WRITE_PROTECTED_PT;



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it
  2021-09-02 23:40   ` Sean Christopherson
  2021-09-02 23:54     ` Sean Christopherson
  2021-09-03  0:51     ` Lai Jiangshan
@ 2021-09-13 11:30     ` Maxim Levitsky
  2021-09-13 20:49       ` Sean Christopherson
  2 siblings, 1 reply; 30+ messages in thread
From: Maxim Levitsky @ 2021-09-13 11:30 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: linux-kernel, Paolo Bonzini, Lai Jiangshan, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Marcelo Tosatti, Avi Kivity, kvm

On Thu, 2021-09-02 at 23:40 +0000, Sean Christopherson wrote:
> On Tue, Aug 24, 2021, Lai Jiangshan wrote:
> > From: Lai Jiangshan <laijs@linux.alibaba.com>
> > 
> > If gpte is changed from non-present to present, the guest doesn't need
> > to flush tlb per SDM.  So the host must synchronze sp before
> > link it.  Otherwise the guest might use a wrong mapping.
> > 
> > For example: the guest first changes a level-1 pagetable, and then
> > links its parent to a new place where the original gpte is non-present.
> > Finally the guest can access the remapped area without flushing
> > the tlb.  The guest's behavior should be allowed per SDM, but the host
> > kvm mmu makes it wrong.
> 
> Ah, are you saying, given:
> 
> VA_x = PML4_A -> PDP_B -> PD_C -> PT_D
> 
> the guest can modify PT_D, then link it with
> 
> VA_y = PML4_A -> PDP_B -> PD_E -> PT_D
> 
> and access it via VA_y without flushing, and so KVM must sync PT_D.  Is that
> correct?

Looks like this. However 


> 
> > Fixes: 4731d4c7a077 ("KVM: MMU: out of sync shadow core")
> > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > ---
> 
> ...
> 
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index 50ade6450ace..48c7fe1b2d50 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -664,7 +664,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
> >   * emulate this operation, return 1 to indicate this case.
> >   */
> >  static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > -			 struct guest_walker *gw)
> > +			 struct guest_walker *gw, unsigned long mmu_seq)
> >  {
> >  	struct kvm_mmu_page *sp = NULL;
> >  	struct kvm_shadow_walk_iterator it;
> > @@ -678,6 +678,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >  	top_level = vcpu->arch.mmu->root_level;
> >  	if (top_level == PT32E_ROOT_LEVEL)
> >  		top_level = PT32_ROOT_LEVEL;
> > +
> > +again:
> >  	/*
> >  	 * Verify that the top-level gpte is still there.  Since the page
> >  	 * is a root page, it is either write protected (and cannot be
> > @@ -713,8 +715,28 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >  		if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))
> >  			goto out_gpte_changed;
> >  
> > -		if (sp)
> > +		if (sp) {
> > +			/*
> > +			 * We must synchronize the pagetable before link it
> > +			 * because the guest doens't need to flush tlb when
> > +			 * gpte is changed from non-present to present.
> > +			 * Otherwise, the guest may use the wrong mapping.
> > +			 *
> > +			 * For PG_LEVEL_4K, kvm_mmu_get_page() has already
> > +			 * synchronized it transiently via kvm_sync_page().
> > +			 *
> > +			 * For higher level pagetable, we synchronize it
> > +			 * via slower mmu_sync_children().  If it once
> > +			 * released the mmu_lock, we need to restart from
> > +			 * the root since we don't have reference to @sp.
> > +			 */
> > +			if (sp->unsync_children && !mmu_sync_children(vcpu, sp, false)) {
> 
> I don't like dropping mmu_lock in the page fault path.  I agree that it's not
> all that different than grabbing various things in kvm_mmu_do_page_fault() long
> before acquiring mmu_lock, but I'm not 100% convinced we don't have a latent
> bug hiding somehwere in there :-), and (b) there's a possibility, however small,
> that something in FNAME(fetch) that we're missing.  Case in point, this technically
> needs to do make_mmu_pages_available().
> 
> And I believe kvm_mmu_get_page() already tries to handle this case by requesting
> KVM_REQ_MMU_SYNC if it uses a sp with unsync_children, it just doesn't handle SMP
> interaction, e.g. can link a sp that's immediately available to other vCPUs before
> the sync.
> 
> Rather than force the sync here, what about kicking all vCPUs and retrying the
> page fault?  The only gross part is that kvm_mmu_get_page() can now fail :-(
> 
> ---
>  arch/x86/include/asm/kvm_host.h | 3 ++-
>  arch/x86/kvm/mmu/mmu.c          | 9 +++++++--
>  arch/x86/kvm/mmu/paging_tmpl.h  | 4 ++++
>  3 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 09b256db394a..332b9fb3454c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -57,7 +57,8 @@
>  #define KVM_REQ_MIGRATE_TIMER		KVM_ARCH_REQ(0)
>  #define KVM_REQ_REPORT_TPR_ACCESS	KVM_ARCH_REQ(1)
>  #define KVM_REQ_TRIPLE_FAULT		KVM_ARCH_REQ(2)
> -#define KVM_REQ_MMU_SYNC		KVM_ARCH_REQ(3)
> +#define KVM_REQ_MMU_SYNC \
> +	KVM_ARCH_REQ_FLAGS(3, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_CLOCK_UPDATE		KVM_ARCH_REQ(4)
>  #define KVM_REQ_LOAD_MMU_PGD		KVM_ARCH_REQ(5)
>  #define KVM_REQ_EVENT			KVM_ARCH_REQ(6)
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4853c033e6ce..03293cd3c7ae 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2143,8 +2143,10 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  			kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
>  		}
> 
> -		if (sp->unsync_children)
> -			kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> +		if (sp->unsync_children) {
> +			kvm_make_all_cpus_request(KVM_REQ_MMU_SYNC, vcpu);

I don't know the KVM mmu well so I miss something here most likely,
but why to switch to kvm_make_all_cpus_request?

MMU is shared by all VCPUs, and the process of its syncing should also do remote TLB flushes
when needed?


Another thing I don't fully understand is why this patch is needed. If we link an SP which has unsync
children, we raise KVM_REQ_MMU_SYNC, which I think means that *this* vCPU will sync the whole MMU on next
guest entry, including these unsync child SPs. Could you explain this?

I am just curious, and I would like to understand the KVM's mmu better.

Best regards,
	Maxim Levitsky


> +			return NULL;
> +		}
> 
>  		__clear_sp_write_flooding_count(sp);
> 
> @@ -2999,6 +3001,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> 
>  		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
>  				      it.level - 1, true, ACC_ALL);
> +		BUG_ON(!sp);
> 
>  		link_shadow_page(vcpu, it.sptep, sp);
>  		if (fault->is_tdp && fault->huge_page_disallowed &&
> @@ -3383,6 +3386,8 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
>  	struct kvm_mmu_page *sp;
> 
>  	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
> +	BUG_ON(!sp);
> +
>  	++sp->root_count;
> 
>  	return __pa(sp->spt);
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 50ade6450ace..f573d45e2c6f 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -704,6 +704,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  			access = gw->pt_access[it.level - 2];
>  			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
>  					      it.level-1, false, access);
> +			if (!sp)
> +				return RET_PF_RETRY;
>  		}
> 
>  		/*
> @@ -742,6 +744,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  		if (!is_shadow_present_pte(*it.sptep)) {
>  			sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
>  					      it.level - 1, true, direct_access);
> +			BUG_ON(!sp);
> +
>  			link_shadow_page(vcpu, it.sptep, sp);
>  			if (fault->huge_page_disallowed &&
>  			    fault->req_level >= it.level)
> --
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it
  2021-09-13 11:30     ` Maxim Levitsky
@ 2021-09-13 20:49       ` Sean Christopherson
  2021-09-13 22:31         ` Maxim Levitsky
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2021-09-13 20:49 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Lai Jiangshan, linux-kernel, Paolo Bonzini, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Marcelo Tosatti, Avi Kivity, kvm

On Mon, Sep 13, 2021, Maxim Levitsky wrote:
> On Thu, 2021-09-02 at 23:40 +0000, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 4853c033e6ce..03293cd3c7ae 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2143,8 +2143,10 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >  			kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> >  		}
> > 
> > -		if (sp->unsync_children)
> > -			kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > +		if (sp->unsync_children) {
> > +			kvm_make_all_cpus_request(KVM_REQ_MMU_SYNC, vcpu);
> 
> I don't know the KVM mmu well so I miss something here most likely,
> but why to switch to kvm_make_all_cpus_request?
> 
> MMU is shared by all VCPUs, and the process of its syncing should also do
> remote TLB flushes when needed?
> 
> Another thing I don't fully understand is why this patch is needed. If we
> link an SP which has unsync children, we raise KVM_REQ_MMU_SYNC, which I
> think means that *this* vCPU will sync the whole MMU on next guest entry,
> including these unsync child SPs. Could you explain this?

Answering all three questions at once, the problem is that KVM links in a new SP
that points at unsync'd SPs _before_ servicing KVM_REQ_MMU_SYNC.  While the vCPU
is guaranteed to service KVM_REQ_MMU_SYNC before entering the guest, that doesn't
hold true for other vCPUs.  As a result, there's a window where a different vCPU
can consume the stale, unsync SP via the new SP.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it
  2021-09-13 20:49       ` Sean Christopherson
@ 2021-09-13 22:31         ` Maxim Levitsky
  0 siblings, 0 replies; 30+ messages in thread
From: Maxim Levitsky @ 2021-09-13 22:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lai Jiangshan, linux-kernel, Paolo Bonzini, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Marcelo Tosatti, Avi Kivity, kvm

On Mon, 2021-09-13 at 20:49 +0000, Sean Christopherson wrote:
> On Mon, Sep 13, 2021, Maxim Levitsky wrote:
> > On Thu, 2021-09-02 at 23:40 +0000, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 4853c033e6ce..03293cd3c7ae 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -2143,8 +2143,10 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >  			kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> > >  		}
> > > 
> > > -		if (sp->unsync_children)
> > > -			kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > > +		if (sp->unsync_children) {
> > > +			kvm_make_all_cpus_request(KVM_REQ_MMU_SYNC, vcpu);
> > 
> > I don't know the KVM mmu well so I miss something here most likely,
> > but why to switch to kvm_make_all_cpus_request?
> > 
> > MMU is shared by all VCPUs, and the process of its syncing should also do
> > remote TLB flushes when needed?
> > 
> > Another thing I don't fully understand is why this patch is needed. If we
> > link an SP which has unsync children, we raise KVM_REQ_MMU_SYNC, which I
> > think means that *this* vCPU will sync the whole MMU on next guest entry,
> > including these unsync child SPs. Could you explain this?
> 
> Answering all three questions at once, the problem is that KVM links in a new SP
> that points at unsync'd SPs _before_ servicing KVM_REQ_MMU_SYNC.  While the vCPU
> is guaranteed to service KVM_REQ_MMU_SYNC before entering the guest, that doesn't
> hold true for other vCPUs.  As a result, there's a window where a different vCPU
> can consume the stale, unsync SP via the new SP.
> 
Thank you, now I understand!

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/7] KVM: X86: Don't unsync pagetables when speculative
  2021-09-13 11:02   ` Maxim Levitsky
@ 2021-09-18  3:06     ` Lai Jiangshan
  0 siblings, 0 replies; 30+ messages in thread
From: Lai Jiangshan @ 2021-09-18  3:06 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: LKML, Paolo Bonzini, Lai Jiangshan, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, X86 ML,
	H. Peter Anvin, kvm

It is weird that I did not receive this email.

On Mon, Sep 13, 2021 at 7:02 PM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Tue, 2021-08-24 at 15:55 +0800, Lai Jiangshan wrote:
> > From: Lai Jiangshan <laijs@linux.alibaba.com>
> >
> > We'd better only unsync the pagetable when there just was a really
> > write fault on a level-1 pagetable.
> >
> > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 6 +++++-
> >  arch/x86/kvm/mmu/mmu_internal.h | 3 ++-
> >  arch/x86/kvm/mmu/spte.c         | 2 +-
> >  3 files changed, 8 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index a165eb8713bc..e5932af6f11c 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2600,7 +2600,8 @@ static void kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> >   * were marked unsync (or if there is no shadow page), -EPERM if the SPTE must
> >   * be write-protected.
> >   */
> > -int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync)
> > +int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync,
> > +                         bool speculative)
> >  {
> >       struct kvm_mmu_page *sp;
> >       bool locked = false;
> > @@ -2626,6 +2627,9 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync)
> >               if (sp->unsync)
> >                       continue;
> >
> > +             if (speculative)
> > +                     return -EEXIST;
>
> Woudn't it be better to ensure that callers set can_unsync = false when speculating?

I don't want to change the current behavior of "can_unsync"

For a gfn:
  case1: All sps for the gfn are synced
  case2: Some sps for the gfn are synced and the others are not
  case3: All sps for the gfn are not synced

"!can_unsync" causes the function to return non-zero for all cases.
"speculative" causes the function to return non-zero for case1,case2.

I don't think it would be bug if the behavior of old "!can_unsync" is changed
to the behavior of this new "speculative".  But the meaning of "!can_unsync"
has to be changed.

!can_unsync: all sps for @gfn can't be unsync.  (derived from current code)
==>
!can_unsync: it should not do any unsync operation.

I have sent the patch in V2 without any change.  If the new meaning
is preferred, I will respin the patch, or I will send it separately
if no other patches in V2 need to be updated.

>
> Also if I understand correctly this is not fixing a bug, but an optimization?
>

It is not fixing any bugs.  But it is weird to do unsync operation on sps when
speculative which would cause future overhead with no reason.

> Best regards,
>         Maxim Levitsky
>
>
> > +
> >               /*
> >                * TDP MMU page faults require an additional spinlock as they
> >                * run with mmu_lock held for read, not write, and the unsync
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 658d8d228d43..f5d8be787993 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -116,7 +116,8 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
> >              kvm_x86_ops.cpu_dirty_log_size;
> >  }
> >
> > -int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync);
> > +int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync,
> > +                         bool speculative);
> >
> >  void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
> >  void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index 3e97cdb13eb7..b68a580f3510 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -159,7 +159,7 @@ int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
> >                * e.g. it's write-tracked (upper-level SPs) or has one or more
> >                * shadow pages and unsync'ing pages is not allowed.
> >                */
> > -             if (mmu_try_to_unsync_pages(vcpu, gfn, can_unsync)) {
> > +             if (mmu_try_to_unsync_pages(vcpu, gfn, can_unsync, speculative)) {
> >                       pgprintk("%s: found shadow page for %llx, marking ro\n",
> >                                __func__, gfn);
> >                       ret |= SET_SPTE_WRITE_PROTECTED_PT;
>
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2021-09-18  3:06 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-24  7:55 [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups Lai Jiangshan
2021-08-24  7:55 ` [PATCH 1/7] KVM: X86: Fix missed remote tlb flush in rmap_write_protect() Lai Jiangshan
2021-09-02 21:38   ` Sean Christopherson
2021-09-13  9:57   ` Maxim Levitsky
2021-08-24  7:55 ` [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it Lai Jiangshan
2021-09-02 23:40   ` Sean Christopherson
2021-09-02 23:54     ` Sean Christopherson
2021-09-03  0:44       ` Lai Jiangshan
2021-09-03 16:06         ` Sean Christopherson
2021-09-03 16:25           ` Lai Jiangshan
2021-09-03 16:40             ` Sean Christopherson
2021-09-03 17:00               ` Lai Jiangshan
2021-09-03 16:33           ` Lai Jiangshan
2021-09-03  0:51     ` Lai Jiangshan
2021-09-13 11:30     ` Maxim Levitsky
2021-09-13 20:49       ` Sean Christopherson
2021-09-13 22:31         ` Maxim Levitsky
2021-08-24  7:55 ` [PATCH 3/7] KVM: X86: Zap the invalid list after remote tlb flushing Lai Jiangshan
2021-09-02 21:54   ` Sean Christopherson
2021-08-24  7:55 ` [PATCH 4/7] KVM: X86: Remove FNAME(update_pte) Lai Jiangshan
2021-09-13  9:49   ` Maxim Levitsky
2021-08-24  7:55 ` [PATCH 5/7] KVM: X86: Don't unsync pagetables when speculative Lai Jiangshan
2021-09-13 11:02   ` Maxim Levitsky
2021-09-18  3:06     ` Lai Jiangshan
2021-08-24  7:55 ` [PATCH 6/7] KVM: X86: Don't check unsync if the original spte is writible Lai Jiangshan
2021-08-24  7:55 ` [PATCH 7/7] KVM: X86: Also prefetch the last range in __direct_pte_prefetch() Lai Jiangshan
2021-08-25 15:18   ` Sean Christopherson
2021-08-25 22:58     ` Lai Jiangshan
2021-08-31 18:02 ` [PATCH 0/7] KVM: X86: MMU: misc fixes and cleanups Lai Jiangshan
2021-08-31 21:57   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).