All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/7] KVM: MMU: fast zap all shadow pages
@ 2013-05-16 21:12 Xiao Guangrong
  2013-05-16 21:12 ` [PATCH v6 1/7] KVM: MMU: drop unnecessary kvm_reload_remote_mmus Xiao Guangrong
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Xiao Guangrong @ 2013-05-16 21:12 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

The benchmark and the result can be found at:
http://www.spinics.net/lists/kvm/msg91391.html

Changlog:
V6:
  1): reversely walk active_list to skip the new created pages based
      on the comments from Gleb and Paolo.

  2): completely replace kvm_mmu_zap_all by kvm_mmu_invalidate_all_pages
      based on Gleb's comments.

  3): improve the parameters of kvm_mmu_invalidate_all_pages based on
      Gleb's comments.
 
  4): rename kvm_mmu_invalidate_memslot_pages to kvm_mmu_invalidate_all_pages
  5): rename zap_invalid_pages to kvm_zap_obsolete_pages

V5:
  1): rename is_valid_sp to is_obsolete_sp
  2): use lock-break technique to zap all old pages instead of only pages
      linked on invalid slot's rmap suggested by Marcelo.
  3): trace invalid pages and kvm_mmu_invalidate_memslot_pages()
  4): rename kvm_mmu_invalid_memslot_pages to kvm_mmu_invalidate_memslot_pages
      according to Takuya's comments.

V4:
  1): drop unmapping invalid rmap out of mmu-lock and use lock-break technique
      instead. Thanks to Gleb's comments.

  2): needn't handle invalid-gen pages specially due to page table always
      switched by KVM_REQ_MMU_RELOAD. Thanks to Marcelo's comments.

V3:
  completely redesign the algorithm, please see below.

V2:
  - do not reset n_requested_mmu_pages and n_max_mmu_pages
  - batch free root shadow pages to reduce vcpu notification and mmu-lock
    contention
  - remove the first patch that introduce kvm->arch.mmu_cache since we only
    'memset zero' on hashtable rather than all mmu cache members in this
    version
  - remove unnecessary kvm_reload_remote_mmus after kvm_mmu_zap_all

* Issue
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability.

* Idea
KVM maintains a global mmu invalid generation-number which is stored in
kvm->arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp->mmu_valid_gen when it is created.

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.

Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
are zapped by using lock-break technique.

Xiao Guangrong (7):
  KVM: MMU: drop unnecessary kvm_reload_remote_mmus
  KVM: MMU: delete shadow page from hash list in
    kvm_mmu_prepare_zap_page
  KVM: MMU: fast invalidate all pages
  KVM: MMU: zap pages in batch
  KVM: x86: use the fast way to invalidate all pages
  KVM: MMU: show mmu_valid_gen in shadow page related tracepoints
  KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages

 arch/x86/include/asm/kvm_host.h |    2 +
 arch/x86/kvm/mmu.c              |  115 +++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu.h              |    1 +
 arch/x86/kvm/mmutrace.h         |   45 ++++++++++++----
 arch/x86/kvm/x86.c              |   11 ++---
 5 files changed, 151 insertions(+), 23 deletions(-)

-- 
1.7.7.6


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 1/7] KVM: MMU: drop unnecessary kvm_reload_remote_mmus
  2013-05-16 21:12 [PATCH v6 0/7] KVM: MMU: fast zap all shadow pages Xiao Guangrong
@ 2013-05-16 21:12 ` Xiao Guangrong
  2013-05-16 21:12 ` [PATCH v6 2/7] KVM: MMU: delete shadow page from hash list in kvm_mmu_prepare_zap_page Xiao Guangrong
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Xiao Guangrong @ 2013-05-16 21:12 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

It is the responsibility of kvm_mmu_zap_all that keeps the
consistent of mmu and tlbs. And it is also unnecessary after
zap all mmio sptes since no mmio spte exists on root shadow
page and it can not be cached into tlb

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/x86.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8d28810..d885418 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7067,16 +7067,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 	 * If memory slot is created, or moved, we need to clear all
 	 * mmio sptes.
 	 */
-	if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
+	if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE))
 		kvm_mmu_zap_mmio_sptes(kvm);
-		kvm_reload_remote_mmus(kvm);
-	}
 }
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
 	kvm_mmu_zap_all(kvm);
-	kvm_reload_remote_mmus(kvm);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v6 2/7] KVM: MMU: delete shadow page from hash list in kvm_mmu_prepare_zap_page
  2013-05-16 21:12 [PATCH v6 0/7] KVM: MMU: fast zap all shadow pages Xiao Guangrong
  2013-05-16 21:12 ` [PATCH v6 1/7] KVM: MMU: drop unnecessary kvm_reload_remote_mmus Xiao Guangrong
@ 2013-05-16 21:12 ` Xiao Guangrong
  2013-05-19 10:47   ` Gleb Natapov
  2013-05-16 21:12 ` [PATCH v6 3/7] KVM: MMU: fast invalidate all pages Xiao Guangrong
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Xiao Guangrong @ 2013-05-16 21:12 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Move deletion shadow page from the hash list from kvm_mmu_commit_zap_page to
kvm_mmu_prepare_zap_page so that we can call kvm_mmu_commit_zap_page
once for multiple kvm_mmu_prepare_zap_page that can help us to avoid
unnecessary TLB flush

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 40d7b2d..682ecb4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
 static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
 {
 	ASSERT(is_empty_shadow_page(sp->spt));
-	hlist_del(&sp->hash_link);
+
 	list_del(&sp->link);
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
@@ -1655,7 +1655,8 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn)			\
 	for_each_gfn_sp(_kvm, _sp, _gfn)				\
-		if ((_sp)->role.direct || (_sp)->role.invalid) {} else
+		if ((_sp)->role.direct ||				\
+		      ((_sp)->role.invalid && WARN_ON(1))) {} else
 
 /* @sp->gfn should be write-protected at the call site */
 static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
@@ -2074,6 +2075,9 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 		unaccount_shadowed(kvm, sp->gfn);
 	if (sp->unsync)
 		kvm_unlink_unsync_page(kvm, sp);
+
+	hlist_del_init(&sp->hash_link);
+
 	if (!sp->root_count) {
 		/* Count self */
 		ret++;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-16 21:12 [PATCH v6 0/7] KVM: MMU: fast zap all shadow pages Xiao Guangrong
  2013-05-16 21:12 ` [PATCH v6 1/7] KVM: MMU: drop unnecessary kvm_reload_remote_mmus Xiao Guangrong
  2013-05-16 21:12 ` [PATCH v6 2/7] KVM: MMU: delete shadow page from hash list in kvm_mmu_prepare_zap_page Xiao Guangrong
@ 2013-05-16 21:12 ` Xiao Guangrong
  2013-05-19 10:04   ` Gleb Natapov
  2013-05-20 19:46   ` Marcelo Tosatti
  2013-05-16 21:12 ` [PATCH v6 4/7] KVM: MMU: zap pages in batch Xiao Guangrong
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Xiao Guangrong @ 2013-05-16 21:12 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability

In this patch, we introduce a faster way to invalidate all shadow pages.
KVM maintains a global mmu invalid generation-number which is stored in
kvm->arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp->mmu_valid_gen when it is created

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.
Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
are zapped by using lock-break technique

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    2 +
 arch/x86/kvm/mmu.c              |  103 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu.h              |    1 +
 3 files changed, 106 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3741c65..bff7d46 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -222,6 +222,7 @@ struct kvm_mmu_page {
 	int root_count;          /* Currently serving as active root */
 	unsigned int unsync_children;
 	unsigned long parent_ptes;	/* Reverse mapping for parent_pte */
+	unsigned long mmu_valid_gen;
 	DECLARE_BITMAP(unsync_child_bitmap, 512);
 
 #ifdef CONFIG_X86_32
@@ -529,6 +530,7 @@ struct kvm_arch {
 	unsigned int n_requested_mmu_pages;
 	unsigned int n_max_mmu_pages;
 	unsigned int indirect_shadow_pages;
+	unsigned long mmu_valid_gen;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	/*
 	 * Hash table of struct kvm_mmu_page.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 682ecb4..891ad2c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1839,6 +1839,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sp);
 }
 
+static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+	return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
+}
+
 static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     gfn_t gfn,
 					     gva_t gaddr,
@@ -1865,6 +1870,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 		role.quadrant = quadrant;
 	}
 	for_each_gfn_sp(vcpu->kvm, sp, gfn) {
+		if (is_obsolete_sp(vcpu->kvm, sp))
+			continue;
+
 		if (!need_sync && sp->unsync)
 			need_sync = true;
 
@@ -1901,6 +1909,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 
 		account_shadowed(vcpu->kvm, gfn);
 	}
+	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
 	init_shadow_page_table(sp);
 	trace_kvm_mmu_get_page(sp, true);
 	return sp;
@@ -2071,8 +2080,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 	ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
 	kvm_mmu_page_unlink_children(kvm, sp);
 	kvm_mmu_unlink_parents(kvm, sp);
+
 	if (!sp->role.invalid && !sp->role.direct)
 		unaccount_shadowed(kvm, sp->gfn);
+
 	if (sp->unsync)
 		kvm_unlink_unsync_page(kvm, sp);
 
@@ -4196,6 +4207,98 @@ restart:
 	spin_unlock(&kvm->mmu_lock);
 }
 
+static void kvm_zap_obsolete_pages(struct kvm *kvm)
+{
+	struct kvm_mmu_page *sp, *node;
+	LIST_HEAD(invalid_list);
+
+restart:
+	list_for_each_entry_safe_reverse(sp, node,
+	      &kvm->arch.active_mmu_pages, link) {
+		/*
+		 * No obsolete page exists before new created page since
+		 * active_mmu_pages is the FIFO list.
+		 */
+		if (!is_obsolete_sp(kvm, sp))
+			break;
+
+		/*
+		 * Do not repeatedly zap a root page to avoid unnecessary
+		 * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
+		 * progress:
+		 *    vcpu 0                        vcpu 1
+		 *                         call vcpu_enter_guest():
+		 *                            1): handle KVM_REQ_MMU_RELOAD
+		 *                                and require mmu-lock to
+		 *                                load mmu
+		 * repeat:
+		 *    1): zap root page and
+		 *        send KVM_REQ_MMU_RELOAD
+		 *
+		 *    2): if (cond_resched_lock(mmu-lock))
+		 *
+		 *                            2): hold mmu-lock and load mmu
+		 *
+		 *                            3): see KVM_REQ_MMU_RELOAD bit
+		 *                                on vcpu->requests is set
+		 *                                then return 1 to call
+		 *                                vcpu_enter_guest() again.
+		 *            goto repeat;
+		 *
+		 */
+		if (sp->role.invalid)
+			continue;
+		/*
+		 * Need not flush tlb since we only zap the sp with invalid
+		 * generation number.
+		 */
+		if (cond_resched_lock(&kvm->mmu_lock))
+			goto restart;
+
+		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
+			goto restart;
+	}
+
+	/*
+	 * Should flush tlb before free page tables since lockless-walking
+	 * may use the pages.
+	 */
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
+}
+
+/*
+ * Fast invalidate all shadow pages.
+ *
+ * @zap_obsolete_pages indicates whether all the obsolete pages should
+ * be zapped. This is required when memslot is being deleted or VM is
+ * being destroyed, in these cases, we should ensure that KVM MMU does
+ * not use any resource of the being-deleted slot or all slots after
+ * calling the function.
+ *
+ * @zap_obsolete_pages == false means the caller just wants to flush all
+ * shadow page tables.
+ */
+void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
+{
+	spin_lock(&kvm->mmu_lock);
+	kvm->arch.mmu_valid_gen++;
+
+	/*
+	 * Notify all vcpus to reload its shadow page table
+	 * and flush TLB. Then all vcpus will switch to new
+	 * shadow page table with the new mmu_valid_gen.
+	 *
+	 * Note: we should do this under the protection of
+	 * mmu-lock, otherwise, vcpu would purge shadow page
+	 * but miss tlb flush.
+	 */
+	kvm_reload_remote_mmus(kvm);
+
+	if (zap_obsolete_pages)
+		kvm_zap_obsolete_pages(kvm);
+	spin_unlock(&kvm->mmu_lock);
+}
+
 void kvm_mmu_zap_mmio_sptes(struct kvm *kvm)
 {
 	struct kvm_mmu_page *sp, *node;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2adcbc2..123d703 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -97,4 +97,5 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access,
 	return (mmu->permissions[pfec >> 1] >> pte_access) & 1;
 }
 
+void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages);
 #endif
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v6 4/7] KVM: MMU: zap pages in batch
  2013-05-16 21:12 [PATCH v6 0/7] KVM: MMU: fast zap all shadow pages Xiao Guangrong
                   ` (2 preceding siblings ...)
  2013-05-16 21:12 ` [PATCH v6 3/7] KVM: MMU: fast invalidate all pages Xiao Guangrong
@ 2013-05-16 21:12 ` Xiao Guangrong
  2013-05-16 21:13 ` [PATCH v6 5/7] KVM: x86: use the fast way to invalidate all pages Xiao Guangrong
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Xiao Guangrong @ 2013-05-16 21:12 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Zap at lease 10 pages before releasing mmu-lock to reduce the overload
caused by requiring lock

[ It improves kernel building 0.6% ~ 1% ]

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c |   14 ++++++++++++--
 1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 891ad2c..7ad0e50 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4207,14 +4207,18 @@ restart:
 	spin_unlock(&kvm->mmu_lock);
 }
 
+#define BATCH_ZAP_PAGES	10
 static void kvm_zap_obsolete_pages(struct kvm *kvm)
 {
 	struct kvm_mmu_page *sp, *node;
 	LIST_HEAD(invalid_list);
+	int batch = 0;
 
 restart:
 	list_for_each_entry_safe_reverse(sp, node,
 	      &kvm->arch.active_mmu_pages, link) {
+		int ret;
+
 		/*
 		 * No obsolete page exists before new created page since
 		 * active_mmu_pages is the FIFO list.
@@ -4252,10 +4256,16 @@ restart:
 		 * Need not flush tlb since we only zap the sp with invalid
 		 * generation number.
 		 */
-		if (cond_resched_lock(&kvm->mmu_lock))
+		if ((batch >= BATCH_ZAP_PAGES) &&
+		      cond_resched_lock(&kvm->mmu_lock)) {
+			batch = 0;
 			goto restart;
+		}
 
-		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
+		ret = kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
+		batch += ret;
+
+		if (ret)
 			goto restart;
 	}
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v6 5/7] KVM: x86: use the fast way to invalidate all pages
  2013-05-16 21:12 [PATCH v6 0/7] KVM: MMU: fast zap all shadow pages Xiao Guangrong
                   ` (3 preceding siblings ...)
  2013-05-16 21:12 ` [PATCH v6 4/7] KVM: MMU: zap pages in batch Xiao Guangrong
@ 2013-05-16 21:13 ` Xiao Guangrong
  2013-05-16 21:13 ` [PATCH v6 6/7] KVM: MMU: show mmu_valid_gen in shadow page related tracepoints Xiao Guangrong
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Xiao Guangrong @ 2013-05-16 21:13 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Replace kvm_mmu_zap_all by kvm_mmu_invalidate_all_pages

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c |   15 ---------------
 arch/x86/kvm/x86.c |    6 +++---
 2 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7ad0e50..89b51dc 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4192,21 +4192,6 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
 	spin_unlock(&kvm->mmu_lock);
 }
 
-void kvm_mmu_zap_all(struct kvm *kvm)
-{
-	struct kvm_mmu_page *sp, *node;
-	LIST_HEAD(invalid_list);
-
-	spin_lock(&kvm->mmu_lock);
-restart:
-	list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link)
-		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
-			goto restart;
-
-	kvm_mmu_commit_zap_page(kvm, &invalid_list);
-	spin_unlock(&kvm->mmu_lock);
-}
-
 #define BATCH_ZAP_PAGES	10
 static void kvm_zap_obsolete_pages(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d885418..30a990c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5528,7 +5528,7 @@ static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
 	 * to ensure that the updated hypercall appears atomically across all
 	 * VCPUs.
 	 */
-	kvm_mmu_zap_all(vcpu->kvm);
+	kvm_mmu_invalidate_all_pages(vcpu->kvm, false);
 
 	kvm_x86_ops->patch_hypercall(vcpu, instruction);
 
@@ -7073,13 +7073,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
-	kvm_mmu_zap_all(kvm);
+	kvm_mmu_invalidate_all_pages(kvm, true);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 				   struct kvm_memory_slot *slot)
 {
-	kvm_arch_flush_shadow_all(kvm);
+	kvm_mmu_invalidate_all_pages(kvm, true);
 }
 
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v6 6/7] KVM: MMU: show mmu_valid_gen in shadow page related tracepoints
  2013-05-16 21:12 [PATCH v6 0/7] KVM: MMU: fast zap all shadow pages Xiao Guangrong
                   ` (4 preceding siblings ...)
  2013-05-16 21:13 ` [PATCH v6 5/7] KVM: x86: use the fast way to invalidate all pages Xiao Guangrong
@ 2013-05-16 21:13 ` Xiao Guangrong
  2013-05-16 21:13 ` [PATCH v6 7/7] KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages Xiao Guangrong
  2013-05-19 10:49 ` [PATCH v6 0/7] KVM: MMU: fast zap all shadow pages Gleb Natapov
  7 siblings, 0 replies; 30+ messages in thread
From: Xiao Guangrong @ 2013-05-16 21:13 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Show sp->mmu_valid_gen

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmutrace.h |   22 ++++++++++++----------
 1 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index b8f6172..697f466 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -7,16 +7,18 @@
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM kvmmmu
 
-#define KVM_MMU_PAGE_FIELDS \
-	__field(__u64, gfn) \
-	__field(__u32, role) \
-	__field(__u32, root_count) \
+#define KVM_MMU_PAGE_FIELDS			\
+	__field(unsigned long, mmu_valid_gen)	\
+	__field(__u64, gfn)			\
+	__field(__u32, role)			\
+	__field(__u32, root_count)		\
 	__field(bool, unsync)
 
-#define KVM_MMU_PAGE_ASSIGN(sp)			     \
-	__entry->gfn = sp->gfn;			     \
-	__entry->role = sp->role.word;		     \
-	__entry->root_count = sp->root_count;        \
+#define KVM_MMU_PAGE_ASSIGN(sp)				\
+	__entry->mmu_valid_gen = sp->mmu_valid_gen;	\
+	__entry->gfn = sp->gfn;				\
+	__entry->role = sp->role.word;			\
+	__entry->root_count = sp->root_count;		\
 	__entry->unsync = sp->unsync;
 
 #define KVM_MMU_PAGE_PRINTK() ({				        \
@@ -28,8 +30,8 @@
 								        \
 	role.word = __entry->role;					\
 									\
-	trace_seq_printf(p, "sp gfn %llx %u%s q%u%s %s%s"		\
-			 " %snxe root %u %s%c",				\
+	trace_seq_printf(p, "sp gen %lx gfn %llx %u%s q%u%s %s%s"	\
+			 " %snxe root %u %s%c",	__entry->mmu_valid_gen,	\
 			 __entry->gfn, role.level,			\
 			 role.cr4_pae ? " pae" : "",			\
 			 role.quadrant,					\
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v6 7/7] KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages
  2013-05-16 21:12 [PATCH v6 0/7] KVM: MMU: fast zap all shadow pages Xiao Guangrong
                   ` (5 preceding siblings ...)
  2013-05-16 21:13 ` [PATCH v6 6/7] KVM: MMU: show mmu_valid_gen in shadow page related tracepoints Xiao Guangrong
@ 2013-05-16 21:13 ` Xiao Guangrong
  2013-05-19 10:49 ` [PATCH v6 0/7] KVM: MMU: fast zap all shadow pages Gleb Natapov
  7 siblings, 0 replies; 30+ messages in thread
From: Xiao Guangrong @ 2013-05-16 21:13 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

It is good for debug and development

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c      |    1 +
 arch/x86/kvm/mmutrace.h |   23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 89b51dc..2c512e8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4276,6 +4276,7 @@ restart:
 void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
 {
 	spin_lock(&kvm->mmu_lock);
+	trace_kvm_mmu_invalidate_all_pages(kvm, zap_obsolete_pages);
 	kvm->arch.mmu_valid_gen++;
 
 	/*
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 697f466..e13d253 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -276,6 +276,29 @@ TRACE_EVENT(
 		  __spte_satisfied(old_spte), __spte_satisfied(new_spte)
 	)
 );
+
+TRACE_EVENT(
+	kvm_mmu_invalidate_all_pages,
+	TP_PROTO(struct kvm *kvm, bool zap_obsolete_pages),
+	TP_ARGS(kvm, zap_obsolete_pages),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, mmu_valid_gen)
+		__field(unsigned int, mmu_used_pages)
+		__field(bool, zap_obsolete_pages)
+	),
+
+	TP_fast_assign(
+		__entry->mmu_valid_gen = kvm->arch.mmu_valid_gen;
+		__entry->mmu_used_pages = kvm->arch.n_used_mmu_pages;
+		__entry->zap_obsolete_pages = zap_obsolete_pages;
+	),
+
+	TP_printk("kvm-mmu-valid-gen %lx zap_obsolete_pages %d "
+		  "used_pages %x", __entry->mmu_valid_gen,
+		  __entry->zap_obsolete_pages, __entry->mmu_used_pages
+	)
+);
 #endif /* _TRACE_KVMMMU_H */
 
 #undef TRACE_INCLUDE_PATH
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-16 21:12 ` [PATCH v6 3/7] KVM: MMU: fast invalidate all pages Xiao Guangrong
@ 2013-05-19 10:04   ` Gleb Natapov
  2013-05-20  9:12     ` Xiao Guangrong
  2013-05-20 19:46   ` Marcelo Tosatti
  1 sibling, 1 reply; 30+ messages in thread
From: Gleb Natapov @ 2013-05-19 10:04 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On Fri, May 17, 2013 at 05:12:58AM +0800, Xiao Guangrong wrote:
> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> walk and zap all shadow pages one by one, also it need to zap all guest
> page's rmap and all shadow page's parent spte list. Particularly, things
> become worse if guest uses more memory or vcpus. It is not good for
> scalability
> 
> In this patch, we introduce a faster way to invalidate all shadow pages.
> KVM maintains a global mmu invalid generation-number which is stored in
> kvm->arch.mmu_valid_gen and every shadow page stores the current global
> generation-number into sp->mmu_valid_gen when it is created
> 
> When KVM need zap all shadow pages sptes, it just simply increase the
> global generation-number then reload root shadow pages on all vcpus.
> Vcpu will create a new shadow page table according to current kvm's
> generation-number. It ensures the old pages are not used any more.
> Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> are zapped by using lock-break technique
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    2 +
>  arch/x86/kvm/mmu.c              |  103 +++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu.h              |    1 +
>  3 files changed, 106 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3741c65..bff7d46 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -222,6 +222,7 @@ struct kvm_mmu_page {
>  	int root_count;          /* Currently serving as active root */
>  	unsigned int unsync_children;
>  	unsigned long parent_ptes;	/* Reverse mapping for parent_pte */
> +	unsigned long mmu_valid_gen;
>  	DECLARE_BITMAP(unsync_child_bitmap, 512);
>  
>  #ifdef CONFIG_X86_32
> @@ -529,6 +530,7 @@ struct kvm_arch {
>  	unsigned int n_requested_mmu_pages;
>  	unsigned int n_max_mmu_pages;
>  	unsigned int indirect_shadow_pages;
> +	unsigned long mmu_valid_gen;
>  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>  	/*
>  	 * Hash table of struct kvm_mmu_page.
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 682ecb4..891ad2c 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1839,6 +1839,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
>  	__clear_sp_write_flooding_count(sp);
>  }
>  
> +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> +	return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> +}
> +
>  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  					     gfn_t gfn,
>  					     gva_t gaddr,
> @@ -1865,6 +1870,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  		role.quadrant = quadrant;
>  	}
>  	for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> +		if (is_obsolete_sp(vcpu->kvm, sp))
> +			continue;
> +
>  		if (!need_sync && sp->unsync)
>  			need_sync = true;
>  
> @@ -1901,6 +1909,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  
>  		account_shadowed(vcpu->kvm, gfn);
>  	}
> +	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
>  	init_shadow_page_table(sp);
>  	trace_kvm_mmu_get_page(sp, true);
>  	return sp;
> @@ -2071,8 +2080,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>  	ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>  	kvm_mmu_page_unlink_children(kvm, sp);
>  	kvm_mmu_unlink_parents(kvm, sp);
> +
>  	if (!sp->role.invalid && !sp->role.direct)
>  		unaccount_shadowed(kvm, sp->gfn);
> +
>  	if (sp->unsync)
>  		kvm_unlink_unsync_page(kvm, sp);
>  
> @@ -4196,6 +4207,98 @@ restart:
>  	spin_unlock(&kvm->mmu_lock);
>  }
>  
> +static void kvm_zap_obsolete_pages(struct kvm *kvm)
> +{
> +	struct kvm_mmu_page *sp, *node;
> +	LIST_HEAD(invalid_list);
> +
> +restart:
> +	list_for_each_entry_safe_reverse(sp, node,
> +	      &kvm->arch.active_mmu_pages, link) {
> +		/*
> +		 * No obsolete page exists before new created page since
> +		 * active_mmu_pages is the FIFO list.
> +		 */
> +		if (!is_obsolete_sp(kvm, sp))
> +			break;
> +
> +		/*
> +		 * Do not repeatedly zap a root page to avoid unnecessary
> +		 * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
> +		 * progress:
> +		 *    vcpu 0                        vcpu 1
> +		 *                         call vcpu_enter_guest():
> +		 *                            1): handle KVM_REQ_MMU_RELOAD
> +		 *                                and require mmu-lock to
> +		 *                                load mmu
> +		 * repeat:
> +		 *    1): zap root page and
> +		 *        send KVM_REQ_MMU_RELOAD
> +		 *
> +		 *    2): if (cond_resched_lock(mmu-lock))
> +		 *
> +		 *                            2): hold mmu-lock and load mmu
> +		 *
> +		 *                            3): see KVM_REQ_MMU_RELOAD bit
> +		 *                                on vcpu->requests is set
> +		 *                                then return 1 to call
> +		 *                                vcpu_enter_guest() again.
> +		 *            goto repeat;
> +		 *
> +		 */
I am not sure why the above scenario will prevent us from progressing.
There is finite number of root pages with invalid generation number, so
eventually we will zap them all and vcpu1 will stop seeing KVM_REQ_MMU_RELOAD
request.

This check here prevent unnecessary KVM_REQ_MMU_RELOAD as you say, but
this races the question, why don't we check for sp->role.invalid in
kvm_mmu_prepare_zap_page before calling kvm_reload_remote_mmus()?
Something like this:

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 40d7b2d..d2ae3a4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2081,7 +2081,8 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 		kvm_mod_used_mmu_pages(kvm, -1);
 	} else {
 		list_move(&sp->link, &kvm->arch.active_mmu_pages);
-		kvm_reload_remote_mmus(kvm);
+		if (!sp->role.invalid)
+			kvm_reload_remote_mmus(kvm);
 	}
 
 	sp->role.invalid = 1;

Actually we can add check for is_obsolete_sp() there too since
kvm_mmu_invalidate_all_pages() already calls kvm_reload_remote_mmus()
after incrementing mmu_valid_gen.

Or do I miss something?

> +		if (sp->role.invalid)
> +			continue;
> +		/*
> +		 * Need not flush tlb since we only zap the sp with invalid
> +		 * generation number.
> +		 */
> +		if (cond_resched_lock(&kvm->mmu_lock))
> +			goto restart;
> +
> +		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> +			goto restart;
> +	}
> +
> +	/*
> +	 * Should flush tlb before free page tables since lockless-walking
> +	 * may use the pages.
> +	 */
> +	kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +}
> +
> +/*
> + * Fast invalidate all shadow pages.
> + *
> + * @zap_obsolete_pages indicates whether all the obsolete pages should
> + * be zapped. This is required when memslot is being deleted or VM is
> + * being destroyed, in these cases, we should ensure that KVM MMU does
> + * not use any resource of the being-deleted slot or all slots after
> + * calling the function.
> + *
> + * @zap_obsolete_pages == false means the caller just wants to flush all
> + * shadow page tables.
> + */
> +void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
> +{
> +	spin_lock(&kvm->mmu_lock);
> +	kvm->arch.mmu_valid_gen++;
> +
> +	/*
> +	 * Notify all vcpus to reload its shadow page table
> +	 * and flush TLB. Then all vcpus will switch to new
> +	 * shadow page table with the new mmu_valid_gen.
> +	 *
> +	 * Note: we should do this under the protection of
> +	 * mmu-lock, otherwise, vcpu would purge shadow page
> +	 * but miss tlb flush.
> +	 */
> +	kvm_reload_remote_mmus(kvm);
> +
> +	if (zap_obsolete_pages)
> +		kvm_zap_obsolete_pages(kvm);
> +	spin_unlock(&kvm->mmu_lock);
> +}
> +
>  void kvm_mmu_zap_mmio_sptes(struct kvm *kvm)
>  {
>  	struct kvm_mmu_page *sp, *node;
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 2adcbc2..123d703 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -97,4 +97,5 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access,
>  	return (mmu->permissions[pfec >> 1] >> pte_access) & 1;
>  }
>  
> +void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages);
>  #endif
> -- 
> 1.7.7.6

--
			Gleb.

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/7] KVM: MMU: delete shadow page from hash list in kvm_mmu_prepare_zap_page
  2013-05-16 21:12 ` [PATCH v6 2/7] KVM: MMU: delete shadow page from hash list in kvm_mmu_prepare_zap_page Xiao Guangrong
@ 2013-05-19 10:47   ` Gleb Natapov
  2013-05-20  9:19     ` Xiao Guangrong
  0 siblings, 1 reply; 30+ messages in thread
From: Gleb Natapov @ 2013-05-19 10:47 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On Fri, May 17, 2013 at 05:12:57AM +0800, Xiao Guangrong wrote:
> Move deletion shadow page from the hash list from kvm_mmu_commit_zap_page to
> kvm_mmu_prepare_zap_page so that we can call kvm_mmu_commit_zap_page
> once for multiple kvm_mmu_prepare_zap_page that can help us to avoid
> unnecessary TLB flush
> 
Don't we call kvm_mmu_commit_zap_page() once for multiple
kvm_mmu_prepare_zap_page() now when possible? kvm_mmu_commit_zap_page()
gets a list as a parameter. I am not against the change, but wish to
understand it better.

> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/mmu.c |    8 ++++++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 40d7b2d..682ecb4 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>  static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>  {
>  	ASSERT(is_empty_shadow_page(sp->spt));
> -	hlist_del(&sp->hash_link);
> +
>  	list_del(&sp->link);
>  	free_page((unsigned long)sp->spt);
>  	if (!sp->role.direct)
> @@ -1655,7 +1655,8 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>  
>  #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn)			\
>  	for_each_gfn_sp(_kvm, _sp, _gfn)				\
> -		if ((_sp)->role.direct || (_sp)->role.invalid) {} else
> +		if ((_sp)->role.direct ||				\
> +		      ((_sp)->role.invalid && WARN_ON(1))) {} else
>  
>  /* @sp->gfn should be write-protected at the call site */
>  static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> @@ -2074,6 +2075,9 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>  		unaccount_shadowed(kvm, sp->gfn);
>  	if (sp->unsync)
>  		kvm_unlink_unsync_page(kvm, sp);
> +
> +	hlist_del_init(&sp->hash_link);
> +
What about moving this inside if() bellow and making it hlist_del()?
Leave the page on the hash if root_count is non zero.

>  	if (!sp->root_count) {
>  		/* Count self */
>  		ret++;
> -- 
> 1.7.7.6

--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 0/7] KVM: MMU: fast zap all shadow pages
  2013-05-16 21:12 [PATCH v6 0/7] KVM: MMU: fast zap all shadow pages Xiao Guangrong
                   ` (6 preceding siblings ...)
  2013-05-16 21:13 ` [PATCH v6 7/7] KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages Xiao Guangrong
@ 2013-05-19 10:49 ` Gleb Natapov
  7 siblings, 0 replies; 30+ messages in thread
From: Gleb Natapov @ 2013-05-19 10:49 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On Fri, May 17, 2013 at 05:12:55AM +0800, Xiao Guangrong wrote:
> The benchmark and the result can be found at:
> http://www.spinics.net/lists/kvm/msg91391.html
> 
I asked a couple of questions on some patches, but overall this looks
good to me. Marcelo can you look at this too?

> Changlog:
> V6:
>   1): reversely walk active_list to skip the new created pages based
>       on the comments from Gleb and Paolo.
> 
>   2): completely replace kvm_mmu_zap_all by kvm_mmu_invalidate_all_pages
>       based on Gleb's comments.
> 
>   3): improve the parameters of kvm_mmu_invalidate_all_pages based on
>       Gleb's comments.
>  
>   4): rename kvm_mmu_invalidate_memslot_pages to kvm_mmu_invalidate_all_pages
>   5): rename zap_invalid_pages to kvm_zap_obsolete_pages
> 
> V5:
>   1): rename is_valid_sp to is_obsolete_sp
>   2): use lock-break technique to zap all old pages instead of only pages
>       linked on invalid slot's rmap suggested by Marcelo.
>   3): trace invalid pages and kvm_mmu_invalidate_memslot_pages()
>   4): rename kvm_mmu_invalid_memslot_pages to kvm_mmu_invalidate_memslot_pages
>       according to Takuya's comments.
> 
> V4:
>   1): drop unmapping invalid rmap out of mmu-lock and use lock-break technique
>       instead. Thanks to Gleb's comments.
> 
>   2): needn't handle invalid-gen pages specially due to page table always
>       switched by KVM_REQ_MMU_RELOAD. Thanks to Marcelo's comments.
> 
> V3:
>   completely redesign the algorithm, please see below.
> 
> V2:
>   - do not reset n_requested_mmu_pages and n_max_mmu_pages
>   - batch free root shadow pages to reduce vcpu notification and mmu-lock
>     contention
>   - remove the first patch that introduce kvm->arch.mmu_cache since we only
>     'memset zero' on hashtable rather than all mmu cache members in this
>     version
>   - remove unnecessary kvm_reload_remote_mmus after kvm_mmu_zap_all
> 
> * Issue
> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> walk and zap all shadow pages one by one, also it need to zap all guest
> page's rmap and all shadow page's parent spte list. Particularly, things
> become worse if guest uses more memory or vcpus. It is not good for
> scalability.
> 
> * Idea
> KVM maintains a global mmu invalid generation-number which is stored in
> kvm->arch.mmu_valid_gen and every shadow page stores the current global
> generation-number into sp->mmu_valid_gen when it is created.
> 
> When KVM need zap all shadow pages sptes, it just simply increase the
> global generation-number then reload root shadow pages on all vcpus.
> Vcpu will create a new shadow page table according to current kvm's
> generation-number. It ensures the old pages are not used any more.
> 
> Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> are zapped by using lock-break technique.
> 
> Xiao Guangrong (7):
>   KVM: MMU: drop unnecessary kvm_reload_remote_mmus
>   KVM: MMU: delete shadow page from hash list in
>     kvm_mmu_prepare_zap_page
>   KVM: MMU: fast invalidate all pages
>   KVM: MMU: zap pages in batch
>   KVM: x86: use the fast way to invalidate all pages
>   KVM: MMU: show mmu_valid_gen in shadow page related tracepoints
>   KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages
> 
>  arch/x86/include/asm/kvm_host.h |    2 +
>  arch/x86/kvm/mmu.c              |  115 +++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu.h              |    1 +
>  arch/x86/kvm/mmutrace.h         |   45 ++++++++++++----
>  arch/x86/kvm/x86.c              |   11 ++---
>  5 files changed, 151 insertions(+), 23 deletions(-)
> 
> -- 
> 1.7.7.6

--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-19 10:04   ` Gleb Natapov
@ 2013-05-20  9:12     ` Xiao Guangrong
  0 siblings, 0 replies; 30+ messages in thread
From: Xiao Guangrong @ 2013-05-20  9:12 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On 05/19/2013 06:04 PM, Gleb Natapov wrote:

>> +		/*
>> +		 * Do not repeatedly zap a root page to avoid unnecessary
>> +		 * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
>> +		 * progress:
>> +		 *    vcpu 0                        vcpu 1
>> +		 *                         call vcpu_enter_guest():
>> +		 *                            1): handle KVM_REQ_MMU_RELOAD
>> +		 *                                and require mmu-lock to
>> +		 *                                load mmu
>> +		 * repeat:
>> +		 *    1): zap root page and
>> +		 *        send KVM_REQ_MMU_RELOAD
>> +		 *
>> +		 *    2): if (cond_resched_lock(mmu-lock))
>> +		 *
>> +		 *                            2): hold mmu-lock and load mmu
>> +		 *
>> +		 *                            3): see KVM_REQ_MMU_RELOAD bit
>> +		 *                                on vcpu->requests is set
>> +		 *                                then return 1 to call
>> +		 *                                vcpu_enter_guest() again.
>> +		 *            goto repeat;
>> +		 *
>> +		 */
> I am not sure why the above scenario will prevent us from progressing.
> There is finite number of root pages with invalid generation number, so
> eventually we will zap them all and vcpu1 will stop seeing KVM_REQ_MMU_RELOAD
> request.

This patch does not "zap pages in batch", so kvm_zap_obsolete_pages() can
just zap invalid root pages and lock-break due to the lock contention on the
path of handing KVM_REQ_MMU_RELOAD.

Yes, after "zap pages in batch", this issue does not exist any more. I should
update this into that patch.

> 
> This check here prevent unnecessary KVM_REQ_MMU_RELOAD as you say, but
> this races the question, why don't we check for sp->role.invalid in
> kvm_mmu_prepare_zap_page before calling kvm_reload_remote_mmus()?
> Something like this:
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 40d7b2d..d2ae3a4 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2081,7 +2081,8 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>  		kvm_mod_used_mmu_pages(kvm, -1);
>  	} else {
>  		list_move(&sp->link, &kvm->arch.active_mmu_pages);
> -		kvm_reload_remote_mmus(kvm);
> +		if (!sp->role.invalid)
> +			kvm_reload_remote_mmus(kvm);
>  	}
> 
>  	sp->role.invalid = 1;

Yes, it is better.

> 
> Actually we can add check for is_obsolete_sp() there too since
> kvm_mmu_invalidate_all_pages() already calls kvm_reload_remote_mmus()
> after incrementing mmu_valid_gen.

Yes, I agree.

> 
> Or do I miss something?

No, you are right. ;)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/7] KVM: MMU: delete shadow page from hash list in kvm_mmu_prepare_zap_page
  2013-05-19 10:47   ` Gleb Natapov
@ 2013-05-20  9:19     ` Xiao Guangrong
  2013-05-20  9:42       ` Gleb Natapov
  0 siblings, 1 reply; 30+ messages in thread
From: Xiao Guangrong @ 2013-05-20  9:19 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On 05/19/2013 06:47 PM, Gleb Natapov wrote:
> On Fri, May 17, 2013 at 05:12:57AM +0800, Xiao Guangrong wrote:
>> Move deletion shadow page from the hash list from kvm_mmu_commit_zap_page to
>> kvm_mmu_prepare_zap_page so that we can call kvm_mmu_commit_zap_page
>> once for multiple kvm_mmu_prepare_zap_page that can help us to avoid
>> unnecessary TLB flush
>>
> Don't we call kvm_mmu_commit_zap_page() once for multiple
> kvm_mmu_prepare_zap_page() now when possible? kvm_mmu_commit_zap_page()
> gets a list as a parameter. I am not against the change, but wish to
> understand it better.

The changelong is not clear enough, i mean we can "call
kvm_mmu_commit_zap_page once for multiple kvm_mmu_prepare_zap_page" when
we use lock-break technique. If we do not do this, the page can be found
in hashtable but they are linked on the invalid_list on other thread.

> 
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>> ---
>>  arch/x86/kvm/mmu.c |    8 ++++++--
>>  1 files changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 40d7b2d..682ecb4 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>>  static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>>  {
>>  	ASSERT(is_empty_shadow_page(sp->spt));
>> -	hlist_del(&sp->hash_link);
>> +
>>  	list_del(&sp->link);
>>  	free_page((unsigned long)sp->spt);
>>  	if (!sp->role.direct)
>> @@ -1655,7 +1655,8 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>>  
>>  #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn)			\
>>  	for_each_gfn_sp(_kvm, _sp, _gfn)				\
>> -		if ((_sp)->role.direct || (_sp)->role.invalid) {} else
>> +		if ((_sp)->role.direct ||				\
>> +		      ((_sp)->role.invalid && WARN_ON(1))) {} else
>>  
>>  /* @sp->gfn should be write-protected at the call site */
>>  static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>> @@ -2074,6 +2075,9 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>>  		unaccount_shadowed(kvm, sp->gfn);
>>  	if (sp->unsync)
>>  		kvm_unlink_unsync_page(kvm, sp);
>> +
>> +	hlist_del_init(&sp->hash_link);
>> +
> What about moving this inside if() bellow and making it hlist_del()?
> Leave the page on the hash if root_count is non zero.
> 

It's a good idea. will update.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/7] KVM: MMU: delete shadow page from hash list in kvm_mmu_prepare_zap_page
  2013-05-20  9:19     ` Xiao Guangrong
@ 2013-05-20  9:42       ` Gleb Natapov
  0 siblings, 0 replies; 30+ messages in thread
From: Gleb Natapov @ 2013-05-20  9:42 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On Mon, May 20, 2013 at 05:19:26PM +0800, Xiao Guangrong wrote:
> On 05/19/2013 06:47 PM, Gleb Natapov wrote:
> > On Fri, May 17, 2013 at 05:12:57AM +0800, Xiao Guangrong wrote:
> >> Move deletion shadow page from the hash list from kvm_mmu_commit_zap_page to
> >> kvm_mmu_prepare_zap_page so that we can call kvm_mmu_commit_zap_page
> >> once for multiple kvm_mmu_prepare_zap_page that can help us to avoid
> >> unnecessary TLB flush
> >>
> > Don't we call kvm_mmu_commit_zap_page() once for multiple
> > kvm_mmu_prepare_zap_page() now when possible? kvm_mmu_commit_zap_page()
> > gets a list as a parameter. I am not against the change, but wish to
> > understand it better.
> 
> The changelong is not clear enough, i mean we can "call
> kvm_mmu_commit_zap_page once for multiple kvm_mmu_prepare_zap_page" when
> we use lock-break technique. If we do not do this, the page can be found
> in hashtable but they are linked on the invalid_list on other thread.
> 
Got it. Make sense.

> > 
> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> >> ---
> >>  arch/x86/kvm/mmu.c |    8 ++++++--
> >>  1 files changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 40d7b2d..682ecb4 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> >>  static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
> >>  {
> >>  	ASSERT(is_empty_shadow_page(sp->spt));
> >> -	hlist_del(&sp->hash_link);
> >> +
> >>  	list_del(&sp->link);
> >>  	free_page((unsigned long)sp->spt);
> >>  	if (!sp->role.direct)
> >> @@ -1655,7 +1655,8 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> >>  
> >>  #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn)			\
> >>  	for_each_gfn_sp(_kvm, _sp, _gfn)				\
> >> -		if ((_sp)->role.direct || (_sp)->role.invalid) {} else
> >> +		if ((_sp)->role.direct ||				\
> >> +		      ((_sp)->role.invalid && WARN_ON(1))) {} else
> >>  
> >>  /* @sp->gfn should be write-protected at the call site */
> >>  static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >> @@ -2074,6 +2075,9 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> >>  		unaccount_shadowed(kvm, sp->gfn);
> >>  	if (sp->unsync)
> >>  		kvm_unlink_unsync_page(kvm, sp);
> >> +
> >> +	hlist_del_init(&sp->hash_link);
> >> +
> > What about moving this inside if() bellow and making it hlist_del()?
> > Leave the page on the hash if root_count is non zero.
> > 
> 
> It's a good idea. will update.
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-16 21:12 ` [PATCH v6 3/7] KVM: MMU: fast invalidate all pages Xiao Guangrong
  2013-05-19 10:04   ` Gleb Natapov
@ 2013-05-20 19:46   ` Marcelo Tosatti
  2013-05-20 20:15     ` Gleb Natapov
  1 sibling, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2013-05-20 19:46 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: gleb, avi.kivity, pbonzini, linux-kernel, kvm

On Fri, May 17, 2013 at 05:12:58AM +0800, Xiao Guangrong wrote:
> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> walk and zap all shadow pages one by one, also it need to zap all guest
> page's rmap and all shadow page's parent spte list. Particularly, things
> become worse if guest uses more memory or vcpus. It is not good for
> scalability
> 
> In this patch, we introduce a faster way to invalidate all shadow pages.
> KVM maintains a global mmu invalid generation-number which is stored in
> kvm->arch.mmu_valid_gen and every shadow page stores the current global
> generation-number into sp->mmu_valid_gen when it is created
> 
> When KVM need zap all shadow pages sptes, it just simply increase the
> global generation-number then reload root shadow pages on all vcpus.
> Vcpu will create a new shadow page table according to current kvm's
> generation-number. It ensures the old pages are not used any more.
> Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> are zapped by using lock-break technique
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    2 +
>  arch/x86/kvm/mmu.c              |  103 +++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu.h              |    1 +
>  3 files changed, 106 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3741c65..bff7d46 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -222,6 +222,7 @@ struct kvm_mmu_page {
>  	int root_count;          /* Currently serving as active root */
>  	unsigned int unsync_children;
>  	unsigned long parent_ptes;	/* Reverse mapping for parent_pte */
> +	unsigned long mmu_valid_gen;
>  	DECLARE_BITMAP(unsync_child_bitmap, 512);
>  
>  #ifdef CONFIG_X86_32
> @@ -529,6 +530,7 @@ struct kvm_arch {
>  	unsigned int n_requested_mmu_pages;
>  	unsigned int n_max_mmu_pages;
>  	unsigned int indirect_shadow_pages;
> +	unsigned long mmu_valid_gen;
>  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>  	/*
>  	 * Hash table of struct kvm_mmu_page.
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 682ecb4..891ad2c 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1839,6 +1839,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
>  	__clear_sp_write_flooding_count(sp);
>  }
>  
> +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> +	return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> +}
> +
>  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  					     gfn_t gfn,
>  					     gva_t gaddr,
> @@ -1865,6 +1870,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  		role.quadrant = quadrant;
>  	}
>  	for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> +		if (is_obsolete_sp(vcpu->kvm, sp))
> +			continue;
> +

Whats the purpose of not using pages which are considered "obsolete" ?

The only purpose of the generation number should be to guarantee forward
progress of kvm_mmu_zap_all in face of allocators (kvm_mmu_get_page).

That is, on allocation you store the generation number on the shadow page.
The only purpose of that number in the shadow page is so that
kvm_mmu_zap_all can skip deleting shadow pages which have been created
after kvm_mmu_zap_all began operation.

>  		if (!need_sync && sp->unsync)
>  			need_sync = true;
>  
> @@ -1901,6 +1909,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  
>  		account_shadowed(vcpu->kvm, gfn);
>  	}
> +	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
>  	init_shadow_page_table(sp);
>  	trace_kvm_mmu_get_page(sp, true);
>  	return sp;
> @@ -2071,8 +2080,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>  	ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>  	kvm_mmu_page_unlink_children(kvm, sp);
>  	kvm_mmu_unlink_parents(kvm, sp);
> +
>  	if (!sp->role.invalid && !sp->role.direct)
>  		unaccount_shadowed(kvm, sp->gfn);
> +
>  	if (sp->unsync)
>  		kvm_unlink_unsync_page(kvm, sp);
>  
> @@ -4196,6 +4207,98 @@ restart:
>  	spin_unlock(&kvm->mmu_lock);
>  }
>  
> +static void kvm_zap_obsolete_pages(struct kvm *kvm)
> +{
> +	struct kvm_mmu_page *sp, *node;
> +	LIST_HEAD(invalid_list);
> +
> +restart:
> +	list_for_each_entry_safe_reverse(sp, node,
> +	      &kvm->arch.active_mmu_pages, link) {
> +		/*
> +		 * No obsolete page exists before new created page since
> +		 * active_mmu_pages is the FIFO list.
> +		 */
> +		if (!is_obsolete_sp(kvm, sp))
> +			break;
> +
> +		/*
> +		 * Do not repeatedly zap a root page to avoid unnecessary
> +		 * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
> +		 * progress:
> +		 *    vcpu 0                        vcpu 1
> +		 *                         call vcpu_enter_guest():
> +		 *                            1): handle KVM_REQ_MMU_RELOAD
> +		 *                                and require mmu-lock to
> +		 *                                load mmu
> +		 * repeat:
> +		 *    1): zap root page and
> +		 *        send KVM_REQ_MMU_RELOAD
> +		 *
> +		 *    2): if (cond_resched_lock(mmu-lock))
> +		 *
> +		 *                            2): hold mmu-lock and load mmu
> +		 *
> +		 *                            3): see KVM_REQ_MMU_RELOAD bit
> +		 *                                on vcpu->requests is set
> +		 *                                then return 1 to call
> +		 *                                vcpu_enter_guest() again.
> +		 *            goto repeat;
> +		 *
> +		 */
> +		if (sp->role.invalid)
> +			continue;
> +		/*
> +		 * Need not flush tlb since we only zap the sp with invalid
> +		 * generation number.
> +		 */
> +		if (cond_resched_lock(&kvm->mmu_lock))
> +			goto restart;

Please don't do this, just release all pages and flush the TLB when
dropping mmu_lock instead. This means for other mmu_lock users is that
there is a consistent state.

> +		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> +			goto restart;
> +	}
> +
> +	/*
> +	 * Should flush tlb before free page tables since lockless-walking
> +	 * may use the pages.
> +	 */
> +	kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +}
> +
> +/*
> + * Fast invalidate all shadow pages.
> + *
> + * @zap_obsolete_pages indicates whether all the obsolete pages should
> + * be zapped. This is required when memslot is being deleted or VM is
> + * being destroyed, in these cases, we should ensure that KVM MMU does
> + * not use any resource of the being-deleted slot or all slots after
> + * calling the function.
> + *
> + * @zap_obsolete_pages == false means the caller just wants to flush all
> + * shadow page tables.
> + */
> +void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
> +{
> +	spin_lock(&kvm->mmu_lock);
> +	kvm->arch.mmu_valid_gen++;
> +
> +	/*
> +	 * Notify all vcpus to reload its shadow page table
> +	 * and flush TLB. Then all vcpus will switch to new
> +	 * shadow page table with the new mmu_valid_gen.

Only if you zap the roots, which we agreed would be a second step, after
being understood its necessary.

> +	 *
> +	 * Note: we should do this under the protection of
> +	 * mmu-lock, otherwise, vcpu would purge shadow page
> +	 * but miss tlb flush.
> +	 */
> +	kvm_reload_remote_mmus(kvm);
> +
> +	if (zap_obsolete_pages)
> +		kvm_zap_obsolete_pages(kvm);

Please don't condition behaviour on parameters like this, its confusing. 
Can probably just use

spin_lock(&kvm->mmu_lock);
kvm->arch.mmu_valid_gen++;
kvm_reload_remote_mmus(kvm);
spin_unlock(&kvm->mmu_lock);

Directly when needed.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-20 19:46   ` Marcelo Tosatti
@ 2013-05-20 20:15     ` Gleb Natapov
  2013-05-20 20:40       ` Marcelo Tosatti
  0 siblings, 1 reply; 30+ messages in thread
From: Gleb Natapov @ 2013-05-20 20:15 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Xiao Guangrong, avi.kivity, pbonzini, linux-kernel, kvm

On Mon, May 20, 2013 at 04:46:24PM -0300, Marcelo Tosatti wrote:
> On Fri, May 17, 2013 at 05:12:58AM +0800, Xiao Guangrong wrote:
> > The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> > walk and zap all shadow pages one by one, also it need to zap all guest
> > page's rmap and all shadow page's parent spte list. Particularly, things
> > become worse if guest uses more memory or vcpus. It is not good for
> > scalability
> > 
> > In this patch, we introduce a faster way to invalidate all shadow pages.
> > KVM maintains a global mmu invalid generation-number which is stored in
> > kvm->arch.mmu_valid_gen and every shadow page stores the current global
> > generation-number into sp->mmu_valid_gen when it is created
> > 
> > When KVM need zap all shadow pages sptes, it just simply increase the
> > global generation-number then reload root shadow pages on all vcpus.
> > Vcpu will create a new shadow page table according to current kvm's
> > generation-number. It ensures the old pages are not used any more.
> > Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> > are zapped by using lock-break technique
> > 
> > Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |    2 +
> >  arch/x86/kvm/mmu.c              |  103 +++++++++++++++++++++++++++++++++++++++
> >  arch/x86/kvm/mmu.h              |    1 +
> >  3 files changed, 106 insertions(+), 0 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 3741c65..bff7d46 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -222,6 +222,7 @@ struct kvm_mmu_page {
> >  	int root_count;          /* Currently serving as active root */
> >  	unsigned int unsync_children;
> >  	unsigned long parent_ptes;	/* Reverse mapping for parent_pte */
> > +	unsigned long mmu_valid_gen;
> >  	DECLARE_BITMAP(unsync_child_bitmap, 512);
> >  
> >  #ifdef CONFIG_X86_32
> > @@ -529,6 +530,7 @@ struct kvm_arch {
> >  	unsigned int n_requested_mmu_pages;
> >  	unsigned int n_max_mmu_pages;
> >  	unsigned int indirect_shadow_pages;
> > +	unsigned long mmu_valid_gen;
> >  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> >  	/*
> >  	 * Hash table of struct kvm_mmu_page.
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 682ecb4..891ad2c 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -1839,6 +1839,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
> >  	__clear_sp_write_flooding_count(sp);
> >  }
> >  
> > +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > +{
> > +	return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> > +}
> > +
> >  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >  					     gfn_t gfn,
> >  					     gva_t gaddr,
> > @@ -1865,6 +1870,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >  		role.quadrant = quadrant;
> >  	}
> >  	for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> > +		if (is_obsolete_sp(vcpu->kvm, sp))
> > +			continue;
> > +
> 
> Whats the purpose of not using pages which are considered "obsolete" ?
> 
The same as not using page that is invalid, to not reuse stale
information. The page may contain ptes that point to invalid slot.

> The only purpose of the generation number should be to guarantee forward
> progress of kvm_mmu_zap_all in face of allocators (kvm_mmu_get_page).
> 
> That is, on allocation you store the generation number on the shadow page.
> The only purpose of that number in the shadow page is so that
> kvm_mmu_zap_all can skip deleting shadow pages which have been created
> after kvm_mmu_zap_all began operation.
> 
> >  		if (!need_sync && sp->unsync)
> >  			need_sync = true;
> >  
> > @@ -1901,6 +1909,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >  
> >  		account_shadowed(vcpu->kvm, gfn);
> >  	}
> > +	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> >  	init_shadow_page_table(sp);
> >  	trace_kvm_mmu_get_page(sp, true);
> >  	return sp;
> > @@ -2071,8 +2080,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> >  	ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> >  	kvm_mmu_page_unlink_children(kvm, sp);
> >  	kvm_mmu_unlink_parents(kvm, sp);
> > +
> >  	if (!sp->role.invalid && !sp->role.direct)
> >  		unaccount_shadowed(kvm, sp->gfn);
> > +
> >  	if (sp->unsync)
> >  		kvm_unlink_unsync_page(kvm, sp);
> >  
> > @@ -4196,6 +4207,98 @@ restart:
> >  	spin_unlock(&kvm->mmu_lock);
> >  }
> >  
> > +static void kvm_zap_obsolete_pages(struct kvm *kvm)
> > +{
> > +	struct kvm_mmu_page *sp, *node;
> > +	LIST_HEAD(invalid_list);
> > +
> > +restart:
> > +	list_for_each_entry_safe_reverse(sp, node,
> > +	      &kvm->arch.active_mmu_pages, link) {
> > +		/*
> > +		 * No obsolete page exists before new created page since
> > +		 * active_mmu_pages is the FIFO list.
> > +		 */
> > +		if (!is_obsolete_sp(kvm, sp))
> > +			break;
> > +
> > +		/*
> > +		 * Do not repeatedly zap a root page to avoid unnecessary
> > +		 * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
> > +		 * progress:
> > +		 *    vcpu 0                        vcpu 1
> > +		 *                         call vcpu_enter_guest():
> > +		 *                            1): handle KVM_REQ_MMU_RELOAD
> > +		 *                                and require mmu-lock to
> > +		 *                                load mmu
> > +		 * repeat:
> > +		 *    1): zap root page and
> > +		 *        send KVM_REQ_MMU_RELOAD
> > +		 *
> > +		 *    2): if (cond_resched_lock(mmu-lock))
> > +		 *
> > +		 *                            2): hold mmu-lock and load mmu
> > +		 *
> > +		 *                            3): see KVM_REQ_MMU_RELOAD bit
> > +		 *                                on vcpu->requests is set
> > +		 *                                then return 1 to call
> > +		 *                                vcpu_enter_guest() again.
> > +		 *            goto repeat;
> > +		 *
> > +		 */
> > +		if (sp->role.invalid)
> > +			continue;
> > +		/*
> > +		 * Need not flush tlb since we only zap the sp with invalid
> > +		 * generation number.
> > +		 */
> > +		if (cond_resched_lock(&kvm->mmu_lock))
> > +			goto restart;
> 
> Please don't do this, just release all pages and flush the TLB when
> dropping mmu_lock instead. This means for other mmu_lock users is that
> there is a consistent state.
You mean call kvm_mmu_commit_zap_page() on  current invalid_list before restart?
If yes, I agree.

> 
> > +		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> > +			goto restart;
> > +	}
> > +
> > +	/*
> > +	 * Should flush tlb before free page tables since lockless-walking
> > +	 * may use the pages.
> > +	 */
> > +	kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > +}
> > +
> > +/*
> > + * Fast invalidate all shadow pages.
> > + *
> > + * @zap_obsolete_pages indicates whether all the obsolete pages should
> > + * be zapped. This is required when memslot is being deleted or VM is
> > + * being destroyed, in these cases, we should ensure that KVM MMU does
> > + * not use any resource of the being-deleted slot or all slots after
> > + * calling the function.
> > + *
> > + * @zap_obsolete_pages == false means the caller just wants to flush all
> > + * shadow page tables.
> > + */
> > +void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
> > +{
> > +	spin_lock(&kvm->mmu_lock);
> > +	kvm->arch.mmu_valid_gen++;
> > +
> > +	/*
> > +	 * Notify all vcpus to reload its shadow page table
> > +	 * and flush TLB. Then all vcpus will switch to new
> > +	 * shadow page table with the new mmu_valid_gen.
> 
> Only if you zap the roots, which we agreed would be a second step, after
> being understood its necessary.
> 
I've lost you here. The patch implement what was agreed upon.

> > +	 *
> > +	 * Note: we should do this under the protection of
> > +	 * mmu-lock, otherwise, vcpu would purge shadow page
> > +	 * but miss tlb flush.
> > +	 */
> > +	kvm_reload_remote_mmus(kvm);
> > +
> > +	if (zap_obsolete_pages)
> > +		kvm_zap_obsolete_pages(kvm);
> 
> Please don't condition behaviour on parameters like this, its confusing. 
> Can probably just use
> 
> spin_lock(&kvm->mmu_lock);
> kvm->arch.mmu_valid_gen++;
> kvm_reload_remote_mmus(kvm);
> spin_unlock(&kvm->mmu_lock);
> 
> Directly when needed.

Please, no. Better have one place where mmu_valid_gen is
handled. If parameter is confusing better have two functions:
kvm_mmu_invalidate_all_pages() and kvm_mmu_invalidate_zap_all_pages(),
but to minimize code duplication they will call the same function with
a parameter anyway, so I am not sure this is a win.

--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-20 20:15     ` Gleb Natapov
@ 2013-05-20 20:40       ` Marcelo Tosatti
  2013-05-21  3:36         ` Xiao Guangrong
  2013-05-21  8:39         ` Gleb Natapov
  0 siblings, 2 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2013-05-20 20:40 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Xiao Guangrong, avi.kivity, pbonzini, linux-kernel, kvm

On Mon, May 20, 2013 at 11:15:45PM +0300, Gleb Natapov wrote:
> On Mon, May 20, 2013 at 04:46:24PM -0300, Marcelo Tosatti wrote:
> > On Fri, May 17, 2013 at 05:12:58AM +0800, Xiao Guangrong wrote:
> > > The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> > > walk and zap all shadow pages one by one, also it need to zap all guest
> > > page's rmap and all shadow page's parent spte list. Particularly, things
> > > become worse if guest uses more memory or vcpus. It is not good for
> > > scalability
> > > 
> > > In this patch, we introduce a faster way to invalidate all shadow pages.
> > > KVM maintains a global mmu invalid generation-number which is stored in
> > > kvm->arch.mmu_valid_gen and every shadow page stores the current global
> > > generation-number into sp->mmu_valid_gen when it is created
> > > 
> > > When KVM need zap all shadow pages sptes, it just simply increase the
> > > global generation-number then reload root shadow pages on all vcpus.
> > > Vcpu will create a new shadow page table according to current kvm's
> > > generation-number. It ensures the old pages are not used any more.
> > > Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> > > are zapped by using lock-break technique
> > > 
> > > Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> > > ---
> > >  arch/x86/include/asm/kvm_host.h |    2 +
> > >  arch/x86/kvm/mmu.c              |  103 +++++++++++++++++++++++++++++++++++++++
> > >  arch/x86/kvm/mmu.h              |    1 +
> > >  3 files changed, 106 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 3741c65..bff7d46 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -222,6 +222,7 @@ struct kvm_mmu_page {
> > >  	int root_count;          /* Currently serving as active root */
> > >  	unsigned int unsync_children;
> > >  	unsigned long parent_ptes;	/* Reverse mapping for parent_pte */
> > > +	unsigned long mmu_valid_gen;
> > >  	DECLARE_BITMAP(unsync_child_bitmap, 512);
> > >  
> > >  #ifdef CONFIG_X86_32
> > > @@ -529,6 +530,7 @@ struct kvm_arch {
> > >  	unsigned int n_requested_mmu_pages;
> > >  	unsigned int n_max_mmu_pages;
> > >  	unsigned int indirect_shadow_pages;
> > > +	unsigned long mmu_valid_gen;
> > >  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> > >  	/*
> > >  	 * Hash table of struct kvm_mmu_page.
> > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > index 682ecb4..891ad2c 100644
> > > --- a/arch/x86/kvm/mmu.c
> > > +++ b/arch/x86/kvm/mmu.c
> > > @@ -1839,6 +1839,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
> > >  	__clear_sp_write_flooding_count(sp);
> > >  }
> > >  
> > > +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > > +{
> > > +	return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> > > +}
> > > +
> > >  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >  					     gfn_t gfn,
> > >  					     gva_t gaddr,
> > > @@ -1865,6 +1870,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >  		role.quadrant = quadrant;
> > >  	}
> > >  	for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> > > +		if (is_obsolete_sp(vcpu->kvm, sp))
> > > +			continue;
> > > +
> > 
> > Whats the purpose of not using pages which are considered "obsolete" ?
> > 
> The same as not using page that is invalid, to not reuse stale
> information. The page may contain ptes that point to invalid slot.

Any pages with stale information will be zapped by kvm_mmu_zap_all().
When that happens, page faults will take place which will automatically 
use the new generation number.

So still not clear why is this necessary.

> > The only purpose of the generation number should be to guarantee forward
> > progress of kvm_mmu_zap_all in face of allocators (kvm_mmu_get_page).
> > 
> > That is, on allocation you store the generation number on the shadow page.
> > The only purpose of that number in the shadow page is so that
> > kvm_mmu_zap_all can skip deleting shadow pages which have been created
> > after kvm_mmu_zap_all began operation.
> > 
> > >  		if (!need_sync && sp->unsync)
> > >  			need_sync = true;
> > >  
> > > @@ -1901,6 +1909,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >  
> > >  		account_shadowed(vcpu->kvm, gfn);
> > >  	}
> > > +	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> > >  	init_shadow_page_table(sp);
> > >  	trace_kvm_mmu_get_page(sp, true);
> > >  	return sp;
> > > @@ -2071,8 +2080,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> > >  	ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> > >  	kvm_mmu_page_unlink_children(kvm, sp);
> > >  	kvm_mmu_unlink_parents(kvm, sp);
> > > +
> > >  	if (!sp->role.invalid && !sp->role.direct)
> > >  		unaccount_shadowed(kvm, sp->gfn);
> > > +
> > >  	if (sp->unsync)
> > >  		kvm_unlink_unsync_page(kvm, sp);
> > >  
> > > @@ -4196,6 +4207,98 @@ restart:
> > >  	spin_unlock(&kvm->mmu_lock);
> > >  }
> > >  
> > > +static void kvm_zap_obsolete_pages(struct kvm *kvm)
> > > +{
> > > +	struct kvm_mmu_page *sp, *node;
> > > +	LIST_HEAD(invalid_list);
> > > +
> > > +restart:
> > > +	list_for_each_entry_safe_reverse(sp, node,
> > > +	      &kvm->arch.active_mmu_pages, link) {
> > > +		/*
> > > +		 * No obsolete page exists before new created page since
> > > +		 * active_mmu_pages is the FIFO list.
> > > +		 */
> > > +		if (!is_obsolete_sp(kvm, sp))
> > > +			break;
> > > +
> > > +		/*
> > > +		 * Do not repeatedly zap a root page to avoid unnecessary
> > > +		 * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
> > > +		 * progress:
> > > +		 *    vcpu 0                        vcpu 1
> > > +		 *                         call vcpu_enter_guest():
> > > +		 *                            1): handle KVM_REQ_MMU_RELOAD
> > > +		 *                                and require mmu-lock to
> > > +		 *                                load mmu
> > > +		 * repeat:
> > > +		 *    1): zap root page and
> > > +		 *        send KVM_REQ_MMU_RELOAD
> > > +		 *
> > > +		 *    2): if (cond_resched_lock(mmu-lock))
> > > +		 *
> > > +		 *                            2): hold mmu-lock and load mmu
> > > +		 *
> > > +		 *                            3): see KVM_REQ_MMU_RELOAD bit
> > > +		 *                                on vcpu->requests is set
> > > +		 *                                then return 1 to call
> > > +		 *                                vcpu_enter_guest() again.
> > > +		 *            goto repeat;
> > > +		 *
> > > +		 */
> > > +		if (sp->role.invalid)
> > > +			continue;
> > > +		/*
> > > +		 * Need not flush tlb since we only zap the sp with invalid
> > > +		 * generation number.
> > > +		 */
> > > +		if (cond_resched_lock(&kvm->mmu_lock))
> > > +			goto restart;
> > 
> > Please don't do this, just release all pages and flush the TLB when
> > dropping mmu_lock instead. This means for other mmu_lock users is that
> > there is a consistent state.
> You mean call kvm_mmu_commit_zap_page() on  current invalid_list before restart?
> If yes, I agree.

I mean leaving pages which have been "partially deleted" while releasing
mmu_lock. Say for example upon the need to delete a given page you do:

spin_lock(mmu_lock)
shadow page = lookup page on hash list
if (shadow page is not found)
	proceed assuming the page has been properly deleted (*)
spin_unlock(mmu_lock)

At (*) we assume it has been 1) deleted and 2) TLB flushed.

So its better to just 

if (need_resched()) {
	kvm_mmu_complete_zap_page(&list);
	cond_resched_lock(&kvm->mmu_lock);
}

If you want to collapse TLB flushes, please do it in a later patch.

> > > +		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> > > +			goto restart;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Should flush tlb before free page tables since lockless-walking
> > > +	 * may use the pages.
> > > +	 */
> > > +	kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > > +}
> > > +
> > > +/*
> > > + * Fast invalidate all shadow pages.
> > > + *
> > > + * @zap_obsolete_pages indicates whether all the obsolete pages should
> > > + * be zapped. This is required when memslot is being deleted or VM is
> > > + * being destroyed, in these cases, we should ensure that KVM MMU does
> > > + * not use any resource of the being-deleted slot or all slots after
> > > + * calling the function.
> > > + *
> > > + * @zap_obsolete_pages == false means the caller just wants to flush all
> > > + * shadow page tables.
> > > + */
> > > +void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
> > > +{
> > > +	spin_lock(&kvm->mmu_lock);
> > > +	kvm->arch.mmu_valid_gen++;
> > > +
> > > +	/*
> > > +	 * Notify all vcpus to reload its shadow page table
> > > +	 * and flush TLB. Then all vcpus will switch to new
> > > +	 * shadow page table with the new mmu_valid_gen.
> > 
> > Only if you zap the roots, which we agreed would be a second step, after
> > being understood its necessary.
> > 
> I've lost you here. The patch implement what was agreed upon.

"
+ /*
+  * Notify all vcpus to reload its shadow page table
+  * and flush TLB. Then all vcpus will switch to new
+  * shadow page table with the new mmu_valid_gen.
"

What was suggested was... go to phrase which starts with "The only purpose
of the generation number should be to".

The comment quoted here does not match that description.

> > > +	 *
> > > +	 * Note: we should do this under the protection of
> > > +	 * mmu-lock, otherwise, vcpu would purge shadow page
> > > +	 * but miss tlb flush.
> > > +	 */
> > > +	kvm_reload_remote_mmus(kvm);
> > > +
> > > +	if (zap_obsolete_pages)
> > > +		kvm_zap_obsolete_pages(kvm);
> > 
> > Please don't condition behaviour on parameters like this, its confusing. 
> > Can probably just use
> > 
> > spin_lock(&kvm->mmu_lock);
> > kvm->arch.mmu_valid_gen++;
> > kvm_reload_remote_mmus(kvm);
> > spin_unlock(&kvm->mmu_lock);
> > 
> > Directly when needed.
> 
> Please, no. Better have one place where mmu_valid_gen is
> handled. If parameter is confusing better have two functions:
> kvm_mmu_invalidate_all_pages() and kvm_mmu_invalidate_zap_all_pages(),
> but to minimize code duplication they will call the same function with
> a parameter anyway, so I am not sure this is a win.

OK.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-20 20:40       ` Marcelo Tosatti
@ 2013-05-21  3:36         ` Xiao Guangrong
  2013-05-21  8:45           ` Gleb Natapov
  2013-05-22  1:41           ` Marcelo Tosatti
  2013-05-21  8:39         ` Gleb Natapov
  1 sibling, 2 replies; 30+ messages in thread
From: Xiao Guangrong @ 2013-05-21  3:36 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Gleb Natapov, avi.kivity, pbonzini, linux-kernel, kvm

On 05/21/2013 04:40 AM, Marcelo Tosatti wrote:
> On Mon, May 20, 2013 at 11:15:45PM +0300, Gleb Natapov wrote:
>> On Mon, May 20, 2013 at 04:46:24PM -0300, Marcelo Tosatti wrote:
>>> On Fri, May 17, 2013 at 05:12:58AM +0800, Xiao Guangrong wrote:
>>>> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
>>>> walk and zap all shadow pages one by one, also it need to zap all guest
>>>> page's rmap and all shadow page's parent spte list. Particularly, things
>>>> become worse if guest uses more memory or vcpus. It is not good for
>>>> scalability
>>>>
>>>> In this patch, we introduce a faster way to invalidate all shadow pages.
>>>> KVM maintains a global mmu invalid generation-number which is stored in
>>>> kvm->arch.mmu_valid_gen and every shadow page stores the current global
>>>> generation-number into sp->mmu_valid_gen when it is created
>>>>
>>>> When KVM need zap all shadow pages sptes, it just simply increase the
>>>> global generation-number then reload root shadow pages on all vcpus.
>>>> Vcpu will create a new shadow page table according to current kvm's
>>>> generation-number. It ensures the old pages are not used any more.
>>>> Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
>>>> are zapped by using lock-break technique
>>>>
>>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>>>> ---
>>>>  arch/x86/include/asm/kvm_host.h |    2 +
>>>>  arch/x86/kvm/mmu.c              |  103 +++++++++++++++++++++++++++++++++++++++
>>>>  arch/x86/kvm/mmu.h              |    1 +
>>>>  3 files changed, 106 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>>> index 3741c65..bff7d46 100644
>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>> @@ -222,6 +222,7 @@ struct kvm_mmu_page {
>>>>  	int root_count;          /* Currently serving as active root */
>>>>  	unsigned int unsync_children;
>>>>  	unsigned long parent_ptes;	/* Reverse mapping for parent_pte */
>>>> +	unsigned long mmu_valid_gen;
>>>>  	DECLARE_BITMAP(unsync_child_bitmap, 512);
>>>>  
>>>>  #ifdef CONFIG_X86_32
>>>> @@ -529,6 +530,7 @@ struct kvm_arch {
>>>>  	unsigned int n_requested_mmu_pages;
>>>>  	unsigned int n_max_mmu_pages;
>>>>  	unsigned int indirect_shadow_pages;
>>>> +	unsigned long mmu_valid_gen;
>>>>  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>>>>  	/*
>>>>  	 * Hash table of struct kvm_mmu_page.
>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>> index 682ecb4..891ad2c 100644
>>>> --- a/arch/x86/kvm/mmu.c
>>>> +++ b/arch/x86/kvm/mmu.c
>>>> @@ -1839,6 +1839,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
>>>>  	__clear_sp_write_flooding_count(sp);
>>>>  }
>>>>  
>>>> +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>>>> +{
>>>> +	return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
>>>> +}
>>>> +
>>>>  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>>>>  					     gfn_t gfn,
>>>>  					     gva_t gaddr,
>>>> @@ -1865,6 +1870,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>>>>  		role.quadrant = quadrant;
>>>>  	}
>>>>  	for_each_gfn_sp(vcpu->kvm, sp, gfn) {
>>>> +		if (is_obsolete_sp(vcpu->kvm, sp))
>>>> +			continue;
>>>> +
>>>
>>> Whats the purpose of not using pages which are considered "obsolete" ?
>>>
>> The same as not using page that is invalid, to not reuse stale
>> information. The page may contain ptes that point to invalid slot.
> 
> Any pages with stale information will be zapped by kvm_mmu_zap_all().
> When that happens, page faults will take place which will automatically 
> use the new generation number.

kvm_mmu_zap_all() uses lock-break technique to zap pages, before it zaps
all obsolete pages other vcpus can require mmu-lock and call kvm_mmu_get_page()
to install new page. In this case, obsolete page still live in hast-table and
can be found by kvm_mmu_get_page().

> 
> So still not clear why is this necessary.
> 
>>> The only purpose of the generation number should be to guarantee forward
>>> progress of kvm_mmu_zap_all in face of allocators (kvm_mmu_get_page).
>>>
>>> That is, on allocation you store the generation number on the shadow page.
>>> The only purpose of that number in the shadow page is so that
>>> kvm_mmu_zap_all can skip deleting shadow pages which have been created
>>> after kvm_mmu_zap_all began operation.
>>>
>>>>  		if (!need_sync && sp->unsync)
>>>>  			need_sync = true;
>>>>  
>>>> @@ -1901,6 +1909,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>>>>  
>>>>  		account_shadowed(vcpu->kvm, gfn);
>>>>  	}
>>>> +	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
>>>>  	init_shadow_page_table(sp);
>>>>  	trace_kvm_mmu_get_page(sp, true);
>>>>  	return sp;
>>>> @@ -2071,8 +2080,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>>>>  	ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>>>>  	kvm_mmu_page_unlink_children(kvm, sp);
>>>>  	kvm_mmu_unlink_parents(kvm, sp);
>>>> +
>>>>  	if (!sp->role.invalid && !sp->role.direct)
>>>>  		unaccount_shadowed(kvm, sp->gfn);
>>>> +
>>>>  	if (sp->unsync)
>>>>  		kvm_unlink_unsync_page(kvm, sp);
>>>>  
>>>> @@ -4196,6 +4207,98 @@ restart:
>>>>  	spin_unlock(&kvm->mmu_lock);
>>>>  }
>>>>  
>>>> +static void kvm_zap_obsolete_pages(struct kvm *kvm)
>>>> +{
>>>> +	struct kvm_mmu_page *sp, *node;
>>>> +	LIST_HEAD(invalid_list);
>>>> +
>>>> +restart:
>>>> +	list_for_each_entry_safe_reverse(sp, node,
>>>> +	      &kvm->arch.active_mmu_pages, link) {
>>>> +		/*
>>>> +		 * No obsolete page exists before new created page since
>>>> +		 * active_mmu_pages is the FIFO list.
>>>> +		 */
>>>> +		if (!is_obsolete_sp(kvm, sp))
>>>> +			break;
>>>> +
>>>> +		/*
>>>> +		 * Do not repeatedly zap a root page to avoid unnecessary
>>>> +		 * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
>>>> +		 * progress:
>>>> +		 *    vcpu 0                        vcpu 1
>>>> +		 *                         call vcpu_enter_guest():
>>>> +		 *                            1): handle KVM_REQ_MMU_RELOAD
>>>> +		 *                                and require mmu-lock to
>>>> +		 *                                load mmu
>>>> +		 * repeat:
>>>> +		 *    1): zap root page and
>>>> +		 *        send KVM_REQ_MMU_RELOAD
>>>> +		 *
>>>> +		 *    2): if (cond_resched_lock(mmu-lock))
>>>> +		 *
>>>> +		 *                            2): hold mmu-lock and load mmu
>>>> +		 *
>>>> +		 *                            3): see KVM_REQ_MMU_RELOAD bit
>>>> +		 *                                on vcpu->requests is set
>>>> +		 *                                then return 1 to call
>>>> +		 *                                vcpu_enter_guest() again.
>>>> +		 *            goto repeat;
>>>> +		 *
>>>> +		 */
>>>> +		if (sp->role.invalid)
>>>> +			continue;
>>>> +		/*
>>>> +		 * Need not flush tlb since we only zap the sp with invalid
>>>> +		 * generation number.
>>>> +		 */
>>>> +		if (cond_resched_lock(&kvm->mmu_lock))
>>>> +			goto restart;
>>>
>>> Please don't do this, just release all pages and flush the TLB when
>>> dropping mmu_lock instead. This means for other mmu_lock users is that
>>> there is a consistent state.
>> You mean call kvm_mmu_commit_zap_page() on  current invalid_list before restart?
>> If yes, I agree.
> 
> I mean leaving pages which have been "partially deleted" while releasing
> mmu_lock. Say for example upon the need to delete a given page you do:
> 
> spin_lock(mmu_lock)
> shadow page = lookup page on hash list
> if (shadow page is not found)
> 	proceed assuming the page has been properly deleted (*)
> spin_unlock(mmu_lock)
> 
> At (*) we assume it has been 1) deleted and 2) TLB flushed.
> 
> So its better to just 
> 
> if (need_resched()) {
> 	kvm_mmu_complete_zap_page(&list);

is kvm_mmu_commit_zap_page()?

> 	cond_resched_lock(&kvm->mmu_lock);
> }
> 

Isn't it what Gleb said?

> If you want to collapse TLB flushes, please do it in a later patch.

Good to me.

> 
>>>> +		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
>>>> +			goto restart;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * Should flush tlb before free page tables since lockless-walking
>>>> +	 * may use the pages.
>>>> +	 */
>>>> +	kvm_mmu_commit_zap_page(kvm, &invalid_list);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Fast invalidate all shadow pages.
>>>> + *
>>>> + * @zap_obsolete_pages indicates whether all the obsolete pages should
>>>> + * be zapped. This is required when memslot is being deleted or VM is
>>>> + * being destroyed, in these cases, we should ensure that KVM MMU does
>>>> + * not use any resource of the being-deleted slot or all slots after
>>>> + * calling the function.
>>>> + *
>>>> + * @zap_obsolete_pages == false means the caller just wants to flush all
>>>> + * shadow page tables.
>>>> + */
>>>> +void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
>>>> +{
>>>> +	spin_lock(&kvm->mmu_lock);
>>>> +	kvm->arch.mmu_valid_gen++;
>>>> +
>>>> +	/*
>>>> +	 * Notify all vcpus to reload its shadow page table
>>>> +	 * and flush TLB. Then all vcpus will switch to new
>>>> +	 * shadow page table with the new mmu_valid_gen.
>>>
>>> Only if you zap the roots, which we agreed would be a second step, after
>>> being understood its necessary.
>>>
>> I've lost you here. The patch implement what was agreed upon.
> 
> "
> + /*
> +  * Notify all vcpus to reload its shadow page table
> +  * and flush TLB. Then all vcpus will switch to new
> +  * shadow page table with the new mmu_valid_gen.
> "
> 
> What was suggested was... go to phrase which starts with "The only purpose
> of the generation number should be to".
> 
> The comment quoted here does not match that description.

So, is this your want?

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2c512e8..2fd4c04 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4275,10 +4275,19 @@ restart:
  */
 void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
 {
+       bool zap_root = fase;
+       struct kvm_mmu_page *sp;
+
        spin_lock(&kvm->mmu_lock);
        trace_kvm_mmu_invalidate_all_pages(kvm, zap_obsolete_pages);
        kvm->arch.mmu_valid_gen++;

+       list_for_each_entry(sp, kvm->arch.active_mmu_pages, link)
+               if (sp->root_count && !sp->role.invalid) {
+                       zap_root = true;
+                       break;
+               }
+
        /*
         * Notify all vcpus to reload its shadow page table
         * and flush TLB. Then all vcpus will switch to new
@@ -4288,7 +4297,8 @@ void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
         * mmu-lock, otherwise, vcpu would purge shadow page
         * but miss tlb flush.
         */
-       kvm_reload_remote_mmus(kvm);
+       if (zap_root)
+               kvm_reload_remote_mmus(kvm);

        if (zap_obsolete_pages)
                kvm_zap_obsolete_pages(kvm);

> 
>>>> +	 *
>>>> +	 * Note: we should do this under the protection of
>>>> +	 * mmu-lock, otherwise, vcpu would purge shadow page
>>>> +	 * but miss tlb flush.
>>>> +	 */
>>>> +	kvm_reload_remote_mmus(kvm);
>>>> +
>>>> +	if (zap_obsolete_pages)
>>>> +		kvm_zap_obsolete_pages(kvm);
>>>
>>> Please don't condition behaviour on parameters like this, its confusing. 
>>> Can probably just use
>>>
>>> spin_lock(&kvm->mmu_lock);
>>> kvm->arch.mmu_valid_gen++;
>>> kvm_reload_remote_mmus(kvm);
>>> spin_unlock(&kvm->mmu_lock);
>>>
>>> Directly when needed.
>>
>> Please, no. Better have one place where mmu_valid_gen is
>> handled. If parameter is confusing better have two functions:
>> kvm_mmu_invalidate_all_pages() and kvm_mmu_invalidate_zap_all_pages(),
>> but to minimize code duplication they will call the same function with
>> a parameter anyway, so I am not sure this is a win.
> 
> OK.

Will introduce these two functions.

Thank you all. ;)



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-20 20:40       ` Marcelo Tosatti
  2013-05-21  3:36         ` Xiao Guangrong
@ 2013-05-21  8:39         ` Gleb Natapov
  2013-05-22  1:33           ` Marcelo Tosatti
  1 sibling, 1 reply; 30+ messages in thread
From: Gleb Natapov @ 2013-05-21  8:39 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Xiao Guangrong, avi.kivity, pbonzini, linux-kernel, kvm

On Mon, May 20, 2013 at 05:40:47PM -0300, Marcelo Tosatti wrote:
> On Mon, May 20, 2013 at 11:15:45PM +0300, Gleb Natapov wrote:
> > On Mon, May 20, 2013 at 04:46:24PM -0300, Marcelo Tosatti wrote:
> > > On Fri, May 17, 2013 at 05:12:58AM +0800, Xiao Guangrong wrote:
> > > > The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> > > > walk and zap all shadow pages one by one, also it need to zap all guest
> > > > page's rmap and all shadow page's parent spte list. Particularly, things
> > > > become worse if guest uses more memory or vcpus. It is not good for
> > > > scalability
> > > > 
> > > > In this patch, we introduce a faster way to invalidate all shadow pages.
> > > > KVM maintains a global mmu invalid generation-number which is stored in
> > > > kvm->arch.mmu_valid_gen and every shadow page stores the current global
> > > > generation-number into sp->mmu_valid_gen when it is created
> > > > 
> > > > When KVM need zap all shadow pages sptes, it just simply increase the
> > > > global generation-number then reload root shadow pages on all vcpus.
> > > > Vcpu will create a new shadow page table according to current kvm's
> > > > generation-number. It ensures the old pages are not used any more.
> > > > Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> > > > are zapped by using lock-break technique
> > > > 
> > > > Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> > > > ---
> > > >  arch/x86/include/asm/kvm_host.h |    2 +
> > > >  arch/x86/kvm/mmu.c              |  103 +++++++++++++++++++++++++++++++++++++++
> > > >  arch/x86/kvm/mmu.h              |    1 +
> > > >  3 files changed, 106 insertions(+), 0 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 3741c65..bff7d46 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -222,6 +222,7 @@ struct kvm_mmu_page {
> > > >  	int root_count;          /* Currently serving as active root */
> > > >  	unsigned int unsync_children;
> > > >  	unsigned long parent_ptes;	/* Reverse mapping for parent_pte */
> > > > +	unsigned long mmu_valid_gen;
> > > >  	DECLARE_BITMAP(unsync_child_bitmap, 512);
> > > >  
> > > >  #ifdef CONFIG_X86_32
> > > > @@ -529,6 +530,7 @@ struct kvm_arch {
> > > >  	unsigned int n_requested_mmu_pages;
> > > >  	unsigned int n_max_mmu_pages;
> > > >  	unsigned int indirect_shadow_pages;
> > > > +	unsigned long mmu_valid_gen;
> > > >  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> > > >  	/*
> > > >  	 * Hash table of struct kvm_mmu_page.
> > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > > index 682ecb4..891ad2c 100644
> > > > --- a/arch/x86/kvm/mmu.c
> > > > +++ b/arch/x86/kvm/mmu.c
> > > > @@ -1839,6 +1839,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
> > > >  	__clear_sp_write_flooding_count(sp);
> > > >  }
> > > >  
> > > > +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > > > +{
> > > > +	return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> > > > +}
> > > > +
> > > >  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > >  					     gfn_t gfn,
> > > >  					     gva_t gaddr,
> > > > @@ -1865,6 +1870,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > >  		role.quadrant = quadrant;
> > > >  	}
> > > >  	for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> > > > +		if (is_obsolete_sp(vcpu->kvm, sp))
> > > > +			continue;
> > > > +
> > > 
> > > Whats the purpose of not using pages which are considered "obsolete" ?
> > > 
> > The same as not using page that is invalid, to not reuse stale
> > information. The page may contain ptes that point to invalid slot.
> 
> Any pages with stale information will be zapped by kvm_mmu_zap_all().
> When that happens, page faults will take place which will automatically 
> use the new generation number.
> 
> So still not clear why is this necessary.
> 
This is not, strictly speaking, necessary, but it is the sane thing to do.
You cannot update page's generation number to prevent it from been
destroyed since after kvm_mmu_zap_all() completes stale ptes in the
shadow page may point to now deleted memslot. So why build shadow page
table with a page that is in a process of been destroyed?

> > > The only purpose of the generation number should be to guarantee forward
> > > progress of kvm_mmu_zap_all in face of allocators (kvm_mmu_get_page).
> > > 
> > > That is, on allocation you store the generation number on the shadow page.
> > > The only purpose of that number in the shadow page is so that
> > > kvm_mmu_zap_all can skip deleting shadow pages which have been created
> > > after kvm_mmu_zap_all began operation.
> > > 
> > > >  		if (!need_sync && sp->unsync)
> > > >  			need_sync = true;
> > > >  
> > > > @@ -1901,6 +1909,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > >  
> > > >  		account_shadowed(vcpu->kvm, gfn);
> > > >  	}
> > > > +	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> > > >  	init_shadow_page_table(sp);
> > > >  	trace_kvm_mmu_get_page(sp, true);
> > > >  	return sp;
> > > > @@ -2071,8 +2080,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> > > >  	ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> > > >  	kvm_mmu_page_unlink_children(kvm, sp);
> > > >  	kvm_mmu_unlink_parents(kvm, sp);
> > > > +
> > > >  	if (!sp->role.invalid && !sp->role.direct)
> > > >  		unaccount_shadowed(kvm, sp->gfn);
> > > > +
> > > >  	if (sp->unsync)
> > > >  		kvm_unlink_unsync_page(kvm, sp);
> > > >  
> > > > @@ -4196,6 +4207,98 @@ restart:
> > > >  	spin_unlock(&kvm->mmu_lock);
> > > >  }
> > > >  
> > > > +static void kvm_zap_obsolete_pages(struct kvm *kvm)
> > > > +{
> > > > +	struct kvm_mmu_page *sp, *node;
> > > > +	LIST_HEAD(invalid_list);
> > > > +
> > > > +restart:
> > > > +	list_for_each_entry_safe_reverse(sp, node,
> > > > +	      &kvm->arch.active_mmu_pages, link) {
> > > > +		/*
> > > > +		 * No obsolete page exists before new created page since
> > > > +		 * active_mmu_pages is the FIFO list.
> > > > +		 */
> > > > +		if (!is_obsolete_sp(kvm, sp))
> > > > +			break;
> > > > +
> > > > +		/*
> > > > +		 * Do not repeatedly zap a root page to avoid unnecessary
> > > > +		 * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
> > > > +		 * progress:
> > > > +		 *    vcpu 0                        vcpu 1
> > > > +		 *                         call vcpu_enter_guest():
> > > > +		 *                            1): handle KVM_REQ_MMU_RELOAD
> > > > +		 *                                and require mmu-lock to
> > > > +		 *                                load mmu
> > > > +		 * repeat:
> > > > +		 *    1): zap root page and
> > > > +		 *        send KVM_REQ_MMU_RELOAD
> > > > +		 *
> > > > +		 *    2): if (cond_resched_lock(mmu-lock))
> > > > +		 *
> > > > +		 *                            2): hold mmu-lock and load mmu
> > > > +		 *
> > > > +		 *                            3): see KVM_REQ_MMU_RELOAD bit
> > > > +		 *                                on vcpu->requests is set
> > > > +		 *                                then return 1 to call
> > > > +		 *                                vcpu_enter_guest() again.
> > > > +		 *            goto repeat;
> > > > +		 *
> > > > +		 */
> > > > +		if (sp->role.invalid)
> > > > +			continue;
> > > > +		/*
> > > > +		 * Need not flush tlb since we only zap the sp with invalid
> > > > +		 * generation number.
> > > > +		 */
> > > > +		if (cond_resched_lock(&kvm->mmu_lock))
> > > > +			goto restart;
> > > 
> > > Please don't do this, just release all pages and flush the TLB when
> > > dropping mmu_lock instead. This means for other mmu_lock users is that
> > > there is a consistent state.
> > You mean call kvm_mmu_commit_zap_page() on  current invalid_list before restart?
> > If yes, I agree.
> 
> I mean leaving pages which have been "partially deleted" while releasing
> mmu_lock. Say for example upon the need to delete a given page you do:
> 
> spin_lock(mmu_lock)
> shadow page = lookup page on hash list
> if (shadow page is not found)
> 	proceed assuming the page has been properly deleted (*)
> spin_unlock(mmu_lock)
> 
> At (*) we assume it has been 1) deleted and 2) TLB flushed.
> 
> So its better to just 
> 
> if (need_resched()) {
> 	kvm_mmu_complete_zap_page(&list);
> 	cond_resched_lock(&kvm->mmu_lock);
> }
> 
OK, this what I said above.

> If you want to collapse TLB flushes, please do it in a later patch.
> 
Not sure what you mean again. We flush TLB once before entering this function.
kvm_reload_remote_mmus() does this for us, no?

> > > > +		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> > > > +			goto restart;
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * Should flush tlb before free page tables since lockless-walking
> > > > +	 * may use the pages.
> > > > +	 */
> > > > +	kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Fast invalidate all shadow pages.
> > > > + *
> > > > + * @zap_obsolete_pages indicates whether all the obsolete pages should
> > > > + * be zapped. This is required when memslot is being deleted or VM is
> > > > + * being destroyed, in these cases, we should ensure that KVM MMU does
> > > > + * not use any resource of the being-deleted slot or all slots after
> > > > + * calling the function.
> > > > + *
> > > > + * @zap_obsolete_pages == false means the caller just wants to flush all
> > > > + * shadow page tables.
> > > > + */
> > > > +void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
> > > > +{
> > > > +	spin_lock(&kvm->mmu_lock);
> > > > +	kvm->arch.mmu_valid_gen++;
> > > > +
> > > > +	/*
> > > > +	 * Notify all vcpus to reload its shadow page table
> > > > +	 * and flush TLB. Then all vcpus will switch to new
> > > > +	 * shadow page table with the new mmu_valid_gen.
> > > 
> > > Only if you zap the roots, which we agreed would be a second step, after
> > > being understood its necessary.
> > > 
> > I've lost you here. The patch implement what was agreed upon.
> 
> "
> + /*
> +  * Notify all vcpus to reload its shadow page table
> +  * and flush TLB. Then all vcpus will switch to new
> +  * shadow page table with the new mmu_valid_gen.
> "
> 
> What was suggested was... go to phrase which starts with "The only purpose
> of the generation number should be to".
> 
> The comment quoted here does not match that description.
> 
The comment describes what code does and in this it is correct.

You propose to not reload roots right away and do it only when root sp
is encountered, right? So my question is what's the point? There are,
obviously, root sps with invalid generation number at this point, so
reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
invalid and obsolete sps as I proposed in one of my email?

--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-21  3:36         ` Xiao Guangrong
@ 2013-05-21  8:45           ` Gleb Natapov
  2013-05-22  1:41           ` Marcelo Tosatti
  1 sibling, 0 replies; 30+ messages in thread
From: Gleb Natapov @ 2013-05-21  8:45 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, avi.kivity, pbonzini, linux-kernel, kvm

On Tue, May 21, 2013 at 11:36:57AM +0800, Xiao Guangrong wrote:
> > So its better to just 
> > 
> > if (need_resched()) {
> > 	kvm_mmu_complete_zap_page(&list);
> 
> is kvm_mmu_commit_zap_page()?
> 
Also we need to check that someone waits on mmu_lock before entering
here.

> > 	cond_resched_lock(&kvm->mmu_lock);
> > }
> > 
> 
> Isn't it what Gleb said?
> 
It is.

> > If you want to collapse TLB flushes, please do it in a later patch.
> 
> Good to me.
> 
> > 
> >>>> +		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> >>>> +			goto restart;
> >>>> +	}
> >>>> +
> >>>> +	/*
> >>>> +	 * Should flush tlb before free page tables since lockless-walking
> >>>> +	 * may use the pages.
> >>>> +	 */
> >>>> +	kvm_mmu_commit_zap_page(kvm, &invalid_list);
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> + * Fast invalidate all shadow pages.
> >>>> + *
> >>>> + * @zap_obsolete_pages indicates whether all the obsolete pages should
> >>>> + * be zapped. This is required when memslot is being deleted or VM is
> >>>> + * being destroyed, in these cases, we should ensure that KVM MMU does
> >>>> + * not use any resource of the being-deleted slot or all slots after
> >>>> + * calling the function.
> >>>> + *
> >>>> + * @zap_obsolete_pages == false means the caller just wants to flush all
> >>>> + * shadow page tables.
> >>>> + */
> >>>> +void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
> >>>> +{
> >>>> +	spin_lock(&kvm->mmu_lock);
> >>>> +	kvm->arch.mmu_valid_gen++;
> >>>> +
> >>>> +	/*
> >>>> +	 * Notify all vcpus to reload its shadow page table
> >>>> +	 * and flush TLB. Then all vcpus will switch to new
> >>>> +	 * shadow page table with the new mmu_valid_gen.
> >>>
> >>> Only if you zap the roots, which we agreed would be a second step, after
> >>> being understood its necessary.
> >>>
> >> I've lost you here. The patch implement what was agreed upon.
> > 
> > "
> > + /*
> > +  * Notify all vcpus to reload its shadow page table
> > +  * and flush TLB. Then all vcpus will switch to new
> > +  * shadow page table with the new mmu_valid_gen.
> > "
> > 
> > What was suggested was... go to phrase which starts with "The only purpose
> > of the generation number should be to".
> > 
> > The comment quoted here does not match that description.
> 
> So, is this your want?
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 2c512e8..2fd4c04 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4275,10 +4275,19 @@ restart:
>   */
>  void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
>  {
> +       bool zap_root = fase;
> +       struct kvm_mmu_page *sp;
> +
>         spin_lock(&kvm->mmu_lock);
>         trace_kvm_mmu_invalidate_all_pages(kvm, zap_obsolete_pages);
>         kvm->arch.mmu_valid_gen++;
> 
> +       list_for_each_entry(sp, kvm->arch.active_mmu_pages, link)
> +               if (sp->root_count && !sp->role.invalid) {
> +                       zap_root = true;
> +                       break;
> +               }
> +
That's the part I do not understand from what Marcelo suggest: why would zap_root
be ever false after this loop?

>         /*
>          * Notify all vcpus to reload its shadow page table
>          * and flush TLB. Then all vcpus will switch to new
> @@ -4288,7 +4297,8 @@ void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
>          * mmu-lock, otherwise, vcpu would purge shadow page
>          * but miss tlb flush.
>          */
> -       kvm_reload_remote_mmus(kvm);
> +       if (zap_root)
> +               kvm_reload_remote_mmus(kvm);
> 
>         if (zap_obsolete_pages)
>                 kvm_zap_obsolete_pages(kvm);
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-21  8:39         ` Gleb Natapov
@ 2013-05-22  1:33           ` Marcelo Tosatti
  2013-05-22  6:34             ` Gleb Natapov
  0 siblings, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2013-05-22  1:33 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Xiao Guangrong, avi.kivity, pbonzini, linux-kernel, kvm

On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
> > Any pages with stale information will be zapped by kvm_mmu_zap_all().
> > When that happens, page faults will take place which will automatically 
> > use the new generation number.
> > 
> > So still not clear why is this necessary.
> > 
> This is not, strictly speaking, necessary, but it is the sane thing to do.
> You cannot update page's generation number to prevent it from been
> destroyed since after kvm_mmu_zap_all() completes stale ptes in the
> shadow page may point to now deleted memslot. So why build shadow page
> table with a page that is in a process of been destroyed?

OK, can this be introduced separately, in a later patch, with separate
justification, then?

Xiao please have the first patches of the patchset focus on the problem
at hand: fix long mmu_lock hold times.

> Not sure what you mean again. We flush TLB once before entering this function.
> kvm_reload_remote_mmus() does this for us, no?

kvm_reload_remote_mmus() is used as an optimization, its separate from the
problem solution.

> > 
> > What was suggested was... go to phrase which starts with "The only purpose
> > of the generation number should be to".
> > 
> > The comment quoted here does not match that description.
> > 
> The comment describes what code does and in this it is correct.
> 
> You propose to not reload roots right away and do it only when root sp
> is encountered, right? So my question is what's the point? There are,
> obviously, root sps with invalid generation number at this point, so
> reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
> do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
> invalid and obsolete sps as I proposed in one of my email?

Sure. But Xiao please introduce that TLB collapsing optimization as a
later patch, so we can reason about it in a more organized fashion.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-21  3:36         ` Xiao Guangrong
  2013-05-21  8:45           ` Gleb Natapov
@ 2013-05-22  1:41           ` Marcelo Tosatti
  1 sibling, 0 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2013-05-22  1:41 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Gleb Natapov, avi.kivity, pbonzini, linux-kernel, kvm

On Tue, May 21, 2013 at 11:36:57AM +0800, Xiao Guangrong wrote:
> On 05/21/2013 04:40 AM, Marcelo Tosatti wrote:
> > On Mon, May 20, 2013 at 11:15:45PM +0300, Gleb Natapov wrote:
> >> On Mon, May 20, 2013 at 04:46:24PM -0300, Marcelo Tosatti wrote:
> >>> On Fri, May 17, 2013 at 05:12:58AM +0800, Xiao Guangrong wrote:
> >>>> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> >>>> walk and zap all shadow pages one by one, also it need to zap all guest
> >>>> page's rmap and all shadow page's parent spte list. Particularly, things
> >>>> become worse if guest uses more memory or vcpus. It is not good for
> >>>> scalability
> >>>>
> >>>> In this patch, we introduce a faster way to invalidate all shadow pages.
> >>>> KVM maintains a global mmu invalid generation-number which is stored in
> >>>> kvm->arch.mmu_valid_gen and every shadow page stores the current global
> >>>> generation-number into sp->mmu_valid_gen when it is created
> >>>>
> >>>> When KVM need zap all shadow pages sptes, it just simply increase the
> >>>> global generation-number then reload root shadow pages on all vcpus.
> >>>> Vcpu will create a new shadow page table according to current kvm's
> >>>> generation-number. It ensures the old pages are not used any more.
> >>>> Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> >>>> are zapped by using lock-break technique
> >>>>
> >>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> >>>> ---
> >>>>  arch/x86/include/asm/kvm_host.h |    2 +
> >>>>  arch/x86/kvm/mmu.c              |  103 +++++++++++++++++++++++++++++++++++++++
> >>>>  arch/x86/kvm/mmu.h              |    1 +
> >>>>  3 files changed, 106 insertions(+), 0 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >>>> index 3741c65..bff7d46 100644
> >>>> --- a/arch/x86/include/asm/kvm_host.h
> >>>> +++ b/arch/x86/include/asm/kvm_host.h
> >>>> @@ -222,6 +222,7 @@ struct kvm_mmu_page {
> >>>>  	int root_count;          /* Currently serving as active root */
> >>>>  	unsigned int unsync_children;
> >>>>  	unsigned long parent_ptes;	/* Reverse mapping for parent_pte */
> >>>> +	unsigned long mmu_valid_gen;
> >>>>  	DECLARE_BITMAP(unsync_child_bitmap, 512);
> >>>>  
> >>>>  #ifdef CONFIG_X86_32
> >>>> @@ -529,6 +530,7 @@ struct kvm_arch {
> >>>>  	unsigned int n_requested_mmu_pages;
> >>>>  	unsigned int n_max_mmu_pages;
> >>>>  	unsigned int indirect_shadow_pages;
> >>>> +	unsigned long mmu_valid_gen;
> >>>>  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> >>>>  	/*
> >>>>  	 * Hash table of struct kvm_mmu_page.
> >>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >>>> index 682ecb4..891ad2c 100644
> >>>> --- a/arch/x86/kvm/mmu.c
> >>>> +++ b/arch/x86/kvm/mmu.c
> >>>> @@ -1839,6 +1839,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
> >>>>  	__clear_sp_write_flooding_count(sp);
> >>>>  }
> >>>>  
> >>>> +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> >>>> +{
> >>>> +	return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> >>>> +}
> >>>> +
> >>>>  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >>>>  					     gfn_t gfn,
> >>>>  					     gva_t gaddr,
> >>>> @@ -1865,6 +1870,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >>>>  		role.quadrant = quadrant;
> >>>>  	}
> >>>>  	for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> >>>> +		if (is_obsolete_sp(vcpu->kvm, sp))
> >>>> +			continue;
> >>>> +
> >>>
> >>> Whats the purpose of not using pages which are considered "obsolete" ?
> >>>
> >> The same as not using page that is invalid, to not reuse stale
> >> information. The page may contain ptes that point to invalid slot.
> > 
> > Any pages with stale information will be zapped by kvm_mmu_zap_all().
> > When that happens, page faults will take place which will automatically 
> > use the new generation number.
> 
> kvm_mmu_zap_all() uses lock-break technique to zap pages, before it zaps
> all obsolete pages other vcpus can require mmu-lock and call kvm_mmu_get_page()
> to install new page. In this case, obsolete page still live in hast-table and
> can be found by kvm_mmu_get_page().

There is an assumption in that modification that its worthwhile to
"prefault" pages as soon as kvm_mmu_zap_all() is detected. So its a
heuristic. 

Please move it to a separate patch, as an optimization.

(*) marker

> > spin_lock(mmu_lock)
> > shadow page = lookup page on hash list
> > if (shadow page is not found)
> > 	proceed assuming the page has been properly deleted (*)
> > spin_unlock(mmu_lock)
> > 
> > At (*) we assume it has been 1) deleted and 2) TLB flushed.
> > 
> > So its better to just 
> > 
> > if (need_resched()) {
> > 	kvm_mmu_complete_zap_page(&list);
> 
> is kvm_mmu_commit_zap_page()?

Yeah.

> 
> > 	cond_resched_lock(&kvm->mmu_lock);
> > }
> > 
> 
> Isn't it what Gleb said?
> 
> > If you want to collapse TLB flushes, please do it in a later patch.
> 
> Good to me.

OK thanks.

> > + /*
> > +  * Notify all vcpus to reload its shadow page table
> > +  * and flush TLB. Then all vcpus will switch to new
> > +  * shadow page table with the new mmu_valid_gen.
> > "
> > 
> > What was suggested was... go to phrase which starts with "The only purpose
> > of the generation number should be to".
> > 
> > The comment quoted here does not match that description.
> 
> So, is this your want?

This 

"
* Notify all vcpus to reload its shadow page table
* and flush TLB. Then all vcpus will switch to new
* shadow page table with the new mmu_valid_gen.
"

Should be in a later patch, where prefault behaviour (see the first
comment) is introduced, see the "(*) marker" comment.

Note sure about the below (perhaps separately i can understand).

> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 2c512e8..2fd4c04 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4275,10 +4275,19 @@ restart:
>   */
>  void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
>  {
> +       bool zap_root = fase;
> +       struct kvm_mmu_page *sp;
> +
>         spin_lock(&kvm->mmu_lock);
>         trace_kvm_mmu_invalidate_all_pages(kvm, zap_obsolete_pages);
>         kvm->arch.mmu_valid_gen++;
> 
> +       list_for_each_entry(sp, kvm->arch.active_mmu_pages, link)
> +               if (sp->root_count && !sp->role.invalid) {
> +                       zap_root = true;
> +                       break;
> +               }
> +
>         /*
>          * Notify all vcpus to reload its shadow page table
>          * and flush TLB. Then all vcpus will switch to new
> @@ -4288,7 +4297,8 @@ void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
>          * mmu-lock, otherwise, vcpu would purge shadow page
>          * but miss tlb flush.
>          */
> -       kvm_reload_remote_mmus(kvm);
> +       if (zap_root)
> +               kvm_reload_remote_mmus(kvm);
> 
>         if (zap_obsolete_pages)
>                 kvm_zap_obsolete_pages(kvm);
> 
> > 
> >>>> +	 *
> >>>> +	 * Note: we should do this under the protection of
> >>>> +	 * mmu-lock, otherwise, vcpu would purge shadow page
> >>>> +	 * but miss tlb flush.
> >>>> +	 */
> >>>> +	kvm_reload_remote_mmus(kvm);
> >>>> +
> >>>> +	if (zap_obsolete_pages)
> >>>> +		kvm_zap_obsolete_pages(kvm);
> >>>
> >>> Please don't condition behaviour on parameters like this, its confusing. 
> >>> Can probably just use
> >>>
> >>> spin_lock(&kvm->mmu_lock);
> >>> kvm->arch.mmu_valid_gen++;
> >>> kvm_reload_remote_mmus(kvm);
> >>> spin_unlock(&kvm->mmu_lock);
> >>>
> >>> Directly when needed.
> >>
> >> Please, no. Better have one place where mmu_valid_gen is
> >> handled. If parameter is confusing better have two functions:
> >> kvm_mmu_invalidate_all_pages() and kvm_mmu_invalidate_zap_all_pages(),
> >> but to minimize code duplication they will call the same function with
> >> a parameter anyway, so I am not sure this is a win.
> > 
> > OK.
> 
> Will introduce these two functions.
> 
> Thank you all. ;)
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-22  1:33           ` Marcelo Tosatti
@ 2013-05-22  6:34             ` Gleb Natapov
  2013-05-22  8:46               ` Xiao Guangrong
  2013-05-22 15:06               ` Marcelo Tosatti
  0 siblings, 2 replies; 30+ messages in thread
From: Gleb Natapov @ 2013-05-22  6:34 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Xiao Guangrong, avi.kivity, pbonzini, linux-kernel, kvm

On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
> On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
> > > Any pages with stale information will be zapped by kvm_mmu_zap_all().
> > > When that happens, page faults will take place which will automatically 
> > > use the new generation number.
> > > 
> > > So still not clear why is this necessary.
> > > 
> > This is not, strictly speaking, necessary, but it is the sane thing to do.
> > You cannot update page's generation number to prevent it from been
> > destroyed since after kvm_mmu_zap_all() completes stale ptes in the
> > shadow page may point to now deleted memslot. So why build shadow page
> > table with a page that is in a process of been destroyed?
> 
> OK, can this be introduced separately, in a later patch, with separate
> justification, then?
> 
> Xiao please have the first patches of the patchset focus on the problem
> at hand: fix long mmu_lock hold times.
> 
> > Not sure what you mean again. We flush TLB once before entering this function.
> > kvm_reload_remote_mmus() does this for us, no?
> 
> kvm_reload_remote_mmus() is used as an optimization, its separate from the
> problem solution.
> 
> > > 
> > > What was suggested was... go to phrase which starts with "The only purpose
> > > of the generation number should be to".
> > > 
> > > The comment quoted here does not match that description.
> > > 
> > The comment describes what code does and in this it is correct.
> > 
> > You propose to not reload roots right away and do it only when root sp
> > is encountered, right? So my question is what's the point? There are,
> > obviously, root sps with invalid generation number at this point, so
> > reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
> > do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
> > invalid and obsolete sps as I proposed in one of my email?
> 
> Sure. But Xiao please introduce that TLB collapsing optimization as a
> later patch, so we can reason about it in a more organized fashion.

So, if I understand correctly, you are asking to move is_obsolete_sp()
check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if
we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
become nop. But I question the need to zap all shadow pages tables there
in the first place, why kvm_flush_remote_tlbs() is not enough?

--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-22  6:34             ` Gleb Natapov
@ 2013-05-22  8:46               ` Xiao Guangrong
  2013-05-22  8:54                 ` Gleb Natapov
  2013-05-22 15:06               ` Marcelo Tosatti
  1 sibling, 1 reply; 30+ messages in thread
From: Xiao Guangrong @ 2013-05-22  8:46 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Marcelo Tosatti, avi.kivity, pbonzini, linux-kernel, kvm

On 05/22/2013 02:34 PM, Gleb Natapov wrote:
> On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
>> On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
>>>> Any pages with stale information will be zapped by kvm_mmu_zap_all().
>>>> When that happens, page faults will take place which will automatically 
>>>> use the new generation number.
>>>>
>>>> So still not clear why is this necessary.
>>>>
>>> This is not, strictly speaking, necessary, but it is the sane thing to do.
>>> You cannot update page's generation number to prevent it from been
>>> destroyed since after kvm_mmu_zap_all() completes stale ptes in the
>>> shadow page may point to now deleted memslot. So why build shadow page
>>> table with a page that is in a process of been destroyed?
>>
>> OK, can this be introduced separately, in a later patch, with separate
>> justification, then?
>>
>> Xiao please have the first patches of the patchset focus on the problem
>> at hand: fix long mmu_lock hold times.
>>
>>> Not sure what you mean again. We flush TLB once before entering this function.
>>> kvm_reload_remote_mmus() does this for us, no?
>>
>> kvm_reload_remote_mmus() is used as an optimization, its separate from the
>> problem solution.
>>
>>>>
>>>> What was suggested was... go to phrase which starts with "The only purpose
>>>> of the generation number should be to".
>>>>
>>>> The comment quoted here does not match that description.
>>>>
>>> The comment describes what code does and in this it is correct.
>>>
>>> You propose to not reload roots right away and do it only when root sp
>>> is encountered, right? So my question is what's the point? There are,
>>> obviously, root sps with invalid generation number at this point, so
>>> reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
>>> do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
>>> invalid and obsolete sps as I proposed in one of my email?
>>
>> Sure. But Xiao please introduce that TLB collapsing optimization as a
>> later patch, so we can reason about it in a more organized fashion.
> 
> So, if I understand correctly, you are asking to move is_obsolete_sp()
> check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
> kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if
> we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
> call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
> become nop. But I question the need to zap all shadow pages tables there
> in the first place, why kvm_flush_remote_tlbs() is not enough?

I do not know too... I even do no know why kvm_flush_remote_tlbs
is needed. :(




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-22  8:46               ` Xiao Guangrong
@ 2013-05-22  8:54                 ` Gleb Natapov
  2013-05-22  9:41                   ` Xiao Guangrong
  0 siblings, 1 reply; 30+ messages in thread
From: Gleb Natapov @ 2013-05-22  8:54 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, avi.kivity, pbonzini, linux-kernel, kvm

On Wed, May 22, 2013 at 04:46:04PM +0800, Xiao Guangrong wrote:
> On 05/22/2013 02:34 PM, Gleb Natapov wrote:
> > On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
> >> On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
> >>>> Any pages with stale information will be zapped by kvm_mmu_zap_all().
> >>>> When that happens, page faults will take place which will automatically 
> >>>> use the new generation number.
> >>>>
> >>>> So still not clear why is this necessary.
> >>>>
> >>> This is not, strictly speaking, necessary, but it is the sane thing to do.
> >>> You cannot update page's generation number to prevent it from been
> >>> destroyed since after kvm_mmu_zap_all() completes stale ptes in the
> >>> shadow page may point to now deleted memslot. So why build shadow page
> >>> table with a page that is in a process of been destroyed?
> >>
> >> OK, can this be introduced separately, in a later patch, with separate
> >> justification, then?
> >>
> >> Xiao please have the first patches of the patchset focus on the problem
> >> at hand: fix long mmu_lock hold times.
> >>
> >>> Not sure what you mean again. We flush TLB once before entering this function.
> >>> kvm_reload_remote_mmus() does this for us, no?
> >>
> >> kvm_reload_remote_mmus() is used as an optimization, its separate from the
> >> problem solution.
> >>
> >>>>
> >>>> What was suggested was... go to phrase which starts with "The only purpose
> >>>> of the generation number should be to".
> >>>>
> >>>> The comment quoted here does not match that description.
> >>>>
> >>> The comment describes what code does and in this it is correct.
> >>>
> >>> You propose to not reload roots right away and do it only when root sp
> >>> is encountered, right? So my question is what's the point? There are,
> >>> obviously, root sps with invalid generation number at this point, so
> >>> reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
> >>> do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
> >>> invalid and obsolete sps as I proposed in one of my email?
> >>
> >> Sure. But Xiao please introduce that TLB collapsing optimization as a
> >> later patch, so we can reason about it in a more organized fashion.
> > 
> > So, if I understand correctly, you are asking to move is_obsolete_sp()
> > check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
> > kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if
> > we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
> > call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
> > become nop. But I question the need to zap all shadow pages tables there
> > in the first place, why kvm_flush_remote_tlbs() is not enough?
> 
> I do not know too... I even do no know why kvm_flush_remote_tlbs
> is needed. :(
We changed the content of an executable page, we need to flush instruction
cache of all vcpus to not use stale data, so my suggestion to call
kvm_flush_remote_tlbs() is obviously incorrect since this flushes tlb,
not instruction cache, but why kvm_reload_remote_mmus() would flush
instruction cache?

--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-22  8:54                 ` Gleb Natapov
@ 2013-05-22  9:41                   ` Xiao Guangrong
  2013-05-22 13:17                     ` Gleb Natapov
  0 siblings, 1 reply; 30+ messages in thread
From: Xiao Guangrong @ 2013-05-22  9:41 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Marcelo Tosatti, avi.kivity, pbonzini, linux-kernel, kvm,
	Anthony Liguori

On 05/22/2013 04:54 PM, Gleb Natapov wrote:
> On Wed, May 22, 2013 at 04:46:04PM +0800, Xiao Guangrong wrote:
>> On 05/22/2013 02:34 PM, Gleb Natapov wrote:
>>> On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
>>>> On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
>>>>>> Any pages with stale information will be zapped by kvm_mmu_zap_all().
>>>>>> When that happens, page faults will take place which will automatically 
>>>>>> use the new generation number.
>>>>>>
>>>>>> So still not clear why is this necessary.
>>>>>>
>>>>> This is not, strictly speaking, necessary, but it is the sane thing to do.
>>>>> You cannot update page's generation number to prevent it from been
>>>>> destroyed since after kvm_mmu_zap_all() completes stale ptes in the
>>>>> shadow page may point to now deleted memslot. So why build shadow page
>>>>> table with a page that is in a process of been destroyed?
>>>>
>>>> OK, can this be introduced separately, in a later patch, with separate
>>>> justification, then?
>>>>
>>>> Xiao please have the first patches of the patchset focus on the problem
>>>> at hand: fix long mmu_lock hold times.
>>>>
>>>>> Not sure what you mean again. We flush TLB once before entering this function.
>>>>> kvm_reload_remote_mmus() does this for us, no?
>>>>
>>>> kvm_reload_remote_mmus() is used as an optimization, its separate from the
>>>> problem solution.
>>>>
>>>>>>
>>>>>> What was suggested was... go to phrase which starts with "The only purpose
>>>>>> of the generation number should be to".
>>>>>>
>>>>>> The comment quoted here does not match that description.
>>>>>>
>>>>> The comment describes what code does and in this it is correct.
>>>>>
>>>>> You propose to not reload roots right away and do it only when root sp
>>>>> is encountered, right? So my question is what's the point? There are,
>>>>> obviously, root sps with invalid generation number at this point, so
>>>>> reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
>>>>> do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
>>>>> invalid and obsolete sps as I proposed in one of my email?
>>>>
>>>> Sure. But Xiao please introduce that TLB collapsing optimization as a
>>>> later patch, so we can reason about it in a more organized fashion.
>>>
>>> So, if I understand correctly, you are asking to move is_obsolete_sp()
>>> check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
>>> kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if
>>> we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
>>> call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
>>> become nop. But I question the need to zap all shadow pages tables there
>>> in the first place, why kvm_flush_remote_tlbs() is not enough?
>>
>> I do not know too... I even do no know why kvm_flush_remote_tlbs
>> is needed. :(
> We changed the content of an executable page, we need to flush instruction
> cache of all vcpus to not use stale data, so my suggestion to call

I thought the reason is about icache too but icache is automatically
flushed on x86, we only need to invalidate the prefetched instructions by
executing a serializing operation.

See the SDM in the chapter of
"8.1.3 Handling Self- and Cross-Modifying Code"

> kvm_flush_remote_tlbs() is obviously incorrect since this flushes tlb,
> not instruction cache, but why kvm_reload_remote_mmus() would flush
> instruction cache?

kvm_reload_remote_mmus do not have any help i think.

I find that this change is introduced by commit: 7aa81cc0
and I have added Anthony in the CC.

I also find some discussions related to calling
kvm_reload_remote_mmus():

>
> But if the instruction is architecture dependent, and you run on the
> wrong architecture, now you have to patch many locations at fault time,
> introducing some nasty runtime code / data cache overlap performance
> problems.  Granted, they go away eventually.
>

We're addressing that by blowing away the shadow cache and holding the
big kvm lock to ensure SMP safety.  Not a great thing to do from a
performance perspective but the whole point of patching is that the cost
is amortized.

(http://kerneltrap.org/mailarchive/linux-kernel/2007/9/14/260288)

But i can not understand...



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-22  9:41                   ` Xiao Guangrong
@ 2013-05-22 13:17                     ` Gleb Natapov
  2013-05-22 15:25                       ` Xiao Guangrong
  0 siblings, 1 reply; 30+ messages in thread
From: Gleb Natapov @ 2013-05-22 13:17 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Marcelo Tosatti, avi.kivity, pbonzini, linux-kernel, kvm,
	Anthony Liguori

On Wed, May 22, 2013 at 05:41:10PM +0800, Xiao Guangrong wrote:
> On 05/22/2013 04:54 PM, Gleb Natapov wrote:
> > On Wed, May 22, 2013 at 04:46:04PM +0800, Xiao Guangrong wrote:
> >> On 05/22/2013 02:34 PM, Gleb Natapov wrote:
> >>> On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
> >>>> On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
> >>>>>> Any pages with stale information will be zapped by kvm_mmu_zap_all().
> >>>>>> When that happens, page faults will take place which will automatically 
> >>>>>> use the new generation number.
> >>>>>>
> >>>>>> So still not clear why is this necessary.
> >>>>>>
> >>>>> This is not, strictly speaking, necessary, but it is the sane thing to do.
> >>>>> You cannot update page's generation number to prevent it from been
> >>>>> destroyed since after kvm_mmu_zap_all() completes stale ptes in the
> >>>>> shadow page may point to now deleted memslot. So why build shadow page
> >>>>> table with a page that is in a process of been destroyed?
> >>>>
> >>>> OK, can this be introduced separately, in a later patch, with separate
> >>>> justification, then?
> >>>>
> >>>> Xiao please have the first patches of the patchset focus on the problem
> >>>> at hand: fix long mmu_lock hold times.
> >>>>
> >>>>> Not sure what you mean again. We flush TLB once before entering this function.
> >>>>> kvm_reload_remote_mmus() does this for us, no?
> >>>>
> >>>> kvm_reload_remote_mmus() is used as an optimization, its separate from the
> >>>> problem solution.
> >>>>
> >>>>>>
> >>>>>> What was suggested was... go to phrase which starts with "The only purpose
> >>>>>> of the generation number should be to".
> >>>>>>
> >>>>>> The comment quoted here does not match that description.
> >>>>>>
> >>>>> The comment describes what code does and in this it is correct.
> >>>>>
> >>>>> You propose to not reload roots right away and do it only when root sp
> >>>>> is encountered, right? So my question is what's the point? There are,
> >>>>> obviously, root sps with invalid generation number at this point, so
> >>>>> reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
> >>>>> do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
> >>>>> invalid and obsolete sps as I proposed in one of my email?
> >>>>
> >>>> Sure. But Xiao please introduce that TLB collapsing optimization as a
> >>>> later patch, so we can reason about it in a more organized fashion.
> >>>
> >>> So, if I understand correctly, you are asking to move is_obsolete_sp()
> >>> check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
> >>> kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if
> >>> we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
> >>> call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
> >>> become nop. But I question the need to zap all shadow pages tables there
> >>> in the first place, why kvm_flush_remote_tlbs() is not enough?
> >>
> >> I do not know too... I even do no know why kvm_flush_remote_tlbs
> >> is needed. :(
> > We changed the content of an executable page, we need to flush instruction
> > cache of all vcpus to not use stale data, so my suggestion to call
> 
> I thought the reason is about icache too but icache is automatically
> flushed on x86, we only need to invalidate the prefetched instructions by
> executing a serializing operation.
> 
> See the SDM in the chapter of
> "8.1.3 Handling Self- and Cross-Modifying Code"
> 
Right, so we do cross-modifying code here and we need to make sure no
vcpu is running in a guest mode while this happens, but
kvm_mmu_zap_all() does not provide this guaranty since vcpus will
continue running after reloading roots!
 
> > kvm_flush_remote_tlbs() is obviously incorrect since this flushes tlb,
> > not instruction cache, but why kvm_reload_remote_mmus() would flush
> > instruction cache?
> 
> kvm_reload_remote_mmus do not have any help i think.
> 
> I find that this change is introduced by commit: 7aa81cc0
> and I have added Anthony in the CC.
> 
> I also find some discussions related to calling
> kvm_reload_remote_mmus():
> 
> >
> > But if the instruction is architecture dependent, and you run on the
> > wrong architecture, now you have to patch many locations at fault time,
> > introducing some nasty runtime code / data cache overlap performance
> > problems.  Granted, they go away eventually.
> >
> 
> We're addressing that by blowing away the shadow cache and holding the
> big kvm lock to ensure SMP safety.  Not a great thing to do from a
> performance perspective but the whole point of patching is that the cost
> is amortized.
> 
> (http://kerneltrap.org/mailarchive/linux-kernel/2007/9/14/260288)
> 
> But i can not understand...
Back then kvm->lock protected memslot access so code like:

mutex_lock(&vcpu->kvm->lock);
kvm_mmu_zap_all(vcpu->kvm);
mutex_unlock(&vcpu->kvm->lock);

which is what 7aa81cc0 does was enough to guaranty that no vcpu will
run while code is patched. This is no longer the case and
mutex_lock(&vcpu->kvm->lock); is gone from that code path long time ago,
so now kvm_mmu_zap_all() there is useless and the code is incorrect.

Lets drop kvm_mmu_zap_all() there (in separate patch) and fix the
patching properly later.

--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-22  6:34             ` Gleb Natapov
  2013-05-22  8:46               ` Xiao Guangrong
@ 2013-05-22 15:06               ` Marcelo Tosatti
  1 sibling, 0 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2013-05-22 15:06 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Xiao Guangrong, avi.kivity, pbonzini, linux-kernel, kvm

On Wed, May 22, 2013 at 09:34:13AM +0300, Gleb Natapov wrote:
> On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
> > On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
> > > > Any pages with stale information will be zapped by kvm_mmu_zap_all().
> > > > When that happens, page faults will take place which will automatically 
> > > > use the new generation number.
> > > > 
> > > > So still not clear why is this necessary.
> > > > 
> > > This is not, strictly speaking, necessary, but it is the sane thing to do.
> > > You cannot update page's generation number to prevent it from been
> > > destroyed since after kvm_mmu_zap_all() completes stale ptes in the
> > > shadow page may point to now deleted memslot. So why build shadow page
> > > table with a page that is in a process of been destroyed?
> > 
> > OK, can this be introduced separately, in a later patch, with separate
> > justification, then?
> > 
> > Xiao please have the first patches of the patchset focus on the problem
> > at hand: fix long mmu_lock hold times.
> > 
> > > Not sure what you mean again. We flush TLB once before entering this function.
> > > kvm_reload_remote_mmus() does this for us, no?
> > 
> > kvm_reload_remote_mmus() is used as an optimization, its separate from the
> > problem solution.
> > 
> > > > 
> > > > What was suggested was... go to phrase which starts with "The only purpose
> > > > of the generation number should be to".
> > > > 
> > > > The comment quoted here does not match that description.
> > > > 
> > > The comment describes what code does and in this it is correct.
> > > 
> > > You propose to not reload roots right away and do it only when root sp
> > > is encountered, right? So my question is what's the point? There are,
> > > obviously, root sps with invalid generation number at this point, so
> > > reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
> > > do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
> > > invalid and obsolete sps as I proposed in one of my email?
> > 
> > Sure. But Xiao please introduce that TLB collapsing optimization as a
> > later patch, so we can reason about it in a more organized fashion.
> 
> So, if I understand correctly, you are asking to move is_obsolete_sp()
> check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
> kvm_mmu_invalidate_all_pages() to a separate patch.

- Move the is_obsolete_sp() check from kvm_mmu_get_page() to a separate
optimization patch.
- Move kvm_reload_remote_mmus() optimization to a separate optimization
patch.
- Leave the is_obsolete_sp() check in kvm_mmu_invalidate_all_pages(), as
it is necessary to guarantee forward progress, in the initial patches.

> Fine by me, but if
> we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
> call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
> become nop. But I question the need to zap all shadow pages tables there
> in the first place, why kvm_flush_remote_tlbs() is not enough?

Good question.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-22 13:17                     ` Gleb Natapov
@ 2013-05-22 15:25                       ` Xiao Guangrong
  2013-05-22 15:42                         ` Gleb Natapov
  0 siblings, 1 reply; 30+ messages in thread
From: Xiao Guangrong @ 2013-05-22 15:25 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Marcelo Tosatti, avi.kivity, pbonzini, linux-kernel, kvm,
	Anthony Liguori

On 05/22/2013 09:17 PM, Gleb Natapov wrote:
> On Wed, May 22, 2013 at 05:41:10PM +0800, Xiao Guangrong wrote:
>> On 05/22/2013 04:54 PM, Gleb Natapov wrote:
>>> On Wed, May 22, 2013 at 04:46:04PM +0800, Xiao Guangrong wrote:
>>>> On 05/22/2013 02:34 PM, Gleb Natapov wrote:
>>>>> On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
>>>>>> On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
>>>>>>>> Any pages with stale information will be zapped by kvm_mmu_zap_all().
>>>>>>>> When that happens, page faults will take place which will automatically 
>>>>>>>> use the new generation number.
>>>>>>>>
>>>>>>>> So still not clear why is this necessary.
>>>>>>>>
>>>>>>> This is not, strictly speaking, necessary, but it is the sane thing to do.
>>>>>>> You cannot update page's generation number to prevent it from been
>>>>>>> destroyed since after kvm_mmu_zap_all() completes stale ptes in the
>>>>>>> shadow page may point to now deleted memslot. So why build shadow page
>>>>>>> table with a page that is in a process of been destroyed?
>>>>>>
>>>>>> OK, can this be introduced separately, in a later patch, with separate
>>>>>> justification, then?
>>>>>>
>>>>>> Xiao please have the first patches of the patchset focus on the problem
>>>>>> at hand: fix long mmu_lock hold times.
>>>>>>
>>>>>>> Not sure what you mean again. We flush TLB once before entering this function.
>>>>>>> kvm_reload_remote_mmus() does this for us, no?
>>>>>>
>>>>>> kvm_reload_remote_mmus() is used as an optimization, its separate from the
>>>>>> problem solution.
>>>>>>
>>>>>>>>
>>>>>>>> What was suggested was... go to phrase which starts with "The only purpose
>>>>>>>> of the generation number should be to".
>>>>>>>>
>>>>>>>> The comment quoted here does not match that description.
>>>>>>>>
>>>>>>> The comment describes what code does and in this it is correct.
>>>>>>>
>>>>>>> You propose to not reload roots right away and do it only when root sp
>>>>>>> is encountered, right? So my question is what's the point? There are,
>>>>>>> obviously, root sps with invalid generation number at this point, so
>>>>>>> reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
>>>>>>> do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
>>>>>>> invalid and obsolete sps as I proposed in one of my email?
>>>>>>
>>>>>> Sure. But Xiao please introduce that TLB collapsing optimization as a
>>>>>> later patch, so we can reason about it in a more organized fashion.
>>>>>
>>>>> So, if I understand correctly, you are asking to move is_obsolete_sp()
>>>>> check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
>>>>> kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if
>>>>> we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
>>>>> call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
>>>>> become nop. But I question the need to zap all shadow pages tables there
>>>>> in the first place, why kvm_flush_remote_tlbs() is not enough?
>>>>
>>>> I do not know too... I even do no know why kvm_flush_remote_tlbs
>>>> is needed. :(
>>> We changed the content of an executable page, we need to flush instruction
>>> cache of all vcpus to not use stale data, so my suggestion to call
>>
>> I thought the reason is about icache too but icache is automatically
>> flushed on x86, we only need to invalidate the prefetched instructions by
>> executing a serializing operation.
>>
>> See the SDM in the chapter of
>> "8.1.3 Handling Self- and Cross-Modifying Code"
>>
> Right, so we do cross-modifying code here and we need to make sure no
> vcpu is running in a guest mode while this happens, but
> kvm_mmu_zap_all() does not provide this guaranty since vcpus will
> continue running after reloading roots!

May be we can introduce a function to atomic write gpa, then the guest
either 1) see the old value, in that case, it can be intercepted or
2) see the the new value in that case, it can continue to execute.

> 
>>> kvm_flush_remote_tlbs() is obviously incorrect since this flushes tlb,
>>> not instruction cache, but why kvm_reload_remote_mmus() would flush
>>> instruction cache?
>>
>> kvm_reload_remote_mmus do not have any help i think.
>>
>> I find that this change is introduced by commit: 7aa81cc0
>> and I have added Anthony in the CC.
>>
>> I also find some discussions related to calling
>> kvm_reload_remote_mmus():
>>
>>>
>>> But if the instruction is architecture dependent, and you run on the
>>> wrong architecture, now you have to patch many locations at fault time,
>>> introducing some nasty runtime code / data cache overlap performance
>>> problems.  Granted, they go away eventually.
>>>
>>
>> We're addressing that by blowing away the shadow cache and holding the
>> big kvm lock to ensure SMP safety.  Not a great thing to do from a
>> performance perspective but the whole point of patching is that the cost
>> is amortized.
>>
>> (http://kerneltrap.org/mailarchive/linux-kernel/2007/9/14/260288)
>>
>> But i can not understand...
> Back then kvm->lock protected memslot access so code like:
> 
> mutex_lock(&vcpu->kvm->lock);
> kvm_mmu_zap_all(vcpu->kvm);
> mutex_unlock(&vcpu->kvm->lock);
> 
> which is what 7aa81cc0 does was enough to guaranty that no vcpu will
> run while code is patched. 

So, at that time, kvm->lock is also held when #PF is being fixed?

> This is no longer the case and
> mutex_lock(&vcpu->kvm->lock); is gone from that code path long time ago,
> so now kvm_mmu_zap_all() there is useless and the code is incorrect.
> 
> Lets drop kvm_mmu_zap_all() there (in separate patch) and fix the
> patching properly later.

Will do.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages
  2013-05-22 15:25                       ` Xiao Guangrong
@ 2013-05-22 15:42                         ` Gleb Natapov
  0 siblings, 0 replies; 30+ messages in thread
From: Gleb Natapov @ 2013-05-22 15:42 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Marcelo Tosatti, avi.kivity, pbonzini, linux-kernel, kvm,
	Anthony Liguori

On Wed, May 22, 2013 at 11:25:13PM +0800, Xiao Guangrong wrote:
> On 05/22/2013 09:17 PM, Gleb Natapov wrote:
> > On Wed, May 22, 2013 at 05:41:10PM +0800, Xiao Guangrong wrote:
> >> On 05/22/2013 04:54 PM, Gleb Natapov wrote:
> >>> On Wed, May 22, 2013 at 04:46:04PM +0800, Xiao Guangrong wrote:
> >>>> On 05/22/2013 02:34 PM, Gleb Natapov wrote:
> >>>>> On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
> >>>>>> On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
> >>>>>>>> Any pages with stale information will be zapped by kvm_mmu_zap_all().
> >>>>>>>> When that happens, page faults will take place which will automatically 
> >>>>>>>> use the new generation number.
> >>>>>>>>
> >>>>>>>> So still not clear why is this necessary.
> >>>>>>>>
> >>>>>>> This is not, strictly speaking, necessary, but it is the sane thing to do.
> >>>>>>> You cannot update page's generation number to prevent it from been
> >>>>>>> destroyed since after kvm_mmu_zap_all() completes stale ptes in the
> >>>>>>> shadow page may point to now deleted memslot. So why build shadow page
> >>>>>>> table with a page that is in a process of been destroyed?
> >>>>>>
> >>>>>> OK, can this be introduced separately, in a later patch, with separate
> >>>>>> justification, then?
> >>>>>>
> >>>>>> Xiao please have the first patches of the patchset focus on the problem
> >>>>>> at hand: fix long mmu_lock hold times.
> >>>>>>
> >>>>>>> Not sure what you mean again. We flush TLB once before entering this function.
> >>>>>>> kvm_reload_remote_mmus() does this for us, no?
> >>>>>>
> >>>>>> kvm_reload_remote_mmus() is used as an optimization, its separate from the
> >>>>>> problem solution.
> >>>>>>
> >>>>>>>>
> >>>>>>>> What was suggested was... go to phrase which starts with "The only purpose
> >>>>>>>> of the generation number should be to".
> >>>>>>>>
> >>>>>>>> The comment quoted here does not match that description.
> >>>>>>>>
> >>>>>>> The comment describes what code does and in this it is correct.
> >>>>>>>
> >>>>>>> You propose to not reload roots right away and do it only when root sp
> >>>>>>> is encountered, right? So my question is what's the point? There are,
> >>>>>>> obviously, root sps with invalid generation number at this point, so
> >>>>>>> reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
> >>>>>>> do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
> >>>>>>> invalid and obsolete sps as I proposed in one of my email?
> >>>>>>
> >>>>>> Sure. But Xiao please introduce that TLB collapsing optimization as a
> >>>>>> later patch, so we can reason about it in a more organized fashion.
> >>>>>
> >>>>> So, if I understand correctly, you are asking to move is_obsolete_sp()
> >>>>> check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
> >>>>> kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if
> >>>>> we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
> >>>>> call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
> >>>>> become nop. But I question the need to zap all shadow pages tables there
> >>>>> in the first place, why kvm_flush_remote_tlbs() is not enough?
> >>>>
> >>>> I do not know too... I even do no know why kvm_flush_remote_tlbs
> >>>> is needed. :(
> >>> We changed the content of an executable page, we need to flush instruction
> >>> cache of all vcpus to not use stale data, so my suggestion to call
> >>
> >> I thought the reason is about icache too but icache is automatically
> >> flushed on x86, we only need to invalidate the prefetched instructions by
> >> executing a serializing operation.
> >>
> >> See the SDM in the chapter of
> >> "8.1.3 Handling Self- and Cross-Modifying Code"
> >>
> > Right, so we do cross-modifying code here and we need to make sure no
> > vcpu is running in a guest mode while this happens, but
> > kvm_mmu_zap_all() does not provide this guaranty since vcpus will
> > continue running after reloading roots!
> 
> May be we can introduce a function to atomic write gpa, then the guest
> either 1) see the old value, in that case, it can be intercepted or
> 2) see the the new value in that case, it can continue to execute.
> 
SDM says atomic write is not enough. All vcpu should be guarantied to
not execute code in the vicinity of modified code. This is easy to
achieve though:

vcpu0:                            
lock(x);
make_all_cpus_request(EXIT);
unlock(x);

vcpuX:
if (kvm_check_request(EXIT)) { 
    lock(x);
    unlock(x);
}

> >>> kvm_flush_remote_tlbs() is obviously incorrect since this flushes tlb,
> >>> not instruction cache, but why kvm_reload_remote_mmus() would flush
> >>> instruction cache?
> >>
> >> kvm_reload_remote_mmus do not have any help i think.
> >>
> >> I find that this change is introduced by commit: 7aa81cc0
> >> and I have added Anthony in the CC.
> >>
> >> I also find some discussions related to calling
> >> kvm_reload_remote_mmus():
> >>
> >>>
> >>> But if the instruction is architecture dependent, and you run on the
> >>> wrong architecture, now you have to patch many locations at fault time,
> >>> introducing some nasty runtime code / data cache overlap performance
> >>> problems.  Granted, they go away eventually.
> >>>
> >>
> >> We're addressing that by blowing away the shadow cache and holding the
> >> big kvm lock to ensure SMP safety.  Not a great thing to do from a
> >> performance perspective but the whole point of patching is that the cost
> >> is amortized.
> >>
> >> (http://kerneltrap.org/mailarchive/linux-kernel/2007/9/14/260288)
> >>
> >> But i can not understand...
> > Back then kvm->lock protected memslot access so code like:
> > 
> > mutex_lock(&vcpu->kvm->lock);
> > kvm_mmu_zap_all(vcpu->kvm);
> > mutex_unlock(&vcpu->kvm->lock);
> > 
> > which is what 7aa81cc0 does was enough to guaranty that no vcpu will
> > run while code is patched. 
> 
> So, at that time, kvm->lock is also held when #PF is being fixed?
> 
It was, and also during kvm_mmu_load() which is called during vcpu entry
after roots are zapped.

> > This is no longer the case and
> > mutex_lock(&vcpu->kvm->lock); is gone from that code path long time ago,
> > so now kvm_mmu_zap_all() there is useless and the code is incorrect.
> > 
> > Lets drop kvm_mmu_zap_all() there (in separate patch) and fix the
> > patching properly later.
> 
> Will do.
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2013-05-22 15:42 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-16 21:12 [PATCH v6 0/7] KVM: MMU: fast zap all shadow pages Xiao Guangrong
2013-05-16 21:12 ` [PATCH v6 1/7] KVM: MMU: drop unnecessary kvm_reload_remote_mmus Xiao Guangrong
2013-05-16 21:12 ` [PATCH v6 2/7] KVM: MMU: delete shadow page from hash list in kvm_mmu_prepare_zap_page Xiao Guangrong
2013-05-19 10:47   ` Gleb Natapov
2013-05-20  9:19     ` Xiao Guangrong
2013-05-20  9:42       ` Gleb Natapov
2013-05-16 21:12 ` [PATCH v6 3/7] KVM: MMU: fast invalidate all pages Xiao Guangrong
2013-05-19 10:04   ` Gleb Natapov
2013-05-20  9:12     ` Xiao Guangrong
2013-05-20 19:46   ` Marcelo Tosatti
2013-05-20 20:15     ` Gleb Natapov
2013-05-20 20:40       ` Marcelo Tosatti
2013-05-21  3:36         ` Xiao Guangrong
2013-05-21  8:45           ` Gleb Natapov
2013-05-22  1:41           ` Marcelo Tosatti
2013-05-21  8:39         ` Gleb Natapov
2013-05-22  1:33           ` Marcelo Tosatti
2013-05-22  6:34             ` Gleb Natapov
2013-05-22  8:46               ` Xiao Guangrong
2013-05-22  8:54                 ` Gleb Natapov
2013-05-22  9:41                   ` Xiao Guangrong
2013-05-22 13:17                     ` Gleb Natapov
2013-05-22 15:25                       ` Xiao Guangrong
2013-05-22 15:42                         ` Gleb Natapov
2013-05-22 15:06               ` Marcelo Tosatti
2013-05-16 21:12 ` [PATCH v6 4/7] KVM: MMU: zap pages in batch Xiao Guangrong
2013-05-16 21:13 ` [PATCH v6 5/7] KVM: x86: use the fast way to invalidate all pages Xiao Guangrong
2013-05-16 21:13 ` [PATCH v6 6/7] KVM: MMU: show mmu_valid_gen in shadow page related tracepoints Xiao Guangrong
2013-05-16 21:13 ` [PATCH v6 7/7] KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages Xiao Guangrong
2013-05-19 10:49 ` [PATCH v6 0/7] KVM: MMU: fast zap all shadow pages Gleb Natapov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.