All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] KVM: MMU: fast cleanup D bit based on fast write protect
@ 2019-01-17 13:55 ` Zhuangyanying
  0 siblings, 0 replies; 24+ messages in thread
From: Zhuangyanying @ 2019-01-17 13:55 UTC (permalink / raw)
  To: xiaoguangrong, pbonzini, arei.gonglei
  Cc: liu.jinsong, wangxinxin.wang, Zhuang Yanying, qemu-devel, kvm

From: Zhuang Yanying <ann.zhuangyanying@huawei.com>

Recently I tested live-migration with large-memory guests, find vcpu may hang for a long time while starting migration, such as 9s for 2048G(linux-5.0.0-rc2+qemu-3.1.0).
The reason is memory_global_dirty_log_start() taking too long, and the vcpu is waiting for BQL. The page-by-page D bit clearup is the main time consumption.
I think that the idea of "KVM: MMU: fast write protect" by xiaoguangrong, especially the function kvm_mmu_write_protect_all_pages(), is very helpful.
After a little modifcation, on his patch, can solve this problem, 9s to 0.5s.

At the begining of live migration, write protection is only applied to the top-level SPTE. Then the write from vm trigger the EPT violation, with for_each_shadow_entry write protection is performed at dirct_map.
Finally the Dirty bit of the target page(at level 1 page table) is cleared, and the dirty page tracking is started. Of coure, the page where GPA is located is marked dirty when mmu_set_spte.
A similar implementation on xen, just emt instead of write protection.

What do you think about this solution?

Xiao Guangrong (3):
  KVM: MMU: correct the behavior of mmu_spte_update_no_track
  KVM: MMU: introduce possible_writable_spte_bitmap
  KVM: MMU: introduce kvm_mmu_write_protect_all_pages

Zhuang Yanying (1):
  KVM: MMU: fast cleanup D bit based on fast write protect

 arch/x86/include/asm/kvm_host.h |  25 +++-
 arch/x86/kvm/mmu.c              | 259 ++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu.h              |   1 +
 arch/x86/kvm/paging_tmpl.h      |  13 +-
 arch/x86/kvm/vmx/vmx.c          |   3 +-
 5 files changed, 286 insertions(+), 15 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH 0/4] KVM: MMU: fast cleanup D bit based on fast write protect
@ 2019-01-17 13:55 ` Zhuangyanying
  0 siblings, 0 replies; 24+ messages in thread
From: Zhuangyanying @ 2019-01-17 13:55 UTC (permalink / raw)
  To: xiaoguangrong, pbonzini, arei.gonglei
  Cc: qemu-devel, kvm, wangxinxin.wang, liu.jinsong, Zhuang Yanying

From: Zhuang Yanying <ann.zhuangyanying@huawei.com>

Recently I tested live-migration with large-memory guests, find vcpu may hang for a long time while starting migration, such as 9s for 2048G(linux-5.0.0-rc2+qemu-3.1.0).
The reason is memory_global_dirty_log_start() taking too long, and the vcpu is waiting for BQL. The page-by-page D bit clearup is the main time consumption.
I think that the idea of "KVM: MMU: fast write protect" by xiaoguangrong, especially the function kvm_mmu_write_protect_all_pages(), is very helpful.
After a little modifcation, on his patch, can solve this problem, 9s to 0.5s.

At the begining of live migration, write protection is only applied to the top-level SPTE. Then the write from vm trigger the EPT violation, with for_each_shadow_entry write protection is performed at dirct_map.
Finally the Dirty bit of the target page(at level 1 page table) is cleared, and the dirty page tracking is started. Of coure, the page where GPA is located is marked dirty when mmu_set_spte.
A similar implementation on xen, just emt instead of write protection.

What do you think about this solution?

Xiao Guangrong (3):
  KVM: MMU: correct the behavior of mmu_spte_update_no_track
  KVM: MMU: introduce possible_writable_spte_bitmap
  KVM: MMU: introduce kvm_mmu_write_protect_all_pages

Zhuang Yanying (1):
  KVM: MMU: fast cleanup D bit based on fast write protect

 arch/x86/include/asm/kvm_host.h |  25 +++-
 arch/x86/kvm/mmu.c              | 259 ++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu.h              |   1 +
 arch/x86/kvm/paging_tmpl.h      |  13 +-
 arch/x86/kvm/vmx/vmx.c          |   3 +-
 5 files changed, 286 insertions(+), 15 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/4] KVM: MMU: correct the behavior of mmu_spte_update_no_track
  2019-01-17 13:55 ` [Qemu-devel] " Zhuangyanying
@ 2019-01-17 13:55   ` Zhuangyanying
  -1 siblings, 0 replies; 24+ messages in thread
From: Zhuangyanying @ 2019-01-17 13:55 UTC (permalink / raw)
  To: xiaoguangrong, pbonzini, arei.gonglei
  Cc: liu.jinsong, wangxinxin.wang, qemu-devel, kvm

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Current behavior of mmu_spte_update_no_track() does not match
the name of _no_track() as actually the A/D bits are tracked
and returned to the caller

This patch introduces the real _no_track() function to update
the spte regardless of A/D bits and rename the original function
to _track()

The _no_track() function will be used by later patches to update
upper spte which need not care of A/D bits indeed

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 arch/x86/kvm/mmu.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ce770b4..eeb3bac 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -731,10 +731,29 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
 }
 
 /*
- * Update the SPTE (excluding the PFN), but do not track changes in its
+ * Update the SPTE (excluding the PFN) regardless of accessed/dirty
+ * status which is used to update the upper level spte.
+ */
+static void mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
+{
+	u64 old_spte = *sptep;
+
+	WARN_ON(!is_shadow_present_pte(new_spte));
+
+	if (!is_shadow_present_pte(old_spte)) {
+		mmu_spte_set(sptep, new_spte);
+		return;
+	}
+
+	__update_clear_spte_fast(sptep, new_spte);
+}
+
+/*
+ * Update the SPTE (excluding the PFN), the original value is
+ * returned, based on it, the caller can track changes of its
  * accessed/dirty status.
  */
-static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
+static u64 mmu_spte_update_track(u64 *sptep, u64 new_spte)
 {
 	u64 old_spte = *sptep;
 
@@ -769,7 +788,7 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
 static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 {
 	bool flush = false;
-	u64 old_spte = mmu_spte_update_no_track(sptep, new_spte);
+	u64 old_spte = mmu_spte_update_track(sptep, new_spte);
 
 	if (!is_shadow_present_pte(old_spte))
 		return false;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH 1/4] KVM: MMU: correct the behavior of mmu_spte_update_no_track
@ 2019-01-17 13:55   ` Zhuangyanying
  0 siblings, 0 replies; 24+ messages in thread
From: Zhuangyanying @ 2019-01-17 13:55 UTC (permalink / raw)
  To: xiaoguangrong, pbonzini, arei.gonglei
  Cc: qemu-devel, kvm, wangxinxin.wang, liu.jinsong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Current behavior of mmu_spte_update_no_track() does not match
the name of _no_track() as actually the A/D bits are tracked
and returned to the caller

This patch introduces the real _no_track() function to update
the spte regardless of A/D bits and rename the original function
to _track()

The _no_track() function will be used by later patches to update
upper spte which need not care of A/D bits indeed

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 arch/x86/kvm/mmu.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ce770b4..eeb3bac 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -731,10 +731,29 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
 }
 
 /*
- * Update the SPTE (excluding the PFN), but do not track changes in its
+ * Update the SPTE (excluding the PFN) regardless of accessed/dirty
+ * status which is used to update the upper level spte.
+ */
+static void mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
+{
+	u64 old_spte = *sptep;
+
+	WARN_ON(!is_shadow_present_pte(new_spte));
+
+	if (!is_shadow_present_pte(old_spte)) {
+		mmu_spte_set(sptep, new_spte);
+		return;
+	}
+
+	__update_clear_spte_fast(sptep, new_spte);
+}
+
+/*
+ * Update the SPTE (excluding the PFN), the original value is
+ * returned, based on it, the caller can track changes of its
  * accessed/dirty status.
  */
-static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
+static u64 mmu_spte_update_track(u64 *sptep, u64 new_spte)
 {
 	u64 old_spte = *sptep;
 
@@ -769,7 +788,7 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
 static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 {
 	bool flush = false;
-	u64 old_spte = mmu_spte_update_no_track(sptep, new_spte);
+	u64 old_spte = mmu_spte_update_track(sptep, new_spte);
 
 	if (!is_shadow_present_pte(old_spte))
 		return false;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/4] KVM: MMU: introduce possible_writable_spte_bitmap
  2019-01-17 13:55 ` [Qemu-devel] " Zhuangyanying
@ 2019-01-17 13:55   ` Zhuangyanying
  -1 siblings, 0 replies; 24+ messages in thread
From: Zhuangyanying @ 2019-01-17 13:55 UTC (permalink / raw)
  To: xiaoguangrong, pbonzini, arei.gonglei
  Cc: liu.jinsong, wangxinxin.wang, qemu-devel, kvm

From: Xiao Guangrong <xiaoguangrong@tencent.com>

It is used to track possible writable sptes on the shadow page on
which the bit is set to 1 for the sptes that are already writable
or can be locklessly updated to writable on the fast_page_fault
path, also a counter for the number of possible writable sptes is
introduced to speed up bitmap walking

Later patch will benefit good performance by using this bitmap and
counter to fast figure out writable sptes and write protect them

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 arch/x86/include/asm/kvm_host.h |  6 ++++-
 arch/x86/kvm/mmu.c              | 53 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4660ce9..5c30aa0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -128,6 +128,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
 #define KVM_MIN_ALLOC_MMU_PAGES 64
 #define KVM_MMU_HASH_SHIFT 12
 #define KVM_NUM_MMU_PAGES (1 << KVM_MMU_HASH_SHIFT)
+#define KVM_MMU_SP_ENTRY_NR 512
 #define KVM_MIN_FREE_MMU_PAGES 5
 #define KVM_REFILL_PAGES 25
 #define KVM_MAX_CPUID_ENTRIES 80
@@ -331,12 +332,15 @@ struct kvm_mmu_page {
 	gfn_t *gfns;
 	int root_count;          /* Currently serving as active root */
 	unsigned int unsync_children;
+	unsigned int possiable_writable_sptes;
 	struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
 
 	/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
 	unsigned long mmu_valid_gen;
 
-	DECLARE_BITMAP(unsync_child_bitmap, 512);
+	DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR);
+
+	DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR);
 
 #ifdef CONFIG_X86_32
 	/*
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index eeb3bac..9daab00 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -718,6 +718,49 @@ static bool is_dirty_spte(u64 spte)
 	return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
 }
 
+static bool is_possible_writable_spte(u64 spte)
+{
+	if (!is_shadow_present_pte(spte))
+		return false;
+
+	if (is_writable_pte(spte))
+		return true;
+
+	if (spte_can_locklessly_be_made_writable(spte))
+		return true;
+
+	/*
+	 * although is_access_track_spte() sptes can be updated out of
+	 * mmu-lock, we need not take them into account as access_track
+	 * drops writable bit for them
+	 */
+	return false;
+}
+
+static void
+mmu_log_possible_writable_spte(u64 *sptep, u64 old_spte, u64 new_spte)
+{
+	struct kvm_mmu_page *sp = page_header(__pa(sptep));
+	bool old_state, new_state;
+
+	old_state = is_possible_writable_spte(old_spte);
+	new_state = is_possible_writable_spte(new_spte);
+
+	if (old_state == new_state)
+		return;
+
+	/* a possible writable spte is dropped */
+	if (old_state) {
+		sp->possiable_writable_sptes--;
+		__clear_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap);
+		return;
+	}
+
+	/* a new possible writable spte is set */
+	sp->possiable_writable_sptes++;
+	__set_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap);
+}
+
 /* Rules for using mmu_spte_set:
  * Set the sptep from nonpresent to present.
  * Note: the sptep being assigned *must* be either not present
@@ -728,6 +771,7 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
 {
 	WARN_ON(is_shadow_present_pte(*sptep));
 	__set_spte(sptep, new_spte);
+	mmu_log_possible_writable_spte(sptep, 0ull, new_spte);
 }
 
 /*
@@ -746,6 +790,7 @@ static void mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
 	}
 
 	__update_clear_spte_fast(sptep, new_spte);
+	mmu_log_possible_writable_spte(sptep, old_spte, new_spte);
 }
 
 /*
@@ -771,6 +816,7 @@ static u64 mmu_spte_update_track(u64 *sptep, u64 new_spte)
 
 	WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte));
 
+	mmu_log_possible_writable_spte(sptep, old_spte, new_spte);
 	return old_spte;
 }
 
@@ -836,6 +882,8 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
 	else
 		old_spte = __update_clear_spte_slow(sptep, 0ull);
 
+	mmu_log_possible_writable_spte(sptep, old_spte, 0ull);
+
 	if (!is_shadow_present_pte(old_spte))
 		return 0;
 
@@ -864,7 +912,10 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
  */
 static void mmu_spte_clear_no_track(u64 *sptep)
 {
+	u64 old_spte = *sptep;
+
 	__update_clear_spte_fast(sptep, 0ull);
+	mmu_log_possible_writable_spte(sptep, old_spte, 0ull);
 }
 
 static u64 mmu_spte_get_lockless(u64 *sptep)
@@ -2159,7 +2210,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
 {
 	int i, ret, nr_unsync_leaf = 0;
 
-	for_each_set_bit(i, sp->unsync_child_bitmap, 512) {
+	for_each_set_bit(i, sp->unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR) {
 		struct kvm_mmu_page *child;
 		u64 ent = sp->spt[i];
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH 2/4] KVM: MMU: introduce possible_writable_spte_bitmap
@ 2019-01-17 13:55   ` Zhuangyanying
  0 siblings, 0 replies; 24+ messages in thread
From: Zhuangyanying @ 2019-01-17 13:55 UTC (permalink / raw)
  To: xiaoguangrong, pbonzini, arei.gonglei
  Cc: qemu-devel, kvm, wangxinxin.wang, liu.jinsong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

It is used to track possible writable sptes on the shadow page on
which the bit is set to 1 for the sptes that are already writable
or can be locklessly updated to writable on the fast_page_fault
path, also a counter for the number of possible writable sptes is
introduced to speed up bitmap walking

Later patch will benefit good performance by using this bitmap and
counter to fast figure out writable sptes and write protect them

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 arch/x86/include/asm/kvm_host.h |  6 ++++-
 arch/x86/kvm/mmu.c              | 53 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4660ce9..5c30aa0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -128,6 +128,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
 #define KVM_MIN_ALLOC_MMU_PAGES 64
 #define KVM_MMU_HASH_SHIFT 12
 #define KVM_NUM_MMU_PAGES (1 << KVM_MMU_HASH_SHIFT)
+#define KVM_MMU_SP_ENTRY_NR 512
 #define KVM_MIN_FREE_MMU_PAGES 5
 #define KVM_REFILL_PAGES 25
 #define KVM_MAX_CPUID_ENTRIES 80
@@ -331,12 +332,15 @@ struct kvm_mmu_page {
 	gfn_t *gfns;
 	int root_count;          /* Currently serving as active root */
 	unsigned int unsync_children;
+	unsigned int possiable_writable_sptes;
 	struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
 
 	/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
 	unsigned long mmu_valid_gen;
 
-	DECLARE_BITMAP(unsync_child_bitmap, 512);
+	DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR);
+
+	DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR);
 
 #ifdef CONFIG_X86_32
 	/*
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index eeb3bac..9daab00 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -718,6 +718,49 @@ static bool is_dirty_spte(u64 spte)
 	return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
 }
 
+static bool is_possible_writable_spte(u64 spte)
+{
+	if (!is_shadow_present_pte(spte))
+		return false;
+
+	if (is_writable_pte(spte))
+		return true;
+
+	if (spte_can_locklessly_be_made_writable(spte))
+		return true;
+
+	/*
+	 * although is_access_track_spte() sptes can be updated out of
+	 * mmu-lock, we need not take them into account as access_track
+	 * drops writable bit for them
+	 */
+	return false;
+}
+
+static void
+mmu_log_possible_writable_spte(u64 *sptep, u64 old_spte, u64 new_spte)
+{
+	struct kvm_mmu_page *sp = page_header(__pa(sptep));
+	bool old_state, new_state;
+
+	old_state = is_possible_writable_spte(old_spte);
+	new_state = is_possible_writable_spte(new_spte);
+
+	if (old_state == new_state)
+		return;
+
+	/* a possible writable spte is dropped */
+	if (old_state) {
+		sp->possiable_writable_sptes--;
+		__clear_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap);
+		return;
+	}
+
+	/* a new possible writable spte is set */
+	sp->possiable_writable_sptes++;
+	__set_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap);
+}
+
 /* Rules for using mmu_spte_set:
  * Set the sptep from nonpresent to present.
  * Note: the sptep being assigned *must* be either not present
@@ -728,6 +771,7 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
 {
 	WARN_ON(is_shadow_present_pte(*sptep));
 	__set_spte(sptep, new_spte);
+	mmu_log_possible_writable_spte(sptep, 0ull, new_spte);
 }
 
 /*
@@ -746,6 +790,7 @@ static void mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
 	}
 
 	__update_clear_spte_fast(sptep, new_spte);
+	mmu_log_possible_writable_spte(sptep, old_spte, new_spte);
 }
 
 /*
@@ -771,6 +816,7 @@ static u64 mmu_spte_update_track(u64 *sptep, u64 new_spte)
 
 	WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte));
 
+	mmu_log_possible_writable_spte(sptep, old_spte, new_spte);
 	return old_spte;
 }
 
@@ -836,6 +882,8 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
 	else
 		old_spte = __update_clear_spte_slow(sptep, 0ull);
 
+	mmu_log_possible_writable_spte(sptep, old_spte, 0ull);
+
 	if (!is_shadow_present_pte(old_spte))
 		return 0;
 
@@ -864,7 +912,10 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
  */
 static void mmu_spte_clear_no_track(u64 *sptep)
 {
+	u64 old_spte = *sptep;
+
 	__update_clear_spte_fast(sptep, 0ull);
+	mmu_log_possible_writable_spte(sptep, old_spte, 0ull);
 }
 
 static u64 mmu_spte_get_lockless(u64 *sptep)
@@ -2159,7 +2210,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
 {
 	int i, ret, nr_unsync_leaf = 0;
 
-	for_each_set_bit(i, sp->unsync_child_bitmap, 512) {
+	for_each_set_bit(i, sp->unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR) {
 		struct kvm_mmu_page *child;
 		u64 ent = sp->spt[i];
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/4] KVM: MMU: introduce kvm_mmu_write_protect_all_pages
  2019-01-17 13:55 ` [Qemu-devel] " Zhuangyanying
@ 2019-01-17 13:55   ` Zhuangyanying
  -1 siblings, 0 replies; 24+ messages in thread
From: Zhuangyanying @ 2019-01-17 13:55 UTC (permalink / raw)
  To: xiaoguangrong, pbonzini, arei.gonglei
  Cc: liu.jinsong, wangxinxin.wang, qemu-devel, kvm

From: Xiao Guangrong <xiaoguangrong@tencent.com>

The original idea is from Avi. kvm_mmu_write_protect_all_pages() is
extremely fast to write protect all the guest memory. Comparing with
the ordinary algorithm which write protects last level sptes based on
the rmap one by one, it just simply updates the generation number to
ask all vCPUs to reload its root page table, particularly, it can be
done out of mmu-lock, so that it does not hurt vMMU's parallel. It is
the O(1) algorithm which does not depends on the capacity of guest's
memory and the number of guest's vCPUs

When reloading its root page table, the vCPU checks root page table's
generation number with current global number, if it is not matched, it
makes all the entries in page readonly and directly go to VM. So the
read access is still going on smoothly without KVM's involvement and
write access triggers page fault, then KVM moves the write protection
from the upper level to the lower level page - by making all the entries
in the lower page readonly first then make the upper level writable,
this operation is repeated until we meet the last spte

In order to speed up the process of making all entries readonly, we
introduce possible_writable_spte_bitmap which indicates the writable
sptes and possiable_writable_sptes which is a counter indicating the
number of writable sptes, this works very efficiently as usually only
one entry in PML4 ( < 512 G),few entries in PDPT (only entry indicates
1G memory), PDEs and PTEs need to be write protected for the worst case.
Note, the number of page fault and TLB flush are the same as the ordinary
algorithm. During our test, for a VM which has 3G memory and 12 vCPUs,
we benchmarked the performance of pure memory write after write protection,
noticed only 3% is dropped, however, we also benchmarked the case that
run the test case of pure memory-write in the new booted VM (i.e, it will
trigger #PF to map memory), at the same time, live migration is going on,
we noticed the diry page ratio is increased ~50%, that means, the memory's
performance is hugely improved during live migration

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 arch/x86/include/asm/kvm_host.h |  19 +++++
 arch/x86/kvm/mmu.c              | 179 ++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu.h              |   1 +
 arch/x86/kvm/paging_tmpl.h      |  13 ++-
 4 files changed, 204 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5c30aa0..a581ff4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -338,6 +338,13 @@ struct kvm_mmu_page {
 	/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
 	unsigned long mmu_valid_gen;
 
+	/*
+	 * The generation number of write protection for all guest memory
+	 * which is synced with kvm_arch.mmu_write_protect_all_indicator
+	 * whenever it is linked into upper entry.
+	 */
+	u64 mmu_write_protect_all_gen;
+
 	DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR);
 
 	DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR);
@@ -851,6 +858,18 @@ struct kvm_arch {
 	unsigned int n_max_mmu_pages;
 	unsigned int indirect_shadow_pages;
 	unsigned long mmu_valid_gen;
+
+	/*
+	 * The indicator of write protection for all guest memory.
+	 *
+	 * The top bit indicates if the write-protect is enabled,
+	 * remaining bits are used as a generation number which is
+	 * increased whenever write-protect is enabled.
+	 *
+	 * The enable bit and generation number are squeezed into
+	 * a single u64 so that it can be accessed atomically.
+	 */
+	atomic64_t mmu_write_protect_all_indicator;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	/*
 	 * Hash table of struct kvm_mmu_page.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9daab00..047b897 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -489,6 +489,34 @@ static void kvm_mmu_reset_all_pte_masks(void)
 	shadow_nonpresent_or_rsvd_lower_gfn_mask =
 		GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT);
 }
+/* see the comments in struct kvm_arch. */
+#define WP_ALL_ENABLE_BIT	(63)
+#define WP_ALL_ENABLE_MASK	(1ull << WP_ALL_ENABLE_BIT)
+#define WP_ALL_GEN_MASK		(~0ull & ~WP_ALL_ENABLE_MASK)
+
+static bool is_write_protect_all_enabled(u64 indicator)
+{
+	return !!(indicator & WP_ALL_ENABLE_MASK);
+}
+
+static u64 get_write_protect_all_gen(u64 indicator)
+{
+	return indicator & WP_ALL_GEN_MASK;
+}
+
+static u64 get_write_protect_all_indicator(struct kvm *kvm)
+{
+	return atomic64_read(&kvm->arch.mmu_write_protect_all_indicator);
+}
+
+static void
+set_write_protect_all_indicator(struct kvm *kvm, bool enable, u64 generation)
+{
+	u64 value = (u64)(!!enable) << WP_ALL_ENABLE_BIT;
+
+	value |= generation & WP_ALL_GEN_MASK;
+	atomic64_set(&kvm->arch.mmu_write_protect_all_indicator, value);
+}
 
 static int is_cpuid_PSE36(void)
 {
@@ -2479,6 +2507,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     int direct,
 					     unsigned access)
 {
+	u64 write_protect_indicator;
 	union kvm_mmu_page_role role;
 	unsigned quadrant;
 	struct kvm_mmu_page *sp;
@@ -2553,6 +2582,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			flush |= kvm_sync_pages(vcpu, gfn, &invalid_list);
 	}
 	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
+	write_protect_indicator = get_write_protect_all_indicator(vcpu->kvm);
+	sp->mmu_write_protect_all_gen =
+			get_write_protect_all_gen(write_protect_indicator);
 	clear_page(sp->spt);
 	trace_kvm_mmu_get_page(sp, true);
 
@@ -3201,6 +3233,70 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 	__direct_pte_prefetch(vcpu, sp, sptep);
 }
 
+static bool mmu_load_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+	unsigned int offset;
+	u64 wp_all_indicator = get_write_protect_all_indicator(kvm);
+	u64 kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
+	bool flush = false;
+
+	if (!is_write_protect_all_enabled(wp_all_indicator))
+		return false;
+
+	if (sp->mmu_write_protect_all_gen == kvm_wp_all_gen)
+		return false;
+
+	if (!sp->possiable_writable_sptes)
+		return false;
+
+	for_each_set_bit(offset, sp->possible_writable_spte_bitmap,
+	      KVM_MMU_SP_ENTRY_NR) {
+		u64 *sptep = sp->spt + offset, spte = *sptep;
+
+		if (!sp->possiable_writable_sptes)
+			break;
+
+		if (is_last_spte(spte, sp->role.level)) {
+			flush |= spte_write_protect(sptep, false);
+			continue;
+		}
+
+		mmu_spte_update_no_track(sptep, spte & ~PT_WRITABLE_MASK);
+		flush = true;
+	}
+
+	sp->mmu_write_protect_all_gen = kvm_wp_all_gen;
+	return flush;
+}
+
+static bool
+handle_readonly_upper_spte(struct kvm *kvm, u64 *sptep, int write_fault)
+{
+	u64 spte = *sptep;
+	struct kvm_mmu_page *child = page_header(spte & PT64_BASE_ADDR_MASK);
+	bool flush;
+
+	/*
+	 * delay the spte update to the point when write permission is
+	 * really needed.
+	 */
+	if (!write_fault)
+		return false;
+
+	/*
+	 * if it is already writable, that means the write-protection has
+	 * been moved to lower level.
+	 */
+	if (is_writable_pte(spte))
+		return false;
+
+	flush = mmu_load_shadow_page(kvm, child);
+
+	/* needn't flush tlb if the spte is changed from RO to RW. */
+	mmu_spte_update_no_track(sptep, spte | PT_WRITABLE_MASK);
+	return flush;
+}
+
 static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
 			int level, gfn_t gfn, kvm_pfn_t pfn, bool prefault)
 {
@@ -3208,6 +3304,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
 	struct kvm_mmu_page *sp;
 	int emulate = 0;
 	gfn_t pseudo_gfn;
+	bool flush = false;
 
 	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
 		return 0;
@@ -3230,10 +3327,19 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
 			pseudo_gfn = base_addr >> PAGE_SHIFT;
 			sp = kvm_mmu_get_page(vcpu, pseudo_gfn, iterator.addr,
 					      iterator.level - 1, 1, ACC_ALL);
+			if (write)
+				flush |= mmu_load_shadow_page(vcpu->kvm, sp);
 
 			link_shadow_page(vcpu, iterator.sptep, sp);
+			continue;
 		}
+
+		flush |= handle_readonly_upper_spte(vcpu->kvm, iterator.sptep,
+						    write);
 	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(vcpu->kvm);
 	return emulate;
 }
 
@@ -3426,10 +3532,18 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 	do {
 		u64 new_spte;
 
-		for_each_shadow_entry_lockless(vcpu, gva, iterator, spte)
+		for_each_shadow_entry_lockless(vcpu, gva, iterator, spte) {
 			if (!is_shadow_present_pte(spte) ||
 			    iterator.level < level)
 				break;
+			/*
+			 * the fast path can not fix the upper spte which
+			 * is readonly.
+			 */
+			if ((error_code & PFERR_WRITE_MASK) &&
+			      !is_writable_pte(spte))
+				break;
+		}
 
 		sp = page_header(__pa(iterator.sptep));
 		if (!is_last_spte(spte, sp->role.level))
@@ -3657,26 +3771,37 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		}
 		sp = kvm_mmu_get_page(vcpu, 0, 0,
 				vcpu->arch.mmu->shadow_root_level, 1, ACC_ALL);
+		if (mmu_load_shadow_page(vcpu->kvm, sp))
+			kvm_flush_remote_tlbs(vcpu->kvm);
+
 		++sp->root_count;
 		spin_unlock(&vcpu->kvm->mmu_lock);
 		vcpu->arch.mmu->root_hpa = __pa(sp->spt);
 	} else if (vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL) {
+		bool flush = false;
+
+		spin_lock(&vcpu->kvm->mmu_lock);
 		for (i = 0; i < 4; ++i) {
 			hpa_t root = vcpu->arch.mmu->pae_root[i];
 
 			MMU_WARN_ON(VALID_PAGE(root));
-			spin_lock(&vcpu->kvm->mmu_lock);
 			if (make_mmu_pages_available(vcpu) < 0) {
+				if (flush)
+					kvm_flush_remote_tlbs(vcpu->kvm);
 				spin_unlock(&vcpu->kvm->mmu_lock);
 				return -ENOSPC;
 			}
 			sp = kvm_mmu_get_page(vcpu, i << (30 - PAGE_SHIFT),
 					i << 30, PT32_ROOT_LEVEL, 1, ACC_ALL);
+			flush |= mmu_load_shadow_page(vcpu->kvm, sp);
 			root = __pa(sp->spt);
 			++sp->root_count;
-			spin_unlock(&vcpu->kvm->mmu_lock);
 			vcpu->arch.mmu->pae_root[i] = root | PT_PRESENT_MASK;
 		}
+
+		if (flush)
+			kvm_flush_remote_tlbs(vcpu->kvm);
+		spin_unlock(&vcpu->kvm->mmu_lock);
 		vcpu->arch.mmu->root_hpa = __pa(vcpu->arch.mmu->pae_root);
 	} else
 		BUG();
@@ -3690,6 +3815,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	u64 pdptr, pm_mask;
 	gfn_t root_gfn;
 	int i;
+	bool flush = false;
 
 	root_gfn = vcpu->arch.mmu->get_cr3(vcpu) >> PAGE_SHIFT;
 
@@ -3712,6 +3838,9 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 		}
 		sp = kvm_mmu_get_page(vcpu, root_gfn, 0,
 				vcpu->arch.mmu->shadow_root_level, 0, ACC_ALL);
+		if (mmu_load_shadow_page(vcpu->kvm, sp))
+			kvm_flush_remote_tlbs(vcpu->kvm);
+
 		root = __pa(sp->spt);
 		++sp->root_count;
 		spin_unlock(&vcpu->kvm->mmu_lock);
@@ -3728,6 +3857,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.mmu->shadow_root_level == PT64_ROOT_4LEVEL)
 		pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
 
+	spin_lock(&vcpu->kvm->mmu_lock);
 	for (i = 0; i < 4; ++i) {
 		hpa_t root = vcpu->arch.mmu->pae_root[i];
 
@@ -3739,22 +3869,30 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 				continue;
 			}
 			root_gfn = pdptr >> PAGE_SHIFT;
-			if (mmu_check_root(vcpu, root_gfn))
+			if (mmu_check_root(vcpu, root_gfn)) {
+				if (flush)
+					kvm_flush_remote_tlbs(vcpu->kvm);
+				spin_unlock(&vcpu->kvm->mmu_lock);
 				return 1;
+			}
 		}
-		spin_lock(&vcpu->kvm->mmu_lock);
 		if (make_mmu_pages_available(vcpu) < 0) {
+			if (flush)
+				kvm_flush_remote_tlbs(vcpu->kvm);
 			spin_unlock(&vcpu->kvm->mmu_lock);
 			return -ENOSPC;
 		}
 		sp = kvm_mmu_get_page(vcpu, root_gfn, i << 30, PT32_ROOT_LEVEL,
 				      0, ACC_ALL);
+		flush |= mmu_load_shadow_page(vcpu->kvm, sp);
 		root = __pa(sp->spt);
 		++sp->root_count;
-		spin_unlock(&vcpu->kvm->mmu_lock);
-
 		vcpu->arch.mmu->pae_root[i] = root | pm_mask;
 	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(vcpu->kvm);
+	spin_unlock(&vcpu->kvm->mmu_lock);
 	vcpu->arch.mmu->root_hpa = __pa(vcpu->arch.mmu->pae_root);
 
 	/*
@@ -5972,6 +6110,33 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots *slots)
 	}
 }
 
+void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect)
+{
+	u64 wp_all_indicator, kvm_wp_all_gen;
+
+	mutex_lock(&kvm->slots_lock);
+	wp_all_indicator = get_write_protect_all_indicator(kvm);
+	kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
+
+	/*
+	 * whenever it is enabled, we increase the generation to
+	 * update shadow pages.
+	 */
+	if (write_protect)
+		kvm_wp_all_gen++;
+
+	set_write_protect_all_indicator(kvm, write_protect, kvm_wp_all_gen);
+
+	/*
+	 * if it is enabled, we need to sync the root page tables
+	 * immediately, otherwise, the write protection is dropped
+	 * on demand, i.e, when page fault is triggered.
+	 */
+	if (write_protect)
+		kvm_reload_remote_mmus(kvm);
+	mutex_unlock(&kvm->slots_lock);
+}
+
 static unsigned long
 mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index c7b3331..d5f9adbd 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -210,5 +210,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 				    struct kvm_memory_slot *slot, u64 gfn);
+void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect);
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 #endif
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 6bdca39..27166d7 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -602,6 +602,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 	struct kvm_shadow_walk_iterator it;
 	unsigned direct_access, access = gw->pt_access;
 	int top_level, ret;
+	bool flush = false;
 
 	direct_access = gw->pte_access;
 
@@ -633,6 +634,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			table_gfn = gw->table_gfn[it.level - 2];
 			sp = kvm_mmu_get_page(vcpu, table_gfn, addr, it.level-1,
 					      false, access);
+			if (write_fault)
+				flush |= mmu_load_shadow_page(vcpu->kvm, sp);
 		}
 
 		/*
@@ -644,6 +647,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 
 		if (sp)
 			link_shadow_page(vcpu, it.sptep, sp);
+		else
+			flush |= handle_readonly_upper_spte(vcpu->kvm, it.sptep,
+							    write_fault);
 	}
 
 	for (;
@@ -656,13 +662,18 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 
 		drop_large_spte(vcpu, it.sptep);
 
-		if (is_shadow_present_pte(*it.sptep))
+		if (is_shadow_present_pte(*it.sptep)) {
+			flush |= handle_readonly_upper_spte(vcpu->kvm,
+							it.sptep, write_fault);
 			continue;
+		}
 
 		direct_gfn = gw->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
 
 		sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
 				      true, direct_access);
+		if (write_fault)
+			flush |= mmu_load_shadow_page(vcpu->kvm, sp);
 		link_shadow_page(vcpu, it.sptep, sp);
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH 3/4] KVM: MMU: introduce kvm_mmu_write_protect_all_pages
@ 2019-01-17 13:55   ` Zhuangyanying
  0 siblings, 0 replies; 24+ messages in thread
From: Zhuangyanying @ 2019-01-17 13:55 UTC (permalink / raw)
  To: xiaoguangrong, pbonzini, arei.gonglei
  Cc: qemu-devel, kvm, wangxinxin.wang, liu.jinsong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

The original idea is from Avi. kvm_mmu_write_protect_all_pages() is
extremely fast to write protect all the guest memory. Comparing with
the ordinary algorithm which write protects last level sptes based on
the rmap one by one, it just simply updates the generation number to
ask all vCPUs to reload its root page table, particularly, it can be
done out of mmu-lock, so that it does not hurt vMMU's parallel. It is
the O(1) algorithm which does not depends on the capacity of guest's
memory and the number of guest's vCPUs

When reloading its root page table, the vCPU checks root page table's
generation number with current global number, if it is not matched, it
makes all the entries in page readonly and directly go to VM. So the
read access is still going on smoothly without KVM's involvement and
write access triggers page fault, then KVM moves the write protection
from the upper level to the lower level page - by making all the entries
in the lower page readonly first then make the upper level writable,
this operation is repeated until we meet the last spte

In order to speed up the process of making all entries readonly, we
introduce possible_writable_spte_bitmap which indicates the writable
sptes and possiable_writable_sptes which is a counter indicating the
number of writable sptes, this works very efficiently as usually only
one entry in PML4 ( < 512 G),few entries in PDPT (only entry indicates
1G memory), PDEs and PTEs need to be write protected for the worst case.
Note, the number of page fault and TLB flush are the same as the ordinary
algorithm. During our test, for a VM which has 3G memory and 12 vCPUs,
we benchmarked the performance of pure memory write after write protection,
noticed only 3% is dropped, however, we also benchmarked the case that
run the test case of pure memory-write in the new booted VM (i.e, it will
trigger #PF to map memory), at the same time, live migration is going on,
we noticed the diry page ratio is increased ~50%, that means, the memory's
performance is hugely improved during live migration

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 arch/x86/include/asm/kvm_host.h |  19 +++++
 arch/x86/kvm/mmu.c              | 179 ++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu.h              |   1 +
 arch/x86/kvm/paging_tmpl.h      |  13 ++-
 4 files changed, 204 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5c30aa0..a581ff4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -338,6 +338,13 @@ struct kvm_mmu_page {
 	/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
 	unsigned long mmu_valid_gen;
 
+	/*
+	 * The generation number of write protection for all guest memory
+	 * which is synced with kvm_arch.mmu_write_protect_all_indicator
+	 * whenever it is linked into upper entry.
+	 */
+	u64 mmu_write_protect_all_gen;
+
 	DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR);
 
 	DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR);
@@ -851,6 +858,18 @@ struct kvm_arch {
 	unsigned int n_max_mmu_pages;
 	unsigned int indirect_shadow_pages;
 	unsigned long mmu_valid_gen;
+
+	/*
+	 * The indicator of write protection for all guest memory.
+	 *
+	 * The top bit indicates if the write-protect is enabled,
+	 * remaining bits are used as a generation number which is
+	 * increased whenever write-protect is enabled.
+	 *
+	 * The enable bit and generation number are squeezed into
+	 * a single u64 so that it can be accessed atomically.
+	 */
+	atomic64_t mmu_write_protect_all_indicator;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	/*
 	 * Hash table of struct kvm_mmu_page.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9daab00..047b897 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -489,6 +489,34 @@ static void kvm_mmu_reset_all_pte_masks(void)
 	shadow_nonpresent_or_rsvd_lower_gfn_mask =
 		GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT);
 }
+/* see the comments in struct kvm_arch. */
+#define WP_ALL_ENABLE_BIT	(63)
+#define WP_ALL_ENABLE_MASK	(1ull << WP_ALL_ENABLE_BIT)
+#define WP_ALL_GEN_MASK		(~0ull & ~WP_ALL_ENABLE_MASK)
+
+static bool is_write_protect_all_enabled(u64 indicator)
+{
+	return !!(indicator & WP_ALL_ENABLE_MASK);
+}
+
+static u64 get_write_protect_all_gen(u64 indicator)
+{
+	return indicator & WP_ALL_GEN_MASK;
+}
+
+static u64 get_write_protect_all_indicator(struct kvm *kvm)
+{
+	return atomic64_read(&kvm->arch.mmu_write_protect_all_indicator);
+}
+
+static void
+set_write_protect_all_indicator(struct kvm *kvm, bool enable, u64 generation)
+{
+	u64 value = (u64)(!!enable) << WP_ALL_ENABLE_BIT;
+
+	value |= generation & WP_ALL_GEN_MASK;
+	atomic64_set(&kvm->arch.mmu_write_protect_all_indicator, value);
+}
 
 static int is_cpuid_PSE36(void)
 {
@@ -2479,6 +2507,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     int direct,
 					     unsigned access)
 {
+	u64 write_protect_indicator;
 	union kvm_mmu_page_role role;
 	unsigned quadrant;
 	struct kvm_mmu_page *sp;
@@ -2553,6 +2582,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			flush |= kvm_sync_pages(vcpu, gfn, &invalid_list);
 	}
 	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
+	write_protect_indicator = get_write_protect_all_indicator(vcpu->kvm);
+	sp->mmu_write_protect_all_gen =
+			get_write_protect_all_gen(write_protect_indicator);
 	clear_page(sp->spt);
 	trace_kvm_mmu_get_page(sp, true);
 
@@ -3201,6 +3233,70 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 	__direct_pte_prefetch(vcpu, sp, sptep);
 }
 
+static bool mmu_load_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+	unsigned int offset;
+	u64 wp_all_indicator = get_write_protect_all_indicator(kvm);
+	u64 kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
+	bool flush = false;
+
+	if (!is_write_protect_all_enabled(wp_all_indicator))
+		return false;
+
+	if (sp->mmu_write_protect_all_gen == kvm_wp_all_gen)
+		return false;
+
+	if (!sp->possiable_writable_sptes)
+		return false;
+
+	for_each_set_bit(offset, sp->possible_writable_spte_bitmap,
+	      KVM_MMU_SP_ENTRY_NR) {
+		u64 *sptep = sp->spt + offset, spte = *sptep;
+
+		if (!sp->possiable_writable_sptes)
+			break;
+
+		if (is_last_spte(spte, sp->role.level)) {
+			flush |= spte_write_protect(sptep, false);
+			continue;
+		}
+
+		mmu_spte_update_no_track(sptep, spte & ~PT_WRITABLE_MASK);
+		flush = true;
+	}
+
+	sp->mmu_write_protect_all_gen = kvm_wp_all_gen;
+	return flush;
+}
+
+static bool
+handle_readonly_upper_spte(struct kvm *kvm, u64 *sptep, int write_fault)
+{
+	u64 spte = *sptep;
+	struct kvm_mmu_page *child = page_header(spte & PT64_BASE_ADDR_MASK);
+	bool flush;
+
+	/*
+	 * delay the spte update to the point when write permission is
+	 * really needed.
+	 */
+	if (!write_fault)
+		return false;
+
+	/*
+	 * if it is already writable, that means the write-protection has
+	 * been moved to lower level.
+	 */
+	if (is_writable_pte(spte))
+		return false;
+
+	flush = mmu_load_shadow_page(kvm, child);
+
+	/* needn't flush tlb if the spte is changed from RO to RW. */
+	mmu_spte_update_no_track(sptep, spte | PT_WRITABLE_MASK);
+	return flush;
+}
+
 static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
 			int level, gfn_t gfn, kvm_pfn_t pfn, bool prefault)
 {
@@ -3208,6 +3304,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
 	struct kvm_mmu_page *sp;
 	int emulate = 0;
 	gfn_t pseudo_gfn;
+	bool flush = false;
 
 	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
 		return 0;
@@ -3230,10 +3327,19 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
 			pseudo_gfn = base_addr >> PAGE_SHIFT;
 			sp = kvm_mmu_get_page(vcpu, pseudo_gfn, iterator.addr,
 					      iterator.level - 1, 1, ACC_ALL);
+			if (write)
+				flush |= mmu_load_shadow_page(vcpu->kvm, sp);
 
 			link_shadow_page(vcpu, iterator.sptep, sp);
+			continue;
 		}
+
+		flush |= handle_readonly_upper_spte(vcpu->kvm, iterator.sptep,
+						    write);
 	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(vcpu->kvm);
 	return emulate;
 }
 
@@ -3426,10 +3532,18 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 	do {
 		u64 new_spte;
 
-		for_each_shadow_entry_lockless(vcpu, gva, iterator, spte)
+		for_each_shadow_entry_lockless(vcpu, gva, iterator, spte) {
 			if (!is_shadow_present_pte(spte) ||
 			    iterator.level < level)
 				break;
+			/*
+			 * the fast path can not fix the upper spte which
+			 * is readonly.
+			 */
+			if ((error_code & PFERR_WRITE_MASK) &&
+			      !is_writable_pte(spte))
+				break;
+		}
 
 		sp = page_header(__pa(iterator.sptep));
 		if (!is_last_spte(spte, sp->role.level))
@@ -3657,26 +3771,37 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		}
 		sp = kvm_mmu_get_page(vcpu, 0, 0,
 				vcpu->arch.mmu->shadow_root_level, 1, ACC_ALL);
+		if (mmu_load_shadow_page(vcpu->kvm, sp))
+			kvm_flush_remote_tlbs(vcpu->kvm);
+
 		++sp->root_count;
 		spin_unlock(&vcpu->kvm->mmu_lock);
 		vcpu->arch.mmu->root_hpa = __pa(sp->spt);
 	} else if (vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL) {
+		bool flush = false;
+
+		spin_lock(&vcpu->kvm->mmu_lock);
 		for (i = 0; i < 4; ++i) {
 			hpa_t root = vcpu->arch.mmu->pae_root[i];
 
 			MMU_WARN_ON(VALID_PAGE(root));
-			spin_lock(&vcpu->kvm->mmu_lock);
 			if (make_mmu_pages_available(vcpu) < 0) {
+				if (flush)
+					kvm_flush_remote_tlbs(vcpu->kvm);
 				spin_unlock(&vcpu->kvm->mmu_lock);
 				return -ENOSPC;
 			}
 			sp = kvm_mmu_get_page(vcpu, i << (30 - PAGE_SHIFT),
 					i << 30, PT32_ROOT_LEVEL, 1, ACC_ALL);
+			flush |= mmu_load_shadow_page(vcpu->kvm, sp);
 			root = __pa(sp->spt);
 			++sp->root_count;
-			spin_unlock(&vcpu->kvm->mmu_lock);
 			vcpu->arch.mmu->pae_root[i] = root | PT_PRESENT_MASK;
 		}
+
+		if (flush)
+			kvm_flush_remote_tlbs(vcpu->kvm);
+		spin_unlock(&vcpu->kvm->mmu_lock);
 		vcpu->arch.mmu->root_hpa = __pa(vcpu->arch.mmu->pae_root);
 	} else
 		BUG();
@@ -3690,6 +3815,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	u64 pdptr, pm_mask;
 	gfn_t root_gfn;
 	int i;
+	bool flush = false;
 
 	root_gfn = vcpu->arch.mmu->get_cr3(vcpu) >> PAGE_SHIFT;
 
@@ -3712,6 +3838,9 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 		}
 		sp = kvm_mmu_get_page(vcpu, root_gfn, 0,
 				vcpu->arch.mmu->shadow_root_level, 0, ACC_ALL);
+		if (mmu_load_shadow_page(vcpu->kvm, sp))
+			kvm_flush_remote_tlbs(vcpu->kvm);
+
 		root = __pa(sp->spt);
 		++sp->root_count;
 		spin_unlock(&vcpu->kvm->mmu_lock);
@@ -3728,6 +3857,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.mmu->shadow_root_level == PT64_ROOT_4LEVEL)
 		pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
 
+	spin_lock(&vcpu->kvm->mmu_lock);
 	for (i = 0; i < 4; ++i) {
 		hpa_t root = vcpu->arch.mmu->pae_root[i];
 
@@ -3739,22 +3869,30 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 				continue;
 			}
 			root_gfn = pdptr >> PAGE_SHIFT;
-			if (mmu_check_root(vcpu, root_gfn))
+			if (mmu_check_root(vcpu, root_gfn)) {
+				if (flush)
+					kvm_flush_remote_tlbs(vcpu->kvm);
+				spin_unlock(&vcpu->kvm->mmu_lock);
 				return 1;
+			}
 		}
-		spin_lock(&vcpu->kvm->mmu_lock);
 		if (make_mmu_pages_available(vcpu) < 0) {
+			if (flush)
+				kvm_flush_remote_tlbs(vcpu->kvm);
 			spin_unlock(&vcpu->kvm->mmu_lock);
 			return -ENOSPC;
 		}
 		sp = kvm_mmu_get_page(vcpu, root_gfn, i << 30, PT32_ROOT_LEVEL,
 				      0, ACC_ALL);
+		flush |= mmu_load_shadow_page(vcpu->kvm, sp);
 		root = __pa(sp->spt);
 		++sp->root_count;
-		spin_unlock(&vcpu->kvm->mmu_lock);
-
 		vcpu->arch.mmu->pae_root[i] = root | pm_mask;
 	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(vcpu->kvm);
+	spin_unlock(&vcpu->kvm->mmu_lock);
 	vcpu->arch.mmu->root_hpa = __pa(vcpu->arch.mmu->pae_root);
 
 	/*
@@ -5972,6 +6110,33 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots *slots)
 	}
 }
 
+void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect)
+{
+	u64 wp_all_indicator, kvm_wp_all_gen;
+
+	mutex_lock(&kvm->slots_lock);
+	wp_all_indicator = get_write_protect_all_indicator(kvm);
+	kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
+
+	/*
+	 * whenever it is enabled, we increase the generation to
+	 * update shadow pages.
+	 */
+	if (write_protect)
+		kvm_wp_all_gen++;
+
+	set_write_protect_all_indicator(kvm, write_protect, kvm_wp_all_gen);
+
+	/*
+	 * if it is enabled, we need to sync the root page tables
+	 * immediately, otherwise, the write protection is dropped
+	 * on demand, i.e, when page fault is triggered.
+	 */
+	if (write_protect)
+		kvm_reload_remote_mmus(kvm);
+	mutex_unlock(&kvm->slots_lock);
+}
+
 static unsigned long
 mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index c7b3331..d5f9adbd 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -210,5 +210,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 				    struct kvm_memory_slot *slot, u64 gfn);
+void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect);
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 #endif
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 6bdca39..27166d7 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -602,6 +602,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 	struct kvm_shadow_walk_iterator it;
 	unsigned direct_access, access = gw->pt_access;
 	int top_level, ret;
+	bool flush = false;
 
 	direct_access = gw->pte_access;
 
@@ -633,6 +634,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			table_gfn = gw->table_gfn[it.level - 2];
 			sp = kvm_mmu_get_page(vcpu, table_gfn, addr, it.level-1,
 					      false, access);
+			if (write_fault)
+				flush |= mmu_load_shadow_page(vcpu->kvm, sp);
 		}
 
 		/*
@@ -644,6 +647,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 
 		if (sp)
 			link_shadow_page(vcpu, it.sptep, sp);
+		else
+			flush |= handle_readonly_upper_spte(vcpu->kvm, it.sptep,
+							    write_fault);
 	}
 
 	for (;
@@ -656,13 +662,18 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 
 		drop_large_spte(vcpu, it.sptep);
 
-		if (is_shadow_present_pte(*it.sptep))
+		if (is_shadow_present_pte(*it.sptep)) {
+			flush |= handle_readonly_upper_spte(vcpu->kvm,
+							it.sptep, write_fault);
 			continue;
+		}
 
 		direct_gfn = gw->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
 
 		sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
 				      true, direct_access);
+		if (write_fault)
+			flush |= mmu_load_shadow_page(vcpu->kvm, sp);
 		link_shadow_page(vcpu, it.sptep, sp);
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect
  2019-01-17 13:55 ` [Qemu-devel] " Zhuangyanying
@ 2019-01-17 13:55   ` Zhuangyanying
  -1 siblings, 0 replies; 24+ messages in thread
From: Zhuangyanying @ 2019-01-17 13:55 UTC (permalink / raw)
  To: xiaoguangrong, pbonzini, arei.gonglei
  Cc: liu.jinsong, wangxinxin.wang, Zhuang Yanying, qemu-devel, kvm

From: Zhuang Yanying <ann.zhuangyanying@huawei.com>

When live-migration with large-memory guests, vcpu may hang for a long
time while starting migration, such as 9s for 2T
(linux-5.0.0-rc2+qemu-3.1.0).
The reason is memory_global_dirty_log_start() taking too long, and the
vcpu is waiting for BQL. The page-by-page D bit clearup is the main time
consumption. I think that the idea of "KVM: MMU: fast write protect" by
xiaoguangrong, especially the function kvm_mmu_write_protect_all_pages(),
is very helpful. After a little modifcation, on his patch, can solve
this problem, 9s to 0.5s.

At the beginning of live migration, write protection is only applied to the
top-level SPTE. Then the write from vm trigger the EPT violation, with
for_each_shadow_entry write protection is performed at dirct_map.
Finally the Dirty bit of the target page(at level 1 page table) is
cleared, and the dirty page tracking is started. Of coure, the page where
GPA is located is marked dirty when mmu_set_spte.
A similar implementation on xen, just emt instead of write protection.

Signed-off-by: Zhuang Yanying <ann.zhuangyanying@huawei.com>
---
 arch/x86/kvm/mmu.c     | 8 +++++---
 arch/x86/kvm/vmx/vmx.c | 3 +--
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 047b897..a18bcc0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3257,7 +3257,10 @@ static bool mmu_load_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 			break;
 
 		if (is_last_spte(spte, sp->role.level)) {
-			flush |= spte_write_protect(sptep, false);
+			if (sp->role.level == PT_PAGE_TABLE_LEVEL)
+				flush |= spte_clear_dirty(sptep);
+			else
+				flush |= spte_write_protect(sptep, false);
 			continue;
 		}
 
@@ -6114,7 +6117,6 @@ void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect)
 {
 	u64 wp_all_indicator, kvm_wp_all_gen;
 
-	mutex_lock(&kvm->slots_lock);
 	wp_all_indicator = get_write_protect_all_indicator(kvm);
 	kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
 
@@ -6134,8 +6136,8 @@ void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect)
 	 */
 	if (write_protect)
 		kvm_reload_remote_mmus(kvm);
-	mutex_unlock(&kvm->slots_lock);
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages);
 
 static unsigned long
 mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f6915f1..5236a07 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 static void vmx_slot_enable_log_dirty(struct kvm *kvm,
 				     struct kvm_memory_slot *slot)
 {
-	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
-	kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
+	kvm_mmu_write_protect_all_pages(kvm, true);
 }
 
 static void vmx_slot_disable_log_dirty(struct kvm *kvm,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect
@ 2019-01-17 13:55   ` Zhuangyanying
  0 siblings, 0 replies; 24+ messages in thread
From: Zhuangyanying @ 2019-01-17 13:55 UTC (permalink / raw)
  To: xiaoguangrong, pbonzini, arei.gonglei
  Cc: qemu-devel, kvm, wangxinxin.wang, liu.jinsong, Zhuang Yanying

From: Zhuang Yanying <ann.zhuangyanying@huawei.com>

When live-migration with large-memory guests, vcpu may hang for a long
time while starting migration, such as 9s for 2T
(linux-5.0.0-rc2+qemu-3.1.0).
The reason is memory_global_dirty_log_start() taking too long, and the
vcpu is waiting for BQL. The page-by-page D bit clearup is the main time
consumption. I think that the idea of "KVM: MMU: fast write protect" by
xiaoguangrong, especially the function kvm_mmu_write_protect_all_pages(),
is very helpful. After a little modifcation, on his patch, can solve
this problem, 9s to 0.5s.

At the beginning of live migration, write protection is only applied to the
top-level SPTE. Then the write from vm trigger the EPT violation, with
for_each_shadow_entry write protection is performed at dirct_map.
Finally the Dirty bit of the target page(at level 1 page table) is
cleared, and the dirty page tracking is started. Of coure, the page where
GPA is located is marked dirty when mmu_set_spte.
A similar implementation on xen, just emt instead of write protection.

Signed-off-by: Zhuang Yanying <ann.zhuangyanying@huawei.com>
---
 arch/x86/kvm/mmu.c     | 8 +++++---
 arch/x86/kvm/vmx/vmx.c | 3 +--
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 047b897..a18bcc0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3257,7 +3257,10 @@ static bool mmu_load_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 			break;
 
 		if (is_last_spte(spte, sp->role.level)) {
-			flush |= spte_write_protect(sptep, false);
+			if (sp->role.level == PT_PAGE_TABLE_LEVEL)
+				flush |= spte_clear_dirty(sptep);
+			else
+				flush |= spte_write_protect(sptep, false);
 			continue;
 		}
 
@@ -6114,7 +6117,6 @@ void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect)
 {
 	u64 wp_all_indicator, kvm_wp_all_gen;
 
-	mutex_lock(&kvm->slots_lock);
 	wp_all_indicator = get_write_protect_all_indicator(kvm);
 	kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
 
@@ -6134,8 +6136,8 @@ void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect)
 	 */
 	if (write_protect)
 		kvm_reload_remote_mmus(kvm);
-	mutex_unlock(&kvm->slots_lock);
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages);
 
 static unsigned long
 mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f6915f1..5236a07 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 static void vmx_slot_enable_log_dirty(struct kvm *kvm,
 				     struct kvm_memory_slot *slot)
 {
-	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
-	kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
+	kvm_mmu_write_protect_all_pages(kvm, true);
 }
 
 static void vmx_slot_disable_log_dirty(struct kvm *kvm,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] KVM: MMU: correct the behavior of mmu_spte_update_no_track
  2019-01-17 13:55   ` [Qemu-devel] " Zhuangyanying
@ 2019-01-17 15:44     ` Sean Christopherson
  -1 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2019-01-17 15:44 UTC (permalink / raw)
  To: Zhuangyanying
  Cc: liu.jinsong, kvm, wangxinxin.wang, xiaoguangrong, qemu-devel,
	arei.gonglei, pbonzini

On Thu, Jan 17, 2019 at 01:55:28PM +0000, Zhuangyanying wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Current behavior of mmu_spte_update_no_track() does not match
> the name of _no_track() as actually the A/D bits are tracked
> and returned to the caller

Sentences should be terminated with periods.

> This patch introduces the real _no_track() function to update

"This patch" is redundant, e.g. simply state "Introduce ...".

> the spte regardless of A/D bits and rename the original function
> to _track()

The function also avoids __update_clear_spte_slow(), i.e. AFAICT it
doesn't guarantee volatile bits will be preserved.  I assume this is
intentional, but it'd be nice to explain why this is ok.

> The _no_track() function will be used by later patches to update
> upper spte which need not care of A/D bits indeed

The _no_track() variant is already used (by mmu_spte_age()), I don't
see any point in having this blurb on the changelog, e.g. it led me
to incorrectly think an unused function was being introduced.

> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  arch/x86/kvm/mmu.c | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index ce770b4..eeb3bac 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -731,10 +731,29 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
>  }
>  
>  /*
> - * Update the SPTE (excluding the PFN), but do not track changes in its
> + * Update the SPTE (excluding the PFN) regardless of accessed/dirty
> + * status which is used to update the upper level spte.
> + */
> +static void mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
> +{
> +	u64 old_spte = *sptep;

No need to snapshot the old spte since it's not being returned.

> +	WARN_ON(!is_shadow_present_pte(new_spte));
> +
> +	if (!is_shadow_present_pte(old_spte)) {
> +		mmu_spte_set(sptep, new_spte);
> +		return;

Similarly, this is more complex than it needs to be, e.g. the function
can be simplified to:

static void mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
{
	WARN_ON(!is_shadow_present_pte(new_spte));

	if (!is_shadow_present_pte(*sptep))
		mmu_spte_set(sptep, new_spte);
	else
		__update_clear_spte_fast(sptep, new_spte);
}

> +	}
> +
> +	__update_clear_spte_fast(sptep, new_spte);
> +}
> +
> +/*
> + * Update the SPTE (excluding the PFN), the original value is
> + * returned, based on it, the caller can track changes of its
>   * accessed/dirty status.
>   */
> -static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
> +static u64 mmu_spte_update_track(u64 *sptep, u64 new_spte)
>  {
>  	u64 old_spte = *sptep;
>  
> @@ -769,7 +788,7 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
>  static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>  {
>  	bool flush = false;
> -	u64 old_spte = mmu_spte_update_no_track(sptep, new_spte);
> +	u64 old_spte = mmu_spte_update_track(sptep, new_spte);
>  
>  	if (!is_shadow_present_pte(old_spte))
>  		return false;
> -- 
> 1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] KVM: MMU: correct the behavior of mmu_spte_update_no_track
@ 2019-01-17 15:44     ` Sean Christopherson
  0 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2019-01-17 15:44 UTC (permalink / raw)
  To: Zhuangyanying
  Cc: xiaoguangrong, pbonzini, arei.gonglei, qemu-devel, kvm,
	wangxinxin.wang, liu.jinsong

On Thu, Jan 17, 2019 at 01:55:28PM +0000, Zhuangyanying wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Current behavior of mmu_spte_update_no_track() does not match
> the name of _no_track() as actually the A/D bits are tracked
> and returned to the caller

Sentences should be terminated with periods.

> This patch introduces the real _no_track() function to update

"This patch" is redundant, e.g. simply state "Introduce ...".

> the spte regardless of A/D bits and rename the original function
> to _track()

The function also avoids __update_clear_spte_slow(), i.e. AFAICT it
doesn't guarantee volatile bits will be preserved.  I assume this is
intentional, but it'd be nice to explain why this is ok.

> The _no_track() function will be used by later patches to update
> upper spte which need not care of A/D bits indeed

The _no_track() variant is already used (by mmu_spte_age()), I don't
see any point in having this blurb on the changelog, e.g. it led me
to incorrectly think an unused function was being introduced.

> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  arch/x86/kvm/mmu.c | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index ce770b4..eeb3bac 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -731,10 +731,29 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
>  }
>  
>  /*
> - * Update the SPTE (excluding the PFN), but do not track changes in its
> + * Update the SPTE (excluding the PFN) regardless of accessed/dirty
> + * status which is used to update the upper level spte.
> + */
> +static void mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
> +{
> +	u64 old_spte = *sptep;

No need to snapshot the old spte since it's not being returned.

> +	WARN_ON(!is_shadow_present_pte(new_spte));
> +
> +	if (!is_shadow_present_pte(old_spte)) {
> +		mmu_spte_set(sptep, new_spte);
> +		return;

Similarly, this is more complex than it needs to be, e.g. the function
can be simplified to:

static void mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
{
	WARN_ON(!is_shadow_present_pte(new_spte));

	if (!is_shadow_present_pte(*sptep))
		mmu_spte_set(sptep, new_spte);
	else
		__update_clear_spte_fast(sptep, new_spte);
}

> +	}
> +
> +	__update_clear_spte_fast(sptep, new_spte);
> +}
> +
> +/*
> + * Update the SPTE (excluding the PFN), the original value is
> + * returned, based on it, the caller can track changes of its
>   * accessed/dirty status.
>   */
> -static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
> +static u64 mmu_spte_update_track(u64 *sptep, u64 new_spte)
>  {
>  	u64 old_spte = *sptep;
>  
> @@ -769,7 +788,7 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
>  static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>  {
>  	bool flush = false;
> -	u64 old_spte = mmu_spte_update_no_track(sptep, new_spte);
> +	u64 old_spte = mmu_spte_update_track(sptep, new_spte);
>  
>  	if (!is_shadow_present_pte(old_spte))
>  		return false;
> -- 
> 1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] KVM: MMU: introduce possible_writable_spte_bitmap
  2019-01-17 13:55   ` [Qemu-devel] " Zhuangyanying
@ 2019-01-17 15:55     ` Sean Christopherson
  -1 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2019-01-17 15:55 UTC (permalink / raw)
  To: Zhuangyanying
  Cc: liu.jinsong, kvm, wangxinxin.wang, xiaoguangrong, qemu-devel,
	arei.gonglei, pbonzini

On Thu, Jan 17, 2019 at 01:55:29PM +0000, Zhuangyanying wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> It is used to track possible writable sptes on the shadow page on
> which the bit is set to 1 for the sptes that are already writable
> or can be locklessly updated to writable on the fast_page_fault
> path, also a counter for the number of possible writable sptes is
> introduced to speed up bitmap walking
> 
> Later patch will benefit good performance by using this bitmap and
> counter to fast figure out writable sptes and write protect them
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  6 ++++-
>  arch/x86/kvm/mmu.c              | 53 ++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 57 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4660ce9..5c30aa0 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -128,6 +128,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
>  #define KVM_MIN_ALLOC_MMU_PAGES 64
>  #define KVM_MMU_HASH_SHIFT 12
>  #define KVM_NUM_MMU_PAGES (1 << KVM_MMU_HASH_SHIFT)
> +#define KVM_MMU_SP_ENTRY_NR 512
>  #define KVM_MIN_FREE_MMU_PAGES 5
>  #define KVM_REFILL_PAGES 25
>  #define KVM_MAX_CPUID_ENTRIES 80
> @@ -331,12 +332,15 @@ struct kvm_mmu_page {
>  	gfn_t *gfns;
>  	int root_count;          /* Currently serving as active root */
>  	unsigned int unsync_children;
> +	unsigned int possiable_writable_sptes;

s/possiable/possible

>  	struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
>  
>  	/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
>  	unsigned long mmu_valid_gen;
>  
> -	DECLARE_BITMAP(unsync_child_bitmap, 512);
> +	DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR);
> +
> +	DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR);
>  
>  #ifdef CONFIG_X86_32
>  	/*
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index eeb3bac..9daab00 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -718,6 +718,49 @@ static bool is_dirty_spte(u64 spte)
>  	return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
>  }
>  
> +static bool is_possible_writable_spte(u64 spte)
> +{
> +	if (!is_shadow_present_pte(spte))
> +		return false;
> +
> +	if (is_writable_pte(spte))
> +		return true;
> +
> +	if (spte_can_locklessly_be_made_writable(spte))
> +		return true;
> +
> +	/*
> +	 * although is_access_track_spte() sptes can be updated out of
> +	 * mmu-lock, we need not take them into account as access_track
> +	 * drops writable bit for them
> +	 */
> +	return false;
> +}
> +
> +static void
> +mmu_log_possible_writable_spte(u64 *sptep, u64 old_spte, u64 new_spte)
> +{
> +	struct kvm_mmu_page *sp = page_header(__pa(sptep));
> +	bool old_state, new_state;
> +
> +	old_state = is_possible_writable_spte(old_spte);
> +	new_state = is_possible_writable_spte(new_spte);
> +
> +	if (old_state == new_state)
> +		return;
> +
> +	/* a possible writable spte is dropped */
> +	if (old_state) {
> +		sp->possiable_writable_sptes--;
> +		__clear_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap);
> +		return;
> +	}
> +
> +	/* a new possible writable spte is set */
> +	sp->possiable_writable_sptes++;
> +	__set_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap);
> +}
> +
>  /* Rules for using mmu_spte_set:
>   * Set the sptep from nonpresent to present.
>   * Note: the sptep being assigned *must* be either not present
> @@ -728,6 +771,7 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
>  {
>  	WARN_ON(is_shadow_present_pte(*sptep));
>  	__set_spte(sptep, new_spte);
> +	mmu_log_possible_writable_spte(sptep, 0ull, new_spte);
>  }
>  
>  /*
> @@ -746,6 +790,7 @@ static void mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
>  	}
>  
>  	__update_clear_spte_fast(sptep, new_spte);
> +	mmu_log_possible_writable_spte(sptep, old_spte, new_spte);

Hmm, maybe it's just me, but I find the "track vs. no_track" terminology
quite confusing, e.g. the "no_track" functions are now logging sptes.
Track and log aren't exact synonyms, but they're pretty closed.

>  }
>  
>  /*
> @@ -771,6 +816,7 @@ static u64 mmu_spte_update_track(u64 *sptep, u64 new_spte)
>  
>  	WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte));
>  
> +	mmu_log_possible_writable_spte(sptep, old_spte, new_spte);
>  	return old_spte;
>  }
>  
> @@ -836,6 +882,8 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
>  	else
>  		old_spte = __update_clear_spte_slow(sptep, 0ull);
>  
> +	mmu_log_possible_writable_spte(sptep, old_spte, 0ull);
> +
>  	if (!is_shadow_present_pte(old_spte))
>  		return 0;
>  
> @@ -864,7 +912,10 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
>   */
>  static void mmu_spte_clear_no_track(u64 *sptep)
>  {
> +	u64 old_spte = *sptep;
> +
>  	__update_clear_spte_fast(sptep, 0ull);
> +	mmu_log_possible_writable_spte(sptep, old_spte, 0ull);
>  }
>  
>  static u64 mmu_spte_get_lockless(u64 *sptep)
> @@ -2159,7 +2210,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
>  {
>  	int i, ret, nr_unsync_leaf = 0;
>  
> -	for_each_set_bit(i, sp->unsync_child_bitmap, 512) {
> +	for_each_set_bit(i, sp->unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR) {
>  		struct kvm_mmu_page *child;
>  		u64 ent = sp->spt[i];
>  
> -- 
> 1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] KVM: MMU: introduce possible_writable_spte_bitmap
@ 2019-01-17 15:55     ` Sean Christopherson
  0 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2019-01-17 15:55 UTC (permalink / raw)
  To: Zhuangyanying
  Cc: xiaoguangrong, pbonzini, arei.gonglei, qemu-devel, kvm,
	wangxinxin.wang, liu.jinsong

On Thu, Jan 17, 2019 at 01:55:29PM +0000, Zhuangyanying wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> It is used to track possible writable sptes on the shadow page on
> which the bit is set to 1 for the sptes that are already writable
> or can be locklessly updated to writable on the fast_page_fault
> path, also a counter for the number of possible writable sptes is
> introduced to speed up bitmap walking
> 
> Later patch will benefit good performance by using this bitmap and
> counter to fast figure out writable sptes and write protect them
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  6 ++++-
>  arch/x86/kvm/mmu.c              | 53 ++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 57 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4660ce9..5c30aa0 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -128,6 +128,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
>  #define KVM_MIN_ALLOC_MMU_PAGES 64
>  #define KVM_MMU_HASH_SHIFT 12
>  #define KVM_NUM_MMU_PAGES (1 << KVM_MMU_HASH_SHIFT)
> +#define KVM_MMU_SP_ENTRY_NR 512
>  #define KVM_MIN_FREE_MMU_PAGES 5
>  #define KVM_REFILL_PAGES 25
>  #define KVM_MAX_CPUID_ENTRIES 80
> @@ -331,12 +332,15 @@ struct kvm_mmu_page {
>  	gfn_t *gfns;
>  	int root_count;          /* Currently serving as active root */
>  	unsigned int unsync_children;
> +	unsigned int possiable_writable_sptes;

s/possiable/possible

>  	struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
>  
>  	/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
>  	unsigned long mmu_valid_gen;
>  
> -	DECLARE_BITMAP(unsync_child_bitmap, 512);
> +	DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR);
> +
> +	DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR);
>  
>  #ifdef CONFIG_X86_32
>  	/*
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index eeb3bac..9daab00 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -718,6 +718,49 @@ static bool is_dirty_spte(u64 spte)
>  	return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
>  }
>  
> +static bool is_possible_writable_spte(u64 spte)
> +{
> +	if (!is_shadow_present_pte(spte))
> +		return false;
> +
> +	if (is_writable_pte(spte))
> +		return true;
> +
> +	if (spte_can_locklessly_be_made_writable(spte))
> +		return true;
> +
> +	/*
> +	 * although is_access_track_spte() sptes can be updated out of
> +	 * mmu-lock, we need not take them into account as access_track
> +	 * drops writable bit for them
> +	 */
> +	return false;
> +}
> +
> +static void
> +mmu_log_possible_writable_spte(u64 *sptep, u64 old_spte, u64 new_spte)
> +{
> +	struct kvm_mmu_page *sp = page_header(__pa(sptep));
> +	bool old_state, new_state;
> +
> +	old_state = is_possible_writable_spte(old_spte);
> +	new_state = is_possible_writable_spte(new_spte);
> +
> +	if (old_state == new_state)
> +		return;
> +
> +	/* a possible writable spte is dropped */
> +	if (old_state) {
> +		sp->possiable_writable_sptes--;
> +		__clear_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap);
> +		return;
> +	}
> +
> +	/* a new possible writable spte is set */
> +	sp->possiable_writable_sptes++;
> +	__set_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap);
> +}
> +
>  /* Rules for using mmu_spte_set:
>   * Set the sptep from nonpresent to present.
>   * Note: the sptep being assigned *must* be either not present
> @@ -728,6 +771,7 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
>  {
>  	WARN_ON(is_shadow_present_pte(*sptep));
>  	__set_spte(sptep, new_spte);
> +	mmu_log_possible_writable_spte(sptep, 0ull, new_spte);
>  }
>  
>  /*
> @@ -746,6 +790,7 @@ static void mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
>  	}
>  
>  	__update_clear_spte_fast(sptep, new_spte);
> +	mmu_log_possible_writable_spte(sptep, old_spte, new_spte);

Hmm, maybe it's just me, but I find the "track vs. no_track" terminology
quite confusing, e.g. the "no_track" functions are now logging sptes.
Track and log aren't exact synonyms, but they're pretty closed.

>  }
>  
>  /*
> @@ -771,6 +816,7 @@ static u64 mmu_spte_update_track(u64 *sptep, u64 new_spte)
>  
>  	WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte));
>  
> +	mmu_log_possible_writable_spte(sptep, old_spte, new_spte);
>  	return old_spte;
>  }
>  
> @@ -836,6 +882,8 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
>  	else
>  		old_spte = __update_clear_spte_slow(sptep, 0ull);
>  
> +	mmu_log_possible_writable_spte(sptep, old_spte, 0ull);
> +
>  	if (!is_shadow_present_pte(old_spte))
>  		return 0;
>  
> @@ -864,7 +912,10 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
>   */
>  static void mmu_spte_clear_no_track(u64 *sptep)
>  {
> +	u64 old_spte = *sptep;
> +
>  	__update_clear_spte_fast(sptep, 0ull);
> +	mmu_log_possible_writable_spte(sptep, old_spte, 0ull);
>  }
>  
>  static u64 mmu_spte_get_lockless(u64 *sptep)
> @@ -2159,7 +2210,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
>  {
>  	int i, ret, nr_unsync_leaf = 0;
>  
> -	for_each_set_bit(i, sp->unsync_child_bitmap, 512) {
> +	for_each_set_bit(i, sp->unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR) {
>  		struct kvm_mmu_page *child;
>  		u64 ent = sp->spt[i];
>  
> -- 
> 1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] KVM: MMU: introduce kvm_mmu_write_protect_all_pages
  2019-01-17 13:55   ` [Qemu-devel] " Zhuangyanying
@ 2019-01-17 16:09     ` Sean Christopherson
  -1 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2019-01-17 16:09 UTC (permalink / raw)
  To: Zhuangyanying
  Cc: liu.jinsong, kvm, wangxinxin.wang, xiaoguangrong, qemu-devel,
	arei.gonglei, pbonzini

On Thu, Jan 17, 2019 at 01:55:30PM +0000, Zhuangyanying wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> The original idea is from Avi. kvm_mmu_write_protect_all_pages() is
> extremely fast to write protect all the guest memory. Comparing with
> the ordinary algorithm which write protects last level sptes based on
> the rmap one by one, it just simply updates the generation number to
> ask all vCPUs to reload its root page table, particularly, it can be
> done out of mmu-lock, so that it does not hurt vMMU's parallel. It is
> the O(1) algorithm which does not depends on the capacity of guest's
> memory and the number of guest's vCPUs
> 
> When reloading its root page table, the vCPU checks root page table's
> generation number with current global number, if it is not matched, it
> makes all the entries in page readonly and directly go to VM. So the
> read access is still going on smoothly without KVM's involvement and
> write access triggers page fault, then KVM moves the write protection
> from the upper level to the lower level page - by making all the entries
> in the lower page readonly first then make the upper level writable,
> this operation is repeated until we meet the last spte
> 
> In order to speed up the process of making all entries readonly, we
> introduce possible_writable_spte_bitmap which indicates the writable
> sptes and possiable_writable_sptes which is a counter indicating the
> number of writable sptes, this works very efficiently as usually only
> one entry in PML4 ( < 512 G),few entries in PDPT (only entry indicates
> 1G memory), PDEs and PTEs need to be write protected for the worst case.

The possible_writable_spte* stuff was introduced in the previous patch,
i.e. at least some part of this blurb belongs in patch 2/4.

> Note, the number of page fault and TLB flush are the same as the ordinary
> algorithm. During our test, for a VM which has 3G memory and 12 vCPUs,
> we benchmarked the performance of pure memory write after write protection,
> noticed only 3% is dropped, however, we also benchmarked the case that
> run the test case of pure memory-write in the new booted VM (i.e, it will
> trigger #PF to map memory), at the same time, live migration is going on,
> we noticed the diry page ratio is increased ~50%, that means, the memory's
> performance is hugely improved during live migration
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  19 +++++
>  arch/x86/kvm/mmu.c              | 179 ++++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu.h              |   1 +
>  arch/x86/kvm/paging_tmpl.h      |  13 ++-
>  4 files changed, 204 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5c30aa0..a581ff4 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -338,6 +338,13 @@ struct kvm_mmu_page {
>  	/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
>  	unsigned long mmu_valid_gen;
>  
> +	/*
> +	 * The generation number of write protection for all guest memory
> +	 * which is synced with kvm_arch.mmu_write_protect_all_indicator
> +	 * whenever it is linked into upper entry.
> +	 */
> +	u64 mmu_write_protect_all_gen;
> +
>  	DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR);
>  
>  	DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR);
> @@ -851,6 +858,18 @@ struct kvm_arch {
>  	unsigned int n_max_mmu_pages;
>  	unsigned int indirect_shadow_pages;
>  	unsigned long mmu_valid_gen;
> +
> +	/*
> +	 * The indicator of write protection for all guest memory.
> +	 *
> +	 * The top bit indicates if the write-protect is enabled,
> +	 * remaining bits are used as a generation number which is
> +	 * increased whenever write-protect is enabled.
> +	 *
> +	 * The enable bit and generation number are squeezed into
> +	 * a single u64 so that it can be accessed atomically.
> +	 */
> +	atomic64_t mmu_write_protect_all_indicator;
>  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>  	/*
>  	 * Hash table of struct kvm_mmu_page.
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 9daab00..047b897 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -489,6 +489,34 @@ static void kvm_mmu_reset_all_pte_masks(void)
>  	shadow_nonpresent_or_rsvd_lower_gfn_mask =
>  		GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT);
>  }
> +/* see the comments in struct kvm_arch. */
> +#define WP_ALL_ENABLE_BIT	(63)
> +#define WP_ALL_ENABLE_MASK	(1ull << WP_ALL_ENABLE_BIT)
> +#define WP_ALL_GEN_MASK		(~0ull & ~WP_ALL_ENABLE_MASK)
> +
> +static bool is_write_protect_all_enabled(u64 indicator)
> +{
> +	return !!(indicator & WP_ALL_ENABLE_MASK);
> +}
> +
> +static u64 get_write_protect_all_gen(u64 indicator)
> +{
> +	return indicator & WP_ALL_GEN_MASK;
> +}
> +
> +static u64 get_write_protect_all_indicator(struct kvm *kvm)
> +{
> +	return atomic64_read(&kvm->arch.mmu_write_protect_all_indicator);
> +}
> +
> +static void
> +set_write_protect_all_indicator(struct kvm *kvm, bool enable, u64 generation)
> +{
> +	u64 value = (u64)(!!enable) << WP_ALL_ENABLE_BIT;
> +
> +	value |= generation & WP_ALL_GEN_MASK;
> +	atomic64_set(&kvm->arch.mmu_write_protect_all_indicator, value);
> +}
>  
>  static int is_cpuid_PSE36(void)
>  {
> @@ -2479,6 +2507,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  					     int direct,
>  					     unsigned access)
>  {
> +	u64 write_protect_indicator;
>  	union kvm_mmu_page_role role;
>  	unsigned quadrant;
>  	struct kvm_mmu_page *sp;
> @@ -2553,6 +2582,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  			flush |= kvm_sync_pages(vcpu, gfn, &invalid_list);
>  	}
>  	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> +	write_protect_indicator = get_write_protect_all_indicator(vcpu->kvm);
> +	sp->mmu_write_protect_all_gen =
> +			get_write_protect_all_gen(write_protect_indicator);

Oof.  This is a kludgy way to use an atomic.  What about:

static bool get_write_protect_all_indicator(struct kvm *kvm, u64 *gen)
{
	u64 val = atomic64_read(&kvm->arch.mmu_write_protect_all_indicator);
	gen = (val & WP_ALL_ENABLE_MASK);
	return !!(indicator & WP_ALL_ENABLE_MASK);
}

>  	clear_page(sp->spt);
>  	trace_kvm_mmu_get_page(sp, true);
>  
> @@ -3201,6 +3233,70 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
>  	__direct_pte_prefetch(vcpu, sp, sptep);
>  }
>  
> +static bool mmu_load_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> +	unsigned int offset;
> +	u64 wp_all_indicator = get_write_protect_all_indicator(kvm);
> +	u64 kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
> +	bool flush = false;
> +
> +	if (!is_write_protect_all_enabled(wp_all_indicator))
> +		return false;
> +
> +	if (sp->mmu_write_protect_all_gen == kvm_wp_all_gen)
> +		return false;
> +
> +	if (!sp->possiable_writable_sptes)
> +		return false;
> +
> +	for_each_set_bit(offset, sp->possible_writable_spte_bitmap,
> +	      KVM_MMU_SP_ENTRY_NR) {
> +		u64 *sptep = sp->spt + offset, spte = *sptep;
> +
> +		if (!sp->possiable_writable_sptes)
> +			break;
> +
> +		if (is_last_spte(spte, sp->role.level)) {
> +			flush |= spte_write_protect(sptep, false);
> +			continue;
> +		}
> +
> +		mmu_spte_update_no_track(sptep, spte & ~PT_WRITABLE_MASK);
> +		flush = true;
> +	}
> +
> +	sp->mmu_write_protect_all_gen = kvm_wp_all_gen;
> +	return flush;
> +}
> +
> +static bool
> +handle_readonly_upper_spte(struct kvm *kvm, u64 *sptep, int write_fault)
> +{
> +	u64 spte = *sptep;
> +	struct kvm_mmu_page *child = page_header(spte & PT64_BASE_ADDR_MASK);
> +	bool flush;
> +
> +	/*
> +	 * delay the spte update to the point when write permission is
> +	 * really needed.
> +	 */
> +	if (!write_fault)
> +		return false;
> +
> +	/*
> +	 * if it is already writable, that means the write-protection has
> +	 * been moved to lower level.
> +	 */
> +	if (is_writable_pte(spte))
> +		return false;
> +
> +	flush = mmu_load_shadow_page(kvm, child);
> +
> +	/* needn't flush tlb if the spte is changed from RO to RW. */
> +	mmu_spte_update_no_track(sptep, spte | PT_WRITABLE_MASK);
> +	return flush;
> +}
> +
>  static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
>  			int level, gfn_t gfn, kvm_pfn_t pfn, bool prefault)
>  {
> @@ -3208,6 +3304,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
>  	struct kvm_mmu_page *sp;
>  	int emulate = 0;
>  	gfn_t pseudo_gfn;
> +	bool flush = false;
>  
>  	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
>  		return 0;
> @@ -3230,10 +3327,19 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
>  			pseudo_gfn = base_addr >> PAGE_SHIFT;
>  			sp = kvm_mmu_get_page(vcpu, pseudo_gfn, iterator.addr,
>  					      iterator.level - 1, 1, ACC_ALL);
> +			if (write)
> +				flush |= mmu_load_shadow_page(vcpu->kvm, sp);
>  
>  			link_shadow_page(vcpu, iterator.sptep, sp);
> +			continue;
>  		}
> +
> +		flush |= handle_readonly_upper_spte(vcpu->kvm, iterator.sptep,
> +						    write);
>  	}
> +
> +	if (flush)
> +		kvm_flush_remote_tlbs(vcpu->kvm);
>  	return emulate;
>  }
>  
> @@ -3426,10 +3532,18 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  	do {
>  		u64 new_spte;
>  
> -		for_each_shadow_entry_lockless(vcpu, gva, iterator, spte)
> +		for_each_shadow_entry_lockless(vcpu, gva, iterator, spte) {
>  			if (!is_shadow_present_pte(spte) ||
>  			    iterator.level < level)
>  				break;
> +			/*
> +			 * the fast path can not fix the upper spte which
> +			 * is readonly.
> +			 */
> +			if ((error_code & PFERR_WRITE_MASK) &&
> +			      !is_writable_pte(spte))
> +				break;
> +		}
>  
>  		sp = page_header(__pa(iterator.sptep));
>  		if (!is_last_spte(spte, sp->role.level))
> @@ -3657,26 +3771,37 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  		}
>  		sp = kvm_mmu_get_page(vcpu, 0, 0,
>  				vcpu->arch.mmu->shadow_root_level, 1, ACC_ALL);
> +		if (mmu_load_shadow_page(vcpu->kvm, sp))
> +			kvm_flush_remote_tlbs(vcpu->kvm);
> +
>  		++sp->root_count;
>  		spin_unlock(&vcpu->kvm->mmu_lock);
>  		vcpu->arch.mmu->root_hpa = __pa(sp->spt);
>  	} else if (vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL) {
> +		bool flush = false;
> +
> +		spin_lock(&vcpu->kvm->mmu_lock);
>  		for (i = 0; i < 4; ++i) {
>  			hpa_t root = vcpu->arch.mmu->pae_root[i];
>  
>  			MMU_WARN_ON(VALID_PAGE(root));
> -			spin_lock(&vcpu->kvm->mmu_lock);
>  			if (make_mmu_pages_available(vcpu) < 0) {
> +				if (flush)
> +					kvm_flush_remote_tlbs(vcpu->kvm);
>  				spin_unlock(&vcpu->kvm->mmu_lock);
>  				return -ENOSPC;
>  			}
>  			sp = kvm_mmu_get_page(vcpu, i << (30 - PAGE_SHIFT),
>  					i << 30, PT32_ROOT_LEVEL, 1, ACC_ALL);
> +			flush |= mmu_load_shadow_page(vcpu->kvm, sp);
>  			root = __pa(sp->spt);
>  			++sp->root_count;
> -			spin_unlock(&vcpu->kvm->mmu_lock);
>  			vcpu->arch.mmu->pae_root[i] = root | PT_PRESENT_MASK;
>  		}
> +
> +		if (flush)
> +			kvm_flush_remote_tlbs(vcpu->kvm);
> +		spin_unlock(&vcpu->kvm->mmu_lock);
>  		vcpu->arch.mmu->root_hpa = __pa(vcpu->arch.mmu->pae_root);
>  	} else
>  		BUG();
> @@ -3690,6 +3815,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	u64 pdptr, pm_mask;
>  	gfn_t root_gfn;
>  	int i;
> +	bool flush = false;
>  
>  	root_gfn = vcpu->arch.mmu->get_cr3(vcpu) >> PAGE_SHIFT;
>  
> @@ -3712,6 +3838,9 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  		}
>  		sp = kvm_mmu_get_page(vcpu, root_gfn, 0,
>  				vcpu->arch.mmu->shadow_root_level, 0, ACC_ALL);
> +		if (mmu_load_shadow_page(vcpu->kvm, sp))
> +			kvm_flush_remote_tlbs(vcpu->kvm);
> +
>  		root = __pa(sp->spt);
>  		++sp->root_count;
>  		spin_unlock(&vcpu->kvm->mmu_lock);
> @@ -3728,6 +3857,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	if (vcpu->arch.mmu->shadow_root_level == PT64_ROOT_4LEVEL)
>  		pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
>  
> +	spin_lock(&vcpu->kvm->mmu_lock);
>  	for (i = 0; i < 4; ++i) {
>  		hpa_t root = vcpu->arch.mmu->pae_root[i];
>  
> @@ -3739,22 +3869,30 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  				continue;
>  			}
>  			root_gfn = pdptr >> PAGE_SHIFT;
> -			if (mmu_check_root(vcpu, root_gfn))
> +			if (mmu_check_root(vcpu, root_gfn)) {
> +				if (flush)
> +					kvm_flush_remote_tlbs(vcpu->kvm);
> +				spin_unlock(&vcpu->kvm->mmu_lock);
>  				return 1;
> +			}
>  		}
> -		spin_lock(&vcpu->kvm->mmu_lock);
>  		if (make_mmu_pages_available(vcpu) < 0) {
> +			if (flush)
> +				kvm_flush_remote_tlbs(vcpu->kvm);
>  			spin_unlock(&vcpu->kvm->mmu_lock);
>  			return -ENOSPC;
>  		}
>  		sp = kvm_mmu_get_page(vcpu, root_gfn, i << 30, PT32_ROOT_LEVEL,
>  				      0, ACC_ALL);
> +		flush |= mmu_load_shadow_page(vcpu->kvm, sp);
>  		root = __pa(sp->spt);
>  		++sp->root_count;
> -		spin_unlock(&vcpu->kvm->mmu_lock);
> -
>  		vcpu->arch.mmu->pae_root[i] = root | pm_mask;
>  	}
> +
> +	if (flush)
> +		kvm_flush_remote_tlbs(vcpu->kvm);
> +	spin_unlock(&vcpu->kvm->mmu_lock);
>  	vcpu->arch.mmu->root_hpa = __pa(vcpu->arch.mmu->pae_root);
>  
>  	/*
> @@ -5972,6 +6110,33 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots *slots)
>  	}
>  }
>  
> +void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect)
> +{
> +	u64 wp_all_indicator, kvm_wp_all_gen;
> +
> +	mutex_lock(&kvm->slots_lock);
> +	wp_all_indicator = get_write_protect_all_indicator(kvm);
> +	kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
> +
> +	/*
> +	 * whenever it is enabled, we increase the generation to
> +	 * update shadow pages.
> +	 */
> +	if (write_protect)
> +		kvm_wp_all_gen++;
> +
> +	set_write_protect_all_indicator(kvm, write_protect, kvm_wp_all_gen);

This is an even more bizarre usage of an atomic since the gen is being
updated in a non-atomic fashion.  I assume there's no race due to
holding slots_lock, but it's stil weird.  It begs the question, do we
actually need an atomic?

> +
> +	/*
> +	 * if it is enabled, we need to sync the root page tables
> +	 * immediately, otherwise, the write protection is dropped
> +	 * on demand, i.e, when page fault is triggered.
> +	 */
> +	if (write_protect)
> +		kvm_reload_remote_mmus(kvm);
> +	mutex_unlock(&kvm->slots_lock);
> +}
> +
>  static unsigned long
>  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>  {
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index c7b3331..d5f9adbd 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -210,5 +210,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
>  bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>  				    struct kvm_memory_slot *slot, u64 gfn);
> +void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect);
>  int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
>  #endif
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 6bdca39..27166d7 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -602,6 +602,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>  	struct kvm_shadow_walk_iterator it;
>  	unsigned direct_access, access = gw->pt_access;
>  	int top_level, ret;
> +	bool flush = false;
>  
>  	direct_access = gw->pte_access;
>  
> @@ -633,6 +634,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>  			table_gfn = gw->table_gfn[it.level - 2];
>  			sp = kvm_mmu_get_page(vcpu, table_gfn, addr, it.level-1,
>  					      false, access);
> +			if (write_fault)
> +				flush |= mmu_load_shadow_page(vcpu->kvm, sp);
>  		}
>  
>  		/*
> @@ -644,6 +647,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>  
>  		if (sp)
>  			link_shadow_page(vcpu, it.sptep, sp);
> +		else
> +			flush |= handle_readonly_upper_spte(vcpu->kvm, it.sptep,
> +							    write_fault);
>  	}
>  
>  	for (;
> @@ -656,13 +662,18 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>  
>  		drop_large_spte(vcpu, it.sptep);
>  
> -		if (is_shadow_present_pte(*it.sptep))
> +		if (is_shadow_present_pte(*it.sptep)) {
> +			flush |= handle_readonly_upper_spte(vcpu->kvm,
> +							it.sptep, write_fault);
>  			continue;
> +		}
>  
>  		direct_gfn = gw->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
>  
>  		sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
>  				      true, direct_access);
> +		if (write_fault)
> +			flush |= mmu_load_shadow_page(vcpu->kvm, sp);
>  		link_shadow_page(vcpu, it.sptep, sp);
>  	}
>  
> -- 
> 1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] KVM: MMU: introduce kvm_mmu_write_protect_all_pages
@ 2019-01-17 16:09     ` Sean Christopherson
  0 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2019-01-17 16:09 UTC (permalink / raw)
  To: Zhuangyanying
  Cc: xiaoguangrong, pbonzini, arei.gonglei, qemu-devel, kvm,
	wangxinxin.wang, liu.jinsong

On Thu, Jan 17, 2019 at 01:55:30PM +0000, Zhuangyanying wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> The original idea is from Avi. kvm_mmu_write_protect_all_pages() is
> extremely fast to write protect all the guest memory. Comparing with
> the ordinary algorithm which write protects last level sptes based on
> the rmap one by one, it just simply updates the generation number to
> ask all vCPUs to reload its root page table, particularly, it can be
> done out of mmu-lock, so that it does not hurt vMMU's parallel. It is
> the O(1) algorithm which does not depends on the capacity of guest's
> memory and the number of guest's vCPUs
> 
> When reloading its root page table, the vCPU checks root page table's
> generation number with current global number, if it is not matched, it
> makes all the entries in page readonly and directly go to VM. So the
> read access is still going on smoothly without KVM's involvement and
> write access triggers page fault, then KVM moves the write protection
> from the upper level to the lower level page - by making all the entries
> in the lower page readonly first then make the upper level writable,
> this operation is repeated until we meet the last spte
> 
> In order to speed up the process of making all entries readonly, we
> introduce possible_writable_spte_bitmap which indicates the writable
> sptes and possiable_writable_sptes which is a counter indicating the
> number of writable sptes, this works very efficiently as usually only
> one entry in PML4 ( < 512 G),few entries in PDPT (only entry indicates
> 1G memory), PDEs and PTEs need to be write protected for the worst case.

The possible_writable_spte* stuff was introduced in the previous patch,
i.e. at least some part of this blurb belongs in patch 2/4.

> Note, the number of page fault and TLB flush are the same as the ordinary
> algorithm. During our test, for a VM which has 3G memory and 12 vCPUs,
> we benchmarked the performance of pure memory write after write protection,
> noticed only 3% is dropped, however, we also benchmarked the case that
> run the test case of pure memory-write in the new booted VM (i.e, it will
> trigger #PF to map memory), at the same time, live migration is going on,
> we noticed the diry page ratio is increased ~50%, that means, the memory's
> performance is hugely improved during live migration
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  19 +++++
>  arch/x86/kvm/mmu.c              | 179 ++++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu.h              |   1 +
>  arch/x86/kvm/paging_tmpl.h      |  13 ++-
>  4 files changed, 204 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5c30aa0..a581ff4 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -338,6 +338,13 @@ struct kvm_mmu_page {
>  	/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
>  	unsigned long mmu_valid_gen;
>  
> +	/*
> +	 * The generation number of write protection for all guest memory
> +	 * which is synced with kvm_arch.mmu_write_protect_all_indicator
> +	 * whenever it is linked into upper entry.
> +	 */
> +	u64 mmu_write_protect_all_gen;
> +
>  	DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR);
>  
>  	DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR);
> @@ -851,6 +858,18 @@ struct kvm_arch {
>  	unsigned int n_max_mmu_pages;
>  	unsigned int indirect_shadow_pages;
>  	unsigned long mmu_valid_gen;
> +
> +	/*
> +	 * The indicator of write protection for all guest memory.
> +	 *
> +	 * The top bit indicates if the write-protect is enabled,
> +	 * remaining bits are used as a generation number which is
> +	 * increased whenever write-protect is enabled.
> +	 *
> +	 * The enable bit and generation number are squeezed into
> +	 * a single u64 so that it can be accessed atomically.
> +	 */
> +	atomic64_t mmu_write_protect_all_indicator;
>  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>  	/*
>  	 * Hash table of struct kvm_mmu_page.
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 9daab00..047b897 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -489,6 +489,34 @@ static void kvm_mmu_reset_all_pte_masks(void)
>  	shadow_nonpresent_or_rsvd_lower_gfn_mask =
>  		GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT);
>  }
> +/* see the comments in struct kvm_arch. */
> +#define WP_ALL_ENABLE_BIT	(63)
> +#define WP_ALL_ENABLE_MASK	(1ull << WP_ALL_ENABLE_BIT)
> +#define WP_ALL_GEN_MASK		(~0ull & ~WP_ALL_ENABLE_MASK)
> +
> +static bool is_write_protect_all_enabled(u64 indicator)
> +{
> +	return !!(indicator & WP_ALL_ENABLE_MASK);
> +}
> +
> +static u64 get_write_protect_all_gen(u64 indicator)
> +{
> +	return indicator & WP_ALL_GEN_MASK;
> +}
> +
> +static u64 get_write_protect_all_indicator(struct kvm *kvm)
> +{
> +	return atomic64_read(&kvm->arch.mmu_write_protect_all_indicator);
> +}
> +
> +static void
> +set_write_protect_all_indicator(struct kvm *kvm, bool enable, u64 generation)
> +{
> +	u64 value = (u64)(!!enable) << WP_ALL_ENABLE_BIT;
> +
> +	value |= generation & WP_ALL_GEN_MASK;
> +	atomic64_set(&kvm->arch.mmu_write_protect_all_indicator, value);
> +}
>  
>  static int is_cpuid_PSE36(void)
>  {
> @@ -2479,6 +2507,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  					     int direct,
>  					     unsigned access)
>  {
> +	u64 write_protect_indicator;
>  	union kvm_mmu_page_role role;
>  	unsigned quadrant;
>  	struct kvm_mmu_page *sp;
> @@ -2553,6 +2582,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  			flush |= kvm_sync_pages(vcpu, gfn, &invalid_list);
>  	}
>  	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> +	write_protect_indicator = get_write_protect_all_indicator(vcpu->kvm);
> +	sp->mmu_write_protect_all_gen =
> +			get_write_protect_all_gen(write_protect_indicator);

Oof.  This is a kludgy way to use an atomic.  What about:

static bool get_write_protect_all_indicator(struct kvm *kvm, u64 *gen)
{
	u64 val = atomic64_read(&kvm->arch.mmu_write_protect_all_indicator);
	gen = (val & WP_ALL_ENABLE_MASK);
	return !!(indicator & WP_ALL_ENABLE_MASK);
}

>  	clear_page(sp->spt);
>  	trace_kvm_mmu_get_page(sp, true);
>  
> @@ -3201,6 +3233,70 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
>  	__direct_pte_prefetch(vcpu, sp, sptep);
>  }
>  
> +static bool mmu_load_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> +	unsigned int offset;
> +	u64 wp_all_indicator = get_write_protect_all_indicator(kvm);
> +	u64 kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
> +	bool flush = false;
> +
> +	if (!is_write_protect_all_enabled(wp_all_indicator))
> +		return false;
> +
> +	if (sp->mmu_write_protect_all_gen == kvm_wp_all_gen)
> +		return false;
> +
> +	if (!sp->possiable_writable_sptes)
> +		return false;
> +
> +	for_each_set_bit(offset, sp->possible_writable_spte_bitmap,
> +	      KVM_MMU_SP_ENTRY_NR) {
> +		u64 *sptep = sp->spt + offset, spte = *sptep;
> +
> +		if (!sp->possiable_writable_sptes)
> +			break;
> +
> +		if (is_last_spte(spte, sp->role.level)) {
> +			flush |= spte_write_protect(sptep, false);
> +			continue;
> +		}
> +
> +		mmu_spte_update_no_track(sptep, spte & ~PT_WRITABLE_MASK);
> +		flush = true;
> +	}
> +
> +	sp->mmu_write_protect_all_gen = kvm_wp_all_gen;
> +	return flush;
> +}
> +
> +static bool
> +handle_readonly_upper_spte(struct kvm *kvm, u64 *sptep, int write_fault)
> +{
> +	u64 spte = *sptep;
> +	struct kvm_mmu_page *child = page_header(spte & PT64_BASE_ADDR_MASK);
> +	bool flush;
> +
> +	/*
> +	 * delay the spte update to the point when write permission is
> +	 * really needed.
> +	 */
> +	if (!write_fault)
> +		return false;
> +
> +	/*
> +	 * if it is already writable, that means the write-protection has
> +	 * been moved to lower level.
> +	 */
> +	if (is_writable_pte(spte))
> +		return false;
> +
> +	flush = mmu_load_shadow_page(kvm, child);
> +
> +	/* needn't flush tlb if the spte is changed from RO to RW. */
> +	mmu_spte_update_no_track(sptep, spte | PT_WRITABLE_MASK);
> +	return flush;
> +}
> +
>  static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
>  			int level, gfn_t gfn, kvm_pfn_t pfn, bool prefault)
>  {
> @@ -3208,6 +3304,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
>  	struct kvm_mmu_page *sp;
>  	int emulate = 0;
>  	gfn_t pseudo_gfn;
> +	bool flush = false;
>  
>  	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
>  		return 0;
> @@ -3230,10 +3327,19 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
>  			pseudo_gfn = base_addr >> PAGE_SHIFT;
>  			sp = kvm_mmu_get_page(vcpu, pseudo_gfn, iterator.addr,
>  					      iterator.level - 1, 1, ACC_ALL);
> +			if (write)
> +				flush |= mmu_load_shadow_page(vcpu->kvm, sp);
>  
>  			link_shadow_page(vcpu, iterator.sptep, sp);
> +			continue;
>  		}
> +
> +		flush |= handle_readonly_upper_spte(vcpu->kvm, iterator.sptep,
> +						    write);
>  	}
> +
> +	if (flush)
> +		kvm_flush_remote_tlbs(vcpu->kvm);
>  	return emulate;
>  }
>  
> @@ -3426,10 +3532,18 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  	do {
>  		u64 new_spte;
>  
> -		for_each_shadow_entry_lockless(vcpu, gva, iterator, spte)
> +		for_each_shadow_entry_lockless(vcpu, gva, iterator, spte) {
>  			if (!is_shadow_present_pte(spte) ||
>  			    iterator.level < level)
>  				break;
> +			/*
> +			 * the fast path can not fix the upper spte which
> +			 * is readonly.
> +			 */
> +			if ((error_code & PFERR_WRITE_MASK) &&
> +			      !is_writable_pte(spte))
> +				break;
> +		}
>  
>  		sp = page_header(__pa(iterator.sptep));
>  		if (!is_last_spte(spte, sp->role.level))
> @@ -3657,26 +3771,37 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  		}
>  		sp = kvm_mmu_get_page(vcpu, 0, 0,
>  				vcpu->arch.mmu->shadow_root_level, 1, ACC_ALL);
> +		if (mmu_load_shadow_page(vcpu->kvm, sp))
> +			kvm_flush_remote_tlbs(vcpu->kvm);
> +
>  		++sp->root_count;
>  		spin_unlock(&vcpu->kvm->mmu_lock);
>  		vcpu->arch.mmu->root_hpa = __pa(sp->spt);
>  	} else if (vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL) {
> +		bool flush = false;
> +
> +		spin_lock(&vcpu->kvm->mmu_lock);
>  		for (i = 0; i < 4; ++i) {
>  			hpa_t root = vcpu->arch.mmu->pae_root[i];
>  
>  			MMU_WARN_ON(VALID_PAGE(root));
> -			spin_lock(&vcpu->kvm->mmu_lock);
>  			if (make_mmu_pages_available(vcpu) < 0) {
> +				if (flush)
> +					kvm_flush_remote_tlbs(vcpu->kvm);
>  				spin_unlock(&vcpu->kvm->mmu_lock);
>  				return -ENOSPC;
>  			}
>  			sp = kvm_mmu_get_page(vcpu, i << (30 - PAGE_SHIFT),
>  					i << 30, PT32_ROOT_LEVEL, 1, ACC_ALL);
> +			flush |= mmu_load_shadow_page(vcpu->kvm, sp);
>  			root = __pa(sp->spt);
>  			++sp->root_count;
> -			spin_unlock(&vcpu->kvm->mmu_lock);
>  			vcpu->arch.mmu->pae_root[i] = root | PT_PRESENT_MASK;
>  		}
> +
> +		if (flush)
> +			kvm_flush_remote_tlbs(vcpu->kvm);
> +		spin_unlock(&vcpu->kvm->mmu_lock);
>  		vcpu->arch.mmu->root_hpa = __pa(vcpu->arch.mmu->pae_root);
>  	} else
>  		BUG();
> @@ -3690,6 +3815,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	u64 pdptr, pm_mask;
>  	gfn_t root_gfn;
>  	int i;
> +	bool flush = false;
>  
>  	root_gfn = vcpu->arch.mmu->get_cr3(vcpu) >> PAGE_SHIFT;
>  
> @@ -3712,6 +3838,9 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  		}
>  		sp = kvm_mmu_get_page(vcpu, root_gfn, 0,
>  				vcpu->arch.mmu->shadow_root_level, 0, ACC_ALL);
> +		if (mmu_load_shadow_page(vcpu->kvm, sp))
> +			kvm_flush_remote_tlbs(vcpu->kvm);
> +
>  		root = __pa(sp->spt);
>  		++sp->root_count;
>  		spin_unlock(&vcpu->kvm->mmu_lock);
> @@ -3728,6 +3857,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	if (vcpu->arch.mmu->shadow_root_level == PT64_ROOT_4LEVEL)
>  		pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
>  
> +	spin_lock(&vcpu->kvm->mmu_lock);
>  	for (i = 0; i < 4; ++i) {
>  		hpa_t root = vcpu->arch.mmu->pae_root[i];
>  
> @@ -3739,22 +3869,30 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  				continue;
>  			}
>  			root_gfn = pdptr >> PAGE_SHIFT;
> -			if (mmu_check_root(vcpu, root_gfn))
> +			if (mmu_check_root(vcpu, root_gfn)) {
> +				if (flush)
> +					kvm_flush_remote_tlbs(vcpu->kvm);
> +				spin_unlock(&vcpu->kvm->mmu_lock);
>  				return 1;
> +			}
>  		}
> -		spin_lock(&vcpu->kvm->mmu_lock);
>  		if (make_mmu_pages_available(vcpu) < 0) {
> +			if (flush)
> +				kvm_flush_remote_tlbs(vcpu->kvm);
>  			spin_unlock(&vcpu->kvm->mmu_lock);
>  			return -ENOSPC;
>  		}
>  		sp = kvm_mmu_get_page(vcpu, root_gfn, i << 30, PT32_ROOT_LEVEL,
>  				      0, ACC_ALL);
> +		flush |= mmu_load_shadow_page(vcpu->kvm, sp);
>  		root = __pa(sp->spt);
>  		++sp->root_count;
> -		spin_unlock(&vcpu->kvm->mmu_lock);
> -
>  		vcpu->arch.mmu->pae_root[i] = root | pm_mask;
>  	}
> +
> +	if (flush)
> +		kvm_flush_remote_tlbs(vcpu->kvm);
> +	spin_unlock(&vcpu->kvm->mmu_lock);
>  	vcpu->arch.mmu->root_hpa = __pa(vcpu->arch.mmu->pae_root);
>  
>  	/*
> @@ -5972,6 +6110,33 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots *slots)
>  	}
>  }
>  
> +void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect)
> +{
> +	u64 wp_all_indicator, kvm_wp_all_gen;
> +
> +	mutex_lock(&kvm->slots_lock);
> +	wp_all_indicator = get_write_protect_all_indicator(kvm);
> +	kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
> +
> +	/*
> +	 * whenever it is enabled, we increase the generation to
> +	 * update shadow pages.
> +	 */
> +	if (write_protect)
> +		kvm_wp_all_gen++;
> +
> +	set_write_protect_all_indicator(kvm, write_protect, kvm_wp_all_gen);

This is an even more bizarre usage of an atomic since the gen is being
updated in a non-atomic fashion.  I assume there's no race due to
holding slots_lock, but it's stil weird.  It begs the question, do we
actually need an atomic?

> +
> +	/*
> +	 * if it is enabled, we need to sync the root page tables
> +	 * immediately, otherwise, the write protection is dropped
> +	 * on demand, i.e, when page fault is triggered.
> +	 */
> +	if (write_protect)
> +		kvm_reload_remote_mmus(kvm);
> +	mutex_unlock(&kvm->slots_lock);
> +}
> +
>  static unsigned long
>  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>  {
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index c7b3331..d5f9adbd 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -210,5 +210,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
>  bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>  				    struct kvm_memory_slot *slot, u64 gfn);
> +void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect);
>  int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
>  #endif
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 6bdca39..27166d7 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -602,6 +602,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>  	struct kvm_shadow_walk_iterator it;
>  	unsigned direct_access, access = gw->pt_access;
>  	int top_level, ret;
> +	bool flush = false;
>  
>  	direct_access = gw->pte_access;
>  
> @@ -633,6 +634,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>  			table_gfn = gw->table_gfn[it.level - 2];
>  			sp = kvm_mmu_get_page(vcpu, table_gfn, addr, it.level-1,
>  					      false, access);
> +			if (write_fault)
> +				flush |= mmu_load_shadow_page(vcpu->kvm, sp);
>  		}
>  
>  		/*
> @@ -644,6 +647,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>  
>  		if (sp)
>  			link_shadow_page(vcpu, it.sptep, sp);
> +		else
> +			flush |= handle_readonly_upper_spte(vcpu->kvm, it.sptep,
> +							    write_fault);
>  	}
>  
>  	for (;
> @@ -656,13 +662,18 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>  
>  		drop_large_spte(vcpu, it.sptep);
>  
> -		if (is_shadow_present_pte(*it.sptep))
> +		if (is_shadow_present_pte(*it.sptep)) {
> +			flush |= handle_readonly_upper_spte(vcpu->kvm,
> +							it.sptep, write_fault);
>  			continue;
> +		}
>  
>  		direct_gfn = gw->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
>  
>  		sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
>  				      true, direct_access);
> +		if (write_fault)
> +			flush |= mmu_load_shadow_page(vcpu->kvm, sp);
>  		link_shadow_page(vcpu, it.sptep, sp);
>  	}
>  
> -- 
> 1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect
  2019-01-17 13:55   ` [Qemu-devel] " Zhuangyanying
@ 2019-01-17 16:32     ` Sean Christopherson
  -1 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2019-01-17 16:32 UTC (permalink / raw)
  To: Zhuangyanying
  Cc: liu.jinsong, kvm, wangxinxin.wang, xiaoguangrong, qemu-devel,
	arei.gonglei, pbonzini

On Thu, Jan 17, 2019 at 01:55:31PM +0000, Zhuangyanying wrote:
> From: Zhuang Yanying <ann.zhuangyanying@huawei.com>
> 
> When live-migration with large-memory guests, vcpu may hang for a long
> time while starting migration, such as 9s for 2T
> (linux-5.0.0-rc2+qemu-3.1.0).
> The reason is memory_global_dirty_log_start() taking too long, and the
> vcpu is waiting for BQL. The page-by-page D bit clearup is the main time
> consumption. I think that the idea of "KVM: MMU: fast write protect" by
> xiaoguangrong, especially the function kvm_mmu_write_protect_all_pages(),
> is very helpful. After a little modifcation, on his patch, can solve
> this problem, 9s to 0.5s.
> 
> At the beginning of live migration, write protection is only applied to the
> top-level SPTE. Then the write from vm trigger the EPT violation, with
> for_each_shadow_entry write protection is performed at dirct_map.
> Finally the Dirty bit of the target page(at level 1 page table) is
> cleared, and the dirty page tracking is started. Of coure, the page where
> GPA is located is marked dirty when mmu_set_spte.
> A similar implementation on xen, just emt instead of write protection.
> 
> Signed-off-by: Zhuang Yanying <ann.zhuangyanying@huawei.com>
> ---
>  arch/x86/kvm/mmu.c     | 8 +++++---
>  arch/x86/kvm/vmx/vmx.c | 3 +--
>  2 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 047b897..a18bcc0 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3257,7 +3257,10 @@ static bool mmu_load_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  			break;
>  
>  		if (is_last_spte(spte, sp->role.level)) {
> -			flush |= spte_write_protect(sptep, false);
> +			if (sp->role.level == PT_PAGE_TABLE_LEVEL)
> +				flush |= spte_clear_dirty(sptep);
> +			else
> +				flush |= spte_write_protect(sptep, false);
>  			continue;
>  		}
>  
> @@ -6114,7 +6117,6 @@ void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect)
>  {
>  	u64 wp_all_indicator, kvm_wp_all_gen;
>  
> -	mutex_lock(&kvm->slots_lock);
>  	wp_all_indicator = get_write_protect_all_indicator(kvm);
>  	kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
>  
> @@ -6134,8 +6136,8 @@ void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect)
>  	 */
>  	if (write_protect)
>  		kvm_reload_remote_mmus(kvm);
> -	mutex_unlock(&kvm->slots_lock);

Why is the lock removed?  And why was it added in the first place?

>  }
> +EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages);
>  
>  static unsigned long
>  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f6915f1..5236a07 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
>  static void vmx_slot_enable_log_dirty(struct kvm *kvm,
>  				     struct kvm_memory_slot *slot)
>  {
> -	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
> -	kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
> +	kvm_mmu_write_protect_all_pages(kvm, true);

What's the purpose of having @write_protect if
kvm_mmu_write_protect_all_pages() is only ever called to enable
protection?  If there's no known scenario where write protection is
explicitly disabled then there's no need for WP_ALL_ENABLE_MASK, i.e. a
non-zero generation would indicate write protection is enabled.  That'd
simplify the code and clean up the atomic usage.

>  }
>  
>  static void vmx_slot_disable_log_dirty(struct kvm *kvm,
> -- 
> 1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect
@ 2019-01-17 16:32     ` Sean Christopherson
  0 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2019-01-17 16:32 UTC (permalink / raw)
  To: Zhuangyanying
  Cc: xiaoguangrong, pbonzini, arei.gonglei, qemu-devel, kvm,
	wangxinxin.wang, liu.jinsong

On Thu, Jan 17, 2019 at 01:55:31PM +0000, Zhuangyanying wrote:
> From: Zhuang Yanying <ann.zhuangyanying@huawei.com>
> 
> When live-migration with large-memory guests, vcpu may hang for a long
> time while starting migration, such as 9s for 2T
> (linux-5.0.0-rc2+qemu-3.1.0).
> The reason is memory_global_dirty_log_start() taking too long, and the
> vcpu is waiting for BQL. The page-by-page D bit clearup is the main time
> consumption. I think that the idea of "KVM: MMU: fast write protect" by
> xiaoguangrong, especially the function kvm_mmu_write_protect_all_pages(),
> is very helpful. After a little modifcation, on his patch, can solve
> this problem, 9s to 0.5s.
> 
> At the beginning of live migration, write protection is only applied to the
> top-level SPTE. Then the write from vm trigger the EPT violation, with
> for_each_shadow_entry write protection is performed at dirct_map.
> Finally the Dirty bit of the target page(at level 1 page table) is
> cleared, and the dirty page tracking is started. Of coure, the page where
> GPA is located is marked dirty when mmu_set_spte.
> A similar implementation on xen, just emt instead of write protection.
> 
> Signed-off-by: Zhuang Yanying <ann.zhuangyanying@huawei.com>
> ---
>  arch/x86/kvm/mmu.c     | 8 +++++---
>  arch/x86/kvm/vmx/vmx.c | 3 +--
>  2 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 047b897..a18bcc0 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3257,7 +3257,10 @@ static bool mmu_load_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  			break;
>  
>  		if (is_last_spte(spte, sp->role.level)) {
> -			flush |= spte_write_protect(sptep, false);
> +			if (sp->role.level == PT_PAGE_TABLE_LEVEL)
> +				flush |= spte_clear_dirty(sptep);
> +			else
> +				flush |= spte_write_protect(sptep, false);
>  			continue;
>  		}
>  
> @@ -6114,7 +6117,6 @@ void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect)
>  {
>  	u64 wp_all_indicator, kvm_wp_all_gen;
>  
> -	mutex_lock(&kvm->slots_lock);
>  	wp_all_indicator = get_write_protect_all_indicator(kvm);
>  	kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
>  
> @@ -6134,8 +6136,8 @@ void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect)
>  	 */
>  	if (write_protect)
>  		kvm_reload_remote_mmus(kvm);
> -	mutex_unlock(&kvm->slots_lock);

Why is the lock removed?  And why was it added in the first place?

>  }
> +EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages);
>  
>  static unsigned long
>  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f6915f1..5236a07 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
>  static void vmx_slot_enable_log_dirty(struct kvm *kvm,
>  				     struct kvm_memory_slot *slot)
>  {
> -	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
> -	kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
> +	kvm_mmu_write_protect_all_pages(kvm, true);

What's the purpose of having @write_protect if
kvm_mmu_write_protect_all_pages() is only ever called to enable
protection?  If there's no known scenario where write protection is
explicitly disabled then there's no need for WP_ALL_ENABLE_MASK, i.e. a
non-zero generation would indicate write protection is enabled.  That'd
simplify the code and clean up the atomic usage.

>  }
>  
>  static void vmx_slot_disable_log_dirty(struct kvm *kvm,
> -- 
> 1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect
  2019-01-17 16:32     ` [Qemu-devel] " Sean Christopherson
@ 2019-01-21  6:37       ` Zhuangyanying
  -1 siblings, 0 replies; 24+ messages in thread
From: Zhuangyanying @ 2019-01-21  6:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Liujinsong (Paul), xiaoguangrong.eric, kvm, wangxin (U),
	xiaoguangrong, qemu-devel, Gonglei (Arei), Zhoujian (jay),
	pbonzini



> -----Original Message-----
> From: Sean Christopherson [mailto:sean.j.christopherson@intel.com]
> Sent: Friday, January 18, 2019 12:32 AM
> To: Zhuangyanying <ann.zhuangyanying@huawei.com>
> Cc: xiaoguangrong@tencent.com; pbonzini@redhat.com; Gonglei (Arei)
> <arei.gonglei@huawei.com>; qemu-devel@nongnu.org; kvm@vger.kernel.org;
> wangxin (U) <wangxinxin.wang@huawei.com>; Liujinsong (Paul)
> <liu.jinsong@huawei.com>
> Subject: Re: [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write
> protect
> 
> On Thu, Jan 17, 2019 at 01:55:31PM +0000, Zhuangyanying wrote:
> > From: Zhuang Yanying <ann.zhuangyanying@huawei.com>
> >
> > When live-migration with large-memory guests, vcpu may hang for a long
> > time while starting migration, such as 9s for 2T
> > (linux-5.0.0-rc2+qemu-3.1.0).
> > The reason is memory_global_dirty_log_start() taking too long, and the
> > vcpu is waiting for BQL. The page-by-page D bit clearup is the main
> > time consumption. I think that the idea of "KVM: MMU: fast write
> > protect" by xiaoguangrong, especially the function
> > kvm_mmu_write_protect_all_pages(),
> > is very helpful. After a little modifcation, on his patch, can solve
> > this problem, 9s to 0.5s.
> >
> > At the beginning of live migration, write protection is only applied
> > to the top-level SPTE. Then the write from vm trigger the EPT
> > violation, with for_each_shadow_entry write protection is performed at
> dirct_map.
> > Finally the Dirty bit of the target page(at level 1 page table) is
> > cleared, and the dirty page tracking is started. Of coure, the page
> > where GPA is located is marked dirty when mmu_set_spte.
> > A similar implementation on xen, just emt instead of write protection.
> >
> > Signed-off-by: Zhuang Yanying <ann.zhuangyanying@huawei.com>
> > ---
> >  arch/x86/kvm/mmu.c     | 8 +++++---
> >  arch/x86/kvm/vmx/vmx.c | 3 +--
> >  2 files changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index
> > 047b897..a18bcc0 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -3257,7 +3257,10 @@ static bool mmu_load_shadow_page(struct kvm
> *kvm, struct kvm_mmu_page *sp)
> >  			break;
> >
> >  		if (is_last_spte(spte, sp->role.level)) {
> > -			flush |= spte_write_protect(sptep, false);
> > +			if (sp->role.level == PT_PAGE_TABLE_LEVEL)
> > +				flush |= spte_clear_dirty(sptep);
> > +			else
> > +				flush |= spte_write_protect(sptep, false);
> >  			continue;
> >  		}
> >
> > @@ -6114,7 +6117,6 @@ void kvm_mmu_write_protect_all_pages(struct
> kvm
> > *kvm, bool write_protect)  {
> >  	u64 wp_all_indicator, kvm_wp_all_gen;
> >
> > -	mutex_lock(&kvm->slots_lock);
> >  	wp_all_indicator = get_write_protect_all_indicator(kvm);
> >  	kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
> >
> > @@ -6134,8 +6136,8 @@ void kvm_mmu_write_protect_all_pages(struct
> kvm *kvm, bool write_protect)
> >  	 */
> >  	if (write_protect)
> >  		kvm_reload_remote_mmus(kvm);
> > -	mutex_unlock(&kvm->slots_lock);
> 
> Why is the lock removed?  And why was it added in the first place?
> 
The original purpose of fast write protect is to implement lock-free get_dirty_log,
kvm_mmu_write_protect_all_pages is a stand-alone kvm API. See
https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg04370.html
A total of 7 patches, I only need the first 3 patches to achieve step-by-step
page table traversal. In order to maintain the integrity of the xiaoguangrong
patch, I did not directly modify it on his patch.
> >  }
> > +EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages);
> >
> >  static unsigned long
> >  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index
> > f6915f1..5236a07 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu,
> > int cpu)  static void vmx_slot_enable_log_dirty(struct kvm *kvm,
> >  				     struct kvm_memory_slot *slot)  {
> > -	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
> > -	kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
> > +	kvm_mmu_write_protect_all_pages(kvm, true);
> 
> What's the purpose of having @write_protect if
> kvm_mmu_write_protect_all_pages() is only ever called to enable
> protection?  If there's no known scenario where write protection is
> explicitly disabled then there's no need for WP_ALL_ENABLE_MASK, i.e. a
> non-zero generation would indicate write protection is enabled.  That'd
> simplify the code and clean up the atomic usage.
> 
In the live migration, The large page split depends on the creation of memslot->dirty_bitmap
in the function __kvm_set_memory_region().
The interface design between qemu and kvm to enable dirty log is one by one in slot units.
In order to enable dirty page tracking of the entire vm, it is necessary to call
kvm_mmu_write_protect_all_pages multiple times. The page table update request can
be merged for processing by the atomic usage. This method is not elegant, but it works.
Complete the creation of all solt's dirty_bitmap in an API, just call 
kvm_mmu_write_protect_all_pages once, need more implementation changes, even qemu.

Best regards,
-Zhuang Yanying
> >  }
> >
> >  static void vmx_slot_disable_log_dirty(struct kvm *kvm,
> > --
> > 1.8.3.1
> >
> >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect
@ 2019-01-21  6:37       ` Zhuangyanying
  0 siblings, 0 replies; 24+ messages in thread
From: Zhuangyanying @ 2019-01-21  6:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: xiaoguangrong, pbonzini, Gonglei (Arei),
	qemu-devel, kvm, wangxin (U), Liujinsong (Paul), Zhoujian (jay),
	xiaoguangrong.eric



> -----Original Message-----
> From: Sean Christopherson [mailto:sean.j.christopherson@intel.com]
> Sent: Friday, January 18, 2019 12:32 AM
> To: Zhuangyanying <ann.zhuangyanying@huawei.com>
> Cc: xiaoguangrong@tencent.com; pbonzini@redhat.com; Gonglei (Arei)
> <arei.gonglei@huawei.com>; qemu-devel@nongnu.org; kvm@vger.kernel.org;
> wangxin (U) <wangxinxin.wang@huawei.com>; Liujinsong (Paul)
> <liu.jinsong@huawei.com>
> Subject: Re: [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write
> protect
> 
> On Thu, Jan 17, 2019 at 01:55:31PM +0000, Zhuangyanying wrote:
> > From: Zhuang Yanying <ann.zhuangyanying@huawei.com>
> >
> > When live-migration with large-memory guests, vcpu may hang for a long
> > time while starting migration, such as 9s for 2T
> > (linux-5.0.0-rc2+qemu-3.1.0).
> > The reason is memory_global_dirty_log_start() taking too long, and the
> > vcpu is waiting for BQL. The page-by-page D bit clearup is the main
> > time consumption. I think that the idea of "KVM: MMU: fast write
> > protect" by xiaoguangrong, especially the function
> > kvm_mmu_write_protect_all_pages(),
> > is very helpful. After a little modifcation, on his patch, can solve
> > this problem, 9s to 0.5s.
> >
> > At the beginning of live migration, write protection is only applied
> > to the top-level SPTE. Then the write from vm trigger the EPT
> > violation, with for_each_shadow_entry write protection is performed at
> dirct_map.
> > Finally the Dirty bit of the target page(at level 1 page table) is
> > cleared, and the dirty page tracking is started. Of coure, the page
> > where GPA is located is marked dirty when mmu_set_spte.
> > A similar implementation on xen, just emt instead of write protection.
> >
> > Signed-off-by: Zhuang Yanying <ann.zhuangyanying@huawei.com>
> > ---
> >  arch/x86/kvm/mmu.c     | 8 +++++---
> >  arch/x86/kvm/vmx/vmx.c | 3 +--
> >  2 files changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index
> > 047b897..a18bcc0 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -3257,7 +3257,10 @@ static bool mmu_load_shadow_page(struct kvm
> *kvm, struct kvm_mmu_page *sp)
> >  			break;
> >
> >  		if (is_last_spte(spte, sp->role.level)) {
> > -			flush |= spte_write_protect(sptep, false);
> > +			if (sp->role.level == PT_PAGE_TABLE_LEVEL)
> > +				flush |= spte_clear_dirty(sptep);
> > +			else
> > +				flush |= spte_write_protect(sptep, false);
> >  			continue;
> >  		}
> >
> > @@ -6114,7 +6117,6 @@ void kvm_mmu_write_protect_all_pages(struct
> kvm
> > *kvm, bool write_protect)  {
> >  	u64 wp_all_indicator, kvm_wp_all_gen;
> >
> > -	mutex_lock(&kvm->slots_lock);
> >  	wp_all_indicator = get_write_protect_all_indicator(kvm);
> >  	kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
> >
> > @@ -6134,8 +6136,8 @@ void kvm_mmu_write_protect_all_pages(struct
> kvm *kvm, bool write_protect)
> >  	 */
> >  	if (write_protect)
> >  		kvm_reload_remote_mmus(kvm);
> > -	mutex_unlock(&kvm->slots_lock);
> 
> Why is the lock removed?  And why was it added in the first place?
> 
The original purpose of fast write protect is to implement lock-free get_dirty_log,
kvm_mmu_write_protect_all_pages is a stand-alone kvm API. See
https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg04370.html
A total of 7 patches, I only need the first 3 patches to achieve step-by-step
page table traversal. In order to maintain the integrity of the xiaoguangrong
patch, I did not directly modify it on his patch.
> >  }
> > +EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages);
> >
> >  static unsigned long
> >  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index
> > f6915f1..5236a07 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu,
> > int cpu)  static void vmx_slot_enable_log_dirty(struct kvm *kvm,
> >  				     struct kvm_memory_slot *slot)  {
> > -	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
> > -	kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
> > +	kvm_mmu_write_protect_all_pages(kvm, true);
> 
> What's the purpose of having @write_protect if
> kvm_mmu_write_protect_all_pages() is only ever called to enable
> protection?  If there's no known scenario where write protection is
> explicitly disabled then there's no need for WP_ALL_ENABLE_MASK, i.e. a
> non-zero generation would indicate write protection is enabled.  That'd
> simplify the code and clean up the atomic usage.
> 
In the live migration, The large page split depends on the creation of memslot->dirty_bitmap
in the function __kvm_set_memory_region().
The interface design between qemu and kvm to enable dirty log is one by one in slot units.
In order to enable dirty page tracking of the entire vm, it is necessary to call
kvm_mmu_write_protect_all_pages multiple times. The page table update request can
be merged for processing by the atomic usage. This method is not elegant, but it works.
Complete the creation of all solt's dirty_bitmap in an API, just call 
kvm_mmu_write_protect_all_pages once, need more implementation changes, even qemu.

Best regards,
-Zhuang Yanying
> >  }
> >
> >  static void vmx_slot_disable_log_dirty(struct kvm *kvm,
> > --
> > 1.8.3.1
> >
> >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect
  2019-01-21  6:37       ` [Qemu-devel] " Zhuangyanying
@ 2019-01-22 15:17         ` Sean Christopherson
  -1 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2019-01-22 15:17 UTC (permalink / raw)
  To: Zhuangyanying
  Cc: Liujinsong (Paul), xiaoguangrong.eric, kvm, wangxin (U),
	xiaoguangrong, qemu-devel, Gonglei (Arei), Zhoujian (jay),
	pbonzini

On Mon, Jan 21, 2019 at 06:37:36AM +0000, Zhuangyanying wrote:
> 
> > >  	u64 wp_all_indicator, kvm_wp_all_gen;
> > >
> > > -	mutex_lock(&kvm->slots_lock);
> > >  	wp_all_indicator = get_write_protect_all_indicator(kvm);
> > >  	kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
> > >
> > > @@ -6134,8 +6136,8 @@ void kvm_mmu_write_protect_all_pages(struct
> > kvm *kvm, bool write_protect)
> > >  	 */
> > >  	if (write_protect)
> > >  		kvm_reload_remote_mmus(kvm);
> > > -	mutex_unlock(&kvm->slots_lock);
> > 
> > Why is the lock removed?  And why was it added in the first place?
> > 
> The original purpose of fast write protect is to implement lock-free get_dirty_log,
> kvm_mmu_write_protect_all_pages is a stand-alone kvm API. See
> https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg04370.html
> A total of 7 patches, I only need the first 3 patches to achieve step-by-step
> page table traversal. In order to maintain the integrity of the xiaoguangrong
> patch, I did not directly modify it on his patch.

That's not a sufficient argument for adding locking and removing it one
patch later.

> > >  }
> > > +EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages);
> > >
> > >  static unsigned long
> > >  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index
> > > f6915f1..5236a07 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu,
> > > int cpu)  static void vmx_slot_enable_log_dirty(struct kvm *kvm,
> > >  				     struct kvm_memory_slot *slot)  {
> > > -	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
> > > -	kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
> > > +	kvm_mmu_write_protect_all_pages(kvm, true);
> > 
> > What's the purpose of having @write_protect if
> > kvm_mmu_write_protect_all_pages() is only ever called to enable
> > protection?  If there's no known scenario where write protection is
> > explicitly disabled then there's no need for WP_ALL_ENABLE_MASK, i.e. a
> > non-zero generation would indicate write protection is enabled.  That'd
> > simplify the code and clean up the atomic usage.
> > 
> In the live migration, The large page split depends on the creation of memslot->dirty_bitmap
> in the function __kvm_set_memory_region().
> The interface design between qemu and kvm to enable dirty log is one by one in slot units.
> In order to enable dirty page tracking of the entire vm, it is necessary to call
> kvm_mmu_write_protect_all_pages multiple times. The page table update request can
> be merged for processing by the atomic usage. This method is not elegant, but it works.
> Complete the creation of all solt's dirty_bitmap in an API, just call 
> kvm_mmu_write_protect_all_pages once, need more implementation changes, even qemu.

Calling kvm_mmu_write_protect_all_pages() multiple times is fine.  My
question was regarding the 'write_protect' parameter.  If all callers
always pass %true for 'write_protect' then why does the parameter exist?
And eliminating the parameter means you don't need an 'enable' flag
buried in the generation, which would simplify the implementation.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect
@ 2019-01-22 15:17         ` Sean Christopherson
  0 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2019-01-22 15:17 UTC (permalink / raw)
  To: Zhuangyanying
  Cc: xiaoguangrong, pbonzini, Gonglei (Arei),
	qemu-devel, kvm, wangxin (U), Liujinsong (Paul), Zhoujian (jay),
	xiaoguangrong.eric

On Mon, Jan 21, 2019 at 06:37:36AM +0000, Zhuangyanying wrote:
> 
> > >  	u64 wp_all_indicator, kvm_wp_all_gen;
> > >
> > > -	mutex_lock(&kvm->slots_lock);
> > >  	wp_all_indicator = get_write_protect_all_indicator(kvm);
> > >  	kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
> > >
> > > @@ -6134,8 +6136,8 @@ void kvm_mmu_write_protect_all_pages(struct
> > kvm *kvm, bool write_protect)
> > >  	 */
> > >  	if (write_protect)
> > >  		kvm_reload_remote_mmus(kvm);
> > > -	mutex_unlock(&kvm->slots_lock);
> > 
> > Why is the lock removed?  And why was it added in the first place?
> > 
> The original purpose of fast write protect is to implement lock-free get_dirty_log,
> kvm_mmu_write_protect_all_pages is a stand-alone kvm API. See
> https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg04370.html
> A total of 7 patches, I only need the first 3 patches to achieve step-by-step
> page table traversal. In order to maintain the integrity of the xiaoguangrong
> patch, I did not directly modify it on his patch.

That's not a sufficient argument for adding locking and removing it one
patch later.

> > >  }
> > > +EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages);
> > >
> > >  static unsigned long
> > >  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index
> > > f6915f1..5236a07 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu,
> > > int cpu)  static void vmx_slot_enable_log_dirty(struct kvm *kvm,
> > >  				     struct kvm_memory_slot *slot)  {
> > > -	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
> > > -	kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
> > > +	kvm_mmu_write_protect_all_pages(kvm, true);
> > 
> > What's the purpose of having @write_protect if
> > kvm_mmu_write_protect_all_pages() is only ever called to enable
> > protection?  If there's no known scenario where write protection is
> > explicitly disabled then there's no need for WP_ALL_ENABLE_MASK, i.e. a
> > non-zero generation would indicate write protection is enabled.  That'd
> > simplify the code and clean up the atomic usage.
> > 
> In the live migration, The large page split depends on the creation of memslot->dirty_bitmap
> in the function __kvm_set_memory_region().
> The interface design between qemu and kvm to enable dirty log is one by one in slot units.
> In order to enable dirty page tracking of the entire vm, it is necessary to call
> kvm_mmu_write_protect_all_pages multiple times. The page table update request can
> be merged for processing by the atomic usage. This method is not elegant, but it works.
> Complete the creation of all solt's dirty_bitmap in an API, just call 
> kvm_mmu_write_protect_all_pages once, need more implementation changes, even qemu.

Calling kvm_mmu_write_protect_all_pages() multiple times is fine.  My
question was regarding the 'write_protect' parameter.  If all callers
always pass %true for 'write_protect' then why does the parameter exist?
And eliminating the parameter means you don't need an 'enable' flag
buried in the generation, which would simplify the implementation.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect
  2019-01-22 15:17         ` [Qemu-devel] " Sean Christopherson
@ 2019-01-23 18:38           ` Zhuangyanying
  -1 siblings, 0 replies; 24+ messages in thread
From: Zhuangyanying @ 2019-01-23 18:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Liujinsong (Paul), xiaoguangrong.eric, kvm, wangxin (U),
	xiaoguangrong, qemu-devel, Gonglei (Arei), Zhoujian (jay),
	pbonzini



> -----Original Message-----
> From: Sean Christopherson [mailto:sean.j.christopherson@intel.com]
> Sent: Tuesday, January 22, 2019 11:17 PM
> To: Zhuangyanying <ann.zhuangyanying@huawei.com>
> Cc: xiaoguangrong@tencent.com; pbonzini@redhat.com; Gonglei (Arei)
> <arei.gonglei@huawei.com>; qemu-devel@nongnu.org; kvm@vger.kernel.org;
> wangxin (U) <wangxinxin.wang@huawei.com>; Liujinsong (Paul)
> <liu.jinsong@huawei.com>; Zhoujian (jay) <jianjay.zhou@huawei.com>;
> xiaoguangrong.eric@gmail.com
> Subject: Re: [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write
> protect
> 
> On Mon, Jan 21, 2019 at 06:37:36AM +0000, Zhuangyanying wrote:
> >
> > > >  	u64 wp_all_indicator, kvm_wp_all_gen;
> > > >
> > > > -	mutex_lock(&kvm->slots_lock);
> > > >  	wp_all_indicator = get_write_protect_all_indicator(kvm);
> > > >  	kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
> > > >
> > > > @@ -6134,8 +6136,8 @@ void
> kvm_mmu_write_protect_all_pages(struct
> > > kvm *kvm, bool write_protect)
> > > >  	 */
> > > >  	if (write_protect)
> > > >  		kvm_reload_remote_mmus(kvm);
> > > > -	mutex_unlock(&kvm->slots_lock);
> > >
> > > Why is the lock removed?  And why was it added in the first place?
> > >
> > The original purpose of fast write protect is to implement lock-free
> > get_dirty_log, kvm_mmu_write_protect_all_pages is a stand-alone kvm
> > API. See
> > https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg04370.html
> > A total of 7 patches, I only need the first 3 patches to achieve
> > step-by-step page table traversal. In order to maintain the integrity
> > of the xiaoguangrong patch, I did not directly modify it on his patch.
> 
> That's not a sufficient argument for adding locking and removing it one patch
> later.
> 
> > > >  }
> > > > +EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages);
> > > >
> > > >  static unsigned long
> > > >  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control
> > > > *sc) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index
> > > > f6915f1..5236a07 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu
> > > > *vcpu, int cpu)  static void vmx_slot_enable_log_dirty(struct kvm *kvm,
> > > >  				     struct kvm_memory_slot *slot)  {
> > > > -	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
> > > > -	kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
> > > > +	kvm_mmu_write_protect_all_pages(kvm, true);
> > >
> > > What's the purpose of having @write_protect if
> > > kvm_mmu_write_protect_all_pages() is only ever called to enable
> > > protection?  If there's no known scenario where write protection is
> > > explicitly disabled then there's no need for WP_ALL_ENABLE_MASK,
> > > i.e. a non-zero generation would indicate write protection is
> > > enabled.  That'd simplify the code and clean up the atomic usage.
> > >
> > In the live migration, The large page split depends on the creation of
> > memslot->dirty_bitmap in the function __kvm_set_memory_region().
> > The interface design between qemu and kvm to enable dirty log is one by one
> in slot units.
> > In order to enable dirty page tracking of the entire vm, it is
> > necessary to call kvm_mmu_write_protect_all_pages multiple times. The
> > page table update request can be merged for processing by the atomic usage.
> This method is not elegant, but it works.
> > Complete the creation of all solt's dirty_bitmap in an API, just call
> > kvm_mmu_write_protect_all_pages once, need more implementation
> changes, even qemu.
> 
> Calling kvm_mmu_write_protect_all_pages() multiple times is fine.  My
> question was regarding the 'write_protect' parameter.  If all callers always
> pass %true for 'write_protect' then why does the parameter exist?
> And eliminating the parameter means you don't need an 'enable' flag buried in
> the generation, which would simplify the implementation.

In fact, when cancel migration for 2T vm, the func memory_global_dirty_log_stop()
will hold the BQL for 12s, vmx_slot_disable_log_dirty() for 4s, 
kvm_mmu_zap_collapsible_sptes() for 2s, double because smm is enabled by default.
What about:
static void vmx_slot_disable_log_dirty(struct kvm *kvm,
				       struct kvm_memory_slot *slot)
{
+	kvm_mmu_write_protect_all_pages(kvm, false);
-	kvm_mmu_slot_set_dirty(kvm, slot);
}

Sorry, this patch is not complete. I will send patch v2 soon.

Best regards,
-Zhuang Yanying

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect
@ 2019-01-23 18:38           ` Zhuangyanying
  0 siblings, 0 replies; 24+ messages in thread
From: Zhuangyanying @ 2019-01-23 18:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: xiaoguangrong, pbonzini, Gonglei (Arei),
	qemu-devel, kvm, wangxin (U), Liujinsong (Paul), Zhoujian (jay),
	xiaoguangrong.eric



> -----Original Message-----
> From: Sean Christopherson [mailto:sean.j.christopherson@intel.com]
> Sent: Tuesday, January 22, 2019 11:17 PM
> To: Zhuangyanying <ann.zhuangyanying@huawei.com>
> Cc: xiaoguangrong@tencent.com; pbonzini@redhat.com; Gonglei (Arei)
> <arei.gonglei@huawei.com>; qemu-devel@nongnu.org; kvm@vger.kernel.org;
> wangxin (U) <wangxinxin.wang@huawei.com>; Liujinsong (Paul)
> <liu.jinsong@huawei.com>; Zhoujian (jay) <jianjay.zhou@huawei.com>;
> xiaoguangrong.eric@gmail.com
> Subject: Re: [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write
> protect
> 
> On Mon, Jan 21, 2019 at 06:37:36AM +0000, Zhuangyanying wrote:
> >
> > > >  	u64 wp_all_indicator, kvm_wp_all_gen;
> > > >
> > > > -	mutex_lock(&kvm->slots_lock);
> > > >  	wp_all_indicator = get_write_protect_all_indicator(kvm);
> > > >  	kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
> > > >
> > > > @@ -6134,8 +6136,8 @@ void
> kvm_mmu_write_protect_all_pages(struct
> > > kvm *kvm, bool write_protect)
> > > >  	 */
> > > >  	if (write_protect)
> > > >  		kvm_reload_remote_mmus(kvm);
> > > > -	mutex_unlock(&kvm->slots_lock);
> > >
> > > Why is the lock removed?  And why was it added in the first place?
> > >
> > The original purpose of fast write protect is to implement lock-free
> > get_dirty_log, kvm_mmu_write_protect_all_pages is a stand-alone kvm
> > API. See
> > https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg04370.html
> > A total of 7 patches, I only need the first 3 patches to achieve
> > step-by-step page table traversal. In order to maintain the integrity
> > of the xiaoguangrong patch, I did not directly modify it on his patch.
> 
> That's not a sufficient argument for adding locking and removing it one patch
> later.
> 
> > > >  }
> > > > +EXPORT_SYMBOL_GPL(kvm_mmu_write_protect_all_pages);
> > > >
> > > >  static unsigned long
> > > >  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control
> > > > *sc) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index
> > > > f6915f1..5236a07 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -7180,8 +7180,7 @@ static void vmx_sched_in(struct kvm_vcpu
> > > > *vcpu, int cpu)  static void vmx_slot_enable_log_dirty(struct kvm *kvm,
> > > >  				     struct kvm_memory_slot *slot)  {
> > > > -	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
> > > > -	kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
> > > > +	kvm_mmu_write_protect_all_pages(kvm, true);
> > >
> > > What's the purpose of having @write_protect if
> > > kvm_mmu_write_protect_all_pages() is only ever called to enable
> > > protection?  If there's no known scenario where write protection is
> > > explicitly disabled then there's no need for WP_ALL_ENABLE_MASK,
> > > i.e. a non-zero generation would indicate write protection is
> > > enabled.  That'd simplify the code and clean up the atomic usage.
> > >
> > In the live migration, The large page split depends on the creation of
> > memslot->dirty_bitmap in the function __kvm_set_memory_region().
> > The interface design between qemu and kvm to enable dirty log is one by one
> in slot units.
> > In order to enable dirty page tracking of the entire vm, it is
> > necessary to call kvm_mmu_write_protect_all_pages multiple times. The
> > page table update request can be merged for processing by the atomic usage.
> This method is not elegant, but it works.
> > Complete the creation of all solt's dirty_bitmap in an API, just call
> > kvm_mmu_write_protect_all_pages once, need more implementation
> changes, even qemu.
> 
> Calling kvm_mmu_write_protect_all_pages() multiple times is fine.  My
> question was regarding the 'write_protect' parameter.  If all callers always
> pass %true for 'write_protect' then why does the parameter exist?
> And eliminating the parameter means you don't need an 'enable' flag buried in
> the generation, which would simplify the implementation.

In fact, when cancel migration for 2T vm, the func memory_global_dirty_log_stop()
will hold the BQL for 12s, vmx_slot_disable_log_dirty() for 4s, 
kvm_mmu_zap_collapsible_sptes() for 2s, double because smm is enabled by default.
What about:
static void vmx_slot_disable_log_dirty(struct kvm *kvm,
				       struct kvm_memory_slot *slot)
{
+	kvm_mmu_write_protect_all_pages(kvm, false);
-	kvm_mmu_slot_set_dirty(kvm, slot);
}

Sorry, this patch is not complete. I will send patch v2 soon.

Best regards,
-Zhuang Yanying

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-01-23 18:39 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-17 13:55 [PATCH 0/4] KVM: MMU: fast cleanup D bit based on fast write protect Zhuangyanying
2019-01-17 13:55 ` [Qemu-devel] " Zhuangyanying
2019-01-17 13:55 ` [PATCH 1/4] KVM: MMU: correct the behavior of mmu_spte_update_no_track Zhuangyanying
2019-01-17 13:55   ` [Qemu-devel] " Zhuangyanying
2019-01-17 15:44   ` Sean Christopherson
2019-01-17 15:44     ` [Qemu-devel] " Sean Christopherson
2019-01-17 13:55 ` [PATCH 2/4] KVM: MMU: introduce possible_writable_spte_bitmap Zhuangyanying
2019-01-17 13:55   ` [Qemu-devel] " Zhuangyanying
2019-01-17 15:55   ` Sean Christopherson
2019-01-17 15:55     ` [Qemu-devel] " Sean Christopherson
2019-01-17 13:55 ` [PATCH 3/4] KVM: MMU: introduce kvm_mmu_write_protect_all_pages Zhuangyanying
2019-01-17 13:55   ` [Qemu-devel] " Zhuangyanying
2019-01-17 16:09   ` Sean Christopherson
2019-01-17 16:09     ` [Qemu-devel] " Sean Christopherson
2019-01-17 13:55 ` [PATCH 4/4] KVM: MMU: fast cleanup D bit based on fast write protect Zhuangyanying
2019-01-17 13:55   ` [Qemu-devel] " Zhuangyanying
2019-01-17 16:32   ` Sean Christopherson
2019-01-17 16:32     ` [Qemu-devel] " Sean Christopherson
2019-01-21  6:37     ` Zhuangyanying
2019-01-21  6:37       ` [Qemu-devel] " Zhuangyanying
2019-01-22 15:17       ` Sean Christopherson
2019-01-22 15:17         ` [Qemu-devel] " Sean Christopherson
2019-01-23 18:38         ` Zhuangyanying
2019-01-23 18:38           ` [Qemu-devel] " Zhuangyanying

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.