kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/2] KVM: x86: Enable dirty logging lazily for huge pages
@ 2021-04-16  8:25 Keqian Zhu
  2021-04-16  8:25 ` [RFC PATCH v2 1/2] KVM: x86: Support write protect gfn with min_level Keqian Zhu
  2021-04-16  8:25 ` [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log Keqian Zhu
  0 siblings, 2 replies; 11+ messages in thread
From: Keqian Zhu @ 2021-04-16  8:25 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson, Ben Gardon
  Cc: wanghaibin.wang

Currently during start dirty logging, if we're with init-all-set,
we write protect huge pages and leave normal pages untouched, for
that we can enable dirty logging for these pages lazily.

Actually enable dirty logging lazily for huge pages is feasible
too, which not only reduces the time of start dirty logging, also
greatly reduces side-effect on guest when there is high dirty rate.

Thanks,
Keqian

Keqian Zhu (2):
  KVM: x86: Support write protect gfn with min_level
  KVM: x86: Not wr-protect huge page with init_all_set dirty log

 arch/x86/kvm/mmu/mmu.c          | 57 ++++++++++++++++++++++++++++-----
 arch/x86/kvm/mmu/mmu_internal.h |  3 +-
 arch/x86/kvm/mmu/page_track.c   |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c      | 16 ++++++---
 arch/x86/kvm/mmu/tdp_mmu.h      |  3 +-
 arch/x86/kvm/x86.c              | 37 ++++++---------------
 6 files changed, 76 insertions(+), 42 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH v2 1/2] KVM: x86: Support write protect gfn with min_level
  2021-04-16  8:25 [RFC PATCH v2 0/2] KVM: x86: Enable dirty logging lazily for huge pages Keqian Zhu
@ 2021-04-16  8:25 ` Keqian Zhu
  2021-04-16  8:25 ` [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log Keqian Zhu
  1 sibling, 0 replies; 11+ messages in thread
From: Keqian Zhu @ 2021-04-16  8:25 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson, Ben Gardon
  Cc: wanghaibin.wang

Under some circumstances, we just need to write protect large page
gfn. This gets prepared for write protecting large page lazily during
dirty log tracking.

None function and performance change expected.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/x86/kvm/mmu/mmu.c          |  9 +++++----
 arch/x86/kvm/mmu/mmu_internal.h |  3 ++-
 arch/x86/kvm/mmu/page_track.c   |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c      | 16 ++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h      |  3 ++-
 5 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 486aa94ecf1d..2ce5bc2ea46d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1265,20 +1265,21 @@ int kvm_cpu_dirty_log_size(void)
 }
 
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
-				    struct kvm_memory_slot *slot, u64 gfn)
+				    struct kvm_memory_slot *slot, u64 gfn,
+				    int min_level)
 {
 	struct kvm_rmap_head *rmap_head;
 	int i;
 	bool write_protected = false;
 
-	for (i = PG_LEVEL_4K; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
+	for (i = min_level; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
 		rmap_head = __gfn_to_rmap(gfn, i, slot);
 		write_protected |= __rmap_write_protect(kvm, rmap_head, true);
 	}
 
 	if (is_tdp_mmu_enabled(kvm))
 		write_protected |=
-			kvm_tdp_mmu_write_protect_gfn(kvm, slot, gfn);
+			kvm_tdp_mmu_write_protect_gfn(kvm, slot, gfn, min_level);
 
 	return write_protected;
 }
@@ -1288,7 +1289,7 @@ static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
 	struct kvm_memory_slot *slot;
 
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-	return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn);
+	return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn, PG_LEVEL_4K);
 }
 
 static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1f6f98c76bdf..4c7c42bb8cf8 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -104,7 +104,8 @@ bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
-				    struct kvm_memory_slot *slot, u64 gfn);
+				    struct kvm_memory_slot *slot, u64 gfn,
+				    int min_level);
 void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
 					u64 start_gfn, u64 pages);
 
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 34bb0ec69bd8..91a9f7e0fd91 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -100,7 +100,7 @@ void kvm_slot_page_track_add_page(struct kvm *kvm,
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
 
 	if (mode == KVM_PAGE_TRACK_WRITE)
-		if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn))
+		if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
 			kvm_flush_remote_tlbs(kvm);
 }
 EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 018d82e73e31..6cf0284e2e6a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1338,15 +1338,22 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
  * Returns true if an SPTE was set and a TLB flush is needed.
  */
 static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
-			      gfn_t gfn)
+			      gfn_t gfn, int min_level)
 {
 	struct tdp_iter iter;
 	u64 new_spte;
 	bool spte_set = false;
 
+	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
+
 	rcu_read_lock();
 
-	tdp_root_for_each_leaf_pte(iter, root, gfn, gfn + 1) {
+	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
+				   min_level, gfn, gfn + 1) {
+		if (!is_shadow_present_pte(iter.old_spte) ||
+		    !is_last_spte(iter.old_spte, iter.level))
+			continue;
+
 		if (!is_writable_pte(iter.old_spte))
 			break;
 
@@ -1368,7 +1375,8 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
  * Returns true if an SPTE was set and a TLB flush is needed.
  */
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
-				   struct kvm_memory_slot *slot, gfn_t gfn)
+				   struct kvm_memory_slot *slot, gfn_t gfn,
+				   int min_level)
 {
 	struct kvm_mmu_page *root;
 	int root_as_id;
@@ -1380,7 +1388,7 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 		if (root_as_id != slot->as_id)
 			continue;
 
-		spte_set |= write_protect_gfn(kvm, root, gfn);
+		spte_set |= write_protect_gfn(kvm, root, gfn, min_level);
 	}
 	return spte_set;
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 31096ece9b14..cea787469016 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -59,7 +59,8 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				       struct kvm_memory_slot *slot);
 
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
-				   struct kvm_memory_slot *slot, gfn_t gfn);
+				   struct kvm_memory_slot *slot, gfn_t gfn,
+				   int min_level);
 
 int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 			 int *root_level);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log
  2021-04-16  8:25 [RFC PATCH v2 0/2] KVM: x86: Enable dirty logging lazily for huge pages Keqian Zhu
  2021-04-16  8:25 ` [RFC PATCH v2 1/2] KVM: x86: Support write protect gfn with min_level Keqian Zhu
@ 2021-04-16  8:25 ` Keqian Zhu
  2021-04-19 19:20   ` Ben Gardon
  1 sibling, 1 reply; 11+ messages in thread
From: Keqian Zhu @ 2021-04-16  8:25 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson, Ben Gardon
  Cc: wanghaibin.wang

Currently during start dirty logging, if we're with init-all-set,
we write protect huge pages and leave normal pages untouched, for
that we can enable dirty logging for these pages lazily.

Actually enable dirty logging lazily for huge pages is feasible
too, which not only reduces the time of start dirty logging, also
greatly reduces side-effect on guest when there is high dirty rate.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/x86/kvm/mmu/mmu.c | 48 ++++++++++++++++++++++++++++++++++++++----
 arch/x86/kvm/x86.c     | 37 +++++++++-----------------------
 2 files changed, 54 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2ce5bc2ea46d..98fa25172b9a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1188,8 +1188,7 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
  * @gfn_offset: start of the BITS_PER_LONG pages we care about
  * @mask: indicates which pages we should protect
  *
- * Used when we do not need to care about huge page mappings: e.g. during dirty
- * logging we do not have any such mappings.
+ * Used when we do not need to care about huge page mappings.
  */
 static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 				     struct kvm_memory_slot *slot,
@@ -1246,13 +1245,54 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
  * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
  * enable dirty logging for them.
  *
- * Used when we do not need to care about huge page mappings: e.g. during dirty
- * logging we do not have any such mappings.
+ * We need to care about huge page mappings: e.g. during dirty logging we may
+ * have any such mappings.
  */
 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 				struct kvm_memory_slot *slot,
 				gfn_t gfn_offset, unsigned long mask)
 {
+	gfn_t start, end;
+
+	/*
+	 * Huge pages are NOT write protected when we start dirty log with
+	 * init-all-set, so we must write protect them at here.
+	 *
+	 * The gfn_offset is guaranteed to be aligned to 64, but the base_gfn
+	 * of memslot has no such restriction, so the range can cross two large
+	 * pages.
+	 */
+	if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
+		start = slot->base_gfn + gfn_offset + __ffs(mask);
+		end = slot->base_gfn + gfn_offset + __fls(mask);
+		kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
+
+		/* Cross two large pages? */
+		if (ALIGN(start << PAGE_SHIFT, PMD_SIZE) !=
+		    ALIGN(end << PAGE_SHIFT, PMD_SIZE))
+			kvm_mmu_slot_gfn_write_protect(kvm, slot, end,
+						       PG_LEVEL_2M);
+	}
+
+	/*
+	 * RFC:
+	 *
+	 * 1. I don't return early when kvm_mmu_slot_gfn_write_protect() returns
+	 * true, because I am not very clear about the relationship between
+	 * legacy mmu and tdp mmu. AFAICS, the code logic is NOT an if/else
+	 * manner.
+	 *
+	 * The kvm_mmu_slot_gfn_write_protect() returns true when we hit a
+	 * writable large page mapping in legacy mmu mapping or tdp mmu mapping.
+	 * Do we still have normal mapping in that case? (e.g. We have large
+	 * mapping in legacy mmu and normal mapping in tdp mmu).
+	 *
+	 * 2. kvm_mmu_slot_gfn_write_protect() doesn't tell us whether the large
+	 * page mapping exist. If it exists but is clean, we can return early.
+	 * However, we have to do invasive change.
+	 */
+
+	/* Then we can handle the PT level pages */
 	if (kvm_x86_ops.cpu_dirty_log_size)
 		kvm_mmu_clear_dirty_pt_masked(kvm, slot, gfn_offset, mask);
 	else
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eca63625aee4..dfd676ffa7da 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10888,36 +10888,19 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 		 */
 		kvm_mmu_zap_collapsible_sptes(kvm, new);
 	} else {
-		/* By default, write-protect everything to log writes. */
-		int level = PG_LEVEL_4K;
+		/*
+		 * If we're with initial-all-set, we don't need to write protect
+		 * any page because they're reported as dirty already.
+		 */
+		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
+			return;
 
 		if (kvm_x86_ops.cpu_dirty_log_size) {
-			/*
-			 * Clear all dirty bits, unless pages are treated as
-			 * dirty from the get-go.
-			 */
-			if (!kvm_dirty_log_manual_protect_and_init_set(kvm))
-				kvm_mmu_slot_leaf_clear_dirty(kvm, new);
-
-			/*
-			 * Write-protect large pages on write so that dirty
-			 * logging happens at 4k granularity.  No need to
-			 * write-protect small SPTEs since write accesses are
-			 * logged by the CPU via dirty bits.
-			 */
-			level = PG_LEVEL_2M;
-		} else if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
-			/*
-			 * If we're with initial-all-set, we don't need
-			 * to write protect any small page because
-			 * they're reported as dirty already.  However
-			 * we still need to write-protect huge pages
-			 * so that the page split can happen lazily on
-			 * the first write to the huge page.
-			 */
-			level = PG_LEVEL_2M;
+			kvm_mmu_slot_leaf_clear_dirty(kvm, new);
+			kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_2M);
+		} else {
+			kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_4K);
 		}
-		kvm_mmu_slot_remove_write_access(kvm, new, level);
 	}
 }
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log
  2021-04-16  8:25 ` [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log Keqian Zhu
@ 2021-04-19 19:20   ` Ben Gardon
  2021-04-20  7:49     ` Keqian Zhu
  0 siblings, 1 reply; 11+ messages in thread
From: Ben Gardon @ 2021-04-19 19:20 UTC (permalink / raw)
  To: Keqian Zhu; +Cc: LKML, kvm, Paolo Bonzini, Sean Christopherson, wanghaibin.wang

On Fri, Apr 16, 2021 at 1:25 AM Keqian Zhu <zhukeqian1@huawei.com> wrote:
>
> Currently during start dirty logging, if we're with init-all-set,
> we write protect huge pages and leave normal pages untouched, for
> that we can enable dirty logging for these pages lazily.
>
> Actually enable dirty logging lazily for huge pages is feasible
> too, which not only reduces the time of start dirty logging, also
> greatly reduces side-effect on guest when there is high dirty rate.
>
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 48 ++++++++++++++++++++++++++++++++++++++----
>  arch/x86/kvm/x86.c     | 37 +++++++++-----------------------
>  2 files changed, 54 insertions(+), 31 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2ce5bc2ea46d..98fa25172b9a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1188,8 +1188,7 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>   * @gfn_offset: start of the BITS_PER_LONG pages we care about
>   * @mask: indicates which pages we should protect
>   *
> - * Used when we do not need to care about huge page mappings: e.g. during dirty
> - * logging we do not have any such mappings.
> + * Used when we do not need to care about huge page mappings.
>   */
>  static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
>                                      struct kvm_memory_slot *slot,
> @@ -1246,13 +1245,54 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
>   * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
>   * enable dirty logging for them.
>   *
> - * Used when we do not need to care about huge page mappings: e.g. during dirty
> - * logging we do not have any such mappings.
> + * We need to care about huge page mappings: e.g. during dirty logging we may
> + * have any such mappings.
>   */
>  void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>                                 struct kvm_memory_slot *slot,
>                                 gfn_t gfn_offset, unsigned long mask)
>  {
> +       gfn_t start, end;
> +
> +       /*
> +        * Huge pages are NOT write protected when we start dirty log with
> +        * init-all-set, so we must write protect them at here.
> +        *
> +        * The gfn_offset is guaranteed to be aligned to 64, but the base_gfn
> +        * of memslot has no such restriction, so the range can cross two large
> +        * pages.
> +        */
> +       if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
> +               start = slot->base_gfn + gfn_offset + __ffs(mask);
> +               end = slot->base_gfn + gfn_offset + __fls(mask);
> +               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> +
> +               /* Cross two large pages? */
> +               if (ALIGN(start << PAGE_SHIFT, PMD_SIZE) !=
> +                   ALIGN(end << PAGE_SHIFT, PMD_SIZE))
> +                       kvm_mmu_slot_gfn_write_protect(kvm, slot, end,
> +                                                      PG_LEVEL_2M);
> +       }
> +
> +       /*
> +        * RFC:
> +        *
> +        * 1. I don't return early when kvm_mmu_slot_gfn_write_protect() returns
> +        * true, because I am not very clear about the relationship between
> +        * legacy mmu and tdp mmu. AFAICS, the code logic is NOT an if/else
> +        * manner.
> +        *
> +        * The kvm_mmu_slot_gfn_write_protect() returns true when we hit a
> +        * writable large page mapping in legacy mmu mapping or tdp mmu mapping.
> +        * Do we still have normal mapping in that case? (e.g. We have large
> +        * mapping in legacy mmu and normal mapping in tdp mmu).

Right, we can't return early because the two MMUs could map the page
in different ways, but each MMU could also map the page in multiple
ways independently.
For example, if the legacy MMU was being used and we were running a
nested VM, a page could be mapped 2M in EPT01 and 4K in EPT02, so we'd
still need kvm_mmu_slot_gfn_write_protect  calls for both levels.
I don't think there's a case where we can return early here with the
information that the first calls to kvm_mmu_slot_gfn_write_protect
access.

> +        *
> +        * 2. kvm_mmu_slot_gfn_write_protect() doesn't tell us whether the large
> +        * page mapping exist. If it exists but is clean, we can return early.
> +        * However, we have to do invasive change.

What do you mean by invasive change?

> +        */
> +
> +       /* Then we can handle the PT level pages */
>         if (kvm_x86_ops.cpu_dirty_log_size)
>                 kvm_mmu_clear_dirty_pt_masked(kvm, slot, gfn_offset, mask);
>         else
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index eca63625aee4..dfd676ffa7da 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10888,36 +10888,19 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>                  */
>                 kvm_mmu_zap_collapsible_sptes(kvm, new);
>         } else {
> -               /* By default, write-protect everything to log writes. */
> -               int level = PG_LEVEL_4K;
> +               /*
> +                * If we're with initial-all-set, we don't need to write protect
> +                * any page because they're reported as dirty already.
> +                */
> +               if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> +                       return;
>
>                 if (kvm_x86_ops.cpu_dirty_log_size) {
> -                       /*
> -                        * Clear all dirty bits, unless pages are treated as
> -                        * dirty from the get-go.
> -                        */
> -                       if (!kvm_dirty_log_manual_protect_and_init_set(kvm))
> -                               kvm_mmu_slot_leaf_clear_dirty(kvm, new);
> -
> -                       /*
> -                        * Write-protect large pages on write so that dirty
> -                        * logging happens at 4k granularity.  No need to
> -                        * write-protect small SPTEs since write accesses are
> -                        * logged by the CPU via dirty bits.
> -                        */
> -                       level = PG_LEVEL_2M;
> -               } else if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
> -                       /*
> -                        * If we're with initial-all-set, we don't need
> -                        * to write protect any small page because
> -                        * they're reported as dirty already.  However
> -                        * we still need to write-protect huge pages
> -                        * so that the page split can happen lazily on
> -                        * the first write to the huge page.
> -                        */
> -                       level = PG_LEVEL_2M;
> +                       kvm_mmu_slot_leaf_clear_dirty(kvm, new);
> +                       kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_2M);
> +               } else {
> +                       kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_4K);
>                 }
> -               kvm_mmu_slot_remove_write_access(kvm, new, level);
>         }
>  }
>
> --
> 2.23.0
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log
  2021-04-19 19:20   ` Ben Gardon
@ 2021-04-20  7:49     ` Keqian Zhu
  2021-04-20 16:30       ` Ben Gardon
  0 siblings, 1 reply; 11+ messages in thread
From: Keqian Zhu @ 2021-04-20  7:49 UTC (permalink / raw)
  To: Ben Gardon; +Cc: LKML, kvm, Paolo Bonzini, Sean Christopherson, wanghaibin.wang

Hi Ben,

On 2021/4/20 3:20, Ben Gardon wrote:
> On Fri, Apr 16, 2021 at 1:25 AM Keqian Zhu <zhukeqian1@huawei.com> wrote:
>>
>> Currently during start dirty logging, if we're with init-all-set,
>> we write protect huge pages and leave normal pages untouched, for
>> that we can enable dirty logging for these pages lazily.
>>
>> Actually enable dirty logging lazily for huge pages is feasible
>> too, which not only reduces the time of start dirty logging, also
>> greatly reduces side-effect on guest when there is high dirty rate.
>>
>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>> ---
>>  arch/x86/kvm/mmu/mmu.c | 48 ++++++++++++++++++++++++++++++++++++++----
>>  arch/x86/kvm/x86.c     | 37 +++++++++-----------------------
>>  2 files changed, 54 insertions(+), 31 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 2ce5bc2ea46d..98fa25172b9a 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -1188,8 +1188,7 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>>   * @gfn_offset: start of the BITS_PER_LONG pages we care about
>>   * @mask: indicates which pages we should protect
>>   *
>> - * Used when we do not need to care about huge page mappings: e.g. during dirty
>> - * logging we do not have any such mappings.
>> + * Used when we do not need to care about huge page mappings.
>>   */
>>  static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
>>                                      struct kvm_memory_slot *slot,
>> @@ -1246,13 +1245,54 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
>>   * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
>>   * enable dirty logging for them.
>>   *
>> - * Used when we do not need to care about huge page mappings: e.g. during dirty
>> - * logging we do not have any such mappings.
>> + * We need to care about huge page mappings: e.g. during dirty logging we may
>> + * have any such mappings.
>>   */
>>  void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>>                                 struct kvm_memory_slot *slot,
>>                                 gfn_t gfn_offset, unsigned long mask)
>>  {
>> +       gfn_t start, end;
>> +
>> +       /*
>> +        * Huge pages are NOT write protected when we start dirty log with
>> +        * init-all-set, so we must write protect them at here.
>> +        *
>> +        * The gfn_offset is guaranteed to be aligned to 64, but the base_gfn
>> +        * of memslot has no such restriction, so the range can cross two large
>> +        * pages.
>> +        */
>> +       if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
>> +               start = slot->base_gfn + gfn_offset + __ffs(mask);
>> +               end = slot->base_gfn + gfn_offset + __fls(mask);
>> +               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
>> +
>> +               /* Cross two large pages? */
>> +               if (ALIGN(start << PAGE_SHIFT, PMD_SIZE) !=
>> +                   ALIGN(end << PAGE_SHIFT, PMD_SIZE))
>> +                       kvm_mmu_slot_gfn_write_protect(kvm, slot, end,
>> +                                                      PG_LEVEL_2M);
>> +       }
>> +
>> +       /*
>> +        * RFC:
>> +        *
>> +        * 1. I don't return early when kvm_mmu_slot_gfn_write_protect() returns
>> +        * true, because I am not very clear about the relationship between
>> +        * legacy mmu and tdp mmu. AFAICS, the code logic is NOT an if/else
>> +        * manner.
>> +        *
>> +        * The kvm_mmu_slot_gfn_write_protect() returns true when we hit a
>> +        * writable large page mapping in legacy mmu mapping or tdp mmu mapping.
>> +        * Do we still have normal mapping in that case? (e.g. We have large
>> +        * mapping in legacy mmu and normal mapping in tdp mmu).
> 
> Right, we can't return early because the two MMUs could map the page
> in different ways, but each MMU could also map the page in multiple
> ways independently.
> For example, if the legacy MMU was being used and we were running a
> nested VM, a page could be mapped 2M in EPT01 and 4K in EPT02, so we'd
> still need kvm_mmu_slot_gfn_write_protect  calls for both levels.
> I don't think there's a case where we can return early here with the
> information that the first calls to kvm_mmu_slot_gfn_write_protect
> access.
Thanks for the detailed explanation.

> 
>> +        *
>> +        * 2. kvm_mmu_slot_gfn_write_protect() doesn't tell us whether the large
>> +        * page mapping exist. If it exists but is clean, we can return early.
>> +        * However, we have to do invasive change.
> 
> What do you mean by invasive change?
We need the kvm_mmu_slot_gfn_write_protect to report whether all mapping are large
and clean, so we can return early. However it's not a part of semantics of this function.

If this is the final code, compared to old code, we have an extra gfn_write_protect(),
I don't whether it's acceptable?

Thanks,
Keqian


> 
>> +        */
>> +
>> +       /* Then we can handle the PT level pages */
>>         if (kvm_x86_ops.cpu_dirty_log_size)
>>                 kvm_mmu_clear_dirty_pt_masked(kvm, slot, gfn_offset, mask);
>>         else
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index eca63625aee4..dfd676ffa7da 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -10888,36 +10888,19 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>>                  */
>>                 kvm_mmu_zap_collapsible_sptes(kvm, new);
>>         } else {
>> -               /* By default, write-protect everything to log writes. */
>> -               int level = PG_LEVEL_4K;
>> +               /*
>> +                * If we're with initial-all-set, we don't need to write protect
>> +                * any page because they're reported as dirty already.
>> +                */
>> +               if (kvm_dirty_log_manual_protect_and_init_set(kvm))
>> +                       return;
>>
>>                 if (kvm_x86_ops.cpu_dirty_log_size) {
>> -                       /*
>> -                        * Clear all dirty bits, unless pages are treated as
>> -                        * dirty from the get-go.
>> -                        */
>> -                       if (!kvm_dirty_log_manual_protect_and_init_set(kvm))
>> -                               kvm_mmu_slot_leaf_clear_dirty(kvm, new);
>> -
>> -                       /*
>> -                        * Write-protect large pages on write so that dirty
>> -                        * logging happens at 4k granularity.  No need to
>> -                        * write-protect small SPTEs since write accesses are
>> -                        * logged by the CPU via dirty bits.
>> -                        */
>> -                       level = PG_LEVEL_2M;
>> -               } else if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
>> -                       /*
>> -                        * If we're with initial-all-set, we don't need
>> -                        * to write protect any small page because
>> -                        * they're reported as dirty already.  However
>> -                        * we still need to write-protect huge pages
>> -                        * so that the page split can happen lazily on
>> -                        * the first write to the huge page.
>> -                        */
>> -                       level = PG_LEVEL_2M;
>> +                       kvm_mmu_slot_leaf_clear_dirty(kvm, new);
>> +                       kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_2M);
>> +               } else {
>> +                       kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_4K);
>>                 }
>> -               kvm_mmu_slot_remove_write_access(kvm, new, level);
>>         }
>>  }
>>
>> --
>> 2.23.0
>>
> .
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log
  2021-04-20  7:49     ` Keqian Zhu
@ 2021-04-20 16:30       ` Ben Gardon
  2021-04-27  5:03         ` Keqian Zhu
  0 siblings, 1 reply; 11+ messages in thread
From: Ben Gardon @ 2021-04-20 16:30 UTC (permalink / raw)
  To: Keqian Zhu; +Cc: LKML, kvm, Paolo Bonzini, Sean Christopherson, wanghaibin.wang

On Tue, Apr 20, 2021 at 12:49 AM Keqian Zhu <zhukeqian1@huawei.com> wrote:
>
> Hi Ben,
>
> On 2021/4/20 3:20, Ben Gardon wrote:
> > On Fri, Apr 16, 2021 at 1:25 AM Keqian Zhu <zhukeqian1@huawei.com> wrote:
> >>
> >> Currently during start dirty logging, if we're with init-all-set,
> >> we write protect huge pages and leave normal pages untouched, for
> >> that we can enable dirty logging for these pages lazily.
> >>
> >> Actually enable dirty logging lazily for huge pages is feasible
> >> too, which not only reduces the time of start dirty logging, also
> >> greatly reduces side-effect on guest when there is high dirty rate.
> >>
> >> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> >> ---
> >>  arch/x86/kvm/mmu/mmu.c | 48 ++++++++++++++++++++++++++++++++++++++----
> >>  arch/x86/kvm/x86.c     | 37 +++++++++-----------------------
> >>  2 files changed, 54 insertions(+), 31 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >> index 2ce5bc2ea46d..98fa25172b9a 100644
> >> --- a/arch/x86/kvm/mmu/mmu.c
> >> +++ b/arch/x86/kvm/mmu/mmu.c
> >> @@ -1188,8 +1188,7 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
> >>   * @gfn_offset: start of the BITS_PER_LONG pages we care about
> >>   * @mask: indicates which pages we should protect
> >>   *
> >> - * Used when we do not need to care about huge page mappings: e.g. during dirty
> >> - * logging we do not have any such mappings.
> >> + * Used when we do not need to care about huge page mappings.
> >>   */
> >>  static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> >>                                      struct kvm_memory_slot *slot,
> >> @@ -1246,13 +1245,54 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
> >>   * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
> >>   * enable dirty logging for them.
> >>   *
> >> - * Used when we do not need to care about huge page mappings: e.g. during dirty
> >> - * logging we do not have any such mappings.
> >> + * We need to care about huge page mappings: e.g. during dirty logging we may
> >> + * have any such mappings.
> >>   */
> >>  void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >>                                 struct kvm_memory_slot *slot,
> >>                                 gfn_t gfn_offset, unsigned long mask)
> >>  {
> >> +       gfn_t start, end;
> >> +
> >> +       /*
> >> +        * Huge pages are NOT write protected when we start dirty log with
> >> +        * init-all-set, so we must write protect them at here.
> >> +        *
> >> +        * The gfn_offset is guaranteed to be aligned to 64, but the base_gfn
> >> +        * of memslot has no such restriction, so the range can cross two large
> >> +        * pages.
> >> +        */
> >> +       if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
> >> +               start = slot->base_gfn + gfn_offset + __ffs(mask);
> >> +               end = slot->base_gfn + gfn_offset + __fls(mask);
> >> +               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> >> +
> >> +               /* Cross two large pages? */
> >> +               if (ALIGN(start << PAGE_SHIFT, PMD_SIZE) !=
> >> +                   ALIGN(end << PAGE_SHIFT, PMD_SIZE))
> >> +                       kvm_mmu_slot_gfn_write_protect(kvm, slot, end,
> >> +                                                      PG_LEVEL_2M);
> >> +       }
> >> +
> >> +       /*
> >> +        * RFC:
> >> +        *
> >> +        * 1. I don't return early when kvm_mmu_slot_gfn_write_protect() returns
> >> +        * true, because I am not very clear about the relationship between
> >> +        * legacy mmu and tdp mmu. AFAICS, the code logic is NOT an if/else
> >> +        * manner.
> >> +        *
> >> +        * The kvm_mmu_slot_gfn_write_protect() returns true when we hit a
> >> +        * writable large page mapping in legacy mmu mapping or tdp mmu mapping.
> >> +        * Do we still have normal mapping in that case? (e.g. We have large
> >> +        * mapping in legacy mmu and normal mapping in tdp mmu).
> >
> > Right, we can't return early because the two MMUs could map the page
> > in different ways, but each MMU could also map the page in multiple
> > ways independently.
> > For example, if the legacy MMU was being used and we were running a
> > nested VM, a page could be mapped 2M in EPT01 and 4K in EPT02, so we'd
> > still need kvm_mmu_slot_gfn_write_protect  calls for both levels.
> > I don't think there's a case where we can return early here with the
> > information that the first calls to kvm_mmu_slot_gfn_write_protect
> > access.
> Thanks for the detailed explanation.
>
> >
> >> +        *
> >> +        * 2. kvm_mmu_slot_gfn_write_protect() doesn't tell us whether the large
> >> +        * page mapping exist. If it exists but is clean, we can return early.
> >> +        * However, we have to do invasive change.
> >
> > What do you mean by invasive change?
> We need the kvm_mmu_slot_gfn_write_protect to report whether all mapping are large
> and clean, so we can return early. However it's not a part of semantics of this function.
>
> If this is the final code, compared to old code, we have an extra gfn_write_protect(),
> I don't whether it's acceptable?

Ah, I see. Please correct me if I'm wrong, but I think that in order
to check that the only mappings on the GFN range are large, we'd still
have to go over the rmap for the 4k mappings, at least for the legacy
MMU. In that case, we're doing about as much work as the extra
gfn_write_protect and I don't think that we'd get any efficiency gain
for the change in semantics.

Likewise for the TDP MMU, if the GFN range is mapped both large and
4k, it would have to be in different TDP structures, so the efficiency
gains would again not be very big.

I'm really just guessing about those performance characteristics
though. It would definitely help to have some performance data to back
all this up. Even just a few runs of the dirty_log_perf_test (in
selftests) could provide some interesting results, and I'd be happy to
help review any improvements you might make to that test.

Regardless, I'd be inclined to keep this change as simple as possible
for now and the early return optimization could happen in a follow-up
patch. I think the extra gfn_write_protect is acceptable, especially
if you can show that it doesn't cause a big hit in performance when
running the dirty_log_perf_test with 4k and 2m backing memory.

>
> Thanks,
> Keqian
>
>
> >
> >> +        */
> >> +
> >> +       /* Then we can handle the PT level pages */
> >>         if (kvm_x86_ops.cpu_dirty_log_size)
> >>                 kvm_mmu_clear_dirty_pt_masked(kvm, slot, gfn_offset, mask);
> >>         else
> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index eca63625aee4..dfd676ffa7da 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -10888,36 +10888,19 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
> >>                  */
> >>                 kvm_mmu_zap_collapsible_sptes(kvm, new);
> >>         } else {
> >> -               /* By default, write-protect everything to log writes. */
> >> -               int level = PG_LEVEL_4K;
> >> +               /*
> >> +                * If we're with initial-all-set, we don't need to write protect
> >> +                * any page because they're reported as dirty already.
> >> +                */
> >> +               if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> >> +                       return;
> >>
> >>                 if (kvm_x86_ops.cpu_dirty_log_size) {
> >> -                       /*
> >> -                        * Clear all dirty bits, unless pages are treated as
> >> -                        * dirty from the get-go.
> >> -                        */
> >> -                       if (!kvm_dirty_log_manual_protect_and_init_set(kvm))
> >> -                               kvm_mmu_slot_leaf_clear_dirty(kvm, new);
> >> -
> >> -                       /*
> >> -                        * Write-protect large pages on write so that dirty
> >> -                        * logging happens at 4k granularity.  No need to
> >> -                        * write-protect small SPTEs since write accesses are
> >> -                        * logged by the CPU via dirty bits.
> >> -                        */
> >> -                       level = PG_LEVEL_2M;
> >> -               } else if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
> >> -                       /*
> >> -                        * If we're with initial-all-set, we don't need
> >> -                        * to write protect any small page because
> >> -                        * they're reported as dirty already.  However
> >> -                        * we still need to write-protect huge pages
> >> -                        * so that the page split can happen lazily on
> >> -                        * the first write to the huge page.
> >> -                        */
> >> -                       level = PG_LEVEL_2M;
> >> +                       kvm_mmu_slot_leaf_clear_dirty(kvm, new);
> >> +                       kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_2M);
> >> +               } else {
> >> +                       kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_4K);
> >>                 }
> >> -               kvm_mmu_slot_remove_write_access(kvm, new, level);
> >>         }
> >>  }
> >>
> >> --
> >> 2.23.0
> >>
> > .
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log
  2021-04-20 16:30       ` Ben Gardon
@ 2021-04-27  5:03         ` Keqian Zhu
  2021-04-27 16:33           ` Ben Gardon
  0 siblings, 1 reply; 11+ messages in thread
From: Keqian Zhu @ 2021-04-27  5:03 UTC (permalink / raw)
  To: Ben Gardon; +Cc: LKML, kvm, Paolo Bonzini, Sean Christopherson, wanghaibin.wang

Hi Ben,

Sorry for the delay reply!

On 2021/4/21 0:30, Ben Gardon wrote:
> On Tue, Apr 20, 2021 at 12:49 AM Keqian Zhu <zhukeqian1@huawei.com> wrote:
>>
>> Hi Ben,
>>
>> On 2021/4/20 3:20, Ben Gardon wrote:
>>> On Fri, Apr 16, 2021 at 1:25 AM Keqian Zhu <zhukeqian1@huawei.com> wrote:
>>>>
>>>> Currently during start dirty logging, if we're with init-all-set,
>>>> we write protect huge pages and leave normal pages untouched, for
>>>> that we can enable dirty logging for these pages lazily.
>>>>
>>>> Actually enable dirty logging lazily for huge pages is feasible
>>>> too, which not only reduces the time of start dirty logging, also
>>>> greatly reduces side-effect on guest when there is high dirty rate.
>>>>
>>>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>>>> ---
>>>>  arch/x86/kvm/mmu/mmu.c | 48 ++++++++++++++++++++++++++++++++++++++----
>>>>  arch/x86/kvm/x86.c     | 37 +++++++++-----------------------
>>>>  2 files changed, 54 insertions(+), 31 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>>> index 2ce5bc2ea46d..98fa25172b9a 100644
>>>> --- a/arch/x86/kvm/mmu/mmu.c
>>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>>> @@ -1188,8 +1188,7 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>>>>   * @gfn_offset: start of the BITS_PER_LONG pages we care about
>>>>   * @mask: indicates which pages we should protect
>>>>   *
>>>> - * Used when we do not need to care about huge page mappings: e.g. during dirty
>>>> - * logging we do not have any such mappings.
>>>> + * Used when we do not need to care about huge page mappings.
>>>>   */
>>>>  static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
>>>>                                      struct kvm_memory_slot *slot,
>>>> @@ -1246,13 +1245,54 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
>>>>   * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
>>>>   * enable dirty logging for them.
>>>>   *
>>>> - * Used when we do not need to care about huge page mappings: e.g. during dirty
>>>> - * logging we do not have any such mappings.
>>>> + * We need to care about huge page mappings: e.g. during dirty logging we may
>>>> + * have any such mappings.
>>>>   */
>>>>  void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>>>>                                 struct kvm_memory_slot *slot,
>>>>                                 gfn_t gfn_offset, unsigned long mask)
>>>>  {
>>>> +       gfn_t start, end;
>>>> +
>>>> +       /*
>>>> +        * Huge pages are NOT write protected when we start dirty log with
>>>> +        * init-all-set, so we must write protect them at here.
>>>> +        *
>>>> +        * The gfn_offset is guaranteed to be aligned to 64, but the base_gfn
>>>> +        * of memslot has no such restriction, so the range can cross two large
>>>> +        * pages.
>>>> +        */
>>>> +       if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
>>>> +               start = slot->base_gfn + gfn_offset + __ffs(mask);
>>>> +               end = slot->base_gfn + gfn_offset + __fls(mask);
>>>> +               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
>>>> +
>>>> +               /* Cross two large pages? */
>>>> +               if (ALIGN(start << PAGE_SHIFT, PMD_SIZE) !=
>>>> +                   ALIGN(end << PAGE_SHIFT, PMD_SIZE))
>>>> +                       kvm_mmu_slot_gfn_write_protect(kvm, slot, end,
>>>> +                                                      PG_LEVEL_2M);
>>>> +       }
>>>> +
>>>> +       /*
>>>> +        * RFC:
>>>> +        *
>>>> +        * 1. I don't return early when kvm_mmu_slot_gfn_write_protect() returns
>>>> +        * true, because I am not very clear about the relationship between
>>>> +        * legacy mmu and tdp mmu. AFAICS, the code logic is NOT an if/else
>>>> +        * manner.
>>>> +        *
>>>> +        * The kvm_mmu_slot_gfn_write_protect() returns true when we hit a
>>>> +        * writable large page mapping in legacy mmu mapping or tdp mmu mapping.
>>>> +        * Do we still have normal mapping in that case? (e.g. We have large
>>>> +        * mapping in legacy mmu and normal mapping in tdp mmu).
>>>
>>> Right, we can't return early because the two MMUs could map the page
>>> in different ways, but each MMU could also map the page in multiple
>>> ways independently.
>>> For example, if the legacy MMU was being used and we were running a
>>> nested VM, a page could be mapped 2M in EPT01 and 4K in EPT02, so we'd
>>> still need kvm_mmu_slot_gfn_write_protect  calls for both levels.
>>> I don't think there's a case where we can return early here with the
>>> information that the first calls to kvm_mmu_slot_gfn_write_protect
>>> access.
>> Thanks for the detailed explanation.
>>
>>>
>>>> +        *
>>>> +        * 2. kvm_mmu_slot_gfn_write_protect() doesn't tell us whether the large
>>>> +        * page mapping exist. If it exists but is clean, we can return early.
>>>> +        * However, we have to do invasive change.
>>>
>>> What do you mean by invasive change?
>> We need the kvm_mmu_slot_gfn_write_protect to report whether all mapping are large
>> and clean, so we can return early. However it's not a part of semantics of this function.
>>
>> If this is the final code, compared to old code, we have an extra gfn_write_protect(),
>> I don't whether it's acceptable?
> 
> Ah, I see. Please correct me if I'm wrong, but I think that in order
> to check that the only mappings on the GFN range are large, we'd still
> have to go over the rmap for the 4k mappings, at least for the legacy
> MMU. In that case, we're doing about as much work as the extra
> gfn_write_protect and I don't think that we'd get any efficiency gain
> for the change in semantics.
> 
> Likewise for the TDP MMU, if the GFN range is mapped both large and
> 4k, it would have to be in different TDP structures, so the efficiency
> gains would again not be very big.
I am not familiar with the MMU virtualization of x86 arch, but I think
you are right.

> 
> I'm really just guessing about those performance characteristics
> though. It would definitely help to have some performance data to back
> all this up. Even just a few runs of the dirty_log_perf_test (in
> selftests) could provide some interesting results, and I'd be happy to
> help review any improvements you might make to that test.
> 
> Regardless, I'd be inclined to keep this change as simple as possible
> for now and the early return optimization could happen in a follow-up
> patch. I think the extra gfn_write_protect is acceptable, especially
> if you can show that it doesn't cause a big hit in performance when
> running the dirty_log_perf_test with 4k and 2m backing memory.
I tested it using dirty_log_perf_test, the result shows that performance
of clear_dirty_log different within 2%.

*Without this patch*

./dirty_log_perf_test -i 5 -v 16 -s anonymous

Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
guest physical test memory offset: 0xffbfffff000
Populate memory time: 3.105203579s
Enabling dirty logging time: 0.000323444s
[...]
Get dirty log over 5 iterations took 0.000595033s. (Avg 0.000119006s/iteration)
Clear dirty log over 5 iterations took 0.713212922s. (Avg 0.142642584s/iteration)

./dirty_log_perf_test -i 5 -v 16 -s anonymous_hugetlb

Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
guest physical test memory offset: 0xffbfffff000
Populate memory time: 3.922764235s
Enabling dirty logging time: 0.000316473s
[...]
Get dirty log over 5 iterations took 0.000485459s. (Avg 0.000097091s/iteration)
Clear dirty log over 5 iterations took 0.603749670s. (Avg 0.120749934s/iteration)


*With this patch*

./dirty_log_perf_test -i 5 -v 16 -s anonymous

Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
guest physical test memory offset: 0xffbfffff000
Populate memory time: 3.244515198s
Enabling dirty logging time: 0.000280207s
[...]
Get dirty log over 5 iterations took 0.000484953s. (Avg 0.000096990s/iteration)
Clear dirty log over 5 iterations took 0.727620114s. (Avg 0.145524022s/iteration)

./dirty_log_perf_test -i 5 -v 16 -s anonymous_hugetlb

Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
guest physical test memory offset: 0xffbfffff000
Populate memory time: 3.244294061s
Enabling dirty logging time: 0.000273590s
[...]
Get dirty log over 5 iterations took 0.000474244s. (Avg 0.000094848s/iteration)
Clear dirty log over 5 iterations took 0.600593672s. (Avg 0.120118734s/iteration)


I faced a problem that there is no huge page mapping when test with
"-s anonymous_hugetlb", both for TDP enabled or disabled.

However, these tests above can tell the fact that our optimization does little effect
on clear_dirty_log performance.

Thanks,
Keqian

> 
>>
>> Thanks,
>> Keqian
>>
>>
>>>
>>>> +        */
>>>> +
>>>> +       /* Then we can handle the PT level pages */
>>>>         if (kvm_x86_ops.cpu_dirty_log_size)
>>>>                 kvm_mmu_clear_dirty_pt_masked(kvm, slot, gfn_offset, mask);
>>>>         else
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index eca63625aee4..dfd676ffa7da 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -10888,36 +10888,19 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>>>>                  */
>>>>                 kvm_mmu_zap_collapsible_sptes(kvm, new);
>>>>         } else {
>>>> -               /* By default, write-protect everything to log writes. */
>>>> -               int level = PG_LEVEL_4K;
>>>> +               /*
>>>> +                * If we're with initial-all-set, we don't need to write protect
>>>> +                * any page because they're reported as dirty already.
>>>> +                */
>>>> +               if (kvm_dirty_log_manual_protect_and_init_set(kvm))
>>>> +                       return;
>>>>
>>>>                 if (kvm_x86_ops.cpu_dirty_log_size) {
>>>> -                       /*
>>>> -                        * Clear all dirty bits, unless pages are treated as
>>>> -                        * dirty from the get-go.
>>>> -                        */
>>>> -                       if (!kvm_dirty_log_manual_protect_and_init_set(kvm))
>>>> -                               kvm_mmu_slot_leaf_clear_dirty(kvm, new);
>>>> -
>>>> -                       /*
>>>> -                        * Write-protect large pages on write so that dirty
>>>> -                        * logging happens at 4k granularity.  No need to
>>>> -                        * write-protect small SPTEs since write accesses are
>>>> -                        * logged by the CPU via dirty bits.
>>>> -                        */
>>>> -                       level = PG_LEVEL_2M;
>>>> -               } else if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
>>>> -                       /*
>>>> -                        * If we're with initial-all-set, we don't need
>>>> -                        * to write protect any small page because
>>>> -                        * they're reported as dirty already.  However
>>>> -                        * we still need to write-protect huge pages
>>>> -                        * so that the page split can happen lazily on
>>>> -                        * the first write to the huge page.
>>>> -                        */
>>>> -                       level = PG_LEVEL_2M;
>>>> +                       kvm_mmu_slot_leaf_clear_dirty(kvm, new);
>>>> +                       kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_2M);
>>>> +               } else {
>>>> +                       kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_4K);
>>>>                 }
>>>> -               kvm_mmu_slot_remove_write_access(kvm, new, level);
>>>>         }
>>>>  }
>>>>
>>>> --
>>>> 2.23.0
>>>>
>>> .
>>>
> .
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log
  2021-04-27  5:03         ` Keqian Zhu
@ 2021-04-27 16:33           ` Ben Gardon
  2021-04-28 10:51             ` Keqian Zhu
  0 siblings, 1 reply; 11+ messages in thread
From: Ben Gardon @ 2021-04-27 16:33 UTC (permalink / raw)
  To: Keqian Zhu; +Cc: LKML, kvm, Paolo Bonzini, Sean Christopherson, wanghaibin.wang

On Mon, Apr 26, 2021 at 10:04 PM Keqian Zhu <zhukeqian1@huawei.com> wrote:
>
> Hi Ben,
>
> Sorry for the delay reply!
>
> On 2021/4/21 0:30, Ben Gardon wrote:
> > On Tue, Apr 20, 2021 at 12:49 AM Keqian Zhu <zhukeqian1@huawei.com> wrote:
> >>
> >> Hi Ben,
> >>
> >> On 2021/4/20 3:20, Ben Gardon wrote:
> >>> On Fri, Apr 16, 2021 at 1:25 AM Keqian Zhu <zhukeqian1@huawei.com> wrote:
> >>>>
> >>>> Currently during start dirty logging, if we're with init-all-set,
> >>>> we write protect huge pages and leave normal pages untouched, for
> >>>> that we can enable dirty logging for these pages lazily.
> >>>>
> >>>> Actually enable dirty logging lazily for huge pages is feasible
> >>>> too, which not only reduces the time of start dirty logging, also
> >>>> greatly reduces side-effect on guest when there is high dirty rate.
> >>>>
> >>>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> >>>> ---
> >>>>  arch/x86/kvm/mmu/mmu.c | 48 ++++++++++++++++++++++++++++++++++++++----
> >>>>  arch/x86/kvm/x86.c     | 37 +++++++++-----------------------
> >>>>  2 files changed, 54 insertions(+), 31 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >>>> index 2ce5bc2ea46d..98fa25172b9a 100644
> >>>> --- a/arch/x86/kvm/mmu/mmu.c
> >>>> +++ b/arch/x86/kvm/mmu/mmu.c
> >>>> @@ -1188,8 +1188,7 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
> >>>>   * @gfn_offset: start of the BITS_PER_LONG pages we care about
> >>>>   * @mask: indicates which pages we should protect
> >>>>   *
> >>>> - * Used when we do not need to care about huge page mappings: e.g. during dirty
> >>>> - * logging we do not have any such mappings.
> >>>> + * Used when we do not need to care about huge page mappings.
> >>>>   */
> >>>>  static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> >>>>                                      struct kvm_memory_slot *slot,
> >>>> @@ -1246,13 +1245,54 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
> >>>>   * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
> >>>>   * enable dirty logging for them.
> >>>>   *
> >>>> - * Used when we do not need to care about huge page mappings: e.g. during dirty
> >>>> - * logging we do not have any such mappings.
> >>>> + * We need to care about huge page mappings: e.g. during dirty logging we may
> >>>> + * have any such mappings.
> >>>>   */
> >>>>  void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >>>>                                 struct kvm_memory_slot *slot,
> >>>>                                 gfn_t gfn_offset, unsigned long mask)
> >>>>  {
> >>>> +       gfn_t start, end;
> >>>> +
> >>>> +       /*
> >>>> +        * Huge pages are NOT write protected when we start dirty log with
> >>>> +        * init-all-set, so we must write protect them at here.
> >>>> +        *
> >>>> +        * The gfn_offset is guaranteed to be aligned to 64, but the base_gfn
> >>>> +        * of memslot has no such restriction, so the range can cross two large
> >>>> +        * pages.
> >>>> +        */
> >>>> +       if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
> >>>> +               start = slot->base_gfn + gfn_offset + __ffs(mask);
> >>>> +               end = slot->base_gfn + gfn_offset + __fls(mask);
> >>>> +               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> >>>> +
> >>>> +               /* Cross two large pages? */
> >>>> +               if (ALIGN(start << PAGE_SHIFT, PMD_SIZE) !=
> >>>> +                   ALIGN(end << PAGE_SHIFT, PMD_SIZE))
> >>>> +                       kvm_mmu_slot_gfn_write_protect(kvm, slot, end,
> >>>> +                                                      PG_LEVEL_2M);
> >>>> +       }
> >>>> +
> >>>> +       /*
> >>>> +        * RFC:
> >>>> +        *
> >>>> +        * 1. I don't return early when kvm_mmu_slot_gfn_write_protect() returns
> >>>> +        * true, because I am not very clear about the relationship between
> >>>> +        * legacy mmu and tdp mmu. AFAICS, the code logic is NOT an if/else
> >>>> +        * manner.
> >>>> +        *
> >>>> +        * The kvm_mmu_slot_gfn_write_protect() returns true when we hit a
> >>>> +        * writable large page mapping in legacy mmu mapping or tdp mmu mapping.
> >>>> +        * Do we still have normal mapping in that case? (e.g. We have large
> >>>> +        * mapping in legacy mmu and normal mapping in tdp mmu).
> >>>
> >>> Right, we can't return early because the two MMUs could map the page
> >>> in different ways, but each MMU could also map the page in multiple
> >>> ways independently.
> >>> For example, if the legacy MMU was being used and we were running a
> >>> nested VM, a page could be mapped 2M in EPT01 and 4K in EPT02, so we'd
> >>> still need kvm_mmu_slot_gfn_write_protect  calls for both levels.
> >>> I don't think there's a case where we can return early here with the
> >>> information that the first calls to kvm_mmu_slot_gfn_write_protect
> >>> access.
> >> Thanks for the detailed explanation.
> >>
> >>>
> >>>> +        *
> >>>> +        * 2. kvm_mmu_slot_gfn_write_protect() doesn't tell us whether the large
> >>>> +        * page mapping exist. If it exists but is clean, we can return early.
> >>>> +        * However, we have to do invasive change.
> >>>
> >>> What do you mean by invasive change?
> >> We need the kvm_mmu_slot_gfn_write_protect to report whether all mapping are large
> >> and clean, so we can return early. However it's not a part of semantics of this function.
> >>
> >> If this is the final code, compared to old code, we have an extra gfn_write_protect(),
> >> I don't whether it's acceptable?
> >
> > Ah, I see. Please correct me if I'm wrong, but I think that in order
> > to check that the only mappings on the GFN range are large, we'd still
> > have to go over the rmap for the 4k mappings, at least for the legacy
> > MMU. In that case, we're doing about as much work as the extra
> > gfn_write_protect and I don't think that we'd get any efficiency gain
> > for the change in semantics.
> >
> > Likewise for the TDP MMU, if the GFN range is mapped both large and
> > 4k, it would have to be in different TDP structures, so the efficiency
> > gains would again not be very big.
> I am not familiar with the MMU virtualization of x86 arch, but I think
> you are right.
>
> >
> > I'm really just guessing about those performance characteristics
> > though. It would definitely help to have some performance data to back
> > all this up. Even just a few runs of the dirty_log_perf_test (in
> > selftests) could provide some interesting results, and I'd be happy to
> > help review any improvements you might make to that test.
> >
> > Regardless, I'd be inclined to keep this change as simple as possible
> > for now and the early return optimization could happen in a follow-up
> > patch. I think the extra gfn_write_protect is acceptable, especially
> > if you can show that it doesn't cause a big hit in performance when
> > running the dirty_log_perf_test with 4k and 2m backing memory.
> I tested it using dirty_log_perf_test, the result shows that performance
> of clear_dirty_log different within 2%.

I think there are a couple obstacles which make the stock
dirty_log_perf_test less useful for measuring this optimization.

1. Variance between runs
With only 16 vCPUs and whatever the associated default guest memory
size is, random system events and daemons introduce a lot of variance,
at least in my testing. I usually try to run the biggest VM I can to
smooth that out, but even with a 96 vCPU VM, a 2% difference is often
not statistically significant.  CPU pinning for the vCPU threads would
help a lot to reduce variance. I don't remember if anyone has
implemented this yet.

2. The guest dirty pattern
By default, each guest vCPU will dirty it's entire partition of guest
memory on each iteration. This means that instead of amortizing out
the cost of write-protecting and splitting large pages, we simply move
the burden later in the process. I see you didn't include the time for
each iteration below, but I would expect this patch to move some of
the time from "Enabling dirty logging time" and "Dirtying memory time"
for pass 1 to "Clear dirty log time" and "Dirtying memory time" for
pass 2. I wouldn't expect the total time over 5 iterations to change
for this test.

It would probably also serve us well to have some kind of "hot" subset
of memory for each vCPU, since some of the benefit of lazy large page
splitting depend on that access pattern.

3. Lockstep dirtying and dirty log collection
While this test is currently great for timing dirty logging
operations, it's not great for trickier analysis, especially
reductions to guest degradation. In order to measure that we'd need to
change the test to collect the dirty log as quickly as possible,
independent of what the guest is doing and then also record how much
"progress" the guest is able to make while all that is happening.

I'd be happy to help review any improvements to the test which you
feel like making.

>
> *Without this patch*
>
> ./dirty_log_perf_test -i 5 -v 16 -s anonymous
>
> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> guest physical test memory offset: 0xffbfffff000
> Populate memory time: 3.105203579s
> Enabling dirty logging time: 0.000323444s
> [...]
> Get dirty log over 5 iterations took 0.000595033s. (Avg 0.000119006s/iteration)
> Clear dirty log over 5 iterations took 0.713212922s. (Avg 0.142642584s/iteration)
>
> ./dirty_log_perf_test -i 5 -v 16 -s anonymous_hugetlb
>
> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> guest physical test memory offset: 0xffbfffff000
> Populate memory time: 3.922764235s
> Enabling dirty logging time: 0.000316473s
> [...]
> Get dirty log over 5 iterations took 0.000485459s. (Avg 0.000097091s/iteration)
> Clear dirty log over 5 iterations took 0.603749670s. (Avg 0.120749934s/iteration)
>
>
> *With this patch*
>
> ./dirty_log_perf_test -i 5 -v 16 -s anonymous
>
> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> guest physical test memory offset: 0xffbfffff000
> Populate memory time: 3.244515198s
> Enabling dirty logging time: 0.000280207s
> [...]
> Get dirty log over 5 iterations took 0.000484953s. (Avg 0.000096990s/iteration)
> Clear dirty log over 5 iterations took 0.727620114s. (Avg 0.145524022s/iteration)
>
> ./dirty_log_perf_test -i 5 -v 16 -s anonymous_hugetlb
>
> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> guest physical test memory offset: 0xffbfffff000
> Populate memory time: 3.244294061s
> Enabling dirty logging time: 0.000273590s
> [...]
> Get dirty log over 5 iterations took 0.000474244s. (Avg 0.000094848s/iteration)
> Clear dirty log over 5 iterations took 0.600593672s. (Avg 0.120118734s/iteration)
>
>
> I faced a problem that there is no huge page mapping when test with
> "-s anonymous_hugetlb", both for TDP enabled or disabled.

Do you mean that even before dirty logging was enabled, KVM didn't
create any large mappings? That's odd. I would assume the backing
memory allocation would just fail if there aren't enough hugepages
available.

>
> However, these tests above can tell the fact that our optimization does little effect
> on clear_dirty_log performance.
>
> Thanks,
> Keqian
>
> >
> >>
> >> Thanks,
> >> Keqian
> >>
> >>
> >>>
> >>>> +        */
> >>>> +
> >>>> +       /* Then we can handle the PT level pages */
> >>>>         if (kvm_x86_ops.cpu_dirty_log_size)
> >>>>                 kvm_mmu_clear_dirty_pt_masked(kvm, slot, gfn_offset, mask);
> >>>>         else
> >>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>>> index eca63625aee4..dfd676ffa7da 100644
> >>>> --- a/arch/x86/kvm/x86.c
> >>>> +++ b/arch/x86/kvm/x86.c
> >>>> @@ -10888,36 +10888,19 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
> >>>>                  */
> >>>>                 kvm_mmu_zap_collapsible_sptes(kvm, new);
> >>>>         } else {
> >>>> -               /* By default, write-protect everything to log writes. */
> >>>> -               int level = PG_LEVEL_4K;
> >>>> +               /*
> >>>> +                * If we're with initial-all-set, we don't need to write protect
> >>>> +                * any page because they're reported as dirty already.
> >>>> +                */
> >>>> +               if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> >>>> +                       return;
> >>>>
> >>>>                 if (kvm_x86_ops.cpu_dirty_log_size) {
> >>>> -                       /*
> >>>> -                        * Clear all dirty bits, unless pages are treated as
> >>>> -                        * dirty from the get-go.
> >>>> -                        */
> >>>> -                       if (!kvm_dirty_log_manual_protect_and_init_set(kvm))
> >>>> -                               kvm_mmu_slot_leaf_clear_dirty(kvm, new);
> >>>> -
> >>>> -                       /*
> >>>> -                        * Write-protect large pages on write so that dirty
> >>>> -                        * logging happens at 4k granularity.  No need to
> >>>> -                        * write-protect small SPTEs since write accesses are
> >>>> -                        * logged by the CPU via dirty bits.
> >>>> -                        */
> >>>> -                       level = PG_LEVEL_2M;
> >>>> -               } else if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
> >>>> -                       /*
> >>>> -                        * If we're with initial-all-set, we don't need
> >>>> -                        * to write protect any small page because
> >>>> -                        * they're reported as dirty already.  However
> >>>> -                        * we still need to write-protect huge pages
> >>>> -                        * so that the page split can happen lazily on
> >>>> -                        * the first write to the huge page.
> >>>> -                        */
> >>>> -                       level = PG_LEVEL_2M;
> >>>> +                       kvm_mmu_slot_leaf_clear_dirty(kvm, new);
> >>>> +                       kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_2M);
> >>>> +               } else {
> >>>> +                       kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_4K);
> >>>>                 }
> >>>> -               kvm_mmu_slot_remove_write_access(kvm, new, level);
> >>>>         }
> >>>>  }
> >>>>
> >>>> --
> >>>> 2.23.0
> >>>>
> >>> .
> >>>
> > .
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log
  2021-04-27 16:33           ` Ben Gardon
@ 2021-04-28 10:51             ` Keqian Zhu
       [not found]               ` <60894846.1c69fb81.6e765.161bSMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Keqian Zhu @ 2021-04-28 10:51 UTC (permalink / raw)
  To: Ben Gardon; +Cc: LKML, kvm, Paolo Bonzini, Sean Christopherson, wanghaibin.wang



On 2021/4/28 0:33, Ben Gardon wrote:
> On Mon, Apr 26, 2021 at 10:04 PM Keqian Zhu <zhukeqian1@huawei.com> wrote:
>>
>> Hi Ben,
>>
>> Sorry for the delay reply!
>>
>> On 2021/4/21 0:30, Ben Gardon wrote:
>>> On Tue, Apr 20, 2021 at 12:49 AM Keqian Zhu <zhukeqian1@huawei.com> wrote:
>>>>
>>>> Hi Ben,
>>>>
>>>> On 2021/4/20 3:20, Ben Gardon wrote:
>>>>> On Fri, Apr 16, 2021 at 1:25 AM Keqian Zhu <zhukeqian1@huawei.com> wrote:
>>>>>>
>>>>>> Currently during start dirty logging, if we're with init-all-set,
>>>>>> we write protect huge pages and leave normal pages untouched, for
>>>>>> that we can enable dirty logging for these pages lazily.
>>>>>>
>>>>>> Actually enable dirty logging lazily for huge pages is feasible
>>>>>> too, which not only reduces the time of start dirty logging, also
>>>>>> greatly reduces side-effect on guest when there is high dirty rate.
>>>>>>
>>>>>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>>>>>> ---
>>>>>>  arch/x86/kvm/mmu/mmu.c | 48 ++++++++++++++++++++++++++++++++++++++----
>>>>>>  arch/x86/kvm/x86.c     | 37 +++++++++-----------------------
>>>>>>  2 files changed, 54 insertions(+), 31 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>>>>> index 2ce5bc2ea46d..98fa25172b9a 100644
>>>>>> --- a/arch/x86/kvm/mmu/mmu.c
>>>>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>>>>> @@ -1188,8 +1188,7 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>>>>>>   * @gfn_offset: start of the BITS_PER_LONG pages we care about
>>>>>>   * @mask: indicates which pages we should protect
>>>>>>   *
>>>>>> - * Used when we do not need to care about huge page mappings: e.g. during dirty
>>>>>> - * logging we do not have any such mappings.
>>>>>> + * Used when we do not need to care about huge page mappings.
>>>>>>   */
>>>>>>  static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
>>>>>>                                      struct kvm_memory_slot *slot,
>>>>>> @@ -1246,13 +1245,54 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
>>>>>>   * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
>>>>>>   * enable dirty logging for them.
>>>>>>   *
>>>>>> - * Used when we do not need to care about huge page mappings: e.g. during dirty
>>>>>> - * logging we do not have any such mappings.
>>>>>> + * We need to care about huge page mappings: e.g. during dirty logging we may
>>>>>> + * have any such mappings.
>>>>>>   */
>>>>>>  void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>>>>>>                                 struct kvm_memory_slot *slot,
>>>>>>                                 gfn_t gfn_offset, unsigned long mask)
>>>>>>  {
>>>>>> +       gfn_t start, end;
>>>>>> +
>>>>>> +       /*
>>>>>> +        * Huge pages are NOT write protected when we start dirty log with
>>>>>> +        * init-all-set, so we must write protect them at here.
>>>>>> +        *
>>>>>> +        * The gfn_offset is guaranteed to be aligned to 64, but the base_gfn
>>>>>> +        * of memslot has no such restriction, so the range can cross two large
>>>>>> +        * pages.
>>>>>> +        */
>>>>>> +       if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
>>>>>> +               start = slot->base_gfn + gfn_offset + __ffs(mask);
>>>>>> +               end = slot->base_gfn + gfn_offset + __fls(mask);
>>>>>> +               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
>>>>>> +
>>>>>> +               /* Cross two large pages? */
>>>>>> +               if (ALIGN(start << PAGE_SHIFT, PMD_SIZE) !=
>>>>>> +                   ALIGN(end << PAGE_SHIFT, PMD_SIZE))
>>>>>> +                       kvm_mmu_slot_gfn_write_protect(kvm, slot, end,
>>>>>> +                                                      PG_LEVEL_2M);
>>>>>> +       }
>>>>>> +
>>>>>> +       /*
>>>>>> +        * RFC:
>>>>>> +        *
>>>>>> +        * 1. I don't return early when kvm_mmu_slot_gfn_write_protect() returns
>>>>>> +        * true, because I am not very clear about the relationship between
>>>>>> +        * legacy mmu and tdp mmu. AFAICS, the code logic is NOT an if/else
>>>>>> +        * manner.
>>>>>> +        *
>>>>>> +        * The kvm_mmu_slot_gfn_write_protect() returns true when we hit a
>>>>>> +        * writable large page mapping in legacy mmu mapping or tdp mmu mapping.
>>>>>> +        * Do we still have normal mapping in that case? (e.g. We have large
>>>>>> +        * mapping in legacy mmu and normal mapping in tdp mmu).
>>>>>
>>>>> Right, we can't return early because the two MMUs could map the page
>>>>> in different ways, but each MMU could also map the page in multiple
>>>>> ways independently.
>>>>> For example, if the legacy MMU was being used and we were running a
>>>>> nested VM, a page could be mapped 2M in EPT01 and 4K in EPT02, so we'd
>>>>> still need kvm_mmu_slot_gfn_write_protect  calls for both levels.
>>>>> I don't think there's a case where we can return early here with the
>>>>> information that the first calls to kvm_mmu_slot_gfn_write_protect
>>>>> access.
>>>> Thanks for the detailed explanation.
>>>>
>>>>>
>>>>>> +        *
>>>>>> +        * 2. kvm_mmu_slot_gfn_write_protect() doesn't tell us whether the large
>>>>>> +        * page mapping exist. If it exists but is clean, we can return early.
>>>>>> +        * However, we have to do invasive change.
>>>>>
>>>>> What do you mean by invasive change?
>>>> We need the kvm_mmu_slot_gfn_write_protect to report whether all mapping are large
>>>> and clean, so we can return early. However it's not a part of semantics of this function.
>>>>
>>>> If this is the final code, compared to old code, we have an extra gfn_write_protect(),
>>>> I don't whether it's acceptable?
>>>
>>> Ah, I see. Please correct me if I'm wrong, but I think that in order
>>> to check that the only mappings on the GFN range are large, we'd still
>>> have to go over the rmap for the 4k mappings, at least for the legacy
>>> MMU. In that case, we're doing about as much work as the extra
>>> gfn_write_protect and I don't think that we'd get any efficiency gain
>>> for the change in semantics.
>>>
>>> Likewise for the TDP MMU, if the GFN range is mapped both large and
>>> 4k, it would have to be in different TDP structures, so the efficiency
>>> gains would again not be very big.
>> I am not familiar with the MMU virtualization of x86 arch, but I think
>> you are right.
>>
>>>
>>> I'm really just guessing about those performance characteristics
>>> though. It would definitely help to have some performance data to back
>>> all this up. Even just a few runs of the dirty_log_perf_test (in
>>> selftests) could provide some interesting results, and I'd be happy to
>>> help review any improvements you might make to that test.
>>>
>>> Regardless, I'd be inclined to keep this change as simple as possible
>>> for now and the early return optimization could happen in a follow-up
>>> patch. I think the extra gfn_write_protect is acceptable, especially
>>> if you can show that it doesn't cause a big hit in performance when
>>> running the dirty_log_perf_test with 4k and 2m backing memory.
>> I tested it using dirty_log_perf_test, the result shows that performance
>> of clear_dirty_log different within 2%.
> 
> I think there are a couple obstacles which make the stock
> dirty_log_perf_test less useful for measuring this optimization.
> 
> 1. Variance between runs
> With only 16 vCPUs and whatever the associated default guest memory
> size is, random system events and daemons introduce a lot of variance,
> at least in my testing. I usually try to run the biggest VM I can to
> smooth that out, but even with a 96 vCPU VM, a 2% difference is often
> not statistically significant.  CPU pinning for the vCPU threads would
> help a lot to reduce variance. I don't remember if anyone has
> implemented this yet.
Yes, this makes sense.

> 
> 2. The guest dirty pattern
> By default, each guest vCPU will dirty it's entire partition of guest
> memory on each iteration. This means that instead of amortizing out
> the cost of write-protecting and splitting large pages, we simply move
> the burden later in the process. I see you didn't include the time for
> each iteration below, but I would expect this patch to move some of
> the time from "Enabling dirty logging time" and "Dirtying memory time"
> for pass 1 to "Clear dirty log time" and "Dirtying memory time" for
> pass 2. I wouldn't expect the total time over 5 iterations to change
> for this test.
If we have large page mapping and are with this optimization, the "Enabling dirty logging time"
and the first round "Dirtying memory time" will be greatly reduced.

However, I don't think other times (dirty_memory except first round, get_log, clear_log) are
expected to change compared to w/o optimization. Because after the first round "Dirtying memory",
all mappings have been split to normal mappings, so the situation is same as w/o this optimization.

Maybe I miss something?

> 
> It would probably also serve us well to have some kind of "hot" subset
> of memory for each vCPU, since some of the benefit of lazy large page
> splitting depend on that access pattern.
>
> 3. Lockstep dirtying and dirty log collection
> While this test is currently great for timing dirty logging
> operations, it's not great for trickier analysis, especially
> reductions to guest degradation. In order to measure that we'd need to
> change the test to collect the dirty log as quickly as possible,
> independent of what the guest is doing and then also record how much
> "progress" the guest is able to make while all that is happening.
Yes, make sense.

Does the "dirty log collection" contains "dirty log clear"? As I understand, the dirty log
collection is very fast, just some memory copy. But for "dirty log clear", we should modify mappings
and perform TLBI, the time is much longer.

> 
> I'd be happy to help review any improvements to the test which you
> feel like making.
Thanks, Ben. emm... I feel very sorry that perhaps I don't have enough time to do this, many works are queued...
On the other hand, I think the "Dirtying memory time" of first round can show us the optimization.

> 
>>
>> *Without this patch*
>>
>> ./dirty_log_perf_test -i 5 -v 16 -s anonymous
>>
>> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
>> guest physical test memory offset: 0xffbfffff000
>> Populate memory time: 3.105203579s
>> Enabling dirty logging time: 0.000323444s
>> [...]
>> Get dirty log over 5 iterations took 0.000595033s. (Avg 0.000119006s/iteration)
>> Clear dirty log over 5 iterations took 0.713212922s. (Avg 0.142642584s/iteration)
>>
>> ./dirty_log_perf_test -i 5 -v 16 -s anonymous_hugetlb
>>
>> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
>> guest physical test memory offset: 0xffbfffff000
>> Populate memory time: 3.922764235s
>> Enabling dirty logging time: 0.000316473s
>> [...]
>> Get dirty log over 5 iterations took 0.000485459s. (Avg 0.000097091s/iteration)
>> Clear dirty log over 5 iterations took 0.603749670s. (Avg 0.120749934s/iteration)
>>
>>
>> *With this patch*
>>
>> ./dirty_log_perf_test -i 5 -v 16 -s anonymous
>>
>> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
>> guest physical test memory offset: 0xffbfffff000
>> Populate memory time: 3.244515198s
>> Enabling dirty logging time: 0.000280207s
>> [...]
>> Get dirty log over 5 iterations took 0.000484953s. (Avg 0.000096990s/iteration)
>> Clear dirty log over 5 iterations took 0.727620114s. (Avg 0.145524022s/iteration)
>>
>> ./dirty_log_perf_test -i 5 -v 16 -s anonymous_hugetlb
>>
>> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
>> guest physical test memory offset: 0xffbfffff000
>> Populate memory time: 3.244294061s
>> Enabling dirty logging time: 0.000273590s
>> [...]
>> Get dirty log over 5 iterations took 0.000474244s. (Avg 0.000094848s/iteration)
>> Clear dirty log over 5 iterations took 0.600593672s. (Avg 0.120118734s/iteration)
>>
>>
>> I faced a problem that there is no huge page mapping when test with
>> "-s anonymous_hugetlb", both for TDP enabled or disabled.
> 
> Do you mean that even before dirty logging was enabled, KVM didn't
> create any large mappings? That's odd. I would assume the backing
> memory allocation would just fail if there aren't enough hugepages
> available.
It's odd indeed. I can see there are large mapping when I do normal migration, but I
don't see large mapping when run this test.

I have proofed the time of "clear dirty log" is not effected, what about send a
formal patch?

Thanks,
Keqian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log
       [not found]               ` <60894846.1c69fb81.6e765.161bSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2021-04-28 16:22                 ` Ben Gardon
  2021-04-29  3:30                   ` Keqian Zhu
  0 siblings, 1 reply; 11+ messages in thread
From: Ben Gardon @ 2021-04-28 16:22 UTC (permalink / raw)
  To: zhukeqian; +Cc: LKML, kvm, Paolo Bonzini, Sean Christopherson, Wanghaibin (D)

On Wed, Apr 28, 2021 at 4:34 AM zhukeqian <zhukeqian1@huawei.com> wrote:
>
> Oh, I have to correct myself.
>
> without this opt:
> first round dirtying: write fault and split large mapping
> second round: write fault
>
> with this opt:
> first round dirtying: no write fault
> second round: write fault and split large mapping
>
> the total test time is expected to be reduced.

Oh yeah, good point. So we should really see the savings in the first
round dirty memory time. Good catch.

>
> On 2021/4/28 0:33, Ben Gardon wrote:
> > On Mon, Apr 26, 2021 at 10:04 PM Keqian Zhu < zhukeqian1@huawei.com> wrote:
> >>
> >> Hi Ben,
> >>
> >> Sorry for the delay reply!
> >>
> >> On 2021/4/21 0:30, Ben Gardon wrote:
> >>> On Tue, Apr 20, 2021 at 12:49 AM Keqian Zhu < zhukeqian1@huawei.com> wrote:
> >>>>
> >>>> Hi Ben,
> >>>>
> >>>> On 2021/4/20 3:20, Ben Gardon wrote:
> >>>>> On Fri, Apr 16, 2021 at 1:25 AM Keqian Zhu < zhukeqian1@huawei.com> wrote:
> >>>>>>
> >>>>>> Currently during start dirty logging, if we're with init-all-set,
> >>>>>> we write protect huge pages and leave normal pages untouched, for
> >>>>>> that we can enable dirty logging for these pages lazily.
> >>>>>>
> >>>>>> Actually enable dirty logging lazily for huge pages is feasible
> >>>>>> too, which not only reduces the time of start dirty logging, also
> >>>>>> greatly reduces side-effect on guest when there is high dirty rate.
> >>>>>>
> >>>>>> Signed-off-by: Keqian Zhu < zhukeqian1@huawei.com>
> >>>>>> ---
> >>>>>> arch/x86/kvm/mmu/mmu.c | 48 ++++++++++++++++++++++++++++++++++++++----
> >>>>>> arch/x86/kvm/x86.c     | 37 +++++++++-----------------------
> >>>>>> 2 files changed, 54 insertions(+), 31 deletions(-)
> >>>>>>
> >>>>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >>>>>> index 2ce5bc2ea46d..98fa25172b9a 100644
> >>>>>> --- a/arch/x86/kvm/mmu/mmu.c
> >>>>>> +++ b/arch/x86/kvm/mmu/mmu.c
> >>>>>> @@ -1188,8 +1188,7 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
> >>>>>>   * @gfn_offset: start of the BITS_PER_LONG pages we care about
> >>>>>>   * @mask: indicates which pages we should protect
> >>>>>>   *
> >>>>>> - * Used when we do not need to care about huge page mappings: e.g. during dirty
> >>>>>> - * logging we do not have any such mappings.
> >>>>>> + * Used when we do not need to care about huge page mappings.
> >>>>>>   */
> >>>>>> static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> >>>>>>                                      struct kvm_memory_slot *slot,
> >>>>>> @@ -1246,13 +1245,54 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
> >>>>>>   * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
> >>>>>>   * enable dirty logging for them.
> >>>>>>   *
> >>>>>> - * Used when we do not need to care about huge page mappings: e.g. during dirty
> >>>>>> - * logging we do not have any such mappings.
> >>>>>> + * We need to care about huge page mappings: e.g. during dirty logging we may
> >>>>>> + * have any such mappings.
> >>>>>>   */
> >>>>>> void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >>>>>>                                 struct kvm_memory_slot *slot,
> >>>>>>                                 gfn_t gfn_offset, unsigned long mask)
> >>>>>> {
> >>>>>> +       gfn_t start, end;
> >>>>>> +
> >>>>>> +       /*
> >>>>>> +        * Huge pages are NOT write protected when we start dirty log with
> >>>>>> +        * init-all-set, so we must write protect them at here.
> >>>>>> +        *
> >>>>>> +        * The gfn_offset is guaranteed to be aligned to 64, but the base_gfn
> >>>>>> +        * of memslot has no such restriction, so the range can cross two large
> >>>>>> +        * pages.
> >>>>>> +        */
> >>>>>> +       if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
> >>>>>> +               start = slot->base_gfn + gfn_offset + __ffs(mask);
> >>>>>> +               end = slot->base_gfn + gfn_offset + __fls(mask);
> >>>>>> +               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> >>>>>> +
> >>>>>> +               /* Cross two large pages? */
> >>>>>> +               if (ALIGN(start << PAGE_SHIFT, PMD_SIZE) !=
> >>>>>> +                   ALIGN(end << PAGE_SHIFT, PMD_SIZE))
> >>>>>> +                       kvm_mmu_slot_gfn_write_protect(kvm, slot, end,
> >>>>>> +                                                      PG_LEVEL_2M);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       /*
> >>>>>> +        * RFC:
> >>>>>> +        *
> >>>>>> +        * 1. I don't return early when kvm_mmu_slot_gfn_write_protect() returns
> >>>>>> +        * true, because I am not very clear about the relationship between
> >>>>>> +        * legacy mmu and tdp mmu. AFAICS, the code logic is NOT an if/else
> >>>>>> +        * manner.
> >>>>>> +        *
> >>>>>> +        * The kvm_mmu_slot_gfn_write_protect() returns true when we hit a
> >>>>>> +        * writable large page mapping in legacy mmu mapping or tdp mmu mapping.
> >>>>>> +        * Do we still have normal mapping in that case? (e.g. We have large
> >>>>>> +        * mapping in legacy mmu and normal mapping in tdp mmu).
> >>>>>
> >>>>> Right, we can't return early because the two MMUs could map the page
> >>>>> in different ways, but each MMU could also map the page in multiple
> >>>>> ways independently.
> >>>>> For example, if the legacy MMU was being used and we were running a
> >>>>> nested VM, a page could be mapped 2M in EPT01 and 4K in EPT02, so we'd
> >>>>> still need kvm_mmu_slot_gfn_write_protect calls for both levels.
> >>>>> I don't think there's a case where we can return early here with the
> >>>>> information that the first calls to kvm_mmu_slot_gfn_write_protect
> >>>>> access.
> >>>> Thanks for the detailed explanation.
> >>>>
> >>>>>
> >>>>>> +        *
> >>>>>> +        * 2. kvm_mmu_slot_gfn_write_protect() doesn't tell us whether the large
> >>>>>> +        * page mapping exist. If it exists but is clean, we can return early.
> >>>>>> +        * However, we have to do invasive change.
> >>>>>
> >>>>> What do you mean by invasive change?
> >>>> We need the kvm_mmu_slot_gfn_write_protect to report whether all mapping are large
> >>>> and clean, so we can return early. However it's not a part of semantics of this function.
> >>>>
> >>>> If this is the final code, compared to old code, we have an extra gfn_write_protect(),
> >>>> I don't whether it's acceptable?
> >>>
> >>> Ah, I see. Please correct me if I'm wrong, but I think that in order
> >>> to check that the only mappings on the GFN range are large, we'd still
> >>> have to go over the rmap for the 4k mappings, at least for the legacy
> >>> MMU. In that case, we're doing about as much work as the extra
> >>> gfn_write_protect and I don't think that we'd get any efficiency gain
> >>> for the change in semantics.
> >>>
> >>> Likewise for the TDP MMU, if the GFN range is mapped both large and
> >>> 4k, it would have to be in different TDP structures, so the efficiency
> >>> gains would again not be very big.
> >> I am not familiar with the MMU virtualization of x86 arch, but I think
> >> you are right.
> >>
> >>>
> >>> I'm really just guessing about those performance characteristics
> >>> though. It would definitely help to have some performance data to back
> >>> all this up. Even just a few runs of the dirty_log_perf_test (in
> >>> selftests) could provide some interesting results, and I'd be happy to
> >>> help review any improvements you might make to that test.
> >>>
> >>> Regardless, I'd be inclined to keep this change as simple as possible
> >>> for now and the early return optimization could happen in a follow-up
> >>> patch. I think the extra gfn_write_protect is acceptable, especially
> >>> if you can show that it doesn't cause a big hit in performance when
> >>> running the dirty_log_perf_test with 4k and 2m backing memory.
> >> I tested it using dirty_log_perf_test, the result shows that performance
> >> of clear_dirty_log different within 2%.
> >
> > I think there are a couple obstacles which make the stock
> > dirty_log_perf_test less useful for measuring this optimization.
> >
> > 1. Variance between runs
> > With only 16 vCPUs and whatever the associated default guest memory
> > size is, random system events and daemons introduce a lot of variance,
> > at least in my testing. I usually try to run the biggest VM I can to
> > smooth that out, but even with a 96 vCPU VM, a 2% difference is often
> > not statistically significant. CPU pinning for the vCPU threads would
> > help a lot to reduce variance. I don't remember if anyone has
> > implemented this yet.
> Yes, this makes sense.
>
> >
> > 2. The guest dirty pattern
> > By default, each guest vCPU will dirty it's entire partition of guest
> > memory on each iteration. This means that instead of amortizing out
> > the cost of write-protecting and splitting large pages, we simply move
> > the burden later in the process. I see you didn't include the time for
> > each iteration below, but I would expect this patch to move some of
> > the time from "Enabling dirty logging time" and "Dirtying memory time"
> > for pass 1 to "Clear dirty log time" and "Dirtying memory time" for
> > pass 2. I wouldn't expect the total time over 5 iterations to change
> > for this test.
> If we have large page mapping and are with this optimization, the "Enabling dirty logging time"
> and the first round "Dirtying memory time" will be greatly reduced.
>
> However, I don't think other times (dirty_memory except first round, get_log, clear_log) are
> expected to change compared to w/o optimization. Because after the first round "Dirtying memory",
> all mappings have been split to normal mappings, so the situation is same as w/o this optimization.
>
> Maybe I miss something?
>
> >
> > It would probably also serve us well to have some kind of "hot" subset
> > of memory for each vCPU, since some of the benefit of lazy large page
> > splitting depend on that access pattern.
> >
> > 3. Lockstep dirtying and dirty log collection
> > While this test is currently great for timing dirty logging
> > operations, it's not great for trickier analysis, especially
> > reductions to guest degradation. In order to measure that we'd need to
> > change the test to collect the dirty log as quickly as possible,
> > independent of what the guest is doing and then also record how much
> > "progress" the guest is able to make while all that is happening.
> Yes, make sense.
>
> Does the "dirty log collection" contains "dirty log clear"? As I understand, the dirty log
> collection is very fast, just some memory copy. But for "dirty log clear", we should modify mappings
> and perform TLBI, the time is much longer.

Yeah, sorry. By dirty log collection I meant get + clear since the
test does both before it waits for the guest to dirty all memory
again.

>
> >
> > I'd be happy to help review any improvements to the test which you
> > feel like making.
> Thanks, Ben. emm... I feel very sorry that perhaps I don't have enough time to do this, many works are queued...
> On the other hand, I think the "Dirtying memory time" of first round can show us the optimization.

No worries, I think this is a good patch either way. No need to block
on test improvements, from my perspective.

>
> >
> >>
> >> *Without this patch*
> >>
> >> ./dirty_log_perf_test -i 5 -v 16 -s anonymous
> >>
> >> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
> >> guest physical test memory offset: 0xffbfffff000
> >> Populate memory time: 3.105203579s
> >> Enabling dirty logging time: 0.000323444s
> >> [...]
> >> Get dirty log over 5 iterations took 0.000595033s. (Avg 0.000119006s/iteration)
> >> Clear dirty log over 5 iterations took 0.713212922s. (Avg 0.142642584s/iteration)
> >>
> >> ./dirty_log_perf_test -i 5 -v 16 -s anonymous_hugetlb
> >>
> >> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
> >> guest physical test memory offset: 0xffbfffff000
> >> Populate memory time: 3.922764235s
> >> Enabling dirty logging time: 0.000316473s
> >> [...]
> >> Get dirty log over 5 iterations took 0.000485459s. (Avg 0.000097091s/iteration)
> >> Clear dirty log over 5 iterations took 0.603749670s. (Avg 0.120749934s/iteration)
> >>
> >>
> >> *With this patch*
> >>
> >> ./dirty_log_perf_test -i 5 -v 16 -s anonymous
> >>
> >> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
> >> guest physical test memory offset: 0xffbfffff000
> >> Populate memory time: 3.244515198s
> >> Enabling dirty logging time: 0.000280207s
> >> [...]
> >> Get dirty log over 5 iterations took 0.000484953s. (Avg 0.000096990s/iteration)
> >> Clear dirty log over 5 iterations took 0.727620114s. (Avg 0.145524022s/iteration)
> >>
> >> ./dirty_log_perf_test -i 5 -v 16 -s anonymous_hugetlb
> >>
> >> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
> >> guest physical test memory offset: 0xffbfffff000
> >> Populate memory time: 3.244294061s
> >> Enabling dirty logging time: 0.000273590s
> >> [...]
> >> Get dirty log over 5 iterations took 0.000474244s. (Avg 0.000094848s/iteration)
> >> Clear dirty log over 5 iterations took 0.600593672s. (Avg 0.120118734s/iteration)
> >>
> >>
> >> I faced a problem that there is no huge page mapping when test with
> >> "-s anonymous_hugetlb", both for TDP enabled or disabled.
> >
> > Do you mean that even before dirty logging was enabled, KVM didn't
> > create any large mappings? That's odd. I would assume the backing
> > memory allocation would just fail if there aren't enough hugepages
> > available.
> It's odd indeed. I can see there are large mapping when I do normal migration, but I
> don't see large mapping when run this test.
>
> I have proofed the time of "clear dirty log" is not effected, what about send a
> formal patch?

That sounds good to me.

>
> Thanks,
> Keqian
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log
  2021-04-28 16:22                 ` Ben Gardon
@ 2021-04-29  3:30                   ` Keqian Zhu
  0 siblings, 0 replies; 11+ messages in thread
From: Keqian Zhu @ 2021-04-29  3:30 UTC (permalink / raw)
  To: Ben Gardon; +Cc: LKML, kvm, Paolo Bonzini, Sean Christopherson, Wanghaibin (D)

Hi Ben,

On 2021/4/29 0:22, Ben Gardon wrote:
> On Wed, Apr 28, 2021 at 4:34 AM zhukeqian <zhukeqian1@huawei.com> wrote:
>>
>> Oh, I have to correct myself.
>>
>> without this opt:
>> first round dirtying: write fault and split large mapping
>> second round: write fault
>>
>> with this opt:
>> first round dirtying: no write fault
>> second round: write fault and split large mapping
>>
>> the total test time is expected to be reduced.
> 
> Oh yeah, good point. So we should really see the savings in the first
> round dirty memory time. Good catch.
> 
[...]

>>> It would probably also serve us well to have some kind of "hot" subset
>>> of memory for each vCPU, since some of the benefit of lazy large page
>>> splitting depend on that access pattern.
>>>
>>> 3. Lockstep dirtying and dirty log collection
>>> While this test is currently great for timing dirty logging
>>> operations, it's not great for trickier analysis, especially
>>> reductions to guest degradation. In order to measure that we'd need to
>>> change the test to collect the dirty log as quickly as possible,
>>> independent of what the guest is doing and then also record how much
>>> "progress" the guest is able to make while all that is happening.
>> Yes, make sense.
>>
>> Does the "dirty log collection" contains "dirty log clear"? As I understand, the dirty log
>> collection is very fast, just some memory copy. But for "dirty log clear", we should modify mappings
>> and perform TLBI, the time is much longer.
> 
> Yeah, sorry. By dirty log collection I meant get + clear since the
> test does both before it waits for the guest to dirty all memory
> again.
I see.

> 
>>
>>>
>>> I'd be happy to help review any improvements to the test which you
>>> feel like making.
>> Thanks, Ben. emm... I feel very sorry that perhaps I don't have enough time to do this, many works are queued...
>> On the other hand, I think the "Dirtying memory time" of first round can show us the optimization.
> 
> No worries, I think this is a good patch either way. No need to block
> on test improvements, from my perspective.
OK, thanks.


BRs,
Keqian

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-04-29  3:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-16  8:25 [RFC PATCH v2 0/2] KVM: x86: Enable dirty logging lazily for huge pages Keqian Zhu
2021-04-16  8:25 ` [RFC PATCH v2 1/2] KVM: x86: Support write protect gfn with min_level Keqian Zhu
2021-04-16  8:25 ` [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log Keqian Zhu
2021-04-19 19:20   ` Ben Gardon
2021-04-20  7:49     ` Keqian Zhu
2021-04-20 16:30       ` Ben Gardon
2021-04-27  5:03         ` Keqian Zhu
2021-04-27 16:33           ` Ben Gardon
2021-04-28 10:51             ` Keqian Zhu
     [not found]               ` <60894846.1c69fb81.6e765.161bSMTPIN_ADDED_BROKEN@mx.google.com>
2021-04-28 16:22                 ` Ben Gardon
2021-04-29  3:30                   ` Keqian Zhu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).