kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log
@ 2021-01-26 12:44 Keqian Zhu
  2021-01-26 12:44 ` [RFC PATCH 1/7] arm64: cpufeature: Add API to report system support of HWDBM Keqian Zhu
                   ` (8 more replies)
  0 siblings, 9 replies; 12+ messages in thread
From: Keqian Zhu @ 2021-01-26 12:44 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier,
	Will Deacon, Catalin Marinas
  Cc: Alex Williamson, Kirti Wankhede, Cornelia Huck, Mark Rutland,
	James Morse, Robin Murphy, Suzuki K Poulose, wanghaibin.wang,
	jiangkunkun, xiexiangyou, zhengchuan, yubihong

The intention:

On arm64 platform, we tracking dirty log of vCPU through guest memory abort.
KVM occupys some vCPU time of guest to change stage2 mapping and mark dirty.
This leads to heavy side effect on VM, especially when multi vCPU race and
some of them block on kvm mmu_lock.

DBM is a HW auxiliary approach to log dirty. MMU chages PTE to be writable if
its DBM bit is set. Then KVM doesn't occupy vCPU time to log dirty.

About this patch series:

The biggest problem of apply DBM for stage2 is that software must scan PTs to
collect dirty state, which may cost much time and affect downtime of migration.

This series realize a SW/HW combined dirty log that can effectively solve this
problem (The smmu side can also use this approach to solve dma dirty log tracking).

The core idea is that we do not enable hardware dirty at start (do not add DBM bit).
When a arbitrary PT occurs fault, we execute soft tracking for this PT and enable
hardware tracking for its *nearby* PTs (e.g. Add DBM bit for nearby 16PTs). Then when
sync dirty log, we have known all PTs with hardware dirty enabled, so we do not need
to scan all PTs.

        mem abort point             mem abort point
              ↓                            ↓
---------------------------------------------------------------
        |********|        |        |********|        |        |
---------------------------------------------------------------
             ↑                            ↑
        set DBM bit of               set DBM bit of
     this PT section (64PTEs)      this PT section (64PTEs)

We may worry that when dirty rate is over-high we still need to scan too much PTs.
We mainly concern the VM stop time. With Qemu dirty rate throttling, the dirty memory
is closing to the VM stop threshold, so there is a little PTs to scan after VM stop.

It has the advantages of hardware tracking that minimizes side effect on vCPU,
and also has the advantages of software tracking that controls vCPU dirty rate.
Moreover, software tracking helps us to scan PTs at some fixed points, which
greatly reduces scanning time. And the biggest benefit is that we can apply this
solution for dma dirty tracking.

Test:

Host: Kunpeng 920 with 128 CPU 512G RAM. Disable Transparent Hugepage (Ensure test result
      is not effected by dissolve of block page table at the early stage of migration).
VM:   16 CPU 16GB RAM. Run 4 pair of (redis_benchmark+redis_server).

Each run 5 times for software dirty log and SW/HW conbined dirty log. 

Test result:

Gain 5%~7% improvement of redis QPS during VM migration.
VM downtime is not affected fundamentally.
About 56.7% of DBM is effectively used.

Keqian Zhu (7):
  arm64: cpufeature: Add API to report system support of HWDBM
  kvm: arm64: Use atomic operation when update PTE
  kvm: arm64: Add level_apply parameter for stage2_attr_walker
  kvm: arm64: Add some HW_DBM related pgtable interfaces
  kvm: arm64: Add some HW_DBM related mmu interfaces
  kvm: arm64: Only write protect selected PTE
  kvm: arm64: Start up SW/HW combined dirty log

 arch/arm64/include/asm/cpufeature.h  |  12 +++
 arch/arm64/include/asm/kvm_host.h    |   6 ++
 arch/arm64/include/asm/kvm_mmu.h     |   7 ++
 arch/arm64/include/asm/kvm_pgtable.h |  45 ++++++++++
 arch/arm64/kvm/arm.c                 | 125 ++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c         | 130 ++++++++++++++++++++++-----
 arch/arm64/kvm/mmu.c                 |  47 +++++++++-
 arch/arm64/kvm/reset.c               |   8 +-
 8 files changed, 351 insertions(+), 29 deletions(-)

-- 
2.19.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/7] arm64: cpufeature: Add API to report system support of HWDBM
  2021-01-26 12:44 [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log Keqian Zhu
@ 2021-01-26 12:44 ` Keqian Zhu
  2021-01-26 12:44 ` [RFC PATCH 2/7] kvm: arm64: Use atomic operation when update PTE Keqian Zhu
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Keqian Zhu @ 2021-01-26 12:44 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier,
	Will Deacon, Catalin Marinas
  Cc: Alex Williamson, Kirti Wankhede, Cornelia Huck, Mark Rutland,
	James Morse, Robin Murphy, Suzuki K Poulose, wanghaibin.wang,
	jiangkunkun, xiexiangyou, zhengchuan, yubihong

Though we already has a cpu capability named ARM64_HW_DBM, it's a
LOCAL_CPU cap and conditionally compiled by CONFIG_ARM64_HW_AFDBM.

This reports the system wide support of HW_DBM.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/include/asm/cpufeature.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 9a555809b89c..dfded86c7684 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -664,6 +664,18 @@ static inline bool system_supports_mixed_endian(void)
 	return val == 0x1;
 }
 
+static inline bool system_supports_hw_dbm(void)
+{
+	u64 mmfr1;
+	u32 val;
+
+	mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
+	val = cpuid_feature_extract_unsigned_field(mmfr1,
+						ID_AA64MMFR1_HADBS_SHIFT);
+
+	return val == 0x2;
+}
+
 static __always_inline bool system_supports_fpsimd(void)
 {
 	return !cpus_have_const_cap(ARM64_HAS_NO_FPSIMD);
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/7] kvm: arm64: Use atomic operation when update PTE
  2021-01-26 12:44 [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log Keqian Zhu
  2021-01-26 12:44 ` [RFC PATCH 1/7] arm64: cpufeature: Add API to report system support of HWDBM Keqian Zhu
@ 2021-01-26 12:44 ` Keqian Zhu
  2021-01-26 12:44 ` [RFC PATCH 3/7] kvm: arm64: Add level_apply parameter for stage2_attr_walker Keqian Zhu
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Keqian Zhu @ 2021-01-26 12:44 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier,
	Will Deacon, Catalin Marinas
  Cc: Alex Williamson, Kirti Wankhede, Cornelia Huck, Mark Rutland,
	James Morse, Robin Murphy, Suzuki K Poulose, wanghaibin.wang,
	jiangkunkun, xiexiangyou, zhengchuan, yubihong

We are about to add HW_DBM support for stage2 dirty log, so software
updating PTE may race with the MMU trying to set the access flag or
dirty state.

Use atomic oparations to avoid reverting these bits set by MMU.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 41 ++++++++++++++++++++++++------------
 1 file changed, 27 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index bdf8e55ed308..4915ba35f93b 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -153,10 +153,34 @@ static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte)
 	return __va(kvm_pte_to_phys(pte));
 }
 
+/*
+ * We may race with the MMU trying to set the access flag or dirty state,
+ * use atomic oparations to avoid reverting these bits.
+ *
+ * Return original PTE.
+ */
+static kvm_pte_t kvm_update_pte(kvm_pte_t *ptep, kvm_pte_t bit_set,
+				kvm_pte_t bit_clr)
+{
+	kvm_pte_t old_pte, pte = *ptep;
+
+	do {
+		old_pte = pte;
+		pte &= ~bit_clr;
+		pte |= bit_set;
+
+		if (old_pte == pte)
+			break;
+
+		pte = cmpxchg_relaxed(ptep, old_pte, pte);
+	} while (pte != old_pte);
+
+	return old_pte;
+}
+
 static void kvm_set_invalid_pte(kvm_pte_t *ptep)
 {
-	kvm_pte_t pte = *ptep;
-	WRITE_ONCE(*ptep, pte & ~KVM_PTE_VALID);
+	kvm_update_pte(ptep, 0, KVM_PTE_VALID);
 }
 
 static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp)
@@ -723,18 +747,7 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 		return 0;
 
 	data->level = level;
-	data->pte = pte;
-	pte &= ~data->attr_clr;
-	pte |= data->attr_set;
-
-	/*
-	 * We may race with the CPU trying to set the access flag here,
-	 * but worst-case the access flag update gets lost and will be
-	 * set on the next access instead.
-	 */
-	if (data->pte != pte)
-		WRITE_ONCE(*ptep, pte);
-
+	data->pte = kvm_update_pte(ptep, data->attr_set, data->attr_clr);
 	return 0;
 }
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 3/7] kvm: arm64: Add level_apply parameter for stage2_attr_walker
  2021-01-26 12:44 [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log Keqian Zhu
  2021-01-26 12:44 ` [RFC PATCH 1/7] arm64: cpufeature: Add API to report system support of HWDBM Keqian Zhu
  2021-01-26 12:44 ` [RFC PATCH 2/7] kvm: arm64: Use atomic operation when update PTE Keqian Zhu
@ 2021-01-26 12:44 ` Keqian Zhu
  2021-01-26 12:44 ` [RFC PATCH 4/7] kvm: arm64: Add some HW_DBM related pgtable interfaces Keqian Zhu
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Keqian Zhu @ 2021-01-26 12:44 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier,
	Will Deacon, Catalin Marinas
  Cc: Alex Williamson, Kirti Wankhede, Cornelia Huck, Mark Rutland,
	James Morse, Robin Murphy, Suzuki K Poulose, wanghaibin.wang,
	jiangkunkun, xiexiangyou, zhengchuan, yubihong

In order to change PTEs of some specific levels, the level_apply
parameter can be used as a level mask.

This has no fuctional change for current code.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 4915ba35f93b..0f8a319f16fe 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -734,6 +734,7 @@ struct stage2_attr_data {
 	kvm_pte_t	attr_clr;
 	kvm_pte_t	pte;
 	u32		level;
+	u32		level_apply;
 };
 
 static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
@@ -743,6 +744,9 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	kvm_pte_t pte = *ptep;
 	struct stage2_attr_data *data = arg;
 
+	if (!(data->level_apply & BIT(level)))
+		return 0;
+
 	if (!kvm_pte_valid(pte))
 		return 0;
 
@@ -753,14 +757,15 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 
 static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
 				    u64 size, kvm_pte_t attr_set,
-				    kvm_pte_t attr_clr, kvm_pte_t *orig_pte,
-				    u32 *level)
+				    kvm_pte_t attr_clr, u32 level_apply,
+				    kvm_pte_t *orig_pte, u32 *level)
 {
 	int ret;
 	kvm_pte_t attr_mask = KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI;
 	struct stage2_attr_data data = {
 		.attr_set	= attr_set & attr_mask,
 		.attr_clr	= attr_clr & attr_mask,
+		.level_apply	= level_apply,
 	};
 	struct kvm_pgtable_walker walker = {
 		.cb		= stage2_attr_walker,
@@ -783,7 +788,7 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
 int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size)
 {
 	return stage2_update_leaf_attrs(pgt, addr, size, 0,
-					KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
+					KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W, -1,
 					NULL, NULL);
 }
 
@@ -791,7 +796,7 @@ kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr)
 {
 	kvm_pte_t pte = 0;
 	stage2_update_leaf_attrs(pgt, addr, 1, KVM_PTE_LEAF_ATTR_LO_S2_AF, 0,
-				 &pte, NULL);
+				 -1, &pte, NULL);
 	dsb(ishst);
 	return pte;
 }
@@ -800,7 +805,7 @@ kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, u64 addr)
 {
 	kvm_pte_t pte = 0;
 	stage2_update_leaf_attrs(pgt, addr, 1, 0, KVM_PTE_LEAF_ATTR_LO_S2_AF,
-				 &pte, NULL);
+				 -1, &pte, NULL);
 	/*
 	 * "But where's the TLBI?!", you scream.
 	 * "Over in the core code", I sigh.
@@ -813,7 +818,7 @@ kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, u64 addr)
 bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr)
 {
 	kvm_pte_t pte = 0;
-	stage2_update_leaf_attrs(pgt, addr, 1, 0, 0, &pte, NULL);
+	stage2_update_leaf_attrs(pgt, addr, 1, 0, 0, -1, &pte, NULL);
 	return pte & KVM_PTE_LEAF_ATTR_LO_S2_AF;
 }
 
@@ -833,7 +838,7 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 	if (prot & KVM_PGTABLE_PROT_X)
 		clr |= KVM_PTE_LEAF_ATTR_HI_S2_XN;
 
-	ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level);
+	ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, -1, NULL, &level);
 	if (!ret)
 		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, pgt->mmu, addr, level);
 	return ret;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 4/7] kvm: arm64: Add some HW_DBM related pgtable interfaces
  2021-01-26 12:44 [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log Keqian Zhu
                   ` (2 preceding siblings ...)
  2021-01-26 12:44 ` [RFC PATCH 3/7] kvm: arm64: Add level_apply parameter for stage2_attr_walker Keqian Zhu
@ 2021-01-26 12:44 ` Keqian Zhu
  2021-01-26 12:44 ` [RFC PATCH 5/7] kvm: arm64: Add some HW_DBM related mmu interfaces Keqian Zhu
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Keqian Zhu @ 2021-01-26 12:44 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier,
	Will Deacon, Catalin Marinas
  Cc: Alex Williamson, Kirti Wankhede, Cornelia Huck, Mark Rutland,
	James Morse, Robin Murphy, Suzuki K Poulose, wanghaibin.wang,
	jiangkunkun, xiexiangyou, zhengchuan, yubihong

This adds set_dbm, clear_dbm and sync_dirty interfaces in pgtable
layer. (1) set_dbm: Set DBM bit for last level PTE of a specified
range. TLBI is completed. (2) clear_dbm: Clear DBM bit for last
level PTE of a specified range. TLBI is not acted. (3) sync_dirty:
Scan last level PTE of a specific range. Log dirty if PTE is writable.

Besides, save the dirty state of PTE if it's invalided by map or
unmap.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/include/asm/kvm_pgtable.h | 45 ++++++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c         | 70 ++++++++++++++++++++++++++++
 2 files changed, 115 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 52ab38db04c7..8984b5227cfc 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -204,6 +204,51 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
  */
 int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size);
 
+/**
+ * kvm_pgtable_stage2_clear_dbm() - Clear DBM of guest stage-2 address range
+ *                                  without TLB invalidation (only last level).
+ * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init().
+ * @addr:	Intermediate physical address from which to clear DBM,
+ * @size:	Size of the range.
+ *
+ * The offset of @addr within a page is ignored and @size is rounded-up to
+ * the next page boundary.
+ *
+ * Note that it is the caller's responsibility to invalidate the TLB after
+ * calling this function to ensure that the disabled HW dirty are visible
+ * to the CPUs.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_stage2_clear_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size);
+
+/**
+ * kvm_pgtable_stage2_set_dbm() - Set DBM of guest stage-2 address range to
+ *                                enable HW dirty (only last level).
+ * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init().
+ * @addr:	Intermediate physical address from which to set DBM.
+ * @size:	Size of the range.
+ *
+ * The offset of @addr within a page is ignored and @size is rounded-up to
+ * the next page boundary.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_stage2_set_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size);
+
+/**
+ * kvm_pgtable_stage2_sync_dirty() - Sync HW dirty state into memslot.
+ * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init().
+ * @addr:	Intermediate physical address from which to sync.
+ * @size:	Size of the range.
+ *
+ * The offset of @addr within a page is ignored and @size is rounded-up to
+ * the next page boundary.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_stage2_sync_dirty(struct kvm_pgtable *pgt, u64 addr, u64 size);
+
 /**
  * kvm_pgtable_stage2_mkyoung() - Set the access flag in a page-table entry.
  * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init().
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0f8a319f16fe..b6f0d2f3aee4 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -43,6 +43,7 @@
 
 #define KVM_PTE_LEAF_ATTR_HI_S1_XN	BIT(54)
 
+#define KVM_PTE_LEAF_ATTR_HI_S2_DBM	BIT(51)
 #define KVM_PTE_LEAF_ATTR_HI_S2_XN	BIT(54)
 
 struct kvm_pgtable_walk_data {
@@ -485,6 +486,11 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
 	return 0;
 }
 
+static bool stage2_pte_writable(kvm_pte_t pte)
+{
+	return pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
+}
+
 static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 				       kvm_pte_t *ptep,
 				       struct stage2_map_data *data)
@@ -509,6 +515,11 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	/* There's an existing valid leaf entry, so perform break-before-make */
 	kvm_set_invalid_pte(ptep);
 	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
+
+	/* Save the possible hardware dirty info */
+	if ((level == KVM_PGTABLE_MAX_LEVELS - 1) && stage2_pte_writable(*ptep))
+		mark_page_dirty(data->mmu->kvm, addr >> PAGE_SHIFT);
+
 	kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
 out:
 	data->phys += granule;
@@ -547,6 +558,10 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 		if (kvm_pte_valid(pte))
 			put_page(page);
 
+		/*
+		 * HW DBM is not working during page merging, so we don't
+		 * need to save possible hardware dirty info here.
+		 */
 		return 0;
 	}
 
@@ -707,6 +722,10 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, addr, level);
 	put_page(virt_to_page(ptep));
 
+	/* Save the possible hardware dirty info */
+	if ((level == KVM_PGTABLE_MAX_LEVELS - 1) && stage2_pte_writable(*ptep))
+		mark_page_dirty(mmu->kvm, addr >> PAGE_SHIFT);
+
 	if (need_flush) {
 		stage2_flush_dcache(kvm_pte_follow(pte),
 				    kvm_granule_size(level));
@@ -792,6 +811,30 @@ int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size)
 					NULL, NULL);
 }
 
+int kvm_pgtable_stage2_set_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size)
+{
+	int ret;
+	u64 offset;
+
+	ret = stage2_update_leaf_attrs(pgt, addr, size,
+				       KVM_PTE_LEAF_ATTR_HI_S2_DBM, 0, BIT(3),
+				       NULL, NULL);
+	if (!ret)
+		return ret;
+
+	for (offset = 0; offset < size; offset += PAGE_SIZE)
+		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, pgt->mmu, addr, 3);
+
+	return 0;
+}
+
+int kvm_pgtable_stage2_clear_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size)
+{
+	return stage2_update_leaf_attrs(pgt, addr, size,
+					0, KVM_PTE_LEAF_ATTR_HI_S2_DBM, BIT(3),
+					NULL, NULL);
+}
+
 kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr)
 {
 	kvm_pte_t pte = 0;
@@ -844,6 +887,33 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 	return ret;
 }
 
+static int stage2_sync_dirty_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+				    enum kvm_pgtable_walk_flags flag,
+				    void * const arg)
+{
+	kvm_pte_t pte = *ptep;
+	struct kvm *kvm = arg;
+
+	if (!kvm_pte_valid(pte))
+		return 0;
+
+	if ((level == KVM_PGTABLE_MAX_LEVELS - 1) && stage2_pte_writable(pte))
+		mark_page_dirty(kvm, addr >> PAGE_SHIFT);
+
+	return 0;
+}
+
+int kvm_pgtable_stage2_sync_dirty(struct kvm_pgtable *pgt, u64 addr, u64 size)
+{
+	struct kvm_pgtable_walker walker = {
+		.cb	= stage2_sync_dirty_walker,
+		.flags	= KVM_PGTABLE_WALK_LEAF,
+		.arg	= pgt->mmu->kvm,
+	};
+
+	return kvm_pgtable_walk(pgt, addr, size, &walker);
+}
+
 static int stage2_flush_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			       enum kvm_pgtable_walk_flags flag,
 			       void * const arg)
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 5/7] kvm: arm64: Add some HW_DBM related mmu interfaces
  2021-01-26 12:44 [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log Keqian Zhu
                   ` (3 preceding siblings ...)
  2021-01-26 12:44 ` [RFC PATCH 4/7] kvm: arm64: Add some HW_DBM related pgtable interfaces Keqian Zhu
@ 2021-01-26 12:44 ` Keqian Zhu
  2021-01-26 12:44 ` [RFC PATCH 6/7] kvm: arm64: Only write protect selected PTE Keqian Zhu
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Keqian Zhu @ 2021-01-26 12:44 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier,
	Will Deacon, Catalin Marinas
  Cc: Alex Williamson, Kirti Wankhede, Cornelia Huck, Mark Rutland,
	James Morse, Robin Murphy, Suzuki K Poulose, wanghaibin.wang,
	jiangkunkun, xiexiangyou, zhengchuan, yubihong

This adds set_dbm, clear_dbm and sync_dirty interfaces in mmu layer.
They simply wrap those interfaces of pgtable layer, adding reschedule
semantics.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/include/asm/kvm_mmu.h |  7 +++++++
 arch/arm64/kvm/mmu.c             | 30 ++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e52d82aeadca..51dd31ba1b94 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -183,6 +183,13 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
 			     void **haddr);
 void free_hyp_pgds(void);
 
+void kvm_stage2_clear_dbm(struct kvm *kvm, struct kvm_memory_slot *slot,
+		gfn_t gfn_offset, unsigned long npages);
+void kvm_stage2_set_dbm(struct kvm *kvm, struct kvm_memory_slot *slot,
+		gfn_t gfn_offset, unsigned long npages);
+void kvm_stage2_sync_dirty(struct kvm *kvm, struct kvm_memory_slot *slot,
+		gfn_t gfn_offset, unsigned long npages);
+
 void stage2_unmap_vm(struct kvm *kvm);
 int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu);
 void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7d2257cc5438..18717fd12731 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -609,6 +609,36 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
+void kvm_stage2_clear_dbm(struct kvm *kvm, struct kvm_memory_slot *slot,
+		gfn_t gfn_offset, unsigned long npages)
+{
+	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+	phys_addr_t addr = base_gfn << PAGE_SHIFT;
+	phys_addr_t end = (base_gfn + npages) << PAGE_SHIFT;
+
+	stage2_apply_range_resched(kvm, addr, end, kvm_pgtable_stage2_clear_dbm);
+}
+
+void kvm_stage2_set_dbm(struct kvm *kvm, struct kvm_memory_slot *slot,
+		gfn_t gfn_offset, unsigned long npages)
+{
+	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+	phys_addr_t addr = base_gfn << PAGE_SHIFT;
+	phys_addr_t end = (base_gfn + npages) << PAGE_SHIFT;
+
+	stage2_apply_range_resched(kvm, addr, end, kvm_pgtable_stage2_set_dbm);
+}
+
+void kvm_stage2_sync_dirty(struct kvm *kvm, struct kvm_memory_slot *slot,
+		gfn_t gfn_offset, unsigned long npages)
+{
+	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+	phys_addr_t addr = base_gfn << PAGE_SHIFT;
+	phys_addr_t end = (base_gfn + npages) << PAGE_SHIFT;
+
+	stage2_apply_range_resched(kvm, addr, end, kvm_pgtable_stage2_sync_dirty);
+}
+
 static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
 {
 	__clean_dcache_guest_page(pfn, size);
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 6/7] kvm: arm64: Only write protect selected PTE
  2021-01-26 12:44 [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log Keqian Zhu
                   ` (4 preceding siblings ...)
  2021-01-26 12:44 ` [RFC PATCH 5/7] kvm: arm64: Add some HW_DBM related mmu interfaces Keqian Zhu
@ 2021-01-26 12:44 ` Keqian Zhu
  2021-01-26 12:44 ` [RFC PATCH 7/7] kvm: arm64: Start up SW/HW combined dirty log Keqian Zhu
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Keqian Zhu @ 2021-01-26 12:44 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier,
	Will Deacon, Catalin Marinas
  Cc: Alex Williamson, Kirti Wankhede, Cornelia Huck, Mark Rutland,
	James Morse, Robin Murphy, Suzuki K Poulose, wanghaibin.wang,
	jiangkunkun, xiexiangyou, zhengchuan, yubihong

This function write protects all PTEs between the ffs and fls of mask.
There may has unset bit between this range. It works well under pure
software dirty log, as software dirty log is not working during this
process.

But it will unexpectly clear dirty status of PTE when hardware dirty
log is enabled. So change it to only write protect selected PTE.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/kvm/mmu.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 18717fd12731..2f8c6770a4dc 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -589,10 +589,14 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 		gfn_t gfn_offset, unsigned long mask)
 {
 	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
-	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
-	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+	phys_addr_t start, end;
+	int rs, re;
 
-	stage2_wp_range(&kvm->arch.mmu, start, end);
+	bitmap_for_each_set_region(&mask, rs, re, 0, BITS_PER_LONG) {
+		start = (base_gfn + rs) << PAGE_SHIFT;
+		end = (base_gfn + re) << PAGE_SHIFT;
+		stage2_wp_range(&kvm->arch.mmu, start, end);
+	}
 }
 
 /*
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 7/7] kvm: arm64: Start up SW/HW combined dirty log
  2021-01-26 12:44 [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log Keqian Zhu
                   ` (5 preceding siblings ...)
  2021-01-26 12:44 ` [RFC PATCH 6/7] kvm: arm64: Only write protect selected PTE Keqian Zhu
@ 2021-01-26 12:44 ` Keqian Zhu
  2021-02-01 13:12 ` [RFC PATCH 0/7] kvm: arm64: Implement " Keqian Zhu
  2021-03-02 11:23 ` Keqian Zhu
  8 siblings, 0 replies; 12+ messages in thread
From: Keqian Zhu @ 2021-01-26 12:44 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier,
	Will Deacon, Catalin Marinas
  Cc: Alex Williamson, Kirti Wankhede, Cornelia Huck, Mark Rutland,
	James Morse, Robin Murphy, Suzuki K Poulose, wanghaibin.wang,
	jiangkunkun, xiexiangyou, zhengchuan, yubihong

We do not enable hardware dirty at start (do not add DBM bit). When
an arbitrary PT occurs fault, we execute soft tracking for this PT
and enable hardware tracking for its nearby PTs (Add DBM bit for
nearby 64PTs). Then when sync dirty log, we have known all PTs with
hardware dirty enabled, so we do not need to scan all PTs.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/include/asm/kvm_host.h |   6 ++
 arch/arm64/kvm/arm.c              | 125 ++++++++++++++++++++++++++++++
 arch/arm64/kvm/mmu.c              |   7 +-
 arch/arm64/kvm/reset.c            |   8 +-
 4 files changed, 141 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 8fcfab0c2567..e9ea5b546326 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -99,6 +99,8 @@ struct kvm_s2_mmu {
 };
 
 struct kvm_arch_memory_slot {
+	#define HWDBM_GRANULE_SHIFT 6  /* 64 pages per bit */
+	unsigned long *hwdbm_bitmap;
 };
 
 struct kvm_arch {
@@ -565,6 +567,10 @@ struct kvm_vcpu_stat {
 	u64 exits;
 };
 
+int kvm_arm_init_hwdbm_bitmap(struct kvm_memory_slot *memslot);
+void kvm_arm_destroy_hwdbm_bitmap(struct kvm_memory_slot *memslot);
+void kvm_arm_enable_nearby_hwdbm(struct kvm *kvm, gfn_t gfn);
+
 int kvm_vcpu_preferred_target(struct kvm_vcpu_init *init);
 unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
 int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 04c44853b103..9e05d45fa6be 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1257,9 +1257,134 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 	return r;
 }
 
+static unsigned long kvm_hwdbm_bitmap_bytes(struct kvm_memory_slot *memslot)
+{
+	unsigned long nbits = DIV_ROUND_UP(memslot->npages, 1 << HWDBM_GRANULE_SHIFT);
+
+	return ALIGN(nbits, BITS_PER_LONG) / 8;
+}
+
+static unsigned long *kvm_second_hwdbm_bitmap(struct kvm_memory_slot *memslot)
+{
+	unsigned long len = kvm_hwdbm_bitmap_bytes(memslot);
+
+	return (void *)memslot->arch.hwdbm_bitmap + len;
+}
+
+/*
+ * Allocate twice space. Refer kvm_arch_sync_dirty_log() to see why the
+ * second space is needed.
+ */
+int kvm_arm_init_hwdbm_bitmap(struct kvm_memory_slot *memslot)
+{
+	unsigned long bytes = 2 * kvm_hwdbm_bitmap_bytes(memslot);
+
+	if (!system_supports_hw_dbm())
+		return 0;
+
+	if (memslot->arch.hwdbm_bitmap) {
+		/* Inherited from old memslot */
+		bitmap_clear(memslot->arch.hwdbm_bitmap, 0, bytes * 8);
+	} else {
+		memslot->arch.hwdbm_bitmap = kvzalloc(bytes, GFP_KERNEL_ACCOUNT);
+		if (!memslot->arch.hwdbm_bitmap)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+void kvm_arm_destroy_hwdbm_bitmap(struct kvm_memory_slot *memslot)
+{
+	if (!memslot->arch.hwdbm_bitmap)
+		return;
+
+	kvfree(memslot->arch.hwdbm_bitmap);
+	memslot->arch.hwdbm_bitmap = NULL;
+}
+
+/* Add DBM for nearby pagetables but do not across memslot */
+void kvm_arm_enable_nearby_hwdbm(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *memslot;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	if (memslot && kvm_slot_dirty_track_enabled(memslot) &&
+	    memslot->arch.hwdbm_bitmap) {
+		unsigned long rel_gfn = gfn - memslot->base_gfn;
+		unsigned long dbm_idx = rel_gfn >> HWDBM_GRANULE_SHIFT;
+		unsigned long start_page, npages;
+
+		if (!test_and_set_bit(dbm_idx, memslot->arch.hwdbm_bitmap)) {
+			start_page = dbm_idx << HWDBM_GRANULE_SHIFT;
+			npages = 1 << HWDBM_GRANULE_SHIFT;
+			npages = min(memslot->npages - start_page, npages);
+			kvm_stage2_set_dbm(kvm, memslot, start_page, npages);
+		}
+	}
+}
+
+/*
+ * We have to find a place to clear hwdbm_bitmap, and clear hwdbm_bitmap means
+ * to clear DBM bit of all related pgtables. Note that between we clear DBM bit
+ * and flush TLB, HW dirty log may occur, so we must scan all related pgtables
+ * after flush TLB. Giving above, it's best choice to clear hwdbm_bitmap before
+ * sync HW dirty log.
+ */
 void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
 {
+	unsigned long *second_bitmap = kvm_second_hwdbm_bitmap(memslot);
+	unsigned long start_page, npages;
+	unsigned int end, rs, re;
+	bool has_hwdbm = false;
+
+	if (!memslot->arch.hwdbm_bitmap)
+		return;
 
+	end = kvm_hwdbm_bitmap_bytes(memslot) * 8;
+	bitmap_clear(second_bitmap, 0, end);
+
+	spin_lock(&kvm->mmu_lock);
+	bitmap_for_each_set_region(memslot->arch.hwdbm_bitmap, rs, re, 0, end) {
+		has_hwdbm = true;
+
+		/*
+		 * Must clear bitmap before clear DBM bit. During we clear DBM
+		 * (it releases the mmu spinlock periodly), SW dirty tracking
+		 * has chance to add DBM which overlaps what we are clearing. So
+		 * if we clear bitmap after clear DBM, we will face a situation
+		 * that bitmap is cleared but DBM are lefted, then we may have
+		 * no chance to scan these lefted pgtables anymore.
+		 */
+		bitmap_clear(memslot->arch.hwdbm_bitmap, rs, re - rs);
+
+		/* Record the bitmap cleared */
+		bitmap_set(second_bitmap, rs, re - rs);
+
+		start_page = rs << HWDBM_GRANULE_SHIFT;
+		npages = (re - rs) << HWDBM_GRANULE_SHIFT;
+		npages = min(memslot->npages - start_page, npages);
+		kvm_stage2_clear_dbm(kvm, memslot, start_page, npages);
+	}
+	spin_unlock(&kvm->mmu_lock);
+
+	if (!has_hwdbm)
+		return;
+
+	/*
+	 * Ensure vcpu write-actions that occur after we clear hwdbm_bitmap can
+	 * be catched by guest memory abort handler.
+	 */
+	kvm_flush_remote_tlbs(kvm);
+
+	spin_lock(&kvm->mmu_lock);
+	bitmap_for_each_set_region(second_bitmap, rs, re, 0, end) {
+		start_page = rs << HWDBM_GRANULE_SHIFT;
+		npages = (re - rs) << HWDBM_GRANULE_SHIFT;
+		npages = min(memslot->npages - start_page, npages);
+		kvm_stage2_sync_dirty(kvm, memslot, start_page, npages);
+	}
+	spin_unlock(&kvm->mmu_lock);
 }
 
 void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 2f8c6770a4dc..1a8702035ddd 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -939,6 +939,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 */
 	if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
 		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
+
+		/* Put here with high probability that nearby PTEs are valid */
+		if (!ret && vma_pagesize == PAGE_SIZE && writable)
+			kvm_arm_enable_nearby_hwdbm(kvm, gfn);
 	} else {
 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
 					     __pfn_to_phys(pfn), prot,
@@ -1407,11 +1411,12 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 	spin_unlock(&kvm->mmu_lock);
 out:
 	mmap_read_unlock(current->mm);
-	return ret;
+	return ret ? : kvm_arm_init_hwdbm_bitmap(memslot);
 }
 
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
+	kvm_arm_destroy_hwdbm_bitmap(slot);
 }
 
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 47f3f035f3ea..231d11009db7 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -376,11 +376,11 @@ int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
 	vtcr |= VTCR_EL2_LVLS_TO_SL0(lvls);
 
 	/*
-	 * Enable the Hardware Access Flag management, unconditionally
-	 * on all CPUs. The features is RES0 on CPUs without the support
-	 * and must be ignored by the CPUs.
+	 * Enable the Hardware Access Flag and Dirty State management
+	 * unconditionally on all CPUs. The features are RES0 on CPUs
+	 * without the support and must be ignored by the CPUs.
 	 */
-	vtcr |= VTCR_EL2_HA;
+	vtcr |= VTCR_EL2_HA | VTCR_EL2_HD;
 
 	/* Set the vmid bits */
 	vtcr |= (kvm_get_vmid_bits() == 16) ?
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log
  2021-01-26 12:44 [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log Keqian Zhu
                   ` (6 preceding siblings ...)
  2021-01-26 12:44 ` [RFC PATCH 7/7] kvm: arm64: Start up SW/HW combined dirty log Keqian Zhu
@ 2021-02-01 13:12 ` Keqian Zhu
  2021-02-01 13:17   ` Marc Zyngier
  2021-03-02 11:23 ` Keqian Zhu
  8 siblings, 1 reply; 12+ messages in thread
From: Keqian Zhu @ 2021-02-01 13:12 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-kernel, linux-arm-kernel, kvm, kvmarm, Will Deacon,
	Catalin Marinas, Alex Williamson, Kirti Wankhede, Cornelia Huck,
	Mark Rutland, James Morse, Robin Murphy, Suzuki K Poulose,
	wanghaibin.wang, jiangkunkun, xiexiangyou, zhengchuan, yubihong

Hi Marc,

Do you have time to have a look at this? Thanks ;-)

Keqian.

On 2021/1/26 20:44, Keqian Zhu wrote:
> The intention:
> 
> On arm64 platform, we tracking dirty log of vCPU through guest memory abort.
> KVM occupys some vCPU time of guest to change stage2 mapping and mark dirty.
> This leads to heavy side effect on VM, especially when multi vCPU race and
> some of them block on kvm mmu_lock.
> 
> DBM is a HW auxiliary approach to log dirty. MMU chages PTE to be writable if
> its DBM bit is set. Then KVM doesn't occupy vCPU time to log dirty.
> 
> About this patch series:
> 
> The biggest problem of apply DBM for stage2 is that software must scan PTs to
> collect dirty state, which may cost much time and affect downtime of migration.
> 
> This series realize a SW/HW combined dirty log that can effectively solve this
> problem (The smmu side can also use this approach to solve dma dirty log tracking).
> 
> The core idea is that we do not enable hardware dirty at start (do not add DBM bit).
> When a arbitrary PT occurs fault, we execute soft tracking for this PT and enable
> hardware tracking for its *nearby* PTs (e.g. Add DBM bit for nearby 16PTs). Then when
> sync dirty log, we have known all PTs with hardware dirty enabled, so we do not need
> to scan all PTs.
> 
>         mem abort point             mem abort point
>               ↓                            ↓
> ---------------------------------------------------------------
>         |********|        |        |********|        |        |
> ---------------------------------------------------------------
>              ↑                            ↑
>         set DBM bit of               set DBM bit of
>      this PT section (64PTEs)      this PT section (64PTEs)
> 
> We may worry that when dirty rate is over-high we still need to scan too much PTs.
> We mainly concern the VM stop time. With Qemu dirty rate throttling, the dirty memory
> is closing to the VM stop threshold, so there is a little PTs to scan after VM stop.
> 
> It has the advantages of hardware tracking that minimizes side effect on vCPU,
> and also has the advantages of software tracking that controls vCPU dirty rate.
> Moreover, software tracking helps us to scan PTs at some fixed points, which
> greatly reduces scanning time. And the biggest benefit is that we can apply this
> solution for dma dirty tracking.
> 
> Test:
> 
> Host: Kunpeng 920 with 128 CPU 512G RAM. Disable Transparent Hugepage (Ensure test result
>       is not effected by dissolve of block page table at the early stage of migration).
> VM:   16 CPU 16GB RAM. Run 4 pair of (redis_benchmark+redis_server).
> 
> Each run 5 times for software dirty log and SW/HW conbined dirty log. 
> 
> Test result:
> 
> Gain 5%~7% improvement of redis QPS during VM migration.
> VM downtime is not affected fundamentally.
> About 56.7% of DBM is effectively used.
> 
> Keqian Zhu (7):
>   arm64: cpufeature: Add API to report system support of HWDBM
>   kvm: arm64: Use atomic operation when update PTE
>   kvm: arm64: Add level_apply parameter for stage2_attr_walker
>   kvm: arm64: Add some HW_DBM related pgtable interfaces
>   kvm: arm64: Add some HW_DBM related mmu interfaces
>   kvm: arm64: Only write protect selected PTE
>   kvm: arm64: Start up SW/HW combined dirty log
> 
>  arch/arm64/include/asm/cpufeature.h  |  12 +++
>  arch/arm64/include/asm/kvm_host.h    |   6 ++
>  arch/arm64/include/asm/kvm_mmu.h     |   7 ++
>  arch/arm64/include/asm/kvm_pgtable.h |  45 ++++++++++
>  arch/arm64/kvm/arm.c                 | 125 ++++++++++++++++++++++++++
>  arch/arm64/kvm/hyp/pgtable.c         | 130 ++++++++++++++++++++++-----
>  arch/arm64/kvm/mmu.c                 |  47 +++++++++-
>  arch/arm64/kvm/reset.c               |   8 +-
>  8 files changed, 351 insertions(+), 29 deletions(-)
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log
  2021-02-01 13:12 ` [RFC PATCH 0/7] kvm: arm64: Implement " Keqian Zhu
@ 2021-02-01 13:17   ` Marc Zyngier
  2021-02-01 13:25     ` Keqian Zhu
  0 siblings, 1 reply; 12+ messages in thread
From: Marc Zyngier @ 2021-02-01 13:17 UTC (permalink / raw)
  To: Keqian Zhu
  Cc: linux-kernel, linux-arm-kernel, kvm, kvmarm, Will Deacon,
	Catalin Marinas, Alex Williamson, Kirti Wankhede, Cornelia Huck,
	Mark Rutland, James Morse, Robin Murphy, Suzuki K Poulose,
	wanghaibin.wang, jiangkunkun, xiexiangyou, zhengchuan, yubihong

On 2021-02-01 13:12, Keqian Zhu wrote:
> Hi Marc,
> 
> Do you have time to have a look at this? Thanks ;-)

Not immediately. I'm busy with stuff that is planned to go
in 5.12, which isn't the case for this series. I'll get to
it eventually.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log
  2021-02-01 13:17   ` Marc Zyngier
@ 2021-02-01 13:25     ` Keqian Zhu
  0 siblings, 0 replies; 12+ messages in thread
From: Keqian Zhu @ 2021-02-01 13:25 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-kernel, linux-arm-kernel, kvm, kvmarm, Will Deacon,
	Catalin Marinas, Alex Williamson, Kirti Wankhede, Cornelia Huck,
	Mark Rutland, James Morse, Robin Murphy, Suzuki K Poulose,
	wanghaibin.wang, jiangkunkun, xiexiangyou, zhengchuan, yubihong



On 2021/2/1 21:17, Marc Zyngier wrote:
> On 2021-02-01 13:12, Keqian Zhu wrote:
>> Hi Marc,
>>
>> Do you have time to have a look at this? Thanks ;-)
> 
> Not immediately. I'm busy with stuff that is planned to go
> in 5.12, which isn't the case for this series. I'll get to
> it eventually.
> 
> Thanks,
> 
>         M.
Sure, I am not eager. Please concentrate on your urgent work firstly. ;-) Thanks.

Keqian.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log
  2021-01-26 12:44 [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log Keqian Zhu
                   ` (7 preceding siblings ...)
  2021-02-01 13:12 ` [RFC PATCH 0/7] kvm: arm64: Implement " Keqian Zhu
@ 2021-03-02 11:23 ` Keqian Zhu
  8 siblings, 0 replies; 12+ messages in thread
From: Keqian Zhu @ 2021-03-02 11:23 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier,
	Will Deacon, Catalin Marinas
  Cc: Alex Williamson, Kirti Wankhede, Cornelia Huck, Mark Rutland,
	James Morse, Robin Murphy, Suzuki K Poulose, wanghaibin.wang,
	jiangkunkun, xiexiangyou, zhengchuan, yubihong

Hi everyone,

Any comments are welcome :).

Thanks,
Keqian

On 2021/1/26 20:44, Keqian Zhu wrote:
> The intention:
> 
> On arm64 platform, we tracking dirty log of vCPU through guest memory abort.
> KVM occupys some vCPU time of guest to change stage2 mapping and mark dirty.
> This leads to heavy side effect on VM, especially when multi vCPU race and
> some of them block on kvm mmu_lock.
> 
> DBM is a HW auxiliary approach to log dirty. MMU chages PTE to be writable if
> its DBM bit is set. Then KVM doesn't occupy vCPU time to log dirty.
> 
> About this patch series:
> 
> The biggest problem of apply DBM for stage2 is that software must scan PTs to
> collect dirty state, which may cost much time and affect downtime of migration.
> 
> This series realize a SW/HW combined dirty log that can effectively solve this
> problem (The smmu side can also use this approach to solve dma dirty log tracking).
> 
> The core idea is that we do not enable hardware dirty at start (do not add DBM bit).
> When a arbitrary PT occurs fault, we execute soft tracking for this PT and enable
> hardware tracking for its *nearby* PTs (e.g. Add DBM bit for nearby 16PTs). Then when
> sync dirty log, we have known all PTs with hardware dirty enabled, so we do not need
> to scan all PTs.
> 
>         mem abort point             mem abort point
>               ↓                            ↓
> ---------------------------------------------------------------
>         |********|        |        |********|        |        |
> ---------------------------------------------------------------
>              ↑                            ↑
>         set DBM bit of               set DBM bit of
>      this PT section (64PTEs)      this PT section (64PTEs)
> 
> We may worry that when dirty rate is over-high we still need to scan too much PTs.
> We mainly concern the VM stop time. With Qemu dirty rate throttling, the dirty memory
> is closing to the VM stop threshold, so there is a little PTs to scan after VM stop.
> 
> It has the advantages of hardware tracking that minimizes side effect on vCPU,
> and also has the advantages of software tracking that controls vCPU dirty rate.
> Moreover, software tracking helps us to scan PTs at some fixed points, which
> greatly reduces scanning time. And the biggest benefit is that we can apply this
> solution for dma dirty tracking.
> 
> Test:
> 
> Host: Kunpeng 920 with 128 CPU 512G RAM. Disable Transparent Hugepage (Ensure test result
>       is not effected by dissolve of block page table at the early stage of migration).
> VM:   16 CPU 16GB RAM. Run 4 pair of (redis_benchmark+redis_server).
> 
> Each run 5 times for software dirty log and SW/HW conbined dirty log. 
> 
> Test result:
> 
> Gain 5%~7% improvement of redis QPS during VM migration.
> VM downtime is not affected fundamentally.
> About 56.7% of DBM is effectively used.
> 
> Keqian Zhu (7):
>   arm64: cpufeature: Add API to report system support of HWDBM
>   kvm: arm64: Use atomic operation when update PTE
>   kvm: arm64: Add level_apply parameter for stage2_attr_walker
>   kvm: arm64: Add some HW_DBM related pgtable interfaces
>   kvm: arm64: Add some HW_DBM related mmu interfaces
>   kvm: arm64: Only write protect selected PTE
>   kvm: arm64: Start up SW/HW combined dirty log
> 
>  arch/arm64/include/asm/cpufeature.h  |  12 +++
>  arch/arm64/include/asm/kvm_host.h    |   6 ++
>  arch/arm64/include/asm/kvm_mmu.h     |   7 ++
>  arch/arm64/include/asm/kvm_pgtable.h |  45 ++++++++++
>  arch/arm64/kvm/arm.c                 | 125 ++++++++++++++++++++++++++
>  arch/arm64/kvm/hyp/pgtable.c         | 130 ++++++++++++++++++++++-----
>  arch/arm64/kvm/mmu.c                 |  47 +++++++++-
>  arch/arm64/kvm/reset.c               |   8 +-
>  8 files changed, 351 insertions(+), 29 deletions(-)
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-03-02 16:07 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-26 12:44 [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log Keqian Zhu
2021-01-26 12:44 ` [RFC PATCH 1/7] arm64: cpufeature: Add API to report system support of HWDBM Keqian Zhu
2021-01-26 12:44 ` [RFC PATCH 2/7] kvm: arm64: Use atomic operation when update PTE Keqian Zhu
2021-01-26 12:44 ` [RFC PATCH 3/7] kvm: arm64: Add level_apply parameter for stage2_attr_walker Keqian Zhu
2021-01-26 12:44 ` [RFC PATCH 4/7] kvm: arm64: Add some HW_DBM related pgtable interfaces Keqian Zhu
2021-01-26 12:44 ` [RFC PATCH 5/7] kvm: arm64: Add some HW_DBM related mmu interfaces Keqian Zhu
2021-01-26 12:44 ` [RFC PATCH 6/7] kvm: arm64: Only write protect selected PTE Keqian Zhu
2021-01-26 12:44 ` [RFC PATCH 7/7] kvm: arm64: Start up SW/HW combined dirty log Keqian Zhu
2021-02-01 13:12 ` [RFC PATCH 0/7] kvm: arm64: Implement " Keqian Zhu
2021-02-01 13:17   ` Marc Zyngier
2021-02-01 13:25     ` Keqian Zhu
2021-03-02 11:23 ` Keqian Zhu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).