kvmarm.lists.cs.columbia.edu archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM
@ 2020-07-02 13:55 Keqian Zhu
  2020-07-02 13:55 ` [PATCH v2 1/8] KVM: arm64: Set DBM bit for writable PTEs Keqian Zhu
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Keqian Zhu @ 2020-07-02 13:55 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Sean Christopherson, Steven Price, liangpeng10,
	Alexios Zavras, Mark Brown, Marc Zyngier, Thomas Gleixner,
	Will Deacon, Andrew Morton

This patch series add support for dirty log based on HW DBM.

It works well under some migration test cases, including VM with 4K
pages or 2M THP. I checked the SHA256 hash digest of all memory and
they keep same for source VM and destination VM, which means no dirty
pages is missed under hardware DBM.

Some key points:

1. Only support hardware updates of dirty status for PTEs. PMDs and PUDs
   are not involved for now.

2. About *performance*: In RFC patch, I have mentioned that for every 64GB
   memory, KVM consumes about 40ms to scan all PTEs to collect dirty log.
   This patch solves this problem through two ways: HW/SW dynamic switch
   and Multi-core offload.

   HW/SW dynamic switch: Give userspace right to enable/disable hw dirty
   log. This adds a new KVM cap named KVM_CAP_ARM_HW_DIRTY_LOG. We can
   achieve this by change the kvm->arch.vtcr value and kick vCPUs out to
   reload this value to VCTR_EL2. Then userspace can enable hw dirty log
   at the begining and disable it when dirty pages is little and about to
   stop VM, so VM downtime is not affected.

   Multi-core offload: Offload the PT scanning workload to multi-core can
   greatly reduce scanning time. To promise we can complete in time, I use
   smp_call_fuction to realize this policy, which utilize IPI to dispatch
   workload to other CPUs. Under 128U Kunpeng 920 platform, it just takes
   about 5ms to scan PTs of 256 RAM (use mempress and almost all PTs have
   been established). And We dispatch workload iterately (every CPU just
   scan PTs of 512M RAM for each iteration), so it won't affect physical
   CPUs seriously.

3. About correctness: Only add DBM bit when PTE is already writable, so
   we still have readonly PTE and some mechanisms which rely on readonly
   PTs are not broken.

4. About PTs modification races: There are two kinds of PTs modification.
   
   The first is adding or clearing specific bit, such as AF or RW. All
   these operations have been converted to be atomic, avoid covering
   dirty status set by hardware.
   
   The second is replacement, such as PTEs unmapping or changement. All
   these operations will invoke kvm_set_pte finally. kvm_set_pte have
   been converted to be atomic and we save the dirty status to underlying
   bitmap if dirty status is coverred.

Change log:

v2:
 - Address Steven's comments.
 - Add support of parallel dirty log sync.
 - Simplify and merge patches of v1.

v1:
 - Address Catalin's comments.

Keqian Zhu (8):
  KVM: arm64: Set DBM bit for writable PTEs
  KVM: arm64: Scan PTEs to sync dirty log
  KVM: arm64: Modify stage2 young mechanism to support hw DBM
  KVM: arm64: Save stage2 PTE dirty status if it is covered
  KVM: arm64: Steply write protect page table by mask bit
  KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability
  KVM: arm64: Sync dirty log parallel
  KVM: Omit dirty log sync in log clear if initially all set

 arch/arm64/include/asm/kvm_host.h |   5 +
 arch/arm64/include/asm/kvm_mmu.h  |  43 ++++-
 arch/arm64/kvm/arm.c              |  45 ++++-
 arch/arm64/kvm/mmu.c              | 307 ++++++++++++++++++++++++++++--
 arch/arm64/kvm/reset.c            |   5 +
 include/uapi/linux/kvm.h          |   1 +
 tools/include/uapi/linux/kvm.h    |   1 +
 virt/kvm/kvm_main.c               |   3 +-
 8 files changed, 389 insertions(+), 21 deletions(-)

-- 
2.19.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/8] KVM: arm64: Set DBM bit for writable PTEs
  2020-07-02 13:55 [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM Keqian Zhu
@ 2020-07-02 13:55 ` Keqian Zhu
  2020-07-02 13:55 ` [PATCH v2 2/8] KVM: arm64: Scan PTEs to sync dirty log Keqian Zhu
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Keqian Zhu @ 2020-07-02 13:55 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Sean Christopherson, Steven Price, liangpeng10,
	Alexios Zavras, Mark Brown, Marc Zyngier, Thomas Gleixner,
	Will Deacon, Andrew Morton

DBM bit is used by MMU to differentiate a genuinely non-writable
page from a page that is only temporarily non-writable in order
to mark dirty.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
---
 arch/arm64/include/asm/kvm_mmu.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index b12bfc1f051a..2700442b0f75 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -169,6 +169,10 @@ void kvm_clear_hyp_idmap(void);
 static inline pte_t kvm_s2pte_mkwrite(pte_t pte)
 {
 	pte_val(pte) |= PTE_S2_RDWR;
+
+	if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
+		pte_val(pte) |= PTE_DBM;
+
 	return pte;
 }
 
-- 
2.19.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/8] KVM: arm64: Scan PTEs to sync dirty log
  2020-07-02 13:55 [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM Keqian Zhu
  2020-07-02 13:55 ` [PATCH v2 1/8] KVM: arm64: Set DBM bit for writable PTEs Keqian Zhu
@ 2020-07-02 13:55 ` Keqian Zhu
  2020-07-02 13:55 ` [PATCH v2 3/8] KVM: arm64: Modify stage2 young mechanism to support hw DBM Keqian Zhu
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Keqian Zhu @ 2020-07-02 13:55 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Sean Christopherson, Steven Price, liangpeng10,
	Alexios Zavras, Mark Brown, Marc Zyngier, Thomas Gleixner,
	Will Deacon, Andrew Morton

For hardware management of dirty state, dirty state is stored in
PTEs. We have to scan all PTEs to sync dirty log to memslot dirty
bitmap.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
---
 arch/arm64/include/asm/kvm_host.h |   1 +
 arch/arm64/include/asm/kvm_mmu.h  |  13 +++
 arch/arm64/kvm/arm.c              |   3 +-
 arch/arm64/kvm/mmu.c              | 142 ++++++++++++++++++++++++++++++
 4 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index c3e6fcc664b1..86b9c210ba43 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -480,6 +480,7 @@ u64 __kvm_call_hyp(void *hypfn, ...);
 
 void force_vm_exit(const cpumask_t *mask);
 void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
+void kvm_mmu_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot);
 
 int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
 		int exception_index);
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 2700442b0f75..4c12b7ad8ae8 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -269,6 +269,19 @@ static inline bool kvm_s2pud_young(pud_t pud)
 	return pud_young(pud);
 }
 
+static inline bool arm_mmu_hw_dbm_supported(void)
+{
+	u8 hadbs = (read_sysreg(id_aa64mmfr1_el1) >>
+		ID_AA64MMFR1_HADBS_SHIFT) & 0xf;
+
+	return hadbs == 0x2;
+}
+
+static inline bool kvm_mmu_hw_dbm_enabled(struct kvm *kvm)
+{
+	return arm_mmu_hw_dbm_supported() && !!(kvm->arch.vtcr & VTCR_EL2_HD);
+}
+
 #define hyp_pte_table_empty(ptep) kvm_page_empty(ptep)
 
 #ifdef __PAGETABLE_PMD_FOLDED
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 90cb90561446..fefa5406e037 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1199,7 +1199,8 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 
 void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
 {
-
+	if (IS_ENABLED(CONFIG_ARM66_HW_AFDBM) && kvm_mmu_hw_dbm_enabled(kvm))
+		kvm_mmu_sync_dirty_log(kvm, memslot);
 }
 
 void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8c0035cab6b6..b3cb8b6da4c2 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2411,6 +2411,148 @@ int kvm_mmu_init(void)
 	return err;
 }
 
+#ifdef CONFIG_ARM64_HW_AFDBM
+/**
+ * stage2_sync_dirty_log_ptes() - synchronize dirty log from PMD range
+ * @kvm:	The KVM pointer
+ * @pmd:	pointer to pmd entry
+ * @addr:	range start address
+ * @end:	range end address
+ */
+static void stage2_sync_dirty_log_ptes(struct kvm *kvm, pmd_t *pmd,
+				       phys_addr_t addr, phys_addr_t end)
+{
+	pte_t *pte;
+
+	pte = pte_offset_kernel(pmd, addr);
+	do {
+		if (!pte_none(*pte) && !kvm_s2pte_readonly(pte))
+			mark_page_dirty(kvm, addr >> PAGE_SHIFT);
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+}
+
+/**
+ * stage2_sync_dirty_log_pmds() - synchronize dirty log from PUD range
+ * @kvm:	The KVM pointer
+ * @pud:	pointer to pud entry
+ * @addr:	range start address
+ * @end:	range end address
+ */
+static void stage2_sync_dirty_log_pmds(struct kvm *kvm, pud_t *pud,
+				       phys_addr_t addr, phys_addr_t end)
+{
+	pmd_t *pmd;
+	phys_addr_t next;
+
+	pmd = stage2_pmd_offset(kvm, pud, addr);
+	do {
+		next = stage2_pmd_addr_end(kvm, addr, end);
+		if (!pmd_none(*pmd) && !pmd_thp_or_huge(*pmd))
+			stage2_sync_dirty_log_ptes(kvm, pmd, addr, next);
+	} while (pmd++, addr = next, addr != end);
+}
+
+/**
+ * stage2_sync_dirty_log_puds() - synchronize dirty log from P4D range
+ * @kvm:	The KVM pointer
+ * @pgd:	pointer to pgd entry
+ * @addr:	range start address
+ * @end:	range end address
+ */
+static void stage2_sync_dirty_log_puds(struct kvm *kvm, p4d_t *p4d,
+				       phys_addr_t addr, phys_addr_t end)
+{
+	pud_t *pud;
+	phys_addr_t next;
+
+	pud = stage2_pud_offset(kvm, p4d, addr);
+	do {
+		next = stage2_pud_addr_end(kvm, addr, end);
+		if (!stage2_pud_none(kvm, *pud) && !stage2_pud_huge(kvm, *pud))
+			stage2_sync_dirty_log_pmds(kvm, pud, addr, next);
+	} while (pud++, addr = next, addr != end);
+}
+
+/**
+ * stage2_sync_dirty_log_p4ds() - synchronize dirty log from PGD range
+ * @kvm:	The KVM pointer
+ * @pgd:	pointer to pgd entry
+ * @addr:	range start address
+ * @end:	range end address
+ */
+static void stage2_sync_dirty_log_p4ds(struct kvm *kvm, pgd_t *pgd,
+				       phys_addr_t addr, phys_addr_t end)
+{
+	p4d_t *p4d;
+	phys_addr_t next;
+
+	p4d = stage2_p4d_offset(kvm, pgd, addr);
+	do {
+		next = stage2_p4d_addr_end(kvm, addr, end);
+		if (!stage2_p4d_none(kvm, *p4d))
+			stage2_sync_dirty_log_puds(kvm, p4d, addr, next);
+	} while (p4d++, addr = next, addr != end);
+}
+
+/**
+ * stage2_sync_dirty_log_range() - synchronize dirty log from stage2 memory
+ * region range
+ * @kvm:	The KVM pointer
+ * @addr:	Start address of range
+ * @end:	End address of range
+ */
+static void stage2_sync_dirty_log_range(struct kvm *kvm, phys_addr_t addr,
+					phys_addr_t end)
+{
+	pgd_t *pgd;
+	phys_addr_t next;
+
+	pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
+	do {
+		cond_resched_lock(&kvm->mmu_lock);
+		if (!READ_ONCE(kvm->arch.pgd))
+			break;
+		next = stage2_pgd_addr_end(kvm, addr, end);
+		if (stage2_pgd_present(kvm, *pgd))
+			stage2_sync_dirty_log_p4ds(kvm, pgd, addr, next);
+	} while (pgd++, addr = next, addr != end);
+}
+
+/**
+ * kvm_mmu_sync_dirty_log() - synchronize dirty log from stage2 PTEs for
+ * memory slot
+ * @kvm:	The KVM pointer
+ * @slot:	The memory slot to synchronize dirty log
+ *
+ * Called to synchronize dirty log (marked by hw) after memory region
+ * KVM_GET_DIRTY_LOG operation is called. After this function returns
+ * all dirty log information (for that hw will modify page tables during
+ * this routine, it is true only when guest is stopped, but it is OK
+ * because we won't miss dirty log finally.) are collected into memslot
+ * dirty_bitmap. Afterwards dirty_bitmap can be copied to userspace.
+ *
+ * Acquires kvm_mmu_lock. Called with kvm->slots_lock mutex acquired,
+ * serializing operations for VM memory regions.
+ */
+void kvm_mmu_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
+{
+	phys_addr_t start = memslot->base_gfn << PAGE_SHIFT;
+	phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
+	int srcu_idx;
+
+	if (WARN_ON_ONCE(!memslot->dirty_bitmap))
+		return;
+
+	srcu_idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+
+	stage2_sync_dirty_log_range(kvm, start, end);
+
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, srcu_idx);
+}
+#endif /* CONFIG_ARM64_HW_AFDBM */
+
 void kvm_arch_commit_memory_region(struct kvm *kvm,
 				   const struct kvm_userspace_memory_region *mem,
 				   struct kvm_memory_slot *old,
-- 
2.19.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/8] KVM: arm64: Modify stage2 young mechanism to support hw DBM
  2020-07-02 13:55 [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM Keqian Zhu
  2020-07-02 13:55 ` [PATCH v2 1/8] KVM: arm64: Set DBM bit for writable PTEs Keqian Zhu
  2020-07-02 13:55 ` [PATCH v2 2/8] KVM: arm64: Scan PTEs to sync dirty log Keqian Zhu
@ 2020-07-02 13:55 ` Keqian Zhu
  2020-07-02 13:55 ` [PATCH v2 4/8] KVM: arm64: Save stage2 PTE dirty status if it is covered Keqian Zhu
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Keqian Zhu @ 2020-07-02 13:55 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Sean Christopherson, Steven Price, liangpeng10,
	Alexios Zavras, Mark Brown, Marc Zyngier, Thomas Gleixner,
	Will Deacon, Andrew Morton

Marking PTs young (set AF bit) should be atomic to avoid cover
dirty status set by hardware.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
---
 arch/arm64/include/asm/kvm_mmu.h | 31 ++++++++++++++++++++++---------
 arch/arm64/kvm/mmu.c             | 15 ++++++++-------
 2 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 4c12b7ad8ae8..a1b6131d980c 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -219,6 +219,18 @@ static inline void kvm_set_s2pte_readonly(pte_t *ptep)
 	} while (pteval != old_pteval);
 }
 
+static inline void kvm_set_s2pte_young(pte_t *ptep)
+{
+	pteval_t old_pteval, pteval;
+
+	pteval = READ_ONCE(pte_val(*ptep));
+	do {
+		old_pteval = pteval;
+		pteval |= PTE_AF;
+		pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
+	} while (pteval != old_pteval);
+}
+
 static inline bool kvm_s2pte_readonly(pte_t *ptep)
 {
 	return (READ_ONCE(pte_val(*ptep)) & PTE_S2_RDWR) == PTE_S2_RDONLY;
@@ -234,6 +246,11 @@ static inline void kvm_set_s2pmd_readonly(pmd_t *pmdp)
 	kvm_set_s2pte_readonly((pte_t *)pmdp);
 }
 
+static inline void kvm_set_s2pmd_young(pmd_t *pmdp)
+{
+	kvm_set_s2pte_young((pte_t *)pmdp);
+}
+
 static inline bool kvm_s2pmd_readonly(pmd_t *pmdp)
 {
 	return kvm_s2pte_readonly((pte_t *)pmdp);
@@ -249,6 +266,11 @@ static inline void kvm_set_s2pud_readonly(pud_t *pudp)
 	kvm_set_s2pte_readonly((pte_t *)pudp);
 }
 
+static inline void kvm_set_s2pud_young(pud_t *pudp)
+{
+	kvm_set_s2pte_young((pte_t *)pudp);
+}
+
 static inline bool kvm_s2pud_readonly(pud_t *pudp)
 {
 	return kvm_s2pte_readonly((pte_t *)pudp);
@@ -259,15 +281,6 @@ static inline bool kvm_s2pud_exec(pud_t *pudp)
 	return !(READ_ONCE(pud_val(*pudp)) & PUD_S2_XN);
 }
 
-static inline pud_t kvm_s2pud_mkyoung(pud_t pud)
-{
-	return pud_mkyoung(pud);
-}
-
-static inline bool kvm_s2pud_young(pud_t pud)
-{
-	return pud_young(pud);
-}
 
 static inline bool arm_mmu_hw_dbm_supported(void)
 {
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index b3cb8b6da4c2..ab8a6ceecbd8 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2008,8 +2008,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  * Resolve the access fault by making the page young again.
  * Note that because the faulting entry is guaranteed not to be
  * cached in the TLB, we don't need to invalidate anything.
- * Only the HW Access Flag updates are supported for Stage 2 (no DBM),
- * so there is no need for atomic (pte|pmd)_mkyoung operations.
+ *
+ * Note: Both DBM and HW AF updates are supported for Stage2, so
+ * young operations should be atomic.
  */
 static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
 {
@@ -2027,15 +2028,15 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
 		goto out;
 
 	if (pud) {		/* HugeTLB */
-		*pud = kvm_s2pud_mkyoung(*pud);
+		kvm_set_s2pud_young(pud);
 		pfn = kvm_pud_pfn(*pud);
 		pfn_valid = true;
 	} else	if (pmd) {	/* THP, HugeTLB */
-		*pmd = pmd_mkyoung(*pmd);
+		kvm_set_s2pmd_young(pmd);
 		pfn = pmd_pfn(*pmd);
 		pfn_valid = true;
-	} else {
-		*pte = pte_mkyoung(*pte);	/* Just a page... */
+	} else {		/* Just a page... */
+		kvm_set_s2pte_young(pte);
 		pfn = pte_pfn(*pte);
 		pfn_valid = true;
 	}
@@ -2280,7 +2281,7 @@ static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *
 		return 0;
 
 	if (pud)
-		return kvm_s2pud_young(*pud);
+		return pud_young(*pud);
 	else if (pmd)
 		return pmd_young(*pmd);
 	else
-- 
2.19.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 4/8] KVM: arm64: Save stage2 PTE dirty status if it is covered
  2020-07-02 13:55 [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM Keqian Zhu
                   ` (2 preceding siblings ...)
  2020-07-02 13:55 ` [PATCH v2 3/8] KVM: arm64: Modify stage2 young mechanism to support hw DBM Keqian Zhu
@ 2020-07-02 13:55 ` Keqian Zhu
  2020-07-02 13:55 ` [PATCH v2 5/8] KVM: arm64: Steply write protect page table by mask bit Keqian Zhu
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Keqian Zhu @ 2020-07-02 13:55 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Sean Christopherson, Steven Price, liangpeng10,
	Alexios Zavras, Mark Brown, Marc Zyngier, Thomas Gleixner,
	Will Deacon, Andrew Morton

kvm_set_pte is called to replace a target PTE with a desired one.
We always do this without changing the desired one, but if dirty
status set by hardware is covered, let caller know it.

There are two types of operations will change PTE and may cover
dirty status set by hardware.

1. Stage2 PTE unmapping: Page table merging (revert of huge page
table dissolving), kvm_unmap_hva_range() and so on.

2. Stage2 PTE changing: including user_mem_abort(), kvm_mmu_notifier
_change_pte() and so on.

All operations above will invoke kvm_set_pte() finally. We should
save the dirty status into memslot bitmap.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
---
 arch/arm64/include/asm/kvm_mmu.h |  5 ++++
 arch/arm64/kvm/mmu.c             | 42 ++++++++++++++++++++++++++++----
 2 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index a1b6131d980c..adb5c8edb29e 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -295,6 +295,11 @@ static inline bool kvm_mmu_hw_dbm_enabled(struct kvm *kvm)
 	return arm_mmu_hw_dbm_supported() && !!(kvm->arch.vtcr & VTCR_EL2_HD);
 }
 
+static inline bool kvm_s2pte_dbm(pte_t *ptep)
+{
+	return !!(READ_ONCE(pte_val(*ptep)) & PTE_DBM);
+}
+
 #define hyp_pte_table_empty(ptep) kvm_page_empty(ptep)
 
 #ifdef __PAGETABLE_PMD_FOLDED
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ab8a6ceecbd8..d0c34549ef3b 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -194,10 +194,26 @@ static void clear_stage2_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr
 	put_page(virt_to_page(pmd));
 }
 
-static inline void kvm_set_pte(pte_t *ptep, pte_t new_pte)
+/**
+ * @ret: true if dirty status set by hardware is covered.
+ */
+static inline bool kvm_set_pte(pte_t *ptep, pte_t new_pte)
 {
-	WRITE_ONCE(*ptep, new_pte);
-	dsb(ishst);
+	pteval_t old_pteval;
+	bool old_logging, new_no_write;
+
+	old_logging = IS_ENABLED(CONFIG_ARM64_HW_AFDBM) &&
+		arm_mmu_hw_dbm_supported() && kvm_s2pte_dbm(ptep);
+	new_no_write = pte_none(new_pte) || kvm_s2pte_readonly(&new_pte);
+
+	if (!old_logging || !new_no_write) {
+		WRITE_ONCE(*ptep, new_pte);
+		dsb(ishst);
+		return false;
+	}
+
+	old_pteval = xchg(&pte_val(*ptep), pte_val(new_pte));
+	return !kvm_s2pte_readonly(&__pte(old_pteval));
 }
 
 static inline void kvm_set_pmd(pmd_t *pmdp, pmd_t new_pmd)
@@ -260,15 +276,23 @@ static void unmap_stage2_ptes(struct kvm *kvm, pmd_t *pmd,
 {
 	phys_addr_t start_addr = addr;
 	pte_t *pte, *start_pte;
+	bool dirty_covered;
+	int idx;
 
 	start_pte = pte = pte_offset_kernel(pmd, addr);
 	do {
 		if (!pte_none(*pte)) {
 			pte_t old_pte = *pte;
 
-			kvm_set_pte(pte, __pte(0));
+			dirty_covered = kvm_set_pte(pte, __pte(0));
 			kvm_tlb_flush_vmid_ipa(kvm, addr);
 
+			if (dirty_covered) {
+				idx = srcu_read_lock(&kvm->srcu);
+				mark_page_dirty(kvm, addr >> PAGE_SHIFT);
+				srcu_read_unlock(&kvm->srcu, idx);
+			}
+
 			/* No need to invalidate the cache for device mappings */
 			if (!kvm_is_device_pfn(pte_pfn(old_pte)))
 				kvm_flush_dcache_pte(old_pte);
@@ -1354,6 +1378,8 @@ static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
 	pte_t *pte, old_pte;
 	bool iomap = flags & KVM_S2PTE_FLAG_IS_IOMAP;
 	bool logging_active = flags & KVM_S2_FLAG_LOGGING_ACTIVE;
+	bool dirty_covered;
+	int idx;
 
 	VM_BUG_ON(logging_active && !cache);
 
@@ -1419,8 +1445,14 @@ static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
 		if (pte_val(old_pte) == pte_val(*new_pte))
 			return 0;
 
-		kvm_set_pte(pte, __pte(0));
+		dirty_covered = kvm_set_pte(pte, __pte(0));
 		kvm_tlb_flush_vmid_ipa(kvm, addr);
+
+		if (dirty_covered) {
+			idx = srcu_read_lock(&kvm->srcu);
+			mark_page_dirty(kvm, addr >> PAGE_SHIFT);
+			srcu_read_unlock(&kvm->srcu, idx);
+		}
 	} else {
 		get_page(virt_to_page(pte));
 	}
-- 
2.19.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 5/8] KVM: arm64: Steply write protect page table by mask bit
  2020-07-02 13:55 [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM Keqian Zhu
                   ` (3 preceding siblings ...)
  2020-07-02 13:55 ` [PATCH v2 4/8] KVM: arm64: Save stage2 PTE dirty status if it is covered Keqian Zhu
@ 2020-07-02 13:55 ` Keqian Zhu
  2020-07-02 13:55 ` [PATCH v2 6/8] KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability Keqian Zhu
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Keqian Zhu @ 2020-07-02 13:55 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Sean Christopherson, Steven Price, liangpeng10,
	Alexios Zavras, Mark Brown, Marc Zyngier, Thomas Gleixner,
	Will Deacon, Andrew Morton

During dirty log clear, page table entries are write protected
according to a mask. In the past we write protect all entries
corresponding to the mask from ffs to fls. Though there may be
zero bits between this range, we are holding the kvm mmu lock
so we won't write protect entries that we don't want to.

We are about to add support for hardware management of dirty state
to arm64, holding kvm mmu lock will be not enough. We should write
protect entries steply by mask bit.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
---
 arch/arm64/kvm/mmu.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d0c34549ef3b..adfa62f1fced 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1703,10 +1703,16 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 		gfn_t gfn_offset, unsigned long mask)
 {
 	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
-	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
-	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+	phys_addr_t start, end;
+	u32 i;
 
-	stage2_wp_range(kvm, start, end);
+	for (i = __ffs(mask); i <= __fls(mask); i++) {
+		if (test_bit_le(i, &mask)) {
+			start = (base_gfn + i) << PAGE_SHIFT;
+			end = (base_gfn + i + 1) << PAGE_SHIFT;
+			stage2_wp_range(kvm, start, end);
+		}
+	}
 }
 
 /*
-- 
2.19.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 6/8] KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability
  2020-07-02 13:55 [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM Keqian Zhu
                   ` (4 preceding siblings ...)
  2020-07-02 13:55 ` [PATCH v2 5/8] KVM: arm64: Steply write protect page table by mask bit Keqian Zhu
@ 2020-07-02 13:55 ` Keqian Zhu
  2020-07-06  1:08   ` zhukeqian
  2020-07-02 13:55 ` [PATCH v2 7/8] KVM: arm64: Sync dirty log parallel Keqian Zhu
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Keqian Zhu @ 2020-07-02 13:55 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Sean Christopherson, Steven Price, liangpeng10,
	Alexios Zavras, Mark Brown, Marc Zyngier, Thomas Gleixner,
	Will Deacon, Andrew Morton

For that using arm64 DBM to log dirty pages has the side effect
of long time dirty log sync, we should give userspace opportunity
to enable or disable this feature, to realize some policy.

This feature is disabled by default.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
---
 arch/arm64/include/asm/kvm_host.h |  1 +
 arch/arm64/kvm/arm.c              | 35 +++++++++++++++++++++++++++++++
 arch/arm64/kvm/mmu.c              | 22 +++++++++++++++++++
 arch/arm64/kvm/reset.c            |  5 +++++
 include/uapi/linux/kvm.h          |  1 +
 tools/include/uapi/linux/kvm.h    |  1 +
 6 files changed, 65 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 86b9c210ba43..69a5317c7049 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -481,6 +481,7 @@ u64 __kvm_call_hyp(void *hypfn, ...);
 void force_vm_exit(const cpumask_t *mask);
 void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
 void kvm_mmu_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot);
+void kvm_mmu_sync_dirty_log_all(struct kvm *kvm);
 
 int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
 		int exception_index);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index fefa5406e037..9e3f765d5467 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -78,6 +78,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 			    struct kvm_enable_cap *cap)
 {
 	int r;
+#ifdef CONFIG_ARM64_HW_AFDBM
+	int i;
+	struct kvm_vcpu *vcpu;
+	bool enable_hw_dirty_log;
+#endif
 
 	if (cap->flags)
 		return -EINVAL;
@@ -87,6 +92,36 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		r = 0;
 		kvm->arch.return_nisv_io_abort_to_user = true;
 		break;
+#ifdef CONFIG_ARM64_HW_AFDBM
+	case KVM_CAP_ARM_HW_DIRTY_LOG:
+		if (!arm_mmu_hw_dbm_supported() || (cap->args[0] & ~0x1))
+			r = -EINVAL;
+
+		enable_hw_dirty_log = !!(cap->args[0] & 0x1);
+		if (!!(kvm->arch.vtcr & VTCR_EL2_HD) != enable_hw_dirty_log) {
+			if (enable_hw_dirty_log)
+				kvm->arch.vtcr |= VTCR_EL2_HD;
+			else
+				kvm->arch.vtcr &= ~VTCR_EL2_HD;
+
+			/*
+			 * We should kick vcpus out of guest mode here to
+			 * load new vtcr value to vtcr_el2 register when
+			 * re-enter guest mode.
+			 */
+			kvm_for_each_vcpu(i, vcpu, kvm)
+				kvm_vcpu_kick(vcpu);
+
+			if (!enable_hw_dirty_log) {
+				mutex_lock(&kvm->slots_lock);
+				kvm_mmu_sync_dirty_log_all(kvm);
+				mutex_unlock(&kvm->slots_lock);
+			}
+		}
+
+		r = 0;
+		break;
+#endif
 	default:
 		r = -EINVAL;
 		break;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index adfa62f1fced..1a48554accb0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2590,6 +2590,28 @@ void kvm_mmu_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
 	spin_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, srcu_idx);
 }
+
+/**
+ * kvm_mmu_sync_dirty_log_all() - synchronize dirty log from PTEs for whole VM
+ * @kvm:	The KVM pointer
+ *
+ * Called with kvm->slots_lock mutex acquired
+ */
+void kvm_mmu_sync_dirty_log_all(struct kvm *kvm)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	struct kvm_memory_slot *memslots = slots->memslots;
+	struct kvm_memory_slot *memslot;
+	int slot;
+
+	if (unlikely(!slots->used_slots))
+		return;
+
+	for (slot = 0; slot < slots->used_slots; slot++) {
+		memslot = &memslots[slot];
+		kvm_mmu_sync_dirty_log(kvm, memslot);
+	}
+}
 #endif /* CONFIG_ARM64_HW_AFDBM */
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index d3b209023727..a3be703dd54b 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -83,6 +83,11 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = has_vhe() && system_supports_address_auth() &&
 				 system_supports_generic_auth();
 		break;
+#ifdef CONFIG_ARM64_HW_AFDBM
+	case KVM_CAP_ARM_HW_DIRTY_LOG:
+		r = arm_mmu_hw_dbm_supported();
+		break;
+#endif /* CONFIG_ARM64_HW_AFDBM */
 	default:
 		r = 0;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4fdf30316582..e0b12c43397b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1031,6 +1031,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_SECURE_GUEST 181
 #define KVM_CAP_HALT_POLL 182
 #define KVM_CAP_ASYNC_PF_INT 183
+#define KVM_CAP_ARM_HW_DIRTY_LOG 184
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 4fdf30316582..e0b12c43397b 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -1031,6 +1031,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_SECURE_GUEST 181
 #define KVM_CAP_HALT_POLL 182
 #define KVM_CAP_ASYNC_PF_INT 183
+#define KVM_CAP_ARM_HW_DIRTY_LOG 184
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.19.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 7/8] KVM: arm64: Sync dirty log parallel
  2020-07-02 13:55 [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM Keqian Zhu
                   ` (5 preceding siblings ...)
  2020-07-02 13:55 ` [PATCH v2 6/8] KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability Keqian Zhu
@ 2020-07-02 13:55 ` Keqian Zhu
  2020-07-02 13:55 ` [PATCH v2 8/8] KVM: Omit dirty log sync in log clear if initially all set Keqian Zhu
  2020-07-06  1:28 ` [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM zhukeqian
  8 siblings, 0 replies; 18+ messages in thread
From: Keqian Zhu @ 2020-07-02 13:55 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Sean Christopherson, Steven Price, liangpeng10,
	Alexios Zavras, Mark Brown, Marc Zyngier, Thomas Gleixner,
	Will Deacon, Andrew Morton

Give userspace another selection to solve high-cost dirty log
sync, which called multi-core offload. Usersapce can enable
this policy through KVM_CAP_ARM_HW_DIRTY_LOG.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
---
 arch/arm64/include/asm/kvm_host.h |  3 ++
 arch/arm64/kvm/arm.c              |  9 +++-
 arch/arm64/kvm/mmu.c              | 82 +++++++++++++++++++++++++++++--
 arch/arm64/kvm/reset.c            |  2 +-
 4 files changed, 90 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 69a5317c7049..05da819f9adc 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -95,6 +95,9 @@ struct kvm_arch {
 	 * supported.
 	 */
 	bool return_nisv_io_abort_to_user;
+
+	/* Sync dirty log parallel when hw dirty log enabled */
+	bool sync_dirty_log_parallel;
 };
 
 #define KVM_NR_MEM_OBJS     40
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 9e3f765d5467..89614984831d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -82,6 +82,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 	int i;
 	struct kvm_vcpu *vcpu;
 	bool enable_hw_dirty_log;
+	bool enable_sync_parallel;
 #endif
 
 	if (cap->flags)
@@ -94,10 +95,14 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		break;
 #ifdef CONFIG_ARM64_HW_AFDBM
 	case KVM_CAP_ARM_HW_DIRTY_LOG:
-		if (!arm_mmu_hw_dbm_supported() || (cap->args[0] & ~0x1))
+		if (!arm_mmu_hw_dbm_supported() || (cap->args[0] & ~0x3))
 			r = -EINVAL;
 
 		enable_hw_dirty_log = !!(cap->args[0] & 0x1);
+		enable_sync_parallel = !!(cap->args[0] & 0x2);
+		if (!enable_hw_dirty_log && enable_sync_parallel)
+			r = -EINVAL;
+
 		if (!!(kvm->arch.vtcr & VTCR_EL2_HD) != enable_hw_dirty_log) {
 			if (enable_hw_dirty_log)
 				kvm->arch.vtcr |= VTCR_EL2_HD;
@@ -119,6 +124,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 			}
 		}
 
+		kvm->arch.sync_dirty_log_parallel = enable_sync_parallel;
+
 		r = 0;
 		break;
 #endif
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 1a48554accb0..be360e0fd20b 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2548,15 +2548,55 @@ static void stage2_sync_dirty_log_range(struct kvm *kvm, phys_addr_t addr,
 
 	pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
 	do {
-		cond_resched_lock(&kvm->mmu_lock);
-		if (!READ_ONCE(kvm->arch.pgd))
-			break;
+		if (!kvm->arch.sync_dirty_log_parallel) {
+			cond_resched_lock(&kvm->mmu_lock);
+			if (!READ_ONCE(kvm->arch.pgd))
+				break;
+		}
 		next = stage2_pgd_addr_end(kvm, addr, end);
 		if (stage2_pgd_present(kvm, *pgd))
 			stage2_sync_dirty_log_p4ds(kvm, pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
 }
 
+static struct dirty_sync_task {
+	struct kvm *kvm;
+	struct kvm_memory_slot *memslot;
+	u32 cpu_cnt;
+	u16 cpu_idx_map[NR_CPUS];
+	u32 ite_npages;
+	u32 ite;
+	bool finished;
+} sync_task;
+
+static void stage2_sync_dirty_log_smp(void *task)
+{
+	struct dirty_sync_task *t = task;
+	struct kvm_memory_slot *memslot = t->memslot;
+	unsigned long ite_idx, base_page, end_page;
+	gfn_t base_gfn;
+
+	ite_idx = t->cpu_cnt * t->ite + t->cpu_idx_map[smp_processor_id()];
+
+	base_page = ite_idx * t->ite_npages;
+	if (base_page >= memslot->npages) {
+		t->finished = true;
+		trace_printk("stage2_sync_dirty_log_smp finished 1.\n");
+		return;
+	}
+
+	end_page = min(memslot->npages, base_page + t->ite_npages);
+	if (end_page == memslot->npages) {
+		t->finished = true;
+		trace_printk("stage2_sync_dirty_log_smp finished 2.\n");
+	}
+
+	base_gfn = memslot->base_gfn;
+	trace_printk("base_page 0x%lx, end_page 0x%lx.\n", base_page, end_page);
+	stage2_sync_dirty_log_range(t->kvm, (base_gfn + base_page) << PAGE_SHIFT,
+				    (base_gfn + end_page) << PAGE_SHIFT);
+}
+
 /**
  * kvm_mmu_sync_dirty_log() - synchronize dirty log from stage2 PTEs for
  * memory slot
@@ -2577,18 +2617,52 @@ void kvm_mmu_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
 {
 	phys_addr_t start = memslot->base_gfn << PAGE_SHIFT;
 	phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
+	u32 ite_npages, cpu_cnt, this_cpu, cpu;
+	u16 cpu_idx;
 	int srcu_idx;
 
 	if (WARN_ON_ONCE(!memslot->dirty_bitmap))
 		return;
 
+	get_online_cpus();
 	srcu_idx = srcu_read_lock(&kvm->srcu);
 	spin_lock(&kvm->mmu_lock);
 
-	stage2_sync_dirty_log_range(kvm, start, end);
+	ite_npages = (1 << 17); /* 512MB max for per iteration and CPU */
+	cpu_cnt = num_online_cpus();
+
+	/* Use single CPU */
+	if (!kvm->arch.sync_dirty_log_parallel ||
+	    memslot->npages <= ite_npages ||
+	    unlikely(cpu_cnt == 1)) {
+		stage2_sync_dirty_log_range(kvm, start, end);
+		goto out_unlock;
+	}
 
+	/* Use many CPUs through IPI */
+	cpu_idx = 0;
+	this_cpu = smp_processor_id();
+	for_each_online_cpu(cpu) {
+		if (cpu != this_cpu)
+			sync_task.cpu_idx_map[cpu] = cpu_idx++;
+	}
+
+	sync_task.kvm = kvm;
+	sync_task.memslot = memslot;
+	sync_task.cpu_cnt = cpu_cnt - 1; /* Not include this CPU */
+	sync_task.ite_npages = ite_npages;
+
+	sync_task.ite = 0;
+	sync_task.finished = false;
+	do {
+		smp_call_function(stage2_sync_dirty_log_smp, &sync_task, 1);
+		sync_task.ite++;
+	} while (!sync_task.finished);
+
+out_unlock:
 	spin_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, srcu_idx);
+	put_online_cpus();
 }
 
 /**
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index a3be703dd54b..4171d6c1d400 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -85,7 +85,7 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		break;
 #ifdef CONFIG_ARM64_HW_AFDBM
 	case KVM_CAP_ARM_HW_DIRTY_LOG:
-		r = arm_mmu_hw_dbm_supported();
+		r = arm_mmu_hw_dbm_supported() ? 0x3 : 0x0;
 		break;
 #endif /* CONFIG_ARM64_HW_AFDBM */
 	default:
-- 
2.19.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 8/8] KVM: Omit dirty log sync in log clear if initially all set
  2020-07-02 13:55 [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM Keqian Zhu
                   ` (6 preceding siblings ...)
  2020-07-02 13:55 ` [PATCH v2 7/8] KVM: arm64: Sync dirty log parallel Keqian Zhu
@ 2020-07-02 13:55 ` Keqian Zhu
  2020-07-06  1:28 ` [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM zhukeqian
  8 siblings, 0 replies; 18+ messages in thread
From: Keqian Zhu @ 2020-07-02 13:55 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Sean Christopherson, Steven Price, liangpeng10,
	Alexios Zavras, Mark Brown, Marc Zyngier, Thomas Gleixner,
	Will Deacon, Andrew Morton

Synchronizing dirty log during log clear is useful only when the dirty
bitmap of userspace contains dirty bits that memslot dirty bitmap does
not contain, because we can sync new dirty bits to memslot dirty bitmap,
then we can clear them by the way and avoid reporting them to userspace
later.

With dirty bitmap "initially all set" feature, the above situation will
not appear if userspace logic is normal, so we can omit dirty log sync in
log clear. This is valuable when dirty log sync is a high-cost operation,
such as arm64 DBM.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 virt/kvm/kvm_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a852af5c3214..3f9e51d52b7a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1554,7 +1554,8 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	    (log->num_pages < memslot->npages - log->first_page && (log->num_pages & 63)))
 	    return -EINVAL;
 
-	kvm_arch_sync_dirty_log(kvm, memslot);
+	if (!kvm_dirty_log_manual_protect_and_init_set(kvm))
+		kvm_arch_sync_dirty_log(kvm, memslot);
 
 	flush = false;
 	dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
-- 
2.19.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 6/8] KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability
  2020-07-02 13:55 ` [PATCH v2 6/8] KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability Keqian Zhu
@ 2020-07-06  1:08   ` zhukeqian
  0 siblings, 0 replies; 18+ messages in thread
From: zhukeqian @ 2020-07-06  1:08 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Sean Christopherson, Steven Price, liangpeng10,
	Alexios Zavras, Mark Brown, Marc Zyngier, Thomas Gleixner,
	Will Deacon, Andrew Morton

Hi,

On 2020/7/2 21:55, Keqian Zhu wrote:
> For that using arm64 DBM to log dirty pages has the side effect
> of long time dirty log sync, we should give userspace opportunity
> to enable or disable this feature, to realize some policy.
> 
> This feature is disabled by default.
> 
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Peng Liang <liangpeng10@huawei.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  1 +
>  arch/arm64/kvm/arm.c              | 35 +++++++++++++++++++++++++++++++
>  arch/arm64/kvm/mmu.c              | 22 +++++++++++++++++++
>  arch/arm64/kvm/reset.c            |  5 +++++
>  include/uapi/linux/kvm.h          |  1 +
>  tools/include/uapi/linux/kvm.h    |  1 +
>  6 files changed, 65 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 86b9c210ba43..69a5317c7049 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -481,6 +481,7 @@ u64 __kvm_call_hyp(void *hypfn, ...);
>  void force_vm_exit(const cpumask_t *mask);
>  void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
>  void kvm_mmu_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot);
> +void kvm_mmu_sync_dirty_log_all(struct kvm *kvm);
>  
>  int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
>  		int exception_index);
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index fefa5406e037..9e3f765d5467 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -78,6 +78,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  			    struct kvm_enable_cap *cap)
>  {
>  	int r;
> +#ifdef CONFIG_ARM64_HW_AFDBM
> +	int i;
> +	struct kvm_vcpu *vcpu;
> +	bool enable_hw_dirty_log;
> +#endif
>  
>  	if (cap->flags)
>  		return -EINVAL;
> @@ -87,6 +92,36 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  		r = 0;
>  		kvm->arch.return_nisv_io_abort_to_user = true;
>  		break;
> +#ifdef CONFIG_ARM64_HW_AFDBM
> +	case KVM_CAP_ARM_HW_DIRTY_LOG:
> +		if (!arm_mmu_hw_dbm_supported() || (cap->args[0] & ~0x1))
> +			r = -EINVAL;
Miss "break" here :-( . Code is converted from "return -EINVAL".
> +
> +		enable_hw_dirty_log = !!(cap->args[0] & 0x1);
> +		if (!!(kvm->arch.vtcr & VTCR_EL2_HD) != enable_hw_dirty_log) {
> +			if (enable_hw_dirty_log)
> +				kvm->arch.vtcr |= VTCR_EL2_HD;
> +			else
> +				kvm->arch.vtcr &= ~VTCR_EL2_HD;
> +
> +			/*
> +			 * We should kick vcpus out of guest mode here to
> +			 * load new vtcr value to vtcr_el2 register when
> +			 * re-enter guest mode.
> +			 */
> +			kvm_for_each_vcpu(i, vcpu, kvm)
> +				kvm_vcpu_kick(vcpu);
> +
> +			if (!enable_hw_dirty_log) {
> +				mutex_lock(&kvm->slots_lock);
> +				kvm_mmu_sync_dirty_log_all(kvm);
> +				mutex_unlock(&kvm->slots_lock);
> +			}
> +		}
> +
> +		r = 0;
> +		break;
> +#endif
>  	default:
>  		r = -EINVAL;
>  		break;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index adfa62f1fced..1a48554accb0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -2590,6 +2590,28 @@ void kvm_mmu_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
>  	spin_unlock(&kvm->mmu_lock);
>  	srcu_read_unlock(&kvm->srcu, srcu_idx);
>  }
> +
> +/**
> + * kvm_mmu_sync_dirty_log_all() - synchronize dirty log from PTEs for whole VM
> + * @kvm:	The KVM pointer
> + *
> + * Called with kvm->slots_lock mutex acquired
> + */
> +void kvm_mmu_sync_dirty_log_all(struct kvm *kvm)
> +{
> +	struct kvm_memslots *slots = kvm_memslots(kvm);
> +	struct kvm_memory_slot *memslots = slots->memslots;
> +	struct kvm_memory_slot *memslot;
> +	int slot;
> +
> +	if (unlikely(!slots->used_slots))
> +		return;
> +
> +	for (slot = 0; slot < slots->used_slots; slot++) {
> +		memslot = &memslots[slot];
> +		kvm_mmu_sync_dirty_log(kvm, memslot);
> +	}
> +}
>  #endif /* CONFIG_ARM64_HW_AFDBM */
>  
>  void kvm_arch_commit_memory_region(struct kvm *kvm,
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index d3b209023727..a3be703dd54b 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -83,6 +83,11 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  		r = has_vhe() && system_supports_address_auth() &&
>  				 system_supports_generic_auth();
>  		break;
> +#ifdef CONFIG_ARM64_HW_AFDBM
> +	case KVM_CAP_ARM_HW_DIRTY_LOG:
> +		r = arm_mmu_hw_dbm_supported();
> +		break;
> +#endif /* CONFIG_ARM64_HW_AFDBM */
>  	default:
>  		r = 0;
>  	}
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 4fdf30316582..e0b12c43397b 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1031,6 +1031,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_PPC_SECURE_GUEST 181
>  #define KVM_CAP_HALT_POLL 182
>  #define KVM_CAP_ASYNC_PF_INT 183
> +#define KVM_CAP_ARM_HW_DIRTY_LOG 184
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> index 4fdf30316582..e0b12c43397b 100644
> --- a/tools/include/uapi/linux/kvm.h
> +++ b/tools/include/uapi/linux/kvm.h
> @@ -1031,6 +1031,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_PPC_SECURE_GUEST 181
>  #define KVM_CAP_HALT_POLL 182
>  #define KVM_CAP_ASYNC_PF_INT 183
> +#define KVM_CAP_ARM_HW_DIRTY_LOG 184
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM
  2020-07-02 13:55 [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM Keqian Zhu
                   ` (7 preceding siblings ...)
  2020-07-02 13:55 ` [PATCH v2 8/8] KVM: Omit dirty log sync in log clear if initially all set Keqian Zhu
@ 2020-07-06  1:28 ` zhukeqian
  2020-07-06  7:54   ` Marc Zyngier
  8 siblings, 1 reply; 18+ messages in thread
From: zhukeqian @ 2020-07-06  1:28 UTC (permalink / raw)
  To: Marc Zyngier, Catalin Marinas; +Cc: kvmarm

Hi Catalin and Marc,

On 2020/7/2 21:55, Keqian Zhu wrote:
> This patch series add support for dirty log based on HW DBM.
> 
> It works well under some migration test cases, including VM with 4K
> pages or 2M THP. I checked the SHA256 hash digest of all memory and
> they keep same for source VM and destination VM, which means no dirty
> pages is missed under hardware DBM.
> 
> Some key points:
> 
> 1. Only support hardware updates of dirty status for PTEs. PMDs and PUDs
>    are not involved for now.
> 
> 2. About *performance*: In RFC patch, I have mentioned that for every 64GB
>    memory, KVM consumes about 40ms to scan all PTEs to collect dirty log.
>    This patch solves this problem through two ways: HW/SW dynamic switch
>    and Multi-core offload.
> 
>    HW/SW dynamic switch: Give userspace right to enable/disable hw dirty
>    log. This adds a new KVM cap named KVM_CAP_ARM_HW_DIRTY_LOG. We can
>    achieve this by change the kvm->arch.vtcr value and kick vCPUs out to
>    reload this value to VCTR_EL2. Then userspace can enable hw dirty log
>    at the begining and disable it when dirty pages is little and about to
>    stop VM, so VM downtime is not affected.
> 
>    Multi-core offload: Offload the PT scanning workload to multi-core can
>    greatly reduce scanning time. To promise we can complete in time, I use
>    smp_call_fuction to realize this policy, which utilize IPI to dispatch
>    workload to other CPUs. Under 128U Kunpeng 920 platform, it just takes
>    about 5ms to scan PTs of 256 RAM (use mempress and almost all PTs have
>    been established). And We dispatch workload iterately (every CPU just
>    scan PTs of 512M RAM for each iteration), so it won't affect physical
>    CPUs seriously.

What do you think of these two methods to solve high-cost PTs scaning? Maybe
you are waiting for PML like feature on ARM :-) , but for my test, DBM is usable
after these two methods applied.

Thanks,
Keqian
> 
> 3. About correctness: Only add DBM bit when PTE is already writable, so
>    we still have readonly PTE and some mechanisms which rely on readonly
>    PTs are not broken.
> 
> 4. About PTs modification races: There are two kinds of PTs modification.
>    
>    The first is adding or clearing specific bit, such as AF or RW. All
>    these operations have been converted to be atomic, avoid covering
>    dirty status set by hardware.
>    
>    The second is replacement, such as PTEs unmapping or changement. All
>    these operations will invoke kvm_set_pte finally. kvm_set_pte have
>    been converted to be atomic and we save the dirty status to underlying
>    bitmap if dirty status is coverred.
> 
> Change log:
> 
> v2:
>  - Address Steven's comments.
>  - Add support of parallel dirty log sync.
>  - Simplify and merge patches of v1.
> 
> v1:
>  - Address Catalin's comments.
> 
> Keqian Zhu (8):
>   KVM: arm64: Set DBM bit for writable PTEs
>   KVM: arm64: Scan PTEs to sync dirty log
>   KVM: arm64: Modify stage2 young mechanism to support hw DBM
>   KVM: arm64: Save stage2 PTE dirty status if it is covered
>   KVM: arm64: Steply write protect page table by mask bit
>   KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability
>   KVM: arm64: Sync dirty log parallel
>   KVM: Omit dirty log sync in log clear if initially all set
> 
>  arch/arm64/include/asm/kvm_host.h |   5 +
>  arch/arm64/include/asm/kvm_mmu.h  |  43 ++++-
>  arch/arm64/kvm/arm.c              |  45 ++++-
>  arch/arm64/kvm/mmu.c              | 307 ++++++++++++++++++++++++++++--
>  arch/arm64/kvm/reset.c            |   5 +
>  include/uapi/linux/kvm.h          |   1 +
>  tools/include/uapi/linux/kvm.h    |   1 +
>  virt/kvm/kvm_main.c               |   3 +-
>  8 files changed, 389 insertions(+), 21 deletions(-)
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM
  2020-07-06  1:28 ` [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM zhukeqian
@ 2020-07-06  7:54   ` Marc Zyngier
  2020-07-13  2:47     ` zhukeqian
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Zyngier @ 2020-07-06  7:54 UTC (permalink / raw)
  To: zhukeqian; +Cc: Catalin Marinas, kvmarm

Hi Keqian,

On 2020-07-06 02:28, zhukeqian wrote:
> Hi Catalin and Marc,
> 
> On 2020/7/2 21:55, Keqian Zhu wrote:
>> This patch series add support for dirty log based on HW DBM.
>> 
>> It works well under some migration test cases, including VM with 4K
>> pages or 2M THP. I checked the SHA256 hash digest of all memory and
>> they keep same for source VM and destination VM, which means no dirty
>> pages is missed under hardware DBM.
>> 
>> Some key points:
>> 
>> 1. Only support hardware updates of dirty status for PTEs. PMDs and 
>> PUDs
>>    are not involved for now.
>> 
>> 2. About *performance*: In RFC patch, I have mentioned that for every 
>> 64GB
>>    memory, KVM consumes about 40ms to scan all PTEs to collect dirty 
>> log.
>>    This patch solves this problem through two ways: HW/SW dynamic 
>> switch
>>    and Multi-core offload.
>> 
>>    HW/SW dynamic switch: Give userspace right to enable/disable hw 
>> dirty
>>    log. This adds a new KVM cap named KVM_CAP_ARM_HW_DIRTY_LOG. We can
>>    achieve this by change the kvm->arch.vtcr value and kick vCPUs out 
>> to
>>    reload this value to VCTR_EL2. Then userspace can enable hw dirty 
>> log
>>    at the begining and disable it when dirty pages is little and about 
>> to
>>    stop VM, so VM downtime is not affected.
>> 
>>    Multi-core offload: Offload the PT scanning workload to multi-core 
>> can
>>    greatly reduce scanning time. To promise we can complete in time, I 
>> use
>>    smp_call_fuction to realize this policy, which utilize IPI to 
>> dispatch
>>    workload to other CPUs. Under 128U Kunpeng 920 platform, it just 
>> takes
>>    about 5ms to scan PTs of 256 RAM (use mempress and almost all PTs 
>> have
>>    been established). And We dispatch workload iterately (every CPU 
>> just
>>    scan PTs of 512M RAM for each iteration), so it won't affect 
>> physical
>>    CPUs seriously.
> 
> What do you think of these two methods to solve high-cost PTs scaning? 
> Maybe
> you are waiting for PML like feature on ARM :-) , but for my test, DBM 
> is usable
> after these two methods applied.

Useable, maybe. But leaving to userspace the decision to switch from one
mode to another isn't an acceptable outcome. Userspace doesn't need nor
want to know about this.

Another thing is that sending IPIs all over to trigger scanning may
work well on a system that runs a limited number of guests (or some
other userspace, actually), but I seriously doubt that it is impact
free once you start doing this on an otherwise loaded system.

You may have better results by having an alternative mapping of your
S2 page tables so that they are accessible linearly, which would
sidestep the PT parsing altogether, probably saving some cycles. But
this is still a marginal gain compared to the overall overhead of
scanning 4kB of memory per 2MB of guest RAM, as opposed to 64 *bytes*
per 2MB (assuming strict 4kB mappings at S2, no block mappings).

Finally, this doesn't work with pages dirtied from DMA, which is the
biggest problem. If you cannot track pages that are dirtied behind your
back, what is the purpose of scanning the dirty bits?

As for a PML-like feature, this would only be useful if the SMMU
architecture took part in it and provided consistent logging of
the dirtied pages in the IPA space. Only having it at the CPU level
would be making the exact same mistake.

So, given the above, I remain unconvinced by this approach.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM
  2020-07-06  7:54   ` Marc Zyngier
@ 2020-07-13  2:47     ` zhukeqian
  2020-07-13 14:53       ` Marc Zyngier
  0 siblings, 1 reply; 18+ messages in thread
From: zhukeqian @ 2020-07-13  2:47 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Catalin Marinas, kvmarm

Hi Marc,

Sorry for the delay reply.

On 2020/7/6 15:54, Marc Zyngier wrote:
> Hi Keqian,
> 
> On 2020-07-06 02:28, zhukeqian wrote:
>> Hi Catalin and Marc,
>>
>> On 2020/7/2 21:55, Keqian Zhu wrote:
>>> This patch series add support for dirty log based on HW DBM.
>>>
>>> It works well under some migration test cases, including VM with 4K
>>> pages or 2M THP. I checked the SHA256 hash digest of all memory and
>>> they keep same for source VM and destination VM, which means no dirty
>>> pages is missed under hardware DBM.
>>>
>>> Some key points:
>>>
>>> 1. Only support hardware updates of dirty status for PTEs. PMDs and PUDs
>>>    are not involved for now.
>>>
>>> 2. About *performance*: In RFC patch, I have mentioned that for every 64GB
>>>    memory, KVM consumes about 40ms to scan all PTEs to collect dirty log.
>>>    This patch solves this problem through two ways: HW/SW dynamic switch
>>>    and Multi-core offload.
>>>
>>>    HW/SW dynamic switch: Give userspace right to enable/disable hw dirty
>>>    log. This adds a new KVM cap named KVM_CAP_ARM_HW_DIRTY_LOG. We can
>>>    achieve this by change the kvm->arch.vtcr value and kick vCPUs out to
>>>    reload this value to VCTR_EL2. Then userspace can enable hw dirty log
>>>    at the begining and disable it when dirty pages is little and about to
>>>    stop VM, so VM downtime is not affected.
>>>
>>>    Multi-core offload: Offload the PT scanning workload to multi-core can
>>>    greatly reduce scanning time. To promise we can complete in time, I use
>>>    smp_call_fuction to realize this policy, which utilize IPI to dispatch
>>>    workload to other CPUs. Under 128U Kunpeng 920 platform, it just takes
>>>    about 5ms to scan PTs of 256 RAM (use mempress and almost all PTs have
>>>    been established). And We dispatch workload iterately (every CPU just
>>>    scan PTs of 512M RAM for each iteration), so it won't affect physical
>>>    CPUs seriously.
>>
>> What do you think of these two methods to solve high-cost PTs scaning? Maybe
>> you are waiting for PML like feature on ARM :-) , but for my test, DBM is usable
>> after these two methods applied.
> 
> Useable, maybe. But leaving to userspace the decision to switch from one
> mode to another isn't an acceptable outcome. Userspace doesn't need nor
> want to know about this.
> 
OK, maybe this is worth discussing. The switch logic can be encapsulated into Qemu
and can not be seen from VM users. Well, I think it maybe acceptable. :)

> Another thing is that sending IPIs all over to trigger scanning may
> work well on a system that runs a limited number of guests (or some
> other userspace, actually), but I seriously doubt that it is impact
> free once you start doing this on an otherwise loaded system.
> 
Yes, it is not suitable to send IPIs to all other physical CPUs. Currently I just
want to show you my idea and to prove it is effective. In real cloud product, we
have resource isolation mechanism, so we will have a bit worse result (compared to 5ms)
but we won't effect other VMs.

> You may have better results by having an alternative mapping of your
> S2 page tables so that they are accessible linearly, which would
> sidestep the PT parsing altogether, probably saving some cycles. But
Yeah, this is a good idea. But for my understanding, to make them linear, we have to preserve
enough physical memory at VM start (may waste much memory), and the effect of this optimization
*maybe* not obvious.

> this is still a marginal gain compared to the overall overhead of
> scanning 4kB of memory per 2MB of guest RAM, as opposed to 64 *bytes*
> per 2MB (assuming strict 4kB mappings at S2, no block mappings).
> 
I ever tested scanning PTs by reading only one byte of each PTE and the test result keeps same.
So, when we scan PTs using just one core, the bottle-neck is CPU speed, instead of memory bandwidth.

> Finally, this doesn't work with pages dirtied from DMA, which is the
> biggest problem. If you cannot track pages that are dirtied behind your
> back, what is the purpose of scanning the dirty bits?
> 
> As for a PML-like feature, this would only be useful if the SMMU
> architecture took part in it and provided consistent logging of
> the dirtied pages in the IPA space. Only having it at the CPU level
> would be making the exact same mistake.
Even SMMU is equipped with PML like feature, we still rely on device suspend to
avoid omitting dirty pages, so I think the only advantage of PML is reducing dirty
log sync time compared to multi-core offload. Maybe I missing something?

Thanks,
Keqian
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM
  2020-07-13  2:47     ` zhukeqian
@ 2020-07-13 14:53       ` Marc Zyngier
  2020-07-28  2:11         ` zhukeqian
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Zyngier @ 2020-07-13 14:53 UTC (permalink / raw)
  To: zhukeqian; +Cc: Catalin Marinas, kvmarm

Hi Keqian,

On Mon, 13 Jul 2020 03:47:25 +0100,
zhukeqian <zhukeqian1@huawei.com> wrote:
> 
> Hi Marc,
> 
> Sorry for the delay reply.
> 
> On 2020/7/6 15:54, Marc Zyngier wrote:
> > Hi Keqian,
> > 
> > On 2020-07-06 02:28, zhukeqian wrote:
> >> Hi Catalin and Marc,
> >>
> >> On 2020/7/2 21:55, Keqian Zhu wrote:
> >>> This patch series add support for dirty log based on HW DBM.
> >>>
> >>> It works well under some migration test cases, including VM with 4K
> >>> pages or 2M THP. I checked the SHA256 hash digest of all memory and
> >>> they keep same for source VM and destination VM, which means no dirty
> >>> pages is missed under hardware DBM.
> >>>
> >>> Some key points:
> >>>
> >>> 1. Only support hardware updates of dirty status for PTEs. PMDs and PUDs
> >>>    are not involved for now.
> >>>
> >>> 2. About *performance*: In RFC patch, I have mentioned that for every 64GB
> >>>    memory, KVM consumes about 40ms to scan all PTEs to collect dirty log.
> >>>    This patch solves this problem through two ways: HW/SW dynamic switch
> >>>    and Multi-core offload.
> >>>
> >>>    HW/SW dynamic switch: Give userspace right to enable/disable hw dirty
> >>>    log. This adds a new KVM cap named KVM_CAP_ARM_HW_DIRTY_LOG. We can
> >>>    achieve this by change the kvm->arch.vtcr value and kick vCPUs out to
> >>>    reload this value to VCTR_EL2. Then userspace can enable hw dirty log
> >>>    at the begining and disable it when dirty pages is little and about to
> >>>    stop VM, so VM downtime is not affected.
> >>>
> >>>    Multi-core offload: Offload the PT scanning workload to multi-core can
> >>>    greatly reduce scanning time. To promise we can complete in time, I use
> >>>    smp_call_fuction to realize this policy, which utilize IPI to dispatch
> >>>    workload to other CPUs. Under 128U Kunpeng 920 platform, it just takes
> >>>    about 5ms to scan PTs of 256 RAM (use mempress and almost all PTs have
> >>>    been established). And We dispatch workload iterately (every CPU just
> >>>    scan PTs of 512M RAM for each iteration), so it won't affect physical
> >>>    CPUs seriously.
> >>
> >> What do you think of these two methods to solve high-cost PTs scaning? Maybe
> >> you are waiting for PML like feature on ARM :-) , but for my test, DBM is usable
> >> after these two methods applied.
> > 
> > Useable, maybe. But leaving to userspace the decision to switch from one
> > mode to another isn't an acceptable outcome. Userspace doesn't need nor
> > want to know about this.
> > 
> OK, maybe this is worth discussing. The switch logic can be
> encapsulated into Qemu and can not be seen from VM users. Well, I
> think it maybe acceptable. :)

I'm sorry, but no, this isn't acceptable. Userspace is concerned with
finding out about the dirty pages, and nothing else. The method by
which you exposes which pages are dirty is not the business of
userspace *at all*.

> 
> > Another thing is that sending IPIs all over to trigger scanning may
> > work well on a system that runs a limited number of guests (or some
> > other userspace, actually), but I seriously doubt that it is impact
> > free once you start doing this on an otherwise loaded system.
> > 
> Yes, it is not suitable to send IPIs to all other physical
> CPUs. Currently I just want to show you my idea and to prove it is
> effective. In real cloud product, we have resource isolation
> mechanism, so we will have a bit worse result (compared to 5ms) but
> we won't effect other VMs.

If you limit the IPIs to the CPUs running the VM, they you are already
in a situation where you effectively stop the guest to parse the S2
page tables.

> 
> > You may have better results by having an alternative mapping of your
> > S2 page tables so that they are accessible linearly, which would
> > sidestep the PT parsing altogether, probably saving some cycles. But
> Yeah, this is a good idea. But for my understanding, to make them
> linear, we have to preserve enough physical memory at VM start (may
> waste much memory), and the effect of this optimization *maybe* not
> obvious.

Well, you'd have to have another set of PT to map the S2
linearly. Yes, it consumes memory.

> 
> > this is still a marginal gain compared to the overall overhead of
> > scanning 4kB of memory per 2MB of guest RAM, as opposed to 64 *bytes*
> > per 2MB (assuming strict 4kB mappings at S2, no block mappings).
> > 
> I ever tested scanning PTs by reading only one byte of each PTE and
> the test result keeps same.  So, when we scan PTs using just one
> core, the bottle-neck is CPU speed, instead of memory bandwidth.

But you are still reading the leaf entries of the PTs, hence defeating
any sort of prefetch that the CPU could do for you. And my claim is
that reading the bitmap is much faster than parsing the PTs. Are you
saying that this isn't the case?

> > Finally, this doesn't work with pages dirtied from DMA, which is the
> > biggest problem. If you cannot track pages that are dirtied behind your
> > back, what is the purpose of scanning the dirty bits?
> > 
> > As for a PML-like feature, this would only be useful if the SMMU
> > architecture took part in it and provided consistent logging of
> > the dirtied pages in the IPA space. Only having it at the CPU level
> > would be making the exact same mistake.
>
> Even SMMU is equipped with PML like feature, we still rely on device
> suspend to avoid omitting dirty pages, so I think the only advantage
> of PML is reducing dirty log sync time compared to multi-core
> offload. Maybe I missing something?

I don't see why you would rely on stopping DMA if you can track all
the dirty pages in a unified way (SMMU + CPU sharing a single log). It
doesn't matter anyway, since such HW doesn't exist, and won't exist
for the foreseeable future.

As for the S2 DBM, I still remain unconvinced. My take still is that
without a good story for tracking DMA, this only makes the S2 MMU code
more complex, without any concrete benefit. Without DMA, what we have
today is adequate, and doesn't require exposing yet another userspace
interface.

If you solve the DMA issue, I'll be more than happy to reconsider.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM
  2020-07-13 14:53       ` Marc Zyngier
@ 2020-07-28  2:11         ` zhukeqian
  2020-07-28  7:52           ` Marc Zyngier
  0 siblings, 1 reply; 18+ messages in thread
From: zhukeqian @ 2020-07-28  2:11 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Catalin Marinas, kvmarm

Hi Marc,

On 2020/7/13 22:53, Marc Zyngier wrote:
> Hi Keqian,
> 
> On Mon, 13 Jul 2020 03:47:25 +0100,
> zhukeqian <zhukeqian1@huawei.com> wrote:
>>
>> Hi Marc,
>>
>> Sorry for the delay reply.
>>
>> On 2020/7/6 15:54, Marc Zyngier wrote:
>>> Hi Keqian,
>>>
>>> On 2020-07-06 02:28, zhukeqian wrote:
>>>> Hi Catalin and Marc,
>>>>
>>>> On 2020/7/2 21:55, Keqian Zhu wrote:
>>>>> This patch series add support for dirty log based on HW DBM.
>>>>>
>>>>> It works well under some migration test cases, including VM with 4K
>>>>> pages or 2M THP. I checked the SHA256 hash digest of all memory and
>>>>> they keep same for source VM and destination VM, which means no dirty
>>>>> pages is missed under hardware DBM.
>>>>>
>>>>> Some key points:
>>>>>
>>>>> 1. Only support hardware updates of dirty status for PTEs. PMDs and PUDs
>>>>>    are not involved for now.
>>>>>
>>>>> 2. About *performance*: In RFC patch, I have mentioned that for every 64GB
>>>>>    memory, KVM consumes about 40ms to scan all PTEs to collect dirty log.
>>>>>    This patch solves this problem through two ways: HW/SW dynamic switch
>>>>>    and Multi-core offload.
>>>>>
>>>>>    HW/SW dynamic switch: Give userspace right to enable/disable hw dirty
>>>>>    log. This adds a new KVM cap named KVM_CAP_ARM_HW_DIRTY_LOG. We can
>>>>>    achieve this by change the kvm->arch.vtcr value and kick vCPUs out to
>>>>>    reload this value to VCTR_EL2. Then userspace can enable hw dirty log
>>>>>    at the begining and disable it when dirty pages is little and about to
>>>>>    stop VM, so VM downtime is not affected.
>>>>>
>>>>>    Multi-core offload: Offload the PT scanning workload to multi-core can
>>>>>    greatly reduce scanning time. To promise we can complete in time, I use
>>>>>    smp_call_fuction to realize this policy, which utilize IPI to dispatch
>>>>>    workload to other CPUs. Under 128U Kunpeng 920 platform, it just takes
>>>>>    about 5ms to scan PTs of 256 RAM (use mempress and almost all PTs have
>>>>>    been established). And We dispatch workload iterately (every CPU just
>>>>>    scan PTs of 512M RAM for each iteration), so it won't affect physical
>>>>>    CPUs seriously.
>>>>
>>>> What do you think of these two methods to solve high-cost PTs scaning? Maybe
>>>> you are waiting for PML like feature on ARM :-) , but for my test, DBM is usable
>>>> after these two methods applied.
>>>
>>> Useable, maybe. But leaving to userspace the decision to switch from one
>>> mode to another isn't an acceptable outcome. Userspace doesn't need nor
>>> want to know about this.
>>>
>> OK, maybe this is worth discussing. The switch logic can be
>> encapsulated into Qemu and can not be seen from VM users. Well, I
>> think it maybe acceptable. :)
> 
> I'm sorry, but no, this isn't acceptable. Userspace is concerned with
> finding out about the dirty pages, and nothing else. The method by
> which you exposes which pages are dirty is not the business of
> userspace *at all*.
OK.
> 
>>
>>> Another thing is that sending IPIs all over to trigger scanning may
>>> work well on a system that runs a limited number of guests (or some
>>> other userspace, actually), but I seriously doubt that it is impact
>>> free once you start doing this on an otherwise loaded system.
>>>
>> Yes, it is not suitable to send IPIs to all other physical
>> CPUs. Currently I just want to show you my idea and to prove it is
>> effective. In real cloud product, we have resource isolation
>> mechanism, so we will have a bit worse result (compared to 5ms) but
>> we won't effect other VMs.
> 
> If you limit the IPIs to the CPUs running the VM, they you are already
> in a situation where you effectively stop the guest to parse the S2
> page tables.
> 
>>
>>> You may have better results by having an alternative mapping of your
>>> S2 page tables so that they are accessible linearly, which would
>>> sidestep the PT parsing altogether, probably saving some cycles. But
>> Yeah, this is a good idea. But for my understanding, to make them
>> linear, we have to preserve enough physical memory at VM start (may
>> waste much memory), and the effect of this optimization *maybe* not
>> obvious.
> 
> Well, you'd have to have another set of PT to map the S2
> linearly. Yes, it consumes memory.
> 
>>
>>> this is still a marginal gain compared to the overall overhead of
>>> scanning 4kB of memory per 2MB of guest RAM, as opposed to 64 *bytes*
>>> per 2MB (assuming strict 4kB mappings at S2, no block mappings).
>>>
>> I ever tested scanning PTs by reading only one byte of each PTE and
>> the test result keeps same.  So, when we scan PTs using just one
>> core, the bottle-neck is CPU speed, instead of memory bandwidth.
> 
> But you are still reading the leaf entries of the PTs, hence defeating
> any sort of prefetch that the CPU could do for you. And my claim is
> that reading the bitmap is much faster than parsing the PTs. Are you
> saying that this isn't the case?
I am confused here. MMU DBM just updates the S2AP[1] of PTs, so dirty information
is not continuous. The smallest granularity of read instruction is one byte, we must
read one byte of each PTE to determine whether it is dirty. So I think the smallest
reading amount is 512 bytes per 2MB.

I write a demo and find that scanning linear PTs (without parsing) to sync dirty log
can reduce half of time, but this is still unacceptable for VM migration.
> 
>>> Finally, this doesn't work with pages dirtied from DMA, which is the
>>> biggest problem. If you cannot track pages that are dirtied behind your
>>> back, what is the purpose of scanning the dirty bits?
>>>
>>> As for a PML-like feature, this would only be useful if the SMMU
>>> architecture took part in it and provided consistent logging of
>>> the dirtied pages in the IPA space. Only having it at the CPU level
>>> would be making the exact same mistake.
>>
>> Even SMMU is equipped with PML like feature, we still rely on device
>> suspend to avoid omitting dirty pages, so I think the only advantage
>> of PML is reducing dirty log sync time compared to multi-core
>> offload. Maybe I missing something?
> 
> I don't see why you would rely on stopping DMA if you can track all
> the dirty pages in a unified way (SMMU + CPU sharing a single log). It
> doesn't matter anyway, since such HW doesn't exist, and won't exist
> for the foreseeable future.
For that it must take some time to send dirty page to destination host, during which, DMA
operation generates new dirty pages. If DMA can not be stopped, we can not find a time point
when all dirty memory is sent out. As for x86 MMU with hardware dirty log and PML, it also
need to stop vCPU when migration is about to complete.

> 
> As for the S2 DBM, I still remain unconvinced. My take still is that
> without a good story for tracking DMA, this only makes the S2 MMU code
> more complex, without any concrete benefit. Without DMA, what we have
> today is adequate, and doesn't require exposing yet another userspace
> interface.
> 
> If you solve the DMA issue, I'll be more than happy to reconsider.
> 
> Thanks,
> 
> 	M.
> 
Thanks,
Keqian
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM
  2020-07-28  2:11         ` zhukeqian
@ 2020-07-28  7:52           ` Marc Zyngier
  2020-07-28  8:32             ` zhukeqian
  2021-01-06  6:55             ` Keqian Zhu
  0 siblings, 2 replies; 18+ messages in thread
From: Marc Zyngier @ 2020-07-28  7:52 UTC (permalink / raw)
  To: zhukeqian; +Cc: Catalin Marinas, kvmarm

On 2020-07-28 03:11, zhukeqian wrote:
> Hi Marc,

[...]

>> But you are still reading the leaf entries of the PTs, hence defeating
>> any sort of prefetch that the CPU could do for you. And my claim is
>> that reading the bitmap is much faster than parsing the PTs. Are you
>> saying that this isn't the case?
> I am confused here. MMU DBM just updates the S2AP[1] of PTs, so dirty
> information
> is not continuous. The smallest granularity of read instruction is one
> byte, we must
> read one byte of each PTE to determine whether it is dirty. So I think
> the smallest
> reading amount is 512 bytes per 2MB.

Which is why using DBM as a way to implement dirty-logging doesn't work.
Forcing the vcpu to take faults in order to update the dirty bitmap
has the benefit of (a) telling you exactly what page has been written 
to,
(b) *slowing the vcpu down*.

See? no additional read, better convergence ratio because you are not
trying to catch up with a vcpu running wild. You are in control of the
dirtying rate, not the vcpu, and the information you get requires no
extra work (just set the corresponding bit in the dirty bitmap).

Honestly, I think you are looking at the problem the wrong way.
DBM at S2 is not a solution to anything, because the information is
way too sparse, and  it doesn't solve the real problem, which is
the tracking of dirty pages caused by devices.

As I said twice before, I will not consider these patches without
a solution to the DMA problem, and I don't think DBM is part of
that solution.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM
  2020-07-28  7:52           ` Marc Zyngier
@ 2020-07-28  8:32             ` zhukeqian
  2021-01-06  6:55             ` Keqian Zhu
  1 sibling, 0 replies; 18+ messages in thread
From: zhukeqian @ 2020-07-28  8:32 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Catalin Marinas, kvmarm

Hi Marc,

On 2020/7/28 15:52, Marc Zyngier wrote:
> On 2020-07-28 03:11, zhukeqian wrote:
>> Hi Marc,
> 
> [...]
> 
>>> But you are still reading the leaf entries of the PTs, hence defeating
>>> any sort of prefetch that the CPU could do for you. And my claim is
>>> that reading the bitmap is much faster than parsing the PTs. Are you
>>> saying that this isn't the case?
>> I am confused here. MMU DBM just updates the S2AP[1] of PTs, so dirty
>> information
>> is not continuous. The smallest granularity of read instruction is one
>> byte, we must
>> read one byte of each PTE to determine whether it is dirty. So I think
>> the smallest
>> reading amount is 512 bytes per 2MB.
> 
> Which is why using DBM as a way to implement dirty-logging doesn't work.
> Forcing the vcpu to take faults in order to update the dirty bitmap
> has the benefit of (a) telling you exactly what page has been written to,
> (b) *slowing the vcpu down*.
> 
> See? no additional read, better convergence ratio because you are not
> trying to catch up with a vcpu running wild. You are in control of the
> dirtying rate, not the vcpu, and the information you get requires no
> extra work (just set the corresponding bit in the dirty bitmap).
OK, in fact I have considered some of these things before. You are right, DBM dirty logging
is not suitable for guest with high dirty rate which even causes Qemu throttling. It only
reduce side-effect of migration on guest with low dirty rate.

I am not meaning to push this defective patch now, instead I am trying to find a software
approach to solve hardware drawback. However, currently we do not have a perfect approach.

> 
> Honestly, I think you are looking at the problem the wrong way.
> DBM at S2 is not a solution to anything, because the information is
> way too sparse, and  it doesn't solve the real problem, which is
> the tracking of dirty pages caused by devices.
> 
> As I said twice before, I will not consider these patches without
> a solution to the DMA problem, and I don't think DBM is part of
> that solution.
For that ARM SMMU HTTU do not have PML like feature, so the behavior of dirty log sync will
be very similar with which of the MMU DBM. My original idea is that we can support MMU DBM
firstly and then support SMMU HTTU based on MMU DBM.

Sure, we can leave this patch here and hope we can pick it up at future if hardware is ready :-) .
Many thanks for your review ;-) .

Thanks,
Keqian
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM
  2020-07-28  7:52           ` Marc Zyngier
  2020-07-28  8:32             ` zhukeqian
@ 2021-01-06  6:55             ` Keqian Zhu
  1 sibling, 0 replies; 18+ messages in thread
From: Keqian Zhu @ 2021-01-06  6:55 UTC (permalink / raw)
  To: Marc Zyngier, Catalin Marinas; +Cc: kvmarm

Hi Marc, Catalin,

I'd like to propose a new dirty tracking solution that combines hardware based
dirty tracking (DBM) and software based dirty tracking (memory abort). Hope to
get your opinions about it.

The core idea is that we do not enable hardware dirty at start (do not add DBM bit).
When a arbitrary PT occurs fault, we execute soft tracking for this PT and enable
hardware tracking for its *nearby* PTs (e.g. Add DBM bit for nearby 16PTs). Then when
sync dirty log, we have known all PTs with hardware dirty enabled, so we do not need
to scan all PTs.

We may worry that when dirty rate is over-high we still need to scan too much PTs.
We mainly concern the VM stop time. With Qemu dirty rate throttling, the dirty memory
is closing to the VM stop threshold, so there is a little PTs to scan after VM stop.

It has the advantages of hardware tracking that minimizes side effect on vCPU,
and also has the advantages of software tracking that controls vCPU dirty rate.
Moreover, software tracking helps us to scan PTs at some fixed points, which
greatly reduces scanning time. And the biggest benefit is that we can apply this
solution for dma dirty tracking.

Thanks,
Keqian


On 2020/7/28 15:52, Marc Zyngier wrote:
> On 2020-07-28 03:11, zhukeqian wrote:
>> Hi Marc,
> 
> [...]
> 
>>> But you are still reading the leaf entries of the PTs, hence defeating
>>> any sort of prefetch that the CPU could do for you. And my claim is
>>> that reading the bitmap is much faster than parsing the PTs. Are you
>>> saying that this isn't the case?
>> I am confused here. MMU DBM just updates the S2AP[1] of PTs, so dirty
>> information
>> is not continuous. The smallest granularity of read instruction is one
>> byte, we must
>> read one byte of each PTE to determine whether it is dirty. So I think
>> the smallest
>> reading amount is 512 bytes per 2MB.
> 
> Which is why using DBM as a way to implement dirty-logging doesn't work.
> Forcing the vcpu to take faults in order to update the dirty bitmap
> has the benefit of (a) telling you exactly what page has been written to,
> (b) *slowing the vcpu down*.
> 
> See? no additional read, better convergence ratio because you are not
> trying to catch up with a vcpu running wild. You are in control of the
> dirtying rate, not the vcpu, and the information you get requires no
> extra work (just set the corresponding bit in the dirty bitmap).
> 
> Honestly, I think you are looking at the problem the wrong way.
> DBM at S2 is not a solution to anything, because the information is
> way too sparse, and  it doesn't solve the real problem, which is
> the tracking of dirty pages caused by devices.
> 
> As I said twice before, I will not consider these patches without
> a solution to the DMA problem, and I don't think DBM is part of
> that solution.
> 
> Thanks,
> 
>         M.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-01-06  6:55 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-02 13:55 [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM Keqian Zhu
2020-07-02 13:55 ` [PATCH v2 1/8] KVM: arm64: Set DBM bit for writable PTEs Keqian Zhu
2020-07-02 13:55 ` [PATCH v2 2/8] KVM: arm64: Scan PTEs to sync dirty log Keqian Zhu
2020-07-02 13:55 ` [PATCH v2 3/8] KVM: arm64: Modify stage2 young mechanism to support hw DBM Keqian Zhu
2020-07-02 13:55 ` [PATCH v2 4/8] KVM: arm64: Save stage2 PTE dirty status if it is covered Keqian Zhu
2020-07-02 13:55 ` [PATCH v2 5/8] KVM: arm64: Steply write protect page table by mask bit Keqian Zhu
2020-07-02 13:55 ` [PATCH v2 6/8] KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability Keqian Zhu
2020-07-06  1:08   ` zhukeqian
2020-07-02 13:55 ` [PATCH v2 7/8] KVM: arm64: Sync dirty log parallel Keqian Zhu
2020-07-02 13:55 ` [PATCH v2 8/8] KVM: Omit dirty log sync in log clear if initially all set Keqian Zhu
2020-07-06  1:28 ` [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM zhukeqian
2020-07-06  7:54   ` Marc Zyngier
2020-07-13  2:47     ` zhukeqian
2020-07-13 14:53       ` Marc Zyngier
2020-07-28  2:11         ` zhukeqian
2020-07-28  7:52           ` Marc Zyngier
2020-07-28  8:32             ` zhukeqian
2021-01-06  6:55             ` Keqian Zhu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).