kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM
@ 2020-06-16  9:35 Keqian Zhu
  2020-06-16  9:35 ` [PATCH 01/12] KVM: arm64: Add some basic functions to support hw DBM Keqian Zhu
                   ` (12 more replies)
  0 siblings, 13 replies; 16+ messages in thread
From: Keqian Zhu @ 2020-06-16  9:35 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	liangpeng10, zhengxiang9, wanghaibin.wang, Keqian Zhu

This patch series add support for stage2 hardware DBM, and it is only
used for dirty log for now.

It works well under some migration test cases, including VM with 4K
pages or 2M THP. I checked the SHA256 hash digest of all memory and
they keep same for source VM and destination VM, which means no dirty
pages is missed under hardware DBM.

Some key points:

1. Only support hardware updates of dirty status for PTEs. PMDs and PUDs
   are not involved for now.

2. About *performance*: In RFC patch, I have mentioned that for every 64GB
   memory, KVM consumes about 40ms to scan all PTEs to collect dirty log.
   
   Initially, I plan to solve this problem using parallel CPUs. However
   I faced two problems.

   The first is bottleneck of memory bandwith. Single thread will occupy
   bandwidth about 500GB/s, we can support about 4 parallel threads at
   most, so the ideal speedup ratio is low.

   The second is huge impact on other CPUs. To scan PTs quickly, I use
   smp_call_function_many, which is based on IPI, to dispatch workload
   on other CPUs. Though it can complete work in time, the interrupt is
   disabled during scaning PTs, which has huge impact on other CPUs.

   Now, I make hardware dirty log can be dynamic enabled and disabled.
   Userspace can enable it before VM migration and disable it when
   remaining dirty pages is little. Thus VM downtime is not affected. 


3. About correctness: Only add DBM bit when PTE is already writable, so
   we still have readonly PTE and some mechanisms which rely on readonly
   PTs are not broken.

4. About PTs modification races: There are two kinds of PTs modification.
   
   The first is adding or clearing specific bit, such as AF or RW. All
   these operations have been converted to be atomic, avoid covering
   dirty status set by hardware.
   
   The second is replacement, such as PTEs unmapping or changement. All
   these operations will invoke kvm_set_pte finally. kvm_set_pte have
   been converted to be atomic and we save the dirty status to underlying
   bitmap if dirty status is coverred.


Keqian Zhu (12):
  KVM: arm64: Add some basic functions to support hw DBM
  KVM: arm64: Modify stage2 young mechanism to support hw DBM
  KVM: arm64: Report hardware dirty status of stage2 PTE if coverred
  KVM: arm64: Support clear DBM bit for PTEs
  KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability
  KVM: arm64: Set DBM bit of PTEs during write protecting
  KVM: arm64: Scan PTEs to sync dirty log
  KVM: Omit dirty log sync in log clear if initially all set
  KVM: arm64: Steply write protect page table by mask bit
  KVM: arm64: Save stage2 PTE dirty status if it is coverred
  KVM: arm64: Support disable hw dirty log after enable
  KVM: arm64: Enable stage2 hardware DBM

 arch/arm64/include/asm/kvm_host.h |  11 +
 arch/arm64/include/asm/kvm_mmu.h  |  56 +++-
 arch/arm64/include/asm/sysreg.h   |   2 +
 arch/arm64/kvm/arm.c              |  22 +-
 arch/arm64/kvm/mmu.c              | 411 ++++++++++++++++++++++++++++--
 arch/arm64/kvm/reset.c            |  14 +-
 include/uapi/linux/kvm.h          |   1 +
 tools/include/uapi/linux/kvm.h    |   1 +
 virt/kvm/kvm_main.c               |   7 +-
 9 files changed, 499 insertions(+), 26 deletions(-)

-- 
2.19.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 01/12] KVM: arm64: Add some basic functions to support hw DBM
  2020-06-16  9:35 [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM Keqian Zhu
@ 2020-06-16  9:35 ` Keqian Zhu
  2020-06-16  9:35 ` [PATCH 02/12] KVM: arm64: Modify stage2 young mechanism " Keqian Zhu
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Keqian Zhu @ 2020-06-16  9:35 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	liangpeng10, zhengxiang9, wanghaibin.wang, Keqian Zhu

Prepare some basic functions to support hardware DBM for PTEs.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
---
 arch/arm64/include/asm/kvm_mmu.h | 36 ++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index b12bfc1f051a..e0ee6e23d626 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -265,6 +265,42 @@ static inline bool kvm_s2pud_young(pud_t pud)
 	return pud_young(pud);
 }
 
+#ifdef CONFIG_ARM64_HW_AFDBM
+static inline bool kvm_hw_dbm_enabled(void)
+{
+	return !!(read_sysreg(vtcr_el2) & VTCR_EL2_HD);
+}
+
+static inline void kvm_set_s2pte_dbm(pte_t *ptep)
+{
+	pteval_t old_pteval, pteval;
+
+	pteval = READ_ONCE(pte_val(*ptep));
+	do {
+		old_pteval = pteval;
+		pteval |= PTE_DBM;
+		pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
+	} while (pteval != old_pteval);
+}
+
+static inline void kvm_clear_s2pte_dbm(pte_t *ptep)
+{
+	pteval_t old_pteval, pteval;
+
+	pteval = READ_ONCE(pte_val(*ptep));
+	do {
+		old_pteval = pteval;
+		pteval &= ~PTE_DBM;
+		pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
+	} while (pteval != old_pteval);
+}
+
+static inline bool kvm_s2pte_dbm(pte_t *ptep)
+{
+	return !!(READ_ONCE(pte_val(*ptep)) & PTE_DBM);
+}
+#endif /* CONFIG_ARM64_HW_AFDBM */
+
 #define hyp_pte_table_empty(ptep) kvm_page_empty(ptep)
 
 #ifdef __PAGETABLE_PMD_FOLDED
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 02/12] KVM: arm64: Modify stage2 young mechanism to support hw DBM
  2020-06-16  9:35 [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM Keqian Zhu
  2020-06-16  9:35 ` [PATCH 01/12] KVM: arm64: Add some basic functions to support hw DBM Keqian Zhu
@ 2020-06-16  9:35 ` Keqian Zhu
  2020-06-16  9:35 ` [PATCH 03/12] KVM: arm64: Report hardware dirty status of stage2 PTE if coverred Keqian Zhu
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Keqian Zhu @ 2020-06-16  9:35 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	liangpeng10, zhengxiang9, wanghaibin.wang, Keqian Zhu

Marking PTs young (set AF bit) should be atomic to avoid cover
dirty status set by hardware.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/include/asm/kvm_mmu.h | 32 ++++++++++++++++++++++----------
 arch/arm64/kvm/mmu.c             | 15 ++++++++-------
 2 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e0ee6e23d626..51af71505fbc 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -215,6 +215,18 @@ static inline void kvm_set_s2pte_readonly(pte_t *ptep)
 	} while (pteval != old_pteval);
 }
 
+static inline void kvm_set_s2pte_young(pte_t *ptep)
+{
+	pteval_t old_pteval, pteval;
+
+	pteval = READ_ONCE(pte_val(*ptep));
+	do {
+		old_pteval = pteval;
+		pteval |= PTE_AF;
+		pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
+	} while (pteval != old_pteval);
+}
+
 static inline bool kvm_s2pte_readonly(pte_t *ptep)
 {
 	return (READ_ONCE(pte_val(*ptep)) & PTE_S2_RDWR) == PTE_S2_RDONLY;
@@ -230,6 +242,11 @@ static inline void kvm_set_s2pmd_readonly(pmd_t *pmdp)
 	kvm_set_s2pte_readonly((pte_t *)pmdp);
 }
 
+static inline void kvm_set_s2pmd_young(pmd_t *pmdp)
+{
+	kvm_set_s2pte_young((pte_t *)pmdp);
+}
+
 static inline bool kvm_s2pmd_readonly(pmd_t *pmdp)
 {
 	return kvm_s2pte_readonly((pte_t *)pmdp);
@@ -245,6 +262,11 @@ static inline void kvm_set_s2pud_readonly(pud_t *pudp)
 	kvm_set_s2pte_readonly((pte_t *)pudp);
 }
 
+static inline void kvm_set_s2pud_young(pud_t *pudp)
+{
+	kvm_set_s2pte_young((pte_t *)pudp);
+}
+
 static inline bool kvm_s2pud_readonly(pud_t *pudp)
 {
 	return kvm_s2pte_readonly((pte_t *)pudp);
@@ -255,16 +277,6 @@ static inline bool kvm_s2pud_exec(pud_t *pudp)
 	return !(READ_ONCE(pud_val(*pudp)) & PUD_S2_XN);
 }
 
-static inline pud_t kvm_s2pud_mkyoung(pud_t pud)
-{
-	return pud_mkyoung(pud);
-}
-
-static inline bool kvm_s2pud_young(pud_t pud)
-{
-	return pud_young(pud);
-}
-
 #ifdef CONFIG_ARM64_HW_AFDBM
 static inline bool kvm_hw_dbm_enabled(void)
 {
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8c0035cab6b6..5ad87bce23c0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2008,8 +2008,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  * Resolve the access fault by making the page young again.
  * Note that because the faulting entry is guaranteed not to be
  * cached in the TLB, we don't need to invalidate anything.
- * Only the HW Access Flag updates are supported for Stage 2 (no DBM),
- * so there is no need for atomic (pte|pmd)_mkyoung operations.
+ *
+ * Note: Both DBM and HW AF updates are supported for Stage2, so
+ * young operations should be atomic.
  */
 static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
 {
@@ -2027,15 +2028,15 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
 		goto out;
 
 	if (pud) {		/* HugeTLB */
-		*pud = kvm_s2pud_mkyoung(*pud);
+		kvm_set_s2pud_young(pud);
 		pfn = kvm_pud_pfn(*pud);
 		pfn_valid = true;
 	} else	if (pmd) {	/* THP, HugeTLB */
-		*pmd = pmd_mkyoung(*pmd);
+		kvm_set_s2pmd_young(pmd);
 		pfn = pmd_pfn(*pmd);
 		pfn_valid = true;
-	} else {
-		*pte = pte_mkyoung(*pte);	/* Just a page... */
+	} else {		/* Just a page... */
+		kvm_set_s2pte_young(pte);
 		pfn = pte_pfn(*pte);
 		pfn_valid = true;
 	}
@@ -2280,7 +2281,7 @@ static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *
 		return 0;
 
 	if (pud)
-		return kvm_s2pud_young(*pud);
+		return pud_young(*pud);
 	else if (pmd)
 		return pmd_young(*pmd);
 	else
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 03/12] KVM: arm64: Report hardware dirty status of stage2 PTE if coverred
  2020-06-16  9:35 [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM Keqian Zhu
  2020-06-16  9:35 ` [PATCH 01/12] KVM: arm64: Add some basic functions to support hw DBM Keqian Zhu
  2020-06-16  9:35 ` [PATCH 02/12] KVM: arm64: Modify stage2 young mechanism " Keqian Zhu
@ 2020-06-16  9:35 ` Keqian Zhu
  2020-07-01 11:28   ` Steven Price
  2020-06-16  9:35 ` [PATCH 04/12] KVM: arm64: Support clear DBM bit for PTEs Keqian Zhu
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 16+ messages in thread
From: Keqian Zhu @ 2020-06-16  9:35 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	liangpeng10, zhengxiang9, wanghaibin.wang, Keqian Zhu

kvm_set_pte is called to replace a target PTE with a desired one.
We always do this without changing the desired one, but if dirty
status set by hardware is coverred, let caller know it.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/kvm/mmu.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 5ad87bce23c0..27407153121b 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -194,11 +194,45 @@ static void clear_stage2_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr
 	put_page(virt_to_page(pmd));
 }
 
-static inline void kvm_set_pte(pte_t *ptep, pte_t new_pte)
+#ifdef CONFIG_ARM64_HW_AFDBM
+/**
+ * @ret: true if dirty status set by hardware is coverred.
+ */
+static bool kvm_set_pte(pte_t *ptep, pte_t new_pte)
+{
+	pteval_t old_pteval, new_pteval, pteval;
+	bool old_logging, new_no_write;
+
+	old_logging = kvm_hw_dbm_enabled() && !pte_none(*ptep) &&
+		      kvm_s2pte_dbm(ptep);
+	new_no_write = pte_none(new_pte) || kvm_s2pte_readonly(&new_pte);
+
+	if (!old_logging || !new_no_write) {
+		WRITE_ONCE(*ptep, new_pte);
+		dsb(ishst);
+		return false;
+	}
+
+	new_pteval = pte_val(new_pte);
+	pteval = READ_ONCE(pte_val(*ptep));
+	do {
+		old_pteval = pteval;
+		pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, new_pteval);
+	} while (pteval != old_pteval);
+
+	return !kvm_s2pte_readonly(&__pte(pteval));
+}
+#else
+/**
+ * @ret: true if dirty status set by hardware is coverred.
+ */
+static inline bool kvm_set_pte(pte_t *ptep, pte_t new_pte)
 {
 	WRITE_ONCE(*ptep, new_pte);
 	dsb(ishst);
+	return false;
 }
+#endif /* CONFIG_ARM64_HW_AFDBM */
 
 static inline void kvm_set_pmd(pmd_t *pmdp, pmd_t new_pmd)
 {
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 04/12] KVM: arm64: Support clear DBM bit for PTEs
  2020-06-16  9:35 [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM Keqian Zhu
                   ` (2 preceding siblings ...)
  2020-06-16  9:35 ` [PATCH 03/12] KVM: arm64: Report hardware dirty status of stage2 PTE if coverred Keqian Zhu
@ 2020-06-16  9:35 ` Keqian Zhu
  2020-06-16  9:35 ` [PATCH 05/12] KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability Keqian Zhu
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Keqian Zhu @ 2020-06-16  9:35 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	liangpeng10, zhengxiang9, wanghaibin.wang, Keqian Zhu

This supports clear DBM bit for PTEs, to realize dynamic enable
of hardware DBM.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/include/asm/kvm_host.h |   2 +
 arch/arm64/kvm/mmu.c              | 151 ++++++++++++++++++++++++++++++
 2 files changed, 153 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index c3e6fcc664b1..9ea2dcfd609c 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -480,6 +480,8 @@ u64 __kvm_call_hyp(void *hypfn, ...);
 
 void force_vm_exit(const cpumask_t *mask);
 void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
+void kvm_mmu_clear_dbm(struct kvm *kvm, struct kvm_memory_slot *memslot);
+void kvm_mmu_clear_dbm_all(struct kvm *kvm);
 
 int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
 		int exception_index);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 27407153121b..f08b0fbca0a0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2446,6 +2446,157 @@ int kvm_mmu_init(void)
 	return err;
 }
 
+#ifdef CONFIG_ARM64_HW_AFDBM
+/**
+ * stage2_clear_dbm_ptes() - clear DBM bit from PMD range
+ * @pmd:	pointer to pmd entry
+ * @addr:	range start address
+ * @end:	range end address
+ */
+static void stage2_clear_dbm_ptes(pmd_t *pmd, phys_addr_t addr,
+				  phys_addr_t end)
+{
+	pte_t *pte;
+
+	pte = pte_offset_kernel(pmd, addr);
+	do {
+		if (!pte_none(*pte) && kvm_s2pte_dbm(pte))
+			kvm_clear_s2pte_dbm(pte);
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+}
+
+/**
+ * stage2_clear_dbm_pmds() - clear DBM bit from PUD range
+ * @kvm:	The KVM pointer
+ * @pud:	pointer to pud entry
+ * @addr:	range start address
+ * @end:	range end address
+ */
+static void stage2_clear_dbm_pmds(struct kvm *kvm, pud_t *pud,
+				  phys_addr_t addr, phys_addr_t end)
+{
+	pmd_t *pmd;
+	phys_addr_t next;
+
+	pmd = stage2_pmd_offset(kvm, pud, addr);
+	do {
+		next = stage2_pmd_addr_end(kvm, addr, end);
+		if (!pmd_none(*pmd) && !pmd_thp_or_huge(*pmd))
+			stage2_clear_dbm_ptes(pmd, addr, next);
+	} while (pmd++, addr = next, addr != end);
+}
+
+/**
+ * stage2_clear_dbm_puds() - clear DBM bit from P4D range
+ * @kvm:	The KVM pointer
+ * @pgd:	pointer to pgd entry
+ * @addr:	range start address
+ * @end:	range end address
+ */
+static void stage2_clear_dbm_puds(struct kvm *kvm, p4d_t *p4d,
+				  phys_addr_t addr, phys_addr_t end)
+{
+	pud_t *pud;
+	phys_addr_t next;
+
+	pud = stage2_pud_offset(kvm, p4d, addr);
+	do {
+		next = stage2_pud_addr_end(kvm, addr, end);
+		if (!stage2_pud_none(kvm, *pud) && !stage2_pud_huge(kvm, *pud))
+			stage2_clear_dbm_pmds(kvm, pud, addr, next);
+	} while (pud++, addr = next, addr != end);
+}
+
+/**
+ * stage2_clear_dbm_p4ds() - clear DBM bit from PGD range
+ * @kvm:	The KVM pointer
+ * @pgd:	pointer to pgd entry
+ * @addr:	range start address
+ * @end:	range end address
+ */
+static void stage2_clear_dbm_p4ds(struct kvm *kvm, pgd_t *pgd,
+				  phys_addr_t addr, phys_addr_t end)
+{
+	p4d_t *p4d;
+	phys_addr_t next;
+
+	p4d = stage2_p4d_offset(kvm, pgd, addr);
+	do {
+		next = stage2_p4d_addr_end(kvm, addr, end);
+		if (!stage2_p4d_none(kvm, *p4d))
+			stage2_clear_dbm_puds(kvm, p4d, addr, next);
+	} while (p4d++, addr = next, addr != end);
+}
+
+/**
+ * stage2_clear_dbm_range() - clear DBM bit from stage2 memory
+ * region range
+ * @kvm:	The KVM pointer
+ * @addr:	Start address of range
+ * @end:	End address of range
+ */
+static void stage2_clear_dbm_range(struct kvm *kvm, phys_addr_t addr,
+				   phys_addr_t end)
+{
+	pgd_t *pgd;
+	phys_addr_t next;
+
+	pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
+	do {
+		cond_resched_lock(&kvm->mmu_lock);
+		if (!READ_ONCE(kvm->arch.pgd))
+			break;
+		next = stage2_pgd_addr_end(kvm, addr, end);
+		if (stage2_pgd_present(kvm, *pgd))
+			stage2_clear_dbm_p4ds(kvm, pgd, addr, next);
+	} while (pgd++, addr = next, addr != end);
+}
+
+/**
+ * kvm_mmu_clear_dbm() - clear DBM bit from stage2 PTEs for memory slot
+ * @kvm:	The KVM pointer
+ * @slot:	The memory slot to clear DBM bit
+ *
+ * After this function returns, DBM bit of all block or page descriptors
+ * is cleared.
+ *
+ * Acquires kvm_mmu_lock. Called with kvm->slots_lock mutex acquired,
+ * serializing operations for VM memory regions.
+ */
+void kvm_mmu_clear_dbm(struct kvm *kvm, struct kvm_memory_slot *memslot)
+{
+	phys_addr_t start = memslot->base_gfn << PAGE_SHIFT;
+	phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
+
+	spin_lock(&kvm->mmu_lock);
+	stage2_clear_dbm_range(kvm, start, end);
+	spin_unlock(&kvm->mmu_lock);
+	kvm_flush_remote_tlbs(kvm);
+}
+
+/**
+ * kvm_mmu_clear_dbm_all() - clear DBM bit from stage2 PTEs for whole VM
+ * @kvm:	The KVM pointer
+ *
+ * Called with kvm->slots_lock mutex acquired.
+ */
+void kvm_mmu_clear_dbm_all(struct kvm *kvm)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	struct kvm_memory_slot *memslots = slots->memslots;
+	struct kvm_memory_slot *memslot;
+	int slot;
+
+	if (unlikely(!slots->used_slots))
+		return;
+
+	for (slot = 0; slot < slots->used_slots; slot++) {
+		memslot = &memslots[slot];
+		kvm_mmu_clear_dbm(kvm, memslot);
+	}
+}
+#endif /* CONFIG_ARM64_HW_AFDBM */
+
 void kvm_arch_commit_memory_region(struct kvm *kvm,
 				   const struct kvm_userspace_memory_region *mem,
 				   struct kvm_memory_slot *old,
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 05/12] KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability
  2020-06-16  9:35 [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM Keqian Zhu
                   ` (3 preceding siblings ...)
  2020-06-16  9:35 ` [PATCH 04/12] KVM: arm64: Support clear DBM bit for PTEs Keqian Zhu
@ 2020-06-16  9:35 ` Keqian Zhu
  2020-06-16  9:35 ` [PATCH 06/12] KVM: arm64: Set DBM bit of PTEs during write protecting Keqian Zhu
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Keqian Zhu @ 2020-06-16  9:35 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	liangpeng10, zhengxiang9, wanghaibin.wang, Keqian Zhu

For that using arm64 DBM to log dirty pages has the side effect
of long time dirty log sync, we should give userspace opportunity
to enable or disable this feature, to realize some policy.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/include/asm/kvm_host.h |  7 +++++++
 arch/arm64/kvm/arm.c              | 10 ++++++++++
 arch/arm64/kvm/reset.c            |  5 +++++
 include/uapi/linux/kvm.h          |  1 +
 tools/include/uapi/linux/kvm.h    |  1 +
 5 files changed, 24 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 9ea2dcfd609c..2bc3256759e3 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -95,6 +95,13 @@ struct kvm_arch {
 	 * supported.
 	 */
 	bool return_nisv_io_abort_to_user;
+
+	/*
+	 * Use hardware management of dirty status (DBM) to log dirty pages.
+	 * Userspace can enable this feature if KVM_CAP_ARM_HW_DIRTY_LOG is
+	 * supported.
+	 */
+	bool hw_dirty_log;
 };
 
 #define KVM_NR_MEM_OBJS     40
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 90cb90561446..850cc5cbc6f0 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -87,6 +87,16 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		r = 0;
 		kvm->arch.return_nisv_io_abort_to_user = true;
 		break;
+#ifdef CONFIG_ARM64_HW_AFDBM
+	case KVM_CAP_ARM_HW_DIRTY_LOG:
+		if ((cap->args[0] & ~1) || !kvm_hw_dbm_enabled()) {
+			r = -EINVAL;
+		} else {
+			r = 0;
+			kvm->arch.hw_dirty_log = cap->args[0];
+		}
+		break;
+#endif
 	default:
 		r = -EINVAL;
 		break;
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index d3b209023727..52bb801c9b2c 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -83,6 +83,11 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = has_vhe() && system_supports_address_auth() &&
 				 system_supports_generic_auth();
 		break;
+#ifdef CONFIG_ARM64_HW_AFDBM
+	case KVM_CAP_ARM_HW_DIRTY_LOG:
+		r = kvm_hw_dbm_enabled();
+		break;
+#endif /* CONFIG_ARM64_HW_AFDBM */
 	default:
 		r = 0;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4fdf30316582..e0b12c43397b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1031,6 +1031,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_SECURE_GUEST 181
 #define KVM_CAP_HALT_POLL 182
 #define KVM_CAP_ASYNC_PF_INT 183
+#define KVM_CAP_ARM_HW_DIRTY_LOG 184
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index fdd632c833b4..53908a8881a4 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -1017,6 +1017,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_VCPU_RESETS 179
 #define KVM_CAP_S390_PROTECTED 180
 #define KVM_CAP_PPC_SECURE_GUEST 181
+#define KVM_CAP_ARM_HW_DIRTY_LOG 184
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 06/12] KVM: arm64: Set DBM bit of PTEs during write protecting
  2020-06-16  9:35 [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM Keqian Zhu
                   ` (4 preceding siblings ...)
  2020-06-16  9:35 ` [PATCH 05/12] KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability Keqian Zhu
@ 2020-06-16  9:35 ` Keqian Zhu
  2020-06-16  9:35 ` [PATCH 07/12] KVM: arm64: Scan PTEs to sync dirty log Keqian Zhu
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Keqian Zhu @ 2020-06-16  9:35 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	liangpeng10, zhengxiang9, wanghaibin.wang, Keqian Zhu

During write protecting PTEs, if hardware dirty log is enabled,
set the DBM bit of PTEs when they are *already writable*. This
ensures some mechanisms that rely on "write fault", such as CoW,
are not broken.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
---
 arch/arm64/kvm/mmu.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f08b0fbca0a0..742c7943176f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1536,19 +1536,24 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 
 /**
  * stage2_wp_ptes - write protect PMD range
+ * @kvm:	kvm instance for the VM
  * @pmd:	pointer to pmd entry
  * @addr:	range start address
  * @end:	range end address
  */
-static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
+static void stage2_wp_ptes(struct kvm *kvm, pmd_t *pmd,
+			   phys_addr_t addr, phys_addr_t end)
 {
 	pte_t *pte;
 
 	pte = pte_offset_kernel(pmd, addr);
 	do {
-		if (!pte_none(*pte)) {
-			if (!kvm_s2pte_readonly(pte))
-				kvm_set_s2pte_readonly(pte);
+		if (!pte_none(*pte) && !kvm_s2pte_readonly(pte)) {
+#ifdef CONFIG_ARM64_HW_AFDBM
+			if (kvm->arch.hw_dirty_log && !kvm_s2pte_dbm(pte))
+				kvm_set_s2pte_dbm(pte);
+#endif
+			kvm_set_s2pte_readonly(pte);
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 }
@@ -1575,7 +1580,7 @@ static void stage2_wp_pmds(struct kvm *kvm, pud_t *pud,
 				if (!kvm_s2pmd_readonly(pmd))
 					kvm_set_s2pmd_readonly(pmd);
 			} else {
-				stage2_wp_ptes(pmd, addr, next);
+				stage2_wp_ptes(kvm, pmd, addr, next);
 			}
 		}
 	} while (pmd++, addr = next, addr != end);
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 07/12] KVM: arm64: Scan PTEs to sync dirty log
  2020-06-16  9:35 [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM Keqian Zhu
                   ` (5 preceding siblings ...)
  2020-06-16  9:35 ` [PATCH 06/12] KVM: arm64: Set DBM bit of PTEs during write protecting Keqian Zhu
@ 2020-06-16  9:35 ` Keqian Zhu
  2020-06-16  9:35 ` [PATCH 08/12] KVM: Omit dirty log sync in log clear if initially all set Keqian Zhu
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Keqian Zhu @ 2020-06-16  9:35 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	liangpeng10, zhengxiang9, wanghaibin.wang, Keqian Zhu

For hardware management of dirty state, dirty state is stored in
PTEs. We have to scan all PTEs to sync dirty log to memslot dirty
bitmap.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
---
 arch/arm64/include/asm/kvm_host.h |   2 +
 arch/arm64/kvm/arm.c              |   6 +-
 arch/arm64/kvm/mmu.c              | 162 ++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c               |   4 +-
 4 files changed, 172 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 2bc3256759e3..910ec33afea8 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -489,6 +489,8 @@ void force_vm_exit(const cpumask_t *mask);
 void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
 void kvm_mmu_clear_dbm(struct kvm *kvm, struct kvm_memory_slot *memslot);
 void kvm_mmu_clear_dbm_all(struct kvm *kvm);
+void kvm_mmu_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot);
+void kvm_mmu_sync_dirty_log_all(struct kvm *kvm);
 
 int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
 		int exception_index);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 850cc5cbc6f0..92f0b40a30fa 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1209,7 +1209,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 
 void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
 {
-
+#ifdef CONFIG_ARM64_HW_AFDBM
+	if (kvm->arch.hw_dirty_log) {
+		kvm_mmu_sync_dirty_log(kvm, memslot);
+	}
+#endif
 }
 
 void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 742c7943176f..3aa0303d83f0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2600,6 +2600,168 @@ void kvm_mmu_clear_dbm_all(struct kvm *kvm)
 		kvm_mmu_clear_dbm(kvm, memslot);
 	}
 }
+
+/**
+ * stage2_sync_dirty_log_ptes() - synchronize dirty log from PMD range
+ * @kvm:	The KVM pointer
+ * @pmd:	pointer to pmd entry
+ * @addr:	range start address
+ * @end:	range end address
+ */
+static void stage2_sync_dirty_log_ptes(struct kvm *kvm, pmd_t *pmd,
+				       phys_addr_t addr, phys_addr_t end)
+{
+	pte_t *pte;
+
+	pte = pte_offset_kernel(pmd, addr);
+	do {
+		if (!pte_none(*pte) && !kvm_s2pte_readonly(pte))
+			mark_page_dirty(kvm, addr >> PAGE_SHIFT);
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+}
+
+/**
+ * stage2_sync_dirty_log_pmds() - synchronize dirty log from PUD range
+ * @kvm:	The KVM pointer
+ * @pud:	pointer to pud entry
+ * @addr:	range start address
+ * @end:	range end address
+ */
+static void stage2_sync_dirty_log_pmds(struct kvm *kvm, pud_t *pud,
+				       phys_addr_t addr, phys_addr_t end)
+{
+	pmd_t *pmd;
+	phys_addr_t next;
+
+	pmd = stage2_pmd_offset(kvm, pud, addr);
+	do {
+		next = stage2_pmd_addr_end(kvm, addr, end);
+		if (!pmd_none(*pmd) && !pmd_thp_or_huge(*pmd))
+			stage2_sync_dirty_log_ptes(kvm, pmd, addr, next);
+	} while (pmd++, addr = next, addr != end);
+}
+
+/**
+ * stage2_sync_dirty_log_puds() - synchronize dirty log from P4D range
+ * @kvm:	The KVM pointer
+ * @pgd:	pointer to pgd entry
+ * @addr:	range start address
+ * @end:	range end address
+ */
+static void stage2_sync_dirty_log_puds(struct kvm *kvm, p4d_t *p4d,
+				       phys_addr_t addr, phys_addr_t end)
+{
+	pud_t *pud;
+	phys_addr_t next;
+
+	pud = stage2_pud_offset(kvm, p4d, addr);
+	do {
+		next = stage2_pud_addr_end(kvm, addr, end);
+		if (!stage2_pud_none(kvm, *pud) && !stage2_pud_huge(kvm, *pud))
+			stage2_sync_dirty_log_pmds(kvm, pud, addr, next);
+	} while (pud++, addr = next, addr != end);
+}
+
+/**
+ * stage2_sync_dirty_log_p4ds() - synchronize dirty log from PGD range
+ * @kvm:	The KVM pointer
+ * @pgd:	pointer to pgd entry
+ * @addr:	range start address
+ * @end:	range end address
+ */
+static void stage2_sync_dirty_log_p4ds(struct kvm *kvm, pgd_t *pgd,
+				       phys_addr_t addr, phys_addr_t end)
+{
+	p4d_t *p4d;
+	phys_addr_t next;
+
+	p4d = stage2_p4d_offset(kvm, pgd, addr);
+	do {
+		next = stage2_p4d_addr_end(kvm, addr, end);
+		if (!stage2_p4d_none(kvm, *p4d))
+			stage2_sync_dirty_log_puds(kvm, p4d, addr, next);
+	} while (p4d++, addr = next, addr != end);
+}
+
+/**
+ * stage2_sync_dirty_log_range() - synchronize dirty log from stage2 memory
+ * region range
+ * @kvm:	The KVM pointer
+ * @addr:	Start address of range
+ * @end:	End address of range
+ */
+static void stage2_sync_dirty_log_range(struct kvm *kvm, phys_addr_t addr,
+					phys_addr_t end)
+{
+	pgd_t *pgd;
+	phys_addr_t next;
+
+	pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
+	do {
+		cond_resched_lock(&kvm->mmu_lock);
+		if (!READ_ONCE(kvm->arch.pgd))
+			break;
+		next = stage2_pgd_addr_end(kvm, addr, end);
+		if (stage2_pgd_present(kvm, *pgd))
+			stage2_sync_dirty_log_p4ds(kvm, pgd, addr, next);
+	} while (pgd++, addr = next, addr != end);
+}
+
+/**
+ * kvm_mmu_sync_dirty_log() - synchronize dirty log from stage2 PTEs for
+ * memory slot
+ * @kvm:	The KVM pointer
+ * @slot:	The memory slot to synchronize dirty log
+ *
+ * Called to synchronize dirty log (marked by hw) after memory region
+ * KVM_GET_DIRTY_LOG operation is called. After this function returns
+ * all dirty log information (for that hw will modify page tables during
+ * this routine, it is true only when guest is stopped, but it is OK
+ * because we won't miss dirty log finally.) are collected into memslot
+ * dirty_bitmap. Afterwards dirty_bitmap can be copied to userspace.
+ *
+ * Acquires kvm_mmu_lock. Called with kvm->slots_lock mutex acquired,
+ * serializing operations for VM memory regions.
+ */
+void kvm_mmu_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
+{
+	phys_addr_t start = memslot->base_gfn << PAGE_SHIFT;
+	phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
+	int idx;
+
+	if (WARN_ON_ONCE(!memslot->dirty_bitmap))
+                return;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+
+	stage2_sync_dirty_log_range(kvm, start, end);
+
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
+}
+
+/**
+ * kvm_mmu_sync_dirty_log_all() - synchronize dirty log from PTEs for whole VM
+ * @kvm:	The KVM pointer
+ *
+ * Called with kvm->slots_lock mutex acquired
+ */
+void kvm_mmu_sync_dirty_log_all(struct kvm *kvm)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	struct kvm_memory_slot *memslots = slots->memslots;
+	struct kvm_memory_slot *memslot;
+	int slot;
+
+	if (unlikely(!slots->used_slots))
+		return;
+
+	for (slot = 0; slot < slots->used_slots; slot++) {
+		memslot = &memslots[slot];
+		kvm_mmu_sync_dirty_log(kvm, memslot);
+	}
+}
 #endif /* CONFIG_ARM64_HW_AFDBM */
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a852af5c3214..3722343fd460 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2581,7 +2581,9 @@ static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
-		set_bit_le(rel_gfn, memslot->dirty_bitmap);
+		/* Speed up if this bit has already been set */
+		if (!test_bit_le(rel_gfn, memslot->dirty_bitmap))
+			set_bit_le(rel_gfn, memslot->dirty_bitmap);
 	}
 }
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 08/12] KVM: Omit dirty log sync in log clear if initially all set
  2020-06-16  9:35 [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM Keqian Zhu
                   ` (6 preceding siblings ...)
  2020-06-16  9:35 ` [PATCH 07/12] KVM: arm64: Scan PTEs to sync dirty log Keqian Zhu
@ 2020-06-16  9:35 ` Keqian Zhu
  2020-06-16  9:35 ` [PATCH 09/12] KVM: arm64: Steply write protect page table by mask bit Keqian Zhu
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Keqian Zhu @ 2020-06-16  9:35 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	liangpeng10, zhengxiang9, wanghaibin.wang, Keqian Zhu

Synchronizing dirty log during log clear is useful only when the dirty
bitmap of userspace contains dirty bits that memslot dirty bitmap does
not contain, because we can sync new dirty bits to memslot dirty bitmap,
then we can clear them by the way and avoid reporting them to userspace
later.

With dirty bitmap "initially all set" feature, the above situation will
not appear if userspace logic is normal, so we can omit dirty log sync in
log clear. This is valuable when dirty log sync is a high-cost operation,
such as arm64 DBM.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 virt/kvm/kvm_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3722343fd460..6c147d6f9da6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1554,7 +1554,8 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	    (log->num_pages < memslot->npages - log->first_page && (log->num_pages & 63)))
 	    return -EINVAL;
 
-	kvm_arch_sync_dirty_log(kvm, memslot);
+	if (!kvm_dirty_log_manual_protect_and_init_set(kvm))
+		kvm_arch_sync_dirty_log(kvm, memslot);
 
 	flush = false;
 	dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 09/12] KVM: arm64: Steply write protect page table by mask bit
  2020-06-16  9:35 [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM Keqian Zhu
                   ` (7 preceding siblings ...)
  2020-06-16  9:35 ` [PATCH 08/12] KVM: Omit dirty log sync in log clear if initially all set Keqian Zhu
@ 2020-06-16  9:35 ` Keqian Zhu
  2020-06-16  9:35 ` [PATCH 10/12] KVM: arm64: Save stage2 PTE dirty status if it is coverred Keqian Zhu
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Keqian Zhu @ 2020-06-16  9:35 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	liangpeng10, zhengxiang9, wanghaibin.wang, Keqian Zhu

During dirty log clear, page table entries are write protected
according to a mask. In the past we write protect all entries
corresponding to the mask from ffs to fls. Though there may be
zero bits between this range, we are holding the kvm mmu lock
so we won't write protect entries that we don't want to.

We are about to add support for hardware management of dirty state
to arm64, holding kvm mmu lock will be not enough. We should write
protect entries steply by mask bit.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/kvm/mmu.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 3aa0303d83f0..898e272a2c07 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1710,10 +1710,16 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 		gfn_t gfn_offset, unsigned long mask)
 {
 	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
-	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
-	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+	phys_addr_t start, end;
+	u32 i;
 
-	stage2_wp_range(kvm, start, end);
+	for (i = __ffs(mask); i <= __fls(mask); i++) {
+		if (test_bit_le(i, &mask)) {
+			start = (base_gfn + i) << PAGE_SHIFT;
+			end = (base_gfn + i + 1) << PAGE_SHIFT;
+			stage2_wp_range(kvm, start, end);
+		}
+	}
 }
 
 /*
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 10/12] KVM: arm64: Save stage2 PTE dirty status if it is coverred
  2020-06-16  9:35 [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM Keqian Zhu
                   ` (8 preceding siblings ...)
  2020-06-16  9:35 ` [PATCH 09/12] KVM: arm64: Steply write protect page table by mask bit Keqian Zhu
@ 2020-06-16  9:35 ` Keqian Zhu
  2020-06-16  9:35 ` [PATCH 11/12] KVM: arm64: Support disable hw dirty log after enable Keqian Zhu
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Keqian Zhu @ 2020-06-16  9:35 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	liangpeng10, zhengxiang9, wanghaibin.wang, Keqian Zhu

There are two types of operations will change PTE and may cover
dirty status set by hardware.

1. Stage2 PTE unmapping: Page table merging (revert of huge page
table dissolving), kvm_unmap_hva_range() and so on.

2. Stage2 PTE changing: including user_mem_abort(), kvm_mmu_notifier
_change_pte() and so on.

All operations above will invoke kvm_set_pte() finally. We should
save the dirty status into memslot bitmap.

Question: Should we acquire kvm_slots_lock when invoke mark_page_dirty?
It seems that user_mem_abort does not acquire this lock when invoke it.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/kvm/mmu.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 898e272a2c07..a230fbcf3889 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -294,15 +294,23 @@ static void unmap_stage2_ptes(struct kvm *kvm, pmd_t *pmd,
 {
 	phys_addr_t start_addr = addr;
 	pte_t *pte, *start_pte;
+	bool dirty_coverred;
+	int idx;
 
 	start_pte = pte = pte_offset_kernel(pmd, addr);
 	do {
 		if (!pte_none(*pte)) {
 			pte_t old_pte = *pte;
 
-			kvm_set_pte(pte, __pte(0));
+			dirty_coverred = kvm_set_pte(pte, __pte(0));
 			kvm_tlb_flush_vmid_ipa(kvm, addr);
 
+			if (dirty_coverred) {
+				idx = srcu_read_lock(&kvm->srcu);
+				mark_page_dirty(kvm, addr >> PAGE_SHIFT);
+				srcu_read_unlock(&kvm->srcu, idx);
+			}
+
 			/* No need to invalidate the cache for device mappings */
 			if (!kvm_is_device_pfn(pte_pfn(old_pte)))
 				kvm_flush_dcache_pte(old_pte);
@@ -1388,6 +1396,8 @@ static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
 	pte_t *pte, old_pte;
 	bool iomap = flags & KVM_S2PTE_FLAG_IS_IOMAP;
 	bool logging_active = flags & KVM_S2_FLAG_LOGGING_ACTIVE;
+	bool dirty_coverred;
+	int idx;
 
 	VM_BUG_ON(logging_active && !cache);
 
@@ -1453,8 +1463,14 @@ static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
 		if (pte_val(old_pte) == pte_val(*new_pte))
 			return 0;
 
-		kvm_set_pte(pte, __pte(0));
+		dirty_coverred = kvm_set_pte(pte, __pte(0));
 		kvm_tlb_flush_vmid_ipa(kvm, addr);
+
+		if (dirty_coverred) {
+			idx = srcu_read_lock(&kvm->srcu);
+			mark_page_dirty(kvm, addr >> PAGE_SHIFT);
+			srcu_read_unlock(&kvm->srcu, idx);
+		}
 	} else {
 		get_page(virt_to_page(pte));
 	}
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 11/12] KVM: arm64: Support disable hw dirty log after enable
  2020-06-16  9:35 [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM Keqian Zhu
                   ` (9 preceding siblings ...)
  2020-06-16  9:35 ` [PATCH 10/12] KVM: arm64: Save stage2 PTE dirty status if it is coverred Keqian Zhu
@ 2020-06-16  9:35 ` Keqian Zhu
  2020-06-16  9:35 ` [PATCH 12/12] KVM: arm64: Enable stage2 hardware DBM Keqian Zhu
  2020-06-18  4:13 ` [PATCH 00/12] KVM: arm64: Support " zhukeqian
  12 siblings, 0 replies; 16+ messages in thread
From: Keqian Zhu @ 2020-06-16  9:35 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	liangpeng10, zhengxiang9, wanghaibin.wang, Keqian Zhu

We should clear DBM bit of all PTEs and flush TLB, then sync dirty log,
which promise we won't miss any dirty status set by hardware.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/kvm/arm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 92f0b40a30fa..76cab4c0b5a6 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 			r = -EINVAL;
 		} else {
 			r = 0;
+			if (kvm->arch.hw_dirty_log && !cap->args[0]) {
+				mutex_lock(&kvm->slots_lock);
+				kvm_mmu_clear_dbm_all(kvm);
+				kvm_mmu_sync_dirty_log_all(kvm);
+				mutex_unlock(&kvm->slots_lock);
+			}
 			kvm->arch.hw_dirty_log = cap->args[0];
 		}
 		break;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 12/12] KVM: arm64: Enable stage2 hardware DBM
  2020-06-16  9:35 [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM Keqian Zhu
                   ` (10 preceding siblings ...)
  2020-06-16  9:35 ` [PATCH 11/12] KVM: arm64: Support disable hw dirty log after enable Keqian Zhu
@ 2020-06-16  9:35 ` Keqian Zhu
  2020-06-18  4:13 ` [PATCH 00/12] KVM: arm64: Support " zhukeqian
  12 siblings, 0 replies; 16+ messages in thread
From: Keqian Zhu @ 2020-06-16  9:35 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	liangpeng10, zhengxiang9, wanghaibin.wang, Keqian Zhu

We are ready to support hw management of dirty state, enable it if
hardware support it.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
---
 arch/arm64/include/asm/sysreg.h | 2 ++
 arch/arm64/kvm/reset.c          | 9 ++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 463175f80341..b22bd903284d 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -744,6 +744,8 @@
 #define ID_AA64MMFR1_VMIDBITS_8		0
 #define ID_AA64MMFR1_VMIDBITS_16	2
 
+#define ID_AA64MMFR1_HADBS_DBS		2
+
 /* id_aa64mmfr2 */
 #define ID_AA64MMFR2_E0PD_SHIFT		60
 #define ID_AA64MMFR2_FWB_SHIFT		40
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 52bb801c9b2c..c1215b13bdd5 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -427,7 +427,7 @@ int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
 {
 	u64 vtcr = VTCR_EL2_FLAGS, mmfr0;
 	u32 parange, phys_shift;
-	u8 lvls;
+	u8 lvls, hadbs;
 
 	if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
 		return -EINVAL;
@@ -465,6 +465,13 @@ int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
 	 */
 	vtcr |= VTCR_EL2_HA;
 
+#ifdef CONFIG_ARM64_HW_AFDBM
+	hadbs = (read_sysreg(id_aa64mmfr1_el1) >>
+			ID_AA64MMFR1_HADBS_SHIFT) & 0xf;
+	if (hadbs == ID_AA64MMFR1_HADBS_DBS)
+		vtcr |= VTCR_EL2_HD;
+#endif
+
 	/* Set the vmid bits */
 	vtcr |= (kvm_get_vmid_bits() == 16) ?
 		VTCR_EL2_VS_16BIT :
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM
  2020-06-16  9:35 [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM Keqian Zhu
                   ` (11 preceding siblings ...)
  2020-06-16  9:35 ` [PATCH 12/12] KVM: arm64: Enable stage2 hardware DBM Keqian Zhu
@ 2020-06-18  4:13 ` zhukeqian
  12 siblings, 0 replies; 16+ messages in thread
From: zhukeqian @ 2020-06-18  4:13 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm, Catalin Marinas,
	Marc Zyngier, James Morse, Will Deacon
  Cc: Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	liangpeng10, zhengxiang9, wanghaibin.wang

Hi,

On 2020/6/16 17:35, Keqian Zhu wrote:
> This patch series add support for stage2 hardware DBM, and it is only
> used for dirty log for now.
> 
> It works well under some migration test cases, including VM with 4K
> pages or 2M THP. I checked the SHA256 hash digest of all memory and
> they keep same for source VM and destination VM, which means no dirty
> pages is missed under hardware DBM.
> 
> Some key points:
> 
> 1. Only support hardware updates of dirty status for PTEs. PMDs and PUDs
>    are not involved for now.
> 
> 2. About *performance*: In RFC patch, I have mentioned that for every 64GB
>    memory, KVM consumes about 40ms to scan all PTEs to collect dirty log.
>    
>    Initially, I plan to solve this problem using parallel CPUs. However
>    I faced two problems.
> 
>    The first is bottleneck of memory bandwith. Single thread will occupy
>    bandwidth about 500GB/s, we can support about 4 parallel threads at
>    most, so the ideal speedup ratio is low.
Aha, I make it wrong here. I test it again, and find that speedup ratio can
be about 23x when I use 32 CPUs to scan PTs (takes about 5ms when scanning PTs
of 200GB RAM).

> 
>    The second is huge impact on other CPUs. To scan PTs quickly, I use
>    smp_call_function_many, which is based on IPI, to dispatch workload
>    on other CPUs. Though it can complete work in time, the interrupt is
>    disabled during scaning PTs, which has huge impact on other CPUs.
I think we can divide scanning workload into smaller ones, which can disable
and enable interrupts periodly.

> 
>    Now, I make hardware dirty log can be dynamic enabled and disabled.
>    Userspace can enable it before VM migration and disable it when
>    remaining dirty pages is little. Thus VM downtime is not affected. 
BTW, we can reserve this interface for userspace if CPU computing resources
are not enough.

Thanks,
Keqian
> 
> 
> 3. About correctness: Only add DBM bit when PTE is already writable, so
>    we still have readonly PTE and some mechanisms which rely on readonly
>    PTs are not broken.
> 
> 4. About PTs modification races: There are two kinds of PTs modification.
>    
>    The first is adding or clearing specific bit, such as AF or RW. All
>    these operations have been converted to be atomic, avoid covering
>    dirty status set by hardware.
>    
>    The second is replacement, such as PTEs unmapping or changement. All
>    these operations will invoke kvm_set_pte finally. kvm_set_pte have
>    been converted to be atomic and we save the dirty status to underlying
>    bitmap if dirty status is coverred.
> 
> 
> Keqian Zhu (12):
>   KVM: arm64: Add some basic functions to support hw DBM
>   KVM: arm64: Modify stage2 young mechanism to support hw DBM
>   KVM: arm64: Report hardware dirty status of stage2 PTE if coverred
>   KVM: arm64: Support clear DBM bit for PTEs
>   KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability
>   KVM: arm64: Set DBM bit of PTEs during write protecting
>   KVM: arm64: Scan PTEs to sync dirty log
>   KVM: Omit dirty log sync in log clear if initially all set
>   KVM: arm64: Steply write protect page table by mask bit
>   KVM: arm64: Save stage2 PTE dirty status if it is coverred
>   KVM: arm64: Support disable hw dirty log after enable
>   KVM: arm64: Enable stage2 hardware DBM
> 
>  arch/arm64/include/asm/kvm_host.h |  11 +
>  arch/arm64/include/asm/kvm_mmu.h  |  56 +++-
>  arch/arm64/include/asm/sysreg.h   |   2 +
>  arch/arm64/kvm/arm.c              |  22 +-
>  arch/arm64/kvm/mmu.c              | 411 ++++++++++++++++++++++++++++--
>  arch/arm64/kvm/reset.c            |  14 +-
>  include/uapi/linux/kvm.h          |   1 +
>  tools/include/uapi/linux/kvm.h    |   1 +
>  virt/kvm/kvm_main.c               |   7 +-
>  9 files changed, 499 insertions(+), 26 deletions(-)
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 03/12] KVM: arm64: Report hardware dirty status of stage2 PTE if coverred
  2020-06-16  9:35 ` [PATCH 03/12] KVM: arm64: Report hardware dirty status of stage2 PTE if coverred Keqian Zhu
@ 2020-07-01 11:28   ` Steven Price
  2020-07-02 11:28     ` zhukeqian
  0 siblings, 1 reply; 16+ messages in thread
From: Steven Price @ 2020-07-01 11:28 UTC (permalink / raw)
  To: Keqian Zhu, linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Sean Christopherson, liangpeng10,
	Alexios Zavras, Mark Brown, Marc Zyngier, Thomas Gleixner,
	Will Deacon, Andrew Morton

Hi,

On 16/06/2020 10:35, Keqian Zhu wrote:
> kvm_set_pte is called to replace a target PTE with a desired one.
> We always do this without changing the desired one, but if dirty
> status set by hardware is coverred, let caller know it.
> 
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> ---
>   arch/arm64/kvm/mmu.c | 36 +++++++++++++++++++++++++++++++++++-
>   1 file changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 5ad87bce23c0..27407153121b 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -194,11 +194,45 @@ static void clear_stage2_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr
>   	put_page(virt_to_page(pmd));
>   }
>   
> -static inline void kvm_set_pte(pte_t *ptep, pte_t new_pte)
> +#ifdef CONFIG_ARM64_HW_AFDBM
> +/**
> + * @ret: true if dirty status set by hardware is coverred.

NIT: s/coverred/covered/, this is in several places.

> + */
> +static bool kvm_set_pte(pte_t *ptep, pte_t new_pte)
> +{
> +	pteval_t old_pteval, new_pteval, pteval;
> +	bool old_logging, new_no_write;
> +
> +	old_logging = kvm_hw_dbm_enabled() && !pte_none(*ptep) &&
> +		      kvm_s2pte_dbm(ptep);
> +	new_no_write = pte_none(new_pte) || kvm_s2pte_readonly(&new_pte);
> +
> +	if (!old_logging || !new_no_write) {
> +		WRITE_ONCE(*ptep, new_pte);
> +		dsb(ishst);
> +		return false;
> +	}
> +
> +	new_pteval = pte_val(new_pte);
> +	pteval = READ_ONCE(pte_val(*ptep));

This usage of *ptep looks wrong - it's read twice using READ_ONCE (once 
in kvm_s2pte_dbm()) and once without any decoration (in the pte_none() 
call). Which looks a bit dodgy and at the very least needs some 
justification. AFAICT you would be better taking a local copy and using 
that rather than reading from memory repeatedly.

> +	do {
> +		old_pteval = pteval;
> +		pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, new_pteval);
> +	} while (pteval != old_pteval);
This look appears to be reinventing xchg_relaxed(). Any reason we can't 
just use xchg_relaxed()? Also we had a dsb() after the WRITE_ONCE but 
you are using the _relaxed variant here. What is the justification for 
not having a barrier?

> +
> +	return !kvm_s2pte_readonly(&__pte(pteval));
> +}
> +#else
> +/**
> + * @ret: true if dirty status set by hardware is coverred.
> + */
> +static inline bool kvm_set_pte(pte_t *ptep, pte_t new_pte)
>   {
>   	WRITE_ONCE(*ptep, new_pte);
>   	dsb(ishst);
> +	return false;
>   }
> +#endif /* CONFIG_ARM64_HW_AFDBM */

You might be able to avoid this #ifdef by redefining old_logging as:

   old_logging = IS_ENABLED(CONFIG_ARM64_HW_AFDBM) && ...

I *think* the compiler should be able to kill the dead code and leave 
you with just the above when the config symbol is off.

Steve

>   
>   static inline void kvm_set_pmd(pmd_t *pmdp, pmd_t new_pmd)
>   {
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 03/12] KVM: arm64: Report hardware dirty status of stage2 PTE if coverred
  2020-07-01 11:28   ` Steven Price
@ 2020-07-02 11:28     ` zhukeqian
  0 siblings, 0 replies; 16+ messages in thread
From: zhukeqian @ 2020-07-02 11:28 UTC (permalink / raw)
  To: Steven Price, linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Sean Christopherson, liangpeng10,
	Alexios Zavras, Mark Brown, Marc Zyngier, Thomas Gleixner,
	Will Deacon, Andrew Morton

Hi Steven,

On 2020/7/1 19:28, Steven Price wrote:
> Hi,
> 
> On 16/06/2020 10:35, Keqian Zhu wrote:
>> kvm_set_pte is called to replace a target PTE with a desired one.
>> We always do this without changing the desired one, but if dirty
>> status set by hardware is coverred, let caller know it.
>>
>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>> ---
>>   arch/arm64/kvm/mmu.c | 36 +++++++++++++++++++++++++++++++++++-
>>   1 file changed, 35 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 5ad87bce23c0..27407153121b 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -194,11 +194,45 @@ static void clear_stage2_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr
>>       put_page(virt_to_page(pmd));
>>   }
>>   -static inline void kvm_set_pte(pte_t *ptep, pte_t new_pte)
>> +#ifdef CONFIG_ARM64_HW_AFDBM
>> +/**
>> + * @ret: true if dirty status set by hardware is coverred.
> 
> NIT: s/coverred/covered/, this is in several places.
> 
OK.
>> + */
>> +static bool kvm_set_pte(pte_t *ptep, pte_t new_pte)
>> +{
>> +    pteval_t old_pteval, new_pteval, pteval;
>> +    bool old_logging, new_no_write;
>> +
>> +    old_logging = kvm_hw_dbm_enabled() && !pte_none(*ptep) &&
>> +              kvm_s2pte_dbm(ptep);
>> +    new_no_write = pte_none(new_pte) || kvm_s2pte_readonly(&new_pte);
>> +
>> +    if (!old_logging || !new_no_write) {
>> +        WRITE_ONCE(*ptep, new_pte);
>> +        dsb(ishst);
>> +        return false;
>> +    }
>> +
>> +    new_pteval = pte_val(new_pte);
>> +    pteval = READ_ONCE(pte_val(*ptep));
> 
> This usage of *ptep looks wrong - it's read twice using READ_ONCE (once in kvm_s2pte_dbm()) and once without any decoration (in the pte_none() call). Which looks a bit dodgy and at the very least needs some justification. AFAICT you would be better taking a local copy and using that rather than reading from memory repeatedly.
> 
oh yes. Here we can use a local copy to get higher performance.

I am not sure here has problem about correctness. I *think* we should always acquire corresponding lock before manipulate PTs,
so there is no other agent will update PTs during our several PTs access (except MMU, but it just modifies AF and [S2]AP) .
But do I miss something?

>> +    do {
>> +        old_pteval = pteval;
>> +        pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, new_pteval);
>> +    } while (pteval != old_pteval);
> This look appears to be reinventing xchg_relaxed(). Any reason we can't just use xchg_relaxed()? Also we had a dsb() after the WRITE_ONCE but you are using the _relaxed variant here. What is the justification for not having a barrier?
> 
Aha, I have changed the code for several times and it is equal to xchg_relaxed now, thanks.

I use _relaxd here because I am not clear about the reason that we use _relaxed in kvm_set_s2XXX_XXX and use dsb() in kvm_set_pte.
But I will correct this in patch v2.
>> +
>> +    return !kvm_s2pte_readonly(&__pte(pteval));
>> +}
>> +#else
>> +/**
>> + * @ret: true if dirty status set by hardware is coverred.
>> + */
>> +static inline bool kvm_set_pte(pte_t *ptep, pte_t new_pte)
>>   {
>>       WRITE_ONCE(*ptep, new_pte);
>>       dsb(ishst);
>> +    return false;
>>   }
>> +#endif /* CONFIG_ARM64_HW_AFDBM */
> 
> You might be able to avoid this #ifdef by redefining old_logging as:
> 
>   old_logging = IS_ENABLED(CONFIG_ARM64_HW_AFDBM) && ...
> 
> I *think* the compiler should be able to kill the dead code and leave you with just the above when the config symbol is off.
> 
After my test, you are right ;-) .


Thanks,
Keqian
> Steve
> 
>>     static inline void kvm_set_pmd(pmd_t *pmdp, pmd_t new_pmd)
>>   {
>>
> 
> .
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-07-02 11:28 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-16  9:35 [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM Keqian Zhu
2020-06-16  9:35 ` [PATCH 01/12] KVM: arm64: Add some basic functions to support hw DBM Keqian Zhu
2020-06-16  9:35 ` [PATCH 02/12] KVM: arm64: Modify stage2 young mechanism " Keqian Zhu
2020-06-16  9:35 ` [PATCH 03/12] KVM: arm64: Report hardware dirty status of stage2 PTE if coverred Keqian Zhu
2020-07-01 11:28   ` Steven Price
2020-07-02 11:28     ` zhukeqian
2020-06-16  9:35 ` [PATCH 04/12] KVM: arm64: Support clear DBM bit for PTEs Keqian Zhu
2020-06-16  9:35 ` [PATCH 05/12] KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability Keqian Zhu
2020-06-16  9:35 ` [PATCH 06/12] KVM: arm64: Set DBM bit of PTEs during write protecting Keqian Zhu
2020-06-16  9:35 ` [PATCH 07/12] KVM: arm64: Scan PTEs to sync dirty log Keqian Zhu
2020-06-16  9:35 ` [PATCH 08/12] KVM: Omit dirty log sync in log clear if initially all set Keqian Zhu
2020-06-16  9:35 ` [PATCH 09/12] KVM: arm64: Steply write protect page table by mask bit Keqian Zhu
2020-06-16  9:35 ` [PATCH 10/12] KVM: arm64: Save stage2 PTE dirty status if it is coverred Keqian Zhu
2020-06-16  9:35 ` [PATCH 11/12] KVM: arm64: Support disable hw dirty log after enable Keqian Zhu
2020-06-16  9:35 ` [PATCH 12/12] KVM: arm64: Enable stage2 hardware DBM Keqian Zhu
2020-06-18  4:13 ` [PATCH 00/12] KVM: arm64: Support " zhukeqian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).