linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] kvm: arm64: Support stage2 hardware DBM
@ 2020-05-25 11:23 Keqian Zhu
  2020-05-25 11:24 ` [RFC PATCH 1/7] KVM: arm64: Add some basic functions for hw DBM Keqian Zhu
                   ` (7 more replies)
  0 siblings, 8 replies; 12+ messages in thread
From: Keqian Zhu @ 2020-05-25 11:23 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	wanghaibin.wang, zhengxiang9, Keqian Zhu

This patch series add support for stage2 hardware DBM, and it is only
used for dirty log for now.

It works well under some migration test cases, including VM with 4K
pages or 2M THP. I checked the SHA256 hash digest of all memory and
they keep same for source VM and destination VM, which means no dirty
pages is missed under hardware DBM.

However, there are some known issues not solved.

1. Some mechanisms that rely on "write permission fault" become invalid,
   such as kvm_set_pfn_dirty and "mmap page sharing".

   kvm_set_pfn_dirty is called in user_mem_abort when guest issues write
   fault. This guarantees physical page will not be dropped directly when
   host kernel recycle memory. After using hardware dirty management, we
   have no chance to call kvm_set_pfn_dirty.

   For "mmap page sharing" mechanism, host kernel will allocate a new
   physical page when guest writes a page that is shared with other page
   table entries. After using hardware dirty management, we have no chance
   to do this too.

   I need to do some survey on how stage1 hardware DBM solve these problems.
   It helps if anyone can figure it out.

2. Page Table Modification Races: Though I have found and solved some data
   races when kernel changes page table entries, I still doubt that there
   are data races I am not aware of. It's great if anyone can figure them out.

3. Performance: Under Kunpeng 920 platform, for every 64GB memory, KVM
   consumes about 40ms to traverse all PTEs to collect dirty log. It will
   cause unbearable downtime for migration if memory size is too big. I will
   try to solve this problem in Patch v1.

Keqian Zhu (7):
  KVM: arm64: Add some basic functions for hw DBM
  KVM: arm64: Set DBM bit of PTEs if hw DBM enabled
  KVM: arm64: Traverse page table entries when sync dirty log
  KVM: arm64: Steply write protect page table by mask bit
  kvm: arm64: Modify stage2 young mechanism to support hw DBM
  kvm: arm64: Save stage2 PTE dirty info if it is coverred
  KVM: arm64: Enable stage2 hardware DBM

 arch/arm64/include/asm/kvm_host.h     |   1 +
 arch/arm64/include/asm/kvm_mmu.h      |  44 +++++-
 arch/arm64/include/asm/pgtable-prot.h |   1 +
 arch/arm64/include/asm/sysreg.h       |   2 +
 arch/arm64/kvm/reset.c                |   9 +-
 virt/kvm/arm/arm.c                    |   6 +-
 virt/kvm/arm/mmu.c                    | 202 ++++++++++++++++++++++++--
 7 files changed, 246 insertions(+), 19 deletions(-)

-- 
2.19.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/7] KVM: arm64: Add some basic functions for hw DBM
  2020-05-25 11:23 [RFC PATCH 0/7] kvm: arm64: Support stage2 hardware DBM Keqian Zhu
@ 2020-05-25 11:24 ` Keqian Zhu
  2020-05-25 11:24 ` [RFC PATCH 2/7] KVM: arm64: Set DBM bit of PTEs if hw DBM enabled Keqian Zhu
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Keqian Zhu @ 2020-05-25 11:24 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	wanghaibin.wang, zhengxiang9, Keqian Zhu, Peng Liang

Prepare some basic functions used by following patches to support
hardware DBM.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
---
 arch/arm64/include/asm/kvm_mmu.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 30b0e8d6b895..8df078f0ee67 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -285,6 +285,30 @@ static inline bool kvm_s2pud_young(pud_t pud)
 	return pud_young(pud);
 }
 
+#ifdef CONFIG_ARM64_HW_AFDBM
+static inline bool kvm_hw_dbm_enabled(void)
+{
+	return !!(read_sysreg(vtcr_el2) & VTCR_EL2_HD);
+}
+
+static inline void kvm_set_s2pte_dbm(pte_t *ptep)
+{
+	pteval_t old_pteval, pteval;
+
+	pteval = READ_ONCE(pte_val(*ptep));
+	do {
+		old_pteval = pteval;
+		pteval |= PTE_DBM;
+		pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
+	} while (pteval != old_pteval);
+}
+
+static inline bool kvm_s2pte_dbm(pte_t *ptep)
+{
+	return !!(READ_ONCE(pte_val(*ptep)) & PTE_DBM);
+}
+#endif /* CONFIG_ARM64_HW_AFDBM */
+
 #define hyp_pte_table_empty(ptep) kvm_page_empty(ptep)
 
 #ifdef __PAGETABLE_PMD_FOLDED
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/7] KVM: arm64: Set DBM bit of PTEs if hw DBM enabled
  2020-05-25 11:23 [RFC PATCH 0/7] kvm: arm64: Support stage2 hardware DBM Keqian Zhu
  2020-05-25 11:24 ` [RFC PATCH 1/7] KVM: arm64: Add some basic functions for hw DBM Keqian Zhu
@ 2020-05-25 11:24 ` Keqian Zhu
  2020-05-26 11:49   ` Catalin Marinas
  2020-05-25 11:24 ` [RFC PATCH 3/7] KVM: arm64: Traverse page table entries when sync dirty log Keqian Zhu
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 12+ messages in thread
From: Keqian Zhu @ 2020-05-25 11:24 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	wanghaibin.wang, zhengxiang9, Keqian Zhu, Peng Liang

In user_mem_abort, for normal case (mem_type is PAGE_S2), set DBM bit
of PTEs if hw DBM enabled. We also check and set DBM bit during write
protect PTEs to make it works well if we miss some cases.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
---
 arch/arm64/include/asm/pgtable-prot.h |  1 +
 virt/kvm/arm/mmu.c                    | 14 +++++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 1305e28225fc..f9910ba2afd8 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -79,6 +79,7 @@ extern bool arm64_use_ng_mappings;
 	})
 
 #define PAGE_S2			__pgprot(_PROT_DEFAULT | PAGE_S2_MEMATTR(NORMAL) | PTE_S2_RDONLY | PAGE_S2_XN)
+#define PAGE_S2_DBM		__pgprot(_PROT_DEFAULT | PAGE_S2_MEMATTR(NORMAL) | PTE_S2_RDONLY | PAGE_S2_XN | PTE_DBM)
 #define PAGE_S2_DEVICE		__pgprot(_PROT_DEFAULT | PAGE_S2_MEMATTR(DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_S2_XN)
 
 #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index e3b9ee268823..dc97988eb2e0 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1426,6 +1426,10 @@ static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
 	pte = pte_offset_kernel(pmd, addr);
 	do {
 		if (!pte_none(*pte)) {
+#ifdef CONFIG_ARM64_HW_AFDBM
+			if (kvm_hw_dbm_enabled() && !kvm_s2pte_dbm(pte))
+				kvm_set_s2pte_dbm(pte);
+#endif
 			if (!kvm_s2pte_readonly(pte))
 				kvm_set_s2pte_readonly(pte);
 		}
@@ -1827,7 +1831,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 		ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd);
 	} else {
-		pte_t new_pte = kvm_pfn_pte(pfn, mem_type);
+		pte_t new_pte;
+
+#ifdef CONFIG_ARM64_HW_AFDBM
+		if (kvm_hw_dbm_enabled() &&
+		    pgprot_val(mem_type) == pgprot_val(PAGE_S2)) {
+			mem_type = PAGE_S2_DBM;
+		}
+#endif
+		new_pte = kvm_pfn_pte(pfn, mem_type);
 
 		if (writable) {
 			new_pte = kvm_s2pte_mkwrite(new_pte);
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 3/7] KVM: arm64: Traverse page table entries when sync dirty log
  2020-05-25 11:23 [RFC PATCH 0/7] kvm: arm64: Support stage2 hardware DBM Keqian Zhu
  2020-05-25 11:24 ` [RFC PATCH 1/7] KVM: arm64: Add some basic functions for hw DBM Keqian Zhu
  2020-05-25 11:24 ` [RFC PATCH 2/7] KVM: arm64: Set DBM bit of PTEs if hw DBM enabled Keqian Zhu
@ 2020-05-25 11:24 ` Keqian Zhu
  2020-05-25 11:24 ` [RFC PATCH 4/7] KVM: arm64: Steply write protect page table by mask bit Keqian Zhu
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Keqian Zhu @ 2020-05-25 11:24 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	wanghaibin.wang, zhengxiang9, Keqian Zhu, Peng Liang

For hardware management of dirty state, dirty state is stored in
page table entries. We have to traverse page table entries when
sync dirty log.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
---
 arch/arm64/include/asm/kvm_host.h |   1 +
 virt/kvm/arm/arm.c                |   6 +-
 virt/kvm/arm/mmu.c                | 127 ++++++++++++++++++++++++++++++
 3 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 32c8a675e5a4..916617d3fed6 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -480,6 +480,7 @@ u64 __kvm_call_hyp(void *hypfn, ...);
 
 void force_vm_exit(const cpumask_t *mask);
 void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
+int kvm_mmu_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot);
 
 int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
 		int exception_index);
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 48d0ec44ad77..975311fa3a27 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -1191,7 +1191,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 
 void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
 {
-
+#ifdef CONFIG_ARM64_HW_AFDBM
+	if (kvm_hw_dbm_enabled()) {
+		kvm_mmu_sync_dirty_log(kvm, memslot);
+	}
+#endif
 }
 
 void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index dc97988eb2e0..ff8df9702e04 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -2266,6 +2266,133 @@ int kvm_mmu_init(void)
 	return err;
 }
 
+#ifdef CONFIG_ARM64_HW_AFDBM
+/**
+ * stage2_sync_dirty_log_ptes() - synchronize dirty log from PMD range
+ * @kvm:        The KVM pointer
+ * @pmd:	pointer to pmd entry
+ * @addr:	range start address
+ * @end:	range end address
+ */
+static void stage2_sync_dirty_log_ptes(struct kvm *kvm, pmd_t *pmd,
+				       phys_addr_t addr, phys_addr_t end)
+{
+	pte_t *pte;
+
+	pte = pte_offset_kernel(pmd, addr);
+	do {
+		if (!pte_none(*pte) && !kvm_s2pte_readonly(pte)) {
+			mark_page_dirty(kvm, addr >> PAGE_SHIFT);
+		}
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+}
+
+/**
+ * stage2_sync_dirty_log_pmds() - synchronize dirty log from PUD range
+ * @kvm:	The KVM pointer
+ * @pud:	pointer to pud entry
+ * @addr:	range start address
+ * @end:	range end address
+ */
+static void stage2_sync_dirty_log_pmds(struct kvm *kvm, pud_t *pud,
+				       phys_addr_t addr, phys_addr_t end)
+{
+	pmd_t *pmd;
+	phys_addr_t next;
+
+	pmd = stage2_pmd_offset(kvm, pud, addr);
+	do {
+		next = stage2_pmd_addr_end(kvm, addr, end);
+		if (!pmd_none(*pmd) && !pmd_thp_or_huge(*pmd)) {
+			stage2_sync_dirty_log_ptes(kvm, pmd, addr, next);
+		}
+	} while (pmd++, addr = next, addr != end);
+}
+
+/**
+ * stage2_sync_dirty_log_puds() - synchronize dirty log from PGD range
+ * @kvm:	The KVM pointer
+ * @pgd:	pointer to pgd entry
+ * @addr:	range start address
+ * @end:	range end address
+ */
+static void stage2_sync_dirty_log_puds(struct kvm *kvm, pgd_t *pgd,
+				       phys_addr_t addr, phys_addr_t end)
+{
+	pud_t *pud;
+	phys_addr_t next;
+
+	pud = stage2_pud_offset(kvm, pgd, addr);
+	do {
+		next = stage2_pud_addr_end(kvm, addr, end);
+		if (!stage2_pud_none(kvm, *pud) && !stage2_pud_huge(kvm, *pud)) {
+			stage2_sync_dirty_log_pmds(kvm, pud, addr, next);
+		}
+	} while (pud++, addr = next, addr != end);
+}
+
+/**
+ * stage2_sync_dirty_log_range() - synchronize dirty log from stage2 memory
+ * region range
+ * @kvm:	The KVM pointer
+ * @addr:	Start address of range
+ * @end:	End address of range
+ */
+static void stage2_sync_dirty_log_range(struct kvm *kvm, phys_addr_t addr,
+					phys_addr_t end)
+{
+	pgd_t *pgd;
+	phys_addr_t next;
+
+	pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
+	do {
+		/*
+		 * Release kvm_mmu_lock periodically if the memory region is
+		 * large. Otherwise, we may see kernel panics with
+		 * CONFIG_DETECT_HUNG_TASK, CONFIG_LOCKUP_DETECTOR,
+		 * CONFIG_LOCKDEP. Additionally, holding the lock too long
+		 * will also starve other vCPUs. We have to also make sure
+		 * that the page tables are not freed while we released
+		 * the lock.
+		 */
+		cond_resched_lock(&kvm->mmu_lock);
+		if (!READ_ONCE(kvm->arch.pgd))
+			break;
+		next = stage2_pgd_addr_end(kvm, addr, end);
+		if (stage2_pgd_present(kvm, *pgd))
+			stage2_sync_dirty_log_puds(kvm, pgd, addr, next);
+	} while (pgd++, addr = next, addr != end);
+}
+
+/**
+ * kvm_mmu_sync_dirty_log() - synchronize dirty log from stage2 entries for
+ * memory slot
+ * @kvm:	The KVM pointer
+ * @slot:	The memory slot to synchronize dirty log
+ *
+ * Called to synchronize dirty log (marked by hw) after memory region
+ * KVM_GET_DIRTY_LOG operation is called. After this function returns
+ * all dirty log information (for that hw will modify page tables during
+ * this routine, it is true only when guest is stopped, but it is OK
+ * because we won't miss dirty log finally.) are collected into memslot
+ * dirty_bitmap. Afterwards dirty_bitmap can be copied to userspace.
+ *
+ * Acquires kvm_mmu_lock. Called with kvm->slots_lock mutex acquired,
+ * serializing operations for VM memory regions.
+ */
+int kvm_mmu_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
+{
+	phys_addr_t start = memslot->base_gfn << PAGE_SHIFT;
+	phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
+
+	spin_lock(&kvm->mmu_lock);
+	stage2_sync_dirty_log_range(kvm, start, end);
+	spin_unlock(&kvm->mmu_lock);
+
+	return 0;
+}
+#endif /* CONFIG_ARM64_HW_AFDBM */
+
 void kvm_arch_commit_memory_region(struct kvm *kvm,
 				   const struct kvm_userspace_memory_region *mem,
 				   struct kvm_memory_slot *old,
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 4/7] KVM: arm64: Steply write protect page table by mask bit
  2020-05-25 11:23 [RFC PATCH 0/7] kvm: arm64: Support stage2 hardware DBM Keqian Zhu
                   ` (2 preceding siblings ...)
  2020-05-25 11:24 ` [RFC PATCH 3/7] KVM: arm64: Traverse page table entries when sync dirty log Keqian Zhu
@ 2020-05-25 11:24 ` Keqian Zhu
  2020-05-25 11:24 ` [RFC PATCH 5/7] kvm: arm64: Modify stage2 young mechanism to support hw DBM Keqian Zhu
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Keqian Zhu @ 2020-05-25 11:24 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	wanghaibin.wang, zhengxiang9, Keqian Zhu

During dirty log clear, page table entries are write protected
according to a mask. In the past we write protect all entries
corresponding to the mask from ffs to fls. Though there may be
zero bits between this range, we are holding the kvm mmu lock
so we won't write protect entries that we don't want to.

We are about to add support for hardware management of dirty state
to arm64, holding kvm mmu lock will be not enough. We should write
protect entries steply by mask bit.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 virt/kvm/arm/mmu.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index ff8df9702e04..779859b85d6d 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1568,10 +1568,16 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 		gfn_t gfn_offset, unsigned long mask)
 {
 	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
-	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
-	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+	phys_addr_t start, end;
+	u32 i;
 
-	stage2_wp_range(kvm, start, end);
+	for (i = __ffs(mask); i <= __fls(mask); i++) {
+		if (test_bit(i, &mask)) {
+			start = (base_gfn + i) << PAGE_SHIFT;
+			end = (base_gfn + i + 1) << PAGE_SHIFT;
+			stage2_wp_range(kvm, start, end);
+		}
+	}
 }
 
 /*
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 5/7] kvm: arm64: Modify stage2 young mechanism to support hw DBM
  2020-05-25 11:23 [RFC PATCH 0/7] kvm: arm64: Support stage2 hardware DBM Keqian Zhu
                   ` (3 preceding siblings ...)
  2020-05-25 11:24 ` [RFC PATCH 4/7] KVM: arm64: Steply write protect page table by mask bit Keqian Zhu
@ 2020-05-25 11:24 ` Keqian Zhu
  2020-05-25 11:24 ` [RFC PATCH 6/7] kvm: arm64: Save stage2 PTE dirty info if it is coverred Keqian Zhu
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Keqian Zhu @ 2020-05-25 11:24 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	wanghaibin.wang, zhengxiang9, Keqian Zhu

Making page table entries young (set AF bit) should be atomic to
avoid cover dirty info that is set by hardware.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/include/asm/kvm_mmu.h | 32 ++++++++++++++++++++++----------
 virt/kvm/arm/mmu.c               | 10 +++++-----
 2 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 8df078f0ee67..a4620d87e456 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -235,6 +235,18 @@ static inline void kvm_set_s2pte_readonly(pte_t *ptep)
 	} while (pteval != old_pteval);
 }
 
+static inline void kvm_set_s2pte_young(pte_t *ptep)
+{
+	pteval_t old_pteval, pteval;
+
+	pteval = READ_ONCE(pte_val(*ptep));
+	do {
+		old_pteval = pteval;
+		pteval |= PTE_AF;
+		pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
+	} while (pteval != old_pteval);
+}
+
 static inline bool kvm_s2pte_readonly(pte_t *ptep)
 {
 	return (READ_ONCE(pte_val(*ptep)) & PTE_S2_RDWR) == PTE_S2_RDONLY;
@@ -250,6 +262,11 @@ static inline void kvm_set_s2pmd_readonly(pmd_t *pmdp)
 	kvm_set_s2pte_readonly((pte_t *)pmdp);
 }
 
+static inline void kvm_set_s2pmd_young(pmd_t *pmdp)
+{
+	kvm_set_s2pte_young((pte_t *)pmdp);
+}
+
 static inline bool kvm_s2pmd_readonly(pmd_t *pmdp)
 {
 	return kvm_s2pte_readonly((pte_t *)pmdp);
@@ -265,6 +282,11 @@ static inline void kvm_set_s2pud_readonly(pud_t *pudp)
 	kvm_set_s2pte_readonly((pte_t *)pudp);
 }
 
+static inline void kvm_set_s2pud_young(pud_t *pudp)
+{
+	kvm_set_s2pte_young((pte_t *)pudp);
+}
+
 static inline bool kvm_s2pud_readonly(pud_t *pudp)
 {
 	return kvm_s2pte_readonly((pte_t *)pudp);
@@ -275,16 +297,6 @@ static inline bool kvm_s2pud_exec(pud_t *pudp)
 	return !(READ_ONCE(pud_val(*pudp)) & PUD_S2_XN);
 }
 
-static inline pud_t kvm_s2pud_mkyoung(pud_t pud)
-{
-	return pud_mkyoung(pud);
-}
-
-static inline bool kvm_s2pud_young(pud_t pud)
-{
-	return pud_young(pud);
-}
-
 #ifdef CONFIG_ARM64_HW_AFDBM
 static inline bool kvm_hw_dbm_enabled(void)
 {
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 779859b85d6d..e1d9e4b98cb6 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1888,15 +1888,15 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
 		goto out;
 
 	if (pud) {		/* HugeTLB */
-		*pud = kvm_s2pud_mkyoung(*pud);
+		kvm_set_s2pud_young(pud);
 		pfn = kvm_pud_pfn(*pud);
 		pfn_valid = true;
 	} else	if (pmd) {	/* THP, HugeTLB */
-		*pmd = pmd_mkyoung(*pmd);
+		kvm_set_s2pmd_young(pmd);
 		pfn = pmd_pfn(*pmd);
 		pfn_valid = true;
-	} else {
-		*pte = pte_mkyoung(*pte);	/* Just a page... */
+	} else {		/* Just a page... */
+		kvm_set_s2pte_young(pte);
 		pfn = pte_pfn(*pte);
 		pfn_valid = true;
 	}
@@ -2141,7 +2141,7 @@ static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *
 		return 0;
 
 	if (pud)
-		return kvm_s2pud_young(*pud);
+		return pud_young(*pud);
 	else if (pmd)
 		return pmd_young(*pmd);
 	else
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 6/7] kvm: arm64: Save stage2 PTE dirty info if it is coverred
  2020-05-25 11:23 [RFC PATCH 0/7] kvm: arm64: Support stage2 hardware DBM Keqian Zhu
                   ` (4 preceding siblings ...)
  2020-05-25 11:24 ` [RFC PATCH 5/7] kvm: arm64: Modify stage2 young mechanism to support hw DBM Keqian Zhu
@ 2020-05-25 11:24 ` Keqian Zhu
  2020-05-25 11:24 ` [RFC PATCH 7/7] KVM: arm64: Enable stage2 hardware DBM Keqian Zhu
  2020-05-25 15:44 ` [RFC PATCH 0/7] kvm: arm64: Support " Marc Zyngier
  7 siblings, 0 replies; 12+ messages in thread
From: Keqian Zhu @ 2020-05-25 11:24 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	wanghaibin.wang, zhengxiang9, Keqian Zhu

kvm_set_pte is called to replace a target PTE with a desired one.
We always replace it, but if hw DBM is enalbled and dirty info is
coverred, should let caller know it. Caller can decide to whether
save the dirty info.

kvm_set_pmd and kvm_set_pud is not modified, because we only use
DBM in PTEs for now.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 virt/kvm/arm/mmu.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index e1d9e4b98cb6..43d89c6333f0 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -185,10 +185,34 @@ static void clear_stage2_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr
 	put_page(virt_to_page(pmd));
 }
 
-static inline void kvm_set_pte(pte_t *ptep, pte_t new_pte)
+/*
+ * @ret: true if dirty info is coverred.
+ */
+static inline bool kvm_set_pte(pte_t *ptep, pte_t new_pte)
 {
+#ifdef CONFIG_ARM64_HW_AFDBM
+	pteval_t old_pteval, new_pteval, pteval;
+
+	if (!kvm_hw_dbm_enabled() || pte_none(*ptep) ||
+	    !kvm_s2pte_readonly(&new_pte)) {
+		WRITE_ONCE(*ptep, new_pte);
+		dsb(ishst);
+		return false;
+	}
+
+	new_pteval = pte_val(new_pte);
+	pteval = READ_ONCE(pte_val(*ptep));
+	do {
+		old_pteval = pteval;
+		pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, new_pteval);
+	} while (pteval != old_pteval);
+
+	return !kvm_s2pte_readonly((pte_t *)&pteval);
+#else
 	WRITE_ONCE(*ptep, new_pte);
 	dsb(ishst);
+	return false;
+#endif
 }
 
 static inline void kvm_set_pmd(pmd_t *pmdp, pmd_t new_pmd)
@@ -249,7 +273,10 @@ static void unmap_stage2_ptes(struct kvm *kvm, pmd_t *pmd,
 		if (!pte_none(*pte)) {
 			pte_t old_pte = *pte;
 
-			kvm_set_pte(pte, __pte(0));
+			if (kvm_set_pte(pte, __pte(0))) {
+				mark_page_dirty(kvm, addr >> PAGE_SHIFT);
+			}
+
 			kvm_tlb_flush_vmid_ipa(kvm, addr);
 
 			/* No need to invalidate the cache for device mappings */
@@ -1291,13 +1318,17 @@ static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
 		if (pte_val(old_pte) == pte_val(*new_pte))
 			return 0;
 
-		kvm_set_pte(pte, __pte(0));
+		if (kvm_set_pte(pte, __pte(0))) {
+			mark_page_dirty(kvm, addr >> PAGE_SHIFT);
+		}
 		kvm_tlb_flush_vmid_ipa(kvm, addr);
 	} else {
 		get_page(virt_to_page(pte));
 	}
 
-	kvm_set_pte(pte, *new_pte);
+	if (kvm_set_pte(pte, *new_pte)) {
+		mark_page_dirty(kvm, addr >> PAGE_SHIFT);
+	}
 	return 0;
 }
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 7/7] KVM: arm64: Enable stage2 hardware DBM
  2020-05-25 11:23 [RFC PATCH 0/7] kvm: arm64: Support stage2 hardware DBM Keqian Zhu
                   ` (5 preceding siblings ...)
  2020-05-25 11:24 ` [RFC PATCH 6/7] kvm: arm64: Save stage2 PTE dirty info if it is coverred Keqian Zhu
@ 2020-05-25 11:24 ` Keqian Zhu
  2020-05-25 15:44 ` [RFC PATCH 0/7] kvm: arm64: Support " Marc Zyngier
  7 siblings, 0 replies; 12+ messages in thread
From: Keqian Zhu @ 2020-05-25 11:24 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm
  Cc: Catalin Marinas, Marc Zyngier, James Morse, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Julien Thierry,
	Mark Brown, Thomas Gleixner, Andrew Morton, Alexios Zavras,
	wanghaibin.wang, zhengxiang9, Keqian Zhu, Peng Liang

We are ready to support hw management of dirty state, enable it if
hardware support it.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Peng Liang <liangpeng10@huawei.com>
---
 arch/arm64/include/asm/sysreg.h | 2 ++
 arch/arm64/kvm/reset.c          | 9 ++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index ebc622432831..371ea6d65c16 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -721,6 +721,8 @@
 #define ID_AA64MMFR1_VMIDBITS_8		0
 #define ID_AA64MMFR1_VMIDBITS_16	2
 
+#define ID_AA64MMFR1_HADBS_DBS		2
+
 /* id_aa64mmfr2 */
 #define ID_AA64MMFR2_E0PD_SHIFT		60
 #define ID_AA64MMFR2_FWB_SHIFT		40
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 30b7ea680f66..cb727e1fb581 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -392,7 +392,7 @@ int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
 {
 	u64 vtcr = VTCR_EL2_FLAGS;
 	u32 parange, phys_shift;
-	u8 lvls;
+	u8 lvls, hadbs;
 
 	if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
 		return -EINVAL;
@@ -428,6 +428,13 @@ int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
 	 */
 	vtcr |= VTCR_EL2_HA;
 
+	hadbs = (read_sysreg(id_aa64mmfr1_el1) >>
+			ID_AA64MMFR1_HADBS_SHIFT) & 0xf;
+#ifdef CONFIG_ARM64_HW_AFDBM
+	if (hadbs == ID_AA64MMFR1_HADBS_DBS)
+		vtcr |= VTCR_EL2_HD;
+#endif
+
 	/* Set the vmid bits */
 	vtcr |= (kvm_get_vmid_bits() == 16) ?
 		VTCR_EL2_VS_16BIT :
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/7] kvm: arm64: Support stage2 hardware DBM
  2020-05-25 11:23 [RFC PATCH 0/7] kvm: arm64: Support stage2 hardware DBM Keqian Zhu
                   ` (6 preceding siblings ...)
  2020-05-25 11:24 ` [RFC PATCH 7/7] KVM: arm64: Enable stage2 hardware DBM Keqian Zhu
@ 2020-05-25 15:44 ` Marc Zyngier
  2020-05-26  2:08   ` zhukeqian
  7 siblings, 1 reply; 12+ messages in thread
From: Marc Zyngier @ 2020-05-25 15:44 UTC (permalink / raw)
  To: Keqian Zhu
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, Catalin Marinas,
	James Morse, Will Deacon, Suzuki K Poulose, Sean Christopherson,
	Julien Thierry, Mark Brown, Thomas Gleixner, Andrew Morton,
	Alexios Zavras, wanghaibin.wang, zhengxiang9

On 2020-05-25 12:23, Keqian Zhu wrote:
> This patch series add support for stage2 hardware DBM, and it is only
> used for dirty log for now.
> 
> It works well under some migration test cases, including VM with 4K
> pages or 2M THP. I checked the SHA256 hash digest of all memory and
> they keep same for source VM and destination VM, which means no dirty
> pages is missed under hardware DBM.
> 
> However, there are some known issues not solved.
> 
> 1. Some mechanisms that rely on "write permission fault" become 
> invalid,
>    such as kvm_set_pfn_dirty and "mmap page sharing".
> 
>    kvm_set_pfn_dirty is called in user_mem_abort when guest issues 
> write
>    fault. This guarantees physical page will not be dropped directly 
> when
>    host kernel recycle memory. After using hardware dirty management, 
> we
>    have no chance to call kvm_set_pfn_dirty.

Then you will end-up with memory corruption under memory pressure.
This also breaks things like CoW, which we depend on.

> 
>    For "mmap page sharing" mechanism, host kernel will allocate a new
>    physical page when guest writes a page that is shared with other 
> page
>    table entries. After using hardware dirty management, we have no 
> chance
>    to do this too.
> 
>    I need to do some survey on how stage1 hardware DBM solve these 
> problems.
>    It helps if anyone can figure it out.
> 
> 2. Page Table Modification Races: Though I have found and solved some 
> data
>    races when kernel changes page table entries, I still doubt that 
> there
>    are data races I am not aware of. It's great if anyone can figure 
> them out.
> 
> 3. Performance: Under Kunpeng 920 platform, for every 64GB memory, KVM
>    consumes about 40ms to traverse all PTEs to collect dirty log. It 
> will
>    cause unbearable downtime for migration if memory size is too big. I 
> will
>    try to solve this problem in Patch v1.

This, in my opinion, is why Stage-2 DBM is fairly useless.
 From a performance perspective, this is the worse possible
situation. You end up continuously scanning page tables, at
an arbitrary rate, without a way to evaluate the fault rate.

One thing S2-DBM would be useful for is SVA, where a device
write would mark the S2 PTs dirty as they are shared between
CPU and SMMU. Another thing is SPE, which is essentially a DMA
agent using the CPU's PTs.

But on its own, and just to log the dirty pages, S2-DBM is
pretty rubbish. I wish arm64 had something like Intel's PML,
which looks far more interesting for the purpose of tracking
accesses.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/7] kvm: arm64: Support stage2 hardware DBM
  2020-05-25 15:44 ` [RFC PATCH 0/7] kvm: arm64: Support " Marc Zyngier
@ 2020-05-26  2:08   ` zhukeqian
  0 siblings, 0 replies; 12+ messages in thread
From: zhukeqian @ 2020-05-26  2:08 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, Catalin Marinas,
	James Morse, Will Deacon, Suzuki K Poulose, Sean Christopherson,
	Julien Thierry, Mark Brown, Thomas Gleixner, Andrew Morton,
	Alexios Zavras, wanghaibin.wang, zhengxiang9

Hi Marc,

On 2020/5/25 23:44, Marc Zyngier wrote:
> On 2020-05-25 12:23, Keqian Zhu wrote:
>> This patch series add support for stage2 hardware DBM, and it is only
>> used for dirty log for now.
>>
>> It works well under some migration test cases, including VM with 4K
>> pages or 2M THP. I checked the SHA256 hash digest of all memory and
>> they keep same for source VM and destination VM, which means no dirty
>> pages is missed under hardware DBM.
>>
>> However, there are some known issues not solved.
>>
>> 1. Some mechanisms that rely on "write permission fault" become invalid,
>>    such as kvm_set_pfn_dirty and "mmap page sharing".
>>
>>    kvm_set_pfn_dirty is called in user_mem_abort when guest issues write
>>    fault. This guarantees physical page will not be dropped directly when
>>    host kernel recycle memory. After using hardware dirty management, we
>>    have no chance to call kvm_set_pfn_dirty.
> 
> Then you will end-up with memory corruption under memory pressure.
> This also breaks things like CoW, which we depend on.
>
Yes, these problems looks knotty. But I think x86 PML support will face these
problems too. I believe there must be some methods to solve them.
>>
>>    For "mmap page sharing" mechanism, host kernel will allocate a new
>>    physical page when guest writes a page that is shared with other page
>>    table entries. After using hardware dirty management, we have no chance
>>    to do this too.
>>
>>    I need to do some survey on how stage1 hardware DBM solve these problems.
>>    It helps if anyone can figure it out.
>>
>> 2. Page Table Modification Races: Though I have found and solved some data
>>    races when kernel changes page table entries, I still doubt that there
>>    are data races I am not aware of. It's great if anyone can figure them out.
>>
>> 3. Performance: Under Kunpeng 920 platform, for every 64GB memory, KVM
>>    consumes about 40ms to traverse all PTEs to collect dirty log. It will
>>    cause unbearable downtime for migration if memory size is too big. I will
>>    try to solve this problem in Patch v1.
> 
> This, in my opinion, is why Stage-2 DBM is fairly useless.
> From a performance perspective, this is the worse possible
> situation. You end up continuously scanning page tables, at
> an arbitrary rate, without a way to evaluate the fault rate.
> 
> One thing S2-DBM would be useful for is SVA, where a device
> write would mark the S2 PTs dirty as they are shared between
> CPU and SMMU. Another thing is SPE, which is essentially a DMA
> agent using the CPU's PTs.
> 
> But on its own, and just to log the dirty pages, S2-DBM is
> pretty rubbish. I wish arm64 had something like Intel's PML,
> which looks far more interesting for the purpose of tracking
> accesses.

Sure, PML is a better solution on hardware management of dirty state.
However, compared to optimizing hardware, optimizing software is with
shorter cycle time.

Here I have an optimization in mind to solve it. Scanning page tables
can be done parallel, which can greatly reduce time consumption. For there
is no communication between parallel CPUs, we can achieve high speedup
ratio.


> 
> Thanks,
> 
>         M.
Thanks,
Keqian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 2/7] KVM: arm64: Set DBM bit of PTEs if hw DBM enabled
  2020-05-25 11:24 ` [RFC PATCH 2/7] KVM: arm64: Set DBM bit of PTEs if hw DBM enabled Keqian Zhu
@ 2020-05-26 11:49   ` Catalin Marinas
  2020-05-27  9:28     ` zhukeqian
  0 siblings, 1 reply; 12+ messages in thread
From: Catalin Marinas @ 2020-05-26 11:49 UTC (permalink / raw)
  To: Keqian Zhu
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, Marc Zyngier,
	James Morse, Will Deacon, Suzuki K Poulose, Sean Christopherson,
	Julien Thierry, Mark Brown, Thomas Gleixner, Andrew Morton,
	Alexios Zavras, wanghaibin.wang, zhengxiang9, Peng Liang

On Mon, May 25, 2020 at 07:24:01PM +0800, Keqian Zhu wrote:
> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> index 1305e28225fc..f9910ba2afd8 100644
> --- a/arch/arm64/include/asm/pgtable-prot.h
> +++ b/arch/arm64/include/asm/pgtable-prot.h
> @@ -79,6 +79,7 @@ extern bool arm64_use_ng_mappings;
>  	})
>  
>  #define PAGE_S2			__pgprot(_PROT_DEFAULT | PAGE_S2_MEMATTR(NORMAL) | PTE_S2_RDONLY | PAGE_S2_XN)
> +#define PAGE_S2_DBM		__pgprot(_PROT_DEFAULT | PAGE_S2_MEMATTR(NORMAL) | PTE_S2_RDONLY | PAGE_S2_XN | PTE_DBM)

You don't need a new page permission (see below).

>  #define PAGE_S2_DEVICE		__pgprot(_PROT_DEFAULT | PAGE_S2_MEMATTR(DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_S2_XN)
>  
>  #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index e3b9ee268823..dc97988eb2e0 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1426,6 +1426,10 @@ static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
>  	pte = pte_offset_kernel(pmd, addr);
>  	do {
>  		if (!pte_none(*pte)) {
> +#ifdef CONFIG_ARM64_HW_AFDBM
> +			if (kvm_hw_dbm_enabled() && !kvm_s2pte_dbm(pte))
> +				kvm_set_s2pte_dbm(pte);
> +#endif
>  			if (!kvm_s2pte_readonly(pte))
>  				kvm_set_s2pte_readonly(pte);
>  		}

Setting the DBM bit is equivalent to marking the page writable. The
actual writable pte bit (S2AP[1] or HAP[2] as we call them in Linux for
legacy reasons) tells you whether the page has been dirtied but it is
still writable if you set DBM. Doing this in stage2_wp_ptes()
practically means that you no longer have read-only pages at S2. There
are several good reasons why you don't want to break this. For example,
the S2 pte may already be read-only for other reasons (CoW).

I think you should only set the DBM bit if the pte was previously
writable. In addition, any permission change to the S2 pte must take
into account the DBM bit and clear it while transferring the dirty
status to the underlying page. I'm not deeply familiar with all these
callbacks into KVM but two such paths are kvm_unmap_hva_range() and the
kvm_mmu_notifier_change_pte().


> @@ -1827,7 +1831,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  
>  		ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd);
>  	} else {
> -		pte_t new_pte = kvm_pfn_pte(pfn, mem_type);
> +		pte_t new_pte;
> +
> +#ifdef CONFIG_ARM64_HW_AFDBM
> +		if (kvm_hw_dbm_enabled() &&
> +		    pgprot_val(mem_type) == pgprot_val(PAGE_S2)) {
> +			mem_type = PAGE_S2_DBM;
> +		}
> +#endif
> +		new_pte = kvm_pfn_pte(pfn, mem_type);
>  
>  		if (writable) {
>  			new_pte = kvm_s2pte_mkwrite(new_pte);

That's wrong here. Basically for any fault you get, you just turn the S2
page writable. The point of DBM is that you don't get write faults at
all if you have a writable page. So, as I said above, only set the DBM
bit if you stored a writable S2 pte (kvm_s2pte_mkwrite()).

-- 
Catalin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 2/7] KVM: arm64: Set DBM bit of PTEs if hw DBM enabled
  2020-05-26 11:49   ` Catalin Marinas
@ 2020-05-27  9:28     ` zhukeqian
  0 siblings, 0 replies; 12+ messages in thread
From: zhukeqian @ 2020-05-27  9:28 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, Marc Zyngier,
	James Morse, Will Deacon, Suzuki K Poulose, Sean Christopherson,
	Julien Thierry, Mark Brown, Thomas Gleixner, Andrew Morton,
	Alexios Zavras, wanghaibin.wang, zhengxiang9, Peng Liang

Hi Catalin,

On 2020/5/26 19:49, Catalin Marinas wrote:
> On Mon, May 25, 2020 at 07:24:01PM +0800, Keqian Zhu wrote:
>> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
>> index 1305e28225fc..f9910ba2afd8 100644
>> --- a/arch/arm64/include/asm/pgtable-prot.h
>> +++ b/arch/arm64/include/asm/pgtable-prot.h
>> @@ -79,6 +79,7 @@ extern bool arm64_use_ng_mappings;
>>  	})
>>  
>>  #define PAGE_S2			__pgprot(_PROT_DEFAULT | PAGE_S2_MEMATTR(NORMAL) | PTE_S2_RDONLY | PAGE_S2_XN)
>> +#define PAGE_S2_DBM		__pgprot(_PROT_DEFAULT | PAGE_S2_MEMATTR(NORMAL) | PTE_S2_RDONLY | PAGE_S2_XN | PTE_DBM)
> 
> You don't need a new page permission (see below).
> 
>>  #define PAGE_S2_DEVICE		__pgprot(_PROT_DEFAULT | PAGE_S2_MEMATTR(DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_S2_XN)
>>  
>>  #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index e3b9ee268823..dc97988eb2e0 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1426,6 +1426,10 @@ static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
>>  	pte = pte_offset_kernel(pmd, addr);
>>  	do {
>>  		if (!pte_none(*pte)) {
>> +#ifdef CONFIG_ARM64_HW_AFDBM
>> +			if (kvm_hw_dbm_enabled() && !kvm_s2pte_dbm(pte))
>> +				kvm_set_s2pte_dbm(pte);
>> +#endif
>>  			if (!kvm_s2pte_readonly(pte))
>>  				kvm_set_s2pte_readonly(pte);
>>  		}
> 
> Setting the DBM bit is equivalent to marking the page writable. The
> actual writable pte bit (S2AP[1] or HAP[2] as we call them in Linux for
> legacy reasons) tells you whether the page has been dirtied but it is
> still writable if you set DBM. Doing this in stage2_wp_ptes()
> practically means that you no longer have read-only pages at S2. There
> are several good reasons why you don't want to break this. For example,
> the S2 pte may already be read-only for other reasons (CoW).
> 
Thanks, your comments help to solve the first problem in cover letter.

> I think you should only set the DBM bit if the pte was previously
> writable. In addition, any permission change to the S2 pte must take
> into account the DBM bit and clear it while transferring the dirty
> status to the underlying page. I'm not deeply familiar with all these
> callbacks into KVM but two such paths are kvm_unmap_hva_range() and the
> kvm_mmu_notifier_change_pte().
Yes, I agree.
> 
> 
>> @@ -1827,7 +1831,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  
>>  		ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd);
>>  	} else {
>> -		pte_t new_pte = kvm_pfn_pte(pfn, mem_type);
>> +		pte_t new_pte;
>> +
>> +#ifdef CONFIG_ARM64_HW_AFDBM
>> +		if (kvm_hw_dbm_enabled() &&
>> +		    pgprot_val(mem_type) == pgprot_val(PAGE_S2)) {
>> +			mem_type = PAGE_S2_DBM;
>> +		}
>> +#endif
>> +		new_pte = kvm_pfn_pte(pfn, mem_type);
>>  
>>  		if (writable) {
>>  			new_pte = kvm_s2pte_mkwrite(new_pte);
> 
> That's wrong here. Basically for any fault you get, you just turn the S2
> page writable. The point of DBM is that you don't get write faults at
> all if you have a writable page. So, as I said above, only set the DBM
> bit if you stored a writable S2 pte (kvm_s2pte_mkwrite()).
Yeah, you are right. I will correct it in Patch v1.
> 

Thanks,
Keqian


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-05-27  9:29 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-25 11:23 [RFC PATCH 0/7] kvm: arm64: Support stage2 hardware DBM Keqian Zhu
2020-05-25 11:24 ` [RFC PATCH 1/7] KVM: arm64: Add some basic functions for hw DBM Keqian Zhu
2020-05-25 11:24 ` [RFC PATCH 2/7] KVM: arm64: Set DBM bit of PTEs if hw DBM enabled Keqian Zhu
2020-05-26 11:49   ` Catalin Marinas
2020-05-27  9:28     ` zhukeqian
2020-05-25 11:24 ` [RFC PATCH 3/7] KVM: arm64: Traverse page table entries when sync dirty log Keqian Zhu
2020-05-25 11:24 ` [RFC PATCH 4/7] KVM: arm64: Steply write protect page table by mask bit Keqian Zhu
2020-05-25 11:24 ` [RFC PATCH 5/7] kvm: arm64: Modify stage2 young mechanism to support hw DBM Keqian Zhu
2020-05-25 11:24 ` [RFC PATCH 6/7] kvm: arm64: Save stage2 PTE dirty info if it is coverred Keqian Zhu
2020-05-25 11:24 ` [RFC PATCH 7/7] KVM: arm64: Enable stage2 hardware DBM Keqian Zhu
2020-05-25 15:44 ` [RFC PATCH 0/7] kvm: arm64: Support " Marc Zyngier
2020-05-26  2:08   ` zhukeqian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).