kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/2] KVM: arm64: Improve efficiency of stage2 page table
@ 2021-03-26  3:16 Yanan Wang
  2021-03-26  3:16 ` [RFC PATCH v3 1/2] KVM: arm64: Move CMOs from user_mem_abort to the fault handlers Yanan Wang
  2021-03-26  3:16 ` [RFC PATCH v3 2/2] KVM: arm64: Distinguish cases of memcache allocations completely Yanan Wang
  0 siblings, 2 replies; 9+ messages in thread
From: Yanan Wang @ 2021-03-26  3:16 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Alexandru Elisei, Catalin Marinas,
	kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, zhukeqian1, yuzenghui,
	Yanan Wang

Hi,

This is a new version of the series [1] that I have posted before. It makes some
efficiency improvement of stage2 page table code and there are some test results
to quantify the benefit of each patch.
[1] v2: https://lore.kernel.org/lkml/20210310094319.18760-1-wangyanan55@huawei.com/

Although there hasn't been any feedback about v2, I am certain that there should
be a big change for the series after plenty of discussion with Alexandru Elisei.
A conclusion was drew that CMOs are still needed for the scenario of coalescing
tables, and as a result the benefit of patch #3 in v2 becomes rather little
judging from the test results. So drop this patch and keep the others which
still remain meaningful.

Changelogs:
v2->v3:
- drop patch #3 in v2
- retest v3 based on v5.12-rc2

v1->v2:
- rebased on top of mainline v5.12-rc2
- also move CMOs of I-cache to the fault handlers
- retest v2 based on v5.12-rc2
- v1: https://lore.kernel.org/lkml/20210208112250.163568-1-wangyanan55@huawei.com/

About this v3 series:
Patch #1:
We currently uniformly permorm CMOs of D-cache and I-cache in function
user_mem_abort before calling the fault handlers. If we get concurrent
guest faults(e.g. translation faults, permission faults) or some really
unnecessary guest faults caused by BBM, CMOs for the first vcpu are
necessary while the others later are not.

By moving CMOs to the fault handlers, we can easily identify conditions
where they are really needed and avoid the unnecessary ones. As it's a
time consuming process to perform CMOs especially when flushing a block
range, so this solution reduces much load of kvm and improve efficiency
of the page table code.

So let's move both clean of D-cache and invalidation of I-cache to the
map path and move only invalidation of I-cache to the permission path.
Since the original APIs for CMOs in mmu.c are only called in function
user_mem_abort, we now also move them to pgtable.c.

The following results represent the benefit of patch #1 alone, and they
were tested by [2] (kvm/selftest) that I have posted recently.
[2] https://lore.kernel.org/lkml/20210302125751.19080-1-wangyanan55@huawei.com/

When there are muitiple vcpus concurrently accessing the same memory region,
we can test the execution time of KVM creating new mappings, updating the
permissions of old mappings from RO to RW, and rebuilding the blocks after
they have been split.

hardware platform: HiSilicon Kunpeng920 Server
host kernel: Linux mainline v5.12-rc2

cmdline: ./kvm_page_table_test -m 4 -s anonymous -b 1G -v 80
           (80 vcpus, 1G memory, page mappings(normal 4K))
KVM_CREATE_MAPPINGS: before 104.35s -> after  90.42s  +13.35%
KVM_UPDATE_MAPPINGS: before  78.64s -> after  75.45s  + 4.06%

cmdline: ./kvm_page_table_test -m 4 -s anonymous_thp -b 20G -v 40
           (40 vcpus, 20G memory, block mappings(THP 2M))
KVM_CREATE_MAPPINGS: before  15.66s -> after   6.92s  +55.80%
KVM_UPDATE_MAPPINGS: before 178.80s -> after 123.35s  +31.00%
KVM_REBUILD_BLOCKS:  before 187.34s -> after 131.76s  +30.65%

cmdline: ./kvm_page_table_test -m 4 -s anonymous_hugetlb_1gb -b 20G -v 40
           (40 vcpus, 20G memory, block mappings(HUGETLB 1G))
KVM_CREATE_MAPPINGS: before 104.54s -> after   3.70s  +96.46%
KVM_UPDATE_MAPPINGS: before 174.20s -> after 115.94s  +33.44%
KVM_REBUILD_BLOCKS:  before 103.95s -> after   2.96s  +97.15%

Patch #2:
A new method to distinguish cases of memcache allocations is introduced.
By comparing fault_granule and vma_pagesize, cases that require allocations
from memcache and cases that don't can be distinguished completely.

Yanan Wang (2):
  KVM: arm64: Move CMOs from user_mem_abort to the fault handlers
  KVM: arm64: Distinguish cases of memcache allocations completely

 arch/arm64/include/asm/kvm_mmu.h | 31 ---------------
 arch/arm64/kvm/hyp/pgtable.c     | 68 +++++++++++++++++++++++++-------
 arch/arm64/kvm/mmu.c             | 48 ++++++++--------------
 3 files changed, 69 insertions(+), 78 deletions(-)

-- 
2.19.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH v3 1/2] KVM: arm64: Move CMOs from user_mem_abort to the fault handlers
  2021-03-26  3:16 [RFC PATCH v3 0/2] KVM: arm64: Improve efficiency of stage2 page table Yanan Wang
@ 2021-03-26  3:16 ` Yanan Wang
  2021-04-07 15:31   ` Alexandru Elisei
  2021-03-26  3:16 ` [RFC PATCH v3 2/2] KVM: arm64: Distinguish cases of memcache allocations completely Yanan Wang
  1 sibling, 1 reply; 9+ messages in thread
From: Yanan Wang @ 2021-03-26  3:16 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Alexandru Elisei, Catalin Marinas,
	kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, zhukeqian1, yuzenghui,
	Yanan Wang

We currently uniformly permorm CMOs of D-cache and I-cache in function
user_mem_abort before calling the fault handlers. If we get concurrent
guest faults(e.g. translation faults, permission faults) or some really
unnecessary guest faults caused by BBM, CMOs for the first vcpu are
necessary while the others later are not.

By moving CMOs to the fault handlers, we can easily identify conditions
where they are really needed and avoid the unnecessary ones. As it's a
time consuming process to perform CMOs especially when flushing a block
range, so this solution reduces much load of kvm and improve efficiency
of the page table code.

So let's move both clean of D-cache and invalidation of I-cache to the
map path and move only invalidation of I-cache to the permission path.
Since the original APIs for CMOs in mmu.c are only called in function
user_mem_abort, we now also move them to pgtable.c.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/include/asm/kvm_mmu.h | 31 ---------------
 arch/arm64/kvm/hyp/pgtable.c     | 68 +++++++++++++++++++++++++-------
 arch/arm64/kvm/mmu.c             | 23 ++---------
 3 files changed, 57 insertions(+), 65 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 90873851f677..c31f88306d4e 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -177,37 +177,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
 	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
 }
 
-static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
-{
-	void *va = page_address(pfn_to_page(pfn));
-
-	/*
-	 * With FWB, we ensure that the guest always accesses memory using
-	 * cacheable attributes, and we don't have to clean to PoC when
-	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
-	 * PoU is not required either in this case.
-	 */
-	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
-		return;
-
-	kvm_flush_dcache_to_poc(va, size);
-}
-
-static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
-						  unsigned long size)
-{
-	if (icache_is_aliasing()) {
-		/* any kind of VIPT cache */
-		__flush_icache_all();
-	} else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
-		/* PIPT or VPIPT at EL2 (see comment in __kvm_tlb_flush_vmid_ipa) */
-		void *va = page_address(pfn_to_page(pfn));
-
-		invalidate_icache_range((unsigned long)va,
-					(unsigned long)va + size);
-	}
-}
-
 void kvm_set_way_flush(struct kvm_vcpu *vcpu);
 void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);
 
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 4d177ce1d536..829a34eea526 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -464,6 +464,43 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
 	return 0;
 }
 
+static bool stage2_pte_cacheable(kvm_pte_t pte)
+{
+	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
+	return memattr == PAGE_S2_MEMATTR(NORMAL);
+}
+
+static bool stage2_pte_executable(kvm_pte_t pte)
+{
+	return !(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN);
+}
+
+static void stage2_flush_dcache(void *addr, u64 size)
+{
+	/*
+	 * With FWB, we ensure that the guest always accesses memory using
+	 * cacheable attributes, and we don't have to clean to PoC when
+	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
+	 * PoU is not required either in this case.
+	 */
+	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
+		return;
+
+	__flush_dcache_area(addr, size);
+}
+
+static void stage2_invalidate_icache(void *addr, u64 size)
+{
+	if (icache_is_aliasing()) {
+		/* Flush any kind of VIPT icache */
+		__flush_icache_all();
+	} else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
+		/* PIPT or VPIPT at EL2 */
+		invalidate_icache_range((unsigned long)addr,
+					(unsigned long)addr + size);
+	}
+}
+
 static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep,
 				      struct stage2_map_data *data)
@@ -495,6 +532,13 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 		put_page(page);
 	}
 
+	/* Perform CMOs before installation of the new PTE */
+	if (!kvm_pte_valid(old) || stage2_pte_cacheable(old))
+		stage2_flush_dcache(__va(phys), granule);
+
+	if (stage2_pte_executable(new))
+		stage2_invalidate_icache(__va(phys), granule);
+
 	smp_store_release(ptep, new);
 	get_page(page);
 	data->phys += granule;
@@ -651,20 +695,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	return ret;
 }
 
-static void stage2_flush_dcache(void *addr, u64 size)
-{
-	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
-		return;
-
-	__flush_dcache_area(addr, size);
-}
-
-static bool stage2_pte_cacheable(kvm_pte_t pte)
-{
-	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
-	return memattr == PAGE_S2_MEMATTR(NORMAL);
-}
-
 static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			       enum kvm_pgtable_walk_flags flag,
 			       void * const arg)
@@ -743,8 +773,16 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	 * but worst-case the access flag update gets lost and will be
 	 * set on the next access instead.
 	 */
-	if (data->pte != pte)
+	if (data->pte != pte) {
+		/*
+		 * Invalidate the instruction cache before updating
+		 * if we are going to add the executable permission.
+		 */
+		if (!stage2_pte_executable(*ptep) && stage2_pte_executable(pte))
+			stage2_invalidate_icache(kvm_pte_follow(pte),
+						 kvm_granule_size(level));
 		WRITE_ONCE(*ptep, pte);
+	}
 
 	return 0;
 }
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..1eec9f63bc6f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -609,16 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
-static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
-{
-	__clean_dcache_guest_page(pfn, size);
-}
-
-static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
-{
-	__invalidate_icache_guest_page(pfn, size);
-}
-
 static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
 {
 	send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
@@ -882,13 +872,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
-	if (fault_status != FSC_PERM && !device)
-		clean_dcache_guest_page(pfn, vma_pagesize);
-
-	if (exec_fault) {
+	if (exec_fault)
 		prot |= KVM_PGTABLE_PROT_X;
-		invalidate_icache_guest_page(pfn, vma_pagesize);
-	}
 
 	if (device)
 		prot |= KVM_PGTABLE_PROT_DEVICE;
@@ -1144,10 +1129,10 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
 	trace_kvm_set_spte_hva(hva);
 
 	/*
-	 * We've moved a page around, probably through CoW, so let's treat it
-	 * just like a translation fault and clean the cache to the PoC.
+	 * We've moved a page around, probably through CoW, so let's treat
+	 * it just like a translation fault and the map handler will clean
+	 * the cache to the PoC.
 	 */
-	clean_dcache_guest_page(pfn, PAGE_SIZE);
 	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);
 	return 0;
 }
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH v3 2/2] KVM: arm64: Distinguish cases of memcache allocations completely
  2021-03-26  3:16 [RFC PATCH v3 0/2] KVM: arm64: Improve efficiency of stage2 page table Yanan Wang
  2021-03-26  3:16 ` [RFC PATCH v3 1/2] KVM: arm64: Move CMOs from user_mem_abort to the fault handlers Yanan Wang
@ 2021-03-26  3:16 ` Yanan Wang
  2021-04-07 15:35   ` Alexandru Elisei
  1 sibling, 1 reply; 9+ messages in thread
From: Yanan Wang @ 2021-03-26  3:16 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Alexandru Elisei, Catalin Marinas,
	kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, zhukeqian1, yuzenghui,
	Yanan Wang

With a guest translation fault, the memcache pages are not needed if KVM
is only about to install a new leaf entry into the existing page table.
And with a guest permission fault, the memcache pages are also not needed
for a write_fault in dirty-logging time if KVM is only about to update
the existing leaf entry instead of collapsing a block entry into a table.

By comparing fault_granule and vma_pagesize, cases that require allocations
from memcache and cases that don't can be distinguished completely.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/kvm/mmu.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 1eec9f63bc6f..05af40dc60c1 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -810,19 +810,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	gfn = fault_ipa >> PAGE_SHIFT;
 	mmap_read_unlock(current->mm);
 
-	/*
-	 * Permission faults just need to update the existing leaf entry,
-	 * and so normally don't require allocations from the memcache. The
-	 * only exception to this is when dirty logging is enabled at runtime
-	 * and a write fault needs to collapse a block entry into a table.
-	 */
-	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
-		ret = kvm_mmu_topup_memory_cache(memcache,
-						 kvm_mmu_cache_min_pages(kvm));
-		if (ret)
-			return ret;
-	}
-
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	/*
 	 * Ensure the read of mmu_notifier_seq happens before we call
@@ -880,6 +867,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
 		prot |= KVM_PGTABLE_PROT_X;
 
+	/*
+	 * Allocations from the memcache are required only when granule of the
+	 * lookup level where the guest fault happened exceeds vma_pagesize,
+	 * which means new page tables will be created in the fault handlers.
+	 */
+	if (fault_granule > vma_pagesize) {
+		ret = kvm_mmu_topup_memory_cache(memcache,
+						 kvm_mmu_cache_min_pages(kvm));
+		if (ret)
+			return ret;
+	}
+
 	/*
 	 * Under the premise of getting a FSC_PERM fault, we just need to relax
 	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3 1/2] KVM: arm64: Move CMOs from user_mem_abort to the fault handlers
  2021-03-26  3:16 ` [RFC PATCH v3 1/2] KVM: arm64: Move CMOs from user_mem_abort to the fault handlers Yanan Wang
@ 2021-04-07 15:31   ` Alexandru Elisei
  2021-04-07 20:57     ` Will Deacon
  2021-04-08  9:23     ` wangyanan (Y)
  0 siblings, 2 replies; 9+ messages in thread
From: Alexandru Elisei @ 2021-04-07 15:31 UTC (permalink / raw)
  To: Yanan Wang, Marc Zyngier, Will Deacon, Catalin Marinas, kvmarm,
	linux-arm-kernel, kvm, linux-kernel
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, zhukeqian1, yuzenghui

Hi Yanan,

On 3/26/21 3:16 AM, Yanan Wang wrote:
> We currently uniformly permorm CMOs of D-cache and I-cache in function
> user_mem_abort before calling the fault handlers. If we get concurrent
> guest faults(e.g. translation faults, permission faults) or some really
> unnecessary guest faults caused by BBM, CMOs for the first vcpu are

I can't figure out what BBM means.

> necessary while the others later are not.
>
> By moving CMOs to the fault handlers, we can easily identify conditions
> where they are really needed and avoid the unnecessary ones. As it's a
> time consuming process to perform CMOs especially when flushing a block
> range, so this solution reduces much load of kvm and improve efficiency
> of the page table code.
>
> So let's move both clean of D-cache and invalidation of I-cache to the
> map path and move only invalidation of I-cache to the permission path.
> Since the original APIs for CMOs in mmu.c are only called in function
> user_mem_abort, we now also move them to pgtable.c.
>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/include/asm/kvm_mmu.h | 31 ---------------
>  arch/arm64/kvm/hyp/pgtable.c     | 68 +++++++++++++++++++++++++-------
>  arch/arm64/kvm/mmu.c             | 23 ++---------
>  3 files changed, 57 insertions(+), 65 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 90873851f677..c31f88306d4e 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -177,37 +177,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>  }
>  
> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> -{
> -	void *va = page_address(pfn_to_page(pfn));
> -
> -	/*
> -	 * With FWB, we ensure that the guest always accesses memory using
> -	 * cacheable attributes, and we don't have to clean to PoC when
> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> -	 * PoU is not required either in this case.
> -	 */
> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> -		return;
> -
> -	kvm_flush_dcache_to_poc(va, size);
> -}
> -
> -static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
> -						  unsigned long size)
> -{
> -	if (icache_is_aliasing()) {
> -		/* any kind of VIPT cache */
> -		__flush_icache_all();
> -	} else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
> -		/* PIPT or VPIPT at EL2 (see comment in __kvm_tlb_flush_vmid_ipa) */
> -		void *va = page_address(pfn_to_page(pfn));
> -
> -		invalidate_icache_range((unsigned long)va,
> -					(unsigned long)va + size);
> -	}
> -}
> -
>  void kvm_set_way_flush(struct kvm_vcpu *vcpu);
>  void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);
>  
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 4d177ce1d536..829a34eea526 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -464,6 +464,43 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>  	return 0;
>  }
>  
> +static bool stage2_pte_cacheable(kvm_pte_t pte)
> +{
> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
> +}
> +
> +static bool stage2_pte_executable(kvm_pte_t pte)
> +{
> +	return !(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN);
> +}
> +
> +static void stage2_flush_dcache(void *addr, u64 size)
> +{
> +	/*
> +	 * With FWB, we ensure that the guest always accesses memory using
> +	 * cacheable attributes, and we don't have to clean to PoC when
> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> +	 * PoU is not required either in this case.
> +	 */
> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> +		return;
> +
> +	__flush_dcache_area(addr, size);
> +}
> +
> +static void stage2_invalidate_icache(void *addr, u64 size)
> +{
> +	if (icache_is_aliasing()) {
> +		/* Flush any kind of VIPT icache */
> +		__flush_icache_all();
> +	} else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
> +		/* PIPT or VPIPT at EL2 */
> +		invalidate_icache_range((unsigned long)addr,
> +					(unsigned long)addr + size);
> +	}
> +}
> +
>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  				      kvm_pte_t *ptep,
>  				      struct stage2_map_data *data)
> @@ -495,6 +532,13 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  		put_page(page);
>  	}
>  
> +	/* Perform CMOs before installation of the new PTE */
> +	if (!kvm_pte_valid(old) || stage2_pte_cacheable(old))

I'm not sure why the stage2_pte_cacheable(old) condition is needed.

kvm_handle_guest_abort() handles three types of stage 2 data or instruction
aborts: translation faults (fault_status == FSC_FAULT), access faults
(fault_status == FSC_ACCESS) and permission faults (fault_status == FSC_PERM).

Access faults are handled in handle_access_fault(), which means user_mem_abort()
handles translation and permission faults. The original code did the dcache clean
+ inval when not a permission fault, which means the CMO was done only on a
translation fault. Translation faults mean that the IPA was not mapped, so the old
entry will always be invalid. Even if we're coalescing multiple last level leaf
entries int oa  block mapping, the table entry which is replaced is invalid
because it's marked as such in stage2_map_walk_table_pre().

Is there something I'm missing?

> +		stage2_flush_dcache(__va(phys), granule);
> +
> +	if (stage2_pte_executable(new))
> +		stage2_invalidate_icache(__va(phys), granule);

This, together with the stage2_attr_walker() changes below, look identical to the
current code in user_mem_abort(). The executable permission is set on an exec
fault (instruction abort not on a stage 2 translation table walk), and as a result
of the fault we either need to map a new page here, or relax permissions in
kvm_pgtable_stage2_relax_perms() -> stage2_attr_walker() below.

Thanks,

Alex

> +
>  	smp_store_release(ptep, new);
>  	get_page(page);
>  	data->phys += granule;
> @@ -651,20 +695,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>  	return ret;
>  }
>  
> -static void stage2_flush_dcache(void *addr, u64 size)
> -{
> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> -		return;
> -
> -	__flush_dcache_area(addr, size);
> -}
> -
> -static bool stage2_pte_cacheable(kvm_pte_t pte)
> -{
> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
> -}
> -
>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  			       enum kvm_pgtable_walk_flags flag,
>  			       void * const arg)
> @@ -743,8 +773,16 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  	 * but worst-case the access flag update gets lost and will be
>  	 * set on the next access instead.
>  	 */
> -	if (data->pte != pte)
> +	if (data->pte != pte) {
> +		/*
> +		 * Invalidate the instruction cache before updating
> +		 * if we are going to add the executable permission.
> +		 */
> +		if (!stage2_pte_executable(*ptep) && stage2_pte_executable(pte))
> +			stage2_invalidate_icache(kvm_pte_follow(pte),
> +						 kvm_granule_size(level));
>  		WRITE_ONCE(*ptep, pte);
> +	}
>  
>  	return 0;
>  }
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 77cb2d28f2a4..1eec9f63bc6f 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -609,16 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>  }
>  
> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> -{
> -	__clean_dcache_guest_page(pfn, size);
> -}
> -
> -static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
> -{
> -	__invalidate_icache_guest_page(pfn, size);
> -}
> -
>  static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
>  {
>  	send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
> @@ -882,13 +872,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (writable)
>  		prot |= KVM_PGTABLE_PROT_W;
>  
> -	if (fault_status != FSC_PERM && !device)
> -		clean_dcache_guest_page(pfn, vma_pagesize);
> -
> -	if (exec_fault) {
> +	if (exec_fault)
>  		prot |= KVM_PGTABLE_PROT_X;
> -		invalidate_icache_guest_page(pfn, vma_pagesize);
> -	}
>  
>  	if (device)
>  		prot |= KVM_PGTABLE_PROT_DEVICE;
> @@ -1144,10 +1129,10 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
>  	trace_kvm_set_spte_hva(hva);
>  
>  	/*
> -	 * We've moved a page around, probably through CoW, so let's treat it
> -	 * just like a translation fault and clean the cache to the PoC.
> +	 * We've moved a page around, probably through CoW, so let's treat
> +	 * it just like a translation fault and the map handler will clean
> +	 * the cache to the PoC.
>  	 */
> -	clean_dcache_guest_page(pfn, PAGE_SIZE);
>  	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3 2/2] KVM: arm64: Distinguish cases of memcache allocations completely
  2021-03-26  3:16 ` [RFC PATCH v3 2/2] KVM: arm64: Distinguish cases of memcache allocations completely Yanan Wang
@ 2021-04-07 15:35   ` Alexandru Elisei
  2021-04-08  9:31     ` wangyanan (Y)
  0 siblings, 1 reply; 9+ messages in thread
From: Alexandru Elisei @ 2021-04-07 15:35 UTC (permalink / raw)
  To: Yanan Wang, Marc Zyngier, Will Deacon, Catalin Marinas, kvmarm,
	linux-arm-kernel, kvm, linux-kernel
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, zhukeqian1, yuzenghui

Hi Yanan,

On 3/26/21 3:16 AM, Yanan Wang wrote:
> With a guest translation fault, the memcache pages are not needed if KVM
> is only about to install a new leaf entry into the existing page table.
> And with a guest permission fault, the memcache pages are also not needed
> for a write_fault in dirty-logging time if KVM is only about to update
> the existing leaf entry instead of collapsing a block entry into a table.
>
> By comparing fault_granule and vma_pagesize, cases that require allocations
> from memcache and cases that don't can be distinguished completely.
>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/kvm/mmu.c | 25 ++++++++++++-------------
>  1 file changed, 12 insertions(+), 13 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 1eec9f63bc6f..05af40dc60c1 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -810,19 +810,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	gfn = fault_ipa >> PAGE_SHIFT;
>  	mmap_read_unlock(current->mm);
>  
> -	/*
> -	 * Permission faults just need to update the existing leaf entry,
> -	 * and so normally don't require allocations from the memcache. The
> -	 * only exception to this is when dirty logging is enabled at runtime
> -	 * and a write fault needs to collapse a block entry into a table.
> -	 */
> -	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
> -		ret = kvm_mmu_topup_memory_cache(memcache,
> -						 kvm_mmu_cache_min_pages(kvm));
> -		if (ret)
> -			return ret;
> -	}
> -
>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>  	/*
>  	 * Ensure the read of mmu_notifier_seq happens before we call
> @@ -880,6 +867,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
>  		prot |= KVM_PGTABLE_PROT_X;
>  
> +	/*
> +	 * Allocations from the memcache are required only when granule of the
> +	 * lookup level where the guest fault happened exceeds vma_pagesize,
> +	 * which means new page tables will be created in the fault handlers.
> +	 */
> +	if (fault_granule > vma_pagesize) {
> +		ret = kvm_mmu_topup_memory_cache(memcache,
> +						 kvm_mmu_cache_min_pages(kvm));
> +		if (ret)
> +			return ret;
> +	}

As I explained in v1 [1], this looks correct to me. I still think that someone
else should have a look, but if Marc decides to pick up this patch as-is, he can
add my Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>.

[1] https://lore.kernel.org/lkml/2c65bff2-be7f-b20c-9265-939bc73185b6@arm.com/

Thanks,

Alex

> +
>  	/*
>  	 * Under the premise of getting a FSC_PERM fault, we just need to relax
>  	 * permissions only if vma_pagesize equals fault_granule. Otherwise,

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3 1/2] KVM: arm64: Move CMOs from user_mem_abort to the fault handlers
  2021-04-07 15:31   ` Alexandru Elisei
@ 2021-04-07 20:57     ` Will Deacon
  2021-04-08  9:23     ` wangyanan (Y)
  1 sibling, 0 replies; 9+ messages in thread
From: Will Deacon @ 2021-04-07 20:57 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Yanan Wang, Marc Zyngier, Catalin Marinas, kvmarm,
	linux-arm-kernel, kvm, linux-kernel, James Morse, Julien Thierry,
	Suzuki K Poulose, Gavin Shan, Quentin Perret, wanghaibin.wang,
	zhukeqian1, yuzenghui

On Wed, Apr 07, 2021 at 04:31:31PM +0100, Alexandru Elisei wrote:
> On 3/26/21 3:16 AM, Yanan Wang wrote:
> > We currently uniformly permorm CMOs of D-cache and I-cache in function
> > user_mem_abort before calling the fault handlers. If we get concurrent
> > guest faults(e.g. translation faults, permission faults) or some really
> > unnecessary guest faults caused by BBM, CMOs for the first vcpu are
> 
> I can't figure out what BBM means.

Oh, I know that one! BBM means "Break Before Make". Not to be confused with
DBM (Dirty Bit Management) or BFM (Bit Field Move).

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3 1/2] KVM: arm64: Move CMOs from user_mem_abort to the fault handlers
  2021-04-07 15:31   ` Alexandru Elisei
  2021-04-07 20:57     ` Will Deacon
@ 2021-04-08  9:23     ` wangyanan (Y)
  2021-04-08 15:59       ` Alexandru Elisei
  1 sibling, 1 reply; 9+ messages in thread
From: wangyanan (Y) @ 2021-04-08  9:23 UTC (permalink / raw)
  To: Alexandru Elisei, Marc Zyngier, Will Deacon, Catalin Marinas,
	kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, zhukeqian1, yuzenghui

Hi Alex,

On 2021/4/7 23:31, Alexandru Elisei wrote:
> Hi Yanan,
>
> On 3/26/21 3:16 AM, Yanan Wang wrote:
>> We currently uniformly permorm CMOs of D-cache and I-cache in function
>> user_mem_abort before calling the fault handlers. If we get concurrent
>> guest faults(e.g. translation faults, permission faults) or some really
>> unnecessary guest faults caused by BBM, CMOs for the first vcpu are
> I can't figure out what BBM means.
Just as Will has explained, it's Break-Before-Make rule. When we need to
replace an old table entry with a new one, we should firstly invalidate
the old table entry(Break), before installation of the new entry(Make).

And I think this patch mainly introduces benefits in two specific scenarios:
1) In a VM startup, it will improve efficiency of handling page faults 
incurred
by vCPUs, when initially populating stage2 page tables.
2) After live migration, the heavy workload will be resumed on the 
destination
VMs, however all the stage2 page tables need to be rebuilt.
>> necessary while the others later are not.
>>
>> By moving CMOs to the fault handlers, we can easily identify conditions
>> where they are really needed and avoid the unnecessary ones. As it's a
>> time consuming process to perform CMOs especially when flushing a block
>> range, so this solution reduces much load of kvm and improve efficiency
>> of the page table code.
>>
>> So let's move both clean of D-cache and invalidation of I-cache to the
>> map path and move only invalidation of I-cache to the permission path.
>> Since the original APIs for CMOs in mmu.c are only called in function
>> user_mem_abort, we now also move them to pgtable.c.
>>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>   arch/arm64/include/asm/kvm_mmu.h | 31 ---------------
>>   arch/arm64/kvm/hyp/pgtable.c     | 68 +++++++++++++++++++++++++-------
>>   arch/arm64/kvm/mmu.c             | 23 ++---------
>>   3 files changed, 57 insertions(+), 65 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index 90873851f677..c31f88306d4e 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -177,37 +177,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>>   	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>>   }
>>   
>> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>> -{
>> -	void *va = page_address(pfn_to_page(pfn));
>> -
>> -	/*
>> -	 * With FWB, we ensure that the guest always accesses memory using
>> -	 * cacheable attributes, and we don't have to clean to PoC when
>> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>> -	 * PoU is not required either in this case.
>> -	 */
>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> -		return;
>> -
>> -	kvm_flush_dcache_to_poc(va, size);
>> -}
>> -
>> -static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>> -						  unsigned long size)
>> -{
>> -	if (icache_is_aliasing()) {
>> -		/* any kind of VIPT cache */
>> -		__flush_icache_all();
>> -	} else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
>> -		/* PIPT or VPIPT at EL2 (see comment in __kvm_tlb_flush_vmid_ipa) */
>> -		void *va = page_address(pfn_to_page(pfn));
>> -
>> -		invalidate_icache_range((unsigned long)va,
>> -					(unsigned long)va + size);
>> -	}
>> -}
>> -
>>   void kvm_set_way_flush(struct kvm_vcpu *vcpu);
>>   void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);
>>   
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 4d177ce1d536..829a34eea526 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -464,6 +464,43 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>>   	return 0;
>>   }
>>   
>> +static bool stage2_pte_cacheable(kvm_pte_t pte)
>> +{
>> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
>> +}
>> +
>> +static bool stage2_pte_executable(kvm_pte_t pte)
>> +{
>> +	return !(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN);
>> +}
>> +
>> +static void stage2_flush_dcache(void *addr, u64 size)
>> +{
>> +	/*
>> +	 * With FWB, we ensure that the guest always accesses memory using
>> +	 * cacheable attributes, and we don't have to clean to PoC when
>> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>> +	 * PoU is not required either in this case.
>> +	 */
>> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> +		return;
>> +
>> +	__flush_dcache_area(addr, size);
>> +}
>> +
>> +static void stage2_invalidate_icache(void *addr, u64 size)
>> +{
>> +	if (icache_is_aliasing()) {
>> +		/* Flush any kind of VIPT icache */
>> +		__flush_icache_all();
>> +	} else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
>> +		/* PIPT or VPIPT at EL2 */
>> +		invalidate_icache_range((unsigned long)addr,
>> +					(unsigned long)addr + size);
>> +	}
>> +}
>> +
>>   static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>   				      kvm_pte_t *ptep,
>>   				      struct stage2_map_data *data)
>> @@ -495,6 +532,13 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>   		put_page(page);
>>   	}
>>   
>> +	/* Perform CMOs before installation of the new PTE */
>> +	if (!kvm_pte_valid(old) || stage2_pte_cacheable(old))
> I'm not sure why the stage2_pte_cacheable(old) condition is needed.
>
> kvm_handle_guest_abort() handles three types of stage 2 data or instruction
> aborts: translation faults (fault_status == FSC_FAULT), access faults
> (fault_status == FSC_ACCESS) and permission faults (fault_status == FSC_PERM).
>
> Access faults are handled in handle_access_fault(), which means user_mem_abort()
> handles translation and permission faults.
Yes, and we are certain that it's a translation fault here in 
stage2_map_walker_try_leaf.
> The original code did the dcache clean
> + inval when not a permission fault, which means the CMO was done only on a
> translation fault. Translation faults mean that the IPA was not mapped, so the old
> entry will always be invalid. Even if we're coalescing multiple last level leaf
> entries int oa  block mapping, the table entry which is replaced is invalid
> because it's marked as such in stage2_map_walk_table_pre().
>
> Is there something I'm missing?
I originally thought that we could possibly have a translation fault on 
a valid stage2 table
descriptor due to some special cases, and that's the reason 
stage2_pte_cacheable(old)
condition exits, but I can't image any scenario like this.

I think your above explanation is right, maybe I should just drop that 
condition.
>
>> +		stage2_flush_dcache(__va(phys), granule);
>> +
>> +	if (stage2_pte_executable(new))
>> +		stage2_invalidate_icache(__va(phys), granule);
> This, together with the stage2_attr_walker() changes below, look identical to the
> current code in user_mem_abort(). The executable permission is set on an exec
> fault (instruction abort not on a stage 2 translation table walk), and as a result
> of the fault we either need to map a new page here, or relax permissions in
> kvm_pgtable_stage2_relax_perms() -> stage2_attr_walker() below.
I agree.
Do you mean this part of change is right?

Thanks,
Yanan
> Thanks,
>
> Alex
>
>> +
>>   	smp_store_release(ptep, new);
>>   	get_page(page);
>>   	data->phys += granule;
>> @@ -651,20 +695,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>>   	return ret;
>>   }
>>   
>> -static void stage2_flush_dcache(void *addr, u64 size)
>> -{
>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> -		return;
>> -
>> -	__flush_dcache_area(addr, size);
>> -}
>> -
>> -static bool stage2_pte_cacheable(kvm_pte_t pte)
>> -{
>> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
>> -}
>> -
>>   static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>   			       enum kvm_pgtable_walk_flags flag,
>>   			       void * const arg)
>> @@ -743,8 +773,16 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>   	 * but worst-case the access flag update gets lost and will be
>>   	 * set on the next access instead.
>>   	 */
>> -	if (data->pte != pte)
>> +	if (data->pte != pte) {
>> +		/*
>> +		 * Invalidate the instruction cache before updating
>> +		 * if we are going to add the executable permission.
>> +		 */
>> +		if (!stage2_pte_executable(*ptep) && stage2_pte_executable(pte))
>> +			stage2_invalidate_icache(kvm_pte_follow(pte),
>> +						 kvm_granule_size(level));
>>   		WRITE_ONCE(*ptep, pte);
>> +	}
>>   
>>   	return 0;
>>   }
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 77cb2d28f2a4..1eec9f63bc6f 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -609,16 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>>   	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>>   }
>>   
>> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>> -{
>> -	__clean_dcache_guest_page(pfn, size);
>> -}
>> -
>> -static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>> -{
>> -	__invalidate_icache_guest_page(pfn, size);
>> -}
>> -
>>   static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
>>   {
>>   	send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
>> @@ -882,13 +872,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	if (writable)
>>   		prot |= KVM_PGTABLE_PROT_W;
>>   
>> -	if (fault_status != FSC_PERM && !device)
>> -		clean_dcache_guest_page(pfn, vma_pagesize);
>> -
>> -	if (exec_fault) {
>> +	if (exec_fault)
>>   		prot |= KVM_PGTABLE_PROT_X;
>> -		invalidate_icache_guest_page(pfn, vma_pagesize);
>> -	}
>>   
>>   	if (device)
>>   		prot |= KVM_PGTABLE_PROT_DEVICE;
>> @@ -1144,10 +1129,10 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
>>   	trace_kvm_set_spte_hva(hva);
>>   
>>   	/*
>> -	 * We've moved a page around, probably through CoW, so let's treat it
>> -	 * just like a translation fault and clean the cache to the PoC.
>> +	 * We've moved a page around, probably through CoW, so let's treat
>> +	 * it just like a translation fault and the map handler will clean
>> +	 * the cache to the PoC.
>>   	 */
>> -	clean_dcache_guest_page(pfn, PAGE_SIZE);
>>   	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);
>>   	return 0;
>>   }
> .

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3 2/2] KVM: arm64: Distinguish cases of memcache allocations completely
  2021-04-07 15:35   ` Alexandru Elisei
@ 2021-04-08  9:31     ` wangyanan (Y)
  0 siblings, 0 replies; 9+ messages in thread
From: wangyanan (Y) @ 2021-04-08  9:31 UTC (permalink / raw)
  To: Alexandru Elisei, Marc Zyngier, Will Deacon, Catalin Marinas,
	kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, zhukeqian1, yuzenghui


On 2021/4/7 23:35, Alexandru Elisei wrote:
> Hi Yanan,
>
> On 3/26/21 3:16 AM, Yanan Wang wrote:
>> With a guest translation fault, the memcache pages are not needed if KVM
>> is only about to install a new leaf entry into the existing page table.
>> And with a guest permission fault, the memcache pages are also not needed
>> for a write_fault in dirty-logging time if KVM is only about to update
>> the existing leaf entry instead of collapsing a block entry into a table.
>>
>> By comparing fault_granule and vma_pagesize, cases that require allocations
>> from memcache and cases that don't can be distinguished completely.
>>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>   arch/arm64/kvm/mmu.c | 25 ++++++++++++-------------
>>   1 file changed, 12 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 1eec9f63bc6f..05af40dc60c1 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -810,19 +810,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	gfn = fault_ipa >> PAGE_SHIFT;
>>   	mmap_read_unlock(current->mm);
>>   
>> -	/*
>> -	 * Permission faults just need to update the existing leaf entry,
>> -	 * and so normally don't require allocations from the memcache. The
>> -	 * only exception to this is when dirty logging is enabled at runtime
>> -	 * and a write fault needs to collapse a block entry into a table.
>> -	 */
>> -	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
>> -		ret = kvm_mmu_topup_memory_cache(memcache,
>> -						 kvm_mmu_cache_min_pages(kvm));
>> -		if (ret)
>> -			return ret;
>> -	}
>> -
>>   	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>>   	/*
>>   	 * Ensure the read of mmu_notifier_seq happens before we call
>> @@ -880,6 +867,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
>>   		prot |= KVM_PGTABLE_PROT_X;
>>   
>> +	/*
>> +	 * Allocations from the memcache are required only when granule of the
>> +	 * lookup level where the guest fault happened exceeds vma_pagesize,
>> +	 * which means new page tables will be created in the fault handlers.
>> +	 */
>> +	if (fault_granule > vma_pagesize) {
>> +		ret = kvm_mmu_topup_memory_cache(memcache,
>> +						 kvm_mmu_cache_min_pages(kvm));
>> +		if (ret)
>> +			return ret;
>> +	}
> As I explained in v1 [1], this looks correct to me. I still think that someone
> else should have a look, but if Marc decides to pick up this patch as-is, he can
> add my Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>.
Thanks again for this, Alex!

Hi Marc, Will,
Any thoughts about this patch?

Thanks,
Yanan
> [1] https://lore.kernel.org/lkml/2c65bff2-be7f-b20c-9265-939bc73185b6@arm.com/
>
> Thanks,
>
> Alex
>
>> +
>>   	/*
>>   	 * Under the premise of getting a FSC_PERM fault, we just need to relax
>>   	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
> .

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3 1/2] KVM: arm64: Move CMOs from user_mem_abort to the fault handlers
  2021-04-08  9:23     ` wangyanan (Y)
@ 2021-04-08 15:59       ` Alexandru Elisei
  0 siblings, 0 replies; 9+ messages in thread
From: Alexandru Elisei @ 2021-04-08 15:59 UTC (permalink / raw)
  To: wangyanan (Y),
	Marc Zyngier, Will Deacon, Catalin Marinas, kvmarm,
	linux-arm-kernel, kvm, linux-kernel
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, zhukeqian1, yuzenghui

Hi Yanan,

On 4/8/21 10:23 AM, wangyanan (Y) wrote:
> Hi Alex,
>
> On 2021/4/7 23:31, Alexandru Elisei wrote:
>> Hi Yanan,
>>
>> On 3/26/21 3:16 AM, Yanan Wang wrote:
>>> We currently uniformly permorm CMOs of D-cache and I-cache in function
>>> user_mem_abort before calling the fault handlers. If we get concurrent
>>> guest faults(e.g. translation faults, permission faults) or some really
>>> unnecessary guest faults caused by BBM, CMOs for the first vcpu are
>> I can't figure out what BBM means.
> Just as Will has explained, it's Break-Before-Make rule. When we need to
> replace an old table entry with a new one, we should firstly invalidate
> the old table entry(Break), before installation of the new entry(Make).

Got it, thank you and Will for the explanation.

>
>
> And I think this patch mainly introduces benefits in two specific scenarios:
> 1) In a VM startup, it will improve efficiency of handling page faults incurred
> by vCPUs, when initially populating stage2 page tables.
> 2) After live migration, the heavy workload will be resumed on the destination
> VMs, however all the stage2 page tables need to be rebuilt.
>>> necessary while the others later are not.
>>>
>>> By moving CMOs to the fault handlers, we can easily identify conditions
>>> where they are really needed and avoid the unnecessary ones. As it's a
>>> time consuming process to perform CMOs especially when flushing a block
>>> range, so this solution reduces much load of kvm and improve efficiency
>>> of the page table code.
>>>
>>> So let's move both clean of D-cache and invalidation of I-cache to the
>>> map path and move only invalidation of I-cache to the permission path.
>>> Since the original APIs for CMOs in mmu.c are only called in function
>>> user_mem_abort, we now also move them to pgtable.c.
>>>
>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>> ---
>>>   arch/arm64/include/asm/kvm_mmu.h | 31 ---------------
>>>   arch/arm64/kvm/hyp/pgtable.c     | 68 +++++++++++++++++++++++++-------
>>>   arch/arm64/kvm/mmu.c             | 23 ++---------
>>>   3 files changed, 57 insertions(+), 65 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>>> index 90873851f677..c31f88306d4e 100644
>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>> @@ -177,37 +177,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu
>>> *vcpu)
>>>       return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>>>   }
>>>   -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long
>>> size)
>>> -{
>>> -    void *va = page_address(pfn_to_page(pfn));
>>> -
>>> -    /*
>>> -     * With FWB, we ensure that the guest always accesses memory using
>>> -     * cacheable attributes, and we don't have to clean to PoC when
>>> -     * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>>> -     * PoU is not required either in this case.
>>> -     */
>>> -    if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>> -        return;
>>> -
>>> -    kvm_flush_dcache_to_poc(va, size);
>>> -}
>>> -
>>> -static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>>> -                          unsigned long size)
>>> -{
>>> -    if (icache_is_aliasing()) {
>>> -        /* any kind of VIPT cache */
>>> -        __flush_icache_all();
>>> -    } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
>>> -        /* PIPT or VPIPT at EL2 (see comment in __kvm_tlb_flush_vmid_ipa) */
>>> -        void *va = page_address(pfn_to_page(pfn));
>>> -
>>> -        invalidate_icache_range((unsigned long)va,
>>> -                    (unsigned long)va + size);
>>> -    }
>>> -}
>>> -
>>>   void kvm_set_way_flush(struct kvm_vcpu *vcpu);
>>>   void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);
>>>   diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>> index 4d177ce1d536..829a34eea526 100644
>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>> @@ -464,6 +464,43 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot
>>> prot,
>>>       return 0;
>>>   }
>>>   +static bool stage2_pte_cacheable(kvm_pte_t pte)
>>> +{
>>> +    u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>>> +    return memattr == PAGE_S2_MEMATTR(NORMAL);
>>> +}
>>> +
>>> +static bool stage2_pte_executable(kvm_pte_t pte)
>>> +{
>>> +    return !(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN);
>>> +}
>>> +
>>> +static void stage2_flush_dcache(void *addr, u64 size)
>>> +{
>>> +    /*
>>> +     * With FWB, we ensure that the guest always accesses memory using
>>> +     * cacheable attributes, and we don't have to clean to PoC when
>>> +     * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>>> +     * PoU is not required either in this case.
>>> +     */
>>> +    if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>> +        return;
>>> +
>>> +    __flush_dcache_area(addr, size);
>>> +}
>>> +
>>> +static void stage2_invalidate_icache(void *addr, u64 size)
>>> +{
>>> +    if (icache_is_aliasing()) {
>>> +        /* Flush any kind of VIPT icache */
>>> +        __flush_icache_all();
>>> +    } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
>>> +        /* PIPT or VPIPT at EL2 */
>>> +        invalidate_icache_range((unsigned long)addr,
>>> +                    (unsigned long)addr + size);
>>> +    }
>>> +}
>>> +
>>>   static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>                         kvm_pte_t *ptep,
>>>                         struct stage2_map_data *data)
>>> @@ -495,6 +532,13 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end,
>>> u32 level,
>>>           put_page(page);
>>>       }
>>>   +    /* Perform CMOs before installation of the new PTE */
>>> +    if (!kvm_pte_valid(old) || stage2_pte_cacheable(old))
>> I'm not sure why the stage2_pte_cacheable(old) condition is needed.
>>
>> kvm_handle_guest_abort() handles three types of stage 2 data or instruction
>> aborts: translation faults (fault_status == FSC_FAULT), access faults
>> (fault_status == FSC_ACCESS) and permission faults (fault_status == FSC_PERM).
>>
>> Access faults are handled in handle_access_fault(), which means user_mem_abort()
>> handles translation and permission faults.
> Yes, and we are certain that it's a translation fault here in
> stage2_map_walker_try_leaf.
>> The original code did the dcache clean
>> + inval when not a permission fault, which means the CMO was done only on a
>> translation fault. Translation faults mean that the IPA was not mapped, so the old
>> entry will always be invalid. Even if we're coalescing multiple last level leaf
>> entries int oa  block mapping, the table entry which is replaced is invalid
>> because it's marked as such in stage2_map_walk_table_pre().
>>
>> Is there something I'm missing?
> I originally thought that we could possibly have a translation fault on a valid
> stage2 table
> descriptor due to some special cases, and that's the reason
> stage2_pte_cacheable(old)
> condition exits, but I can't image any scenario like this.
>
> I think your above explanation is right, maybe I should just drop that condition.
>>
>>> +        stage2_flush_dcache(__va(phys), granule);
>>> +
>>> +    if (stage2_pte_executable(new))
>>> +        stage2_invalidate_icache(__va(phys), granule);
>> This, together with the stage2_attr_walker() changes below, look identical to the
>> current code in user_mem_abort(). The executable permission is set on an exec
>> fault (instruction abort not on a stage 2 translation table walk), and as a result
>> of the fault we either need to map a new page here, or relax permissions in
>> kvm_pgtable_stage2_relax_perms() -> stage2_attr_walker() below.
> I agree.
> Do you mean this part of change is right?

Yes, I was trying to explain that the behaviour with regard to icache invalidation
from this patch is identical to the current behaviour of user_mem_abort ()
(without this patch).

Thanks,

Alex

>
> Thanks,
> Yanan
>> Thanks,
>>
>> Alex
>>
>>> +
>>>       smp_store_release(ptep, new);
>>>       get_page(page);
>>>       data->phys += granule;
>>> @@ -651,20 +695,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64
>>> addr, u64 size,
>>>       return ret;
>>>   }
>>>   -static void stage2_flush_dcache(void *addr, u64 size)
>>> -{
>>> -    if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>> -        return;
>>> -
>>> -    __flush_dcache_area(addr, size);
>>> -}
>>> -
>>> -static bool stage2_pte_cacheable(kvm_pte_t pte)
>>> -{
>>> -    u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>>> -    return memattr == PAGE_S2_MEMATTR(NORMAL);
>>> -}
>>> -
>>>   static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>                      enum kvm_pgtable_walk_flags flag,
>>>                      void * const arg)
>>> @@ -743,8 +773,16 @@ static int stage2_attr_walker(u64 addr, u64 end, u32
>>> level, kvm_pte_t *ptep,
>>>        * but worst-case the access flag update gets lost and will be
>>>        * set on the next access instead.
>>>        */
>>> -    if (data->pte != pte)
>>> +    if (data->pte != pte) {
>>> +        /*
>>> +         * Invalidate the instruction cache before updating
>>> +         * if we are going to add the executable permission.
>>> +         */
>>> +        if (!stage2_pte_executable(*ptep) && stage2_pte_executable(pte))
>>> +            stage2_invalidate_icache(kvm_pte_follow(pte),
>>> +                         kvm_granule_size(level));
>>>           WRITE_ONCE(*ptep, pte);
>>> +    }
>>>         return 0;
>>>   }
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index 77cb2d28f2a4..1eec9f63bc6f 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -609,16 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm
>>> *kvm,
>>>       kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>>>   }
>>>   -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>> -{
>>> -    __clean_dcache_guest_page(pfn, size);
>>> -}
>>> -
>>> -static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>> -{
>>> -    __invalidate_icache_guest_page(pfn, size);
>>> -}
>>> -
>>>   static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
>>>   {
>>>       send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
>>> @@ -882,13 +872,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu,
>>> phys_addr_t fault_ipa,
>>>       if (writable)
>>>           prot |= KVM_PGTABLE_PROT_W;
>>>   -    if (fault_status != FSC_PERM && !device)
>>> -        clean_dcache_guest_page(pfn, vma_pagesize);
>>> -
>>> -    if (exec_fault) {
>>> +    if (exec_fault)
>>>           prot |= KVM_PGTABLE_PROT_X;
>>> -        invalidate_icache_guest_page(pfn, vma_pagesize);
>>> -    }
>>>         if (device)
>>>           prot |= KVM_PGTABLE_PROT_DEVICE;
>>> @@ -1144,10 +1129,10 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long
>>> hva, pte_t pte)
>>>       trace_kvm_set_spte_hva(hva);
>>>         /*
>>> -     * We've moved a page around, probably through CoW, so let's treat it
>>> -     * just like a translation fault and clean the cache to the PoC.
>>> +     * We've moved a page around, probably through CoW, so let's treat
>>> +     * it just like a translation fault and the map handler will clean
>>> +     * the cache to the PoC.
>>>        */
>>> -    clean_dcache_guest_page(pfn, PAGE_SIZE);
>>>       handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);
>>>       return 0;
>>>   }
>> .

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-04-08 16:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-26  3:16 [RFC PATCH v3 0/2] KVM: arm64: Improve efficiency of stage2 page table Yanan Wang
2021-03-26  3:16 ` [RFC PATCH v3 1/2] KVM: arm64: Move CMOs from user_mem_abort to the fault handlers Yanan Wang
2021-04-07 15:31   ` Alexandru Elisei
2021-04-07 20:57     ` Will Deacon
2021-04-08  9:23     ` wangyanan (Y)
2021-04-08 15:59       ` Alexandru Elisei
2021-03-26  3:16 ` [RFC PATCH v3 2/2] KVM: arm64: Distinguish cases of memcache allocations completely Yanan Wang
2021-04-07 15:35   ` Alexandru Elisei
2021-04-08  9:31     ` wangyanan (Y)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).