linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Fix several bugs in KVM stage 2 translation
@ 2020-11-30 12:18 Yanan Wang
  2020-11-30 12:18 ` [RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2 Yanan Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Yanan Wang @ 2020-11-30 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose,
	Gavin Shan, Quentin Perret
  Cc: wanghaibin.wang, yezengruan, zhukeqian1, yuzenghui, jiangkunkun,
	wangjingyi11, lushenming, Yanan Wang

Several problems about KVM stage 2 translation were found when testing based
on the mainline code. The following is description of the problems and the
corresponding patchs.

When installing a new pte entry or updating an old valid entry in stage 2
translation, we use get_page()/put_page() to record page_count of the page-table
pages. PATCH 1/3 aims to fix incorrect use of get_page()/put_page() in stage 2,
which might make page-table pages unable to be freed when unmapping a range.

When dirty logging of a guest with hugepages is finished, we should merge tables
back into a block entry if adjustment of huge mapping is found necessary.
In addition to installing the block entry, mapping of the lower-levels for the
block should also be unmapped to avoid multiple TLB entries.
PATCH 2/3 adds unmap operation when merge tables into a block entry.

The rewrite of page-table code and fault handling add two different handlers
for "just relaxing permissions" and "map by stage2 page-table walk", that's
great improvement. Yet, in function user_mem_abort(), conditions where we choose
the above two fault handlers are not strictly distinguished. This will causes
guest errors such as infinite-loop (soft lockup will occur in result), because of
calling the inappropriate fault handler. So, a solution that can strictly
distinguish conditions is introduced in PATCH 3/3.

Yanan Wang (3):
  KVM: arm64: Fix possible memory leak in kvm stage2
  KVM: arm64: Fix handling of merging tables into a block entry
  KVM: arm64: Add usage of stage 2 fault lookup level in
    user_mem_abort()

 arch/arm64/include/asm/esr.h         |  1 +
 arch/arm64/include/asm/kvm_emulate.h |  5 +++++
 arch/arm64/kvm/hyp/pgtable.c         | 22 +++++++++++++++++-----
 arch/arm64/kvm/mmu.c                 | 11 +++++++++--
 4 files changed, 32 insertions(+), 7 deletions(-)

-- 
2.19.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2
  2020-11-30 12:18 [RFC PATCH 0/3] Fix several bugs in KVM stage 2 translation Yanan Wang
@ 2020-11-30 12:18 ` Yanan Wang
  2020-11-30 13:21   ` Will Deacon
  2020-11-30 12:18 ` [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry Yanan Wang
  2020-11-30 12:18 ` [RFC PATCH 3/3] KVM: arm64: Add usage of stage 2 fault lookup level in user_mem_abort() Yanan Wang
  2 siblings, 1 reply; 24+ messages in thread
From: Yanan Wang @ 2020-11-30 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose,
	Gavin Shan, Quentin Perret
  Cc: wanghaibin.wang, yezengruan, zhukeqian1, yuzenghui, jiangkunkun,
	wangjingyi11, lushenming, Yanan Wang

When installing a new leaf pte onto an invalid ptep, we need to get_page(ptep).
When just updating a valid leaf ptep, we shouldn't get_page(ptep).
Incorrect page_count of translation tables might lead to memory leak,
when unmapping a stage 2 memory range.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0271b4a3b9fe..696b6aa83faf 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -186,6 +186,7 @@ static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, kvm_pte_t attr,
 		return old == pte;
 
 	smp_store_release(ptep, pte);
+	get_page(virt_to_page(ptep));
 	return true;
 }
 
@@ -476,6 +477,7 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	/* There's an existing valid leaf entry, so perform break-before-make */
 	kvm_set_invalid_pte(ptep);
 	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
+	put_page(virt_to_page(ptep));
 	kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
 out:
 	data->phys += granule;
@@ -512,7 +514,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	}
 
 	if (stage2_map_walker_try_leaf(addr, end, level, ptep, data))
-		goto out_get_page;
+		return 0;
 
 	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
 		return -EINVAL;
@@ -536,9 +538,8 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	}
 
 	kvm_set_table_pte(ptep, childp);
-
-out_get_page:
 	get_page(page);
+
 	return 0;
 }
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry
  2020-11-30 12:18 [RFC PATCH 0/3] Fix several bugs in KVM stage 2 translation Yanan Wang
  2020-11-30 12:18 ` [RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2 Yanan Wang
@ 2020-11-30 12:18 ` Yanan Wang
  2020-11-30 13:34   ` Will Deacon
  2020-11-30 12:18 ` [RFC PATCH 3/3] KVM: arm64: Add usage of stage 2 fault lookup level in user_mem_abort() Yanan Wang
  2 siblings, 1 reply; 24+ messages in thread
From: Yanan Wang @ 2020-11-30 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose,
	Gavin Shan, Quentin Perret
  Cc: wanghaibin.wang, yezengruan, zhukeqian1, yuzenghui, jiangkunkun,
	wangjingyi11, lushenming, Yanan Wang

In dirty logging case(logging_active == True), we need to collapse a block
entry into a table if necessary. After dirty logging is canceled, when merging
tables back into a block entry, we should not only free the non-huge page
tables but also unmap the non-huge mapping for the block. Without the unmap,
inconsistent TLB entries for the pages in the the block will be created.

We could also use unmap_stage2_range API to unmap the non-huge mapping,
but this could potentially free the upper level page-table page which
will be useful later.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 696b6aa83faf..fec8dc9f2baa 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 	return 0;
 }
 
+static void stage2_flush_dcache(void *addr, u64 size);
+static bool stage2_pte_cacheable(kvm_pte_t pte);
+
 static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 				struct stage2_map_data *data)
 {
@@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	struct page *page = virt_to_page(ptep);
 
 	if (data->anchor) {
-		if (kvm_pte_valid(pte))
+		if (kvm_pte_valid(pte)) {
+			kvm_set_invalid_pte(ptep);
+			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
+				     addr, level);
 			put_page(page);
 
+			if (stage2_pte_cacheable(pte))
+				stage2_flush_dcache(kvm_pte_follow(pte),
+						    kvm_granule_size(level));
+		}
+
 		return 0;
 	}
 
@@ -574,7 +585,7 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
  * The behaviour of the LEAF callback then depends on whether or not the
  * anchor has been set. If not, then we're not using a block mapping higher
  * up the table and we perform the mapping at the existing leaves instead.
- * If, on the other hand, the anchor _is_ set, then we drop references to
+ * If, on the other hand, the anchor _is_ set, then we unmap the mapping of
  * all valid leaves so that the pages beneath the anchor can be freed.
  *
  * Finally, the TABLE_POST callback does nothing if the anchor has not
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 3/3] KVM: arm64: Add usage of stage 2 fault lookup level in user_mem_abort()
  2020-11-30 12:18 [RFC PATCH 0/3] Fix several bugs in KVM stage 2 translation Yanan Wang
  2020-11-30 12:18 ` [RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2 Yanan Wang
  2020-11-30 12:18 ` [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry Yanan Wang
@ 2020-11-30 12:18 ` Yanan Wang
  2020-11-30 13:49   ` Will Deacon
  2 siblings, 1 reply; 24+ messages in thread
From: Yanan Wang @ 2020-11-30 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose,
	Gavin Shan, Quentin Perret
  Cc: wanghaibin.wang, yezengruan, zhukeqian1, yuzenghui, jiangkunkun,
	wangjingyi11, lushenming, Yanan Wang

If we get a FSC_PERM fault, just using (logging_active && writable) to determine
calling kvm_pgtable_stage2_map(). There will be two more cases we should consider.

(1) After logging_active is configged back to false from true. When we get a
FSC_PERM fault with write_fault and adjustment of hugepage is needed, we should
merge tables back to a block entry. This case is ignored by still calling
kvm_pgtable_stage2_relax_perms(), which will lead to an endless loop and guest
panic due to soft lockup.

(2) We use (FSC_PERM && logging_active && writable) to determine collapsing
a block entry into a table by calling kvm_pgtable_stage2_map(). But sometimes
we may only need to relax permissions when trying to write to a page other than
a block. In this condition, using kvm_pgtable_stage2_relax_perms() will be fine.

The ISS filed bit[1:0] in ESR_EL2 regesiter indicates the stage2 lookup level
at which a D-abort or I-abort occured. By comparing granule of the fault lookup
level with vma_pagesize, we can strictly distinguish conditions of calling
kvm_pgtable_stage2_relax_perms() or kvm_pgtable_stage2_map(), and the above
two cases will be well considered.

Suggested-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/include/asm/esr.h         |  1 +
 arch/arm64/include/asm/kvm_emulate.h |  5 +++++
 arch/arm64/kvm/mmu.c                 | 11 +++++++++--
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
index 22c81f1edda2..85a3e49f92f4 100644
--- a/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@ -104,6 +104,7 @@
 /* Shared ISS fault status code(IFSC/DFSC) for Data/Instruction aborts */
 #define ESR_ELx_FSC		(0x3F)
 #define ESR_ELx_FSC_TYPE	(0x3C)
+#define ESR_ELx_FSC_LEVEL	(0x03)
 #define ESR_ELx_FSC_EXTABT	(0x10)
 #define ESR_ELx_FSC_SERROR	(0x11)
 #define ESR_ELx_FSC_ACCESS	(0x08)
diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index 5ef2669ccd6c..2e0e8edf6306 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -350,6 +350,11 @@ static __always_inline u8 kvm_vcpu_trap_get_fault_type(const struct kvm_vcpu *vc
 	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_TYPE;
 }
 
+static __always_inline u8 kvm_vcpu_trap_get_fault_level(const struct kvm_vcpu *vcpu)
+{
+	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_LEVEL;
+{
+
 static __always_inline bool kvm_vcpu_abt_issea(const struct kvm_vcpu *vcpu)
 {
 	switch (kvm_vcpu_trap_get_fault(vcpu)) {
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 1a01da9fdc99..75814a02d189 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -754,10 +754,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	gfn_t gfn;
 	kvm_pfn_t pfn;
 	bool logging_active = memslot_is_logging(memslot);
-	unsigned long vma_pagesize;
+	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
+	unsigned long vma_pagesize, fault_granule;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
 
+	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
 	write_fault = kvm_is_write_fault(vcpu);
 	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
 	VM_BUG_ON(write_fault && exec_fault);
@@ -896,7 +898,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
 		prot |= KVM_PGTABLE_PROT_X;
 
-	if (fault_status == FSC_PERM && !(logging_active && writable)) {
+	/*
+	 * Under the premise of getting a FSC_PERM fault, we just need to relax
+	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
+	 * kvm_pgtable_stage2_map() should be called to change block size.
+	 */
+	if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
 		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
 	} else {
 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2
  2020-11-30 12:18 ` [RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2 Yanan Wang
@ 2020-11-30 13:21   ` Will Deacon
  2020-12-01  7:21     ` wangyanan (Y)
  0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2020-11-30 13:21 UTC (permalink / raw)
  To: Yanan Wang
  Cc: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming

On Mon, Nov 30, 2020 at 08:18:45PM +0800, Yanan Wang wrote:
> When installing a new leaf pte onto an invalid ptep, we need to get_page(ptep).
> When just updating a valid leaf ptep, we shouldn't get_page(ptep).
> Incorrect page_count of translation tables might lead to memory leak,
> when unmapping a stage 2 memory range.

Did you find this by inspection, or did you hit this in practice? I'd be
interested to see the backtrace for mapping over an existing mapping.

> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 0271b4a3b9fe..696b6aa83faf 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -186,6 +186,7 @@ static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, kvm_pte_t attr,
>  		return old == pte;
>  
>  	smp_store_release(ptep, pte);
> +	get_page(virt_to_page(ptep));

This is also used for the hypervisor stage-1 page-table, so I'd prefer to
leave this function as-is.

>  	return true;
>  }
>  
> @@ -476,6 +477,7 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  	/* There's an existing valid leaf entry, so perform break-before-make */
>  	kvm_set_invalid_pte(ptep);
>  	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
> +	put_page(virt_to_page(ptep));
>  	kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
>  out:
>  	data->phys += granule;

Isn't this hunk alone sufficient to solve the problem?

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry
  2020-11-30 12:18 ` [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry Yanan Wang
@ 2020-11-30 13:34   ` Will Deacon
  2020-11-30 15:24     ` wangyanan (Y)
  0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2020-11-30 13:34 UTC (permalink / raw)
  To: Yanan Wang
  Cc: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming

On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
> In dirty logging case(logging_active == True), we need to collapse a block
> entry into a table if necessary. After dirty logging is canceled, when merging
> tables back into a block entry, we should not only free the non-huge page
> tables but also unmap the non-huge mapping for the block. Without the unmap,
> inconsistent TLB entries for the pages in the the block will be created.
> 
> We could also use unmap_stage2_range API to unmap the non-huge mapping,
> but this could potentially free the upper level page-table page which
> will be useful later.
> 
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 696b6aa83faf..fec8dc9f2baa 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>  	return 0;
>  }
>  
> +static void stage2_flush_dcache(void *addr, u64 size);
> +static bool stage2_pte_cacheable(kvm_pte_t pte);
> +
>  static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  				struct stage2_map_data *data)
>  {
> @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  	struct page *page = virt_to_page(ptep);
>  
>  	if (data->anchor) {
> -		if (kvm_pte_valid(pte))
> +		if (kvm_pte_valid(pte)) {
> +			kvm_set_invalid_pte(ptep);
> +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
> +				     addr, level);
>  			put_page(page);

This doesn't make sense to me: the page-table pages we're walking when the
anchor is set are not accessible to the hardware walker because we unhooked
the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
TLB invalidation.

Are you seeing a problem in practice here?

> +			if (stage2_pte_cacheable(pte))
> +				stage2_flush_dcache(kvm_pte_follow(pte),
> +						    kvm_granule_size(level));

I don't understand the need for the flush either, as we're just coalescing
existing entries into a larger block mapping.

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 3/3] KVM: arm64: Add usage of stage 2 fault lookup level in user_mem_abort()
  2020-11-30 12:18 ` [RFC PATCH 3/3] KVM: arm64: Add usage of stage 2 fault lookup level in user_mem_abort() Yanan Wang
@ 2020-11-30 13:49   ` Will Deacon
  2020-12-01  6:04     ` wangyanan (Y)
  0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2020-11-30 13:49 UTC (permalink / raw)
  To: Yanan Wang
  Cc: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming

On Mon, Nov 30, 2020 at 08:18:47PM +0800, Yanan Wang wrote:
> If we get a FSC_PERM fault, just using (logging_active && writable) to determine
> calling kvm_pgtable_stage2_map(). There will be two more cases we should consider.
> 
> (1) After logging_active is configged back to false from true. When we get a
> FSC_PERM fault with write_fault and adjustment of hugepage is needed, we should
> merge tables back to a block entry. This case is ignored by still calling
> kvm_pgtable_stage2_relax_perms(), which will lead to an endless loop and guest
> panic due to soft lockup.
> 
> (2) We use (FSC_PERM && logging_active && writable) to determine collapsing
> a block entry into a table by calling kvm_pgtable_stage2_map(). But sometimes
> we may only need to relax permissions when trying to write to a page other than
> a block. In this condition, using kvm_pgtable_stage2_relax_perms() will be fine.
> 
> The ISS filed bit[1:0] in ESR_EL2 regesiter indicates the stage2 lookup level
> at which a D-abort or I-abort occured. By comparing granule of the fault lookup
> level with vma_pagesize, we can strictly distinguish conditions of calling
> kvm_pgtable_stage2_relax_perms() or kvm_pgtable_stage2_map(), and the above
> two cases will be well considered.
> 
> Suggested-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/include/asm/esr.h         |  1 +
>  arch/arm64/include/asm/kvm_emulate.h |  5 +++++
>  arch/arm64/kvm/mmu.c                 | 11 +++++++++--
>  3 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> index 22c81f1edda2..85a3e49f92f4 100644
> --- a/arch/arm64/include/asm/esr.h
> +++ b/arch/arm64/include/asm/esr.h
> @@ -104,6 +104,7 @@
>  /* Shared ISS fault status code(IFSC/DFSC) for Data/Instruction aborts */
>  #define ESR_ELx_FSC		(0x3F)
>  #define ESR_ELx_FSC_TYPE	(0x3C)
> +#define ESR_ELx_FSC_LEVEL	(0x03)
>  #define ESR_ELx_FSC_EXTABT	(0x10)
>  #define ESR_ELx_FSC_SERROR	(0x11)
>  #define ESR_ELx_FSC_ACCESS	(0x08)
> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index 5ef2669ccd6c..2e0e8edf6306 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -350,6 +350,11 @@ static __always_inline u8 kvm_vcpu_trap_get_fault_type(const struct kvm_vcpu *vc
>  	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_TYPE;
>  }
>  
> +static __always_inline u8 kvm_vcpu_trap_get_fault_level(const struct kvm_vcpu *vcpu)
> +{
> +	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_LEVEL;
> +{
> +
>  static __always_inline bool kvm_vcpu_abt_issea(const struct kvm_vcpu *vcpu)
>  {
>  	switch (kvm_vcpu_trap_get_fault(vcpu)) {
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 1a01da9fdc99..75814a02d189 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -754,10 +754,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	gfn_t gfn;
>  	kvm_pfn_t pfn;
>  	bool logging_active = memslot_is_logging(memslot);
> -	unsigned long vma_pagesize;
> +	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> +	unsigned long vma_pagesize, fault_granule;
>  	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>  	struct kvm_pgtable *pgt;
>  
> +	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);

I like the idea, but is this macro reliable for stage-2 page-tables, given
that we could have a concatenated pgd?

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry
  2020-11-30 13:34   ` Will Deacon
@ 2020-11-30 15:24     ` wangyanan (Y)
  2020-11-30 16:01       ` Will Deacon
  0 siblings, 1 reply; 24+ messages in thread
From: wangyanan (Y) @ 2020-11-30 15:24 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming


On 2020/11/30 21:34, Will Deacon wrote:
> On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
>> In dirty logging case(logging_active == True), we need to collapse a block
>> entry into a table if necessary. After dirty logging is canceled, when merging
>> tables back into a block entry, we should not only free the non-huge page
>> tables but also unmap the non-huge mapping for the block. Without the unmap,
>> inconsistent TLB entries for the pages in the the block will be created.
>>
>> We could also use unmap_stage2_range API to unmap the non-huge mapping,
>> but this could potentially free the upper level page-table page which
>> will be useful later.
>>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>   arch/arm64/kvm/hyp/pgtable.c | 15 +++++++++++++--
>>   1 file changed, 13 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 696b6aa83faf..fec8dc9f2baa 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>>   	return 0;
>>   }
>>   
>> +static void stage2_flush_dcache(void *addr, u64 size);
>> +static bool stage2_pte_cacheable(kvm_pte_t pte);
>> +
>>   static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>   				struct stage2_map_data *data)
>>   {
>> @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>   	struct page *page = virt_to_page(ptep);
>>   
>>   	if (data->anchor) {
>> -		if (kvm_pte_valid(pte))
>> +		if (kvm_pte_valid(pte)) {
>> +			kvm_set_invalid_pte(ptep);
>> +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
>> +				     addr, level);
>>   			put_page(page);
> This doesn't make sense to me: the page-table pages we're walking when the
> anchor is set are not accessible to the hardware walker because we unhooked
> the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
> TLB invalidation.
>
> Are you seeing a problem in practice here?

Yes, I indeed find a problem in practice.

When the migration was cancelled, a TLB conflic abort  was found in guest.

This problem is fixed before rework of the page table code, you can have 
a look in the following two links:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277

https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html

>> +			if (stage2_pte_cacheable(pte))
>> +				stage2_flush_dcache(kvm_pte_follow(pte),
>> +						    kvm_granule_size(level));
> I don't understand the need for the flush either, as we're just coalescing
> existing entries into a larger block mapping.

In my opinion, after unmapping, it is necessary to ensure the cache 
coherency, because it is unknown whether the future mapping memory 
attribute is changed or not (cacheable -> non_cacheable) theoretically.

> Will
> .

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry
  2020-11-30 15:24     ` wangyanan (Y)
@ 2020-11-30 16:01       ` Will Deacon
  2020-12-01  2:30         ` wangyanan (Y)
  0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2020-11-30 16:01 UTC (permalink / raw)
  To: wangyanan (Y)
  Cc: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming

Hi,

Cheers for the quick reply. See below for more questions...

On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:
> On 2020/11/30 21:34, Will Deacon wrote:
> > On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 696b6aa83faf..fec8dc9f2baa 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
> > >   	return 0;
> > >   }
> > > +static void stage2_flush_dcache(void *addr, u64 size);
> > > +static bool stage2_pte_cacheable(kvm_pte_t pte);
> > > +
> > >   static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > >   				struct stage2_map_data *data)
> > >   {
> > > @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > >   	struct page *page = virt_to_page(ptep);
> > >   	if (data->anchor) {
> > > -		if (kvm_pte_valid(pte))
> > > +		if (kvm_pte_valid(pte)) {
> > > +			kvm_set_invalid_pte(ptep);
> > > +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
> > > +				     addr, level);
> > >   			put_page(page);
> > This doesn't make sense to me: the page-table pages we're walking when the
> > anchor is set are not accessible to the hardware walker because we unhooked
> > the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
> > TLB invalidation.
> > 
> > Are you seeing a problem in practice here?
> 
> Yes, I indeed find a problem in practice.
> 
> When the migration was cancelled, a TLB conflic abort  was found in guest.
> 
> This problem is fixed before rework of the page table code, you can have a
> look in the following two links:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277
> 
> https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html

Ok, let's go through this, because I still don't see the bug. Please correct
me if you spot any mistakes:

  1. We have a block mapping for X => Y
  2. Dirty logging is enabled, so the block mapping is write-protected and
     ends up being split into page mappings
  3. Dirty logging is disabled due to a failed migration.

--- At this point, I think we agree that the state of the MMU is alright ---

  4. We take a stage-2 fault and want to reinstall the block mapping:

     a. kvm_pgtable_stage2_map() is invoked to install the block mapping
     b. stage2_map_walk_table_pre() finds a table where we would like to
        install the block:

	i.   The anchor is set to point at this entry
	ii.  The entry is made invalid
	iii. We invalidate the TLB for the input address. This is
	     TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.

	*** At this point, the page-table pointed to by the old table entry
	    is not reachable to the hardware walker ***

     c. stage2_map_walk_leaf() is called for each leaf entry in the
        now-unreachable subtree, dropping page-references for each valid
	entry it finds.
     d. stage2_map_walk_table_post() is eventually called for the entry
        which we cleared back in b.ii, so we install the new block mapping.

You are proposing to add additional TLB invalidation to (c), but I don't
think that is necessary, thanks to the invalidation already performed in
b.iii. What am I missing here?

> > > +			if (stage2_pte_cacheable(pte))
> > > +				stage2_flush_dcache(kvm_pte_follow(pte),
> > > +						    kvm_granule_size(level));
> > I don't understand the need for the flush either, as we're just coalescing
> > existing entries into a larger block mapping.
> 
> In my opinion, after unmapping, it is necessary to ensure the cache
> coherency, because it is unknown whether the future mapping memory attribute
> is changed or not (cacheable -> non_cacheable) theoretically.

But in this case we're just changing the structure of the page-tables,
not the pages which are mapped, are we?

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry
  2020-11-30 16:01       ` Will Deacon
@ 2020-12-01  2:30         ` wangyanan (Y)
  2020-12-01 13:46           ` Will Deacon
  0 siblings, 1 reply; 24+ messages in thread
From: wangyanan (Y) @ 2020-12-01  2:30 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming

Hi Will,

On 2020/12/1 0:01, Will Deacon wrote:
> Hi,
>
> Cheers for the quick reply. See below for more questions...
>
> On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:
>> On 2020/11/30 21:34, Will Deacon wrote:
>>> On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>> index 696b6aa83faf..fec8dc9f2baa 100644
>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>> @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>>>>    	return 0;
>>>>    }
>>>> +static void stage2_flush_dcache(void *addr, u64 size);
>>>> +static bool stage2_pte_cacheable(kvm_pte_t pte);
>>>> +
>>>>    static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>>    				struct stage2_map_data *data)
>>>>    {
>>>> @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>>    	struct page *page = virt_to_page(ptep);
>>>>    	if (data->anchor) {
>>>> -		if (kvm_pte_valid(pte))
>>>> +		if (kvm_pte_valid(pte)) {
>>>> +			kvm_set_invalid_pte(ptep);
>>>> +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
>>>> +				     addr, level);
>>>>    			put_page(page);
>>> This doesn't make sense to me: the page-table pages we're walking when the
>>> anchor is set are not accessible to the hardware walker because we unhooked
>>> the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
>>> TLB invalidation.
>>>
>>> Are you seeing a problem in practice here?
>> Yes, I indeed find a problem in practice.
>>
>> When the migration was cancelled, a TLB conflic abort  was found in guest.
>>
>> This problem is fixed before rework of the page table code, you can have a
>> look in the following two links:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277
>>
>> https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html
> Ok, let's go through this, because I still don't see the bug. Please correct
> me if you spot any mistakes:
>
>    1. We have a block mapping for X => Y
>    2. Dirty logging is enabled, so the block mapping is write-protected and
>       ends up being split into page mappings
>    3. Dirty logging is disabled due to a failed migration.
>
> --- At this point, I think we agree that the state of the MMU is alright ---
>
>    4. We take a stage-2 fault and want to reinstall the block mapping:
>
>       a. kvm_pgtable_stage2_map() is invoked to install the block mapping
>       b. stage2_map_walk_table_pre() finds a table where we would like to
>          install the block:
>
> 	i.   The anchor is set to point at this entry
> 	ii.  The entry is made invalid
> 	iii. We invalidate the TLB for the input address. This is
> 	     TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.
>
> 	*** At this point, the page-table pointed to by the old table entry
> 	    is not reachable to the hardware walker ***
>
>       c. stage2_map_walk_leaf() is called for each leaf entry in the
>          now-unreachable subtree, dropping page-references for each valid
> 	entry it finds.
>       d. stage2_map_walk_table_post() is eventually called for the entry
>          which we cleared back in b.ii, so we install the new block mapping.
>
> You are proposing to add additional TLB invalidation to (c), but I don't
> think that is necessary, thanks to the invalidation already performed in
> b.iii. What am I missing here?

The point is at b.iii where the TLBI is not enough. There are many page 
mappings that we need to merge into a block mapping.

We invalidate the TLB for the input address without level hint at b.iii, 
but this operation just flush TLB for one page mapping, there

are still some TLB entries for the other page mappings in the cache, the 
MMU hardware walker can still hit these entries next time.


Maybe we can imagine a concrete example here. If we now need to merge 
page mappings into a 1G block mapping, and the

fault_ipa to user_mem_abort() is 0x225043000, after ALIGNMENT to 1G, the 
input address will be 0x200000000, then the TLBI

operation at b.iii will invalidate the TLB entry for address 
0x200000000. But what about address 0x200001000 , 0x200002000 ... ?

After the installing of 1G block mapping is finished, when the fault_ipa 
is 0x200007000, MMU can still hit the TLB entry for page

mapping in the cache and can also access memory through the new block entry.


So adding TLBI operation in stage2_map_walk_leaf() aims to invalidate 
TLB entries for all the page mappings that will be merged.

In this way, after installing of block mapping, MMU can only access 
memory through the new block entry.

>>>> +			if (stage2_pte_cacheable(pte))
>>>> +				stage2_flush_dcache(kvm_pte_follow(pte),
>>>> +						    kvm_granule_size(level));
>>> I don't understand the need for the flush either, as we're just coalescing
>>> existing entries into a larger block mapping.
>> In my opinion, after unmapping, it is necessary to ensure the cache
>> coherency, because it is unknown whether the future mapping memory attribute
>> is changed or not (cacheable -> non_cacheable) theoretically.
> But in this case we're just changing the structure of the page-tables,
> not the pages which are mapped, are we?
>
> Will
> .

Yes, there is not a condition for now that cache attributes will be 
changed in this case.

Maybe this part of dcache flush can be cut.


Yanan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 3/3] KVM: arm64: Add usage of stage 2 fault lookup level in user_mem_abort()
  2020-11-30 13:49   ` Will Deacon
@ 2020-12-01  6:04     ` wangyanan (Y)
  0 siblings, 0 replies; 24+ messages in thread
From: wangyanan (Y) @ 2020-12-01  6:04 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming

Hi Will,

On 2020/11/30 21:49, Will Deacon wrote:
> On Mon, Nov 30, 2020 at 08:18:47PM +0800, Yanan Wang wrote:
>> If we get a FSC_PERM fault, just using (logging_active && writable) to determine
>> calling kvm_pgtable_stage2_map(). There will be two more cases we should consider.
>>
>> (1) After logging_active is configged back to false from true. When we get a
>> FSC_PERM fault with write_fault and adjustment of hugepage is needed, we should
>> merge tables back to a block entry. This case is ignored by still calling
>> kvm_pgtable_stage2_relax_perms(), which will lead to an endless loop and guest
>> panic due to soft lockup.
>>
>> (2) We use (FSC_PERM && logging_active && writable) to determine collapsing
>> a block entry into a table by calling kvm_pgtable_stage2_map(). But sometimes
>> we may only need to relax permissions when trying to write to a page other than
>> a block. In this condition, using kvm_pgtable_stage2_relax_perms() will be fine.
>>
>> The ISS filed bit[1:0] in ESR_EL2 regesiter indicates the stage2 lookup level
>> at which a D-abort or I-abort occured. By comparing granule of the fault lookup
>> level with vma_pagesize, we can strictly distinguish conditions of calling
>> kvm_pgtable_stage2_relax_perms() or kvm_pgtable_stage2_map(), and the above
>> two cases will be well considered.
>>
>> Suggested-by: Keqian Zhu <zhukeqian1@huawei.com>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>   arch/arm64/include/asm/esr.h         |  1 +
>>   arch/arm64/include/asm/kvm_emulate.h |  5 +++++
>>   arch/arm64/kvm/mmu.c                 | 11 +++++++++--
>>   3 files changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>> index 22c81f1edda2..85a3e49f92f4 100644
>> --- a/arch/arm64/include/asm/esr.h
>> +++ b/arch/arm64/include/asm/esr.h
>> @@ -104,6 +104,7 @@
>>   /* Shared ISS fault status code(IFSC/DFSC) for Data/Instruction aborts */
>>   #define ESR_ELx_FSC		(0x3F)
>>   #define ESR_ELx_FSC_TYPE	(0x3C)
>> +#define ESR_ELx_FSC_LEVEL	(0x03)
>>   #define ESR_ELx_FSC_EXTABT	(0x10)
>>   #define ESR_ELx_FSC_SERROR	(0x11)
>>   #define ESR_ELx_FSC_ACCESS	(0x08)
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index 5ef2669ccd6c..2e0e8edf6306 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -350,6 +350,11 @@ static __always_inline u8 kvm_vcpu_trap_get_fault_type(const struct kvm_vcpu *vc
>>   	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_TYPE;
>>   }
>>   
>> +static __always_inline u8 kvm_vcpu_trap_get_fault_level(const struct kvm_vcpu *vcpu)
>> +{
>> +	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_LEVEL;
>> +{
>> +
>>   static __always_inline bool kvm_vcpu_abt_issea(const struct kvm_vcpu *vcpu)
>>   {
>>   	switch (kvm_vcpu_trap_get_fault(vcpu)) {
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 1a01da9fdc99..75814a02d189 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -754,10 +754,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	gfn_t gfn;
>>   	kvm_pfn_t pfn;
>>   	bool logging_active = memslot_is_logging(memslot);
>> -	unsigned long vma_pagesize;
>> +	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
>> +	unsigned long vma_pagesize, fault_granule;
>>   	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>>   	struct kvm_pgtable *pgt;
>>   
>> +	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
> I like the idea, but is this macro reliable for stage-2 page-tables, given
> that we could have a concatenated pgd?
>
> Will
> .

Yes, it's fine even when we have a concatenated pgd table.

No matter a concatenated pgd will be made or not, the initial lookup 
level (start _level) is set in VTCR_EL2 register.

The MMU hardware walker will know the start_level according to 
information in VTCR_EL2.

This idea runs well in practice on host where ia_bits is 40, PAGE_SIZE 
is 4k, and a concatenated pgd is made for guest stage2.

According to the kernel info printed, the start_level is 1, and stage 2 
translation runs as expected.


Yanan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2
  2020-11-30 13:21   ` Will Deacon
@ 2020-12-01  7:21     ` wangyanan (Y)
  2020-12-01 14:16       ` Will Deacon
  0 siblings, 1 reply; 24+ messages in thread
From: wangyanan (Y) @ 2020-12-01  7:21 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming

Hi Will,

On 2020/11/30 21:21, Will Deacon wrote:
> On Mon, Nov 30, 2020 at 08:18:45PM +0800, Yanan Wang wrote:
>> When installing a new leaf pte onto an invalid ptep, we need to get_page(ptep).
>> When just updating a valid leaf ptep, we shouldn't get_page(ptep).
>> Incorrect page_count of translation tables might lead to memory leak,
>> when unmapping a stage 2 memory range.
> Did you find this by inspection, or did you hit this in practice? I'd be
> interested to see the backtrace for mapping over an existing mapping.

Actually this is found by inspection.

In the current code, get_page() will uniformly called at "out_get_page" 
in function stage2_map_walk_leaf(),

no matter the old ptep is valid or not.

When using stage2_unmap_walker() API to unmap a memory range, some 
page-table pages might not be

freed if page_count of the pages is not right.

>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>   arch/arm64/kvm/hyp/pgtable.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 0271b4a3b9fe..696b6aa83faf 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -186,6 +186,7 @@ static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, kvm_pte_t attr,
>>   		return old == pte;
>>   
>>   	smp_store_release(ptep, pte);
>> +	get_page(virt_to_page(ptep));
> This is also used for the hypervisor stage-1 page-table, so I'd prefer to
> leave this function as-is.
I agree at this point.
>>   	return true;
>>   }
>>   
>> @@ -476,6 +477,7 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>   	/* There's an existing valid leaf entry, so perform break-before-make */
>>   	kvm_set_invalid_pte(ptep);
>>   	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
>> +	put_page(virt_to_page(ptep));
>>   	kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
>>   out:
>>   	data->phys += granule;
> Isn't this hunk alone sufficient to solve the problem?
>
> Will
> .

Not sufficient enough. When the old ptep is valid and old pte equlas new 
pte, in this case, "True" is also returned by kvm_set_valid_leaf_pte()

and get_page() will still be called.


Yanan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry
  2020-12-01  2:30         ` wangyanan (Y)
@ 2020-12-01 13:46           ` Will Deacon
  2020-12-01 14:05             ` Marc Zyngier
  2020-12-01 14:11             ` wangyanan (Y)
  0 siblings, 2 replies; 24+ messages in thread
From: Will Deacon @ 2020-12-01 13:46 UTC (permalink / raw)
  To: wangyanan (Y)
  Cc: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming

On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:
> On 2020/12/1 0:01, Will Deacon wrote:
> > On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:
> > > On 2020/11/30 21:34, Will Deacon wrote:
> > > > On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
> > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > > index 696b6aa83faf..fec8dc9f2baa 100644
> > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > > @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
> > > > >    	return 0;
> > > > >    }
> > > > > +static void stage2_flush_dcache(void *addr, u64 size);
> > > > > +static bool stage2_pte_cacheable(kvm_pte_t pte);
> > > > > +
> > > > >    static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > > > >    				struct stage2_map_data *data)
> > > > >    {
> > > > > @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > > > >    	struct page *page = virt_to_page(ptep);
> > > > >    	if (data->anchor) {
> > > > > -		if (kvm_pte_valid(pte))
> > > > > +		if (kvm_pte_valid(pte)) {
> > > > > +			kvm_set_invalid_pte(ptep);
> > > > > +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
> > > > > +				     addr, level);
> > > > >    			put_page(page);
> > > > This doesn't make sense to me: the page-table pages we're walking when the
> > > > anchor is set are not accessible to the hardware walker because we unhooked
> > > > the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
> > > > TLB invalidation.
> > > > 
> > > > Are you seeing a problem in practice here?
> > > Yes, I indeed find a problem in practice.
> > > 
> > > When the migration was cancelled, a TLB conflic abort  was found in guest.
> > > 
> > > This problem is fixed before rework of the page table code, you can have a
> > > look in the following two links:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277
> > > 
> > > https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html
> > Ok, let's go through this, because I still don't see the bug. Please correct
> > me if you spot any mistakes:
> > 
> >    1. We have a block mapping for X => Y
> >    2. Dirty logging is enabled, so the block mapping is write-protected and
> >       ends up being split into page mappings
> >    3. Dirty logging is disabled due to a failed migration.
> > 
> > --- At this point, I think we agree that the state of the MMU is alright ---
> > 
> >    4. We take a stage-2 fault and want to reinstall the block mapping:
> > 
> >       a. kvm_pgtable_stage2_map() is invoked to install the block mapping
> >       b. stage2_map_walk_table_pre() finds a table where we would like to
> >          install the block:
> > 
> > 	i.   The anchor is set to point at this entry
> > 	ii.  The entry is made invalid
> > 	iii. We invalidate the TLB for the input address. This is
> > 	     TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.
> > 
> > 	*** At this point, the page-table pointed to by the old table entry
> > 	    is not reachable to the hardware walker ***
> > 
> >       c. stage2_map_walk_leaf() is called for each leaf entry in the
> >          now-unreachable subtree, dropping page-references for each valid
> > 	entry it finds.
> >       d. stage2_map_walk_table_post() is eventually called for the entry
> >          which we cleared back in b.ii, so we install the new block mapping.
> > 
> > You are proposing to add additional TLB invalidation to (c), but I don't
> > think that is necessary, thanks to the invalidation already performed in
> > b.iii. What am I missing here?
> 
> The point is at b.iii where the TLBI is not enough. There are many page
> mappings that we need to merge into a block mapping.
> 
> We invalidate the TLB for the input address without level hint at b.iii, but
> this operation just flush TLB for one page mapping, there
> 
> are still some TLB entries for the other page mappings in the cache, the MMU
> hardware walker can still hit these entries next time.

Ah, yes, I see. Thanks. I hadn't considered the case where there are table
entries beneath the anchor. So how about the diff below?

Will

--->8

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0271b4a3b9fe..12526d8c7ae4 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 		return 0;
 
 	kvm_set_invalid_pte(ptep);
-	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
+	/* TLB invalidation is deferred until the _post handler */
 	data->anchor = ptep;
 	return 0;
 }
@@ -547,11 +547,21 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 				      struct stage2_map_data *data)
 {
 	int ret = 0;
+	kvm_pte_t pte = *ptep;
 
 	if (!data->anchor)
 		return 0;
 
-	free_page((unsigned long)kvm_pte_follow(*ptep));
+	kvm_set_invalid_pte(ptep);
+
+	/*
+	 * Invalidate the whole stage-2, as we may have numerous leaf
+	 * entries below us which would otherwise need invalidating
+	 * individually.
+	 */
+	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
+
+	free_page((unsigned long)kvm_pte_follow(pte));
 	put_page(virt_to_page(ptep));
 
 	if (data->anchor == ptep) {

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry
  2020-12-01 13:46           ` Will Deacon
@ 2020-12-01 14:05             ` Marc Zyngier
  2020-12-01 14:23               ` Will Deacon
  2020-12-01 14:11             ` wangyanan (Y)
  1 sibling, 1 reply; 24+ messages in thread
From: Marc Zyngier @ 2020-12-01 14:05 UTC (permalink / raw)
  To: Will Deacon
  Cc: wangyanan (Y),
	linux-kernel, linux-arm-kernel, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	wanghaibin.wang, yezengruan, zhukeqian1, yuzenghui, jiangkunkun,
	wangjingyi11, lushenming

On 2020-12-01 13:46, Will Deacon wrote:
> On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:
>> On 2020/12/1 0:01, Will Deacon wrote:
>> > On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:
>> > > On 2020/11/30 21:34, Will Deacon wrote:
>> > > > On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
>> > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> > > > > index 696b6aa83faf..fec8dc9f2baa 100644
>> > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
>> > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
>> > > > > @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>> > > > >    	return 0;
>> > > > >    }
>> > > > > +static void stage2_flush_dcache(void *addr, u64 size);
>> > > > > +static bool stage2_pte_cacheable(kvm_pte_t pte);
>> > > > > +
>> > > > >    static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>> > > > >    				struct stage2_map_data *data)
>> > > > >    {
>> > > > > @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>> > > > >    	struct page *page = virt_to_page(ptep);
>> > > > >    	if (data->anchor) {
>> > > > > -		if (kvm_pte_valid(pte))
>> > > > > +		if (kvm_pte_valid(pte)) {
>> > > > > +			kvm_set_invalid_pte(ptep);
>> > > > > +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
>> > > > > +				     addr, level);
>> > > > >    			put_page(page);
>> > > > This doesn't make sense to me: the page-table pages we're walking when the
>> > > > anchor is set are not accessible to the hardware walker because we unhooked
>> > > > the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
>> > > > TLB invalidation.
>> > > >
>> > > > Are you seeing a problem in practice here?
>> > > Yes, I indeed find a problem in practice.
>> > >
>> > > When the migration was cancelled, a TLB conflic abort  was found in guest.
>> > >
>> > > This problem is fixed before rework of the page table code, you can have a
>> > > look in the following two links:
>> > >
>> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277
>> > >
>> > > https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html
>> > Ok, let's go through this, because I still don't see the bug. Please correct
>> > me if you spot any mistakes:
>> >
>> >    1. We have a block mapping for X => Y
>> >    2. Dirty logging is enabled, so the block mapping is write-protected and
>> >       ends up being split into page mappings
>> >    3. Dirty logging is disabled due to a failed migration.
>> >
>> > --- At this point, I think we agree that the state of the MMU is alright ---
>> >
>> >    4. We take a stage-2 fault and want to reinstall the block mapping:
>> >
>> >       a. kvm_pgtable_stage2_map() is invoked to install the block mapping
>> >       b. stage2_map_walk_table_pre() finds a table where we would like to
>> >          install the block:
>> >
>> > 	i.   The anchor is set to point at this entry
>> > 	ii.  The entry is made invalid
>> > 	iii. We invalidate the TLB for the input address. This is
>> > 	     TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.
>> >
>> > 	*** At this point, the page-table pointed to by the old table entry
>> > 	    is not reachable to the hardware walker ***
>> >
>> >       c. stage2_map_walk_leaf() is called for each leaf entry in the
>> >          now-unreachable subtree, dropping page-references for each valid
>> > 	entry it finds.
>> >       d. stage2_map_walk_table_post() is eventually called for the entry
>> >          which we cleared back in b.ii, so we install the new block mapping.
>> >
>> > You are proposing to add additional TLB invalidation to (c), but I don't
>> > think that is necessary, thanks to the invalidation already performed in
>> > b.iii. What am I missing here?
>> 
>> The point is at b.iii where the TLBI is not enough. There are many 
>> page
>> mappings that we need to merge into a block mapping.
>> 
>> We invalidate the TLB for the input address without level hint at 
>> b.iii, but
>> this operation just flush TLB for one page mapping, there
>> 
>> are still some TLB entries for the other page mappings in the cache, 
>> the MMU
>> hardware walker can still hit these entries next time.
> 
> Ah, yes, I see. Thanks. I hadn't considered the case where there are 
> table
> entries beneath the anchor. So how about the diff below?
> 
> Will
> 
> --->8
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c 
> b/arch/arm64/kvm/hyp/pgtable.c
> index 0271b4a3b9fe..12526d8c7ae4 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64
> end, u32 level,
>  		return 0;
> 
>  	kvm_set_invalid_pte(ptep);
> -	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
> +	/* TLB invalidation is deferred until the _post handler */
>  	data->anchor = ptep;
>  	return 0;
>  }
> @@ -547,11 +547,21 @@ static int stage2_map_walk_table_post(u64 addr,
> u64 end, u32 level,
>  				      struct stage2_map_data *data)
>  {
>  	int ret = 0;
> +	kvm_pte_t pte = *ptep;
> 
>  	if (!data->anchor)
>  		return 0;
> 
> -	free_page((unsigned long)kvm_pte_follow(*ptep));
> +	kvm_set_invalid_pte(ptep);
> +
> +	/*
> +	 * Invalidate the whole stage-2, as we may have numerous leaf
> +	 * entries below us which would otherwise need invalidating
> +	 * individually.
> +	 */
> +	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);

That's a big hammer, and we so far have been pretty careful not to
over-invalidate. Is the block-replacing-table *without* an unmap
in between the only case where this triggers?

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry
  2020-12-01 13:46           ` Will Deacon
  2020-12-01 14:05             ` Marc Zyngier
@ 2020-12-01 14:11             ` wangyanan (Y)
  2020-12-01 14:35               ` Marc Zyngier
  1 sibling, 1 reply; 24+ messages in thread
From: wangyanan (Y) @ 2020-12-01 14:11 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming


On 2020/12/1 21:46, Will Deacon wrote:
> On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:
>> On 2020/12/1 0:01, Will Deacon wrote:
>>> On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:
>>>> On 2020/11/30 21:34, Will Deacon wrote:
>>>>> On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
>>>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>>>> index 696b6aa83faf..fec8dc9f2baa 100644
>>>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>>>> @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>>>>>>     	return 0;
>>>>>>     }
>>>>>> +static void stage2_flush_dcache(void *addr, u64 size);
>>>>>> +static bool stage2_pte_cacheable(kvm_pte_t pte);
>>>>>> +
>>>>>>     static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>>>>     				struct stage2_map_data *data)
>>>>>>     {
>>>>>> @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>>>>     	struct page *page = virt_to_page(ptep);
>>>>>>     	if (data->anchor) {
>>>>>> -		if (kvm_pte_valid(pte))
>>>>>> +		if (kvm_pte_valid(pte)) {
>>>>>> +			kvm_set_invalid_pte(ptep);
>>>>>> +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
>>>>>> +				     addr, level);
>>>>>>     			put_page(page);
>>>>> This doesn't make sense to me: the page-table pages we're walking when the
>>>>> anchor is set are not accessible to the hardware walker because we unhooked
>>>>> the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
>>>>> TLB invalidation.
>>>>>
>>>>> Are you seeing a problem in practice here?
>>>> Yes, I indeed find a problem in practice.
>>>>
>>>> When the migration was cancelled, a TLB conflic abort  was found in guest.
>>>>
>>>> This problem is fixed before rework of the page table code, you can have a
>>>> look in the following two links:
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277
>>>>
>>>> https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html
>>> Ok, let's go through this, because I still don't see the bug. Please correct
>>> me if you spot any mistakes:
>>>
>>>     1. We have a block mapping for X => Y
>>>     2. Dirty logging is enabled, so the block mapping is write-protected and
>>>        ends up being split into page mappings
>>>     3. Dirty logging is disabled due to a failed migration.
>>>
>>> --- At this point, I think we agree that the state of the MMU is alright ---
>>>
>>>     4. We take a stage-2 fault and want to reinstall the block mapping:
>>>
>>>        a. kvm_pgtable_stage2_map() is invoked to install the block mapping
>>>        b. stage2_map_walk_table_pre() finds a table where we would like to
>>>           install the block:
>>>
>>> 	i.   The anchor is set to point at this entry
>>> 	ii.  The entry is made invalid
>>> 	iii. We invalidate the TLB for the input address. This is
>>> 	     TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.
>>>
>>> 	*** At this point, the page-table pointed to by the old table entry
>>> 	    is not reachable to the hardware walker ***
>>>
>>>        c. stage2_map_walk_leaf() is called for each leaf entry in the
>>>           now-unreachable subtree, dropping page-references for each valid
>>> 	entry it finds.
>>>        d. stage2_map_walk_table_post() is eventually called for the entry
>>>           which we cleared back in b.ii, so we install the new block mapping.
>>>
>>> You are proposing to add additional TLB invalidation to (c), but I don't
>>> think that is necessary, thanks to the invalidation already performed in
>>> b.iii. What am I missing here?
>> The point is at b.iii where the TLBI is not enough. There are many page
>> mappings that we need to merge into a block mapping.
>>
>> We invalidate the TLB for the input address without level hint at b.iii, but
>> this operation just flush TLB for one page mapping, there
>>
>> are still some TLB entries for the other page mappings in the cache, the MMU
>> hardware walker can still hit these entries next time.
> Ah, yes, I see. Thanks. I hadn't considered the case where there are table
> entries beneath the anchor. So how about the diff below?
>
> Will
>
> --->8

Hi, I think it's inappropriate to put the TLBI of all the leaf entries 
in function stage2_map_walk_table_post(),

because the *ptep must be an upper table entry when we enter 
stage2_map_walk_table_post().

We should make the TLBI for every leaf entry not table entry in the last 
lookup level,  just as I am proposing

to add the additional TLBI in function stage2_map_walk_leaf().

Thanks.


Yanan

>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 0271b4a3b9fe..12526d8c7ae4 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>   		return 0;
>   
>   	kvm_set_invalid_pte(ptep);
> -	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
> +	/* TLB invalidation is deferred until the _post handler */
>   	data->anchor = ptep;
>   	return 0;
>   }
> @@ -547,11 +547,21 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>   				      struct stage2_map_data *data)
>   {
>   	int ret = 0;
> +	kvm_pte_t pte = *ptep;
>   
>   	if (!data->anchor)
>   		return 0;
>   
> -	free_page((unsigned long)kvm_pte_follow(*ptep));
> +	kvm_set_invalid_pte(ptep);
> +
> +	/*
> +	 * Invalidate the whole stage-2, as we may have numerous leaf
> +	 * entries below us which would otherwise need invalidating
> +	 * individually.
> +	 */
> +	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
> +
> +	free_page((unsigned long)kvm_pte_follow(pte));
>   	put_page(virt_to_page(ptep));
>   
>   	if (data->anchor == ptep) {
> .

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2
  2020-12-01  7:21     ` wangyanan (Y)
@ 2020-12-01 14:16       ` Will Deacon
  2020-12-01 17:19         ` wangyanan (Y)
  0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2020-12-01 14:16 UTC (permalink / raw)
  To: wangyanan (Y)
  Cc: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming

On Tue, Dec 01, 2020 at 03:21:23PM +0800, wangyanan (Y) wrote:
> On 2020/11/30 21:21, Will Deacon wrote:
> > On Mon, Nov 30, 2020 at 08:18:45PM +0800, Yanan Wang wrote:
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 0271b4a3b9fe..696b6aa83faf 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -186,6 +186,7 @@ static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, kvm_pte_t attr,
> > >   		return old == pte;
> > >   	smp_store_release(ptep, pte);
> > > +	get_page(virt_to_page(ptep));
> > This is also used for the hypervisor stage-1 page-table, so I'd prefer to
> > leave this function as-is.
> I agree at this point.
> > >   	return true;
> > >   }
> > > @@ -476,6 +477,7 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> > >   	/* There's an existing valid leaf entry, so perform break-before-make */
> > >   	kvm_set_invalid_pte(ptep);
> > >   	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
> > > +	put_page(virt_to_page(ptep));
> > >   	kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
> > >   out:
> > >   	data->phys += granule;
> > Isn't this hunk alone sufficient to solve the problem?
> > 
> > Will
> > .
> 
> Not sufficient enough. When the old ptep is valid and old pte equlas new
> pte, in this case, "True" is also returned by kvm_set_valid_leaf_pte()
> 
> and get_page() will still be called.

I had a go at fixing this without introducing refcounting to the hyp stage-1
case, and ended up with the diff below. What do you think?

Will

--->8

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0271b4a3b9fe..78e2c0dc47ae 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -170,23 +170,16 @@ static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp)
 	smp_store_release(ptep, pte);
 }
 
-static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, kvm_pte_t attr,
-				   u32 level)
+static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, u32 level)
 {
-	kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(pa);
+	kvm_pte_t pte = kvm_phys_to_pte(pa);
 	u64 type = (level == KVM_PGTABLE_MAX_LEVELS - 1) ? KVM_PTE_TYPE_PAGE :
 							   KVM_PTE_TYPE_BLOCK;
 
 	pte |= attr & (KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI);
 	pte |= FIELD_PREP(KVM_PTE_TYPE, type);
 	pte |= KVM_PTE_VALID;
-
-	/* Tolerate KVM recreating the exact same mapping. */
-	if (kvm_pte_valid(old))
-		return old == pte;
-
-	smp_store_release(ptep, pte);
-	return true;
+	return pte;
 }
 
 static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
@@ -341,12 +334,17 @@ static int hyp_map_set_prot_attr(enum kvm_pgtable_prot prot,
 static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 				    kvm_pte_t *ptep, struct hyp_map_data *data)
 {
+	kvm_pte_t new, old = *ptep;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
 
 	if (!kvm_block_mapping_supported(addr, end, phys, level))
 		return false;
 
-	WARN_ON(!kvm_set_valid_leaf_pte(ptep, phys, data->attr, level));
+	/* Tolerate KVM recreating the exact same mapping. */
+	new = kvm_init_valid_leaf_pte(phys, data->attr, level);
+	if (old != new && !WARN_ON(kvm_pte_valid(old)))
+		smp_store_release(ptep, new);
+
 	data->phys += granule;
 	return true;
 }
@@ -465,19 +463,24 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 				       kvm_pte_t *ptep,
 				       struct stage2_map_data *data)
 {
+	kvm_pte_t new, old = *ptep;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
 
 	if (!kvm_block_mapping_supported(addr, end, phys, level))
 		return false;
 
-	if (kvm_set_valid_leaf_pte(ptep, phys, data->attr, level))
-		goto out;
+	new = kvm_init_valid_leaf_pte(phys, data->attr, level);
+	if (kvm_pte_valid(old)) {
+		/*
+		 * There's an existing valid leaf entry, so perform
+		 * break-before-make.
+		 */
+		kvm_set_invalid_pte(ptep);
+		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
+		put_page(virt_to_page(ptep));
+	}
 
-	/* There's an existing valid leaf entry, so perform break-before-make */
-	kvm_set_invalid_pte(ptep);
-	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
-	kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
-out:
+	smp_store_release(ptep, new);
 	data->phys += granule;
 	return true;
 }

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry
  2020-12-01 14:05             ` Marc Zyngier
@ 2020-12-01 14:23               ` Will Deacon
  2020-12-01 14:32                 ` Marc Zyngier
  0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2020-12-01 14:23 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: wangyanan (Y),
	linux-kernel, linux-arm-kernel, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	wanghaibin.wang, yezengruan, zhukeqian1, yuzenghui, jiangkunkun,
	wangjingyi11, lushenming

On Tue, Dec 01, 2020 at 02:05:03PM +0000, Marc Zyngier wrote:
> On 2020-12-01 13:46, Will Deacon wrote:
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 0271b4a3b9fe..12526d8c7ae4 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64
> > end, u32 level,
> >  		return 0;
> > 
> >  	kvm_set_invalid_pte(ptep);
> > -	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
> > +	/* TLB invalidation is deferred until the _post handler */
> >  	data->anchor = ptep;
> >  	return 0;
> >  }
> > @@ -547,11 +547,21 @@ static int stage2_map_walk_table_post(u64 addr,
> > u64 end, u32 level,
> >  				      struct stage2_map_data *data)
> >  {
> >  	int ret = 0;
> > +	kvm_pte_t pte = *ptep;
> > 
> >  	if (!data->anchor)
> >  		return 0;
> > 
> > -	free_page((unsigned long)kvm_pte_follow(*ptep));
> > +	kvm_set_invalid_pte(ptep);
> > +
> > +	/*
> > +	 * Invalidate the whole stage-2, as we may have numerous leaf
> > +	 * entries below us which would otherwise need invalidating
> > +	 * individually.
> > +	 */
> > +	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
> 
> That's a big hammer, and we so far have been pretty careful not to
> over-invalidate. Is the block-replacing-table *without* an unmap
> in between the only case where this triggers?

Yes, this only happens in that case. The alternative would be to issue
invalidations for every single entry we unmap, which I can implement if
you prefer, but it felt worse to me given that by-IPA invalidation
isn't really great either).

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry
  2020-12-01 14:23               ` Will Deacon
@ 2020-12-01 14:32                 ` Marc Zyngier
  0 siblings, 0 replies; 24+ messages in thread
From: Marc Zyngier @ 2020-12-01 14:32 UTC (permalink / raw)
  To: Will Deacon
  Cc: wangyanan (Y),
	linux-kernel, linux-arm-kernel, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	wanghaibin.wang, yezengruan, zhukeqian1, yuzenghui, jiangkunkun,
	wangjingyi11, lushenming

On 2020-12-01 14:23, Will Deacon wrote:
> On Tue, Dec 01, 2020 at 02:05:03PM +0000, Marc Zyngier wrote:
>> On 2020-12-01 13:46, Will Deacon wrote:
>> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> > index 0271b4a3b9fe..12526d8c7ae4 100644
>> > --- a/arch/arm64/kvm/hyp/pgtable.c
>> > +++ b/arch/arm64/kvm/hyp/pgtable.c
>> > @@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64
>> > end, u32 level,
>> >  		return 0;
>> >
>> >  	kvm_set_invalid_pte(ptep);
>> > -	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
>> > +	/* TLB invalidation is deferred until the _post handler */
>> >  	data->anchor = ptep;
>> >  	return 0;
>> >  }
>> > @@ -547,11 +547,21 @@ static int stage2_map_walk_table_post(u64 addr,
>> > u64 end, u32 level,
>> >  				      struct stage2_map_data *data)
>> >  {
>> >  	int ret = 0;
>> > +	kvm_pte_t pte = *ptep;
>> >
>> >  	if (!data->anchor)
>> >  		return 0;
>> >
>> > -	free_page((unsigned long)kvm_pte_follow(*ptep));
>> > +	kvm_set_invalid_pte(ptep);
>> > +
>> > +	/*
>> > +	 * Invalidate the whole stage-2, as we may have numerous leaf
>> > +	 * entries below us which would otherwise need invalidating
>> > +	 * individually.
>> > +	 */
>> > +	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>> 
>> That's a big hammer, and we so far have been pretty careful not to
>> over-invalidate. Is the block-replacing-table *without* an unmap
>> in between the only case where this triggers?
> 
> Yes, this only happens in that case. The alternative would be to issue
> invalidations for every single entry we unmap, which I can implement if
> you prefer, but it felt worse to me given that by-IPA invalidation
> isn't really great either).

Right. If that's the only case where this happens, I'm not too bothered.
What I want to make sure is that the normal table->block upgrade path
(which goes via a MMU notifier that unmaps the table) stays relatively
quick and doesn't suffer from the above.

It looks like Yanan still has some concerns though, and I'd like to
understand what they are.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry
  2020-12-01 14:11             ` wangyanan (Y)
@ 2020-12-01 14:35               ` Marc Zyngier
  2020-12-01 17:20                 ` wangyanan (Y)
  0 siblings, 1 reply; 24+ messages in thread
From: Marc Zyngier @ 2020-12-01 14:35 UTC (permalink / raw)
  To: wangyanan (Y)
  Cc: Will Deacon, linux-kernel, linux-arm-kernel, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming

Hi Yanan,

On 2020-12-01 14:11, wangyanan (Y) wrote:
> On 2020/12/1 21:46, Will Deacon wrote:
>> On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:

[...]

>>> The point is at b.iii where the TLBI is not enough. There are many 
>>> page
>>> mappings that we need to merge into a block mapping.
>>> 
>>> We invalidate the TLB for the input address without level hint at 
>>> b.iii, but
>>> this operation just flush TLB for one page mapping, there
>>> 
>>> are still some TLB entries for the other page mappings in the cache, 
>>> the MMU
>>> hardware walker can still hit these entries next time.
>> Ah, yes, I see. Thanks. I hadn't considered the case where there are 
>> table
>> entries beneath the anchor. So how about the diff below?
>> 
>> Will
>> 
>> --->8
> 
> Hi, I think it's inappropriate to put the TLBI of all the leaf entries
> in function stage2_map_walk_table_post(),
> 
> because the *ptep must be an upper table entry when we enter
> stage2_map_walk_table_post().
> 
> We should make the TLBI for every leaf entry not table entry in the
> last lookup level,  just as I am proposing
> 
> to add the additional TLBI in function stage2_map_walk_leaf().

Could you make your concerns explicit? As far as I can tell, this should
address the bug you found, at least from a correctness perspective.

Are you worried about the impact of the full S2 invalidation? Or do you
see another correctness issue?

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2
  2020-12-01 14:16       ` Will Deacon
@ 2020-12-01 17:19         ` wangyanan (Y)
  2020-12-01 18:15           ` Will Deacon
  0 siblings, 1 reply; 24+ messages in thread
From: wangyanan (Y) @ 2020-12-01 17:19 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming

On 2020/12/1 22:16, Will Deacon wrote:

> On Tue, Dec 01, 2020 at 03:21:23PM +0800, wangyanan (Y) wrote:
>> On 2020/11/30 21:21, Will Deacon wrote:
>>> On Mon, Nov 30, 2020 at 08:18:45PM +0800, Yanan Wang wrote:
>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>> index 0271b4a3b9fe..696b6aa83faf 100644
>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>> @@ -186,6 +186,7 @@ static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, kvm_pte_t attr,
>>>>    		return old == pte;
>>>>    	smp_store_release(ptep, pte);
>>>> +	get_page(virt_to_page(ptep));
>>> This is also used for the hypervisor stage-1 page-table, so I'd prefer to
>>> leave this function as-is.
>> I agree at this point.
>>>>    	return true;
>>>>    }
>>>> @@ -476,6 +477,7 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>>    	/* There's an existing valid leaf entry, so perform break-before-make */
>>>>    	kvm_set_invalid_pte(ptep);
>>>>    	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
>>>> +	put_page(virt_to_page(ptep));
>>>>    	kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
>>>>    out:
>>>>    	data->phys += granule;
>>> Isn't this hunk alone sufficient to solve the problem?
>>>
>>> Will
>>> .
>> Not sufficient enough. When the old ptep is valid and old pte equlas new
>> pte, in this case, "True" is also returned by kvm_set_valid_leaf_pte()
>>
>> and get_page() will still be called.
> I had a go at fixing this without introducing refcounting to the hyp stage-1
> case, and ended up with the diff below. What do you think?

Hi Will,

Functionally this diff looks fine to me. A small comment inline, please 
see below.

I had made an alternative fix (after sending v1) and it looks much more 
concise.

If you're ok with it, I can send it as v2 (together with patch#2 and #3) 
after some tests.


Thanks,

Yanan


diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0271b4a3b9fe..b232bdd142a6 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -470,6 +470,9 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 
end, u32 level,
         if (!kvm_block_mapping_supported(addr, end, phys, level))
                 return false;

+       if (kvm_pte_valid(*ptep))
+               put_page(virt_to_page(ptep));
+
         if (kvm_set_valid_leaf_pte(ptep, phys, data->attr, level))
                 goto out;

>
> --->8
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 0271b4a3b9fe..78e2c0dc47ae 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -170,23 +170,16 @@ static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp)
>   	smp_store_release(ptep, pte);
>   }
>   
> -static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, kvm_pte_t attr,
> -				   u32 level)
> +static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, u32 level)
>   {
> -	kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(pa);
> +	kvm_pte_t pte = kvm_phys_to_pte(pa);
>   	u64 type = (level == KVM_PGTABLE_MAX_LEVELS - 1) ? KVM_PTE_TYPE_PAGE :
>   							   KVM_PTE_TYPE_BLOCK;
>   
>   	pte |= attr & (KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI);
>   	pte |= FIELD_PREP(KVM_PTE_TYPE, type);
>   	pte |= KVM_PTE_VALID;
> -
> -	/* Tolerate KVM recreating the exact same mapping. */
> -	if (kvm_pte_valid(old))
> -		return old == pte;
> -
> -	smp_store_release(ptep, pte);
> -	return true;
> +	return pte;
>   }
>   
>   static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
> @@ -341,12 +334,17 @@ static int hyp_map_set_prot_attr(enum kvm_pgtable_prot prot,
>   static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>   				    kvm_pte_t *ptep, struct hyp_map_data *data)
>   {
> +	kvm_pte_t new, old = *ptep;
>   	u64 granule = kvm_granule_size(level), phys = data->phys;
>   
>   	if (!kvm_block_mapping_supported(addr, end, phys, level))
>   		return false;
>   
> -	WARN_ON(!kvm_set_valid_leaf_pte(ptep, phys, data->attr, level));
> +	/* Tolerate KVM recreating the exact same mapping. */
> +	new = kvm_init_valid_leaf_pte(phys, data->attr, level);
> +	if (old != new && !WARN_ON(kvm_pte_valid(old)))
> +		smp_store_release(ptep, new);
> +
>   	data->phys += granule;
>   	return true;
>   }
> @@ -465,19 +463,24 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>   				       kvm_pte_t *ptep,
>   				       struct stage2_map_data *data)
>   {
> +	kvm_pte_t new, old = *ptep;
>   	u64 granule = kvm_granule_size(level), phys = data->phys;
>   
>   	if (!kvm_block_mapping_supported(addr, end, phys, level))
>   		return false;
>   
> -	if (kvm_set_valid_leaf_pte(ptep, phys, data->attr, level))
> -		goto out;
> +	new = kvm_init_valid_leaf_pte(phys, data->attr, level);
> +	if (kvm_pte_valid(old)) {
> +		/*
> +		 * There's an existing valid leaf entry, so perform
> +		 * break-before-make.
> +		 */
> +		kvm_set_invalid_pte(ptep);
> +		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
> +		put_page(virt_to_page(ptep));

When old PTE is valid and equals to the new one, we will also perform 
break-before-make and the new PTE installation. But they're actually not 
necessary in this case.

> +	}
>   
> -	/* There's an existing valid leaf entry, so perform break-before-make */
> -	kvm_set_invalid_pte(ptep);
> -	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
> -	kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
> -out:
> +	smp_store_release(ptep, new);
>   	data->phys += granule;
>   	return true;
>   }
> .

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry
  2020-12-01 14:35               ` Marc Zyngier
@ 2020-12-01 17:20                 ` wangyanan (Y)
  2020-12-01 18:17                   ` Will Deacon
  0 siblings, 1 reply; 24+ messages in thread
From: wangyanan (Y) @ 2020-12-01 17:20 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Will Deacon, linux-kernel, linux-arm-kernel, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming

On 2020/12/1 22:35, Marc Zyngier wrote:

> Hi Yanan,
>
> On 2020-12-01 14:11, wangyanan (Y) wrote:
>> On 2020/12/1 21:46, Will Deacon wrote:
>>> On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:
>
> [...]
>
>>>> The point is at b.iii where the TLBI is not enough. There are many 
>>>> page
>>>> mappings that we need to merge into a block mapping.
>>>>
>>>> We invalidate the TLB for the input address without level hint at 
>>>> b.iii, but
>>>> this operation just flush TLB for one page mapping, there
>>>>
>>>> are still some TLB entries for the other page mappings in the 
>>>> cache, the MMU
>>>> hardware walker can still hit these entries next time.
>>> Ah, yes, I see. Thanks. I hadn't considered the case where there are 
>>> table
>>> entries beneath the anchor. So how about the diff below?
>>>
>>> Will
>>>
>>> --->8
>>
>> Hi, I think it's inappropriate to put the TLBI of all the leaf entries
>> in function stage2_map_walk_table_post(),
>>
>> because the *ptep must be an upper table entry when we enter
>> stage2_map_walk_table_post().
>>
>> We should make the TLBI for every leaf entry not table entry in the
>> last lookup level,  just as I am proposing
>>
>> to add the additional TLBI in function stage2_map_walk_leaf().
>
> Could you make your concerns explicit? As far as I can tell, this should
> address the bug you found, at least from a correctness perspective.
>
> Are you worried about the impact of the full S2 invalidation? Or do you
> see another correctness issue?


Hi Will, Marc,


After recheck of the diff, the full S2 invalidation in 
stage2_map_walk_table_post() should be well enough to solve this problem.

But I was wondering if we can add the full S2 invalidation in 
stage2_map_walk_table_pre(), where __kvm_tlb_flush_vmid() will be called 
for only one time.

If we add the full TLBI in stage2_map_walk_table_post(), 
__kvm_tlb_flush_vmid() might be called for many times in the loop and 
lots of (unnecessary) CPU instructions will be wasted.

What I'm saying is something like below, please let me know what do you 
think.

If this is OK, I can update the diff in v2 and send it with your SOB (is 
it appropriate?) after some tests.


diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index b232bdd142a6..f11fb2996080 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -496,7 +496,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 
end, u32 level,
                 return 0;

         kvm_set_invalid_pte(ptep);
-       kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
+       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
         data->anchor = ptep;
         return 0;
  }


Thanks,

Yanan


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2
  2020-12-01 17:19         ` wangyanan (Y)
@ 2020-12-01 18:15           ` Will Deacon
  2020-12-01 20:08             ` wangyanan (Y)
  0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2020-12-01 18:15 UTC (permalink / raw)
  To: wangyanan (Y)
  Cc: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming

On Wed, Dec 02, 2020 at 01:19:35AM +0800, wangyanan (Y) wrote:
> On 2020/12/1 22:16, Will Deacon wrote:
> > On Tue, Dec 01, 2020 at 03:21:23PM +0800, wangyanan (Y) wrote:
> > > On 2020/11/30 21:21, Will Deacon wrote:
> > > > On Mon, Nov 30, 2020 at 08:18:45PM +0800, Yanan Wang wrote:
> > > > > @@ -476,6 +477,7 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> > > > >    	/* There's an existing valid leaf entry, so perform break-before-make */
> > > > >    	kvm_set_invalid_pte(ptep);
> > > > >    	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
> > > > > +	put_page(virt_to_page(ptep));
> > > > >    	kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
> > > > >    out:
> > > > >    	data->phys += granule;
> > > > Isn't this hunk alone sufficient to solve the problem?
> > > > 
> > > Not sufficient enough. When the old ptep is valid and old pte equlas new
> > > pte, in this case, "True" is also returned by kvm_set_valid_leaf_pte()
> > > 
> > > and get_page() will still be called.
> > I had a go at fixing this without introducing refcounting to the hyp stage-1
> > case, and ended up with the diff below. What do you think?
> 
> Functionally this diff looks fine to me. A small comment inline, please see
> below.
> 
> I had made an alternative fix (after sending v1) and it looks much more
> concise.
> 
> If you're ok with it, I can send it as v2 (together with patch#2 and #3)
> after some tests.

Thanks.

> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 0271b4a3b9fe..b232bdd142a6 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -470,6 +470,9 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64
> end, u32 level,
>         if (!kvm_block_mapping_supported(addr, end, phys, level))
>                 return false;
> 
> +       if (kvm_pte_valid(*ptep))
> +               put_page(virt_to_page(ptep));
> +
>         if (kvm_set_valid_leaf_pte(ptep, phys, data->attr, level))
>                 goto out;

This is certainly smaller, but see below.

> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 0271b4a3b9fe..78e2c0dc47ae 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -170,23 +170,16 @@ static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp)
> >   	smp_store_release(ptep, pte);
> >   }
> > -static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, kvm_pte_t attr,
> > -				   u32 level)
> > +static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, u32 level)
> >   {
> > -	kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(pa);
> > +	kvm_pte_t pte = kvm_phys_to_pte(pa);
> >   	u64 type = (level == KVM_PGTABLE_MAX_LEVELS - 1) ? KVM_PTE_TYPE_PAGE :
> >   							   KVM_PTE_TYPE_BLOCK;
> >   	pte |= attr & (KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI);
> >   	pte |= FIELD_PREP(KVM_PTE_TYPE, type);
> >   	pte |= KVM_PTE_VALID;
> > -
> > -	/* Tolerate KVM recreating the exact same mapping. */
> > -	if (kvm_pte_valid(old))
> > -		return old == pte;
> > -
> > -	smp_store_release(ptep, pte);
> > -	return true;
> > +	return pte;
> >   }
> >   static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
> > @@ -341,12 +334,17 @@ static int hyp_map_set_prot_attr(enum kvm_pgtable_prot prot,
> >   static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >   				    kvm_pte_t *ptep, struct hyp_map_data *data)
> >   {
> > +	kvm_pte_t new, old = *ptep;
> >   	u64 granule = kvm_granule_size(level), phys = data->phys;
> >   	if (!kvm_block_mapping_supported(addr, end, phys, level))
> >   		return false;
> > -	WARN_ON(!kvm_set_valid_leaf_pte(ptep, phys, data->attr, level));
> > +	/* Tolerate KVM recreating the exact same mapping. */
> > +	new = kvm_init_valid_leaf_pte(phys, data->attr, level);
> > +	if (old != new && !WARN_ON(kvm_pte_valid(old)))
> > +		smp_store_release(ptep, new);
> > +
> >   	data->phys += granule;
> >   	return true;
> >   }
> > @@ -465,19 +463,24 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >   				       kvm_pte_t *ptep,
> >   				       struct stage2_map_data *data)
> >   {
> > +	kvm_pte_t new, old = *ptep;
> >   	u64 granule = kvm_granule_size(level), phys = data->phys;
> >   	if (!kvm_block_mapping_supported(addr, end, phys, level))
> >   		return false;
> > -	if (kvm_set_valid_leaf_pte(ptep, phys, data->attr, level))
> > -		goto out;
> > +	new = kvm_init_valid_leaf_pte(phys, data->attr, level);
> > +	if (kvm_pte_valid(old)) {
> > +		/*
> > +		 * There's an existing valid leaf entry, so perform
> > +		 * break-before-make.
> > +		 */
> > +		kvm_set_invalid_pte(ptep);
> > +		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
> > +		put_page(virt_to_page(ptep));
> 
> When old PTE is valid and equals to the new one, we will also perform
> break-before-make and the new PTE installation. But they're actually not
> necessary in this case.

Agreed, but I don't think that case should happen for stage-2 mappings.
That's why I prefer my diff here, as it makes that 'old == new' logic
specific to the hyp mappings.

But I'll leave it all up to you (these diffs are only intended to be
helpful).

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry
  2020-12-01 17:20                 ` wangyanan (Y)
@ 2020-12-01 18:17                   ` Will Deacon
  0 siblings, 0 replies; 24+ messages in thread
From: Will Deacon @ 2020-12-01 18:17 UTC (permalink / raw)
  To: wangyanan (Y)
  Cc: Marc Zyngier, linux-kernel, linux-arm-kernel, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming

On Wed, Dec 02, 2020 at 01:20:33AM +0800, wangyanan (Y) wrote:
> On 2020/12/1 22:35, Marc Zyngier wrote:
> 
> > Hi Yanan,
> > 
> > On 2020-12-01 14:11, wangyanan (Y) wrote:
> > > On 2020/12/1 21:46, Will Deacon wrote:
> > > > On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:
> > 
> > [...]
> > 
> > > > > The point is at b.iii where the TLBI is not enough. There
> > > > > are many page
> > > > > mappings that we need to merge into a block mapping.
> > > > > 
> > > > > We invalidate the TLB for the input address without level
> > > > > hint at b.iii, but
> > > > > this operation just flush TLB for one page mapping, there
> > > > > 
> > > > > are still some TLB entries for the other page mappings in
> > > > > the cache, the MMU
> > > > > hardware walker can still hit these entries next time.
> > > > Ah, yes, I see. Thanks. I hadn't considered the case where there
> > > > are table
> > > > entries beneath the anchor. So how about the diff below?
> > > > 
> > > > Will
> > > > 
> > > > --->8
> > > 
> > > Hi, I think it's inappropriate to put the TLBI of all the leaf entries
> > > in function stage2_map_walk_table_post(),
> > > 
> > > because the *ptep must be an upper table entry when we enter
> > > stage2_map_walk_table_post().
> > > 
> > > We should make the TLBI for every leaf entry not table entry in the
> > > last lookup level,  just as I am proposing
> > > 
> > > to add the additional TLBI in function stage2_map_walk_leaf().
> > 
> > Could you make your concerns explicit? As far as I can tell, this should
> > address the bug you found, at least from a correctness perspective.
> > 
> > Are you worried about the impact of the full S2 invalidation? Or do you
> > see another correctness issue?
> 
> 
> Hi Will, Marc,
> 
> 
> After recheck of the diff, the full S2 invalidation in
> stage2_map_walk_table_post() should be well enough to solve this problem.
> 
> But I was wondering if we can add the full S2 invalidation in
> stage2_map_walk_table_pre(), where __kvm_tlb_flush_vmid() will be called for
> only one time.
> 
> If we add the full TLBI in stage2_map_walk_table_post(),
> __kvm_tlb_flush_vmid() might be called for many times in the loop and lots
> of (unnecessary) CPU instructions will be wasted.
> 
> What I'm saying is something like below, please let me know what do you
> think.
> 
> If this is OK, I can update the diff in v2 and send it with your SOB (is it
> appropriate?) after some tests.
> 
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index b232bdd142a6..f11fb2996080 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -496,7 +496,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
> u32 level,
>                 return 0;
> 
>         kvm_set_invalid_pte(ptep);
> -       kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
> +       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>         data->anchor = ptep;
>         return 0;

Yes, I think that's much better, but please add a comment! (you can
probably more-or-less copy the one I had in the post handler)

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2
  2020-12-01 18:15           ` Will Deacon
@ 2020-12-01 20:08             ` wangyanan (Y)
  0 siblings, 0 replies; 24+ messages in thread
From: wangyanan (Y) @ 2020-12-01 20:08 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, Marc Zyngier, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, wanghaibin.wang, yezengruan, zhukeqian1,
	yuzenghui, jiangkunkun, wangjingyi11, lushenming


On 2020/12/2 2:15, Will Deacon wrote:
> On Wed, Dec 02, 2020 at 01:19:35AM +0800, wangyanan (Y) wrote:
>> On 2020/12/1 22:16, Will Deacon wrote:
>>> On Tue, Dec 01, 2020 at 03:21:23PM +0800, wangyanan (Y) wrote:
>>>> On 2020/11/30 21:21, Will Deacon wrote:
>>>>> On Mon, Nov 30, 2020 at 08:18:45PM +0800, Yanan Wang wrote:
>>>>>> @@ -476,6 +477,7 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>>>>     	/* There's an existing valid leaf entry, so perform break-before-make */
>>>>>>     	kvm_set_invalid_pte(ptep);
>>>>>>     	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
>>>>>> +	put_page(virt_to_page(ptep));
>>>>>>     	kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
>>>>>>     out:
>>>>>>     	data->phys += granule;
>>>>> Isn't this hunk alone sufficient to solve the problem?
>>>>>
>>>> Not sufficient enough. When the old ptep is valid and old pte equlas new
>>>> pte, in this case, "True" is also returned by kvm_set_valid_leaf_pte()
>>>>
>>>> and get_page() will still be called.
>>> I had a go at fixing this without introducing refcounting to the hyp stage-1
>>> case, and ended up with the diff below. What do you think?
>> Functionally this diff looks fine to me. A small comment inline, please see
>> below.
>>
>> I had made an alternative fix (after sending v1) and it looks much more
>> concise.
>>
>> If you're ok with it, I can send it as v2 (together with patch#2 and #3)
>> after some tests.
> Thanks.
>
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 0271b4a3b9fe..b232bdd142a6 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -470,6 +470,9 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64
>> end, u32 level,
>>          if (!kvm_block_mapping_supported(addr, end, phys, level))
>>                  return false;
>>
>> +       if (kvm_pte_valid(*ptep))
>> +               put_page(virt_to_page(ptep));
>> +
>>          if (kvm_set_valid_leaf_pte(ptep, phys, data->attr, level))
>>                  goto out;
> This is certainly smaller, but see below.
>
>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>> index 0271b4a3b9fe..78e2c0dc47ae 100644
>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>> @@ -170,23 +170,16 @@ static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp)
>>>    	smp_store_release(ptep, pte);
>>>    }
>>> -static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, kvm_pte_t attr,
>>> -				   u32 level)
>>> +static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, u32 level)
>>>    {
>>> -	kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(pa);
>>> +	kvm_pte_t pte = kvm_phys_to_pte(pa);
>>>    	u64 type = (level == KVM_PGTABLE_MAX_LEVELS - 1) ? KVM_PTE_TYPE_PAGE :
>>>    							   KVM_PTE_TYPE_BLOCK;
>>>    	pte |= attr & (KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI);
>>>    	pte |= FIELD_PREP(KVM_PTE_TYPE, type);
>>>    	pte |= KVM_PTE_VALID;
>>> -
>>> -	/* Tolerate KVM recreating the exact same mapping. */
>>> -	if (kvm_pte_valid(old))
>>> -		return old == pte;
>>> -
>>> -	smp_store_release(ptep, pte);
>>> -	return true;
>>> +	return pte;
>>>    }
>>>    static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
>>> @@ -341,12 +334,17 @@ static int hyp_map_set_prot_attr(enum kvm_pgtable_prot prot,
>>>    static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>    				    kvm_pte_t *ptep, struct hyp_map_data *data)
>>>    {
>>> +	kvm_pte_t new, old = *ptep;
>>>    	u64 granule = kvm_granule_size(level), phys = data->phys;
>>>    	if (!kvm_block_mapping_supported(addr, end, phys, level))
>>>    		return false;
>>> -	WARN_ON(!kvm_set_valid_leaf_pte(ptep, phys, data->attr, level));
>>> +	/* Tolerate KVM recreating the exact same mapping. */
>>> +	new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>> +	if (old != new && !WARN_ON(kvm_pte_valid(old)))
>>> +		smp_store_release(ptep, new);
>>> +
>>>    	data->phys += granule;
>>>    	return true;
>>>    }
>>> @@ -465,19 +463,24 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>    				       kvm_pte_t *ptep,
>>>    				       struct stage2_map_data *data)
>>>    {
>>> +	kvm_pte_t new, old = *ptep;
>>>    	u64 granule = kvm_granule_size(level), phys = data->phys;
>>>    	if (!kvm_block_mapping_supported(addr, end, phys, level))
>>>    		return false;
>>> -	if (kvm_set_valid_leaf_pte(ptep, phys, data->attr, level))
>>> -		goto out;
>>> +	new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>> +	if (kvm_pte_valid(old)) {
>>> +		/*
>>> +		 * There's an existing valid leaf entry, so perform
>>> +		 * break-before-make.
>>> +		 */
>>> +		kvm_set_invalid_pte(ptep);
>>> +		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
>>> +		put_page(virt_to_page(ptep));
>> When old PTE is valid and equals to the new one, we will also perform
>> break-before-make and the new PTE installation. But they're actually not
>> necessary in this case.
> Agreed, but I don't think that case should happen for stage-2 mappings.
> That's why I prefer my diff here, as it makes that 'old == new' logic
> specific to the hyp mappings.
>
> But I'll leave it all up to you (these diffs are only intended to be
> helpful).
>
> Will
> .

Hi Will,

In my opinion, the 'old == new' case might happen too in stage-2 
translation,  especially in the condition of concurrent access of 
multiple vCPUs.

For example, when merging tables into a block, we will perform 
break-before-make for a valid old PTE in function stage2_map_walk_pre().

If the other vCPUs are trying to access the same GPA range, so the MMUs 
will target at the same PTE as above, and they might just catch the moment

when the target PTE is set invalid in BBM by the vCPU holding the 
mmu_lock, but the old PTE will be updated to valid later.

As a result, translation fault will be triggered in these vCPUs, then 
they will wait for the mmu_lock to map exactly the *same* mapping (old 
== new).


Thanks,

Yanan


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-12-01 20:09 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-30 12:18 [RFC PATCH 0/3] Fix several bugs in KVM stage 2 translation Yanan Wang
2020-11-30 12:18 ` [RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2 Yanan Wang
2020-11-30 13:21   ` Will Deacon
2020-12-01  7:21     ` wangyanan (Y)
2020-12-01 14:16       ` Will Deacon
2020-12-01 17:19         ` wangyanan (Y)
2020-12-01 18:15           ` Will Deacon
2020-12-01 20:08             ` wangyanan (Y)
2020-11-30 12:18 ` [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry Yanan Wang
2020-11-30 13:34   ` Will Deacon
2020-11-30 15:24     ` wangyanan (Y)
2020-11-30 16:01       ` Will Deacon
2020-12-01  2:30         ` wangyanan (Y)
2020-12-01 13:46           ` Will Deacon
2020-12-01 14:05             ` Marc Zyngier
2020-12-01 14:23               ` Will Deacon
2020-12-01 14:32                 ` Marc Zyngier
2020-12-01 14:11             ` wangyanan (Y)
2020-12-01 14:35               ` Marc Zyngier
2020-12-01 17:20                 ` wangyanan (Y)
2020-12-01 18:17                   ` Will Deacon
2020-11-30 12:18 ` [RFC PATCH 3/3] KVM: arm64: Add usage of stage 2 fault lookup level in user_mem_abort() Yanan Wang
2020-11-30 13:49   ` Will Deacon
2020-12-01  6:04     ` wangyanan (Y)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).