All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table
@ 2021-02-08 11:22 ` Yanan Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Yanan Wang @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: wanghaibin.wang, zhukeqian1, yuzenghui, Yanan Wang

Hi,

This series makes some efficiency improvement of stage2 page table code,
and there are some test results to present the performance changes, which
were tested by a kvm selftest [1] that I have post:
[1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/ 

About patch 1:
We currently uniformly clean dcache in user_mem_abort() before calling the
fault handlers, if we take a translation fault and the pfn is cacheable.
But if there are concurrent translation faults on the same page or block,
clean of dcache for the first time is necessary while the others are not.

By moving clean of dcache to the map handler, we can easily identify the
conditions where CMOs are really needed and avoid the unnecessary ones.
As it's a time consuming process to perform CMOs especially when flushing
a block range, so this solution reduces much load of kvm and improve the
efficiency of creating mappings.

Test results:
(1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM create block mappings time: 52.83s -> 3.70s
KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s

(2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM creating block mappings time: 104.56s -> 3.70s
KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s

About patch 2, 3:
When KVM needs to coalesce the normal page mappings into a block mapping,
we currently invalidate the old table entry first followed by invalidation
of TLB, then unmap the page mappings, and install the block entry at last.

It will cost a lot of time to unmap the numerous page mappings, which means
the table entry will be left invalid for a long time before installation of
the block entry, and this will cause many spurious translation faults.

So let's quickly install the block entry at first to ensure uninterrupted
memory access of the other vCPUs, and then unmap the page mappings after
installation. This will reduce most of the time when the table entry is
invalid, and avoid most of the unnecessary translation faults.

Test results based on patch 1:
(1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s

(2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s

So combined with patch 1, it makes a big difference of KVM creating mappings
and recovering block mappings with not much code change.

About patch 4:
A new method to distinguish cases of memcache allocations is introduced.
By comparing fault_granule and vma_pagesize, cases that require allocations
from memcache and cases that don't can be distinguished completely.

---

Details of test results
platform: HiSilicon Kunpeng920 (FWB not supported)
host kernel: Linux mainline (v5.11-rc6)

(1) performance change of patch 1
cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
	   (20 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s

Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
	   (40 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s

Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s

(2) performance change of patch 2, 3(based on patch 1)
cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
	   (1 vcpu, 20G memory, block mappings(granule 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
	   (20 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
	   (40 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s

---

Yanan Wang (4):
  KVM: arm64: Move the clean of dcache to the map handler
  KVM: arm64: Add an independent API for coalescing tables
  KVM: arm64: Install the block entry before unmapping the page mappings
  KVM: arm64: Distinguish cases of memcache allocations completely

 arch/arm64/include/asm/kvm_mmu.h | 16 -------
 arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
 arch/arm64/kvm/mmu.c             | 39 ++++++---------
 3 files changed, 69 insertions(+), 68 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table
@ 2021-02-08 11:22 ` Yanan Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Yanan Wang @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi,

This series makes some efficiency improvement of stage2 page table code,
and there are some test results to present the performance changes, which
were tested by a kvm selftest [1] that I have post:
[1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/ 

About patch 1:
We currently uniformly clean dcache in user_mem_abort() before calling the
fault handlers, if we take a translation fault and the pfn is cacheable.
But if there are concurrent translation faults on the same page or block,
clean of dcache for the first time is necessary while the others are not.

By moving clean of dcache to the map handler, we can easily identify the
conditions where CMOs are really needed and avoid the unnecessary ones.
As it's a time consuming process to perform CMOs especially when flushing
a block range, so this solution reduces much load of kvm and improve the
efficiency of creating mappings.

Test results:
(1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM create block mappings time: 52.83s -> 3.70s
KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s

(2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM creating block mappings time: 104.56s -> 3.70s
KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s

About patch 2, 3:
When KVM needs to coalesce the normal page mappings into a block mapping,
we currently invalidate the old table entry first followed by invalidation
of TLB, then unmap the page mappings, and install the block entry at last.

It will cost a lot of time to unmap the numerous page mappings, which means
the table entry will be left invalid for a long time before installation of
the block entry, and this will cause many spurious translation faults.

So let's quickly install the block entry at first to ensure uninterrupted
memory access of the other vCPUs, and then unmap the page mappings after
installation. This will reduce most of the time when the table entry is
invalid, and avoid most of the unnecessary translation faults.

Test results based on patch 1:
(1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s

(2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s

So combined with patch 1, it makes a big difference of KVM creating mappings
and recovering block mappings with not much code change.

About patch 4:
A new method to distinguish cases of memcache allocations is introduced.
By comparing fault_granule and vma_pagesize, cases that require allocations
from memcache and cases that don't can be distinguished completely.

---

Details of test results
platform: HiSilicon Kunpeng920 (FWB not supported)
host kernel: Linux mainline (v5.11-rc6)

(1) performance change of patch 1
cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
	   (20 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s

Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
	   (40 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s

Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s

(2) performance change of patch 2, 3(based on patch 1)
cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
	   (1 vcpu, 20G memory, block mappings(granule 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
	   (20 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
	   (40 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s

---

Yanan Wang (4):
  KVM: arm64: Move the clean of dcache to the map handler
  KVM: arm64: Add an independent API for coalescing tables
  KVM: arm64: Install the block entry before unmapping the page mappings
  KVM: arm64: Distinguish cases of memcache allocations completely

 arch/arm64/include/asm/kvm_mmu.h | 16 -------
 arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
 arch/arm64/kvm/mmu.c             | 39 ++++++---------
 3 files changed, 69 insertions(+), 68 deletions(-)

-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table
@ 2021-02-08 11:22 ` Yanan Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Yanan Wang @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: yuzenghui, wanghaibin.wang, Yanan Wang, zhukeqian1

Hi,

This series makes some efficiency improvement of stage2 page table code,
and there are some test results to present the performance changes, which
were tested by a kvm selftest [1] that I have post:
[1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/ 

About patch 1:
We currently uniformly clean dcache in user_mem_abort() before calling the
fault handlers, if we take a translation fault and the pfn is cacheable.
But if there are concurrent translation faults on the same page or block,
clean of dcache for the first time is necessary while the others are not.

By moving clean of dcache to the map handler, we can easily identify the
conditions where CMOs are really needed and avoid the unnecessary ones.
As it's a time consuming process to perform CMOs especially when flushing
a block range, so this solution reduces much load of kvm and improve the
efficiency of creating mappings.

Test results:
(1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM create block mappings time: 52.83s -> 3.70s
KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s

(2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM creating block mappings time: 104.56s -> 3.70s
KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s

About patch 2, 3:
When KVM needs to coalesce the normal page mappings into a block mapping,
we currently invalidate the old table entry first followed by invalidation
of TLB, then unmap the page mappings, and install the block entry at last.

It will cost a lot of time to unmap the numerous page mappings, which means
the table entry will be left invalid for a long time before installation of
the block entry, and this will cause many spurious translation faults.

So let's quickly install the block entry at first to ensure uninterrupted
memory access of the other vCPUs, and then unmap the page mappings after
installation. This will reduce most of the time when the table entry is
invalid, and avoid most of the unnecessary translation faults.

Test results based on patch 1:
(1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s

(2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s

So combined with patch 1, it makes a big difference of KVM creating mappings
and recovering block mappings with not much code change.

About patch 4:
A new method to distinguish cases of memcache allocations is introduced.
By comparing fault_granule and vma_pagesize, cases that require allocations
from memcache and cases that don't can be distinguished completely.

---

Details of test results
platform: HiSilicon Kunpeng920 (FWB not supported)
host kernel: Linux mainline (v5.11-rc6)

(1) performance change of patch 1
cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
	   (20 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s

Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
	   (40 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s

Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s

(2) performance change of patch 2, 3(based on patch 1)
cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
	   (1 vcpu, 20G memory, block mappings(granule 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
	   (20 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
	   (40 vcpus, 20G memory, block mappings(granule 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s

---

Yanan Wang (4):
  KVM: arm64: Move the clean of dcache to the map handler
  KVM: arm64: Add an independent API for coalescing tables
  KVM: arm64: Install the block entry before unmapping the page mappings
  KVM: arm64: Distinguish cases of memcache allocations completely

 arch/arm64/include/asm/kvm_mmu.h | 16 -------
 arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
 arch/arm64/kvm/mmu.c             | 39 ++++++---------
 3 files changed, 69 insertions(+), 68 deletions(-)

-- 
2.23.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
  2021-02-08 11:22 ` Yanan Wang
  (?)
@ 2021-02-08 11:22   ` Yanan Wang
  -1 siblings, 0 replies; 80+ messages in thread
From: Yanan Wang @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: wanghaibin.wang, zhukeqian1, yuzenghui, Yanan Wang

We currently uniformly clean dcache in user_mem_abort() before calling the
fault handlers, if we take a translation fault and the pfn is cacheable.
But if there are concurrent translation faults on the same page or block,
clean of dcache for the first time is necessary while the others are not.

By moving clean of dcache to the map handler, we can easily identify the
conditions where CMOs are really needed and avoid the unnecessary ones.
As it's a time consuming process to perform CMOs especially when flushing
a block range, so this solution reduces much load of kvm and improve the
efficiency of creating mappings.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/include/asm/kvm_mmu.h | 16 --------------
 arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
 arch/arm64/kvm/mmu.c             | 14 +++---------
 3 files changed, 27 insertions(+), 41 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e52d82aeadca..4ec9879e82ed 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
 	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
 }
 
-static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
-{
-	void *va = page_address(pfn_to_page(pfn));
-
-	/*
-	 * With FWB, we ensure that the guest always accesses memory using
-	 * cacheable attributes, and we don't have to clean to PoC when
-	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
-	 * PoU is not required either in this case.
-	 */
-	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
-		return;
-
-	kvm_flush_dcache_to_poc(va, size);
-}
-
 static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
 						  unsigned long size)
 {
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 4d177ce1d536..2f4f87021980 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
 	return 0;
 }
 
+static bool stage2_pte_cacheable(kvm_pte_t pte)
+{
+	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
+	return memattr == PAGE_S2_MEMATTR(NORMAL);
+}
+
+static void stage2_flush_dcache(void *addr, u64 size)
+{
+	/*
+	 * With FWB, we ensure that the guest always accesses memory using
+	 * cacheable attributes, and we don't have to clean to PoC when
+	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
+	 * PoU is not required either in this case.
+	 */
+	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
+		return;
+
+	__flush_dcache_area(addr, size);
+}
+
 static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep,
 				      struct stage2_map_data *data)
@@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 		put_page(page);
 	}
 
+	/* Flush data cache before installation of the new PTE */
+	if (stage2_pte_cacheable(new))
+		stage2_flush_dcache(__va(phys), granule);
+
 	smp_store_release(ptep, new);
 	get_page(page);
 	data->phys += granule;
@@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	return ret;
 }
 
-static void stage2_flush_dcache(void *addr, u64 size)
-{
-	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
-		return;
-
-	__flush_dcache_area(addr, size);
-}
-
-static bool stage2_pte_cacheable(kvm_pte_t pte)
-{
-	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
-	return memattr == PAGE_S2_MEMATTR(NORMAL);
-}
-
 static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			       enum kvm_pgtable_walk_flags flag,
 			       void * const arg)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..d151927a7d62 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
-static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
-{
-	__clean_dcache_guest_page(pfn, size);
-}
-
 static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
 {
 	__invalidate_icache_guest_page(pfn, size);
@@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
-	if (fault_status != FSC_PERM && !device)
-		clean_dcache_guest_page(pfn, vma_pagesize);
-
 	if (exec_fault) {
 		prot |= KVM_PGTABLE_PROT_X;
 		invalidate_icache_guest_page(pfn, vma_pagesize);
@@ -1144,10 +1136,10 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
 	trace_kvm_set_spte_hva(hva);
 
 	/*
-	 * We've moved a page around, probably through CoW, so let's treat it
-	 * just like a translation fault and clean the cache to the PoC.
+	 * We've moved a page around, probably through CoW, so let's treat
+	 * it just like a translation fault and the map handler will clean
+	 * the cache to the PoC.
 	 */
-	clean_dcache_guest_page(pfn, PAGE_SIZE);
 	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);
 	return 0;
 }
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-08 11:22   ` Yanan Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Yanan Wang @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

We currently uniformly clean dcache in user_mem_abort() before calling the
fault handlers, if we take a translation fault and the pfn is cacheable.
But if there are concurrent translation faults on the same page or block,
clean of dcache for the first time is necessary while the others are not.

By moving clean of dcache to the map handler, we can easily identify the
conditions where CMOs are really needed and avoid the unnecessary ones.
As it's a time consuming process to perform CMOs especially when flushing
a block range, so this solution reduces much load of kvm and improve the
efficiency of creating mappings.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/include/asm/kvm_mmu.h | 16 --------------
 arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
 arch/arm64/kvm/mmu.c             | 14 +++---------
 3 files changed, 27 insertions(+), 41 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e52d82aeadca..4ec9879e82ed 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
 	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
 }
 
-static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
-{
-	void *va = page_address(pfn_to_page(pfn));
-
-	/*
-	 * With FWB, we ensure that the guest always accesses memory using
-	 * cacheable attributes, and we don't have to clean to PoC when
-	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
-	 * PoU is not required either in this case.
-	 */
-	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
-		return;
-
-	kvm_flush_dcache_to_poc(va, size);
-}
-
 static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
 						  unsigned long size)
 {
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 4d177ce1d536..2f4f87021980 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
 	return 0;
 }
 
+static bool stage2_pte_cacheable(kvm_pte_t pte)
+{
+	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
+	return memattr == PAGE_S2_MEMATTR(NORMAL);
+}
+
+static void stage2_flush_dcache(void *addr, u64 size)
+{
+	/*
+	 * With FWB, we ensure that the guest always accesses memory using
+	 * cacheable attributes, and we don't have to clean to PoC when
+	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
+	 * PoU is not required either in this case.
+	 */
+	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
+		return;
+
+	__flush_dcache_area(addr, size);
+}
+
 static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep,
 				      struct stage2_map_data *data)
@@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 		put_page(page);
 	}
 
+	/* Flush data cache before installation of the new PTE */
+	if (stage2_pte_cacheable(new))
+		stage2_flush_dcache(__va(phys), granule);
+
 	smp_store_release(ptep, new);
 	get_page(page);
 	data->phys += granule;
@@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	return ret;
 }
 
-static void stage2_flush_dcache(void *addr, u64 size)
-{
-	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
-		return;
-
-	__flush_dcache_area(addr, size);
-}
-
-static bool stage2_pte_cacheable(kvm_pte_t pte)
-{
-	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
-	return memattr == PAGE_S2_MEMATTR(NORMAL);
-}
-
 static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			       enum kvm_pgtable_walk_flags flag,
 			       void * const arg)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..d151927a7d62 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
-static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
-{
-	__clean_dcache_guest_page(pfn, size);
-}
-
 static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
 {
 	__invalidate_icache_guest_page(pfn, size);
@@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
-	if (fault_status != FSC_PERM && !device)
-		clean_dcache_guest_page(pfn, vma_pagesize);
-
 	if (exec_fault) {
 		prot |= KVM_PGTABLE_PROT_X;
 		invalidate_icache_guest_page(pfn, vma_pagesize);
@@ -1144,10 +1136,10 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
 	trace_kvm_set_spte_hva(hva);
 
 	/*
-	 * We've moved a page around, probably through CoW, so let's treat it
-	 * just like a translation fault and clean the cache to the PoC.
+	 * We've moved a page around, probably through CoW, so let's treat
+	 * it just like a translation fault and the map handler will clean
+	 * the cache to the PoC.
 	 */
-	clean_dcache_guest_page(pfn, PAGE_SIZE);
 	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);
 	return 0;
 }
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-08 11:22   ` Yanan Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Yanan Wang @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: yuzenghui, wanghaibin.wang, Yanan Wang, zhukeqian1

We currently uniformly clean dcache in user_mem_abort() before calling the
fault handlers, if we take a translation fault and the pfn is cacheable.
But if there are concurrent translation faults on the same page or block,
clean of dcache for the first time is necessary while the others are not.

By moving clean of dcache to the map handler, we can easily identify the
conditions where CMOs are really needed and avoid the unnecessary ones.
As it's a time consuming process to perform CMOs especially when flushing
a block range, so this solution reduces much load of kvm and improve the
efficiency of creating mappings.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/include/asm/kvm_mmu.h | 16 --------------
 arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
 arch/arm64/kvm/mmu.c             | 14 +++---------
 3 files changed, 27 insertions(+), 41 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e52d82aeadca..4ec9879e82ed 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
 	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
 }
 
-static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
-{
-	void *va = page_address(pfn_to_page(pfn));
-
-	/*
-	 * With FWB, we ensure that the guest always accesses memory using
-	 * cacheable attributes, and we don't have to clean to PoC when
-	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
-	 * PoU is not required either in this case.
-	 */
-	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
-		return;
-
-	kvm_flush_dcache_to_poc(va, size);
-}
-
 static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
 						  unsigned long size)
 {
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 4d177ce1d536..2f4f87021980 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
 	return 0;
 }
 
+static bool stage2_pte_cacheable(kvm_pte_t pte)
+{
+	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
+	return memattr == PAGE_S2_MEMATTR(NORMAL);
+}
+
+static void stage2_flush_dcache(void *addr, u64 size)
+{
+	/*
+	 * With FWB, we ensure that the guest always accesses memory using
+	 * cacheable attributes, and we don't have to clean to PoC when
+	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
+	 * PoU is not required either in this case.
+	 */
+	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
+		return;
+
+	__flush_dcache_area(addr, size);
+}
+
 static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep,
 				      struct stage2_map_data *data)
@@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 		put_page(page);
 	}
 
+	/* Flush data cache before installation of the new PTE */
+	if (stage2_pte_cacheable(new))
+		stage2_flush_dcache(__va(phys), granule);
+
 	smp_store_release(ptep, new);
 	get_page(page);
 	data->phys += granule;
@@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	return ret;
 }
 
-static void stage2_flush_dcache(void *addr, u64 size)
-{
-	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
-		return;
-
-	__flush_dcache_area(addr, size);
-}
-
-static bool stage2_pte_cacheable(kvm_pte_t pte)
-{
-	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
-	return memattr == PAGE_S2_MEMATTR(NORMAL);
-}
-
 static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			       enum kvm_pgtable_walk_flags flag,
 			       void * const arg)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..d151927a7d62 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
-static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
-{
-	__clean_dcache_guest_page(pfn, size);
-}
-
 static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
 {
 	__invalidate_icache_guest_page(pfn, size);
@@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
-	if (fault_status != FSC_PERM && !device)
-		clean_dcache_guest_page(pfn, vma_pagesize);
-
 	if (exec_fault) {
 		prot |= KVM_PGTABLE_PROT_X;
 		invalidate_icache_guest_page(pfn, vma_pagesize);
@@ -1144,10 +1136,10 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
 	trace_kvm_set_spte_hva(hva);
 
 	/*
-	 * We've moved a page around, probably through CoW, so let's treat it
-	 * just like a translation fault and clean the cache to the PoC.
+	 * We've moved a page around, probably through CoW, so let's treat
+	 * it just like a translation fault and the map handler will clean
+	 * the cache to the PoC.
 	 */
-	clean_dcache_guest_page(pfn, PAGE_SIZE);
 	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);
 	return 0;
 }
-- 
2.23.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH 2/4] KVM: arm64: Add an independent API for coalescing tables
  2021-02-08 11:22 ` Yanan Wang
  (?)
@ 2021-02-08 11:22   ` Yanan Wang
  -1 siblings, 0 replies; 80+ messages in thread
From: Yanan Wang @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: wanghaibin.wang, zhukeqian1, yuzenghui, Yanan Wang

Process of coalescing page mappings back to a block mapping is different
from normal map path, such as TLB invalidation and CMOs, so here add an
independent API for this case.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 2f4f87021980..78a560446f80 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -525,6 +525,24 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	return 0;
 }
 
+static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
+					      kvm_pte_t *ptep,
+					      struct stage2_map_data *data)
+{
+	u64 granule = kvm_granule_size(level), phys = data->phys;
+	kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
+
+	kvm_set_invalid_pte(ptep);
+
+	/*
+	 * Invalidate the whole stage-2, as we may have numerous leaf entries
+	 * below us which would otherwise need invalidating individually.
+	 */
+	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
+	smp_store_release(ptep, new);
+	data->phys += granule;
+}
+
 static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 				     kvm_pte_t *ptep,
 				     struct stage2_map_data *data)
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH 2/4] KVM: arm64: Add an independent API for coalescing tables
@ 2021-02-08 11:22   ` Yanan Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Yanan Wang @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Process of coalescing page mappings back to a block mapping is different
from normal map path, such as TLB invalidation and CMOs, so here add an
independent API for this case.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 2f4f87021980..78a560446f80 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -525,6 +525,24 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	return 0;
 }
 
+static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
+					      kvm_pte_t *ptep,
+					      struct stage2_map_data *data)
+{
+	u64 granule = kvm_granule_size(level), phys = data->phys;
+	kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
+
+	kvm_set_invalid_pte(ptep);
+
+	/*
+	 * Invalidate the whole stage-2, as we may have numerous leaf entries
+	 * below us which would otherwise need invalidating individually.
+	 */
+	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
+	smp_store_release(ptep, new);
+	data->phys += granule;
+}
+
 static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 				     kvm_pte_t *ptep,
 				     struct stage2_map_data *data)
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH 2/4] KVM: arm64: Add an independent API for coalescing tables
@ 2021-02-08 11:22   ` Yanan Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Yanan Wang @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: yuzenghui, wanghaibin.wang, Yanan Wang, zhukeqian1

Process of coalescing page mappings back to a block mapping is different
from normal map path, such as TLB invalidation and CMOs, so here add an
independent API for this case.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 2f4f87021980..78a560446f80 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -525,6 +525,24 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	return 0;
 }
 
+static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
+					      kvm_pte_t *ptep,
+					      struct stage2_map_data *data)
+{
+	u64 granule = kvm_granule_size(level), phys = data->phys;
+	kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
+
+	kvm_set_invalid_pte(ptep);
+
+	/*
+	 * Invalidate the whole stage-2, as we may have numerous leaf entries
+	 * below us which would otherwise need invalidating individually.
+	 */
+	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
+	smp_store_release(ptep, new);
+	data->phys += granule;
+}
+
 static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 				     kvm_pte_t *ptep,
 				     struct stage2_map_data *data)
-- 
2.23.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
  2021-02-08 11:22 ` Yanan Wang
  (?)
@ 2021-02-08 11:22   ` Yanan Wang
  -1 siblings, 0 replies; 80+ messages in thread
From: Yanan Wang @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: wanghaibin.wang, zhukeqian1, yuzenghui, Yanan Wang

When KVM needs to coalesce the normal page mappings into a block mapping,
we currently invalidate the old table entry first followed by invalidation
of TLB, then unmap the page mappings, and install the block entry at last.

It will cost a long time to unmap the numerous page mappings, which means
there will be a long period when the table entry can be found invalid.
If other vCPUs access any guest page within the block range and find the
table entry invalid, they will all exit from guest with a translation fault
which is not necessary. And KVM will make efforts to handle these faults,
especially when performing CMOs by block range.

So let's quickly install the block entry at first to ensure uninterrupted
memory access of the other vCPUs, and then unmap the page mappings after
installation. This will reduce most of the time when the table entry is
invalid, and avoid most of the unnecessary translation faults.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 78a560446f80..308c36b9cd21 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -434,6 +434,7 @@ struct stage2_map_data {
 	kvm_pte_t			attr;
 
 	kvm_pte_t			*anchor;
+	kvm_pte_t			*follow;
 
 	struct kvm_s2_mmu		*mmu;
 	struct kvm_mmu_memory_cache	*memcache;
@@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 	if (!kvm_block_mapping_supported(addr, end, data->phys, level))
 		return 0;
 
-	kvm_set_invalid_pte(ptep);
-
 	/*
-	 * Invalidate the whole stage-2, as we may have numerous leaf
-	 * entries below us which would otherwise need invalidating
-	 * individually.
+	 * If we need to coalesce existing table entries into a block here,
+	 * then install the block entry first and the sub-level page mappings
+	 * will be unmapped later.
 	 */
-	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
 	data->anchor = ptep;
+	data->follow = kvm_pte_follow(*ptep);
+	stage2_coalesce_tables_into_block(addr, level, ptep, data);
 	return 0;
 }
 
@@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep,
 				      struct stage2_map_data *data)
 {
-	int ret = 0;
-
 	if (!data->anchor)
 		return 0;
 
-	free_page((unsigned long)kvm_pte_follow(*ptep));
-	put_page(virt_to_page(ptep));
-
-	if (data->anchor == ptep) {
+	if (data->anchor != ptep) {
+		free_page((unsigned long)kvm_pte_follow(*ptep));
+		put_page(virt_to_page(ptep));
+	} else {
+		free_page((unsigned long)data->follow);
 		data->anchor = NULL;
-		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
 	}
 
-	return ret;
+	return 0;
 }
 
 /*
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-02-08 11:22   ` Yanan Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Yanan Wang @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

When KVM needs to coalesce the normal page mappings into a block mapping,
we currently invalidate the old table entry first followed by invalidation
of TLB, then unmap the page mappings, and install the block entry at last.

It will cost a long time to unmap the numerous page mappings, which means
there will be a long period when the table entry can be found invalid.
If other vCPUs access any guest page within the block range and find the
table entry invalid, they will all exit from guest with a translation fault
which is not necessary. And KVM will make efforts to handle these faults,
especially when performing CMOs by block range.

So let's quickly install the block entry at first to ensure uninterrupted
memory access of the other vCPUs, and then unmap the page mappings after
installation. This will reduce most of the time when the table entry is
invalid, and avoid most of the unnecessary translation faults.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 78a560446f80..308c36b9cd21 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -434,6 +434,7 @@ struct stage2_map_data {
 	kvm_pte_t			attr;
 
 	kvm_pte_t			*anchor;
+	kvm_pte_t			*follow;
 
 	struct kvm_s2_mmu		*mmu;
 	struct kvm_mmu_memory_cache	*memcache;
@@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 	if (!kvm_block_mapping_supported(addr, end, data->phys, level))
 		return 0;
 
-	kvm_set_invalid_pte(ptep);
-
 	/*
-	 * Invalidate the whole stage-2, as we may have numerous leaf
-	 * entries below us which would otherwise need invalidating
-	 * individually.
+	 * If we need to coalesce existing table entries into a block here,
+	 * then install the block entry first and the sub-level page mappings
+	 * will be unmapped later.
 	 */
-	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
 	data->anchor = ptep;
+	data->follow = kvm_pte_follow(*ptep);
+	stage2_coalesce_tables_into_block(addr, level, ptep, data);
 	return 0;
 }
 
@@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep,
 				      struct stage2_map_data *data)
 {
-	int ret = 0;
-
 	if (!data->anchor)
 		return 0;
 
-	free_page((unsigned long)kvm_pte_follow(*ptep));
-	put_page(virt_to_page(ptep));
-
-	if (data->anchor == ptep) {
+	if (data->anchor != ptep) {
+		free_page((unsigned long)kvm_pte_follow(*ptep));
+		put_page(virt_to_page(ptep));
+	} else {
+		free_page((unsigned long)data->follow);
 		data->anchor = NULL;
-		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
 	}
 
-	return ret;
+	return 0;
 }
 
 /*
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-02-08 11:22   ` Yanan Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Yanan Wang @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: yuzenghui, wanghaibin.wang, Yanan Wang, zhukeqian1

When KVM needs to coalesce the normal page mappings into a block mapping,
we currently invalidate the old table entry first followed by invalidation
of TLB, then unmap the page mappings, and install the block entry at last.

It will cost a long time to unmap the numerous page mappings, which means
there will be a long period when the table entry can be found invalid.
If other vCPUs access any guest page within the block range and find the
table entry invalid, they will all exit from guest with a translation fault
which is not necessary. And KVM will make efforts to handle these faults,
especially when performing CMOs by block range.

So let's quickly install the block entry at first to ensure uninterrupted
memory access of the other vCPUs, and then unmap the page mappings after
installation. This will reduce most of the time when the table entry is
invalid, and avoid most of the unnecessary translation faults.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 78a560446f80..308c36b9cd21 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -434,6 +434,7 @@ struct stage2_map_data {
 	kvm_pte_t			attr;
 
 	kvm_pte_t			*anchor;
+	kvm_pte_t			*follow;
 
 	struct kvm_s2_mmu		*mmu;
 	struct kvm_mmu_memory_cache	*memcache;
@@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 	if (!kvm_block_mapping_supported(addr, end, data->phys, level))
 		return 0;
 
-	kvm_set_invalid_pte(ptep);
-
 	/*
-	 * Invalidate the whole stage-2, as we may have numerous leaf
-	 * entries below us which would otherwise need invalidating
-	 * individually.
+	 * If we need to coalesce existing table entries into a block here,
+	 * then install the block entry first and the sub-level page mappings
+	 * will be unmapped later.
 	 */
-	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
 	data->anchor = ptep;
+	data->follow = kvm_pte_follow(*ptep);
+	stage2_coalesce_tables_into_block(addr, level, ptep, data);
 	return 0;
 }
 
@@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep,
 				      struct stage2_map_data *data)
 {
-	int ret = 0;
-
 	if (!data->anchor)
 		return 0;
 
-	free_page((unsigned long)kvm_pte_follow(*ptep));
-	put_page(virt_to_page(ptep));
-
-	if (data->anchor == ptep) {
+	if (data->anchor != ptep) {
+		free_page((unsigned long)kvm_pte_follow(*ptep));
+		put_page(virt_to_page(ptep));
+	} else {
+		free_page((unsigned long)data->follow);
 		data->anchor = NULL;
-		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
 	}
 
-	return ret;
+	return 0;
 }
 
 /*
-- 
2.23.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH 4/4] KVM: arm64: Distinguish cases of memcache allocations completely
  2021-02-08 11:22 ` Yanan Wang
  (?)
@ 2021-02-08 11:22   ` Yanan Wang
  -1 siblings, 0 replies; 80+ messages in thread
From: Yanan Wang @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: wanghaibin.wang, zhukeqian1, yuzenghui, Yanan Wang

With a guest translation fault, the memcache pages are not needed if KVM
is only about to install a new leaf entry into the existing page table.
And with a guest permission fault, the memcache pages are also not needed
for a write_fault in dirty-logging time if KVM is only about to update
the existing leaf entry instead of collapsing a block entry into a table.

By comparing fault_granule and vma_pagesize, cases that require allocations
from memcache and cases that don't can be distinguished completely.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/kvm/mmu.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d151927a7d62..550498a9104e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -815,19 +815,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	gfn = fault_ipa >> PAGE_SHIFT;
 	mmap_read_unlock(current->mm);
 
-	/*
-	 * Permission faults just need to update the existing leaf entry,
-	 * and so normally don't require allocations from the memcache. The
-	 * only exception to this is when dirty logging is enabled at runtime
-	 * and a write fault needs to collapse a block entry into a table.
-	 */
-	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
-		ret = kvm_mmu_topup_memory_cache(memcache,
-						 kvm_mmu_cache_min_pages(kvm));
-		if (ret)
-			return ret;
-	}
-
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	/*
 	 * Ensure the read of mmu_notifier_seq happens before we call
@@ -887,6 +874,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
 		prot |= KVM_PGTABLE_PROT_X;
 
+	/*
+	 * Allocations from the memcache are required only when granule of the
+	 * lookup level where the guest fault happened exceeds vma_pagesize,
+	 * which means new page tables will be created in the fault handlers.
+	 */
+	if (fault_granule > vma_pagesize) {
+		ret = kvm_mmu_topup_memory_cache(memcache,
+						 kvm_mmu_cache_min_pages(kvm));
+		if (ret)
+			return ret;
+	}
+
 	/*
 	 * Under the premise of getting a FSC_PERM fault, we just need to relax
 	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH 4/4] KVM: arm64: Distinguish cases of memcache allocations completely
@ 2021-02-08 11:22   ` Yanan Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Yanan Wang @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

With a guest translation fault, the memcache pages are not needed if KVM
is only about to install a new leaf entry into the existing page table.
And with a guest permission fault, the memcache pages are also not needed
for a write_fault in dirty-logging time if KVM is only about to update
the existing leaf entry instead of collapsing a block entry into a table.

By comparing fault_granule and vma_pagesize, cases that require allocations
from memcache and cases that don't can be distinguished completely.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/kvm/mmu.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d151927a7d62..550498a9104e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -815,19 +815,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	gfn = fault_ipa >> PAGE_SHIFT;
 	mmap_read_unlock(current->mm);
 
-	/*
-	 * Permission faults just need to update the existing leaf entry,
-	 * and so normally don't require allocations from the memcache. The
-	 * only exception to this is when dirty logging is enabled at runtime
-	 * and a write fault needs to collapse a block entry into a table.
-	 */
-	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
-		ret = kvm_mmu_topup_memory_cache(memcache,
-						 kvm_mmu_cache_min_pages(kvm));
-		if (ret)
-			return ret;
-	}
-
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	/*
 	 * Ensure the read of mmu_notifier_seq happens before we call
@@ -887,6 +874,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
 		prot |= KVM_PGTABLE_PROT_X;
 
+	/*
+	 * Allocations from the memcache are required only when granule of the
+	 * lookup level where the guest fault happened exceeds vma_pagesize,
+	 * which means new page tables will be created in the fault handlers.
+	 */
+	if (fault_granule > vma_pagesize) {
+		ret = kvm_mmu_topup_memory_cache(memcache,
+						 kvm_mmu_cache_min_pages(kvm));
+		if (ret)
+			return ret;
+	}
+
 	/*
 	 * Under the premise of getting a FSC_PERM fault, we just need to relax
 	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [RFC PATCH 4/4] KVM: arm64: Distinguish cases of memcache allocations completely
@ 2021-02-08 11:22   ` Yanan Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Yanan Wang @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: yuzenghui, wanghaibin.wang, Yanan Wang, zhukeqian1

With a guest translation fault, the memcache pages are not needed if KVM
is only about to install a new leaf entry into the existing page table.
And with a guest permission fault, the memcache pages are also not needed
for a write_fault in dirty-logging time if KVM is only about to update
the existing leaf entry instead of collapsing a block entry into a table.

By comparing fault_granule and vma_pagesize, cases that require allocations
from memcache and cases that don't can be distinguished completely.

Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
---
 arch/arm64/kvm/mmu.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d151927a7d62..550498a9104e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -815,19 +815,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	gfn = fault_ipa >> PAGE_SHIFT;
 	mmap_read_unlock(current->mm);
 
-	/*
-	 * Permission faults just need to update the existing leaf entry,
-	 * and so normally don't require allocations from the memcache. The
-	 * only exception to this is when dirty logging is enabled at runtime
-	 * and a write fault needs to collapse a block entry into a table.
-	 */
-	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
-		ret = kvm_mmu_topup_memory_cache(memcache,
-						 kvm_mmu_cache_min_pages(kvm));
-		if (ret)
-			return ret;
-	}
-
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	/*
 	 * Ensure the read of mmu_notifier_seq happens before we call
@@ -887,6 +874,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
 		prot |= KVM_PGTABLE_PROT_X;
 
+	/*
+	 * Allocations from the memcache are required only when granule of the
+	 * lookup level where the guest fault happened exceeds vma_pagesize,
+	 * which means new page tables will be created in the fault handlers.
+	 */
+	if (fault_granule > vma_pagesize) {
+		ret = kvm_mmu_topup_memory_cache(memcache,
+						 kvm_mmu_cache_min_pages(kvm));
+		if (ret)
+			return ret;
+	}
+
 	/*
 	 * Under the premise of getting a FSC_PERM fault, we just need to relax
 	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
-- 
2.23.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table
  2021-02-08 11:22 ` Yanan Wang
  (?)
@ 2021-02-23 15:55   ` Alexandru Elisei
  -1 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-02-23 15:55 UTC (permalink / raw)
  To: Yanan Wang, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Yanan,

I wanted to review the patches, but unfortunately I get an error when trying to
apply the first patch in the series:

Applying: KVM: arm64: Move the clean of dcache to the map handler
error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
error: patch failed: arch/arm64/kvm/mmu.c:882
error: arch/arm64/kvm/mmu.c: patch does not apply
Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
mmu.c from your patch is different than what is found on upstream master. Did you
use another branch as the base for your patches?

Thanks,

Alex

On 2/8/21 11:22 AM, Yanan Wang wrote:
> Hi,
>
> This series makes some efficiency improvement of stage2 page table code,
> and there are some test results to present the performance changes, which
> were tested by a kvm selftest [1] that I have post:
> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/ 
>
> About patch 1:
> We currently uniformly clean dcache in user_mem_abort() before calling the
> fault handlers, if we take a translation fault and the pfn is cacheable.
> But if there are concurrent translation faults on the same page or block,
> clean of dcache for the first time is necessary while the others are not.
>
> By moving clean of dcache to the map handler, we can easily identify the
> conditions where CMOs are really needed and avoid the unnecessary ones.
> As it's a time consuming process to perform CMOs especially when flushing
> a block range, so this solution reduces much load of kvm and improve the
> efficiency of creating mappings.
>
> Test results:
> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM create block mappings time: 52.83s -> 3.70s
> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>
> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM creating block mappings time: 104.56s -> 3.70s
> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>
> About patch 2, 3:
> When KVM needs to coalesce the normal page mappings into a block mapping,
> we currently invalidate the old table entry first followed by invalidation
> of TLB, then unmap the page mappings, and install the block entry at last.
>
> It will cost a lot of time to unmap the numerous page mappings, which means
> the table entry will be left invalid for a long time before installation of
> the block entry, and this will cause many spurious translation faults.
>
> So let's quickly install the block entry at first to ensure uninterrupted
> memory access of the other vCPUs, and then unmap the page mappings after
> installation. This will reduce most of the time when the table entry is
> invalid, and avoid most of the unnecessary translation faults.
>
> Test results based on patch 1:
> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>
> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>
> So combined with patch 1, it makes a big difference of KVM creating mappings
> and recovering block mappings with not much code change.
>
> About patch 4:
> A new method to distinguish cases of memcache allocations is introduced.
> By comparing fault_granule and vma_pagesize, cases that require allocations
> from memcache and cases that don't can be distinguished completely.
>
> ---
>
> Details of test results
> platform: HiSilicon Kunpeng920 (FWB not supported)
> host kernel: Linux mainline (v5.11-rc6)
>
> (1) performance change of patch 1
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>
> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>
> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>
> (2) performance change of patch 2, 3(based on patch 1)
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
> 	   (1 vcpu, 20G memory, block mappings(granule 1G))
> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>
> ---
>
> Yanan Wang (4):
>   KVM: arm64: Move the clean of dcache to the map handler
>   KVM: arm64: Add an independent API for coalescing tables
>   KVM: arm64: Install the block entry before unmapping the page mappings
>   KVM: arm64: Distinguish cases of memcache allocations completely
>
>  arch/arm64/include/asm/kvm_mmu.h | 16 -------
>  arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>  arch/arm64/kvm/mmu.c             | 39 ++++++---------
>  3 files changed, 69 insertions(+), 68 deletions(-)
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table
@ 2021-02-23 15:55   ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-02-23 15:55 UTC (permalink / raw)
  To: Yanan Wang, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Yanan,

I wanted to review the patches, but unfortunately I get an error when trying to
apply the first patch in the series:

Applying: KVM: arm64: Move the clean of dcache to the map handler
error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
error: patch failed: arch/arm64/kvm/mmu.c:882
error: arch/arm64/kvm/mmu.c: patch does not apply
Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
mmu.c from your patch is different than what is found on upstream master. Did you
use another branch as the base for your patches?

Thanks,

Alex

On 2/8/21 11:22 AM, Yanan Wang wrote:
> Hi,
>
> This series makes some efficiency improvement of stage2 page table code,
> and there are some test results to present the performance changes, which
> were tested by a kvm selftest [1] that I have post:
> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/ 
>
> About patch 1:
> We currently uniformly clean dcache in user_mem_abort() before calling the
> fault handlers, if we take a translation fault and the pfn is cacheable.
> But if there are concurrent translation faults on the same page or block,
> clean of dcache for the first time is necessary while the others are not.
>
> By moving clean of dcache to the map handler, we can easily identify the
> conditions where CMOs are really needed and avoid the unnecessary ones.
> As it's a time consuming process to perform CMOs especially when flushing
> a block range, so this solution reduces much load of kvm and improve the
> efficiency of creating mappings.
>
> Test results:
> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM create block mappings time: 52.83s -> 3.70s
> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>
> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM creating block mappings time: 104.56s -> 3.70s
> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>
> About patch 2, 3:
> When KVM needs to coalesce the normal page mappings into a block mapping,
> we currently invalidate the old table entry first followed by invalidation
> of TLB, then unmap the page mappings, and install the block entry at last.
>
> It will cost a lot of time to unmap the numerous page mappings, which means
> the table entry will be left invalid for a long time before installation of
> the block entry, and this will cause many spurious translation faults.
>
> So let's quickly install the block entry at first to ensure uninterrupted
> memory access of the other vCPUs, and then unmap the page mappings after
> installation. This will reduce most of the time when the table entry is
> invalid, and avoid most of the unnecessary translation faults.
>
> Test results based on patch 1:
> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>
> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>
> So combined with patch 1, it makes a big difference of KVM creating mappings
> and recovering block mappings with not much code change.
>
> About patch 4:
> A new method to distinguish cases of memcache allocations is introduced.
> By comparing fault_granule and vma_pagesize, cases that require allocations
> from memcache and cases that don't can be distinguished completely.
>
> ---
>
> Details of test results
> platform: HiSilicon Kunpeng920 (FWB not supported)
> host kernel: Linux mainline (v5.11-rc6)
>
> (1) performance change of patch 1
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>
> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>
> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>
> (2) performance change of patch 2, 3(based on patch 1)
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
> 	   (1 vcpu, 20G memory, block mappings(granule 1G))
> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>
> ---
>
> Yanan Wang (4):
>   KVM: arm64: Move the clean of dcache to the map handler
>   KVM: arm64: Add an independent API for coalescing tables
>   KVM: arm64: Install the block entry before unmapping the page mappings
>   KVM: arm64: Distinguish cases of memcache allocations completely
>
>  arch/arm64/include/asm/kvm_mmu.h | 16 -------
>  arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>  arch/arm64/kvm/mmu.c             | 39 ++++++---------
>  3 files changed, 69 insertions(+), 68 deletions(-)
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table
@ 2021-02-23 15:55   ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-02-23 15:55 UTC (permalink / raw)
  To: Yanan Wang, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Yanan,

I wanted to review the patches, but unfortunately I get an error when trying to
apply the first patch in the series:

Applying: KVM: arm64: Move the clean of dcache to the map handler
error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
error: patch failed: arch/arm64/kvm/mmu.c:882
error: arch/arm64/kvm/mmu.c: patch does not apply
Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
mmu.c from your patch is different than what is found on upstream master. Did you
use another branch as the base for your patches?

Thanks,

Alex

On 2/8/21 11:22 AM, Yanan Wang wrote:
> Hi,
>
> This series makes some efficiency improvement of stage2 page table code,
> and there are some test results to present the performance changes, which
> were tested by a kvm selftest [1] that I have post:
> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/ 
>
> About patch 1:
> We currently uniformly clean dcache in user_mem_abort() before calling the
> fault handlers, if we take a translation fault and the pfn is cacheable.
> But if there are concurrent translation faults on the same page or block,
> clean of dcache for the first time is necessary while the others are not.
>
> By moving clean of dcache to the map handler, we can easily identify the
> conditions where CMOs are really needed and avoid the unnecessary ones.
> As it's a time consuming process to perform CMOs especially when flushing
> a block range, so this solution reduces much load of kvm and improve the
> efficiency of creating mappings.
>
> Test results:
> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM create block mappings time: 52.83s -> 3.70s
> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>
> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM creating block mappings time: 104.56s -> 3.70s
> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>
> About patch 2, 3:
> When KVM needs to coalesce the normal page mappings into a block mapping,
> we currently invalidate the old table entry first followed by invalidation
> of TLB, then unmap the page mappings, and install the block entry at last.
>
> It will cost a lot of time to unmap the numerous page mappings, which means
> the table entry will be left invalid for a long time before installation of
> the block entry, and this will cause many spurious translation faults.
>
> So let's quickly install the block entry at first to ensure uninterrupted
> memory access of the other vCPUs, and then unmap the page mappings after
> installation. This will reduce most of the time when the table entry is
> invalid, and avoid most of the unnecessary translation faults.
>
> Test results based on patch 1:
> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>
> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>
> So combined with patch 1, it makes a big difference of KVM creating mappings
> and recovering block mappings with not much code change.
>
> About patch 4:
> A new method to distinguish cases of memcache allocations is introduced.
> By comparing fault_granule and vma_pagesize, cases that require allocations
> from memcache and cases that don't can be distinguished completely.
>
> ---
>
> Details of test results
> platform: HiSilicon Kunpeng920 (FWB not supported)
> host kernel: Linux mainline (v5.11-rc6)
>
> (1) performance change of patch 1
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>
> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>
> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>
> (2) performance change of patch 2, 3(based on patch 1)
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
> 	   (1 vcpu, 20G memory, block mappings(granule 1G))
> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>
> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>
> ---
>
> Yanan Wang (4):
>   KVM: arm64: Move the clean of dcache to the map handler
>   KVM: arm64: Add an independent API for coalescing tables
>   KVM: arm64: Install the block entry before unmapping the page mappings
>   KVM: arm64: Distinguish cases of memcache allocations completely
>
>  arch/arm64/include/asm/kvm_mmu.h | 16 -------
>  arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>  arch/arm64/kvm/mmu.c             | 39 ++++++---------
>  3 files changed, 69 insertions(+), 68 deletions(-)
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table
  2021-02-23 15:55   ` Alexandru Elisei
  (?)
@ 2021-02-24  2:35     ` wangyanan (Y)
  -1 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-02-24  2:35 UTC (permalink / raw)
  To: Alexandru Elisei, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Alex,

On 2021/2/23 23:55, Alexandru Elisei wrote:
> Hi Yanan,
>
> I wanted to review the patches, but unfortunately I get an error when trying to
> apply the first patch in the series:
>
> Applying: KVM: arm64: Move the clean of dcache to the map handler
> error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
> error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
> error: patch failed: arch/arm64/kvm/mmu.c:882
> error: arch/arm64/kvm/mmu.c: patch does not apply
> Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
> hint: Use 'git am --show-current-patch=diff' to see the failed patch
> When you have resolved this problem, run "git am --continue".
> If you prefer to skip this patch, run "git am --skip" instead.
> To restore the original branch and stop patching, run "git am --abort".
>
> Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
> mmu.c from your patch is different than what is found on upstream master. Did you
> use another branch as the base for your patches?
Thanks for your attention.
Indeed, this series was  more or less based on the patches I post before 
(Link: 
https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com).
And they have already been merged into up-to-data upstream master 
(commit: 509552e65ae8287178a5cdea2d734dcd2d6380ab), but not into tags 
v5.11-rc1 to v5.11-rc7.
Could you please try the newest upstream master(since commit: 
509552e65ae8287178a5cdea2d734dcd2d6380ab) ? I have tested on my local 
and no apply errors occur.

Thanks,

Yanan.

> Thanks,
>
> Alex
>
> On 2/8/21 11:22 AM, Yanan Wang wrote:
>> Hi,
>>
>> This series makes some efficiency improvement of stage2 page table code,
>> and there are some test results to present the performance changes, which
>> were tested by a kvm selftest [1] that I have post:
>> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/
>>
>> About patch 1:
>> We currently uniformly clean dcache in user_mem_abort() before calling the
>> fault handlers, if we take a translation fault and the pfn is cacheable.
>> But if there are concurrent translation faults on the same page or block,
>> clean of dcache for the first time is necessary while the others are not.
>>
>> By moving clean of dcache to the map handler, we can easily identify the
>> conditions where CMOs are really needed and avoid the unnecessary ones.
>> As it's a time consuming process to perform CMOs especially when flushing
>> a block range, so this solution reduces much load of kvm and improve the
>> efficiency of creating mappings.
>>
>> Test results:
>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM create block mappings time: 52.83s -> 3.70s
>> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>>
>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM creating block mappings time: 104.56s -> 3.70s
>> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>>
>> About patch 2, 3:
>> When KVM needs to coalesce the normal page mappings into a block mapping,
>> we currently invalidate the old table entry first followed by invalidation
>> of TLB, then unmap the page mappings, and install the block entry at last.
>>
>> It will cost a lot of time to unmap the numerous page mappings, which means
>> the table entry will be left invalid for a long time before installation of
>> the block entry, and this will cause many spurious translation faults.
>>
>> So let's quickly install the block entry at first to ensure uninterrupted
>> memory access of the other vCPUs, and then unmap the page mappings after
>> installation. This will reduce most of the time when the table entry is
>> invalid, and avoid most of the unnecessary translation faults.
>>
>> Test results based on patch 1:
>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>>
>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>>
>> So combined with patch 1, it makes a big difference of KVM creating mappings
>> and recovering block mappings with not much code change.
>>
>> About patch 4:
>> A new method to distinguish cases of memcache allocations is introduced.
>> By comparing fault_granule and vma_pagesize, cases that require allocations
>> from memcache and cases that don't can be distinguished completely.
>>
>> ---
>>
>> Details of test results
>> platform: HiSilicon Kunpeng920 (FWB not supported)
>> host kernel: Linux mainline (v5.11-rc6)
>>
>> (1) performance change of patch 1
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
>> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>>
>> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
>> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>>
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
>> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>>
>> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
>> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>>
>> (2) performance change of patch 2, 3(based on patch 1)
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
>> 	   (1 vcpu, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
>> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>>
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
>> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>>
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
>> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>>
>> ---
>>
>> Yanan Wang (4):
>>    KVM: arm64: Move the clean of dcache to the map handler
>>    KVM: arm64: Add an independent API for coalescing tables
>>    KVM: arm64: Install the block entry before unmapping the page mappings
>>    KVM: arm64: Distinguish cases of memcache allocations completely
>>
>>   arch/arm64/include/asm/kvm_mmu.h | 16 -------
>>   arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>>   arch/arm64/kvm/mmu.c             | 39 ++++++---------
>>   3 files changed, 69 insertions(+), 68 deletions(-)
>>
> .

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table
@ 2021-02-24  2:35     ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-02-24  2:35 UTC (permalink / raw)
  To: Alexandru Elisei, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Alex,

On 2021/2/23 23:55, Alexandru Elisei wrote:
> Hi Yanan,
>
> I wanted to review the patches, but unfortunately I get an error when trying to
> apply the first patch in the series:
>
> Applying: KVM: arm64: Move the clean of dcache to the map handler
> error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
> error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
> error: patch failed: arch/arm64/kvm/mmu.c:882
> error: arch/arm64/kvm/mmu.c: patch does not apply
> Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
> hint: Use 'git am --show-current-patch=diff' to see the failed patch
> When you have resolved this problem, run "git am --continue".
> If you prefer to skip this patch, run "git am --skip" instead.
> To restore the original branch and stop patching, run "git am --abort".
>
> Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
> mmu.c from your patch is different than what is found on upstream master. Did you
> use another branch as the base for your patches?
Thanks for your attention.
Indeed, this series was  more or less based on the patches I post before 
(Link: 
https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com).
And they have already been merged into up-to-data upstream master 
(commit: 509552e65ae8287178a5cdea2d734dcd2d6380ab), but not into tags 
v5.11-rc1 to v5.11-rc7.
Could you please try the newest upstream master(since commit: 
509552e65ae8287178a5cdea2d734dcd2d6380ab) ? I have tested on my local 
and no apply errors occur.

Thanks,

Yanan.

> Thanks,
>
> Alex
>
> On 2/8/21 11:22 AM, Yanan Wang wrote:
>> Hi,
>>
>> This series makes some efficiency improvement of stage2 page table code,
>> and there are some test results to present the performance changes, which
>> were tested by a kvm selftest [1] that I have post:
>> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/
>>
>> About patch 1:
>> We currently uniformly clean dcache in user_mem_abort() before calling the
>> fault handlers, if we take a translation fault and the pfn is cacheable.
>> But if there are concurrent translation faults on the same page or block,
>> clean of dcache for the first time is necessary while the others are not.
>>
>> By moving clean of dcache to the map handler, we can easily identify the
>> conditions where CMOs are really needed and avoid the unnecessary ones.
>> As it's a time consuming process to perform CMOs especially when flushing
>> a block range, so this solution reduces much load of kvm and improve the
>> efficiency of creating mappings.
>>
>> Test results:
>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM create block mappings time: 52.83s -> 3.70s
>> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>>
>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM creating block mappings time: 104.56s -> 3.70s
>> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>>
>> About patch 2, 3:
>> When KVM needs to coalesce the normal page mappings into a block mapping,
>> we currently invalidate the old table entry first followed by invalidation
>> of TLB, then unmap the page mappings, and install the block entry at last.
>>
>> It will cost a lot of time to unmap the numerous page mappings, which means
>> the table entry will be left invalid for a long time before installation of
>> the block entry, and this will cause many spurious translation faults.
>>
>> So let's quickly install the block entry at first to ensure uninterrupted
>> memory access of the other vCPUs, and then unmap the page mappings after
>> installation. This will reduce most of the time when the table entry is
>> invalid, and avoid most of the unnecessary translation faults.
>>
>> Test results based on patch 1:
>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>>
>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>>
>> So combined with patch 1, it makes a big difference of KVM creating mappings
>> and recovering block mappings with not much code change.
>>
>> About patch 4:
>> A new method to distinguish cases of memcache allocations is introduced.
>> By comparing fault_granule and vma_pagesize, cases that require allocations
>> from memcache and cases that don't can be distinguished completely.
>>
>> ---
>>
>> Details of test results
>> platform: HiSilicon Kunpeng920 (FWB not supported)
>> host kernel: Linux mainline (v5.11-rc6)
>>
>> (1) performance change of patch 1
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
>> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>>
>> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
>> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>>
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
>> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>>
>> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
>> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>>
>> (2) performance change of patch 2, 3(based on patch 1)
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
>> 	   (1 vcpu, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
>> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>>
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
>> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>>
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
>> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>>
>> ---
>>
>> Yanan Wang (4):
>>    KVM: arm64: Move the clean of dcache to the map handler
>>    KVM: arm64: Add an independent API for coalescing tables
>>    KVM: arm64: Install the block entry before unmapping the page mappings
>>    KVM: arm64: Distinguish cases of memcache allocations completely
>>
>>   arch/arm64/include/asm/kvm_mmu.h | 16 -------
>>   arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>>   arch/arm64/kvm/mmu.c             | 39 ++++++---------
>>   3 files changed, 69 insertions(+), 68 deletions(-)
>>
> .
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table
@ 2021-02-24  2:35     ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-02-24  2:35 UTC (permalink / raw)
  To: Alexandru Elisei, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Alex,

On 2021/2/23 23:55, Alexandru Elisei wrote:
> Hi Yanan,
>
> I wanted to review the patches, but unfortunately I get an error when trying to
> apply the first patch in the series:
>
> Applying: KVM: arm64: Move the clean of dcache to the map handler
> error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
> error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
> error: patch failed: arch/arm64/kvm/mmu.c:882
> error: arch/arm64/kvm/mmu.c: patch does not apply
> Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
> hint: Use 'git am --show-current-patch=diff' to see the failed patch
> When you have resolved this problem, run "git am --continue".
> If you prefer to skip this patch, run "git am --skip" instead.
> To restore the original branch and stop patching, run "git am --abort".
>
> Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
> mmu.c from your patch is different than what is found on upstream master. Did you
> use another branch as the base for your patches?
Thanks for your attention.
Indeed, this series was  more or less based on the patches I post before 
(Link: 
https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com).
And they have already been merged into up-to-data upstream master 
(commit: 509552e65ae8287178a5cdea2d734dcd2d6380ab), but not into tags 
v5.11-rc1 to v5.11-rc7.
Could you please try the newest upstream master(since commit: 
509552e65ae8287178a5cdea2d734dcd2d6380ab) ? I have tested on my local 
and no apply errors occur.

Thanks,

Yanan.

> Thanks,
>
> Alex
>
> On 2/8/21 11:22 AM, Yanan Wang wrote:
>> Hi,
>>
>> This series makes some efficiency improvement of stage2 page table code,
>> and there are some test results to present the performance changes, which
>> were tested by a kvm selftest [1] that I have post:
>> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/
>>
>> About patch 1:
>> We currently uniformly clean dcache in user_mem_abort() before calling the
>> fault handlers, if we take a translation fault and the pfn is cacheable.
>> But if there are concurrent translation faults on the same page or block,
>> clean of dcache for the first time is necessary while the others are not.
>>
>> By moving clean of dcache to the map handler, we can easily identify the
>> conditions where CMOs are really needed and avoid the unnecessary ones.
>> As it's a time consuming process to perform CMOs especially when flushing
>> a block range, so this solution reduces much load of kvm and improve the
>> efficiency of creating mappings.
>>
>> Test results:
>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM create block mappings time: 52.83s -> 3.70s
>> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>>
>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM creating block mappings time: 104.56s -> 3.70s
>> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>>
>> About patch 2, 3:
>> When KVM needs to coalesce the normal page mappings into a block mapping,
>> we currently invalidate the old table entry first followed by invalidation
>> of TLB, then unmap the page mappings, and install the block entry at last.
>>
>> It will cost a lot of time to unmap the numerous page mappings, which means
>> the table entry will be left invalid for a long time before installation of
>> the block entry, and this will cause many spurious translation faults.
>>
>> So let's quickly install the block entry at first to ensure uninterrupted
>> memory access of the other vCPUs, and then unmap the page mappings after
>> installation. This will reduce most of the time when the table entry is
>> invalid, and avoid most of the unnecessary translation faults.
>>
>> Test results based on patch 1:
>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>>
>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>>
>> So combined with patch 1, it makes a big difference of KVM creating mappings
>> and recovering block mappings with not much code change.
>>
>> About patch 4:
>> A new method to distinguish cases of memcache allocations is introduced.
>> By comparing fault_granule and vma_pagesize, cases that require allocations
>> from memcache and cases that don't can be distinguished completely.
>>
>> ---
>>
>> Details of test results
>> platform: HiSilicon Kunpeng920 (FWB not supported)
>> host kernel: Linux mainline (v5.11-rc6)
>>
>> (1) performance change of patch 1
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
>> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>>
>> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
>> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>>
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
>> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>>
>> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
>> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>>
>> (2) performance change of patch 2, 3(based on patch 1)
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
>> 	   (1 vcpu, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
>> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>>
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
>> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>>
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
>> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>>
>> ---
>>
>> Yanan Wang (4):
>>    KVM: arm64: Move the clean of dcache to the map handler
>>    KVM: arm64: Add an independent API for coalescing tables
>>    KVM: arm64: Install the block entry before unmapping the page mappings
>>    KVM: arm64: Distinguish cases of memcache allocations completely
>>
>>   arch/arm64/include/asm/kvm_mmu.h | 16 -------
>>   arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>>   arch/arm64/kvm/mmu.c             | 39 ++++++---------
>>   3 files changed, 69 insertions(+), 68 deletions(-)
>>
> .

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table
  2021-02-24  2:35     ` wangyanan (Y)
  (?)
@ 2021-02-24 17:20       ` Alexandru Elisei
  -1 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-02-24 17:20 UTC (permalink / raw)
  To: wangyanan (Y),
	Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi,

On 2/24/21 2:35 AM, wangyanan (Y) wrote:

> Hi Alex,
>
> On 2021/2/23 23:55, Alexandru Elisei wrote:
>> Hi Yanan,
>>
>> I wanted to review the patches, but unfortunately I get an error when trying to
>> apply the first patch in the series:
>>
>> Applying: KVM: arm64: Move the clean of dcache to the map handler
>> error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
>> error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
>> error: patch failed: arch/arm64/kvm/mmu.c:882
>> error: arch/arm64/kvm/mmu.c: patch does not apply
>> Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
>> hint: Use 'git am --show-current-patch=diff' to see the failed patch
>> When you have resolved this problem, run "git am --continue".
>> If you prefer to skip this patch, run "git am --skip" instead.
>> To restore the original branch and stop patching, run "git am --abort".
>>
>> Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
>> mmu.c from your patch is different than what is found on upstream master. Did you
>> use another branch as the base for your patches?
> Thanks for your attention.
> Indeed, this series was  more or less based on the patches I post before (Link:
> https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com).
> And they have already been merged into up-to-data upstream master (commit:
> 509552e65ae8287178a5cdea2d734dcd2d6380ab), but not into tags v5.11-rc1 to
> v5.11-rc7.
> Could you please try the newest upstream master(since commit:
> 509552e65ae8287178a5cdea2d734dcd2d6380ab) ? I have tested on my local and no
> apply errors occur.

That worked for me, thank you for the quick reply.

Just to double check, when you run the benchmarks, the before results are for a
kernel built from commit 509552e65ae8 ("KVM: arm64: Mark the page dirty only if
the fault is handled successfully"), and the after results are with this series on
top, right?

Thanks,

Alex

>
> Thanks,
>
> Yanan.
>
>> Thanks,
>>
>> Alex
>>
>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>> Hi,
>>>
>>> This series makes some efficiency improvement of stage2 page table code,
>>> and there are some test results to present the performance changes, which
>>> were tested by a kvm selftest [1] that I have post:
>>> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/
>>>
>>> About patch 1:
>>> We currently uniformly clean dcache in user_mem_abort() before calling the
>>> fault handlers, if we take a translation fault and the pfn is cacheable.
>>> But if there are concurrent translation faults on the same page or block,
>>> clean of dcache for the first time is necessary while the others are not.
>>>
>>> By moving clean of dcache to the map handler, we can easily identify the
>>> conditions where CMOs are really needed and avoid the unnecessary ones.
>>> As it's a time consuming process to perform CMOs especially when flushing
>>> a block range, so this solution reduces much load of kvm and improve the
>>> efficiency of creating mappings.
>>>
>>> Test results:
>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM create block mappings time: 52.83s -> 3.70s
>>> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>>>
>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM creating block mappings time: 104.56s -> 3.70s
>>> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>>>
>>> About patch 2, 3:
>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>> we currently invalidate the old table entry first followed by invalidation
>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>
>>> It will cost a lot of time to unmap the numerous page mappings, which means
>>> the table entry will be left invalid for a long time before installation of
>>> the block entry, and this will cause many spurious translation faults.
>>>
>>> So let's quickly install the block entry at first to ensure uninterrupted
>>> memory access of the other vCPUs, and then unmap the page mappings after
>>> installation. This will reduce most of the time when the table entry is
>>> invalid, and avoid most of the unnecessary translation faults.
>>>
>>> Test results based on patch 1:
>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>>>
>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>>>
>>> So combined with patch 1, it makes a big difference of KVM creating mappings
>>> and recovering block mappings with not much code change.
>>>
>>> About patch 4:
>>> A new method to distinguish cases of memcache allocations is introduced.
>>> By comparing fault_granule and vma_pagesize, cases that require allocations
>>> from memcache and cases that don't can be distinguished completely.
>>>
>>> ---
>>>
>>> Details of test results
>>> platform: HiSilicon Kunpeng920 (FWB not supported)
>>> host kernel: Linux mainline (v5.11-rc6)
>>>
>>> (1) performance change of patch 1
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>        (20 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
>>> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>>>
>>> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
>>> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>>>
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>        (40 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
>>> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>>>
>>> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
>>> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>>>
>>> (2) performance change of patch 2, 3(based on patch 1)
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
>>>        (1 vcpu, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
>>> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>>>
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>        (20 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>>>
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>        (40 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>>>
>>> ---
>>>
>>> Yanan Wang (4):
>>>    KVM: arm64: Move the clean of dcache to the map handler
>>>    KVM: arm64: Add an independent API for coalescing tables
>>>    KVM: arm64: Install the block entry before unmapping the page mappings
>>>    KVM: arm64: Distinguish cases of memcache allocations completely
>>>
>>>   arch/arm64/include/asm/kvm_mmu.h | 16 -------
>>>   arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>>>   arch/arm64/kvm/mmu.c             | 39 ++++++---------
>>>   3 files changed, 69 insertions(+), 68 deletions(-)
>>>
>> .

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table
@ 2021-02-24 17:20       ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-02-24 17:20 UTC (permalink / raw)
  To: wangyanan (Y),
	Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi,

On 2/24/21 2:35 AM, wangyanan (Y) wrote:

> Hi Alex,
>
> On 2021/2/23 23:55, Alexandru Elisei wrote:
>> Hi Yanan,
>>
>> I wanted to review the patches, but unfortunately I get an error when trying to
>> apply the first patch in the series:
>>
>> Applying: KVM: arm64: Move the clean of dcache to the map handler
>> error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
>> error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
>> error: patch failed: arch/arm64/kvm/mmu.c:882
>> error: arch/arm64/kvm/mmu.c: patch does not apply
>> Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
>> hint: Use 'git am --show-current-patch=diff' to see the failed patch
>> When you have resolved this problem, run "git am --continue".
>> If you prefer to skip this patch, run "git am --skip" instead.
>> To restore the original branch and stop patching, run "git am --abort".
>>
>> Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
>> mmu.c from your patch is different than what is found on upstream master. Did you
>> use another branch as the base for your patches?
> Thanks for your attention.
> Indeed, this series was  more or less based on the patches I post before (Link:
> https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com).
> And they have already been merged into up-to-data upstream master (commit:
> 509552e65ae8287178a5cdea2d734dcd2d6380ab), but not into tags v5.11-rc1 to
> v5.11-rc7.
> Could you please try the newest upstream master(since commit:
> 509552e65ae8287178a5cdea2d734dcd2d6380ab) ? I have tested on my local and no
> apply errors occur.

That worked for me, thank you for the quick reply.

Just to double check, when you run the benchmarks, the before results are for a
kernel built from commit 509552e65ae8 ("KVM: arm64: Mark the page dirty only if
the fault is handled successfully"), and the after results are with this series on
top, right?

Thanks,

Alex

>
> Thanks,
>
> Yanan.
>
>> Thanks,
>>
>> Alex
>>
>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>> Hi,
>>>
>>> This series makes some efficiency improvement of stage2 page table code,
>>> and there are some test results to present the performance changes, which
>>> were tested by a kvm selftest [1] that I have post:
>>> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/
>>>
>>> About patch 1:
>>> We currently uniformly clean dcache in user_mem_abort() before calling the
>>> fault handlers, if we take a translation fault and the pfn is cacheable.
>>> But if there are concurrent translation faults on the same page or block,
>>> clean of dcache for the first time is necessary while the others are not.
>>>
>>> By moving clean of dcache to the map handler, we can easily identify the
>>> conditions where CMOs are really needed and avoid the unnecessary ones.
>>> As it's a time consuming process to perform CMOs especially when flushing
>>> a block range, so this solution reduces much load of kvm and improve the
>>> efficiency of creating mappings.
>>>
>>> Test results:
>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM create block mappings time: 52.83s -> 3.70s
>>> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>>>
>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM creating block mappings time: 104.56s -> 3.70s
>>> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>>>
>>> About patch 2, 3:
>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>> we currently invalidate the old table entry first followed by invalidation
>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>
>>> It will cost a lot of time to unmap the numerous page mappings, which means
>>> the table entry will be left invalid for a long time before installation of
>>> the block entry, and this will cause many spurious translation faults.
>>>
>>> So let's quickly install the block entry at first to ensure uninterrupted
>>> memory access of the other vCPUs, and then unmap the page mappings after
>>> installation. This will reduce most of the time when the table entry is
>>> invalid, and avoid most of the unnecessary translation faults.
>>>
>>> Test results based on patch 1:
>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>>>
>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>>>
>>> So combined with patch 1, it makes a big difference of KVM creating mappings
>>> and recovering block mappings with not much code change.
>>>
>>> About patch 4:
>>> A new method to distinguish cases of memcache allocations is introduced.
>>> By comparing fault_granule and vma_pagesize, cases that require allocations
>>> from memcache and cases that don't can be distinguished completely.
>>>
>>> ---
>>>
>>> Details of test results
>>> platform: HiSilicon Kunpeng920 (FWB not supported)
>>> host kernel: Linux mainline (v5.11-rc6)
>>>
>>> (1) performance change of patch 1
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>        (20 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
>>> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>>>
>>> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
>>> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>>>
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>        (40 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
>>> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>>>
>>> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
>>> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>>>
>>> (2) performance change of patch 2, 3(based on patch 1)
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
>>>        (1 vcpu, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
>>> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>>>
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>        (20 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>>>
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>        (40 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>>>
>>> ---
>>>
>>> Yanan Wang (4):
>>>    KVM: arm64: Move the clean of dcache to the map handler
>>>    KVM: arm64: Add an independent API for coalescing tables
>>>    KVM: arm64: Install the block entry before unmapping the page mappings
>>>    KVM: arm64: Distinguish cases of memcache allocations completely
>>>
>>>   arch/arm64/include/asm/kvm_mmu.h | 16 -------
>>>   arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>>>   arch/arm64/kvm/mmu.c             | 39 ++++++---------
>>>   3 files changed, 69 insertions(+), 68 deletions(-)
>>>
>> .
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table
@ 2021-02-24 17:20       ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-02-24 17:20 UTC (permalink / raw)
  To: wangyanan (Y),
	Marc Zyngier, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi,

On 2/24/21 2:35 AM, wangyanan (Y) wrote:

> Hi Alex,
>
> On 2021/2/23 23:55, Alexandru Elisei wrote:
>> Hi Yanan,
>>
>> I wanted to review the patches, but unfortunately I get an error when trying to
>> apply the first patch in the series:
>>
>> Applying: KVM: arm64: Move the clean of dcache to the map handler
>> error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
>> error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
>> error: patch failed: arch/arm64/kvm/mmu.c:882
>> error: arch/arm64/kvm/mmu.c: patch does not apply
>> Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
>> hint: Use 'git am --show-current-patch=diff' to see the failed patch
>> When you have resolved this problem, run "git am --continue".
>> If you prefer to skip this patch, run "git am --skip" instead.
>> To restore the original branch and stop patching, run "git am --abort".
>>
>> Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
>> mmu.c from your patch is different than what is found on upstream master. Did you
>> use another branch as the base for your patches?
> Thanks for your attention.
> Indeed, this series was  more or less based on the patches I post before (Link:
> https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com).
> And they have already been merged into up-to-data upstream master (commit:
> 509552e65ae8287178a5cdea2d734dcd2d6380ab), but not into tags v5.11-rc1 to
> v5.11-rc7.
> Could you please try the newest upstream master(since commit:
> 509552e65ae8287178a5cdea2d734dcd2d6380ab) ? I have tested on my local and no
> apply errors occur.

That worked for me, thank you for the quick reply.

Just to double check, when you run the benchmarks, the before results are for a
kernel built from commit 509552e65ae8 ("KVM: arm64: Mark the page dirty only if
the fault is handled successfully"), and the after results are with this series on
top, right?

Thanks,

Alex

>
> Thanks,
>
> Yanan.
>
>> Thanks,
>>
>> Alex
>>
>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>> Hi,
>>>
>>> This series makes some efficiency improvement of stage2 page table code,
>>> and there are some test results to present the performance changes, which
>>> were tested by a kvm selftest [1] that I have post:
>>> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/
>>>
>>> About patch 1:
>>> We currently uniformly clean dcache in user_mem_abort() before calling the
>>> fault handlers, if we take a translation fault and the pfn is cacheable.
>>> But if there are concurrent translation faults on the same page or block,
>>> clean of dcache for the first time is necessary while the others are not.
>>>
>>> By moving clean of dcache to the map handler, we can easily identify the
>>> conditions where CMOs are really needed and avoid the unnecessary ones.
>>> As it's a time consuming process to perform CMOs especially when flushing
>>> a block range, so this solution reduces much load of kvm and improve the
>>> efficiency of creating mappings.
>>>
>>> Test results:
>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM create block mappings time: 52.83s -> 3.70s
>>> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>>>
>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM creating block mappings time: 104.56s -> 3.70s
>>> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>>>
>>> About patch 2, 3:
>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>> we currently invalidate the old table entry first followed by invalidation
>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>
>>> It will cost a lot of time to unmap the numerous page mappings, which means
>>> the table entry will be left invalid for a long time before installation of
>>> the block entry, and this will cause many spurious translation faults.
>>>
>>> So let's quickly install the block entry at first to ensure uninterrupted
>>> memory access of the other vCPUs, and then unmap the page mappings after
>>> installation. This will reduce most of the time when the table entry is
>>> invalid, and avoid most of the unnecessary translation faults.
>>>
>>> Test results based on patch 1:
>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>>>
>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>>>
>>> So combined with patch 1, it makes a big difference of KVM creating mappings
>>> and recovering block mappings with not much code change.
>>>
>>> About patch 4:
>>> A new method to distinguish cases of memcache allocations is introduced.
>>> By comparing fault_granule and vma_pagesize, cases that require allocations
>>> from memcache and cases that don't can be distinguished completely.
>>>
>>> ---
>>>
>>> Details of test results
>>> platform: HiSilicon Kunpeng920 (FWB not supported)
>>> host kernel: Linux mainline (v5.11-rc6)
>>>
>>> (1) performance change of patch 1
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>        (20 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
>>> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>>>
>>> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
>>> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>>>
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>        (40 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
>>> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>>>
>>> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
>>> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>>>
>>> (2) performance change of patch 2, 3(based on patch 1)
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
>>>        (1 vcpu, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
>>> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>>>
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>        (20 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>>>
>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>        (40 vcpus, 20G memory, block mappings(granule 1G))
>>> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>>>
>>> ---
>>>
>>> Yanan Wang (4):
>>>    KVM: arm64: Move the clean of dcache to the map handler
>>>    KVM: arm64: Add an independent API for coalescing tables
>>>    KVM: arm64: Install the block entry before unmapping the page mappings
>>>    KVM: arm64: Distinguish cases of memcache allocations completely
>>>
>>>   arch/arm64/include/asm/kvm_mmu.h | 16 -------
>>>   arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>>>   arch/arm64/kvm/mmu.c             | 39 ++++++---------
>>>   3 files changed, 69 insertions(+), 68 deletions(-)
>>>
>> .

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
  2021-02-08 11:22   ` Yanan Wang
  (?)
@ 2021-02-24 17:21     ` Alexandru Elisei
  -1 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-02-24 17:21 UTC (permalink / raw)
  To: Yanan Wang, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hello,

On 2/8/21 11:22 AM, Yanan Wang wrote:
> We currently uniformly clean dcache in user_mem_abort() before calling the
> fault handlers, if we take a translation fault and the pfn is cacheable.
> But if there are concurrent translation faults on the same page or block,
> clean of dcache for the first time is necessary while the others are not.
>
> By moving clean of dcache to the map handler, we can easily identify the
> conditions where CMOs are really needed and avoid the unnecessary ones.
> As it's a time consuming process to perform CMOs especially when flushing
> a block range, so this solution reduces much load of kvm and improve the
> efficiency of creating mappings.
>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>  arch/arm64/kvm/mmu.c             | 14 +++---------
>  3 files changed, 27 insertions(+), 41 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index e52d82aeadca..4ec9879e82ed 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>  }
>  
> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> -{
> -	void *va = page_address(pfn_to_page(pfn));
> -
> -	/*
> -	 * With FWB, we ensure that the guest always accesses memory using
> -	 * cacheable attributes, and we don't have to clean to PoC when
> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> -	 * PoU is not required either in this case.
> -	 */
> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> -		return;
> -
> -	kvm_flush_dcache_to_poc(va, size);
> -}
> -
>  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>  						  unsigned long size)
>  {
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 4d177ce1d536..2f4f87021980 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>  	return 0;
>  }
>  
> +static bool stage2_pte_cacheable(kvm_pte_t pte)
> +{
> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
> +}
> +
> +static void stage2_flush_dcache(void *addr, u64 size)
> +{
> +	/*
> +	 * With FWB, we ensure that the guest always accesses memory using
> +	 * cacheable attributes, and we don't have to clean to PoC when
> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> +	 * PoU is not required either in this case.
> +	 */
> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> +		return;
> +
> +	__flush_dcache_area(addr, size);
> +}
> +
>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  				      kvm_pte_t *ptep,
>  				      struct stage2_map_data *data)
> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  		put_page(page);
>  	}
>  
> +	/* Flush data cache before installation of the new PTE */
> +	if (stage2_pte_cacheable(new))
> +		stage2_flush_dcache(__va(phys), granule);

This makes sense to me. kvm_pgtable_stage2_map() is protected against concurrent
calls by the kvm->mmu_lock, so only one VCPU can change the stage 2 translation
table at any given moment. In the case of concurrent translation faults on the
same IPA, the first VCPU that will take the lock will create the mapping and do
the dcache clean+invalidate. The other VCPUs will return -EAGAIN because the
mapping they are trying to install is almost identical* to the mapping created by
the first VCPU that took the lock.

I have a question. Why are you doing the cache maintenance *before* installing the
new mapping? This is what the kernel already does, so I'm not saying it's
incorrect, I'm just curious about the reason behind it.

*permissions might be different.

Thanks,

Alex

> +
>  	smp_store_release(ptep, new);
>  	get_page(page);
>  	data->phys += granule;
> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>  	return ret;
>  }
>  
> -static void stage2_flush_dcache(void *addr, u64 size)
> -{
> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> -		return;
> -
> -	__flush_dcache_area(addr, size);
> -}
> -
> -static bool stage2_pte_cacheable(kvm_pte_t pte)
> -{
> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
> -}
> -
>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  			       enum kvm_pgtable_walk_flags flag,
>  			       void * const arg)
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 77cb2d28f2a4..d151927a7d62 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>  }
>  
> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> -{
> -	__clean_dcache_guest_page(pfn, size);
> -}
> -
>  static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>  {
>  	__invalidate_icache_guest_page(pfn, size);
> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (writable)
>  		prot |= KVM_PGTABLE_PROT_W;
>  
> -	if (fault_status != FSC_PERM && !device)
> -		clean_dcache_guest_page(pfn, vma_pagesize);
> -
>  	if (exec_fault) {
>  		prot |= KVM_PGTABLE_PROT_X;
>  		invalidate_icache_guest_page(pfn, vma_pagesize);
> @@ -1144,10 +1136,10 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
>  	trace_kvm_set_spte_hva(hva);
>  
>  	/*
> -	 * We've moved a page around, probably through CoW, so let's treat it
> -	 * just like a translation fault and clean the cache to the PoC.
> +	 * We've moved a page around, probably through CoW, so let's treat
> +	 * it just like a translation fault and the map handler will clean
> +	 * the cache to the PoC.
>  	 */
> -	clean_dcache_guest_page(pfn, PAGE_SIZE);
>  	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-24 17:21     ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-02-24 17:21 UTC (permalink / raw)
  To: Yanan Wang, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hello,

On 2/8/21 11:22 AM, Yanan Wang wrote:
> We currently uniformly clean dcache in user_mem_abort() before calling the
> fault handlers, if we take a translation fault and the pfn is cacheable.
> But if there are concurrent translation faults on the same page or block,
> clean of dcache for the first time is necessary while the others are not.
>
> By moving clean of dcache to the map handler, we can easily identify the
> conditions where CMOs are really needed and avoid the unnecessary ones.
> As it's a time consuming process to perform CMOs especially when flushing
> a block range, so this solution reduces much load of kvm and improve the
> efficiency of creating mappings.
>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>  arch/arm64/kvm/mmu.c             | 14 +++---------
>  3 files changed, 27 insertions(+), 41 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index e52d82aeadca..4ec9879e82ed 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>  }
>  
> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> -{
> -	void *va = page_address(pfn_to_page(pfn));
> -
> -	/*
> -	 * With FWB, we ensure that the guest always accesses memory using
> -	 * cacheable attributes, and we don't have to clean to PoC when
> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> -	 * PoU is not required either in this case.
> -	 */
> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> -		return;
> -
> -	kvm_flush_dcache_to_poc(va, size);
> -}
> -
>  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>  						  unsigned long size)
>  {
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 4d177ce1d536..2f4f87021980 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>  	return 0;
>  }
>  
> +static bool stage2_pte_cacheable(kvm_pte_t pte)
> +{
> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
> +}
> +
> +static void stage2_flush_dcache(void *addr, u64 size)
> +{
> +	/*
> +	 * With FWB, we ensure that the guest always accesses memory using
> +	 * cacheable attributes, and we don't have to clean to PoC when
> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> +	 * PoU is not required either in this case.
> +	 */
> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> +		return;
> +
> +	__flush_dcache_area(addr, size);
> +}
> +
>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  				      kvm_pte_t *ptep,
>  				      struct stage2_map_data *data)
> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  		put_page(page);
>  	}
>  
> +	/* Flush data cache before installation of the new PTE */
> +	if (stage2_pte_cacheable(new))
> +		stage2_flush_dcache(__va(phys), granule);

This makes sense to me. kvm_pgtable_stage2_map() is protected against concurrent
calls by the kvm->mmu_lock, so only one VCPU can change the stage 2 translation
table at any given moment. In the case of concurrent translation faults on the
same IPA, the first VCPU that will take the lock will create the mapping and do
the dcache clean+invalidate. The other VCPUs will return -EAGAIN because the
mapping they are trying to install is almost identical* to the mapping created by
the first VCPU that took the lock.

I have a question. Why are you doing the cache maintenance *before* installing the
new mapping? This is what the kernel already does, so I'm not saying it's
incorrect, I'm just curious about the reason behind it.

*permissions might be different.

Thanks,

Alex

> +
>  	smp_store_release(ptep, new);
>  	get_page(page);
>  	data->phys += granule;
> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>  	return ret;
>  }
>  
> -static void stage2_flush_dcache(void *addr, u64 size)
> -{
> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> -		return;
> -
> -	__flush_dcache_area(addr, size);
> -}
> -
> -static bool stage2_pte_cacheable(kvm_pte_t pte)
> -{
> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
> -}
> -
>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  			       enum kvm_pgtable_walk_flags flag,
>  			       void * const arg)
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 77cb2d28f2a4..d151927a7d62 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>  }
>  
> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> -{
> -	__clean_dcache_guest_page(pfn, size);
> -}
> -
>  static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>  {
>  	__invalidate_icache_guest_page(pfn, size);
> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (writable)
>  		prot |= KVM_PGTABLE_PROT_W;
>  
> -	if (fault_status != FSC_PERM && !device)
> -		clean_dcache_guest_page(pfn, vma_pagesize);
> -
>  	if (exec_fault) {
>  		prot |= KVM_PGTABLE_PROT_X;
>  		invalidate_icache_guest_page(pfn, vma_pagesize);
> @@ -1144,10 +1136,10 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
>  	trace_kvm_set_spte_hva(hva);
>  
>  	/*
> -	 * We've moved a page around, probably through CoW, so let's treat it
> -	 * just like a translation fault and clean the cache to the PoC.
> +	 * We've moved a page around, probably through CoW, so let's treat
> +	 * it just like a translation fault and the map handler will clean
> +	 * the cache to the PoC.
>  	 */
> -	clean_dcache_guest_page(pfn, PAGE_SIZE);
>  	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);
>  	return 0;
>  }
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-24 17:21     ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-02-24 17:21 UTC (permalink / raw)
  To: Yanan Wang, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hello,

On 2/8/21 11:22 AM, Yanan Wang wrote:
> We currently uniformly clean dcache in user_mem_abort() before calling the
> fault handlers, if we take a translation fault and the pfn is cacheable.
> But if there are concurrent translation faults on the same page or block,
> clean of dcache for the first time is necessary while the others are not.
>
> By moving clean of dcache to the map handler, we can easily identify the
> conditions where CMOs are really needed and avoid the unnecessary ones.
> As it's a time consuming process to perform CMOs especially when flushing
> a block range, so this solution reduces much load of kvm and improve the
> efficiency of creating mappings.
>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>  arch/arm64/kvm/mmu.c             | 14 +++---------
>  3 files changed, 27 insertions(+), 41 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index e52d82aeadca..4ec9879e82ed 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>  }
>  
> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> -{
> -	void *va = page_address(pfn_to_page(pfn));
> -
> -	/*
> -	 * With FWB, we ensure that the guest always accesses memory using
> -	 * cacheable attributes, and we don't have to clean to PoC when
> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> -	 * PoU is not required either in this case.
> -	 */
> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> -		return;
> -
> -	kvm_flush_dcache_to_poc(va, size);
> -}
> -
>  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>  						  unsigned long size)
>  {
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 4d177ce1d536..2f4f87021980 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>  	return 0;
>  }
>  
> +static bool stage2_pte_cacheable(kvm_pte_t pte)
> +{
> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
> +}
> +
> +static void stage2_flush_dcache(void *addr, u64 size)
> +{
> +	/*
> +	 * With FWB, we ensure that the guest always accesses memory using
> +	 * cacheable attributes, and we don't have to clean to PoC when
> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> +	 * PoU is not required either in this case.
> +	 */
> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> +		return;
> +
> +	__flush_dcache_area(addr, size);
> +}
> +
>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  				      kvm_pte_t *ptep,
>  				      struct stage2_map_data *data)
> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  		put_page(page);
>  	}
>  
> +	/* Flush data cache before installation of the new PTE */
> +	if (stage2_pte_cacheable(new))
> +		stage2_flush_dcache(__va(phys), granule);

This makes sense to me. kvm_pgtable_stage2_map() is protected against concurrent
calls by the kvm->mmu_lock, so only one VCPU can change the stage 2 translation
table at any given moment. In the case of concurrent translation faults on the
same IPA, the first VCPU that will take the lock will create the mapping and do
the dcache clean+invalidate. The other VCPUs will return -EAGAIN because the
mapping they are trying to install is almost identical* to the mapping created by
the first VCPU that took the lock.

I have a question. Why are you doing the cache maintenance *before* installing the
new mapping? This is what the kernel already does, so I'm not saying it's
incorrect, I'm just curious about the reason behind it.

*permissions might be different.

Thanks,

Alex

> +
>  	smp_store_release(ptep, new);
>  	get_page(page);
>  	data->phys += granule;
> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>  	return ret;
>  }
>  
> -static void stage2_flush_dcache(void *addr, u64 size)
> -{
> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> -		return;
> -
> -	__flush_dcache_area(addr, size);
> -}
> -
> -static bool stage2_pte_cacheable(kvm_pte_t pte)
> -{
> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
> -}
> -
>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  			       enum kvm_pgtable_walk_flags flag,
>  			       void * const arg)
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 77cb2d28f2a4..d151927a7d62 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>  }
>  
> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> -{
> -	__clean_dcache_guest_page(pfn, size);
> -}
> -
>  static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>  {
>  	__invalidate_icache_guest_page(pfn, size);
> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (writable)
>  		prot |= KVM_PGTABLE_PROT_W;
>  
> -	if (fault_status != FSC_PERM && !device)
> -		clean_dcache_guest_page(pfn, vma_pagesize);
> -
>  	if (exec_fault) {
>  		prot |= KVM_PGTABLE_PROT_X;
>  		invalidate_icache_guest_page(pfn, vma_pagesize);
> @@ -1144,10 +1136,10 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
>  	trace_kvm_set_spte_hva(hva);
>  
>  	/*
> -	 * We've moved a page around, probably through CoW, so let's treat it
> -	 * just like a translation fault and clean the cache to the PoC.
> +	 * We've moved a page around, probably through CoW, so let's treat
> +	 * it just like a translation fault and the map handler will clean
> +	 * the cache to the PoC.
>  	 */
> -	clean_dcache_guest_page(pfn, PAGE_SIZE);
>  	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);
>  	return 0;
>  }

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
  2021-02-24 17:21     ` Alexandru Elisei
  (?)
@ 2021-02-24 17:39       ` Marc Zyngier
  -1 siblings, 0 replies; 80+ messages in thread
From: Marc Zyngier @ 2021-02-24 17:39 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Yanan Wang, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

On Wed, 24 Feb 2021 17:21:22 +0000,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hello,
> 
> On 2/8/21 11:22 AM, Yanan Wang wrote:
> > We currently uniformly clean dcache in user_mem_abort() before calling the
> > fault handlers, if we take a translation fault and the pfn is cacheable.
> > But if there are concurrent translation faults on the same page or block,
> > clean of dcache for the first time is necessary while the others are not.
> >
> > By moving clean of dcache to the map handler, we can easily identify the
> > conditions where CMOs are really needed and avoid the unnecessary ones.
> > As it's a time consuming process to perform CMOs especially when flushing
> > a block range, so this solution reduces much load of kvm and improve the
> > efficiency of creating mappings.
> >
> > Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> > ---
> >  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
> >  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
> >  arch/arm64/kvm/mmu.c             | 14 +++---------
> >  3 files changed, 27 insertions(+), 41 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > index e52d82aeadca..4ec9879e82ed 100644
> > --- a/arch/arm64/include/asm/kvm_mmu.h
> > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
> >  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
> >  }
> >  
> > -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> > -{
> > -	void *va = page_address(pfn_to_page(pfn));
> > -
> > -	/*
> > -	 * With FWB, we ensure that the guest always accesses memory using
> > -	 * cacheable attributes, and we don't have to clean to PoC when
> > -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> > -	 * PoU is not required either in this case.
> > -	 */
> > -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> > -		return;
> > -
> > -	kvm_flush_dcache_to_poc(va, size);
> > -}
> > -
> >  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
> >  						  unsigned long size)
> >  {
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 4d177ce1d536..2f4f87021980 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
> >  	return 0;
> >  }
> >  
> > +static bool stage2_pte_cacheable(kvm_pte_t pte)
> > +{
> > +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> > +	return memattr == PAGE_S2_MEMATTR(NORMAL);
> > +}
> > +
> > +static void stage2_flush_dcache(void *addr, u64 size)
> > +{
> > +	/*
> > +	 * With FWB, we ensure that the guest always accesses memory using
> > +	 * cacheable attributes, and we don't have to clean to PoC when
> > +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> > +	 * PoU is not required either in this case.
> > +	 */
> > +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> > +		return;
> > +
> > +	__flush_dcache_area(addr, size);
> > +}
> > +
> >  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >  				      kvm_pte_t *ptep,
> >  				      struct stage2_map_data *data)
> > @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >  		put_page(page);
> >  	}
> >  
> > +	/* Flush data cache before installation of the new PTE */
> > +	if (stage2_pte_cacheable(new))
> > +		stage2_flush_dcache(__va(phys), granule);
> 
> This makes sense to me. kvm_pgtable_stage2_map() is protected
> against concurrent calls by the kvm->mmu_lock, so only one VCPU can
> change the stage 2 translation table at any given moment. In the
> case of concurrent translation faults on the same IPA, the first
> VCPU that will take the lock will create the mapping and do the
> dcache clean+invalidate. The other VCPUs will return -EAGAIN because
> the mapping they are trying to install is almost identical* to the
> mapping created by the first VCPU that took the lock.
> 
> I have a question. Why are you doing the cache maintenance *before*
> installing the new mapping? This is what the kernel already does, so
> I'm not saying it's incorrect, I'm just curious about the reason
> behind it.

The guarantee KVM offers to the guest is that by the time it can
access the memory, it is cleaned to the PoC. If you establish a
mapping before cleaning, another vcpu can access the PoC (no fault,
you just set up S2) and not see it up to date.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-24 17:39       ` Marc Zyngier
  0 siblings, 0 replies; 80+ messages in thread
From: Marc Zyngier @ 2021-02-24 17:39 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvm, Catalin Marinas, linux-kernel, linux-arm-kernel,
	Will Deacon, kvmarm

On Wed, 24 Feb 2021 17:21:22 +0000,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hello,
> 
> On 2/8/21 11:22 AM, Yanan Wang wrote:
> > We currently uniformly clean dcache in user_mem_abort() before calling the
> > fault handlers, if we take a translation fault and the pfn is cacheable.
> > But if there are concurrent translation faults on the same page or block,
> > clean of dcache for the first time is necessary while the others are not.
> >
> > By moving clean of dcache to the map handler, we can easily identify the
> > conditions where CMOs are really needed and avoid the unnecessary ones.
> > As it's a time consuming process to perform CMOs especially when flushing
> > a block range, so this solution reduces much load of kvm and improve the
> > efficiency of creating mappings.
> >
> > Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> > ---
> >  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
> >  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
> >  arch/arm64/kvm/mmu.c             | 14 +++---------
> >  3 files changed, 27 insertions(+), 41 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > index e52d82aeadca..4ec9879e82ed 100644
> > --- a/arch/arm64/include/asm/kvm_mmu.h
> > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
> >  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
> >  }
> >  
> > -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> > -{
> > -	void *va = page_address(pfn_to_page(pfn));
> > -
> > -	/*
> > -	 * With FWB, we ensure that the guest always accesses memory using
> > -	 * cacheable attributes, and we don't have to clean to PoC when
> > -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> > -	 * PoU is not required either in this case.
> > -	 */
> > -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> > -		return;
> > -
> > -	kvm_flush_dcache_to_poc(va, size);
> > -}
> > -
> >  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
> >  						  unsigned long size)
> >  {
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 4d177ce1d536..2f4f87021980 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
> >  	return 0;
> >  }
> >  
> > +static bool stage2_pte_cacheable(kvm_pte_t pte)
> > +{
> > +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> > +	return memattr == PAGE_S2_MEMATTR(NORMAL);
> > +}
> > +
> > +static void stage2_flush_dcache(void *addr, u64 size)
> > +{
> > +	/*
> > +	 * With FWB, we ensure that the guest always accesses memory using
> > +	 * cacheable attributes, and we don't have to clean to PoC when
> > +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> > +	 * PoU is not required either in this case.
> > +	 */
> > +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> > +		return;
> > +
> > +	__flush_dcache_area(addr, size);
> > +}
> > +
> >  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >  				      kvm_pte_t *ptep,
> >  				      struct stage2_map_data *data)
> > @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >  		put_page(page);
> >  	}
> >  
> > +	/* Flush data cache before installation of the new PTE */
> > +	if (stage2_pte_cacheable(new))
> > +		stage2_flush_dcache(__va(phys), granule);
> 
> This makes sense to me. kvm_pgtable_stage2_map() is protected
> against concurrent calls by the kvm->mmu_lock, so only one VCPU can
> change the stage 2 translation table at any given moment. In the
> case of concurrent translation faults on the same IPA, the first
> VCPU that will take the lock will create the mapping and do the
> dcache clean+invalidate. The other VCPUs will return -EAGAIN because
> the mapping they are trying to install is almost identical* to the
> mapping created by the first VCPU that took the lock.
> 
> I have a question. Why are you doing the cache maintenance *before*
> installing the new mapping? This is what the kernel already does, so
> I'm not saying it's incorrect, I'm just curious about the reason
> behind it.

The guarantee KVM offers to the guest is that by the time it can
access the memory, it is cleaned to the PoC. If you establish a
mapping before cleaning, another vcpu can access the PoC (no fault,
you just set up S2) and not see it up to date.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-24 17:39       ` Marc Zyngier
  0 siblings, 0 replies; 80+ messages in thread
From: Marc Zyngier @ 2021-02-24 17:39 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Gavin Shan, kvm, Suzuki K Poulose, Catalin Marinas,
	Quentin Perret, linux-kernel, Yanan Wang, James Morse,
	linux-arm-kernel, Will Deacon, kvmarm, Julien Thierry

On Wed, 24 Feb 2021 17:21:22 +0000,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hello,
> 
> On 2/8/21 11:22 AM, Yanan Wang wrote:
> > We currently uniformly clean dcache in user_mem_abort() before calling the
> > fault handlers, if we take a translation fault and the pfn is cacheable.
> > But if there are concurrent translation faults on the same page or block,
> > clean of dcache for the first time is necessary while the others are not.
> >
> > By moving clean of dcache to the map handler, we can easily identify the
> > conditions where CMOs are really needed and avoid the unnecessary ones.
> > As it's a time consuming process to perform CMOs especially when flushing
> > a block range, so this solution reduces much load of kvm and improve the
> > efficiency of creating mappings.
> >
> > Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> > ---
> >  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
> >  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
> >  arch/arm64/kvm/mmu.c             | 14 +++---------
> >  3 files changed, 27 insertions(+), 41 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > index e52d82aeadca..4ec9879e82ed 100644
> > --- a/arch/arm64/include/asm/kvm_mmu.h
> > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
> >  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
> >  }
> >  
> > -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> > -{
> > -	void *va = page_address(pfn_to_page(pfn));
> > -
> > -	/*
> > -	 * With FWB, we ensure that the guest always accesses memory using
> > -	 * cacheable attributes, and we don't have to clean to PoC when
> > -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> > -	 * PoU is not required either in this case.
> > -	 */
> > -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> > -		return;
> > -
> > -	kvm_flush_dcache_to_poc(va, size);
> > -}
> > -
> >  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
> >  						  unsigned long size)
> >  {
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 4d177ce1d536..2f4f87021980 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
> >  	return 0;
> >  }
> >  
> > +static bool stage2_pte_cacheable(kvm_pte_t pte)
> > +{
> > +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> > +	return memattr == PAGE_S2_MEMATTR(NORMAL);
> > +}
> > +
> > +static void stage2_flush_dcache(void *addr, u64 size)
> > +{
> > +	/*
> > +	 * With FWB, we ensure that the guest always accesses memory using
> > +	 * cacheable attributes, and we don't have to clean to PoC when
> > +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> > +	 * PoU is not required either in this case.
> > +	 */
> > +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> > +		return;
> > +
> > +	__flush_dcache_area(addr, size);
> > +}
> > +
> >  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >  				      kvm_pte_t *ptep,
> >  				      struct stage2_map_data *data)
> > @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >  		put_page(page);
> >  	}
> >  
> > +	/* Flush data cache before installation of the new PTE */
> > +	if (stage2_pte_cacheable(new))
> > +		stage2_flush_dcache(__va(phys), granule);
> 
> This makes sense to me. kvm_pgtable_stage2_map() is protected
> against concurrent calls by the kvm->mmu_lock, so only one VCPU can
> change the stage 2 translation table at any given moment. In the
> case of concurrent translation faults on the same IPA, the first
> VCPU that will take the lock will create the mapping and do the
> dcache clean+invalidate. The other VCPUs will return -EAGAIN because
> the mapping they are trying to install is almost identical* to the
> mapping created by the first VCPU that took the lock.
> 
> I have a question. Why are you doing the cache maintenance *before*
> installing the new mapping? This is what the kernel already does, so
> I'm not saying it's incorrect, I'm just curious about the reason
> behind it.

The guarantee KVM offers to the guest is that by the time it can
access the memory, it is cleaned to the PoC. If you establish a
mapping before cleaning, another vcpu can access the PoC (no fault,
you just set up S2) and not see it up to date.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table
  2021-02-24 17:20       ` Alexandru Elisei
  (?)
@ 2021-02-25  6:13         ` wangyanan (Y)
  -1 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-02-25  6:13 UTC (permalink / raw)
  To: Alexandru Elisei, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel


On 2021/2/25 1:20, Alexandru Elisei wrote:
> Hi,
>
> On 2/24/21 2:35 AM, wangyanan (Y) wrote:
>
>> Hi Alex,
>>
>> On 2021/2/23 23:55, Alexandru Elisei wrote:
>>> Hi Yanan,
>>>
>>> I wanted to review the patches, but unfortunately I get an error when trying to
>>> apply the first patch in the series:
>>>
>>> Applying: KVM: arm64: Move the clean of dcache to the map handler
>>> error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
>>> error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
>>> error: patch failed: arch/arm64/kvm/mmu.c:882
>>> error: arch/arm64/kvm/mmu.c: patch does not apply
>>> Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
>>> hint: Use 'git am --show-current-patch=diff' to see the failed patch
>>> When you have resolved this problem, run "git am --continue".
>>> If you prefer to skip this patch, run "git am --skip" instead.
>>> To restore the original branch and stop patching, run "git am --abort".
>>>
>>> Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
>>> mmu.c from your patch is different than what is found on upstream master. Did you
>>> use another branch as the base for your patches?
>> Thanks for your attention.
>> Indeed, this series was  more or less based on the patches I post before (Link:
>> https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com).
>> And they have already been merged into up-to-data upstream master (commit:
>> 509552e65ae8287178a5cdea2d734dcd2d6380ab), but not into tags v5.11-rc1 to
>> v5.11-rc7.
>> Could you please try the newest upstream master(since commit:
>> 509552e65ae8287178a5cdea2d734dcd2d6380ab) ? I have tested on my local and no
>> apply errors occur.
> That worked for me, thank you for the quick reply.
>
> Just to double check, when you run the benchmarks, the before results are for a
> kernel built from commit 509552e65ae8 ("KVM: arm64: Mark the page dirty only if
> the fault is handled successfully"), and the after results are with this series on
> top, right?

Yes, that's right. So the performance change results have nothing to do 
with the series of commit 509552e65ae8.

Thanks,

Yanan

>
> Thanks,
>
> Alex
>
>> Thanks,
>>
>> Yanan.
>>
>>> Thanks,
>>>
>>> Alex
>>>
>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>> Hi,
>>>>
>>>> This series makes some efficiency improvement of stage2 page table code,
>>>> and there are some test results to present the performance changes, which
>>>> were tested by a kvm selftest [1] that I have post:
>>>> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/
>>>>
>>>> About patch 1:
>>>> We currently uniformly clean dcache in user_mem_abort() before calling the
>>>> fault handlers, if we take a translation fault and the pfn is cacheable.
>>>> But if there are concurrent translation faults on the same page or block,
>>>> clean of dcache for the first time is necessary while the others are not.
>>>>
>>>> By moving clean of dcache to the map handler, we can easily identify the
>>>> conditions where CMOs are really needed and avoid the unnecessary ones.
>>>> As it's a time consuming process to perform CMOs especially when flushing
>>>> a block range, so this solution reduces much load of kvm and improve the
>>>> efficiency of creating mappings.
>>>>
>>>> Test results:
>>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM create block mappings time: 52.83s -> 3.70s
>>>> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>>>>
>>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM creating block mappings time: 104.56s -> 3.70s
>>>> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>>>>
>>>> About patch 2, 3:
>>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>>> we currently invalidate the old table entry first followed by invalidation
>>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>>
>>>> It will cost a lot of time to unmap the numerous page mappings, which means
>>>> the table entry will be left invalid for a long time before installation of
>>>> the block entry, and this will cause many spurious translation faults.
>>>>
>>>> So let's quickly install the block entry at first to ensure uninterrupted
>>>> memory access of the other vCPUs, and then unmap the page mappings after
>>>> installation. This will reduce most of the time when the table entry is
>>>> invalid, and avoid most of the unnecessary translation faults.
>>>>
>>>> Test results based on patch 1:
>>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>>>>
>>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>>>>
>>>> So combined with patch 1, it makes a big difference of KVM creating mappings
>>>> and recovering block mappings with not much code change.
>>>>
>>>> About patch 4:
>>>> A new method to distinguish cases of memcache allocations is introduced.
>>>> By comparing fault_granule and vma_pagesize, cases that require allocations
>>>> from memcache and cases that don't can be distinguished completely.
>>>>
>>>> ---
>>>>
>>>> Details of test results
>>>> platform: HiSilicon Kunpeng920 (FWB not supported)
>>>> host kernel: Linux mainline (v5.11-rc6)
>>>>
>>>> (1) performance change of patch 1
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>>         (20 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
>>>> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>>>>
>>>> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
>>>> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>>>>
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>>         (40 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
>>>> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>>>>
>>>> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
>>>> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>>>>
>>>> (2) performance change of patch 2, 3(based on patch 1)
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
>>>>         (1 vcpu, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
>>>> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>>>>
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>>         (20 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
>>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>>>>
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>>         (40 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
>>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>>>>
>>>> ---
>>>>
>>>> Yanan Wang (4):
>>>>     KVM: arm64: Move the clean of dcache to the map handler
>>>>     KVM: arm64: Add an independent API for coalescing tables
>>>>     KVM: arm64: Install the block entry before unmapping the page mappings
>>>>     KVM: arm64: Distinguish cases of memcache allocations completely
>>>>
>>>>    arch/arm64/include/asm/kvm_mmu.h | 16 -------
>>>>    arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>>>>    arch/arm64/kvm/mmu.c             | 39 ++++++---------
>>>>    3 files changed, 69 insertions(+), 68 deletions(-)
>>>>
>>> .
> .

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table
@ 2021-02-25  6:13         ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-02-25  6:13 UTC (permalink / raw)
  To: Alexandru Elisei, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel


On 2021/2/25 1:20, Alexandru Elisei wrote:
> Hi,
>
> On 2/24/21 2:35 AM, wangyanan (Y) wrote:
>
>> Hi Alex,
>>
>> On 2021/2/23 23:55, Alexandru Elisei wrote:
>>> Hi Yanan,
>>>
>>> I wanted to review the patches, but unfortunately I get an error when trying to
>>> apply the first patch in the series:
>>>
>>> Applying: KVM: arm64: Move the clean of dcache to the map handler
>>> error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
>>> error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
>>> error: patch failed: arch/arm64/kvm/mmu.c:882
>>> error: arch/arm64/kvm/mmu.c: patch does not apply
>>> Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
>>> hint: Use 'git am --show-current-patch=diff' to see the failed patch
>>> When you have resolved this problem, run "git am --continue".
>>> If you prefer to skip this patch, run "git am --skip" instead.
>>> To restore the original branch and stop patching, run "git am --abort".
>>>
>>> Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
>>> mmu.c from your patch is different than what is found on upstream master. Did you
>>> use another branch as the base for your patches?
>> Thanks for your attention.
>> Indeed, this series was  more or less based on the patches I post before (Link:
>> https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com).
>> And they have already been merged into up-to-data upstream master (commit:
>> 509552e65ae8287178a5cdea2d734dcd2d6380ab), but not into tags v5.11-rc1 to
>> v5.11-rc7.
>> Could you please try the newest upstream master(since commit:
>> 509552e65ae8287178a5cdea2d734dcd2d6380ab) ? I have tested on my local and no
>> apply errors occur.
> That worked for me, thank you for the quick reply.
>
> Just to double check, when you run the benchmarks, the before results are for a
> kernel built from commit 509552e65ae8 ("KVM: arm64: Mark the page dirty only if
> the fault is handled successfully"), and the after results are with this series on
> top, right?

Yes, that's right. So the performance change results have nothing to do 
with the series of commit 509552e65ae8.

Thanks,

Yanan

>
> Thanks,
>
> Alex
>
>> Thanks,
>>
>> Yanan.
>>
>>> Thanks,
>>>
>>> Alex
>>>
>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>> Hi,
>>>>
>>>> This series makes some efficiency improvement of stage2 page table code,
>>>> and there are some test results to present the performance changes, which
>>>> were tested by a kvm selftest [1] that I have post:
>>>> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/
>>>>
>>>> About patch 1:
>>>> We currently uniformly clean dcache in user_mem_abort() before calling the
>>>> fault handlers, if we take a translation fault and the pfn is cacheable.
>>>> But if there are concurrent translation faults on the same page or block,
>>>> clean of dcache for the first time is necessary while the others are not.
>>>>
>>>> By moving clean of dcache to the map handler, we can easily identify the
>>>> conditions where CMOs are really needed and avoid the unnecessary ones.
>>>> As it's a time consuming process to perform CMOs especially when flushing
>>>> a block range, so this solution reduces much load of kvm and improve the
>>>> efficiency of creating mappings.
>>>>
>>>> Test results:
>>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM create block mappings time: 52.83s -> 3.70s
>>>> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>>>>
>>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM creating block mappings time: 104.56s -> 3.70s
>>>> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>>>>
>>>> About patch 2, 3:
>>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>>> we currently invalidate the old table entry first followed by invalidation
>>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>>
>>>> It will cost a lot of time to unmap the numerous page mappings, which means
>>>> the table entry will be left invalid for a long time before installation of
>>>> the block entry, and this will cause many spurious translation faults.
>>>>
>>>> So let's quickly install the block entry at first to ensure uninterrupted
>>>> memory access of the other vCPUs, and then unmap the page mappings after
>>>> installation. This will reduce most of the time when the table entry is
>>>> invalid, and avoid most of the unnecessary translation faults.
>>>>
>>>> Test results based on patch 1:
>>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>>>>
>>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>>>>
>>>> So combined with patch 1, it makes a big difference of KVM creating mappings
>>>> and recovering block mappings with not much code change.
>>>>
>>>> About patch 4:
>>>> A new method to distinguish cases of memcache allocations is introduced.
>>>> By comparing fault_granule and vma_pagesize, cases that require allocations
>>>> from memcache and cases that don't can be distinguished completely.
>>>>
>>>> ---
>>>>
>>>> Details of test results
>>>> platform: HiSilicon Kunpeng920 (FWB not supported)
>>>> host kernel: Linux mainline (v5.11-rc6)
>>>>
>>>> (1) performance change of patch 1
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>>         (20 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
>>>> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>>>>
>>>> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
>>>> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>>>>
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>>         (40 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
>>>> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>>>>
>>>> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
>>>> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>>>>
>>>> (2) performance change of patch 2, 3(based on patch 1)
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
>>>>         (1 vcpu, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
>>>> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>>>>
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>>         (20 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
>>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>>>>
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>>         (40 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
>>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>>>>
>>>> ---
>>>>
>>>> Yanan Wang (4):
>>>>     KVM: arm64: Move the clean of dcache to the map handler
>>>>     KVM: arm64: Add an independent API for coalescing tables
>>>>     KVM: arm64: Install the block entry before unmapping the page mappings
>>>>     KVM: arm64: Distinguish cases of memcache allocations completely
>>>>
>>>>    arch/arm64/include/asm/kvm_mmu.h | 16 -------
>>>>    arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>>>>    arch/arm64/kvm/mmu.c             | 39 ++++++---------
>>>>    3 files changed, 69 insertions(+), 68 deletions(-)
>>>>
>>> .
> .
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table
@ 2021-02-25  6:13         ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-02-25  6:13 UTC (permalink / raw)
  To: Alexandru Elisei, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel


On 2021/2/25 1:20, Alexandru Elisei wrote:
> Hi,
>
> On 2/24/21 2:35 AM, wangyanan (Y) wrote:
>
>> Hi Alex,
>>
>> On 2021/2/23 23:55, Alexandru Elisei wrote:
>>> Hi Yanan,
>>>
>>> I wanted to review the patches, but unfortunately I get an error when trying to
>>> apply the first patch in the series:
>>>
>>> Applying: KVM: arm64: Move the clean of dcache to the map handler
>>> error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
>>> error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
>>> error: patch failed: arch/arm64/kvm/mmu.c:882
>>> error: arch/arm64/kvm/mmu.c: patch does not apply
>>> Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
>>> hint: Use 'git am --show-current-patch=diff' to see the failed patch
>>> When you have resolved this problem, run "git am --continue".
>>> If you prefer to skip this patch, run "git am --skip" instead.
>>> To restore the original branch and stop patching, run "git am --abort".
>>>
>>> Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
>>> mmu.c from your patch is different than what is found on upstream master. Did you
>>> use another branch as the base for your patches?
>> Thanks for your attention.
>> Indeed, this series was  more or less based on the patches I post before (Link:
>> https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com).
>> And they have already been merged into up-to-data upstream master (commit:
>> 509552e65ae8287178a5cdea2d734dcd2d6380ab), but not into tags v5.11-rc1 to
>> v5.11-rc7.
>> Could you please try the newest upstream master(since commit:
>> 509552e65ae8287178a5cdea2d734dcd2d6380ab) ? I have tested on my local and no
>> apply errors occur.
> That worked for me, thank you for the quick reply.
>
> Just to double check, when you run the benchmarks, the before results are for a
> kernel built from commit 509552e65ae8 ("KVM: arm64: Mark the page dirty only if
> the fault is handled successfully"), and the after results are with this series on
> top, right?

Yes, that's right. So the performance change results have nothing to do 
with the series of commit 509552e65ae8.

Thanks,

Yanan

>
> Thanks,
>
> Alex
>
>> Thanks,
>>
>> Yanan.
>>
>>> Thanks,
>>>
>>> Alex
>>>
>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>> Hi,
>>>>
>>>> This series makes some efficiency improvement of stage2 page table code,
>>>> and there are some test results to present the performance changes, which
>>>> were tested by a kvm selftest [1] that I have post:
>>>> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/
>>>>
>>>> About patch 1:
>>>> We currently uniformly clean dcache in user_mem_abort() before calling the
>>>> fault handlers, if we take a translation fault and the pfn is cacheable.
>>>> But if there are concurrent translation faults on the same page or block,
>>>> clean of dcache for the first time is necessary while the others are not.
>>>>
>>>> By moving clean of dcache to the map handler, we can easily identify the
>>>> conditions where CMOs are really needed and avoid the unnecessary ones.
>>>> As it's a time consuming process to perform CMOs especially when flushing
>>>> a block range, so this solution reduces much load of kvm and improve the
>>>> efficiency of creating mappings.
>>>>
>>>> Test results:
>>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM create block mappings time: 52.83s -> 3.70s
>>>> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>>>>
>>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM creating block mappings time: 104.56s -> 3.70s
>>>> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>>>>
>>>> About patch 2, 3:
>>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>>> we currently invalidate the old table entry first followed by invalidation
>>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>>
>>>> It will cost a lot of time to unmap the numerous page mappings, which means
>>>> the table entry will be left invalid for a long time before installation of
>>>> the block entry, and this will cause many spurious translation faults.
>>>>
>>>> So let's quickly install the block entry at first to ensure uninterrupted
>>>> memory access of the other vCPUs, and then unmap the page mappings after
>>>> installation. This will reduce most of the time when the table entry is
>>>> invalid, and avoid most of the unnecessary translation faults.
>>>>
>>>> Test results based on patch 1:
>>>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>>>>
>>>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>>>> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>>>>
>>>> So combined with patch 1, it makes a big difference of KVM creating mappings
>>>> and recovering block mappings with not much code change.
>>>>
>>>> About patch 4:
>>>> A new method to distinguish cases of memcache allocations is introduced.
>>>> By comparing fault_granule and vma_pagesize, cases that require allocations
>>>> from memcache and cases that don't can be distinguished completely.
>>>>
>>>> ---
>>>>
>>>> Details of test results
>>>> platform: HiSilicon Kunpeng920 (FWB not supported)
>>>> host kernel: Linux mainline (v5.11-rc6)
>>>>
>>>> (1) performance change of patch 1
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>>         (20 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
>>>> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>>>>
>>>> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
>>>> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>>>>
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>>         (40 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
>>>> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>>>>
>>>> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
>>>> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>>>>
>>>> (2) performance change of patch 2, 3(based on patch 1)
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
>>>>         (1 vcpu, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
>>>> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>>>>
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>>>>         (20 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
>>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>>>>
>>>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>>>>         (40 vcpus, 20G memory, block mappings(granule 1G))
>>>> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
>>>> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>>>>
>>>> ---
>>>>
>>>> Yanan Wang (4):
>>>>     KVM: arm64: Move the clean of dcache to the map handler
>>>>     KVM: arm64: Add an independent API for coalescing tables
>>>>     KVM: arm64: Install the block entry before unmapping the page mappings
>>>>     KVM: arm64: Distinguish cases of memcache allocations completely
>>>>
>>>>    arch/arm64/include/asm/kvm_mmu.h | 16 -------
>>>>    arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>>>>    arch/arm64/kvm/mmu.c             | 39 ++++++---------
>>>>    3 files changed, 69 insertions(+), 68 deletions(-)
>>>>
>>> .
> .

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
  2021-02-08 11:22   ` Yanan Wang
@ 2021-02-25  9:55     ` Marc Zyngier
  -1 siblings, 0 replies; 80+ messages in thread
From: Marc Zyngier @ 2021-02-25  9:55 UTC (permalink / raw)
  To: Yanan Wang
  Cc: kvm, Catalin Marinas, linux-kernel, linux-arm-kernel,
	Will Deacon, kvmarm

Hi Yanan,

On Mon, 08 Feb 2021 11:22:47 +0000,
Yanan Wang <wangyanan55@huawei.com> wrote:
> 
> We currently uniformly clean dcache in user_mem_abort() before calling the
> fault handlers, if we take a translation fault and the pfn is cacheable.
> But if there are concurrent translation faults on the same page or block,
> clean of dcache for the first time is necessary while the others are not.
> 
> By moving clean of dcache to the map handler, we can easily identify the
> conditions where CMOs are really needed and avoid the unnecessary ones.
> As it's a time consuming process to perform CMOs especially when flushing
> a block range, so this solution reduces much load of kvm and improve the
> efficiency of creating mappings.

That's an interesting approach. However, wouldn't it be better to
identify early that there is already something mapped, and return to
the guest ASAP?

Can you quantify the benefit of this patch alone?

> 
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>  arch/arm64/kvm/mmu.c             | 14 +++---------
>  3 files changed, 27 insertions(+), 41 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index e52d82aeadca..4ec9879e82ed 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>  }
>  
> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> -{
> -	void *va = page_address(pfn_to_page(pfn));
> -
> -	/*
> -	 * With FWB, we ensure that the guest always accesses memory using
> -	 * cacheable attributes, and we don't have to clean to PoC when
> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> -	 * PoU is not required either in this case.
> -	 */
> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> -		return;
> -
> -	kvm_flush_dcache_to_poc(va, size);
> -}
> -
>  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>  						  unsigned long size)
>  {
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 4d177ce1d536..2f4f87021980 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>  	return 0;
>  }
>  
> +static bool stage2_pte_cacheable(kvm_pte_t pte)
> +{
> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
> +}
> +
> +static void stage2_flush_dcache(void *addr, u64 size)
> +{
> +	/*
> +	 * With FWB, we ensure that the guest always accesses memory using
> +	 * cacheable attributes, and we don't have to clean to PoC when
> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> +	 * PoU is not required either in this case.
> +	 */
> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> +		return;
> +
> +	__flush_dcache_area(addr, size);
> +}
> +
>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  				      kvm_pte_t *ptep,
>  				      struct stage2_map_data *data)
> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  		put_page(page);
>  	}
>  
> +	/* Flush data cache before installation of the new PTE */
> +	if (stage2_pte_cacheable(new))
> +		stage2_flush_dcache(__va(phys), granule);
> +
>  	smp_store_release(ptep, new);
>  	get_page(page);
>  	data->phys += granule;
> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>  	return ret;
>  }
>  
> -static void stage2_flush_dcache(void *addr, u64 size)
> -{
> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> -		return;
> -
> -	__flush_dcache_area(addr, size);
> -}
> -
> -static bool stage2_pte_cacheable(kvm_pte_t pte)
> -{
> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
> -}
> -
>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  			       enum kvm_pgtable_walk_flags flag,
>  			       void * const arg)
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 77cb2d28f2a4..d151927a7d62 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>  }
>  
> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> -{
> -	__clean_dcache_guest_page(pfn, size);
> -}
> -
>  static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>  {
>  	__invalidate_icache_guest_page(pfn, size);
> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (writable)
>  		prot |= KVM_PGTABLE_PROT_W;
>  
> -	if (fault_status != FSC_PERM && !device)
> -		clean_dcache_guest_page(pfn, vma_pagesize);
> -
>  	if (exec_fault) {
>  		prot |= KVM_PGTABLE_PROT_X;
>  		invalidate_icache_guest_page(pfn, vma_pagesize);

It seems that the I-side CMO now happens *before* the D-side, which
seems odd. What prevents the CPU from speculatively fetching
instructions in the interval? I would also feel much more confident if
the two were kept close together.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-25  9:55     ` Marc Zyngier
  0 siblings, 0 replies; 80+ messages in thread
From: Marc Zyngier @ 2021-02-25  9:55 UTC (permalink / raw)
  To: Yanan Wang
  Cc: Gavin Shan, kvm, Suzuki K Poulose, Catalin Marinas, zhukeqian1,
	Quentin Perret, linux-kernel, James Morse, linux-arm-kernel,
	yuzenghui, wanghaibin.wang, Will Deacon, kvmarm, Julien Thierry

Hi Yanan,

On Mon, 08 Feb 2021 11:22:47 +0000,
Yanan Wang <wangyanan55@huawei.com> wrote:
> 
> We currently uniformly clean dcache in user_mem_abort() before calling the
> fault handlers, if we take a translation fault and the pfn is cacheable.
> But if there are concurrent translation faults on the same page or block,
> clean of dcache for the first time is necessary while the others are not.
> 
> By moving clean of dcache to the map handler, we can easily identify the
> conditions where CMOs are really needed and avoid the unnecessary ones.
> As it's a time consuming process to perform CMOs especially when flushing
> a block range, so this solution reduces much load of kvm and improve the
> efficiency of creating mappings.

That's an interesting approach. However, wouldn't it be better to
identify early that there is already something mapped, and return to
the guest ASAP?

Can you quantify the benefit of this patch alone?

> 
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>  arch/arm64/kvm/mmu.c             | 14 +++---------
>  3 files changed, 27 insertions(+), 41 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index e52d82aeadca..4ec9879e82ed 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>  }
>  
> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> -{
> -	void *va = page_address(pfn_to_page(pfn));
> -
> -	/*
> -	 * With FWB, we ensure that the guest always accesses memory using
> -	 * cacheable attributes, and we don't have to clean to PoC when
> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> -	 * PoU is not required either in this case.
> -	 */
> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> -		return;
> -
> -	kvm_flush_dcache_to_poc(va, size);
> -}
> -
>  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>  						  unsigned long size)
>  {
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 4d177ce1d536..2f4f87021980 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>  	return 0;
>  }
>  
> +static bool stage2_pte_cacheable(kvm_pte_t pte)
> +{
> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
> +}
> +
> +static void stage2_flush_dcache(void *addr, u64 size)
> +{
> +	/*
> +	 * With FWB, we ensure that the guest always accesses memory using
> +	 * cacheable attributes, and we don't have to clean to PoC when
> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> +	 * PoU is not required either in this case.
> +	 */
> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> +		return;
> +
> +	__flush_dcache_area(addr, size);
> +}
> +
>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  				      kvm_pte_t *ptep,
>  				      struct stage2_map_data *data)
> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  		put_page(page);
>  	}
>  
> +	/* Flush data cache before installation of the new PTE */
> +	if (stage2_pte_cacheable(new))
> +		stage2_flush_dcache(__va(phys), granule);
> +
>  	smp_store_release(ptep, new);
>  	get_page(page);
>  	data->phys += granule;
> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>  	return ret;
>  }
>  
> -static void stage2_flush_dcache(void *addr, u64 size)
> -{
> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> -		return;
> -
> -	__flush_dcache_area(addr, size);
> -}
> -
> -static bool stage2_pte_cacheable(kvm_pte_t pte)
> -{
> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
> -}
> -
>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  			       enum kvm_pgtable_walk_flags flag,
>  			       void * const arg)
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 77cb2d28f2a4..d151927a7d62 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>  }
>  
> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> -{
> -	__clean_dcache_guest_page(pfn, size);
> -}
> -
>  static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>  {
>  	__invalidate_icache_guest_page(pfn, size);
> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (writable)
>  		prot |= KVM_PGTABLE_PROT_W;
>  
> -	if (fault_status != FSC_PERM && !device)
> -		clean_dcache_guest_page(pfn, vma_pagesize);
> -
>  	if (exec_fault) {
>  		prot |= KVM_PGTABLE_PROT_X;
>  		invalidate_icache_guest_page(pfn, vma_pagesize);

It seems that the I-side CMO now happens *before* the D-side, which
seems odd. What prevents the CPU from speculatively fetching
instructions in the interval? I would also feel much more confident if
the two were kept close together.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
  2021-02-24 17:39       ` Marc Zyngier
  (?)
@ 2021-02-25 16:45         ` Alexandru Elisei
  -1 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-02-25 16:45 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Yanan Wang, Will Deacon, Catalin Marinas, James Morse,
	Julien Thierry, Suzuki K Poulose, Gavin Shan, Quentin Perret,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Marc,

On 2/24/21 5:39 PM, Marc Zyngier wrote:
> On Wed, 24 Feb 2021 17:21:22 +0000,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>> Hello,
>>
>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>> We currently uniformly clean dcache in user_mem_abort() before calling the
>>> fault handlers, if we take a translation fault and the pfn is cacheable.
>>> But if there are concurrent translation faults on the same page or block,
>>> clean of dcache for the first time is necessary while the others are not.
>>>
>>> By moving clean of dcache to the map handler, we can easily identify the
>>> conditions where CMOs are really needed and avoid the unnecessary ones.
>>> As it's a time consuming process to perform CMOs especially when flushing
>>> a block range, so this solution reduces much load of kvm and improve the
>>> efficiency of creating mappings.
>>>
>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>> ---
>>>  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>>>  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>>>  arch/arm64/kvm/mmu.c             | 14 +++---------
>>>  3 files changed, 27 insertions(+), 41 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>>> index e52d82aeadca..4ec9879e82ed 100644
>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>>>  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>>>  }
>>>  
>>> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>> -{
>>> -	void *va = page_address(pfn_to_page(pfn));
>>> -
>>> -	/*
>>> -	 * With FWB, we ensure that the guest always accesses memory using
>>> -	 * cacheable attributes, and we don't have to clean to PoC when
>>> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>>> -	 * PoU is not required either in this case.
>>> -	 */
>>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>> -		return;
>>> -
>>> -	kvm_flush_dcache_to_poc(va, size);
>>> -}
>>> -
>>>  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>>>  						  unsigned long size)
>>>  {
>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>> index 4d177ce1d536..2f4f87021980 100644
>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>>>  	return 0;
>>>  }
>>>  
>>> +static bool stage2_pte_cacheable(kvm_pte_t pte)
>>> +{
>>> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>>> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
>>> +}
>>> +
>>> +static void stage2_flush_dcache(void *addr, u64 size)
>>> +{
>>> +	/*
>>> +	 * With FWB, we ensure that the guest always accesses memory using
>>> +	 * cacheable attributes, and we don't have to clean to PoC when
>>> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>>> +	 * PoU is not required either in this case.
>>> +	 */
>>> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>> +		return;
>>> +
>>> +	__flush_dcache_area(addr, size);
>>> +}
>>> +
>>>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>  				      kvm_pte_t *ptep,
>>>  				      struct stage2_map_data *data)
>>> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>  		put_page(page);
>>>  	}
>>>  
>>> +	/* Flush data cache before installation of the new PTE */
>>> +	if (stage2_pte_cacheable(new))
>>> +		stage2_flush_dcache(__va(phys), granule);
>> This makes sense to me. kvm_pgtable_stage2_map() is protected
>> against concurrent calls by the kvm->mmu_lock, so only one VCPU can
>> change the stage 2 translation table at any given moment. In the
>> case of concurrent translation faults on the same IPA, the first
>> VCPU that will take the lock will create the mapping and do the
>> dcache clean+invalidate. The other VCPUs will return -EAGAIN because
>> the mapping they are trying to install is almost identical* to the
>> mapping created by the first VCPU that took the lock.
>>
>> I have a question. Why are you doing the cache maintenance *before*
>> installing the new mapping? This is what the kernel already does, so
>> I'm not saying it's incorrect, I'm just curious about the reason
>> behind it.
> The guarantee KVM offers to the guest is that by the time it can
> access the memory, it is cleaned to the PoC. If you establish a
> mapping before cleaning, another vcpu can access the PoC (no fault,
> you just set up S2) and not see it up to date.

Right, I knew I was missing something, thanks for the explanation.

Thanks,

Alex


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-25 16:45         ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-02-25 16:45 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, Catalin Marinas, linux-kernel, linux-arm-kernel,
	Will Deacon, kvmarm

Hi Marc,

On 2/24/21 5:39 PM, Marc Zyngier wrote:
> On Wed, 24 Feb 2021 17:21:22 +0000,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>> Hello,
>>
>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>> We currently uniformly clean dcache in user_mem_abort() before calling the
>>> fault handlers, if we take a translation fault and the pfn is cacheable.
>>> But if there are concurrent translation faults on the same page or block,
>>> clean of dcache for the first time is necessary while the others are not.
>>>
>>> By moving clean of dcache to the map handler, we can easily identify the
>>> conditions where CMOs are really needed and avoid the unnecessary ones.
>>> As it's a time consuming process to perform CMOs especially when flushing
>>> a block range, so this solution reduces much load of kvm and improve the
>>> efficiency of creating mappings.
>>>
>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>> ---
>>>  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>>>  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>>>  arch/arm64/kvm/mmu.c             | 14 +++---------
>>>  3 files changed, 27 insertions(+), 41 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>>> index e52d82aeadca..4ec9879e82ed 100644
>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>>>  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>>>  }
>>>  
>>> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>> -{
>>> -	void *va = page_address(pfn_to_page(pfn));
>>> -
>>> -	/*
>>> -	 * With FWB, we ensure that the guest always accesses memory using
>>> -	 * cacheable attributes, and we don't have to clean to PoC when
>>> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>>> -	 * PoU is not required either in this case.
>>> -	 */
>>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>> -		return;
>>> -
>>> -	kvm_flush_dcache_to_poc(va, size);
>>> -}
>>> -
>>>  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>>>  						  unsigned long size)
>>>  {
>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>> index 4d177ce1d536..2f4f87021980 100644
>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>>>  	return 0;
>>>  }
>>>  
>>> +static bool stage2_pte_cacheable(kvm_pte_t pte)
>>> +{
>>> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>>> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
>>> +}
>>> +
>>> +static void stage2_flush_dcache(void *addr, u64 size)
>>> +{
>>> +	/*
>>> +	 * With FWB, we ensure that the guest always accesses memory using
>>> +	 * cacheable attributes, and we don't have to clean to PoC when
>>> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>>> +	 * PoU is not required either in this case.
>>> +	 */
>>> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>> +		return;
>>> +
>>> +	__flush_dcache_area(addr, size);
>>> +}
>>> +
>>>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>  				      kvm_pte_t *ptep,
>>>  				      struct stage2_map_data *data)
>>> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>  		put_page(page);
>>>  	}
>>>  
>>> +	/* Flush data cache before installation of the new PTE */
>>> +	if (stage2_pte_cacheable(new))
>>> +		stage2_flush_dcache(__va(phys), granule);
>> This makes sense to me. kvm_pgtable_stage2_map() is protected
>> against concurrent calls by the kvm->mmu_lock, so only one VCPU can
>> change the stage 2 translation table at any given moment. In the
>> case of concurrent translation faults on the same IPA, the first
>> VCPU that will take the lock will create the mapping and do the
>> dcache clean+invalidate. The other VCPUs will return -EAGAIN because
>> the mapping they are trying to install is almost identical* to the
>> mapping created by the first VCPU that took the lock.
>>
>> I have a question. Why are you doing the cache maintenance *before*
>> installing the new mapping? This is what the kernel already does, so
>> I'm not saying it's incorrect, I'm just curious about the reason
>> behind it.
> The guarantee KVM offers to the guest is that by the time it can
> access the memory, it is cleaned to the PoC. If you establish a
> mapping before cleaning, another vcpu can access the PoC (no fault,
> you just set up S2) and not see it up to date.

Right, I knew I was missing something, thanks for the explanation.

Thanks,

Alex

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-25 16:45         ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-02-25 16:45 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Gavin Shan, kvm, Suzuki K Poulose, Catalin Marinas,
	Quentin Perret, linux-kernel, Yanan Wang, James Morse,
	linux-arm-kernel, Will Deacon, kvmarm, Julien Thierry

Hi Marc,

On 2/24/21 5:39 PM, Marc Zyngier wrote:
> On Wed, 24 Feb 2021 17:21:22 +0000,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>> Hello,
>>
>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>> We currently uniformly clean dcache in user_mem_abort() before calling the
>>> fault handlers, if we take a translation fault and the pfn is cacheable.
>>> But if there are concurrent translation faults on the same page or block,
>>> clean of dcache for the first time is necessary while the others are not.
>>>
>>> By moving clean of dcache to the map handler, we can easily identify the
>>> conditions where CMOs are really needed and avoid the unnecessary ones.
>>> As it's a time consuming process to perform CMOs especially when flushing
>>> a block range, so this solution reduces much load of kvm and improve the
>>> efficiency of creating mappings.
>>>
>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>> ---
>>>  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>>>  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>>>  arch/arm64/kvm/mmu.c             | 14 +++---------
>>>  3 files changed, 27 insertions(+), 41 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>>> index e52d82aeadca..4ec9879e82ed 100644
>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>>>  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>>>  }
>>>  
>>> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>> -{
>>> -	void *va = page_address(pfn_to_page(pfn));
>>> -
>>> -	/*
>>> -	 * With FWB, we ensure that the guest always accesses memory using
>>> -	 * cacheable attributes, and we don't have to clean to PoC when
>>> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>>> -	 * PoU is not required either in this case.
>>> -	 */
>>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>> -		return;
>>> -
>>> -	kvm_flush_dcache_to_poc(va, size);
>>> -}
>>> -
>>>  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>>>  						  unsigned long size)
>>>  {
>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>> index 4d177ce1d536..2f4f87021980 100644
>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>>>  	return 0;
>>>  }
>>>  
>>> +static bool stage2_pte_cacheable(kvm_pte_t pte)
>>> +{
>>> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>>> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
>>> +}
>>> +
>>> +static void stage2_flush_dcache(void *addr, u64 size)
>>> +{
>>> +	/*
>>> +	 * With FWB, we ensure that the guest always accesses memory using
>>> +	 * cacheable attributes, and we don't have to clean to PoC when
>>> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>>> +	 * PoU is not required either in this case.
>>> +	 */
>>> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>> +		return;
>>> +
>>> +	__flush_dcache_area(addr, size);
>>> +}
>>> +
>>>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>  				      kvm_pte_t *ptep,
>>>  				      struct stage2_map_data *data)
>>> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>  		put_page(page);
>>>  	}
>>>  
>>> +	/* Flush data cache before installation of the new PTE */
>>> +	if (stage2_pte_cacheable(new))
>>> +		stage2_flush_dcache(__va(phys), granule);
>> This makes sense to me. kvm_pgtable_stage2_map() is protected
>> against concurrent calls by the kvm->mmu_lock, so only one VCPU can
>> change the stage 2 translation table at any given moment. In the
>> case of concurrent translation faults on the same IPA, the first
>> VCPU that will take the lock will create the mapping and do the
>> dcache clean+invalidate. The other VCPUs will return -EAGAIN because
>> the mapping they are trying to install is almost identical* to the
>> mapping created by the first VCPU that took the lock.
>>
>> I have a question. Why are you doing the cache maintenance *before*
>> installing the new mapping? This is what the kernel already does, so
>> I'm not saying it's incorrect, I'm just curious about the reason
>> behind it.
> The guarantee KVM offers to the guest is that by the time it can
> access the memory, it is cleaned to the PoC. If you establish a
> mapping before cleaning, another vcpu can access the PoC (no fault,
> you just set up S2) and not see it up to date.

Right, I knew I was missing something, thanks for the explanation.

Thanks,

Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
  2021-02-25  9:55     ` Marc Zyngier
  (?)
@ 2021-02-25 17:39       ` Alexandru Elisei
  -1 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-02-25 17:39 UTC (permalink / raw)
  To: Marc Zyngier, Yanan Wang
  Cc: kvm, Catalin Marinas, linux-kernel, linux-arm-kernel,
	Will Deacon, kvmarm

Hi Marc,

On 2/25/21 9:55 AM, Marc Zyngier wrote:
> Hi Yanan,
>
> On Mon, 08 Feb 2021 11:22:47 +0000,
> Yanan Wang <wangyanan55@huawei.com> wrote:
>> We currently uniformly clean dcache in user_mem_abort() before calling the
>> fault handlers, if we take a translation fault and the pfn is cacheable.
>> But if there are concurrent translation faults on the same page or block,
>> clean of dcache for the first time is necessary while the others are not.
>>
>> By moving clean of dcache to the map handler, we can easily identify the
>> conditions where CMOs are really needed and avoid the unnecessary ones.
>> As it's a time consuming process to perform CMOs especially when flushing
>> a block range, so this solution reduces much load of kvm and improve the
>> efficiency of creating mappings.
> That's an interesting approach. However, wouldn't it be better to
> identify early that there is already something mapped, and return to
> the guest ASAP?

Wouldn't that introduce overhead for the common case, when there's only one VCPU
that faults on an address? For each data abort caused by a missing stage 2 entry
we would now have to determine if the IPA isn't already mapped and that means
walking the stage 2 tables.

Or am I mistaken and either:

(a) The common case is multiple simultaneous translation faults from different
VCPUs on the same IPA. Or

(b) There's a fast way to check if an IPA is mapped at stage 2 and the overhead
would be negligible.

>
> Can you quantify the benefit of this patch alone?
>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>>  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>>  arch/arm64/kvm/mmu.c             | 14 +++---------
>>  3 files changed, 27 insertions(+), 41 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index e52d82aeadca..4ec9879e82ed 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>>  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>>  }
>>  
>> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>> -{
>> -	void *va = page_address(pfn_to_page(pfn));
>> -
>> -	/*
>> -	 * With FWB, we ensure that the guest always accesses memory using
>> -	 * cacheable attributes, and we don't have to clean to PoC when
>> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>> -	 * PoU is not required either in this case.
>> -	 */
>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> -		return;
>> -
>> -	kvm_flush_dcache_to_poc(va, size);
>> -}
>> -
>>  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>>  						  unsigned long size)
>>  {
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 4d177ce1d536..2f4f87021980 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>>  	return 0;
>>  }
>>  
>> +static bool stage2_pte_cacheable(kvm_pte_t pte)
>> +{
>> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
>> +}
>> +
>> +static void stage2_flush_dcache(void *addr, u64 size)
>> +{
>> +	/*
>> +	 * With FWB, we ensure that the guest always accesses memory using
>> +	 * cacheable attributes, and we don't have to clean to PoC when
>> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>> +	 * PoU is not required either in this case.
>> +	 */
>> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> +		return;
>> +
>> +	__flush_dcache_area(addr, size);
>> +}
>> +
>>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>  				      kvm_pte_t *ptep,
>>  				      struct stage2_map_data *data)
>> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>  		put_page(page);
>>  	}
>>  
>> +	/* Flush data cache before installation of the new PTE */
>> +	if (stage2_pte_cacheable(new))
>> +		stage2_flush_dcache(__va(phys), granule);
>> +
>>  	smp_store_release(ptep, new);
>>  	get_page(page);
>>  	data->phys += granule;
>> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>>  	return ret;
>>  }
>>  
>> -static void stage2_flush_dcache(void *addr, u64 size)
>> -{
>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> -		return;
>> -
>> -	__flush_dcache_area(addr, size);
>> -}
>> -
>> -static bool stage2_pte_cacheable(kvm_pte_t pte)
>> -{
>> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
>> -}
>> -
>>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>  			       enum kvm_pgtable_walk_flags flag,
>>  			       void * const arg)
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 77cb2d28f2a4..d151927a7d62 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>>  	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>>  }
>>  
>> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>> -{
>> -	__clean_dcache_guest_page(pfn, size);
>> -}
>> -
>>  static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>  {
>>  	__invalidate_icache_guest_page(pfn, size);
>> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  	if (writable)
>>  		prot |= KVM_PGTABLE_PROT_W;
>>  
>> -	if (fault_status != FSC_PERM && !device)
>> -		clean_dcache_guest_page(pfn, vma_pagesize);
>> -
>>  	if (exec_fault) {
>>  		prot |= KVM_PGTABLE_PROT_X;
>>  		invalidate_icache_guest_page(pfn, vma_pagesize);
> It seems that the I-side CMO now happens *before* the D-side, which
> seems odd. What prevents the CPU from speculatively fetching
> instructions in the interval? I would also feel much more confident if
> the two were kept close together.

I noticed yet another thing which I don't understand. When the CPU has the
ARM64_HAS_CACHE_DIC featue (CTR_EL0.DIC = 1), which means instruction invalidation
is not required for data to instruction coherence, we still do the icache
invalidation. I am wondering if the invalidation is necessary in this case.

If it's not, then I think it's correct (and straightforward) to move the icache
invalidation to stage2_map_walker_try_leaf() after the dcache clean+inval and make
it depend on the new mapping being executable *and*
!cpus_have_const_cap(ARM64_HAS_CACHE_DIC).

If the icache invalidation is required even if ARM64_HAS_CACHE_DIC is present,
then I'm not sure how we can distinguish between setting the executable
permissions because exec_fault (the code above) and setting the same permissions
because cpus_have_const_cap(ARM64_HAS_CACHE_DIC) (the code immediately following
the snippet above).

Thanks,

Alex

>
> Thanks,
>
> 	M.
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-25 17:39       ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-02-25 17:39 UTC (permalink / raw)
  To: Marc Zyngier, Yanan Wang
  Cc: kvm, Catalin Marinas, linux-kernel, Will Deacon, kvmarm,
	linux-arm-kernel

Hi Marc,

On 2/25/21 9:55 AM, Marc Zyngier wrote:
> Hi Yanan,
>
> On Mon, 08 Feb 2021 11:22:47 +0000,
> Yanan Wang <wangyanan55@huawei.com> wrote:
>> We currently uniformly clean dcache in user_mem_abort() before calling the
>> fault handlers, if we take a translation fault and the pfn is cacheable.
>> But if there are concurrent translation faults on the same page or block,
>> clean of dcache for the first time is necessary while the others are not.
>>
>> By moving clean of dcache to the map handler, we can easily identify the
>> conditions where CMOs are really needed and avoid the unnecessary ones.
>> As it's a time consuming process to perform CMOs especially when flushing
>> a block range, so this solution reduces much load of kvm and improve the
>> efficiency of creating mappings.
> That's an interesting approach. However, wouldn't it be better to
> identify early that there is already something mapped, and return to
> the guest ASAP?

Wouldn't that introduce overhead for the common case, when there's only one VCPU
that faults on an address? For each data abort caused by a missing stage 2 entry
we would now have to determine if the IPA isn't already mapped and that means
walking the stage 2 tables.

Or am I mistaken and either:

(a) The common case is multiple simultaneous translation faults from different
VCPUs on the same IPA. Or

(b) There's a fast way to check if an IPA is mapped at stage 2 and the overhead
would be negligible.

>
> Can you quantify the benefit of this patch alone?
>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>>  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>>  arch/arm64/kvm/mmu.c             | 14 +++---------
>>  3 files changed, 27 insertions(+), 41 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index e52d82aeadca..4ec9879e82ed 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>>  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>>  }
>>  
>> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>> -{
>> -	void *va = page_address(pfn_to_page(pfn));
>> -
>> -	/*
>> -	 * With FWB, we ensure that the guest always accesses memory using
>> -	 * cacheable attributes, and we don't have to clean to PoC when
>> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>> -	 * PoU is not required either in this case.
>> -	 */
>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> -		return;
>> -
>> -	kvm_flush_dcache_to_poc(va, size);
>> -}
>> -
>>  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>>  						  unsigned long size)
>>  {
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 4d177ce1d536..2f4f87021980 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>>  	return 0;
>>  }
>>  
>> +static bool stage2_pte_cacheable(kvm_pte_t pte)
>> +{
>> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
>> +}
>> +
>> +static void stage2_flush_dcache(void *addr, u64 size)
>> +{
>> +	/*
>> +	 * With FWB, we ensure that the guest always accesses memory using
>> +	 * cacheable attributes, and we don't have to clean to PoC when
>> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>> +	 * PoU is not required either in this case.
>> +	 */
>> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> +		return;
>> +
>> +	__flush_dcache_area(addr, size);
>> +}
>> +
>>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>  				      kvm_pte_t *ptep,
>>  				      struct stage2_map_data *data)
>> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>  		put_page(page);
>>  	}
>>  
>> +	/* Flush data cache before installation of the new PTE */
>> +	if (stage2_pte_cacheable(new))
>> +		stage2_flush_dcache(__va(phys), granule);
>> +
>>  	smp_store_release(ptep, new);
>>  	get_page(page);
>>  	data->phys += granule;
>> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>>  	return ret;
>>  }
>>  
>> -static void stage2_flush_dcache(void *addr, u64 size)
>> -{
>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> -		return;
>> -
>> -	__flush_dcache_area(addr, size);
>> -}
>> -
>> -static bool stage2_pte_cacheable(kvm_pte_t pte)
>> -{
>> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
>> -}
>> -
>>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>  			       enum kvm_pgtable_walk_flags flag,
>>  			       void * const arg)
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 77cb2d28f2a4..d151927a7d62 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>>  	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>>  }
>>  
>> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>> -{
>> -	__clean_dcache_guest_page(pfn, size);
>> -}
>> -
>>  static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>  {
>>  	__invalidate_icache_guest_page(pfn, size);
>> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  	if (writable)
>>  		prot |= KVM_PGTABLE_PROT_W;
>>  
>> -	if (fault_status != FSC_PERM && !device)
>> -		clean_dcache_guest_page(pfn, vma_pagesize);
>> -
>>  	if (exec_fault) {
>>  		prot |= KVM_PGTABLE_PROT_X;
>>  		invalidate_icache_guest_page(pfn, vma_pagesize);
> It seems that the I-side CMO now happens *before* the D-side, which
> seems odd. What prevents the CPU from speculatively fetching
> instructions in the interval? I would also feel much more confident if
> the two were kept close together.

I noticed yet another thing which I don't understand. When the CPU has the
ARM64_HAS_CACHE_DIC featue (CTR_EL0.DIC = 1), which means instruction invalidation
is not required for data to instruction coherence, we still do the icache
invalidation. I am wondering if the invalidation is necessary in this case.

If it's not, then I think it's correct (and straightforward) to move the icache
invalidation to stage2_map_walker_try_leaf() after the dcache clean+inval and make
it depend on the new mapping being executable *and*
!cpus_have_const_cap(ARM64_HAS_CACHE_DIC).

If the icache invalidation is required even if ARM64_HAS_CACHE_DIC is present,
then I'm not sure how we can distinguish between setting the executable
permissions because exec_fault (the code above) and setting the same permissions
because cpus_have_const_cap(ARM64_HAS_CACHE_DIC) (the code immediately following
the snippet above).

Thanks,

Alex

>
> Thanks,
>
> 	M.
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-25 17:39       ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-02-25 17:39 UTC (permalink / raw)
  To: Marc Zyngier, Yanan Wang
  Cc: kvm, Catalin Marinas, linux-kernel, Will Deacon, kvmarm,
	linux-arm-kernel

Hi Marc,

On 2/25/21 9:55 AM, Marc Zyngier wrote:
> Hi Yanan,
>
> On Mon, 08 Feb 2021 11:22:47 +0000,
> Yanan Wang <wangyanan55@huawei.com> wrote:
>> We currently uniformly clean dcache in user_mem_abort() before calling the
>> fault handlers, if we take a translation fault and the pfn is cacheable.
>> But if there are concurrent translation faults on the same page or block,
>> clean of dcache for the first time is necessary while the others are not.
>>
>> By moving clean of dcache to the map handler, we can easily identify the
>> conditions where CMOs are really needed and avoid the unnecessary ones.
>> As it's a time consuming process to perform CMOs especially when flushing
>> a block range, so this solution reduces much load of kvm and improve the
>> efficiency of creating mappings.
> That's an interesting approach. However, wouldn't it be better to
> identify early that there is already something mapped, and return to
> the guest ASAP?

Wouldn't that introduce overhead for the common case, when there's only one VCPU
that faults on an address? For each data abort caused by a missing stage 2 entry
we would now have to determine if the IPA isn't already mapped and that means
walking the stage 2 tables.

Or am I mistaken and either:

(a) The common case is multiple simultaneous translation faults from different
VCPUs on the same IPA. Or

(b) There's a fast way to check if an IPA is mapped at stage 2 and the overhead
would be negligible.

>
> Can you quantify the benefit of this patch alone?
>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>>  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>>  arch/arm64/kvm/mmu.c             | 14 +++---------
>>  3 files changed, 27 insertions(+), 41 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index e52d82aeadca..4ec9879e82ed 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>>  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>>  }
>>  
>> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>> -{
>> -	void *va = page_address(pfn_to_page(pfn));
>> -
>> -	/*
>> -	 * With FWB, we ensure that the guest always accesses memory using
>> -	 * cacheable attributes, and we don't have to clean to PoC when
>> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>> -	 * PoU is not required either in this case.
>> -	 */
>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> -		return;
>> -
>> -	kvm_flush_dcache_to_poc(va, size);
>> -}
>> -
>>  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>>  						  unsigned long size)
>>  {
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 4d177ce1d536..2f4f87021980 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>>  	return 0;
>>  }
>>  
>> +static bool stage2_pte_cacheable(kvm_pte_t pte)
>> +{
>> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
>> +}
>> +
>> +static void stage2_flush_dcache(void *addr, u64 size)
>> +{
>> +	/*
>> +	 * With FWB, we ensure that the guest always accesses memory using
>> +	 * cacheable attributes, and we don't have to clean to PoC when
>> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>> +	 * PoU is not required either in this case.
>> +	 */
>> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> +		return;
>> +
>> +	__flush_dcache_area(addr, size);
>> +}
>> +
>>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>  				      kvm_pte_t *ptep,
>>  				      struct stage2_map_data *data)
>> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>  		put_page(page);
>>  	}
>>  
>> +	/* Flush data cache before installation of the new PTE */
>> +	if (stage2_pte_cacheable(new))
>> +		stage2_flush_dcache(__va(phys), granule);
>> +
>>  	smp_store_release(ptep, new);
>>  	get_page(page);
>>  	data->phys += granule;
>> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>>  	return ret;
>>  }
>>  
>> -static void stage2_flush_dcache(void *addr, u64 size)
>> -{
>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> -		return;
>> -
>> -	__flush_dcache_area(addr, size);
>> -}
>> -
>> -static bool stage2_pte_cacheable(kvm_pte_t pte)
>> -{
>> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
>> -}
>> -
>>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>  			       enum kvm_pgtable_walk_flags flag,
>>  			       void * const arg)
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 77cb2d28f2a4..d151927a7d62 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>>  	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>>  }
>>  
>> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>> -{
>> -	__clean_dcache_guest_page(pfn, size);
>> -}
>> -
>>  static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>  {
>>  	__invalidate_icache_guest_page(pfn, size);
>> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  	if (writable)
>>  		prot |= KVM_PGTABLE_PROT_W;
>>  
>> -	if (fault_status != FSC_PERM && !device)
>> -		clean_dcache_guest_page(pfn, vma_pagesize);
>> -
>>  	if (exec_fault) {
>>  		prot |= KVM_PGTABLE_PROT_X;
>>  		invalidate_icache_guest_page(pfn, vma_pagesize);
> It seems that the I-side CMO now happens *before* the D-side, which
> seems odd. What prevents the CPU from speculatively fetching
> instructions in the interval? I would also feel much more confident if
> the two were kept close together.

I noticed yet another thing which I don't understand. When the CPU has the
ARM64_HAS_CACHE_DIC featue (CTR_EL0.DIC = 1), which means instruction invalidation
is not required for data to instruction coherence, we still do the icache
invalidation. I am wondering if the invalidation is necessary in this case.

If it's not, then I think it's correct (and straightforward) to move the icache
invalidation to stage2_map_walker_try_leaf() after the dcache clean+inval and make
it depend on the new mapping being executable *and*
!cpus_have_const_cap(ARM64_HAS_CACHE_DIC).

If the icache invalidation is required even if ARM64_HAS_CACHE_DIC is present,
then I'm not sure how we can distinguish between setting the executable
permissions because exec_fault (the code above) and setting the same permissions
because cpus_have_const_cap(ARM64_HAS_CACHE_DIC) (the code immediately following
the snippet above).

Thanks,

Alex

>
> Thanks,
>
> 	M.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
  2021-02-25 17:39       ` Alexandru Elisei
  (?)
@ 2021-02-25 18:30         ` Marc Zyngier
  -1 siblings, 0 replies; 80+ messages in thread
From: Marc Zyngier @ 2021-02-25 18:30 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Yanan Wang, kvm, Catalin Marinas, linux-kernel, linux-arm-kernel,
	Will Deacon, kvmarm

On Thu, 25 Feb 2021 17:39:00 +0000,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> On 2/25/21 9:55 AM, Marc Zyngier wrote:
> > Hi Yanan,
> >
> > On Mon, 08 Feb 2021 11:22:47 +0000,
> > Yanan Wang <wangyanan55@huawei.com> wrote:
> >> We currently uniformly clean dcache in user_mem_abort() before calling the
> >> fault handlers, if we take a translation fault and the pfn is cacheable.
> >> But if there are concurrent translation faults on the same page or block,
> >> clean of dcache for the first time is necessary while the others are not.
> >>
> >> By moving clean of dcache to the map handler, we can easily identify the
> >> conditions where CMOs are really needed and avoid the unnecessary ones.
> >> As it's a time consuming process to perform CMOs especially when flushing
> >> a block range, so this solution reduces much load of kvm and improve the
> >> efficiency of creating mappings.
> > That's an interesting approach. However, wouldn't it be better to
> > identify early that there is already something mapped, and return to
> > the guest ASAP?
> 
> Wouldn't that introduce overhead for the common case, when there's
> only one VCPU that faults on an address? For each data abort caused
> by a missing stage 2 entry we would now have to determine if the IPA
> isn't already mapped and that means walking the stage 2 tables.

The problem is that there is no easy to define "common case". It all
depends on what you are running in the guest.

> Or am I mistaken and either:
> 
> (a) The common case is multiple simultaneous translation faults from
> different VCPUs on the same IPA. Or
> 
> (b) There's a fast way to check if an IPA is mapped at stage 2 and
> the overhead would be negligible.

Checking that something is mapped is simple enough: walk the S2 PT (in
SW or using AT/PAR), and return early if there is *anything*. You
already have taken the fault, which is the most expensive part of the
handling.

> 
> >
> > Can you quantify the benefit of this patch alone?

And this ^^^ part is crucial to evaluating the merit of this patch,
specially outside of the micro-benchmark space.

> >
> >> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> >> ---
> >>  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
> >>  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
> >>  arch/arm64/kvm/mmu.c             | 14 +++---------
> >>  3 files changed, 27 insertions(+), 41 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> >> index e52d82aeadca..4ec9879e82ed 100644
> >> --- a/arch/arm64/include/asm/kvm_mmu.h
> >> +++ b/arch/arm64/include/asm/kvm_mmu.h
> >> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
> >>  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
> >>  }
> >>  
> >> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> >> -{
> >> -	void *va = page_address(pfn_to_page(pfn));
> >> -
> >> -	/*
> >> -	 * With FWB, we ensure that the guest always accesses memory using
> >> -	 * cacheable attributes, and we don't have to clean to PoC when
> >> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> >> -	 * PoU is not required either in this case.
> >> -	 */
> >> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> >> -		return;
> >> -
> >> -	kvm_flush_dcache_to_poc(va, size);
> >> -}
> >> -
> >>  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
> >>  						  unsigned long size)
> >>  {
> >> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> >> index 4d177ce1d536..2f4f87021980 100644
> >> --- a/arch/arm64/kvm/hyp/pgtable.c
> >> +++ b/arch/arm64/kvm/hyp/pgtable.c
> >> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
> >>  	return 0;
> >>  }
> >>  
> >> +static bool stage2_pte_cacheable(kvm_pte_t pte)
> >> +{
> >> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> >> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
> >> +}
> >> +
> >> +static void stage2_flush_dcache(void *addr, u64 size)
> >> +{
> >> +	/*
> >> +	 * With FWB, we ensure that the guest always accesses memory using
> >> +	 * cacheable attributes, and we don't have to clean to PoC when
> >> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> >> +	 * PoU is not required either in this case.
> >> +	 */
> >> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> >> +		return;
> >> +
> >> +	__flush_dcache_area(addr, size);
> >> +}
> >> +
> >>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >>  				      kvm_pte_t *ptep,
> >>  				      struct stage2_map_data *data)
> >> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >>  		put_page(page);
> >>  	}
> >>  
> >> +	/* Flush data cache before installation of the new PTE */
> >> +	if (stage2_pte_cacheable(new))
> >> +		stage2_flush_dcache(__va(phys), granule);
> >> +
> >>  	smp_store_release(ptep, new);
> >>  	get_page(page);
> >>  	data->phys += granule;
> >> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
> >>  	return ret;
> >>  }
> >>  
> >> -static void stage2_flush_dcache(void *addr, u64 size)
> >> -{
> >> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> >> -		return;
> >> -
> >> -	__flush_dcache_area(addr, size);
> >> -}
> >> -
> >> -static bool stage2_pte_cacheable(kvm_pte_t pte)
> >> -{
> >> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> >> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
> >> -}
> >> -
> >>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> >>  			       enum kvm_pgtable_walk_flags flag,
> >>  			       void * const arg)
> >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> >> index 77cb2d28f2a4..d151927a7d62 100644
> >> --- a/arch/arm64/kvm/mmu.c
> >> +++ b/arch/arm64/kvm/mmu.c
> >> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >>  	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
> >>  }
> >>  
> >> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> >> -{
> >> -	__clean_dcache_guest_page(pfn, size);
> >> -}
> >> -
> >>  static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
> >>  {
> >>  	__invalidate_icache_guest_page(pfn, size);
> >> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>  	if (writable)
> >>  		prot |= KVM_PGTABLE_PROT_W;
> >>  
> >> -	if (fault_status != FSC_PERM && !device)
> >> -		clean_dcache_guest_page(pfn, vma_pagesize);
> >> -
> >>  	if (exec_fault) {
> >>  		prot |= KVM_PGTABLE_PROT_X;
> >>  		invalidate_icache_guest_page(pfn, vma_pagesize);
> > It seems that the I-side CMO now happens *before* the D-side, which
> > seems odd. What prevents the CPU from speculatively fetching
> > instructions in the interval? I would also feel much more confident if
> > the two were kept close together.
> 
> I noticed yet another thing which I don't understand. When the CPU
> has the ARM64_HAS_CACHE_DIC featue (CTR_EL0.DIC = 1), which means
> instruction invalidation is not required for data to instruction
> coherence, we still do the icache invalidation. I am wondering if
> the invalidation is necessary in this case.

It isn't, and DIC is already taken care of in the leaf functions (see
__flush_icache_all() and invalidate_icache_range()).

> If it's not, then I think it's correct (and straightforward) to move
> the icache invalidation to stage2_map_walker_try_leaf() after the
> dcache clean+inval and make it depend on the new mapping being
> executable *and* !cpus_have_const_cap(ARM64_HAS_CACHE_DIC).

It would also need to be duplicated on the permission fault path.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-25 18:30         ` Marc Zyngier
  0 siblings, 0 replies; 80+ messages in thread
From: Marc Zyngier @ 2021-02-25 18:30 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvm, Catalin Marinas, linux-kernel, Will Deacon, kvmarm,
	linux-arm-kernel

On Thu, 25 Feb 2021 17:39:00 +0000,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> On 2/25/21 9:55 AM, Marc Zyngier wrote:
> > Hi Yanan,
> >
> > On Mon, 08 Feb 2021 11:22:47 +0000,
> > Yanan Wang <wangyanan55@huawei.com> wrote:
> >> We currently uniformly clean dcache in user_mem_abort() before calling the
> >> fault handlers, if we take a translation fault and the pfn is cacheable.
> >> But if there are concurrent translation faults on the same page or block,
> >> clean of dcache for the first time is necessary while the others are not.
> >>
> >> By moving clean of dcache to the map handler, we can easily identify the
> >> conditions where CMOs are really needed and avoid the unnecessary ones.
> >> As it's a time consuming process to perform CMOs especially when flushing
> >> a block range, so this solution reduces much load of kvm and improve the
> >> efficiency of creating mappings.
> > That's an interesting approach. However, wouldn't it be better to
> > identify early that there is already something mapped, and return to
> > the guest ASAP?
> 
> Wouldn't that introduce overhead for the common case, when there's
> only one VCPU that faults on an address? For each data abort caused
> by a missing stage 2 entry we would now have to determine if the IPA
> isn't already mapped and that means walking the stage 2 tables.

The problem is that there is no easy to define "common case". It all
depends on what you are running in the guest.

> Or am I mistaken and either:
> 
> (a) The common case is multiple simultaneous translation faults from
> different VCPUs on the same IPA. Or
> 
> (b) There's a fast way to check if an IPA is mapped at stage 2 and
> the overhead would be negligible.

Checking that something is mapped is simple enough: walk the S2 PT (in
SW or using AT/PAR), and return early if there is *anything*. You
already have taken the fault, which is the most expensive part of the
handling.

> 
> >
> > Can you quantify the benefit of this patch alone?

And this ^^^ part is crucial to evaluating the merit of this patch,
specially outside of the micro-benchmark space.

> >
> >> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> >> ---
> >>  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
> >>  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
> >>  arch/arm64/kvm/mmu.c             | 14 +++---------
> >>  3 files changed, 27 insertions(+), 41 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> >> index e52d82aeadca..4ec9879e82ed 100644
> >> --- a/arch/arm64/include/asm/kvm_mmu.h
> >> +++ b/arch/arm64/include/asm/kvm_mmu.h
> >> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
> >>  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
> >>  }
> >>  
> >> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> >> -{
> >> -	void *va = page_address(pfn_to_page(pfn));
> >> -
> >> -	/*
> >> -	 * With FWB, we ensure that the guest always accesses memory using
> >> -	 * cacheable attributes, and we don't have to clean to PoC when
> >> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> >> -	 * PoU is not required either in this case.
> >> -	 */
> >> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> >> -		return;
> >> -
> >> -	kvm_flush_dcache_to_poc(va, size);
> >> -}
> >> -
> >>  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
> >>  						  unsigned long size)
> >>  {
> >> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> >> index 4d177ce1d536..2f4f87021980 100644
> >> --- a/arch/arm64/kvm/hyp/pgtable.c
> >> +++ b/arch/arm64/kvm/hyp/pgtable.c
> >> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
> >>  	return 0;
> >>  }
> >>  
> >> +static bool stage2_pte_cacheable(kvm_pte_t pte)
> >> +{
> >> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> >> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
> >> +}
> >> +
> >> +static void stage2_flush_dcache(void *addr, u64 size)
> >> +{
> >> +	/*
> >> +	 * With FWB, we ensure that the guest always accesses memory using
> >> +	 * cacheable attributes, and we don't have to clean to PoC when
> >> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> >> +	 * PoU is not required either in this case.
> >> +	 */
> >> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> >> +		return;
> >> +
> >> +	__flush_dcache_area(addr, size);
> >> +}
> >> +
> >>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >>  				      kvm_pte_t *ptep,
> >>  				      struct stage2_map_data *data)
> >> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >>  		put_page(page);
> >>  	}
> >>  
> >> +	/* Flush data cache before installation of the new PTE */
> >> +	if (stage2_pte_cacheable(new))
> >> +		stage2_flush_dcache(__va(phys), granule);
> >> +
> >>  	smp_store_release(ptep, new);
> >>  	get_page(page);
> >>  	data->phys += granule;
> >> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
> >>  	return ret;
> >>  }
> >>  
> >> -static void stage2_flush_dcache(void *addr, u64 size)
> >> -{
> >> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> >> -		return;
> >> -
> >> -	__flush_dcache_area(addr, size);
> >> -}
> >> -
> >> -static bool stage2_pte_cacheable(kvm_pte_t pte)
> >> -{
> >> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> >> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
> >> -}
> >> -
> >>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> >>  			       enum kvm_pgtable_walk_flags flag,
> >>  			       void * const arg)
> >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> >> index 77cb2d28f2a4..d151927a7d62 100644
> >> --- a/arch/arm64/kvm/mmu.c
> >> +++ b/arch/arm64/kvm/mmu.c
> >> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >>  	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
> >>  }
> >>  
> >> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> >> -{
> >> -	__clean_dcache_guest_page(pfn, size);
> >> -}
> >> -
> >>  static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
> >>  {
> >>  	__invalidate_icache_guest_page(pfn, size);
> >> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>  	if (writable)
> >>  		prot |= KVM_PGTABLE_PROT_W;
> >>  
> >> -	if (fault_status != FSC_PERM && !device)
> >> -		clean_dcache_guest_page(pfn, vma_pagesize);
> >> -
> >>  	if (exec_fault) {
> >>  		prot |= KVM_PGTABLE_PROT_X;
> >>  		invalidate_icache_guest_page(pfn, vma_pagesize);
> > It seems that the I-side CMO now happens *before* the D-side, which
> > seems odd. What prevents the CPU from speculatively fetching
> > instructions in the interval? I would also feel much more confident if
> > the two were kept close together.
> 
> I noticed yet another thing which I don't understand. When the CPU
> has the ARM64_HAS_CACHE_DIC featue (CTR_EL0.DIC = 1), which means
> instruction invalidation is not required for data to instruction
> coherence, we still do the icache invalidation. I am wondering if
> the invalidation is necessary in this case.

It isn't, and DIC is already taken care of in the leaf functions (see
__flush_icache_all() and invalidate_icache_range()).

> If it's not, then I think it's correct (and straightforward) to move
> the icache invalidation to stage2_map_walker_try_leaf() after the
> dcache clean+inval and make it depend on the new mapping being
> executable *and* !cpus_have_const_cap(ARM64_HAS_CACHE_DIC).

It would also need to be duplicated on the permission fault path.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-25 18:30         ` Marc Zyngier
  0 siblings, 0 replies; 80+ messages in thread
From: Marc Zyngier @ 2021-02-25 18:30 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvm, Catalin Marinas, linux-kernel, Yanan Wang, Will Deacon,
	kvmarm, linux-arm-kernel

On Thu, 25 Feb 2021 17:39:00 +0000,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> On 2/25/21 9:55 AM, Marc Zyngier wrote:
> > Hi Yanan,
> >
> > On Mon, 08 Feb 2021 11:22:47 +0000,
> > Yanan Wang <wangyanan55@huawei.com> wrote:
> >> We currently uniformly clean dcache in user_mem_abort() before calling the
> >> fault handlers, if we take a translation fault and the pfn is cacheable.
> >> But if there are concurrent translation faults on the same page or block,
> >> clean of dcache for the first time is necessary while the others are not.
> >>
> >> By moving clean of dcache to the map handler, we can easily identify the
> >> conditions where CMOs are really needed and avoid the unnecessary ones.
> >> As it's a time consuming process to perform CMOs especially when flushing
> >> a block range, so this solution reduces much load of kvm and improve the
> >> efficiency of creating mappings.
> > That's an interesting approach. However, wouldn't it be better to
> > identify early that there is already something mapped, and return to
> > the guest ASAP?
> 
> Wouldn't that introduce overhead for the common case, when there's
> only one VCPU that faults on an address? For each data abort caused
> by a missing stage 2 entry we would now have to determine if the IPA
> isn't already mapped and that means walking the stage 2 tables.

The problem is that there is no easy to define "common case". It all
depends on what you are running in the guest.

> Or am I mistaken and either:
> 
> (a) The common case is multiple simultaneous translation faults from
> different VCPUs on the same IPA. Or
> 
> (b) There's a fast way to check if an IPA is mapped at stage 2 and
> the overhead would be negligible.

Checking that something is mapped is simple enough: walk the S2 PT (in
SW or using AT/PAR), and return early if there is *anything*. You
already have taken the fault, which is the most expensive part of the
handling.

> 
> >
> > Can you quantify the benefit of this patch alone?

And this ^^^ part is crucial to evaluating the merit of this patch,
specially outside of the micro-benchmark space.

> >
> >> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> >> ---
> >>  arch/arm64/include/asm/kvm_mmu.h | 16 --------------
> >>  arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
> >>  arch/arm64/kvm/mmu.c             | 14 +++---------
> >>  3 files changed, 27 insertions(+), 41 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> >> index e52d82aeadca..4ec9879e82ed 100644
> >> --- a/arch/arm64/include/asm/kvm_mmu.h
> >> +++ b/arch/arm64/include/asm/kvm_mmu.h
> >> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
> >>  	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
> >>  }
> >>  
> >> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> >> -{
> >> -	void *va = page_address(pfn_to_page(pfn));
> >> -
> >> -	/*
> >> -	 * With FWB, we ensure that the guest always accesses memory using
> >> -	 * cacheable attributes, and we don't have to clean to PoC when
> >> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> >> -	 * PoU is not required either in this case.
> >> -	 */
> >> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> >> -		return;
> >> -
> >> -	kvm_flush_dcache_to_poc(va, size);
> >> -}
> >> -
> >>  static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
> >>  						  unsigned long size)
> >>  {
> >> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> >> index 4d177ce1d536..2f4f87021980 100644
> >> --- a/arch/arm64/kvm/hyp/pgtable.c
> >> +++ b/arch/arm64/kvm/hyp/pgtable.c
> >> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
> >>  	return 0;
> >>  }
> >>  
> >> +static bool stage2_pte_cacheable(kvm_pte_t pte)
> >> +{
> >> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> >> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
> >> +}
> >> +
> >> +static void stage2_flush_dcache(void *addr, u64 size)
> >> +{
> >> +	/*
> >> +	 * With FWB, we ensure that the guest always accesses memory using
> >> +	 * cacheable attributes, and we don't have to clean to PoC when
> >> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
> >> +	 * PoU is not required either in this case.
> >> +	 */
> >> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> >> +		return;
> >> +
> >> +	__flush_dcache_area(addr, size);
> >> +}
> >> +
> >>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >>  				      kvm_pte_t *ptep,
> >>  				      struct stage2_map_data *data)
> >> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> >>  		put_page(page);
> >>  	}
> >>  
> >> +	/* Flush data cache before installation of the new PTE */
> >> +	if (stage2_pte_cacheable(new))
> >> +		stage2_flush_dcache(__va(phys), granule);
> >> +
> >>  	smp_store_release(ptep, new);
> >>  	get_page(page);
> >>  	data->phys += granule;
> >> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
> >>  	return ret;
> >>  }
> >>  
> >> -static void stage2_flush_dcache(void *addr, u64 size)
> >> -{
> >> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> >> -		return;
> >> -
> >> -	__flush_dcache_area(addr, size);
> >> -}
> >> -
> >> -static bool stage2_pte_cacheable(kvm_pte_t pte)
> >> -{
> >> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
> >> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
> >> -}
> >> -
> >>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> >>  			       enum kvm_pgtable_walk_flags flag,
> >>  			       void * const arg)
> >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> >> index 77cb2d28f2a4..d151927a7d62 100644
> >> --- a/arch/arm64/kvm/mmu.c
> >> +++ b/arch/arm64/kvm/mmu.c
> >> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >>  	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
> >>  }
> >>  
> >> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
> >> -{
> >> -	__clean_dcache_guest_page(pfn, size);
> >> -}
> >> -
> >>  static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
> >>  {
> >>  	__invalidate_icache_guest_page(pfn, size);
> >> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>  	if (writable)
> >>  		prot |= KVM_PGTABLE_PROT_W;
> >>  
> >> -	if (fault_status != FSC_PERM && !device)
> >> -		clean_dcache_guest_page(pfn, vma_pagesize);
> >> -
> >>  	if (exec_fault) {
> >>  		prot |= KVM_PGTABLE_PROT_X;
> >>  		invalidate_icache_guest_page(pfn, vma_pagesize);
> > It seems that the I-side CMO now happens *before* the D-side, which
> > seems odd. What prevents the CPU from speculatively fetching
> > instructions in the interval? I would also feel much more confident if
> > the two were kept close together.
> 
> I noticed yet another thing which I don't understand. When the CPU
> has the ARM64_HAS_CACHE_DIC featue (CTR_EL0.DIC = 1), which means
> instruction invalidation is not required for data to instruction
> coherence, we still do the icache invalidation. I am wondering if
> the invalidation is necessary in this case.

It isn't, and DIC is already taken care of in the leaf functions (see
__flush_icache_all() and invalidate_icache_range()).

> If it's not, then I think it's correct (and straightforward) to move
> the icache invalidation to stage2_map_walker_try_leaf() after the
> dcache clean+inval and make it depend on the new mapping being
> executable *and* !cpus_have_const_cap(ARM64_HAS_CACHE_DIC).

It would also need to be duplicated on the permission fault path.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
  2021-02-25 18:30         ` Marc Zyngier
  (?)
@ 2021-02-26 15:51           ` wangyanan (Y)
  -1 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-02-26 15:51 UTC (permalink / raw)
  To: Marc Zyngier, Alexandru Elisei
  Cc: kvm, Catalin Marinas, linux-kernel, linux-arm-kernel,
	Will Deacon, kvmarm, wanghaibin.wang, yuzenghui

Hi Marc, Alex,

On 2021/2/26 2:30, Marc Zyngier wrote:
> On Thu, 25 Feb 2021 17:39:00 +0000,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>> Hi Marc,
>>
>> On 2/25/21 9:55 AM, Marc Zyngier wrote:
>>> Hi Yanan,
>>>
>>> On Mon, 08 Feb 2021 11:22:47 +0000,
>>> Yanan Wang <wangyanan55@huawei.com> wrote:
>>>> We currently uniformly clean dcache in user_mem_abort() before calling the
>>>> fault handlers, if we take a translation fault and the pfn is cacheable.
>>>> But if there are concurrent translation faults on the same page or block,
>>>> clean of dcache for the first time is necessary while the others are not.
>>>>
>>>> By moving clean of dcache to the map handler, we can easily identify the
>>>> conditions where CMOs are really needed and avoid the unnecessary ones.
>>>> As it's a time consuming process to perform CMOs especially when flushing
>>>> a block range, so this solution reduces much load of kvm and improve the
>>>> efficiency of creating mappings.
>>> That's an interesting approach. However, wouldn't it be better to
>>> identify early that there is already something mapped, and return to
>>> the guest ASAP?
>> Wouldn't that introduce overhead for the common case, when there's
>> only one VCPU that faults on an address? For each data abort caused
>> by a missing stage 2 entry we would now have to determine if the IPA
>> isn't already mapped and that means walking the stage 2 tables.
> The problem is that there is no easy to define "common case". It all
> depends on what you are running in the guest.
>
>> Or am I mistaken and either:
>>
>> (a) The common case is multiple simultaneous translation faults from
>> different VCPUs on the same IPA. Or
>>
>> (b) There's a fast way to check if an IPA is mapped at stage 2 and
>> the overhead would be negligible.
> Checking that something is mapped is simple enough: walk the S2 PT (in
> SW or using AT/PAR), and return early if there is *anything*. You
> already have taken the fault, which is the most expensive part of the
> handling.
I think maybe it could be better to move CMOs (both dcache and icache) 
to the fault handlers.
The map path and permission path are actually a page table walk, and we 
can easily distinguish
between conditions that need CMOs and the ones that don't in the paths 
now.  Why do we have
to add one more PTW early just for identifying the cases of CMOs and 
ignore the existing one?

Besides, if we know in advance there is already something mapped (page 
table is valid), maybe it's
not appropriate to just return early in all cases. What if we are going 
to change the output address(OA)
of the existing table entry? We can't just return in this case. I'm not 
sure whether this is a correct example :).

Actually, moving CMOs to the fault handlers will not ruin the existing 
stage2 page table framework,
and there will not be so much code change. Please see below.
>>> Can you quantify the benefit of this patch alone?
> And this ^^^ part is crucial to evaluating the merit of this patch,
> specially outside of the micro-benchmark space.
The following test results represent the benefit of this patch alone, 
and it's
indicated that the benefit increase as the page table granularity 
increases.
Selftest: 
https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/ 


---
hardware platform: HiSilicon Kunpeng920 Server(FWB not supported)
host kernel: Linux mainline v5.11-rc6 (with series of 
https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com 
applied)

(1) multiple vcpus concurrently access 1G memory.
     execution time of: a) KVM create new page mappings(normal 4K), b) 
update the mappings from RO to RW.

cmdline: ./kvm_page_table_test -m 4 -t 0 -g 4K -s 1G -v 50
            (50 vcpus, 1G memory, page mappings(normal 4K))
a) Before patch: KVM_CREATE_MAPPINGS: 62.752s 62.123s 61.733s 62.562s 
61.847s
    After  patch: KVM_CREATE_MAPPINGS: 58.800s 58.364s 58.163s 58.370s 
58.677s *average 7% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 49.083s 49.920s 49.484s 49.551s 
49.410s
    After  patch: KVM_UPDATE_MAPPINGS: 48.723s 49.259s 49.204s 48.207s 
49.112s *no change*

cmdline: ./kvm_page_table_test -m 4 -t 0 -g 4K -s 1G -v 100
            (100 vcpus, 1G memory, page mappings(normal 4K))
a) Before patch: KVM_CREATE_MAPPINGS: 129.70s 129.66s 126.78s 126.07s 
130.21s
    After  patch: KVM_CREATE_MAPPINGS: 120.69s 120.28s 120.68s 121.09s 
121.34s *average 9% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 94.097s 94.501s 92.589s 93.957s 
94.317s
    After  patch: KVM_UPDATE_MAPPINGS: 93.677s 93.701s 93.036s 93.484s 
93.584s *no change*

(2) multiple vcpus concurrently access 20G memory.
     execution time of: a) KVM create new block mappings(THP 2M), b) 
split the blocks in dirty logging, c) reconstitute the blocks after 
dirty logging.

cmdline: ./kvm_page_table_test -m 4 -t 1 -g 2M -s 20G -v 20
            (20 vcpus, 20G memory, block mappings(THP 2M))
a) Before patch: KVM_CREATE_MAPPINGS: 12.546s 13.300s 12.448s 12.496s 
12.420s
    After  patch: KVM_CREATE_MAPPINGS:  5.679s  5.773s  5.759s 5.698s  
5.835s *average 54% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 78.510s 78.026s 80.813s 80.681s 
81.671s
    After  patch: KVM_UPDATE_MAPPINGS: 52.820s 57.652s 51.390s 56.468s 
60.070s *average 30% improvement*
c) Before patch: KVM_ADJUST_MAPPINGS: 82.617s 83.551s 83.839s 83.844s 
85.416s
    After  patch: KVM_ADJUST_MAPPINGS: 61.208s 57.212s 58.473s 57.521s 
64.364s *average 30% improvement*

cmdline: ./kvm_page_table_test -m 4 -t 1 -g 2M -s 20G -v 40
            (40 vcpus, 20G memory, block mappings(THP 2M))
a) Before patch: KVM_CREATE_MAPPINGS: 13.226s 13.986s 13.671s 13.697s 
13.077s
    After  patch: KVM_CREATE_MAPPINGS:  7.274s  7.139s  7.257s 7.012s  
7.076s *average 48% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 173.70s 177.45s 178.68s 175.45s 
175.50s
    After  patch: KVM_UPDATE_MAPPINGS: 129.62s 131.61s 131.36s 123.58s 
131.73s *average 28% improvement*
c) Before patch: KVM_ADJUST_MAPPINGS: 179.96s 179.61s 182.01s 181.35s 
181.11s
    After  patch: KVM_ADJUST_MAPPINGS: 137.74s 139.92s 139.79s 132.52s 
140.30s *average 25% improvement*

(3) multiple vcpus concurrently access 20G memory.
     execution time of: a) KVM create new block mappings(HUGETLB 1G), b) 
split the blocks in dirty logging, c) reconstitute the blocks after 
dirty logging.

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
            (20 vcpus, 20G memory, block mappings(HUGETLB 1G))
a) Before patch: KVM_CREATE_MAPPINGS: 52.808s 52.814s 52.826s 52.833s 
52.809s
    After  patch: KVM_CREATE_MAPPINGS:  3.701s  3.700s  3.702s 3.701s  
3.706s *average 93% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 80.886s 80.582s 78.190s 79.964s 
80.561s
    After  patch: KVM_UPDATE_MAPPINGS: 55.546s 53.800s 57.103s 56.278s 
56.372s *average 30% improvement*
c) Before patch: KVM_ADJUST_MAPPINGS: 52.027s 52.031s 52.026s 52.027s 
52.024s
    After  patch: KVM_ADJUST_MAPPINGS:  2.881s  2.883s  2.885s 2.879s  
2.882s *average 95% improvement*

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
            (40 vcpus, 20G memory, block mappings(HUGETLB 1G))
a) Before patch: KVM_CREATE_MAPPINGS: 104.51s 104.53s 104.52s 104.53s 
104.52s
    After  patch: KVM_CREATE_MAPPINGS:  3.698s  3.699s  3.726s 3.700s  
3.697s *average 96% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 171.75s 173.73s 172.11s 173.39s 
170.69s
    After  patch: KVM_UPDATE_MAPPINGS: 126.66s 128.69s 126.59s 120.54s 
127.08s *average 28% improvement*
c) Before patch: KVM_ADJUST_MAPPINGS: 103.93s 103.94s 103.90s 103.78s 
103.78s
    After  patch: KVM_ADJUST_MAPPINGS:  2.954s  2.955s  2.949s 2.951s  
2.953s *average 97% improvement*
>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>> ---
>>>>   arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>>>>   arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>>>>   arch/arm64/kvm/mmu.c             | 14 +++---------
>>>>   3 files changed, 27 insertions(+), 41 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>>>> index e52d82aeadca..4ec9879e82ed 100644
>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>>>>   	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>>>>   }
>>>>   
>>>> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>>> -{
>>>> -	void *va = page_address(pfn_to_page(pfn));
>>>> -
>>>> -	/*
>>>> -	 * With FWB, we ensure that the guest always accesses memory using
>>>> -	 * cacheable attributes, and we don't have to clean to PoC when
>>>> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>>>> -	 * PoU is not required either in this case.
>>>> -	 */
>>>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>>> -		return;
>>>> -
>>>> -	kvm_flush_dcache_to_poc(va, size);
>>>> -}
>>>> -
>>>>   static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>>>>   						  unsigned long size)
>>>>   {
>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>> index 4d177ce1d536..2f4f87021980 100644
>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>>>>   	return 0;
>>>>   }
>>>>   
>>>> +static bool stage2_pte_cacheable(kvm_pte_t pte)
>>>> +{
>>>> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>>>> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
>>>> +}
>>>> +
>>>> +static void stage2_flush_dcache(void *addr, u64 size)
>>>> +{
>>>> +	/*
>>>> +	 * With FWB, we ensure that the guest always accesses memory using
>>>> +	 * cacheable attributes, and we don't have to clean to PoC when
>>>> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>>>> +	 * PoU is not required either in this case.
>>>> +	 */
>>>> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>>> +		return;
>>>> +
>>>> +	__flush_dcache_area(addr, size);
>>>> +}
>>>> +
>>>>   static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>>   				      kvm_pte_t *ptep,
>>>>   				      struct stage2_map_data *data)
>>>> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>>   		put_page(page);
>>>>   	}
>>>>   
>>>> +	/* Flush data cache before installation of the new PTE */
>>>> +	if (stage2_pte_cacheable(new))
>>>> +		stage2_flush_dcache(__va(phys), granule);
>>>> +
>>>>   	smp_store_release(ptep, new);
>>>>   	get_page(page);
>>>>   	data->phys += granule;
>>>> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>>>>   	return ret;
>>>>   }
>>>>   
>>>> -static void stage2_flush_dcache(void *addr, u64 size)
>>>> -{
>>>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>>> -		return;
>>>> -
>>>> -	__flush_dcache_area(addr, size);
>>>> -}
>>>> -
>>>> -static bool stage2_pte_cacheable(kvm_pte_t pte)
>>>> -{
>>>> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>>>> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
>>>> -}
>>>> -
>>>>   static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>>   			       enum kvm_pgtable_walk_flags flag,
>>>>   			       void * const arg)
>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>> index 77cb2d28f2a4..d151927a7d62 100644
>>>> --- a/arch/arm64/kvm/mmu.c
>>>> +++ b/arch/arm64/kvm/mmu.c
>>>> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>>>>   	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>>>>   }
>>>>   
>>>> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>>> -{
>>>> -	__clean_dcache_guest_page(pfn, size);
>>>> -}
>>>> -
>>>>   static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>>>   {
>>>>   	__invalidate_icache_guest_page(pfn, size);
>>>> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>   	if (writable)
>>>>   		prot |= KVM_PGTABLE_PROT_W;
>>>>   
>>>> -	if (fault_status != FSC_PERM && !device)
>>>> -		clean_dcache_guest_page(pfn, vma_pagesize);
>>>> -
>>>>   	if (exec_fault) {
>>>>   		prot |= KVM_PGTABLE_PROT_X;
>>>>   		invalidate_icache_guest_page(pfn, vma_pagesize);
>>> It seems that the I-side CMO now happens *before* the D-side, which
>>> seems odd. What prevents the CPU from speculatively fetching
>>> instructions in the interval? I would also feel much more confident if
>>> the two were kept close together.
>> I noticed yet another thing which I don't understand. When the CPU
>> has the ARM64_HAS_CACHE_DIC featue (CTR_EL0.DIC = 1), which means
>> instruction invalidation is not required for data to instruction
>> coherence, we still do the icache invalidation. I am wondering if
>> the invalidation is necessary in this case.
> It isn't, and DIC is already taken care of in the leaf functions (see
> __flush_icache_all() and invalidate_icache_range()).
Then it will be more simple to also move icache invalidation to both the 
map path and permission path.
We can check whether the executable permission is going to be added to 
the old mapping through the
new PTE, and perform CMO of the icache if it is. The diff like below may 
work, what do you think ?

---

diff --git a/arch/arm64/include/asm/kvm_mmu.h 
b/arch/arm64/include/asm/kvm_mmu.h
index 4ec9879e82ed..534d42da2065 100644
  - - - a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -204,21 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct 
kvm_vcpu *vcpu)
         return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
  }

-static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
-                                                 unsigned long size)
-{
-       if (icache_is_aliasing()) {
-               /* any kind of VIPT cache */
-               __flush_icache_all();
-       } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
-               /* PIPT or VPIPT at EL2 (see comment in 
__kvm_tlb_flush_vmid_ipa) */
-               void *va = page_address(pfn_to_page(pfn));
-
-               invalidate_icache_range((unsigned long)va,
-                                       (unsigned long)va + size);
-       }
-}
-
  void kvm_set_way_flush(struct kvm_vcpu *vcpu);
  void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 308c36b9cd21..950102077676 100644
  - - - a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -120,7 +120,6 @@ static bool kvm_pte_valid(kvm_pte_t pte)
  {
         return pte & KVM_PTE_VALID;
  }
-
  static bool kvm_pte_table(kvm_pte_t pte, u32 level)
  {
         if (level == KVM_PGTABLE_MAX_LEVELS - 1)
@@ -485,6 +484,18 @@ static void stage2_flush_dcache(void *addr, u64 size)
         __flush_dcache_area(addr, size);
  }

+static void stage2_invalidate_icache(void *addr, u64 size)
+{
+       if (icache_is_aliasing()) {
+               /* Flush any kind of VIPT icache */
+               __flush_icache_all();
+       } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
+               /* Flush PIPT or VPIPT icache at EL2 */
+               invalidate_icache_range((unsigned long)addr,
+                                       (unsigned long)addr + size);
+       }
+}
+
  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
                                       kvm_pte_t *ptep,
                                       struct stage2_map_data *data)
@@ -516,7 +527,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 
end, u32 level,
                 put_page(page);
         }

  -       /* Flush data cache before installation of the new PTE */
+       /* Perform CMOs before installation of the new PTE */
+       if (!(new & KVM_PTE_LEAF_ATTR_HI_S2_XN))
+               stage2_invalidate_icache(__va(phys), granule);
+
         if (stage2_pte_cacheable(new))
                 stage2_flush_dcache(__va(phys), granule);

@@ -769,8 +783,16 @@ static int stage2_attr_walker(u64 addr, u64 end, 
u32 level, kvm_pte_t *ptep,
          * but worst-case the access flag update gets lost and will be
          * set on the next access instead.
          */
  -       if (data->pte != pte)
+       if (data->pte != pte) {
+               /*
+                * Invalidate the instruction cache before updating
+                * if we are going to add the executable permission.
+                */
+               if (!(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN))
+ stage2_invalidate_icache(kvm_pte_follow(pte),
+ kvm_granule_size(level));
                 WRITE_ONCE(*ptep, pte);
+       }

         return 0;
  }

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d151927a7d62..1eec9f63bc6f 100644
  - - - a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct 
kvm *kvm,
         kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
  }

-static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long 
size)
-{
-       __invalidate_icache_guest_page(pfn, size);
-}
-
  static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
  {
         send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, 
current);
@@ -877,10 +872,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
         if (writable)
                 prot |= KVM_PGTABLE_PROT_W;

  -       if (exec_fault) {
+       if (exec_fault)
                 prot |= KVM_PGTABLE_PROT_X;
-               invalidate_icache_guest_page(pfn, vma_pagesize);
-       }

         if (device)
                 prot |= KVM_PGTABLE_PROT_DEVICE;

---

Thanks,

Yanan
>> If it's not, then I think it's correct (and straightforward) to move
>> the icache invalidation to stage2_map_walker_try_leaf() after the
>> dcache clean+inval and make it depend on the new mapping being
>> executable *and* !cpus_have_const_cap(ARM64_HAS_CACHE_DIC).
> It would also need to be duplicated on the permission fault path.
>
> Thanks,
>
> 	M.
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-26 15:51           ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-02-26 15:51 UTC (permalink / raw)
  To: Marc Zyngier, Alexandru Elisei
  Cc: kvm, Catalin Marinas, linux-kernel, Will Deacon, kvmarm,
	linux-arm-kernel

Hi Marc, Alex,

On 2021/2/26 2:30, Marc Zyngier wrote:
> On Thu, 25 Feb 2021 17:39:00 +0000,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>> Hi Marc,
>>
>> On 2/25/21 9:55 AM, Marc Zyngier wrote:
>>> Hi Yanan,
>>>
>>> On Mon, 08 Feb 2021 11:22:47 +0000,
>>> Yanan Wang <wangyanan55@huawei.com> wrote:
>>>> We currently uniformly clean dcache in user_mem_abort() before calling the
>>>> fault handlers, if we take a translation fault and the pfn is cacheable.
>>>> But if there are concurrent translation faults on the same page or block,
>>>> clean of dcache for the first time is necessary while the others are not.
>>>>
>>>> By moving clean of dcache to the map handler, we can easily identify the
>>>> conditions where CMOs are really needed and avoid the unnecessary ones.
>>>> As it's a time consuming process to perform CMOs especially when flushing
>>>> a block range, so this solution reduces much load of kvm and improve the
>>>> efficiency of creating mappings.
>>> That's an interesting approach. However, wouldn't it be better to
>>> identify early that there is already something mapped, and return to
>>> the guest ASAP?
>> Wouldn't that introduce overhead for the common case, when there's
>> only one VCPU that faults on an address? For each data abort caused
>> by a missing stage 2 entry we would now have to determine if the IPA
>> isn't already mapped and that means walking the stage 2 tables.
> The problem is that there is no easy to define "common case". It all
> depends on what you are running in the guest.
>
>> Or am I mistaken and either:
>>
>> (a) The common case is multiple simultaneous translation faults from
>> different VCPUs on the same IPA. Or
>>
>> (b) There's a fast way to check if an IPA is mapped at stage 2 and
>> the overhead would be negligible.
> Checking that something is mapped is simple enough: walk the S2 PT (in
> SW or using AT/PAR), and return early if there is *anything*. You
> already have taken the fault, which is the most expensive part of the
> handling.
I think maybe it could be better to move CMOs (both dcache and icache) 
to the fault handlers.
The map path and permission path are actually a page table walk, and we 
can easily distinguish
between conditions that need CMOs and the ones that don't in the paths 
now.  Why do we have
to add one more PTW early just for identifying the cases of CMOs and 
ignore the existing one?

Besides, if we know in advance there is already something mapped (page 
table is valid), maybe it's
not appropriate to just return early in all cases. What if we are going 
to change the output address(OA)
of the existing table entry? We can't just return in this case. I'm not 
sure whether this is a correct example :).

Actually, moving CMOs to the fault handlers will not ruin the existing 
stage2 page table framework,
and there will not be so much code change. Please see below.
>>> Can you quantify the benefit of this patch alone?
> And this ^^^ part is crucial to evaluating the merit of this patch,
> specially outside of the micro-benchmark space.
The following test results represent the benefit of this patch alone, 
and it's
indicated that the benefit increase as the page table granularity 
increases.
Selftest: 
https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/ 


---
hardware platform: HiSilicon Kunpeng920 Server(FWB not supported)
host kernel: Linux mainline v5.11-rc6 (with series of 
https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com 
applied)

(1) multiple vcpus concurrently access 1G memory.
     execution time of: a) KVM create new page mappings(normal 4K), b) 
update the mappings from RO to RW.

cmdline: ./kvm_page_table_test -m 4 -t 0 -g 4K -s 1G -v 50
            (50 vcpus, 1G memory, page mappings(normal 4K))
a) Before patch: KVM_CREATE_MAPPINGS: 62.752s 62.123s 61.733s 62.562s 
61.847s
    After  patch: KVM_CREATE_MAPPINGS: 58.800s 58.364s 58.163s 58.370s 
58.677s *average 7% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 49.083s 49.920s 49.484s 49.551s 
49.410s
    After  patch: KVM_UPDATE_MAPPINGS: 48.723s 49.259s 49.204s 48.207s 
49.112s *no change*

cmdline: ./kvm_page_table_test -m 4 -t 0 -g 4K -s 1G -v 100
            (100 vcpus, 1G memory, page mappings(normal 4K))
a) Before patch: KVM_CREATE_MAPPINGS: 129.70s 129.66s 126.78s 126.07s 
130.21s
    After  patch: KVM_CREATE_MAPPINGS: 120.69s 120.28s 120.68s 121.09s 
121.34s *average 9% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 94.097s 94.501s 92.589s 93.957s 
94.317s
    After  patch: KVM_UPDATE_MAPPINGS: 93.677s 93.701s 93.036s 93.484s 
93.584s *no change*

(2) multiple vcpus concurrently access 20G memory.
     execution time of: a) KVM create new block mappings(THP 2M), b) 
split the blocks in dirty logging, c) reconstitute the blocks after 
dirty logging.

cmdline: ./kvm_page_table_test -m 4 -t 1 -g 2M -s 20G -v 20
            (20 vcpus, 20G memory, block mappings(THP 2M))
a) Before patch: KVM_CREATE_MAPPINGS: 12.546s 13.300s 12.448s 12.496s 
12.420s
    After  patch: KVM_CREATE_MAPPINGS:  5.679s  5.773s  5.759s 5.698s  
5.835s *average 54% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 78.510s 78.026s 80.813s 80.681s 
81.671s
    After  patch: KVM_UPDATE_MAPPINGS: 52.820s 57.652s 51.390s 56.468s 
60.070s *average 30% improvement*
c) Before patch: KVM_ADJUST_MAPPINGS: 82.617s 83.551s 83.839s 83.844s 
85.416s
    After  patch: KVM_ADJUST_MAPPINGS: 61.208s 57.212s 58.473s 57.521s 
64.364s *average 30% improvement*

cmdline: ./kvm_page_table_test -m 4 -t 1 -g 2M -s 20G -v 40
            (40 vcpus, 20G memory, block mappings(THP 2M))
a) Before patch: KVM_CREATE_MAPPINGS: 13.226s 13.986s 13.671s 13.697s 
13.077s
    After  patch: KVM_CREATE_MAPPINGS:  7.274s  7.139s  7.257s 7.012s  
7.076s *average 48% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 173.70s 177.45s 178.68s 175.45s 
175.50s
    After  patch: KVM_UPDATE_MAPPINGS: 129.62s 131.61s 131.36s 123.58s 
131.73s *average 28% improvement*
c) Before patch: KVM_ADJUST_MAPPINGS: 179.96s 179.61s 182.01s 181.35s 
181.11s
    After  patch: KVM_ADJUST_MAPPINGS: 137.74s 139.92s 139.79s 132.52s 
140.30s *average 25% improvement*

(3) multiple vcpus concurrently access 20G memory.
     execution time of: a) KVM create new block mappings(HUGETLB 1G), b) 
split the blocks in dirty logging, c) reconstitute the blocks after 
dirty logging.

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
            (20 vcpus, 20G memory, block mappings(HUGETLB 1G))
a) Before patch: KVM_CREATE_MAPPINGS: 52.808s 52.814s 52.826s 52.833s 
52.809s
    After  patch: KVM_CREATE_MAPPINGS:  3.701s  3.700s  3.702s 3.701s  
3.706s *average 93% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 80.886s 80.582s 78.190s 79.964s 
80.561s
    After  patch: KVM_UPDATE_MAPPINGS: 55.546s 53.800s 57.103s 56.278s 
56.372s *average 30% improvement*
c) Before patch: KVM_ADJUST_MAPPINGS: 52.027s 52.031s 52.026s 52.027s 
52.024s
    After  patch: KVM_ADJUST_MAPPINGS:  2.881s  2.883s  2.885s 2.879s  
2.882s *average 95% improvement*

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
            (40 vcpus, 20G memory, block mappings(HUGETLB 1G))
a) Before patch: KVM_CREATE_MAPPINGS: 104.51s 104.53s 104.52s 104.53s 
104.52s
    After  patch: KVM_CREATE_MAPPINGS:  3.698s  3.699s  3.726s 3.700s  
3.697s *average 96% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 171.75s 173.73s 172.11s 173.39s 
170.69s
    After  patch: KVM_UPDATE_MAPPINGS: 126.66s 128.69s 126.59s 120.54s 
127.08s *average 28% improvement*
c) Before patch: KVM_ADJUST_MAPPINGS: 103.93s 103.94s 103.90s 103.78s 
103.78s
    After  patch: KVM_ADJUST_MAPPINGS:  2.954s  2.955s  2.949s 2.951s  
2.953s *average 97% improvement*
>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>> ---
>>>>   arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>>>>   arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>>>>   arch/arm64/kvm/mmu.c             | 14 +++---------
>>>>   3 files changed, 27 insertions(+), 41 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>>>> index e52d82aeadca..4ec9879e82ed 100644
>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>>>>   	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>>>>   }
>>>>   
>>>> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>>> -{
>>>> -	void *va = page_address(pfn_to_page(pfn));
>>>> -
>>>> -	/*
>>>> -	 * With FWB, we ensure that the guest always accesses memory using
>>>> -	 * cacheable attributes, and we don't have to clean to PoC when
>>>> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>>>> -	 * PoU is not required either in this case.
>>>> -	 */
>>>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>>> -		return;
>>>> -
>>>> -	kvm_flush_dcache_to_poc(va, size);
>>>> -}
>>>> -
>>>>   static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>>>>   						  unsigned long size)
>>>>   {
>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>> index 4d177ce1d536..2f4f87021980 100644
>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>>>>   	return 0;
>>>>   }
>>>>   
>>>> +static bool stage2_pte_cacheable(kvm_pte_t pte)
>>>> +{
>>>> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>>>> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
>>>> +}
>>>> +
>>>> +static void stage2_flush_dcache(void *addr, u64 size)
>>>> +{
>>>> +	/*
>>>> +	 * With FWB, we ensure that the guest always accesses memory using
>>>> +	 * cacheable attributes, and we don't have to clean to PoC when
>>>> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>>>> +	 * PoU is not required either in this case.
>>>> +	 */
>>>> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>>> +		return;
>>>> +
>>>> +	__flush_dcache_area(addr, size);
>>>> +}
>>>> +
>>>>   static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>>   				      kvm_pte_t *ptep,
>>>>   				      struct stage2_map_data *data)
>>>> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>>   		put_page(page);
>>>>   	}
>>>>   
>>>> +	/* Flush data cache before installation of the new PTE */
>>>> +	if (stage2_pte_cacheable(new))
>>>> +		stage2_flush_dcache(__va(phys), granule);
>>>> +
>>>>   	smp_store_release(ptep, new);
>>>>   	get_page(page);
>>>>   	data->phys += granule;
>>>> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>>>>   	return ret;
>>>>   }
>>>>   
>>>> -static void stage2_flush_dcache(void *addr, u64 size)
>>>> -{
>>>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>>> -		return;
>>>> -
>>>> -	__flush_dcache_area(addr, size);
>>>> -}
>>>> -
>>>> -static bool stage2_pte_cacheable(kvm_pte_t pte)
>>>> -{
>>>> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>>>> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
>>>> -}
>>>> -
>>>>   static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>>   			       enum kvm_pgtable_walk_flags flag,
>>>>   			       void * const arg)
>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>> index 77cb2d28f2a4..d151927a7d62 100644
>>>> --- a/arch/arm64/kvm/mmu.c
>>>> +++ b/arch/arm64/kvm/mmu.c
>>>> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>>>>   	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>>>>   }
>>>>   
>>>> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>>> -{
>>>> -	__clean_dcache_guest_page(pfn, size);
>>>> -}
>>>> -
>>>>   static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>>>   {
>>>>   	__invalidate_icache_guest_page(pfn, size);
>>>> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>   	if (writable)
>>>>   		prot |= KVM_PGTABLE_PROT_W;
>>>>   
>>>> -	if (fault_status != FSC_PERM && !device)
>>>> -		clean_dcache_guest_page(pfn, vma_pagesize);
>>>> -
>>>>   	if (exec_fault) {
>>>>   		prot |= KVM_PGTABLE_PROT_X;
>>>>   		invalidate_icache_guest_page(pfn, vma_pagesize);
>>> It seems that the I-side CMO now happens *before* the D-side, which
>>> seems odd. What prevents the CPU from speculatively fetching
>>> instructions in the interval? I would also feel much more confident if
>>> the two were kept close together.
>> I noticed yet another thing which I don't understand. When the CPU
>> has the ARM64_HAS_CACHE_DIC featue (CTR_EL0.DIC = 1), which means
>> instruction invalidation is not required for data to instruction
>> coherence, we still do the icache invalidation. I am wondering if
>> the invalidation is necessary in this case.
> It isn't, and DIC is already taken care of in the leaf functions (see
> __flush_icache_all() and invalidate_icache_range()).
Then it will be more simple to also move icache invalidation to both the 
map path and permission path.
We can check whether the executable permission is going to be added to 
the old mapping through the
new PTE, and perform CMO of the icache if it is. The diff like below may 
work, what do you think ?

---

diff --git a/arch/arm64/include/asm/kvm_mmu.h 
b/arch/arm64/include/asm/kvm_mmu.h
index 4ec9879e82ed..534d42da2065 100644
  - - - a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -204,21 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct 
kvm_vcpu *vcpu)
         return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
  }

-static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
-                                                 unsigned long size)
-{
-       if (icache_is_aliasing()) {
-               /* any kind of VIPT cache */
-               __flush_icache_all();
-       } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
-               /* PIPT or VPIPT at EL2 (see comment in 
__kvm_tlb_flush_vmid_ipa) */
-               void *va = page_address(pfn_to_page(pfn));
-
-               invalidate_icache_range((unsigned long)va,
-                                       (unsigned long)va + size);
-       }
-}
-
  void kvm_set_way_flush(struct kvm_vcpu *vcpu);
  void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 308c36b9cd21..950102077676 100644
  - - - a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -120,7 +120,6 @@ static bool kvm_pte_valid(kvm_pte_t pte)
  {
         return pte & KVM_PTE_VALID;
  }
-
  static bool kvm_pte_table(kvm_pte_t pte, u32 level)
  {
         if (level == KVM_PGTABLE_MAX_LEVELS - 1)
@@ -485,6 +484,18 @@ static void stage2_flush_dcache(void *addr, u64 size)
         __flush_dcache_area(addr, size);
  }

+static void stage2_invalidate_icache(void *addr, u64 size)
+{
+       if (icache_is_aliasing()) {
+               /* Flush any kind of VIPT icache */
+               __flush_icache_all();
+       } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
+               /* Flush PIPT or VPIPT icache at EL2 */
+               invalidate_icache_range((unsigned long)addr,
+                                       (unsigned long)addr + size);
+       }
+}
+
  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
                                       kvm_pte_t *ptep,
                                       struct stage2_map_data *data)
@@ -516,7 +527,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 
end, u32 level,
                 put_page(page);
         }

  -       /* Flush data cache before installation of the new PTE */
+       /* Perform CMOs before installation of the new PTE */
+       if (!(new & KVM_PTE_LEAF_ATTR_HI_S2_XN))
+               stage2_invalidate_icache(__va(phys), granule);
+
         if (stage2_pte_cacheable(new))
                 stage2_flush_dcache(__va(phys), granule);

@@ -769,8 +783,16 @@ static int stage2_attr_walker(u64 addr, u64 end, 
u32 level, kvm_pte_t *ptep,
          * but worst-case the access flag update gets lost and will be
          * set on the next access instead.
          */
  -       if (data->pte != pte)
+       if (data->pte != pte) {
+               /*
+                * Invalidate the instruction cache before updating
+                * if we are going to add the executable permission.
+                */
+               if (!(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN))
+ stage2_invalidate_icache(kvm_pte_follow(pte),
+ kvm_granule_size(level));
                 WRITE_ONCE(*ptep, pte);
+       }

         return 0;
  }

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d151927a7d62..1eec9f63bc6f 100644
  - - - a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct 
kvm *kvm,
         kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
  }

-static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long 
size)
-{
-       __invalidate_icache_guest_page(pfn, size);
-}
-
  static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
  {
         send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, 
current);
@@ -877,10 +872,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
         if (writable)
                 prot |= KVM_PGTABLE_PROT_W;

  -       if (exec_fault) {
+       if (exec_fault)
                 prot |= KVM_PGTABLE_PROT_X;
-               invalidate_icache_guest_page(pfn, vma_pagesize);
-       }

         if (device)
                 prot |= KVM_PGTABLE_PROT_DEVICE;

---

Thanks,

Yanan
>> If it's not, then I think it's correct (and straightforward) to move
>> the icache invalidation to stage2_map_walker_try_leaf() after the
>> dcache clean+inval and make it depend on the new mapping being
>> executable *and* !cpus_have_const_cap(ARM64_HAS_CACHE_DIC).
> It would also need to be duplicated on the permission fault path.
>
> Thanks,
>
> 	M.
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-26 15:51           ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-02-26 15:51 UTC (permalink / raw)
  To: Marc Zyngier, Alexandru Elisei
  Cc: kvm, Catalin Marinas, linux-kernel, yuzenghui, wanghaibin.wang,
	Will Deacon, kvmarm, linux-arm-kernel

Hi Marc, Alex,

On 2021/2/26 2:30, Marc Zyngier wrote:
> On Thu, 25 Feb 2021 17:39:00 +0000,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>> Hi Marc,
>>
>> On 2/25/21 9:55 AM, Marc Zyngier wrote:
>>> Hi Yanan,
>>>
>>> On Mon, 08 Feb 2021 11:22:47 +0000,
>>> Yanan Wang <wangyanan55@huawei.com> wrote:
>>>> We currently uniformly clean dcache in user_mem_abort() before calling the
>>>> fault handlers, if we take a translation fault and the pfn is cacheable.
>>>> But if there are concurrent translation faults on the same page or block,
>>>> clean of dcache for the first time is necessary while the others are not.
>>>>
>>>> By moving clean of dcache to the map handler, we can easily identify the
>>>> conditions where CMOs are really needed and avoid the unnecessary ones.
>>>> As it's a time consuming process to perform CMOs especially when flushing
>>>> a block range, so this solution reduces much load of kvm and improve the
>>>> efficiency of creating mappings.
>>> That's an interesting approach. However, wouldn't it be better to
>>> identify early that there is already something mapped, and return to
>>> the guest ASAP?
>> Wouldn't that introduce overhead for the common case, when there's
>> only one VCPU that faults on an address? For each data abort caused
>> by a missing stage 2 entry we would now have to determine if the IPA
>> isn't already mapped and that means walking the stage 2 tables.
> The problem is that there is no easy to define "common case". It all
> depends on what you are running in the guest.
>
>> Or am I mistaken and either:
>>
>> (a) The common case is multiple simultaneous translation faults from
>> different VCPUs on the same IPA. Or
>>
>> (b) There's a fast way to check if an IPA is mapped at stage 2 and
>> the overhead would be negligible.
> Checking that something is mapped is simple enough: walk the S2 PT (in
> SW or using AT/PAR), and return early if there is *anything*. You
> already have taken the fault, which is the most expensive part of the
> handling.
I think maybe it could be better to move CMOs (both dcache and icache) 
to the fault handlers.
The map path and permission path are actually a page table walk, and we 
can easily distinguish
between conditions that need CMOs and the ones that don't in the paths 
now.  Why do we have
to add one more PTW early just for identifying the cases of CMOs and 
ignore the existing one?

Besides, if we know in advance there is already something mapped (page 
table is valid), maybe it's
not appropriate to just return early in all cases. What if we are going 
to change the output address(OA)
of the existing table entry? We can't just return in this case. I'm not 
sure whether this is a correct example :).

Actually, moving CMOs to the fault handlers will not ruin the existing 
stage2 page table framework,
and there will not be so much code change. Please see below.
>>> Can you quantify the benefit of this patch alone?
> And this ^^^ part is crucial to evaluating the merit of this patch,
> specially outside of the micro-benchmark space.
The following test results represent the benefit of this patch alone, 
and it's
indicated that the benefit increase as the page table granularity 
increases.
Selftest: 
https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/ 


---
hardware platform: HiSilicon Kunpeng920 Server(FWB not supported)
host kernel: Linux mainline v5.11-rc6 (with series of 
https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com 
applied)

(1) multiple vcpus concurrently access 1G memory.
     execution time of: a) KVM create new page mappings(normal 4K), b) 
update the mappings from RO to RW.

cmdline: ./kvm_page_table_test -m 4 -t 0 -g 4K -s 1G -v 50
            (50 vcpus, 1G memory, page mappings(normal 4K))
a) Before patch: KVM_CREATE_MAPPINGS: 62.752s 62.123s 61.733s 62.562s 
61.847s
    After  patch: KVM_CREATE_MAPPINGS: 58.800s 58.364s 58.163s 58.370s 
58.677s *average 7% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 49.083s 49.920s 49.484s 49.551s 
49.410s
    After  patch: KVM_UPDATE_MAPPINGS: 48.723s 49.259s 49.204s 48.207s 
49.112s *no change*

cmdline: ./kvm_page_table_test -m 4 -t 0 -g 4K -s 1G -v 100
            (100 vcpus, 1G memory, page mappings(normal 4K))
a) Before patch: KVM_CREATE_MAPPINGS: 129.70s 129.66s 126.78s 126.07s 
130.21s
    After  patch: KVM_CREATE_MAPPINGS: 120.69s 120.28s 120.68s 121.09s 
121.34s *average 9% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 94.097s 94.501s 92.589s 93.957s 
94.317s
    After  patch: KVM_UPDATE_MAPPINGS: 93.677s 93.701s 93.036s 93.484s 
93.584s *no change*

(2) multiple vcpus concurrently access 20G memory.
     execution time of: a) KVM create new block mappings(THP 2M), b) 
split the blocks in dirty logging, c) reconstitute the blocks after 
dirty logging.

cmdline: ./kvm_page_table_test -m 4 -t 1 -g 2M -s 20G -v 20
            (20 vcpus, 20G memory, block mappings(THP 2M))
a) Before patch: KVM_CREATE_MAPPINGS: 12.546s 13.300s 12.448s 12.496s 
12.420s
    After  patch: KVM_CREATE_MAPPINGS:  5.679s  5.773s  5.759s 5.698s  
5.835s *average 54% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 78.510s 78.026s 80.813s 80.681s 
81.671s
    After  patch: KVM_UPDATE_MAPPINGS: 52.820s 57.652s 51.390s 56.468s 
60.070s *average 30% improvement*
c) Before patch: KVM_ADJUST_MAPPINGS: 82.617s 83.551s 83.839s 83.844s 
85.416s
    After  patch: KVM_ADJUST_MAPPINGS: 61.208s 57.212s 58.473s 57.521s 
64.364s *average 30% improvement*

cmdline: ./kvm_page_table_test -m 4 -t 1 -g 2M -s 20G -v 40
            (40 vcpus, 20G memory, block mappings(THP 2M))
a) Before patch: KVM_CREATE_MAPPINGS: 13.226s 13.986s 13.671s 13.697s 
13.077s
    After  patch: KVM_CREATE_MAPPINGS:  7.274s  7.139s  7.257s 7.012s  
7.076s *average 48% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 173.70s 177.45s 178.68s 175.45s 
175.50s
    After  patch: KVM_UPDATE_MAPPINGS: 129.62s 131.61s 131.36s 123.58s 
131.73s *average 28% improvement*
c) Before patch: KVM_ADJUST_MAPPINGS: 179.96s 179.61s 182.01s 181.35s 
181.11s
    After  patch: KVM_ADJUST_MAPPINGS: 137.74s 139.92s 139.79s 132.52s 
140.30s *average 25% improvement*

(3) multiple vcpus concurrently access 20G memory.
     execution time of: a) KVM create new block mappings(HUGETLB 1G), b) 
split the blocks in dirty logging, c) reconstitute the blocks after 
dirty logging.

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
            (20 vcpus, 20G memory, block mappings(HUGETLB 1G))
a) Before patch: KVM_CREATE_MAPPINGS: 52.808s 52.814s 52.826s 52.833s 
52.809s
    After  patch: KVM_CREATE_MAPPINGS:  3.701s  3.700s  3.702s 3.701s  
3.706s *average 93% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 80.886s 80.582s 78.190s 79.964s 
80.561s
    After  patch: KVM_UPDATE_MAPPINGS: 55.546s 53.800s 57.103s 56.278s 
56.372s *average 30% improvement*
c) Before patch: KVM_ADJUST_MAPPINGS: 52.027s 52.031s 52.026s 52.027s 
52.024s
    After  patch: KVM_ADJUST_MAPPINGS:  2.881s  2.883s  2.885s 2.879s  
2.882s *average 95% improvement*

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
            (40 vcpus, 20G memory, block mappings(HUGETLB 1G))
a) Before patch: KVM_CREATE_MAPPINGS: 104.51s 104.53s 104.52s 104.53s 
104.52s
    After  patch: KVM_CREATE_MAPPINGS:  3.698s  3.699s  3.726s 3.700s  
3.697s *average 96% improvement*
b) Before patch: KVM_UPDATE_MAPPINGS: 171.75s 173.73s 172.11s 173.39s 
170.69s
    After  patch: KVM_UPDATE_MAPPINGS: 126.66s 128.69s 126.59s 120.54s 
127.08s *average 28% improvement*
c) Before patch: KVM_ADJUST_MAPPINGS: 103.93s 103.94s 103.90s 103.78s 
103.78s
    After  patch: KVM_ADJUST_MAPPINGS:  2.954s  2.955s  2.949s 2.951s  
2.953s *average 97% improvement*
>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>> ---
>>>>   arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>>>>   arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>>>>   arch/arm64/kvm/mmu.c             | 14 +++---------
>>>>   3 files changed, 27 insertions(+), 41 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>>>> index e52d82aeadca..4ec9879e82ed 100644
>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>>>>   	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>>>>   }
>>>>   
>>>> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>>> -{
>>>> -	void *va = page_address(pfn_to_page(pfn));
>>>> -
>>>> -	/*
>>>> -	 * With FWB, we ensure that the guest always accesses memory using
>>>> -	 * cacheable attributes, and we don't have to clean to PoC when
>>>> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>>>> -	 * PoU is not required either in this case.
>>>> -	 */
>>>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>>> -		return;
>>>> -
>>>> -	kvm_flush_dcache_to_poc(va, size);
>>>> -}
>>>> -
>>>>   static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>>>>   						  unsigned long size)
>>>>   {
>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>> index 4d177ce1d536..2f4f87021980 100644
>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>>>>   	return 0;
>>>>   }
>>>>   
>>>> +static bool stage2_pte_cacheable(kvm_pte_t pte)
>>>> +{
>>>> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>>>> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
>>>> +}
>>>> +
>>>> +static void stage2_flush_dcache(void *addr, u64 size)
>>>> +{
>>>> +	/*
>>>> +	 * With FWB, we ensure that the guest always accesses memory using
>>>> +	 * cacheable attributes, and we don't have to clean to PoC when
>>>> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>>>> +	 * PoU is not required either in this case.
>>>> +	 */
>>>> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>>> +		return;
>>>> +
>>>> +	__flush_dcache_area(addr, size);
>>>> +}
>>>> +
>>>>   static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>>   				      kvm_pte_t *ptep,
>>>>   				      struct stage2_map_data *data)
>>>> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>>   		put_page(page);
>>>>   	}
>>>>   
>>>> +	/* Flush data cache before installation of the new PTE */
>>>> +	if (stage2_pte_cacheable(new))
>>>> +		stage2_flush_dcache(__va(phys), granule);
>>>> +
>>>>   	smp_store_release(ptep, new);
>>>>   	get_page(page);
>>>>   	data->phys += granule;
>>>> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>>>>   	return ret;
>>>>   }
>>>>   
>>>> -static void stage2_flush_dcache(void *addr, u64 size)
>>>> -{
>>>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>>> -		return;
>>>> -
>>>> -	__flush_dcache_area(addr, size);
>>>> -}
>>>> -
>>>> -static bool stage2_pte_cacheable(kvm_pte_t pte)
>>>> -{
>>>> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>>>> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
>>>> -}
>>>> -
>>>>   static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>>   			       enum kvm_pgtable_walk_flags flag,
>>>>   			       void * const arg)
>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>> index 77cb2d28f2a4..d151927a7d62 100644
>>>> --- a/arch/arm64/kvm/mmu.c
>>>> +++ b/arch/arm64/kvm/mmu.c
>>>> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>>>>   	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>>>>   }
>>>>   
>>>> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>>> -{
>>>> -	__clean_dcache_guest_page(pfn, size);
>>>> -}
>>>> -
>>>>   static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>>>   {
>>>>   	__invalidate_icache_guest_page(pfn, size);
>>>> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>   	if (writable)
>>>>   		prot |= KVM_PGTABLE_PROT_W;
>>>>   
>>>> -	if (fault_status != FSC_PERM && !device)
>>>> -		clean_dcache_guest_page(pfn, vma_pagesize);
>>>> -
>>>>   	if (exec_fault) {
>>>>   		prot |= KVM_PGTABLE_PROT_X;
>>>>   		invalidate_icache_guest_page(pfn, vma_pagesize);
>>> It seems that the I-side CMO now happens *before* the D-side, which
>>> seems odd. What prevents the CPU from speculatively fetching
>>> instructions in the interval? I would also feel much more confident if
>>> the two were kept close together.
>> I noticed yet another thing which I don't understand. When the CPU
>> has the ARM64_HAS_CACHE_DIC featue (CTR_EL0.DIC = 1), which means
>> instruction invalidation is not required for data to instruction
>> coherence, we still do the icache invalidation. I am wondering if
>> the invalidation is necessary in this case.
> It isn't, and DIC is already taken care of in the leaf functions (see
> __flush_icache_all() and invalidate_icache_range()).
Then it will be more simple to also move icache invalidation to both the 
map path and permission path.
We can check whether the executable permission is going to be added to 
the old mapping through the
new PTE, and perform CMO of the icache if it is. The diff like below may 
work, what do you think ?

---

diff --git a/arch/arm64/include/asm/kvm_mmu.h 
b/arch/arm64/include/asm/kvm_mmu.h
index 4ec9879e82ed..534d42da2065 100644
  - - - a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -204,21 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct 
kvm_vcpu *vcpu)
         return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
  }

-static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
-                                                 unsigned long size)
-{
-       if (icache_is_aliasing()) {
-               /* any kind of VIPT cache */
-               __flush_icache_all();
-       } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
-               /* PIPT or VPIPT at EL2 (see comment in 
__kvm_tlb_flush_vmid_ipa) */
-               void *va = page_address(pfn_to_page(pfn));
-
-               invalidate_icache_range((unsigned long)va,
-                                       (unsigned long)va + size);
-       }
-}
-
  void kvm_set_way_flush(struct kvm_vcpu *vcpu);
  void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 308c36b9cd21..950102077676 100644
  - - - a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -120,7 +120,6 @@ static bool kvm_pte_valid(kvm_pte_t pte)
  {
         return pte & KVM_PTE_VALID;
  }
-
  static bool kvm_pte_table(kvm_pte_t pte, u32 level)
  {
         if (level == KVM_PGTABLE_MAX_LEVELS - 1)
@@ -485,6 +484,18 @@ static void stage2_flush_dcache(void *addr, u64 size)
         __flush_dcache_area(addr, size);
  }

+static void stage2_invalidate_icache(void *addr, u64 size)
+{
+       if (icache_is_aliasing()) {
+               /* Flush any kind of VIPT icache */
+               __flush_icache_all();
+       } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) {
+               /* Flush PIPT or VPIPT icache at EL2 */
+               invalidate_icache_range((unsigned long)addr,
+                                       (unsigned long)addr + size);
+       }
+}
+
  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
                                       kvm_pte_t *ptep,
                                       struct stage2_map_data *data)
@@ -516,7 +527,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 
end, u32 level,
                 put_page(page);
         }

  -       /* Flush data cache before installation of the new PTE */
+       /* Perform CMOs before installation of the new PTE */
+       if (!(new & KVM_PTE_LEAF_ATTR_HI_S2_XN))
+               stage2_invalidate_icache(__va(phys), granule);
+
         if (stage2_pte_cacheable(new))
                 stage2_flush_dcache(__va(phys), granule);

@@ -769,8 +783,16 @@ static int stage2_attr_walker(u64 addr, u64 end, 
u32 level, kvm_pte_t *ptep,
          * but worst-case the access flag update gets lost and will be
          * set on the next access instead.
          */
  -       if (data->pte != pte)
+       if (data->pte != pte) {
+               /*
+                * Invalidate the instruction cache before updating
+                * if we are going to add the executable permission.
+                */
+               if (!(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN))
+ stage2_invalidate_icache(kvm_pte_follow(pte),
+ kvm_granule_size(level));
                 WRITE_ONCE(*ptep, pte);
+       }

         return 0;
  }

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d151927a7d62..1eec9f63bc6f 100644
  - - - a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct 
kvm *kvm,
         kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
  }

-static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long 
size)
-{
-       __invalidate_icache_guest_page(pfn, size);
-}
-
  static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
  {
         send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, 
current);
@@ -877,10 +872,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
         if (writable)
                 prot |= KVM_PGTABLE_PROT_W;

  -       if (exec_fault) {
+       if (exec_fault)
                 prot |= KVM_PGTABLE_PROT_X;
-               invalidate_icache_guest_page(pfn, vma_pagesize);
-       }

         if (device)
                 prot |= KVM_PGTABLE_PROT_DEVICE;

---

Thanks,

Yanan
>> If it's not, then I think it's correct (and straightforward) to move
>> the icache invalidation to stage2_map_walker_try_leaf() after the
>> dcache clean+inval and make it depend on the new mapping being
>> executable *and* !cpus_have_const_cap(ARM64_HAS_CACHE_DIC).
> It would also need to be duplicated on the permission fault path.
>
> Thanks,
>
> 	M.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
  2021-02-25  9:55     ` Marc Zyngier
  (?)
@ 2021-02-26 15:58       ` wangyanan (Y)
  -1 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-02-26 15:58 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Will Deacon, Catalin Marinas, kvmarm, linux-arm-kernel, kvm,
	linux-kernel, wanghaibin.wang, zhukeqian1, yuzenghui


On 2021/2/25 17:55, Marc Zyngier wrote:
> Hi Yanan,
>
> On Mon, 08 Feb 2021 11:22:47 +0000,
> Yanan Wang <wangyanan55@huawei.com> wrote:
>> We currently uniformly clean dcache in user_mem_abort() before calling the
>> fault handlers, if we take a translation fault and the pfn is cacheable.
>> But if there are concurrent translation faults on the same page or block,
>> clean of dcache for the first time is necessary while the others are not.
>>
>> By moving clean of dcache to the map handler, we can easily identify the
>> conditions where CMOs are really needed and avoid the unnecessary ones.
>> As it's a time consuming process to perform CMOs especially when flushing
>> a block range, so this solution reduces much load of kvm and improve the
>> efficiency of creating mappings.
> That's an interesting approach. However, wouldn't it be better to
> identify early that there is already something mapped, and return to
> the guest ASAP?
>
> Can you quantify the benefit of this patch alone?
>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>   arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>>   arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>>   arch/arm64/kvm/mmu.c             | 14 +++---------
>>   3 files changed, 27 insertions(+), 41 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index e52d82aeadca..4ec9879e82ed 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>>   	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>>   }
>>   
>> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>> -{
>> -	void *va = page_address(pfn_to_page(pfn));
>> -
>> -	/*
>> -	 * With FWB, we ensure that the guest always accesses memory using
>> -	 * cacheable attributes, and we don't have to clean to PoC when
>> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>> -	 * PoU is not required either in this case.
>> -	 */
>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> -		return;
>> -
>> -	kvm_flush_dcache_to_poc(va, size);
>> -}
>> -
>>   static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>>   						  unsigned long size)
>>   {
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 4d177ce1d536..2f4f87021980 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>>   	return 0;
>>   }
>>   
>> +static bool stage2_pte_cacheable(kvm_pte_t pte)
>> +{
>> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
>> +}
>> +
>> +static void stage2_flush_dcache(void *addr, u64 size)
>> +{
>> +	/*
>> +	 * With FWB, we ensure that the guest always accesses memory using
>> +	 * cacheable attributes, and we don't have to clean to PoC when
>> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>> +	 * PoU is not required either in this case.
>> +	 */
>> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> +		return;
>> +
>> +	__flush_dcache_area(addr, size);
>> +}
>> +
>>   static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>   				      kvm_pte_t *ptep,
>>   				      struct stage2_map_data *data)
>> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>   		put_page(page);
>>   	}
>>   
>> +	/* Flush data cache before installation of the new PTE */
>> +	if (stage2_pte_cacheable(new))
>> +		stage2_flush_dcache(__va(phys), granule);
>> +
>>   	smp_store_release(ptep, new);
>>   	get_page(page);
>>   	data->phys += granule;
>> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>>   	return ret;
>>   }
>>   
>> -static void stage2_flush_dcache(void *addr, u64 size)
>> -{
>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> -		return;
>> -
>> -	__flush_dcache_area(addr, size);
>> -}
>> -
>> -static bool stage2_pte_cacheable(kvm_pte_t pte)
>> -{
>> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
>> -}
>> -
>>   static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>   			       enum kvm_pgtable_walk_flags flag,
>>   			       void * const arg)
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 77cb2d28f2a4..d151927a7d62 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>>   	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>>   }
>>   
>> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>> -{
>> -	__clean_dcache_guest_page(pfn, size);
>> -}
>> -
>>   static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>   {
>>   	__invalidate_icache_guest_page(pfn, size);
>> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	if (writable)
>>   		prot |= KVM_PGTABLE_PROT_W;
>>   
>> -	if (fault_status != FSC_PERM && !device)
>> -		clean_dcache_guest_page(pfn, vma_pagesize);
>> -
>>   	if (exec_fault) {
>>   		prot |= KVM_PGTABLE_PROT_X;
>>   		invalidate_icache_guest_page(pfn, vma_pagesize);
> It seems that the I-side CMO now happens *before* the D-side, which
> seems odd.

Yes, indeed. It is not so right in principle to put invalidation of 
icache before flush of dcache.

Thanks,

Yanan

> What prevents the CPU from speculatively fetching
> instructions in the interval? I would also feel much more confident if
> the two were kept close together.
>
> Thanks,
>
> 	M.
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-26 15:58       ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-02-26 15:58 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, Catalin Marinas, linux-kernel, Will Deacon, kvmarm,
	linux-arm-kernel


On 2021/2/25 17:55, Marc Zyngier wrote:
> Hi Yanan,
>
> On Mon, 08 Feb 2021 11:22:47 +0000,
> Yanan Wang <wangyanan55@huawei.com> wrote:
>> We currently uniformly clean dcache in user_mem_abort() before calling the
>> fault handlers, if we take a translation fault and the pfn is cacheable.
>> But if there are concurrent translation faults on the same page or block,
>> clean of dcache for the first time is necessary while the others are not.
>>
>> By moving clean of dcache to the map handler, we can easily identify the
>> conditions where CMOs are really needed and avoid the unnecessary ones.
>> As it's a time consuming process to perform CMOs especially when flushing
>> a block range, so this solution reduces much load of kvm and improve the
>> efficiency of creating mappings.
> That's an interesting approach. However, wouldn't it be better to
> identify early that there is already something mapped, and return to
> the guest ASAP?
>
> Can you quantify the benefit of this patch alone?
>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>   arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>>   arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>>   arch/arm64/kvm/mmu.c             | 14 +++---------
>>   3 files changed, 27 insertions(+), 41 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index e52d82aeadca..4ec9879e82ed 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>>   	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>>   }
>>   
>> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>> -{
>> -	void *va = page_address(pfn_to_page(pfn));
>> -
>> -	/*
>> -	 * With FWB, we ensure that the guest always accesses memory using
>> -	 * cacheable attributes, and we don't have to clean to PoC when
>> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>> -	 * PoU is not required either in this case.
>> -	 */
>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> -		return;
>> -
>> -	kvm_flush_dcache_to_poc(va, size);
>> -}
>> -
>>   static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>>   						  unsigned long size)
>>   {
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 4d177ce1d536..2f4f87021980 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>>   	return 0;
>>   }
>>   
>> +static bool stage2_pte_cacheable(kvm_pte_t pte)
>> +{
>> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
>> +}
>> +
>> +static void stage2_flush_dcache(void *addr, u64 size)
>> +{
>> +	/*
>> +	 * With FWB, we ensure that the guest always accesses memory using
>> +	 * cacheable attributes, and we don't have to clean to PoC when
>> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>> +	 * PoU is not required either in this case.
>> +	 */
>> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> +		return;
>> +
>> +	__flush_dcache_area(addr, size);
>> +}
>> +
>>   static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>   				      kvm_pte_t *ptep,
>>   				      struct stage2_map_data *data)
>> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>   		put_page(page);
>>   	}
>>   
>> +	/* Flush data cache before installation of the new PTE */
>> +	if (stage2_pte_cacheable(new))
>> +		stage2_flush_dcache(__va(phys), granule);
>> +
>>   	smp_store_release(ptep, new);
>>   	get_page(page);
>>   	data->phys += granule;
>> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>>   	return ret;
>>   }
>>   
>> -static void stage2_flush_dcache(void *addr, u64 size)
>> -{
>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> -		return;
>> -
>> -	__flush_dcache_area(addr, size);
>> -}
>> -
>> -static bool stage2_pte_cacheable(kvm_pte_t pte)
>> -{
>> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
>> -}
>> -
>>   static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>   			       enum kvm_pgtable_walk_flags flag,
>>   			       void * const arg)
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 77cb2d28f2a4..d151927a7d62 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>>   	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>>   }
>>   
>> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>> -{
>> -	__clean_dcache_guest_page(pfn, size);
>> -}
>> -
>>   static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>   {
>>   	__invalidate_icache_guest_page(pfn, size);
>> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	if (writable)
>>   		prot |= KVM_PGTABLE_PROT_W;
>>   
>> -	if (fault_status != FSC_PERM && !device)
>> -		clean_dcache_guest_page(pfn, vma_pagesize);
>> -
>>   	if (exec_fault) {
>>   		prot |= KVM_PGTABLE_PROT_X;
>>   		invalidate_icache_guest_page(pfn, vma_pagesize);
> It seems that the I-side CMO now happens *before* the D-side, which
> seems odd.

Yes, indeed. It is not so right in principle to put invalidation of 
icache before flush of dcache.

Thanks,

Yanan

> What prevents the CPU from speculatively fetching
> instructions in the interval? I would also feel much more confident if
> the two were kept close together.
>
> Thanks,
>
> 	M.
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler
@ 2021-02-26 15:58       ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-02-26 15:58 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, Catalin Marinas, zhukeqian1, linux-kernel, yuzenghui,
	wanghaibin.wang, Will Deacon, kvmarm, linux-arm-kernel


On 2021/2/25 17:55, Marc Zyngier wrote:
> Hi Yanan,
>
> On Mon, 08 Feb 2021 11:22:47 +0000,
> Yanan Wang <wangyanan55@huawei.com> wrote:
>> We currently uniformly clean dcache in user_mem_abort() before calling the
>> fault handlers, if we take a translation fault and the pfn is cacheable.
>> But if there are concurrent translation faults on the same page or block,
>> clean of dcache for the first time is necessary while the others are not.
>>
>> By moving clean of dcache to the map handler, we can easily identify the
>> conditions where CMOs are really needed and avoid the unnecessary ones.
>> As it's a time consuming process to perform CMOs especially when flushing
>> a block range, so this solution reduces much load of kvm and improve the
>> efficiency of creating mappings.
> That's an interesting approach. However, wouldn't it be better to
> identify early that there is already something mapped, and return to
> the guest ASAP?
>
> Can you quantify the benefit of this patch alone?
>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>   arch/arm64/include/asm/kvm_mmu.h | 16 --------------
>>   arch/arm64/kvm/hyp/pgtable.c     | 38 ++++++++++++++++++++------------
>>   arch/arm64/kvm/mmu.c             | 14 +++---------
>>   3 files changed, 27 insertions(+), 41 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index e52d82aeadca..4ec9879e82ed 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -204,22 +204,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
>>   	return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
>>   }
>>   
>> -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>> -{
>> -	void *va = page_address(pfn_to_page(pfn));
>> -
>> -	/*
>> -	 * With FWB, we ensure that the guest always accesses memory using
>> -	 * cacheable attributes, and we don't have to clean to PoC when
>> -	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>> -	 * PoU is not required either in this case.
>> -	 */
>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> -		return;
>> -
>> -	kvm_flush_dcache_to_poc(va, size);
>> -}
>> -
>>   static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn,
>>   						  unsigned long size)
>>   {
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 4d177ce1d536..2f4f87021980 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -464,6 +464,26 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
>>   	return 0;
>>   }
>>   
>> +static bool stage2_pte_cacheable(kvm_pte_t pte)
>> +{
>> +	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>> +	return memattr == PAGE_S2_MEMATTR(NORMAL);
>> +}
>> +
>> +static void stage2_flush_dcache(void *addr, u64 size)
>> +{
>> +	/*
>> +	 * With FWB, we ensure that the guest always accesses memory using
>> +	 * cacheable attributes, and we don't have to clean to PoC when
>> +	 * faulting in pages. Furthermore, FWB implies IDC, so cleaning to
>> +	 * PoU is not required either in this case.
>> +	 */
>> +	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> +		return;
>> +
>> +	__flush_dcache_area(addr, size);
>> +}
>> +
>>   static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>   				      kvm_pte_t *ptep,
>>   				      struct stage2_map_data *data)
>> @@ -495,6 +515,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>   		put_page(page);
>>   	}
>>   
>> +	/* Flush data cache before installation of the new PTE */
>> +	if (stage2_pte_cacheable(new))
>> +		stage2_flush_dcache(__va(phys), granule);
>> +
>>   	smp_store_release(ptep, new);
>>   	get_page(page);
>>   	data->phys += granule;
>> @@ -651,20 +675,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
>>   	return ret;
>>   }
>>   
>> -static void stage2_flush_dcache(void *addr, u64 size)
>> -{
>> -	if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>> -		return;
>> -
>> -	__flush_dcache_area(addr, size);
>> -}
>> -
>> -static bool stage2_pte_cacheable(kvm_pte_t pte)
>> -{
>> -	u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR;
>> -	return memattr == PAGE_S2_MEMATTR(NORMAL);
>> -}
>> -
>>   static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>   			       enum kvm_pgtable_walk_flags flag,
>>   			       void * const arg)
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 77cb2d28f2a4..d151927a7d62 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -609,11 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>>   	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>>   }
>>   
>> -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
>> -{
>> -	__clean_dcache_guest_page(pfn, size);
>> -}
>> -
>>   static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size)
>>   {
>>   	__invalidate_icache_guest_page(pfn, size);
>> @@ -882,9 +877,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	if (writable)
>>   		prot |= KVM_PGTABLE_PROT_W;
>>   
>> -	if (fault_status != FSC_PERM && !device)
>> -		clean_dcache_guest_page(pfn, vma_pagesize);
>> -
>>   	if (exec_fault) {
>>   		prot |= KVM_PGTABLE_PROT_X;
>>   		invalidate_icache_guest_page(pfn, vma_pagesize);
> It seems that the I-side CMO now happens *before* the D-side, which
> seems odd.

Yes, indeed. It is not so right in principle to put invalidation of 
icache before flush of dcache.

Thanks,

Yanan

> What prevents the CPU from speculatively fetching
> instructions in the interval? I would also feel much more confident if
> the two were kept close together.
>
> Thanks,
>
> 	M.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
  2021-02-08 11:22   ` Yanan Wang
  (?)
@ 2021-02-28 11:11     ` wangyanan (Y)
  -1 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-02-28 11:11 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: Marc Zyngier, Will Deacon, Alexandru Elisei, Catalin Marinas,
	wanghaibin.wang, yuzenghui


On 2021/2/8 19:22, Yanan Wang wrote:
> When KVM needs to coalesce the normal page mappings into a block mapping,
> we currently invalidate the old table entry first followed by invalidation
> of TLB, then unmap the page mappings, and install the block entry at last.
>
> It will cost a long time to unmap the numerous page mappings, which means
> there will be a long period when the table entry can be found invalid.
> If other vCPUs access any guest page within the block range and find the
> table entry invalid, they will all exit from guest with a translation fault
> which is not necessary. And KVM will make efforts to handle these faults,
> especially when performing CMOs by block range.
>
> So let's quickly install the block entry at first to ensure uninterrupted
> memory access of the other vCPUs, and then unmap the page mappings after
> installation. This will reduce most of the time when the table entry is
> invalid, and avoid most of the unnecessary translation faults.
BTW: Here show the benefit of this patch alone for reference (testing 
based on patch1) .
This patch aims to speed up the reconstruction of block 
mappings(especially for 1G blocks)
after they have been split, and the following test results represent the 
significant change.
Selftest: 
https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/ 


---

hardware platform: HiSilicon Kunpeng920 Server(FWB not supported)
host kernel: Linux mainline v5.11-rc6 (with series of 
https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com 
applied)

multiple vcpus concurrently access 20G memory.
execution time of KVM reconstituting the block mappings after dirty 
logging.

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
            (20 vcpus, 20G memory, block mappings(HUGETLB 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.881s 2.883s 2.885s 2.879s 2.882s
After  patch: KVM_ADJUST_MAPPINGS: 0.310s 0.301s 0.312s 0.299s 0.306s  
*average 89% improvement*

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
            (40 vcpus, 20G memory, block mappings(HUGETLB 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.954s 2.955s 2.949s 2.951s 2.953s
After  patch: KVM_ADJUST_MAPPINGS: 0.381s 0.366s 0.381s 0.380s 0.378s  
*average 87% improvement*

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 60
            (60 vcpus, 20G memory, block mappings(HUGETLB 1G))
Before patch: KVM_ADJUST_MAPPINGS: 3.118s 3.112s 3.130s 3.128s 3.119s
After  patch: KVM_ADJUST_MAPPINGS: 0.524s 0.534s 0.536s 0.525s 0.539s  
*average 83% improvement*

---

Thanks,

Yanan
>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>   arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>   1 file changed, 12 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 78a560446f80..308c36b9cd21 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -434,6 +434,7 @@ struct stage2_map_data {
>   	kvm_pte_t			attr;
>   
>   	kvm_pte_t			*anchor;
> +	kvm_pte_t			*follow;
>   
>   	struct kvm_s2_mmu		*mmu;
>   	struct kvm_mmu_memory_cache	*memcache;
> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>   	if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>   		return 0;
>   
> -	kvm_set_invalid_pte(ptep);
> -
>   	/*
> -	 * Invalidate the whole stage-2, as we may have numerous leaf
> -	 * entries below us which would otherwise need invalidating
> -	 * individually.
> +	 * If we need to coalesce existing table entries into a block here,
> +	 * then install the block entry first and the sub-level page mappings
> +	 * will be unmapped later.
>   	 */
> -	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>   	data->anchor = ptep;
> +	data->follow = kvm_pte_follow(*ptep);
> +	stage2_coalesce_tables_into_block(addr, level, ptep, data);
>   	return 0;
>   }
>   
> @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>   				      kvm_pte_t *ptep,
>   				      struct stage2_map_data *data)
>   {
> -	int ret = 0;
> -
>   	if (!data->anchor)
>   		return 0;
>   
> -	free_page((unsigned long)kvm_pte_follow(*ptep));
> -	put_page(virt_to_page(ptep));
> -
> -	if (data->anchor == ptep) {
> +	if (data->anchor != ptep) {
> +		free_page((unsigned long)kvm_pte_follow(*ptep));
> +		put_page(virt_to_page(ptep));
> +	} else {
> +		free_page((unsigned long)data->follow);
>   		data->anchor = NULL;
> -		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>   	}
>   
> -	return ret;
> +	return 0;
>   }
>   
>   /*

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-02-28 11:11     ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-02-28 11:11 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: Marc Zyngier, Catalin Marinas, Will Deacon


On 2021/2/8 19:22, Yanan Wang wrote:
> When KVM needs to coalesce the normal page mappings into a block mapping,
> we currently invalidate the old table entry first followed by invalidation
> of TLB, then unmap the page mappings, and install the block entry at last.
>
> It will cost a long time to unmap the numerous page mappings, which means
> there will be a long period when the table entry can be found invalid.
> If other vCPUs access any guest page within the block range and find the
> table entry invalid, they will all exit from guest with a translation fault
> which is not necessary. And KVM will make efforts to handle these faults,
> especially when performing CMOs by block range.
>
> So let's quickly install the block entry at first to ensure uninterrupted
> memory access of the other vCPUs, and then unmap the page mappings after
> installation. This will reduce most of the time when the table entry is
> invalid, and avoid most of the unnecessary translation faults.
BTW: Here show the benefit of this patch alone for reference (testing 
based on patch1) .
This patch aims to speed up the reconstruction of block 
mappings(especially for 1G blocks)
after they have been split, and the following test results represent the 
significant change.
Selftest: 
https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/ 


---

hardware platform: HiSilicon Kunpeng920 Server(FWB not supported)
host kernel: Linux mainline v5.11-rc6 (with series of 
https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com 
applied)

multiple vcpus concurrently access 20G memory.
execution time of KVM reconstituting the block mappings after dirty 
logging.

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
            (20 vcpus, 20G memory, block mappings(HUGETLB 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.881s 2.883s 2.885s 2.879s 2.882s
After  patch: KVM_ADJUST_MAPPINGS: 0.310s 0.301s 0.312s 0.299s 0.306s  
*average 89% improvement*

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
            (40 vcpus, 20G memory, block mappings(HUGETLB 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.954s 2.955s 2.949s 2.951s 2.953s
After  patch: KVM_ADJUST_MAPPINGS: 0.381s 0.366s 0.381s 0.380s 0.378s  
*average 87% improvement*

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 60
            (60 vcpus, 20G memory, block mappings(HUGETLB 1G))
Before patch: KVM_ADJUST_MAPPINGS: 3.118s 3.112s 3.130s 3.128s 3.119s
After  patch: KVM_ADJUST_MAPPINGS: 0.524s 0.534s 0.536s 0.525s 0.539s  
*average 83% improvement*

---

Thanks,

Yanan
>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>   arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>   1 file changed, 12 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 78a560446f80..308c36b9cd21 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -434,6 +434,7 @@ struct stage2_map_data {
>   	kvm_pte_t			attr;
>   
>   	kvm_pte_t			*anchor;
> +	kvm_pte_t			*follow;
>   
>   	struct kvm_s2_mmu		*mmu;
>   	struct kvm_mmu_memory_cache	*memcache;
> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>   	if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>   		return 0;
>   
> -	kvm_set_invalid_pte(ptep);
> -
>   	/*
> -	 * Invalidate the whole stage-2, as we may have numerous leaf
> -	 * entries below us which would otherwise need invalidating
> -	 * individually.
> +	 * If we need to coalesce existing table entries into a block here,
> +	 * then install the block entry first and the sub-level page mappings
> +	 * will be unmapped later.
>   	 */
> -	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>   	data->anchor = ptep;
> +	data->follow = kvm_pte_follow(*ptep);
> +	stage2_coalesce_tables_into_block(addr, level, ptep, data);
>   	return 0;
>   }
>   
> @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>   				      kvm_pte_t *ptep,
>   				      struct stage2_map_data *data)
>   {
> -	int ret = 0;
> -
>   	if (!data->anchor)
>   		return 0;
>   
> -	free_page((unsigned long)kvm_pte_follow(*ptep));
> -	put_page(virt_to_page(ptep));
> -
> -	if (data->anchor == ptep) {
> +	if (data->anchor != ptep) {
> +		free_page((unsigned long)kvm_pte_follow(*ptep));
> +		put_page(virt_to_page(ptep));
> +	} else {
> +		free_page((unsigned long)data->follow);
>   		data->anchor = NULL;
> -		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>   	}
>   
> -	return ret;
> +	return 0;
>   }
>   
>   /*
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-02-28 11:11     ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-02-28 11:11 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm, linux-kernel
  Cc: Marc Zyngier, Alexandru Elisei, Catalin Marinas, yuzenghui,
	wanghaibin.wang, Will Deacon


On 2021/2/8 19:22, Yanan Wang wrote:
> When KVM needs to coalesce the normal page mappings into a block mapping,
> we currently invalidate the old table entry first followed by invalidation
> of TLB, then unmap the page mappings, and install the block entry at last.
>
> It will cost a long time to unmap the numerous page mappings, which means
> there will be a long period when the table entry can be found invalid.
> If other vCPUs access any guest page within the block range and find the
> table entry invalid, they will all exit from guest with a translation fault
> which is not necessary. And KVM will make efforts to handle these faults,
> especially when performing CMOs by block range.
>
> So let's quickly install the block entry at first to ensure uninterrupted
> memory access of the other vCPUs, and then unmap the page mappings after
> installation. This will reduce most of the time when the table entry is
> invalid, and avoid most of the unnecessary translation faults.
BTW: Here show the benefit of this patch alone for reference (testing 
based on patch1) .
This patch aims to speed up the reconstruction of block 
mappings(especially for 1G blocks)
after they have been split, and the following test results represent the 
significant change.
Selftest: 
https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/ 


---

hardware platform: HiSilicon Kunpeng920 Server(FWB not supported)
host kernel: Linux mainline v5.11-rc6 (with series of 
https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com 
applied)

multiple vcpus concurrently access 20G memory.
execution time of KVM reconstituting the block mappings after dirty 
logging.

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
            (20 vcpus, 20G memory, block mappings(HUGETLB 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.881s 2.883s 2.885s 2.879s 2.882s
After  patch: KVM_ADJUST_MAPPINGS: 0.310s 0.301s 0.312s 0.299s 0.306s  
*average 89% improvement*

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
            (40 vcpus, 20G memory, block mappings(HUGETLB 1G))
Before patch: KVM_ADJUST_MAPPINGS: 2.954s 2.955s 2.949s 2.951s 2.953s
After  patch: KVM_ADJUST_MAPPINGS: 0.381s 0.366s 0.381s 0.380s 0.378s  
*average 87% improvement*

cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 60
            (60 vcpus, 20G memory, block mappings(HUGETLB 1G))
Before patch: KVM_ADJUST_MAPPINGS: 3.118s 3.112s 3.130s 3.128s 3.119s
After  patch: KVM_ADJUST_MAPPINGS: 0.524s 0.534s 0.536s 0.525s 0.539s  
*average 83% improvement*

---

Thanks,

Yanan
>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>   arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>   1 file changed, 12 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 78a560446f80..308c36b9cd21 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -434,6 +434,7 @@ struct stage2_map_data {
>   	kvm_pte_t			attr;
>   
>   	kvm_pte_t			*anchor;
> +	kvm_pte_t			*follow;
>   
>   	struct kvm_s2_mmu		*mmu;
>   	struct kvm_mmu_memory_cache	*memcache;
> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>   	if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>   		return 0;
>   
> -	kvm_set_invalid_pte(ptep);
> -
>   	/*
> -	 * Invalidate the whole stage-2, as we may have numerous leaf
> -	 * entries below us which would otherwise need invalidating
> -	 * individually.
> +	 * If we need to coalesce existing table entries into a block here,
> +	 * then install the block entry first and the sub-level page mappings
> +	 * will be unmapped later.
>   	 */
> -	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>   	data->anchor = ptep;
> +	data->follow = kvm_pte_follow(*ptep);
> +	stage2_coalesce_tables_into_block(addr, level, ptep, data);
>   	return 0;
>   }
>   
> @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>   				      kvm_pte_t *ptep,
>   				      struct stage2_map_data *data)
>   {
> -	int ret = 0;
> -
>   	if (!data->anchor)
>   		return 0;
>   
> -	free_page((unsigned long)kvm_pte_follow(*ptep));
> -	put_page(virt_to_page(ptep));
> -
> -	if (data->anchor == ptep) {
> +	if (data->anchor != ptep) {
> +		free_page((unsigned long)kvm_pte_follow(*ptep));
> +		put_page(virt_to_page(ptep));
> +	} else {
> +		free_page((unsigned long)data->follow);
>   		data->anchor = NULL;
> -		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>   	}
>   
> -	return ret;
> +	return 0;
>   }
>   
>   /*

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
  2021-02-08 11:22   ` Yanan Wang
  (?)
@ 2021-03-02 17:13     ` Alexandru Elisei
  -1 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-03-02 17:13 UTC (permalink / raw)
  To: Yanan Wang, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hello,

On 2/8/21 11:22 AM, Yanan Wang wrote:
> When KVM needs to coalesce the normal page mappings into a block mapping,
> we currently invalidate the old table entry first followed by invalidation
> of TLB, then unmap the page mappings, and install the block entry at last.
>
> It will cost a long time to unmap the numerous page mappings, which means
> there will be a long period when the table entry can be found invalid.
> If other vCPUs access any guest page within the block range and find the
> table entry invalid, they will all exit from guest with a translation fault
> which is not necessary. And KVM will make efforts to handle these faults,
> especially when performing CMOs by block range.
>
> So let's quickly install the block entry at first to ensure uninterrupted
> memory access of the other vCPUs, and then unmap the page mappings after
> installation. This will reduce most of the time when the table entry is
> invalid, and avoid most of the unnecessary translation faults.

I'm not convinced I've fully understood what is going on yet, but it seems to me
that the idea is sound. Some questions and comments below.

>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>  1 file changed, 12 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 78a560446f80..308c36b9cd21 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -434,6 +434,7 @@ struct stage2_map_data {
>  	kvm_pte_t			attr;
>  
>  	kvm_pte_t			*anchor;
> +	kvm_pte_t			*follow;
>  
>  	struct kvm_s2_mmu		*mmu;
>  	struct kvm_mmu_memory_cache	*memcache;
> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>  	if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>  		return 0;
>  
> -	kvm_set_invalid_pte(ptep);
> -
>  	/*
> -	 * Invalidate the whole stage-2, as we may have numerous leaf
> -	 * entries below us which would otherwise need invalidating
> -	 * individually.
> +	 * If we need to coalesce existing table entries into a block here,
> +	 * then install the block entry first and the sub-level page mappings
> +	 * will be unmapped later.
>  	 */
> -	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>  	data->anchor = ptep;
> +	data->follow = kvm_pte_follow(*ptep);
> +	stage2_coalesce_tables_into_block(addr, level, ptep, data);

Here's how stage2_coalesce_tables_into_block() is implemented from the previous
patch (it might be worth merging it with this patch, I found it impossible to
judge if the function is correct without seeing how it is used and what is replacing):

static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
                          kvm_pte_t *ptep,
                          struct stage2_map_data *data)
{
    u64 granule = kvm_granule_size(level), phys = data->phys;
    kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);

    kvm_set_invalid_pte(ptep);

    /*
     * Invalidate the whole stage-2, as we may have numerous leaf entries
     * below us which would otherwise need invalidating individually.
     */
    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
    smp_store_release(ptep, new);
    data->phys += granule;
}

This works because __kvm_pgtable_visit() saves the *ptep value before calling the
pre callback, and it visits the next level table based on the initial pte value,
not the new value written by stage2_coalesce_tables_into_block().

Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
dcache to the map handler"), this function is missing the CMOs from
stage2_map_walker_try_leaf(). I can think of the following situation where they
are needed:

1. The 2nd level (PMD) table that will be turned into a block is mapped at stage 2
because one of the pages in the 3rd level (PTE) table it points to is accessed by
the guest.

2. The kernel decides to turn the userspace mapping into a transparent huge page
and calls the mmu notifier to remove the mapping from stage 2. The 2nd level table
is still valid.

3. Guest accesses a page which is not the page it accessed at step 1, which causes
a translation fault. KVM decides we can use a PMD block mapping to map the address
and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
because the guest accesses memory it didn't access before.

What do you think, is that a valid situation?

>  	return 0;
>  }
>  
> @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>  				      kvm_pte_t *ptep,
>  				      struct stage2_map_data *data)
>  {
> -	int ret = 0;
> -
>  	if (!data->anchor)
>  		return 0;
>  
> -	free_page((unsigned long)kvm_pte_follow(*ptep));
> -	put_page(virt_to_page(ptep));
> -
> -	if (data->anchor == ptep) {
> +	if (data->anchor != ptep) {
> +		free_page((unsigned long)kvm_pte_follow(*ptep));
> +		put_page(virt_to_page(ptep));
> +	} else {
> +		free_page((unsigned long)data->follow);
>  		data->anchor = NULL;
> -		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);

stage2_map_walk_leaf() -> stage2_map_walker_try_leaf() calls put_page() and
get_page() once in our case (valid old mapping). It looks to me like we're missing
a put_page() call when the function is called for the anchor. Have you found the
call to be unnecessary?

>  	}
>  
> -	return ret;
> +	return 0;

I think it's correct for this function to succeed unconditionally. The error was
coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
can return an error code if block mapping is not supported, which we know is
supported because we have an anchor, and if only the permissions are different
between the old and the new entry, but in our case we've changed both the valid
and type bits.

Thanks,

Alex

>  }
>  
>  /*

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-03-02 17:13     ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-03-02 17:13 UTC (permalink / raw)
  To: Yanan Wang, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hello,

On 2/8/21 11:22 AM, Yanan Wang wrote:
> When KVM needs to coalesce the normal page mappings into a block mapping,
> we currently invalidate the old table entry first followed by invalidation
> of TLB, then unmap the page mappings, and install the block entry at last.
>
> It will cost a long time to unmap the numerous page mappings, which means
> there will be a long period when the table entry can be found invalid.
> If other vCPUs access any guest page within the block range and find the
> table entry invalid, they will all exit from guest with a translation fault
> which is not necessary. And KVM will make efforts to handle these faults,
> especially when performing CMOs by block range.
>
> So let's quickly install the block entry at first to ensure uninterrupted
> memory access of the other vCPUs, and then unmap the page mappings after
> installation. This will reduce most of the time when the table entry is
> invalid, and avoid most of the unnecessary translation faults.

I'm not convinced I've fully understood what is going on yet, but it seems to me
that the idea is sound. Some questions and comments below.

>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>  1 file changed, 12 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 78a560446f80..308c36b9cd21 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -434,6 +434,7 @@ struct stage2_map_data {
>  	kvm_pte_t			attr;
>  
>  	kvm_pte_t			*anchor;
> +	kvm_pte_t			*follow;
>  
>  	struct kvm_s2_mmu		*mmu;
>  	struct kvm_mmu_memory_cache	*memcache;
> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>  	if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>  		return 0;
>  
> -	kvm_set_invalid_pte(ptep);
> -
>  	/*
> -	 * Invalidate the whole stage-2, as we may have numerous leaf
> -	 * entries below us which would otherwise need invalidating
> -	 * individually.
> +	 * If we need to coalesce existing table entries into a block here,
> +	 * then install the block entry first and the sub-level page mappings
> +	 * will be unmapped later.
>  	 */
> -	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>  	data->anchor = ptep;
> +	data->follow = kvm_pte_follow(*ptep);
> +	stage2_coalesce_tables_into_block(addr, level, ptep, data);

Here's how stage2_coalesce_tables_into_block() is implemented from the previous
patch (it might be worth merging it with this patch, I found it impossible to
judge if the function is correct without seeing how it is used and what is replacing):

static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
                          kvm_pte_t *ptep,
                          struct stage2_map_data *data)
{
    u64 granule = kvm_granule_size(level), phys = data->phys;
    kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);

    kvm_set_invalid_pte(ptep);

    /*
     * Invalidate the whole stage-2, as we may have numerous leaf entries
     * below us which would otherwise need invalidating individually.
     */
    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
    smp_store_release(ptep, new);
    data->phys += granule;
}

This works because __kvm_pgtable_visit() saves the *ptep value before calling the
pre callback, and it visits the next level table based on the initial pte value,
not the new value written by stage2_coalesce_tables_into_block().

Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
dcache to the map handler"), this function is missing the CMOs from
stage2_map_walker_try_leaf(). I can think of the following situation where they
are needed:

1. The 2nd level (PMD) table that will be turned into a block is mapped at stage 2
because one of the pages in the 3rd level (PTE) table it points to is accessed by
the guest.

2. The kernel decides to turn the userspace mapping into a transparent huge page
and calls the mmu notifier to remove the mapping from stage 2. The 2nd level table
is still valid.

3. Guest accesses a page which is not the page it accessed at step 1, which causes
a translation fault. KVM decides we can use a PMD block mapping to map the address
and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
because the guest accesses memory it didn't access before.

What do you think, is that a valid situation?

>  	return 0;
>  }
>  
> @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>  				      kvm_pte_t *ptep,
>  				      struct stage2_map_data *data)
>  {
> -	int ret = 0;
> -
>  	if (!data->anchor)
>  		return 0;
>  
> -	free_page((unsigned long)kvm_pte_follow(*ptep));
> -	put_page(virt_to_page(ptep));
> -
> -	if (data->anchor == ptep) {
> +	if (data->anchor != ptep) {
> +		free_page((unsigned long)kvm_pte_follow(*ptep));
> +		put_page(virt_to_page(ptep));
> +	} else {
> +		free_page((unsigned long)data->follow);
>  		data->anchor = NULL;
> -		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);

stage2_map_walk_leaf() -> stage2_map_walker_try_leaf() calls put_page() and
get_page() once in our case (valid old mapping). It looks to me like we're missing
a put_page() call when the function is called for the anchor. Have you found the
call to be unnecessary?

>  	}
>  
> -	return ret;
> +	return 0;

I think it's correct for this function to succeed unconditionally. The error was
coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
can return an error code if block mapping is not supported, which we know is
supported because we have an anchor, and if only the permissions are different
between the old and the new entry, but in our case we've changed both the valid
and type bits.

Thanks,

Alex

>  }
>  
>  /*
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-03-02 17:13     ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-03-02 17:13 UTC (permalink / raw)
  To: Yanan Wang, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hello,

On 2/8/21 11:22 AM, Yanan Wang wrote:
> When KVM needs to coalesce the normal page mappings into a block mapping,
> we currently invalidate the old table entry first followed by invalidation
> of TLB, then unmap the page mappings, and install the block entry at last.
>
> It will cost a long time to unmap the numerous page mappings, which means
> there will be a long period when the table entry can be found invalid.
> If other vCPUs access any guest page within the block range and find the
> table entry invalid, they will all exit from guest with a translation fault
> which is not necessary. And KVM will make efforts to handle these faults,
> especially when performing CMOs by block range.
>
> So let's quickly install the block entry at first to ensure uninterrupted
> memory access of the other vCPUs, and then unmap the page mappings after
> installation. This will reduce most of the time when the table entry is
> invalid, and avoid most of the unnecessary translation faults.

I'm not convinced I've fully understood what is going on yet, but it seems to me
that the idea is sound. Some questions and comments below.

>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>  1 file changed, 12 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 78a560446f80..308c36b9cd21 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -434,6 +434,7 @@ struct stage2_map_data {
>  	kvm_pte_t			attr;
>  
>  	kvm_pte_t			*anchor;
> +	kvm_pte_t			*follow;
>  
>  	struct kvm_s2_mmu		*mmu;
>  	struct kvm_mmu_memory_cache	*memcache;
> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>  	if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>  		return 0;
>  
> -	kvm_set_invalid_pte(ptep);
> -
>  	/*
> -	 * Invalidate the whole stage-2, as we may have numerous leaf
> -	 * entries below us which would otherwise need invalidating
> -	 * individually.
> +	 * If we need to coalesce existing table entries into a block here,
> +	 * then install the block entry first and the sub-level page mappings
> +	 * will be unmapped later.
>  	 */
> -	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>  	data->anchor = ptep;
> +	data->follow = kvm_pte_follow(*ptep);
> +	stage2_coalesce_tables_into_block(addr, level, ptep, data);

Here's how stage2_coalesce_tables_into_block() is implemented from the previous
patch (it might be worth merging it with this patch, I found it impossible to
judge if the function is correct without seeing how it is used and what is replacing):

static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
                          kvm_pte_t *ptep,
                          struct stage2_map_data *data)
{
    u64 granule = kvm_granule_size(level), phys = data->phys;
    kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);

    kvm_set_invalid_pte(ptep);

    /*
     * Invalidate the whole stage-2, as we may have numerous leaf entries
     * below us which would otherwise need invalidating individually.
     */
    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
    smp_store_release(ptep, new);
    data->phys += granule;
}

This works because __kvm_pgtable_visit() saves the *ptep value before calling the
pre callback, and it visits the next level table based on the initial pte value,
not the new value written by stage2_coalesce_tables_into_block().

Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
dcache to the map handler"), this function is missing the CMOs from
stage2_map_walker_try_leaf(). I can think of the following situation where they
are needed:

1. The 2nd level (PMD) table that will be turned into a block is mapped at stage 2
because one of the pages in the 3rd level (PTE) table it points to is accessed by
the guest.

2. The kernel decides to turn the userspace mapping into a transparent huge page
and calls the mmu notifier to remove the mapping from stage 2. The 2nd level table
is still valid.

3. Guest accesses a page which is not the page it accessed at step 1, which causes
a translation fault. KVM decides we can use a PMD block mapping to map the address
and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
because the guest accesses memory it didn't access before.

What do you think, is that a valid situation?

>  	return 0;
>  }
>  
> @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>  				      kvm_pte_t *ptep,
>  				      struct stage2_map_data *data)
>  {
> -	int ret = 0;
> -
>  	if (!data->anchor)
>  		return 0;
>  
> -	free_page((unsigned long)kvm_pte_follow(*ptep));
> -	put_page(virt_to_page(ptep));
> -
> -	if (data->anchor == ptep) {
> +	if (data->anchor != ptep) {
> +		free_page((unsigned long)kvm_pte_follow(*ptep));
> +		put_page(virt_to_page(ptep));
> +	} else {
> +		free_page((unsigned long)data->follow);
>  		data->anchor = NULL;
> -		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);

stage2_map_walk_leaf() -> stage2_map_walker_try_leaf() calls put_page() and
get_page() once in our case (valid old mapping). It looks to me like we're missing
a put_page() call when the function is called for the anchor. Have you found the
call to be unnecessary?

>  	}
>  
> -	return ret;
> +	return 0;

I think it's correct for this function to succeed unconditionally. The error was
coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
can return an error code if block mapping is not supported, which we know is
supported because we have an anchor, and if only the permissions are different
between the old and the new entry, but in our case we've changed both the valid
and type bits.

Thanks,

Alex

>  }
>  
>  /*

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
  2021-03-02 17:13     ` Alexandru Elisei
  (?)
@ 2021-03-03 11:04       ` wangyanan (Y)
  -1 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-03-03 11:04 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Marc Zyngier, Will Deacon, Catalin Marinas, Julien Thierry,
	James Morse, Suzuki K Poulose, Quentin Perret, Gavin Shan,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Alex,

On 2021/3/3 1:13, Alexandru Elisei wrote:
> Hello,
>
> On 2/8/21 11:22 AM, Yanan Wang wrote:
>> When KVM needs to coalesce the normal page mappings into a block mapping,
>> we currently invalidate the old table entry first followed by invalidation
>> of TLB, then unmap the page mappings, and install the block entry at last.
>>
>> It will cost a long time to unmap the numerous page mappings, which means
>> there will be a long period when the table entry can be found invalid.
>> If other vCPUs access any guest page within the block range and find the
>> table entry invalid, they will all exit from guest with a translation fault
>> which is not necessary. And KVM will make efforts to handle these faults,
>> especially when performing CMOs by block range.
>>
>> So let's quickly install the block entry at first to ensure uninterrupted
>> memory access of the other vCPUs, and then unmap the page mappings after
>> installation. This will reduce most of the time when the table entry is
>> invalid, and avoid most of the unnecessary translation faults.
> I'm not convinced I've fully understood what is going on yet, but it seems to me
> that the idea is sound. Some questions and comments below.
What I am trying to do in this patch is to adjust the order of 
rebuilding block mappings from page mappings.
Take the rebuilding of 1G block mappings as an example.
Before this patch, the order is like:
1) invalidate the table entry of the 1st level(PUD)
2) flush TLB by VMID
3) unmap the old PMD/PTE tables
4) install the new block entry to the 1st level(PUD)

So entry in the 1st level can be found invalid by other vcpus in 1), 2), 
and 3), and it's a long time in 3) to unmap
the numerous old PMD/PTE tables, which means the total time of the entry 
being invalid is long enough to
affect the performance.

After this patch, the order is like:
1) invalidate the table ebtry of the 1st level(PUD)
2) flush TLB by VMID
3) install the new block entry to the 1st level(PUD)
4) unmap the old PMD/PTE tables

The change ensures that period of entry in the 1st level(PUD) being 
invalid is only in 1) and 2),
so if other vcpus access memory within 1G, there will be less chance to 
find the entry invalid
and as a result trigger an unnecessary translation fault.
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>   arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>   1 file changed, 12 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 78a560446f80..308c36b9cd21 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>   	kvm_pte_t			attr;
>>   
>>   	kvm_pte_t			*anchor;
>> +	kvm_pte_t			*follow;
>>   
>>   	struct kvm_s2_mmu		*mmu;
>>   	struct kvm_mmu_memory_cache	*memcache;
>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>>   	if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>>   		return 0;
>>   
>> -	kvm_set_invalid_pte(ptep);
>> -
>>   	/*
>> -	 * Invalidate the whole stage-2, as we may have numerous leaf
>> -	 * entries below us which would otherwise need invalidating
>> -	 * individually.
>> +	 * If we need to coalesce existing table entries into a block here,
>> +	 * then install the block entry first and the sub-level page mappings
>> +	 * will be unmapped later.
>>   	 */
>> -	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>   	data->anchor = ptep;
>> +	data->follow = kvm_pte_follow(*ptep);
>> +	stage2_coalesce_tables_into_block(addr, level, ptep, data);
> Here's how stage2_coalesce_tables_into_block() is implemented from the previous
> patch (it might be worth merging it with this patch, I found it impossible to
> judge if the function is correct without seeing how it is used and what is replacing):
Ok, will do this if v2 is going to be post.
> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>                            kvm_pte_t *ptep,
>                            struct stage2_map_data *data)
> {
>      u64 granule = kvm_granule_size(level), phys = data->phys;
>      kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>
>      kvm_set_invalid_pte(ptep);
>
>      /*
>       * Invalidate the whole stage-2, as we may have numerous leaf entries
>       * below us which would otherwise need invalidating individually.
>       */
>      kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>      smp_store_release(ptep, new);
>      data->phys += granule;
> }
>
> This works because __kvm_pgtable_visit() saves the *ptep value before calling the
> pre callback, and it visits the next level table based on the initial pte value,
> not the new value written by stage2_coalesce_tables_into_block().
Right. So before replacing the initial pte value with the new value, we 
have to use
*data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to 
save
the initial pte value in advance. And data->follow will be used when  we 
start to
unmap the old sub-level tables later.
>
> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
> dcache to the map handler"), this function is missing the CMOs from
> stage2_map_walker_try_leaf().
Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() 
currently,
because I thought they were not needed when we rebuild the block 
mappings from
normal page mappings.

At least, they are not needed if we rebuild the block mappings backed by 
hugetlbfs
pages, because we must have built the new block mappings for the first 
time before
and now need to rebuild them after they were split in dirty logging. Can 
we agree on this?
Then let's see the following situation.
> I can think of the following situation where they
> are needed:
>
> 1. The 2nd level (PMD) table that will be turned into a block is mapped at stage 2
> because one of the pages in the 3rd level (PTE) table it points to is accessed by
> the guest.
>
> 2. The kernel decides to turn the userspace mapping into a transparent huge page
> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level table
> is still valid.
I have a question here. Won't the PMD entry been invalidated too in this 
case?
If remove of the stage2 mapping by mmu notifier is an unmap operation of 
a range,
then it's correct and reasonable to both invalidate the PMD entry and 
free the PTE table.
As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.

And if I was right about this, we will not end up in 
stage2_coalesce_tables_into_block()
like step 3 describes, but in stage2_map_walker_try_leaf() instead. 
Because the PMD entry
is invalid, so KVM will create the new 2M block mapping.

If I'm wrong about this, then I think this is a valid situation.
> 3. Guest accesses a page which is not the page it accessed at step 1, which causes
> a translation fault. KVM decides we can use a PMD block mapping to map the address
> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
> because the guest accesses memory it didn't access before.
>
> What do you think, is that a valid situation?
>>   	return 0;
>>   }
>>   
>> @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>>   				      kvm_pte_t *ptep,
>>   				      struct stage2_map_data *data)
>>   {
>> -	int ret = 0;
>> -
>>   	if (!data->anchor)
>>   		return 0;
>>   
>> -	free_page((unsigned long)kvm_pte_follow(*ptep));
>> -	put_page(virt_to_page(ptep));
>> -
>> -	if (data->anchor == ptep) {
>> +	if (data->anchor != ptep) {
>> +		free_page((unsigned long)kvm_pte_follow(*ptep));
>> +		put_page(virt_to_page(ptep));
>> +	} else {
>> +		free_page((unsigned long)data->follow);
>>   		data->anchor = NULL;
>> -		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and
> get_page() once in our case (valid old mapping). It looks to me like we're missing
> a put_page() call when the function is called for the anchor. Have you found the
> call to be unnecessary?
Before this patch:
When we find data->anchor == ptep, put_page() has been called once in 
advance for the anchor
in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() -> 
stage2_map_walker_try_leaf()
to install the block entry, and only get_page() will be called once in 
stage2_map_walker_try_leaf().
There is a put_page() followed by a get_page() for the anchor, and there 
will not be a problem about
page_counts.

After this patch:
Before we find data->anchor == ptep and after it, there is not a 
put_page() call for the anchor.
This is because that we didn't call get_page() either in 
stage2_coalesce_tables_into_block() when
install the block entry. So I think there will not be a problem too.

Is above the right answer for your point?
>>   	}
>>   
>> -	return ret;
>> +	return 0;
> I think it's correct for this function to succeed unconditionally. The error was
> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
> can return an error code if block mapping is not supported, which we know is
> supported because we have an anchor, and if only the permissions are different
> between the old and the new entry, but in our case we've changed both the valid
> and type bits.
Agreed. Besides, we will definitely not end up updating an old valid 
entry for the anchor
in stage2_map_walker_try_leaf(), because *anchor has already been 
invalidated in
stage2_map_walk_table_pre() before set the anchor, so it will look like 
a build of new mapping.

Thanks,

Yanan
> Thanks,
>
> Alex
>
>>   }
>>   
>>   /*
> .

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-03-03 11:04       ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-03-03 11:04 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvm, Marc Zyngier, linux-kernel, linux-arm-kernel,
	Catalin Marinas, Will Deacon, kvmarm

Hi Alex,

On 2021/3/3 1:13, Alexandru Elisei wrote:
> Hello,
>
> On 2/8/21 11:22 AM, Yanan Wang wrote:
>> When KVM needs to coalesce the normal page mappings into a block mapping,
>> we currently invalidate the old table entry first followed by invalidation
>> of TLB, then unmap the page mappings, and install the block entry at last.
>>
>> It will cost a long time to unmap the numerous page mappings, which means
>> there will be a long period when the table entry can be found invalid.
>> If other vCPUs access any guest page within the block range and find the
>> table entry invalid, they will all exit from guest with a translation fault
>> which is not necessary. And KVM will make efforts to handle these faults,
>> especially when performing CMOs by block range.
>>
>> So let's quickly install the block entry at first to ensure uninterrupted
>> memory access of the other vCPUs, and then unmap the page mappings after
>> installation. This will reduce most of the time when the table entry is
>> invalid, and avoid most of the unnecessary translation faults.
> I'm not convinced I've fully understood what is going on yet, but it seems to me
> that the idea is sound. Some questions and comments below.
What I am trying to do in this patch is to adjust the order of 
rebuilding block mappings from page mappings.
Take the rebuilding of 1G block mappings as an example.
Before this patch, the order is like:
1) invalidate the table entry of the 1st level(PUD)
2) flush TLB by VMID
3) unmap the old PMD/PTE tables
4) install the new block entry to the 1st level(PUD)

So entry in the 1st level can be found invalid by other vcpus in 1), 2), 
and 3), and it's a long time in 3) to unmap
the numerous old PMD/PTE tables, which means the total time of the entry 
being invalid is long enough to
affect the performance.

After this patch, the order is like:
1) invalidate the table ebtry of the 1st level(PUD)
2) flush TLB by VMID
3) install the new block entry to the 1st level(PUD)
4) unmap the old PMD/PTE tables

The change ensures that period of entry in the 1st level(PUD) being 
invalid is only in 1) and 2),
so if other vcpus access memory within 1G, there will be less chance to 
find the entry invalid
and as a result trigger an unnecessary translation fault.
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>   arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>   1 file changed, 12 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 78a560446f80..308c36b9cd21 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>   	kvm_pte_t			attr;
>>   
>>   	kvm_pte_t			*anchor;
>> +	kvm_pte_t			*follow;
>>   
>>   	struct kvm_s2_mmu		*mmu;
>>   	struct kvm_mmu_memory_cache	*memcache;
>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>>   	if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>>   		return 0;
>>   
>> -	kvm_set_invalid_pte(ptep);
>> -
>>   	/*
>> -	 * Invalidate the whole stage-2, as we may have numerous leaf
>> -	 * entries below us which would otherwise need invalidating
>> -	 * individually.
>> +	 * If we need to coalesce existing table entries into a block here,
>> +	 * then install the block entry first and the sub-level page mappings
>> +	 * will be unmapped later.
>>   	 */
>> -	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>   	data->anchor = ptep;
>> +	data->follow = kvm_pte_follow(*ptep);
>> +	stage2_coalesce_tables_into_block(addr, level, ptep, data);
> Here's how stage2_coalesce_tables_into_block() is implemented from the previous
> patch (it might be worth merging it with this patch, I found it impossible to
> judge if the function is correct without seeing how it is used and what is replacing):
Ok, will do this if v2 is going to be post.
> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>                            kvm_pte_t *ptep,
>                            struct stage2_map_data *data)
> {
>      u64 granule = kvm_granule_size(level), phys = data->phys;
>      kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>
>      kvm_set_invalid_pte(ptep);
>
>      /*
>       * Invalidate the whole stage-2, as we may have numerous leaf entries
>       * below us which would otherwise need invalidating individually.
>       */
>      kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>      smp_store_release(ptep, new);
>      data->phys += granule;
> }
>
> This works because __kvm_pgtable_visit() saves the *ptep value before calling the
> pre callback, and it visits the next level table based on the initial pte value,
> not the new value written by stage2_coalesce_tables_into_block().
Right. So before replacing the initial pte value with the new value, we 
have to use
*data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to 
save
the initial pte value in advance. And data->follow will be used when  we 
start to
unmap the old sub-level tables later.
>
> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
> dcache to the map handler"), this function is missing the CMOs from
> stage2_map_walker_try_leaf().
Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() 
currently,
because I thought they were not needed when we rebuild the block 
mappings from
normal page mappings.

At least, they are not needed if we rebuild the block mappings backed by 
hugetlbfs
pages, because we must have built the new block mappings for the first 
time before
and now need to rebuild them after they were split in dirty logging. Can 
we agree on this?
Then let's see the following situation.
> I can think of the following situation where they
> are needed:
>
> 1. The 2nd level (PMD) table that will be turned into a block is mapped at stage 2
> because one of the pages in the 3rd level (PTE) table it points to is accessed by
> the guest.
>
> 2. The kernel decides to turn the userspace mapping into a transparent huge page
> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level table
> is still valid.
I have a question here. Won't the PMD entry been invalidated too in this 
case?
If remove of the stage2 mapping by mmu notifier is an unmap operation of 
a range,
then it's correct and reasonable to both invalidate the PMD entry and 
free the PTE table.
As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.

And if I was right about this, we will not end up in 
stage2_coalesce_tables_into_block()
like step 3 describes, but in stage2_map_walker_try_leaf() instead. 
Because the PMD entry
is invalid, so KVM will create the new 2M block mapping.

If I'm wrong about this, then I think this is a valid situation.
> 3. Guest accesses a page which is not the page it accessed at step 1, which causes
> a translation fault. KVM decides we can use a PMD block mapping to map the address
> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
> because the guest accesses memory it didn't access before.
>
> What do you think, is that a valid situation?
>>   	return 0;
>>   }
>>   
>> @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>>   				      kvm_pte_t *ptep,
>>   				      struct stage2_map_data *data)
>>   {
>> -	int ret = 0;
>> -
>>   	if (!data->anchor)
>>   		return 0;
>>   
>> -	free_page((unsigned long)kvm_pte_follow(*ptep));
>> -	put_page(virt_to_page(ptep));
>> -
>> -	if (data->anchor == ptep) {
>> +	if (data->anchor != ptep) {
>> +		free_page((unsigned long)kvm_pte_follow(*ptep));
>> +		put_page(virt_to_page(ptep));
>> +	} else {
>> +		free_page((unsigned long)data->follow);
>>   		data->anchor = NULL;
>> -		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and
> get_page() once in our case (valid old mapping). It looks to me like we're missing
> a put_page() call when the function is called for the anchor. Have you found the
> call to be unnecessary?
Before this patch:
When we find data->anchor == ptep, put_page() has been called once in 
advance for the anchor
in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() -> 
stage2_map_walker_try_leaf()
to install the block entry, and only get_page() will be called once in 
stage2_map_walker_try_leaf().
There is a put_page() followed by a get_page() for the anchor, and there 
will not be a problem about
page_counts.

After this patch:
Before we find data->anchor == ptep and after it, there is not a 
put_page() call for the anchor.
This is because that we didn't call get_page() either in 
stage2_coalesce_tables_into_block() when
install the block entry. So I think there will not be a problem too.

Is above the right answer for your point?
>>   	}
>>   
>> -	return ret;
>> +	return 0;
> I think it's correct for this function to succeed unconditionally. The error was
> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
> can return an error code if block mapping is not supported, which we know is
> supported because we have an anchor, and if only the permissions are different
> between the old and the new entry, but in our case we've changed both the valid
> and type bits.
Agreed. Besides, we will definitely not end up updating an old valid 
entry for the anchor
in stage2_map_walker_try_leaf(), because *anchor has already been 
invalidated in
stage2_map_walk_table_pre() before set the anchor, so it will look like 
a build of new mapping.

Thanks,

Yanan
> Thanks,
>
> Alex
>
>>   }
>>   
>>   /*
> .
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-03-03 11:04       ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-03-03 11:04 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Marc Zyngier, Will Deacon, Catalin Marinas, Julien Thierry,
	James Morse, Suzuki K Poulose, Quentin Perret, Gavin Shan,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Alex,

On 2021/3/3 1:13, Alexandru Elisei wrote:
> Hello,
>
> On 2/8/21 11:22 AM, Yanan Wang wrote:
>> When KVM needs to coalesce the normal page mappings into a block mapping,
>> we currently invalidate the old table entry first followed by invalidation
>> of TLB, then unmap the page mappings, and install the block entry at last.
>>
>> It will cost a long time to unmap the numerous page mappings, which means
>> there will be a long period when the table entry can be found invalid.
>> If other vCPUs access any guest page within the block range and find the
>> table entry invalid, they will all exit from guest with a translation fault
>> which is not necessary. And KVM will make efforts to handle these faults,
>> especially when performing CMOs by block range.
>>
>> So let's quickly install the block entry at first to ensure uninterrupted
>> memory access of the other vCPUs, and then unmap the page mappings after
>> installation. This will reduce most of the time when the table entry is
>> invalid, and avoid most of the unnecessary translation faults.
> I'm not convinced I've fully understood what is going on yet, but it seems to me
> that the idea is sound. Some questions and comments below.
What I am trying to do in this patch is to adjust the order of 
rebuilding block mappings from page mappings.
Take the rebuilding of 1G block mappings as an example.
Before this patch, the order is like:
1) invalidate the table entry of the 1st level(PUD)
2) flush TLB by VMID
3) unmap the old PMD/PTE tables
4) install the new block entry to the 1st level(PUD)

So entry in the 1st level can be found invalid by other vcpus in 1), 2), 
and 3), and it's a long time in 3) to unmap
the numerous old PMD/PTE tables, which means the total time of the entry 
being invalid is long enough to
affect the performance.

After this patch, the order is like:
1) invalidate the table ebtry of the 1st level(PUD)
2) flush TLB by VMID
3) install the new block entry to the 1st level(PUD)
4) unmap the old PMD/PTE tables

The change ensures that period of entry in the 1st level(PUD) being 
invalid is only in 1) and 2),
so if other vcpus access memory within 1G, there will be less chance to 
find the entry invalid
and as a result trigger an unnecessary translation fault.
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>   arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>   1 file changed, 12 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 78a560446f80..308c36b9cd21 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>   	kvm_pte_t			attr;
>>   
>>   	kvm_pte_t			*anchor;
>> +	kvm_pte_t			*follow;
>>   
>>   	struct kvm_s2_mmu		*mmu;
>>   	struct kvm_mmu_memory_cache	*memcache;
>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>>   	if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>>   		return 0;
>>   
>> -	kvm_set_invalid_pte(ptep);
>> -
>>   	/*
>> -	 * Invalidate the whole stage-2, as we may have numerous leaf
>> -	 * entries below us which would otherwise need invalidating
>> -	 * individually.
>> +	 * If we need to coalesce existing table entries into a block here,
>> +	 * then install the block entry first and the sub-level page mappings
>> +	 * will be unmapped later.
>>   	 */
>> -	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>   	data->anchor = ptep;
>> +	data->follow = kvm_pte_follow(*ptep);
>> +	stage2_coalesce_tables_into_block(addr, level, ptep, data);
> Here's how stage2_coalesce_tables_into_block() is implemented from the previous
> patch (it might be worth merging it with this patch, I found it impossible to
> judge if the function is correct without seeing how it is used and what is replacing):
Ok, will do this if v2 is going to be post.
> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>                            kvm_pte_t *ptep,
>                            struct stage2_map_data *data)
> {
>      u64 granule = kvm_granule_size(level), phys = data->phys;
>      kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>
>      kvm_set_invalid_pte(ptep);
>
>      /*
>       * Invalidate the whole stage-2, as we may have numerous leaf entries
>       * below us which would otherwise need invalidating individually.
>       */
>      kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>      smp_store_release(ptep, new);
>      data->phys += granule;
> }
>
> This works because __kvm_pgtable_visit() saves the *ptep value before calling the
> pre callback, and it visits the next level table based on the initial pte value,
> not the new value written by stage2_coalesce_tables_into_block().
Right. So before replacing the initial pte value with the new value, we 
have to use
*data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to 
save
the initial pte value in advance. And data->follow will be used when  we 
start to
unmap the old sub-level tables later.
>
> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
> dcache to the map handler"), this function is missing the CMOs from
> stage2_map_walker_try_leaf().
Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() 
currently,
because I thought they were not needed when we rebuild the block 
mappings from
normal page mappings.

At least, they are not needed if we rebuild the block mappings backed by 
hugetlbfs
pages, because we must have built the new block mappings for the first 
time before
and now need to rebuild them after they were split in dirty logging. Can 
we agree on this?
Then let's see the following situation.
> I can think of the following situation where they
> are needed:
>
> 1. The 2nd level (PMD) table that will be turned into a block is mapped at stage 2
> because one of the pages in the 3rd level (PTE) table it points to is accessed by
> the guest.
>
> 2. The kernel decides to turn the userspace mapping into a transparent huge page
> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level table
> is still valid.
I have a question here. Won't the PMD entry been invalidated too in this 
case?
If remove of the stage2 mapping by mmu notifier is an unmap operation of 
a range,
then it's correct and reasonable to both invalidate the PMD entry and 
free the PTE table.
As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.

And if I was right about this, we will not end up in 
stage2_coalesce_tables_into_block()
like step 3 describes, but in stage2_map_walker_try_leaf() instead. 
Because the PMD entry
is invalid, so KVM will create the new 2M block mapping.

If I'm wrong about this, then I think this is a valid situation.
> 3. Guest accesses a page which is not the page it accessed at step 1, which causes
> a translation fault. KVM decides we can use a PMD block mapping to map the address
> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
> because the guest accesses memory it didn't access before.
>
> What do you think, is that a valid situation?
>>   	return 0;
>>   }
>>   
>> @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>>   				      kvm_pte_t *ptep,
>>   				      struct stage2_map_data *data)
>>   {
>> -	int ret = 0;
>> -
>>   	if (!data->anchor)
>>   		return 0;
>>   
>> -	free_page((unsigned long)kvm_pte_follow(*ptep));
>> -	put_page(virt_to_page(ptep));
>> -
>> -	if (data->anchor == ptep) {
>> +	if (data->anchor != ptep) {
>> +		free_page((unsigned long)kvm_pte_follow(*ptep));
>> +		put_page(virt_to_page(ptep));
>> +	} else {
>> +		free_page((unsigned long)data->follow);
>>   		data->anchor = NULL;
>> -		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and
> get_page() once in our case (valid old mapping). It looks to me like we're missing
> a put_page() call when the function is called for the anchor. Have you found the
> call to be unnecessary?
Before this patch:
When we find data->anchor == ptep, put_page() has been called once in 
advance for the anchor
in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() -> 
stage2_map_walker_try_leaf()
to install the block entry, and only get_page() will be called once in 
stage2_map_walker_try_leaf().
There is a put_page() followed by a get_page() for the anchor, and there 
will not be a problem about
page_counts.

After this patch:
Before we find data->anchor == ptep and after it, there is not a 
put_page() call for the anchor.
This is because that we didn't call get_page() either in 
stage2_coalesce_tables_into_block() when
install the block entry. So I think there will not be a problem too.

Is above the right answer for your point?
>>   	}
>>   
>> -	return ret;
>> +	return 0;
> I think it's correct for this function to succeed unconditionally. The error was
> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
> can return an error code if block mapping is not supported, which we know is
> supported because we have an anchor, and if only the permissions are different
> between the old and the new entry, but in our case we've changed both the valid
> and type bits.
Agreed. Besides, we will definitely not end up updating an old valid 
entry for the anchor
in stage2_map_walker_try_leaf(), because *anchor has already been 
invalidated in
stage2_map_walk_table_pre() before set the anchor, so it will look like 
a build of new mapping.

Thanks,

Yanan
> Thanks,
>
> Alex
>
>>   }
>>   
>>   /*
> .

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
  2021-03-03 11:04       ` wangyanan (Y)
  (?)
@ 2021-03-03 17:27         ` Alexandru Elisei
  -1 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-03-03 17:27 UTC (permalink / raw)
  To: wangyanan (Y)
  Cc: Marc Zyngier, Will Deacon, Catalin Marinas, Julien Thierry,
	James Morse, Suzuki K Poulose, Quentin Perret, Gavin Shan,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Yanan,

On 3/3/21 11:04 AM, wangyanan (Y) wrote:
> Hi Alex,
>
> On 2021/3/3 1:13, Alexandru Elisei wrote:
>> Hello,
>>
>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>> we currently invalidate the old table entry first followed by invalidation
>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>
>>> It will cost a long time to unmap the numerous page mappings, which means
>>> there will be a long period when the table entry can be found invalid.
>>> If other vCPUs access any guest page within the block range and find the
>>> table entry invalid, they will all exit from guest with a translation fault
>>> which is not necessary. And KVM will make efforts to handle these faults,
>>> especially when performing CMOs by block range.
>>>
>>> So let's quickly install the block entry at first to ensure uninterrupted
>>> memory access of the other vCPUs, and then unmap the page mappings after
>>> installation. This will reduce most of the time when the table entry is
>>> invalid, and avoid most of the unnecessary translation faults.
>> I'm not convinced I've fully understood what is going on yet, but it seems to me
>> that the idea is sound. Some questions and comments below.
> What I am trying to do in this patch is to adjust the order of rebuilding block
> mappings from page mappings.
> Take the rebuilding of 1G block mappings as an example.
> Before this patch, the order is like:
> 1) invalidate the table entry of the 1st level(PUD)
> 2) flush TLB by VMID
> 3) unmap the old PMD/PTE tables
> 4) install the new block entry to the 1st level(PUD)
>
> So entry in the 1st level can be found invalid by other vcpus in 1), 2), and 3),
> and it's a long time in 3) to unmap
> the numerous old PMD/PTE tables, which means the total time of the entry being
> invalid is long enough to
> affect the performance.
>
> After this patch, the order is like:
> 1) invalidate the table ebtry of the 1st level(PUD)
> 2) flush TLB by VMID
> 3) install the new block entry to the 1st level(PUD)
> 4) unmap the old PMD/PTE tables
>
> The change ensures that period of entry in the 1st level(PUD) being invalid is
> only in 1) and 2),
> so if other vcpus access memory within 1G, there will be less chance to find the
> entry invalid
> and as a result trigger an unnecessary translation fault.

Thank you for the explanation, that was my understand of it also, and I believe
your idea is correct. I was more concerned that I got some of the details wrong,
and you have kindly corrected me below.

>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>> ---
>>>   arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>   1 file changed, 12 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>> index 78a560446f80..308c36b9cd21 100644
>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>       kvm_pte_t            attr;
>>>         kvm_pte_t            *anchor;
>>> +    kvm_pte_t            *follow;
>>>         struct kvm_s2_mmu        *mmu;
>>>       struct kvm_mmu_memory_cache    *memcache;
>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
>>> u32 level,
>>>       if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>>>           return 0;
>>>   -    kvm_set_invalid_pte(ptep);
>>> -
>>>       /*
>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>> -     * entries below us which would otherwise need invalidating
>>> -     * individually.
>>> +     * If we need to coalesce existing table entries into a block here,
>>> +     * then install the block entry first and the sub-level page mappings
>>> +     * will be unmapped later.
>>>        */
>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>       data->anchor = ptep;
>>> +    data->follow = kvm_pte_follow(*ptep);
>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>> Here's how stage2_coalesce_tables_into_block() is implemented from the previous
>> patch (it might be worth merging it with this patch, I found it impossible to
>> judge if the function is correct without seeing how it is used and what is
>> replacing):
> Ok, will do this if v2 is going to be post.
>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>                            kvm_pte_t *ptep,
>>                            struct stage2_map_data *data)
>> {
>>      u64 granule = kvm_granule_size(level), phys = data->phys;
>>      kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>
>>      kvm_set_invalid_pte(ptep);
>>
>>      /*
>>       * Invalidate the whole stage-2, as we may have numerous leaf entries
>>       * below us which would otherwise need invalidating individually.
>>       */
>>      kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>      smp_store_release(ptep, new);
>>      data->phys += granule;
>> }
>>
>> This works because __kvm_pgtable_visit() saves the *ptep value before calling the
>> pre callback, and it visits the next level table based on the initial pte value,
>> not the new value written by stage2_coalesce_tables_into_block().
> Right. So before replacing the initial pte value with the new value, we have to use
> *data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to save
> the initial pte value in advance. And data->follow will be used when  we start to
> unmap the old sub-level tables later.

Right, stage2_map_walk_table_post() will use data->follow to free the table page
which is no longer needed because we've replaced the entire next level table with
a block mapping.

>>
>> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
>> dcache to the map handler"), this function is missing the CMOs from
>> stage2_map_walker_try_leaf().
> Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() currently,
> because I thought they were not needed when we rebuild the block mappings from
> normal page mappings.

This assumes that the *only* situation when we replace a table entry with a block
mapping is when the next level table (or tables) is *fully* populated. Is there a
way to prove that this is true? I think it's important to prove it unequivocally,
because if there's a corner case where this doesn't happen and we remove the
dcache maintenance, we can end up with hard to reproduce and hard to diagnose
errors in a guest.

>
> At least, they are not needed if we rebuild the block mappings backed by hugetlbfs
> pages, because we must have built the new block mappings for the first time before
> and now need to rebuild them after they were split in dirty logging. Can we
> agree on this?
> Then let's see the following situation.
>> I can think of the following situation where they
>> are needed:
>>
>> 1. The 2nd level (PMD) table that will be turned into a block is mapped at stage 2
>> because one of the pages in the 3rd level (PTE) table it points to is accessed by
>> the guest.
>>
>> 2. The kernel decides to turn the userspace mapping into a transparent huge page
>> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level table
>> is still valid.
> I have a question here. Won't the PMD entry been invalidated too in this case?
> If remove of the stage2 mapping by mmu notifier is an unmap operation of a range,
> then it's correct and reasonable to both invalidate the PMD entry and free the
> PTE table.
> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>
> And if I was right about this, we will not end up in
> stage2_coalesce_tables_into_block()
> like step 3 describes, but in stage2_map_walker_try_leaf() instead. Because the
> PMD entry
> is invalid, so KVM will create the new 2M block mapping.

Looking at the code for stage2_unmap_walker(), I believe you are correct. After
the entire PTE table has been unmapped, the function will mark the PMD entry as
invalid. In the situation I described, at step 3 we would end up in the leaf
mapper function because the PMD entry is invalid. My example was wrong.

>
> If I'm wrong about this, then I think this is a valid situation.
>> 3. Guest accesses a page which is not the page it accessed at step 1, which causes
>> a translation fault. KVM decides we can use a PMD block mapping to map the address
>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
>> because the guest accesses memory it didn't access before.
>>
>> What do you think, is that a valid situation?
>>>       return 0;
>>>   }
>>>   @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64
>>> end, u32 level,
>>>                         kvm_pte_t *ptep,
>>>                         struct stage2_map_data *data)
>>>   {
>>> -    int ret = 0;
>>> -
>>>       if (!data->anchor)
>>>           return 0;
>>>   -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>> -    put_page(virt_to_page(ptep));
>>> -
>>> -    if (data->anchor == ptep) {
>>> +    if (data->anchor != ptep) {
>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>> +        put_page(virt_to_page(ptep));
>>> +    } else {
>>> +        free_page((unsigned long)data->follow);
>>>           data->anchor = NULL;
>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and
>> get_page() once in our case (valid old mapping). It looks to me like we're missing
>> a put_page() call when the function is called for the anchor. Have you found the
>> call to be unnecessary?
> Before this patch:
> When we find data->anchor == ptep, put_page() has been called once in advance
> for the anchor
> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
> stage2_map_walker_try_leaf()
> to install the block entry, and only get_page() will be called once in
> stage2_map_walker_try_leaf().
> There is a put_page() followed by a get_page() for the anchor, and there will
> not be a problem about
> page_counts.

This is how I'm reading the code before your patch:

- stage2_map_walk_table_post() returns early if there is no anchor.

- stage2_map_walk_table_pre() sets the anchor and marks the entry as invalid. The
entry was a table so the leaf visitor is not called in __kvm_pgtable_visit().

- __kvm_pgtable_visit() visits the next level table.

- stage2_map_walk_table_post() calls put_page(), calls stage2_map_walk_leaf() ->
stage2_map_walker_try_leaf(). The old entry was invalidated by the pre visitor, so
it only calls get_page() (and not put_page() + get_page().

I agree with your conclusion, I didn't realize that because the pre visitor marks
the entry as invalid, stage2_map_walker_try_leaf() will not call put_page().

>
> After this patch:
> Before we find data->anchor == ptep and after it, there is not a put_page() call
> for the anchor.
> This is because that we didn't call get_page() either in
> stage2_coalesce_tables_into_block() when
> install the block entry. So I think there will not be a problem too.

I agree, the refcount will be identical.

>
> Is above the right answer for your point?

Yes, thank you clearing that up for me.

Thanks,

Alex

>>>       }
>>>   -    return ret;
>>> +    return 0;
>> I think it's correct for this function to succeed unconditionally. The error was
>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
>> can return an error code if block mapping is not supported, which we know is
>> supported because we have an anchor, and if only the permissions are different
>> between the old and the new entry, but in our case we've changed both the valid
>> and type bits.
> Agreed. Besides, we will definitely not end up updating an old valid entry for
> the anchor
> in stage2_map_walker_try_leaf(), because *anchor has already been invalidated in
> stage2_map_walk_table_pre() before set the anchor, so it will look like a build
> of new mapping.
>
> Thanks,
>
> Yanan
>> Thanks,
>>
>> Alex
>>
>>>   }
>>>     /*
>> .

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-03-03 17:27         ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-03-03 17:27 UTC (permalink / raw)
  To: wangyanan (Y)
  Cc: kvm, Marc Zyngier, linux-kernel, linux-arm-kernel,
	Catalin Marinas, Will Deacon, kvmarm

Hi Yanan,

On 3/3/21 11:04 AM, wangyanan (Y) wrote:
> Hi Alex,
>
> On 2021/3/3 1:13, Alexandru Elisei wrote:
>> Hello,
>>
>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>> we currently invalidate the old table entry first followed by invalidation
>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>
>>> It will cost a long time to unmap the numerous page mappings, which means
>>> there will be a long period when the table entry can be found invalid.
>>> If other vCPUs access any guest page within the block range and find the
>>> table entry invalid, they will all exit from guest with a translation fault
>>> which is not necessary. And KVM will make efforts to handle these faults,
>>> especially when performing CMOs by block range.
>>>
>>> So let's quickly install the block entry at first to ensure uninterrupted
>>> memory access of the other vCPUs, and then unmap the page mappings after
>>> installation. This will reduce most of the time when the table entry is
>>> invalid, and avoid most of the unnecessary translation faults.
>> I'm not convinced I've fully understood what is going on yet, but it seems to me
>> that the idea is sound. Some questions and comments below.
> What I am trying to do in this patch is to adjust the order of rebuilding block
> mappings from page mappings.
> Take the rebuilding of 1G block mappings as an example.
> Before this patch, the order is like:
> 1) invalidate the table entry of the 1st level(PUD)
> 2) flush TLB by VMID
> 3) unmap the old PMD/PTE tables
> 4) install the new block entry to the 1st level(PUD)
>
> So entry in the 1st level can be found invalid by other vcpus in 1), 2), and 3),
> and it's a long time in 3) to unmap
> the numerous old PMD/PTE tables, which means the total time of the entry being
> invalid is long enough to
> affect the performance.
>
> After this patch, the order is like:
> 1) invalidate the table ebtry of the 1st level(PUD)
> 2) flush TLB by VMID
> 3) install the new block entry to the 1st level(PUD)
> 4) unmap the old PMD/PTE tables
>
> The change ensures that period of entry in the 1st level(PUD) being invalid is
> only in 1) and 2),
> so if other vcpus access memory within 1G, there will be less chance to find the
> entry invalid
> and as a result trigger an unnecessary translation fault.

Thank you for the explanation, that was my understand of it also, and I believe
your idea is correct. I was more concerned that I got some of the details wrong,
and you have kindly corrected me below.

>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>> ---
>>>   arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>   1 file changed, 12 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>> index 78a560446f80..308c36b9cd21 100644
>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>       kvm_pte_t            attr;
>>>         kvm_pte_t            *anchor;
>>> +    kvm_pte_t            *follow;
>>>         struct kvm_s2_mmu        *mmu;
>>>       struct kvm_mmu_memory_cache    *memcache;
>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
>>> u32 level,
>>>       if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>>>           return 0;
>>>   -    kvm_set_invalid_pte(ptep);
>>> -
>>>       /*
>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>> -     * entries below us which would otherwise need invalidating
>>> -     * individually.
>>> +     * If we need to coalesce existing table entries into a block here,
>>> +     * then install the block entry first and the sub-level page mappings
>>> +     * will be unmapped later.
>>>        */
>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>       data->anchor = ptep;
>>> +    data->follow = kvm_pte_follow(*ptep);
>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>> Here's how stage2_coalesce_tables_into_block() is implemented from the previous
>> patch (it might be worth merging it with this patch, I found it impossible to
>> judge if the function is correct without seeing how it is used and what is
>> replacing):
> Ok, will do this if v2 is going to be post.
>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>                            kvm_pte_t *ptep,
>>                            struct stage2_map_data *data)
>> {
>>      u64 granule = kvm_granule_size(level), phys = data->phys;
>>      kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>
>>      kvm_set_invalid_pte(ptep);
>>
>>      /*
>>       * Invalidate the whole stage-2, as we may have numerous leaf entries
>>       * below us which would otherwise need invalidating individually.
>>       */
>>      kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>      smp_store_release(ptep, new);
>>      data->phys += granule;
>> }
>>
>> This works because __kvm_pgtable_visit() saves the *ptep value before calling the
>> pre callback, and it visits the next level table based on the initial pte value,
>> not the new value written by stage2_coalesce_tables_into_block().
> Right. So before replacing the initial pte value with the new value, we have to use
> *data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to save
> the initial pte value in advance. And data->follow will be used when  we start to
> unmap the old sub-level tables later.

Right, stage2_map_walk_table_post() will use data->follow to free the table page
which is no longer needed because we've replaced the entire next level table with
a block mapping.

>>
>> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
>> dcache to the map handler"), this function is missing the CMOs from
>> stage2_map_walker_try_leaf().
> Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() currently,
> because I thought they were not needed when we rebuild the block mappings from
> normal page mappings.

This assumes that the *only* situation when we replace a table entry with a block
mapping is when the next level table (or tables) is *fully* populated. Is there a
way to prove that this is true? I think it's important to prove it unequivocally,
because if there's a corner case where this doesn't happen and we remove the
dcache maintenance, we can end up with hard to reproduce and hard to diagnose
errors in a guest.

>
> At least, they are not needed if we rebuild the block mappings backed by hugetlbfs
> pages, because we must have built the new block mappings for the first time before
> and now need to rebuild them after they were split in dirty logging. Can we
> agree on this?
> Then let's see the following situation.
>> I can think of the following situation where they
>> are needed:
>>
>> 1. The 2nd level (PMD) table that will be turned into a block is mapped at stage 2
>> because one of the pages in the 3rd level (PTE) table it points to is accessed by
>> the guest.
>>
>> 2. The kernel decides to turn the userspace mapping into a transparent huge page
>> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level table
>> is still valid.
> I have a question here. Won't the PMD entry been invalidated too in this case?
> If remove of the stage2 mapping by mmu notifier is an unmap operation of a range,
> then it's correct and reasonable to both invalidate the PMD entry and free the
> PTE table.
> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>
> And if I was right about this, we will not end up in
> stage2_coalesce_tables_into_block()
> like step 3 describes, but in stage2_map_walker_try_leaf() instead. Because the
> PMD entry
> is invalid, so KVM will create the new 2M block mapping.

Looking at the code for stage2_unmap_walker(), I believe you are correct. After
the entire PTE table has been unmapped, the function will mark the PMD entry as
invalid. In the situation I described, at step 3 we would end up in the leaf
mapper function because the PMD entry is invalid. My example was wrong.

>
> If I'm wrong about this, then I think this is a valid situation.
>> 3. Guest accesses a page which is not the page it accessed at step 1, which causes
>> a translation fault. KVM decides we can use a PMD block mapping to map the address
>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
>> because the guest accesses memory it didn't access before.
>>
>> What do you think, is that a valid situation?
>>>       return 0;
>>>   }
>>>   @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64
>>> end, u32 level,
>>>                         kvm_pte_t *ptep,
>>>                         struct stage2_map_data *data)
>>>   {
>>> -    int ret = 0;
>>> -
>>>       if (!data->anchor)
>>>           return 0;
>>>   -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>> -    put_page(virt_to_page(ptep));
>>> -
>>> -    if (data->anchor == ptep) {
>>> +    if (data->anchor != ptep) {
>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>> +        put_page(virt_to_page(ptep));
>>> +    } else {
>>> +        free_page((unsigned long)data->follow);
>>>           data->anchor = NULL;
>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and
>> get_page() once in our case (valid old mapping). It looks to me like we're missing
>> a put_page() call when the function is called for the anchor. Have you found the
>> call to be unnecessary?
> Before this patch:
> When we find data->anchor == ptep, put_page() has been called once in advance
> for the anchor
> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
> stage2_map_walker_try_leaf()
> to install the block entry, and only get_page() will be called once in
> stage2_map_walker_try_leaf().
> There is a put_page() followed by a get_page() for the anchor, and there will
> not be a problem about
> page_counts.

This is how I'm reading the code before your patch:

- stage2_map_walk_table_post() returns early if there is no anchor.

- stage2_map_walk_table_pre() sets the anchor and marks the entry as invalid. The
entry was a table so the leaf visitor is not called in __kvm_pgtable_visit().

- __kvm_pgtable_visit() visits the next level table.

- stage2_map_walk_table_post() calls put_page(), calls stage2_map_walk_leaf() ->
stage2_map_walker_try_leaf(). The old entry was invalidated by the pre visitor, so
it only calls get_page() (and not put_page() + get_page().

I agree with your conclusion, I didn't realize that because the pre visitor marks
the entry as invalid, stage2_map_walker_try_leaf() will not call put_page().

>
> After this patch:
> Before we find data->anchor == ptep and after it, there is not a put_page() call
> for the anchor.
> This is because that we didn't call get_page() either in
> stage2_coalesce_tables_into_block() when
> install the block entry. So I think there will not be a problem too.

I agree, the refcount will be identical.

>
> Is above the right answer for your point?

Yes, thank you clearing that up for me.

Thanks,

Alex

>>>       }
>>>   -    return ret;
>>> +    return 0;
>> I think it's correct for this function to succeed unconditionally. The error was
>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
>> can return an error code if block mapping is not supported, which we know is
>> supported because we have an anchor, and if only the permissions are different
>> between the old and the new entry, but in our case we've changed both the valid
>> and type bits.
> Agreed. Besides, we will definitely not end up updating an old valid entry for
> the anchor
> in stage2_map_walker_try_leaf(), because *anchor has already been invalidated in
> stage2_map_walk_table_pre() before set the anchor, so it will look like a build
> of new mapping.
>
> Thanks,
>
> Yanan
>> Thanks,
>>
>> Alex
>>
>>>   }
>>>     /*
>> .
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-03-03 17:27         ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-03-03 17:27 UTC (permalink / raw)
  To: wangyanan (Y)
  Cc: Marc Zyngier, Will Deacon, Catalin Marinas, Julien Thierry,
	James Morse, Suzuki K Poulose, Quentin Perret, Gavin Shan,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Yanan,

On 3/3/21 11:04 AM, wangyanan (Y) wrote:
> Hi Alex,
>
> On 2021/3/3 1:13, Alexandru Elisei wrote:
>> Hello,
>>
>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>> we currently invalidate the old table entry first followed by invalidation
>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>
>>> It will cost a long time to unmap the numerous page mappings, which means
>>> there will be a long period when the table entry can be found invalid.
>>> If other vCPUs access any guest page within the block range and find the
>>> table entry invalid, they will all exit from guest with a translation fault
>>> which is not necessary. And KVM will make efforts to handle these faults,
>>> especially when performing CMOs by block range.
>>>
>>> So let's quickly install the block entry at first to ensure uninterrupted
>>> memory access of the other vCPUs, and then unmap the page mappings after
>>> installation. This will reduce most of the time when the table entry is
>>> invalid, and avoid most of the unnecessary translation faults.
>> I'm not convinced I've fully understood what is going on yet, but it seems to me
>> that the idea is sound. Some questions and comments below.
> What I am trying to do in this patch is to adjust the order of rebuilding block
> mappings from page mappings.
> Take the rebuilding of 1G block mappings as an example.
> Before this patch, the order is like:
> 1) invalidate the table entry of the 1st level(PUD)
> 2) flush TLB by VMID
> 3) unmap the old PMD/PTE tables
> 4) install the new block entry to the 1st level(PUD)
>
> So entry in the 1st level can be found invalid by other vcpus in 1), 2), and 3),
> and it's a long time in 3) to unmap
> the numerous old PMD/PTE tables, which means the total time of the entry being
> invalid is long enough to
> affect the performance.
>
> After this patch, the order is like:
> 1) invalidate the table ebtry of the 1st level(PUD)
> 2) flush TLB by VMID
> 3) install the new block entry to the 1st level(PUD)
> 4) unmap the old PMD/PTE tables
>
> The change ensures that period of entry in the 1st level(PUD) being invalid is
> only in 1) and 2),
> so if other vcpus access memory within 1G, there will be less chance to find the
> entry invalid
> and as a result trigger an unnecessary translation fault.

Thank you for the explanation, that was my understand of it also, and I believe
your idea is correct. I was more concerned that I got some of the details wrong,
and you have kindly corrected me below.

>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>> ---
>>>   arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>   1 file changed, 12 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>> index 78a560446f80..308c36b9cd21 100644
>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>       kvm_pte_t            attr;
>>>         kvm_pte_t            *anchor;
>>> +    kvm_pte_t            *follow;
>>>         struct kvm_s2_mmu        *mmu;
>>>       struct kvm_mmu_memory_cache    *memcache;
>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
>>> u32 level,
>>>       if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>>>           return 0;
>>>   -    kvm_set_invalid_pte(ptep);
>>> -
>>>       /*
>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>> -     * entries below us which would otherwise need invalidating
>>> -     * individually.
>>> +     * If we need to coalesce existing table entries into a block here,
>>> +     * then install the block entry first and the sub-level page mappings
>>> +     * will be unmapped later.
>>>        */
>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>       data->anchor = ptep;
>>> +    data->follow = kvm_pte_follow(*ptep);
>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>> Here's how stage2_coalesce_tables_into_block() is implemented from the previous
>> patch (it might be worth merging it with this patch, I found it impossible to
>> judge if the function is correct without seeing how it is used and what is
>> replacing):
> Ok, will do this if v2 is going to be post.
>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>                            kvm_pte_t *ptep,
>>                            struct stage2_map_data *data)
>> {
>>      u64 granule = kvm_granule_size(level), phys = data->phys;
>>      kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>
>>      kvm_set_invalid_pte(ptep);
>>
>>      /*
>>       * Invalidate the whole stage-2, as we may have numerous leaf entries
>>       * below us which would otherwise need invalidating individually.
>>       */
>>      kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>      smp_store_release(ptep, new);
>>      data->phys += granule;
>> }
>>
>> This works because __kvm_pgtable_visit() saves the *ptep value before calling the
>> pre callback, and it visits the next level table based on the initial pte value,
>> not the new value written by stage2_coalesce_tables_into_block().
> Right. So before replacing the initial pte value with the new value, we have to use
> *data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to save
> the initial pte value in advance. And data->follow will be used when  we start to
> unmap the old sub-level tables later.

Right, stage2_map_walk_table_post() will use data->follow to free the table page
which is no longer needed because we've replaced the entire next level table with
a block mapping.

>>
>> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
>> dcache to the map handler"), this function is missing the CMOs from
>> stage2_map_walker_try_leaf().
> Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() currently,
> because I thought they were not needed when we rebuild the block mappings from
> normal page mappings.

This assumes that the *only* situation when we replace a table entry with a block
mapping is when the next level table (or tables) is *fully* populated. Is there a
way to prove that this is true? I think it's important to prove it unequivocally,
because if there's a corner case where this doesn't happen and we remove the
dcache maintenance, we can end up with hard to reproduce and hard to diagnose
errors in a guest.

>
> At least, they are not needed if we rebuild the block mappings backed by hugetlbfs
> pages, because we must have built the new block mappings for the first time before
> and now need to rebuild them after they were split in dirty logging. Can we
> agree on this?
> Then let's see the following situation.
>> I can think of the following situation where they
>> are needed:
>>
>> 1. The 2nd level (PMD) table that will be turned into a block is mapped at stage 2
>> because one of the pages in the 3rd level (PTE) table it points to is accessed by
>> the guest.
>>
>> 2. The kernel decides to turn the userspace mapping into a transparent huge page
>> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level table
>> is still valid.
> I have a question here. Won't the PMD entry been invalidated too in this case?
> If remove of the stage2 mapping by mmu notifier is an unmap operation of a range,
> then it's correct and reasonable to both invalidate the PMD entry and free the
> PTE table.
> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>
> And if I was right about this, we will not end up in
> stage2_coalesce_tables_into_block()
> like step 3 describes, but in stage2_map_walker_try_leaf() instead. Because the
> PMD entry
> is invalid, so KVM will create the new 2M block mapping.

Looking at the code for stage2_unmap_walker(), I believe you are correct. After
the entire PTE table has been unmapped, the function will mark the PMD entry as
invalid. In the situation I described, at step 3 we would end up in the leaf
mapper function because the PMD entry is invalid. My example was wrong.

>
> If I'm wrong about this, then I think this is a valid situation.
>> 3. Guest accesses a page which is not the page it accessed at step 1, which causes
>> a translation fault. KVM decides we can use a PMD block mapping to map the address
>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
>> because the guest accesses memory it didn't access before.
>>
>> What do you think, is that a valid situation?
>>>       return 0;
>>>   }
>>>   @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64
>>> end, u32 level,
>>>                         kvm_pte_t *ptep,
>>>                         struct stage2_map_data *data)
>>>   {
>>> -    int ret = 0;
>>> -
>>>       if (!data->anchor)
>>>           return 0;
>>>   -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>> -    put_page(virt_to_page(ptep));
>>> -
>>> -    if (data->anchor == ptep) {
>>> +    if (data->anchor != ptep) {
>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>> +        put_page(virt_to_page(ptep));
>>> +    } else {
>>> +        free_page((unsigned long)data->follow);
>>>           data->anchor = NULL;
>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and
>> get_page() once in our case (valid old mapping). It looks to me like we're missing
>> a put_page() call when the function is called for the anchor. Have you found the
>> call to be unnecessary?
> Before this patch:
> When we find data->anchor == ptep, put_page() has been called once in advance
> for the anchor
> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
> stage2_map_walker_try_leaf()
> to install the block entry, and only get_page() will be called once in
> stage2_map_walker_try_leaf().
> There is a put_page() followed by a get_page() for the anchor, and there will
> not be a problem about
> page_counts.

This is how I'm reading the code before your patch:

- stage2_map_walk_table_post() returns early if there is no anchor.

- stage2_map_walk_table_pre() sets the anchor and marks the entry as invalid. The
entry was a table so the leaf visitor is not called in __kvm_pgtable_visit().

- __kvm_pgtable_visit() visits the next level table.

- stage2_map_walk_table_post() calls put_page(), calls stage2_map_walk_leaf() ->
stage2_map_walker_try_leaf(). The old entry was invalidated by the pre visitor, so
it only calls get_page() (and not put_page() + get_page().

I agree with your conclusion, I didn't realize that because the pre visitor marks
the entry as invalid, stage2_map_walker_try_leaf() will not call put_page().

>
> After this patch:
> Before we find data->anchor == ptep and after it, there is not a put_page() call
> for the anchor.
> This is because that we didn't call get_page() either in
> stage2_coalesce_tables_into_block() when
> install the block entry. So I think there will not be a problem too.

I agree, the refcount will be identical.

>
> Is above the right answer for your point?

Yes, thank you clearing that up for me.

Thanks,

Alex

>>>       }
>>>   -    return ret;
>>> +    return 0;
>> I think it's correct for this function to succeed unconditionally. The error was
>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
>> can return an error code if block mapping is not supported, which we know is
>> supported because we have an anchor, and if only the permissions are different
>> between the old and the new entry, but in our case we've changed both the valid
>> and type bits.
> Agreed. Besides, we will definitely not end up updating an old valid entry for
> the anchor
> in stage2_map_walker_try_leaf(), because *anchor has already been invalidated in
> stage2_map_walk_table_pre() before set the anchor, so it will look like a build
> of new mapping.
>
> Thanks,
>
> Yanan
>> Thanks,
>>
>> Alex
>>
>>>   }
>>>     /*
>> .

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
  2021-03-03 17:27         ` Alexandru Elisei
  (?)
@ 2021-03-04  7:07           ` wangyanan (Y)
  -1 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-03-04  7:07 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Marc Zyngier, Will Deacon, Catalin Marinas, Julien Thierry,
	James Morse, Suzuki K Poulose, Quentin Perret, Gavin Shan,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Alex,

On 2021/3/4 1:27, Alexandru Elisei wrote:
> Hi Yanan,
>
> On 3/3/21 11:04 AM, wangyanan (Y) wrote:
>> Hi Alex,
>>
>> On 2021/3/3 1:13, Alexandru Elisei wrote:
>>> Hello,
>>>
>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>>> we currently invalidate the old table entry first followed by invalidation
>>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>>
>>>> It will cost a long time to unmap the numerous page mappings, which means
>>>> there will be a long period when the table entry can be found invalid.
>>>> If other vCPUs access any guest page within the block range and find the
>>>> table entry invalid, they will all exit from guest with a translation fault
>>>> which is not necessary. And KVM will make efforts to handle these faults,
>>>> especially when performing CMOs by block range.
>>>>
>>>> So let's quickly install the block entry at first to ensure uninterrupted
>>>> memory access of the other vCPUs, and then unmap the page mappings after
>>>> installation. This will reduce most of the time when the table entry is
>>>> invalid, and avoid most of the unnecessary translation faults.
>>> I'm not convinced I've fully understood what is going on yet, but it seems to me
>>> that the idea is sound. Some questions and comments below.
>> What I am trying to do in this patch is to adjust the order of rebuilding block
>> mappings from page mappings.
>> Take the rebuilding of 1G block mappings as an example.
>> Before this patch, the order is like:
>> 1) invalidate the table entry of the 1st level(PUD)
>> 2) flush TLB by VMID
>> 3) unmap the old PMD/PTE tables
>> 4) install the new block entry to the 1st level(PUD)
>>
>> So entry in the 1st level can be found invalid by other vcpus in 1), 2), and 3),
>> and it's a long time in 3) to unmap
>> the numerous old PMD/PTE tables, which means the total time of the entry being
>> invalid is long enough to
>> affect the performance.
>>
>> After this patch, the order is like:
>> 1) invalidate the table ebtry of the 1st level(PUD)
>> 2) flush TLB by VMID
>> 3) install the new block entry to the 1st level(PUD)
>> 4) unmap the old PMD/PTE tables
>>
>> The change ensures that period of entry in the 1st level(PUD) being invalid is
>> only in 1) and 2),
>> so if other vcpus access memory within 1G, there will be less chance to find the
>> entry invalid
>> and as a result trigger an unnecessary translation fault.
> Thank you for the explanation, that was my understand of it also, and I believe
> your idea is correct. I was more concerned that I got some of the details wrong,
> and you have kindly corrected me below.
>
>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>> ---
>>>>    arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>>    1 file changed, 12 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>> index 78a560446f80..308c36b9cd21 100644
>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>>        kvm_pte_t            attr;
>>>>          kvm_pte_t            *anchor;
>>>> +    kvm_pte_t            *follow;
>>>>          struct kvm_s2_mmu        *mmu;
>>>>        struct kvm_mmu_memory_cache    *memcache;
>>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
>>>> u32 level,
>>>>        if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>>>>            return 0;
>>>>    -    kvm_set_invalid_pte(ptep);
>>>> -
>>>>        /*
>>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>>> -     * entries below us which would otherwise need invalidating
>>>> -     * individually.
>>>> +     * If we need to coalesce existing table entries into a block here,
>>>> +     * then install the block entry first and the sub-level page mappings
>>>> +     * will be unmapped later.
>>>>         */
>>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>        data->anchor = ptep;
>>>> +    data->follow = kvm_pte_follow(*ptep);
>>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>>> Here's how stage2_coalesce_tables_into_block() is implemented from the previous
>>> patch (it might be worth merging it with this patch, I found it impossible to
>>> judge if the function is correct without seeing how it is used and what is
>>> replacing):
>> Ok, will do this if v2 is going to be post.
>>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>>                             kvm_pte_t *ptep,
>>>                             struct stage2_map_data *data)
>>> {
>>>       u64 granule = kvm_granule_size(level), phys = data->phys;
>>>       kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>>
>>>       kvm_set_invalid_pte(ptep);
>>>
>>>       /*
>>>        * Invalidate the whole stage-2, as we may have numerous leaf entries
>>>        * below us which would otherwise need invalidating individually.
>>>        */
>>>       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>       smp_store_release(ptep, new);
>>>       data->phys += granule;
>>> }
>>>
>>> This works because __kvm_pgtable_visit() saves the *ptep value before calling the
>>> pre callback, and it visits the next level table based on the initial pte value,
>>> not the new value written by stage2_coalesce_tables_into_block().
>> Right. So before replacing the initial pte value with the new value, we have to use
>> *data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to save
>> the initial pte value in advance. And data->follow will be used when  we start to
>> unmap the old sub-level tables later.
> Right, stage2_map_walk_table_post() will use data->follow to free the table page
> which is no longer needed because we've replaced the entire next level table with
> a block mapping.
>
>>> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
>>> dcache to the map handler"), this function is missing the CMOs from
>>> stage2_map_walker_try_leaf().
>> Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() currently,
>> because I thought they were not needed when we rebuild the block mappings from
>> normal page mappings.
> This assumes that the *only* situation when we replace a table entry with a block
> mapping is when the next level table (or tables) is *fully* populated. Is there a
> way to prove that this is true? I think it's important to prove it unequivocally,
> because if there's a corner case where this doesn't happen and we remove the
> dcache maintenance, we can end up with hard to reproduce and hard to diagnose
> errors in a guest.
So there is still one thing left about this patch to determine, and that 
is whether we can straightly
discard CMOs in stage2_coalesce_tables_into_block() or we should 
distinguish different situations.

Now we know that the situation you have described won't happen, then I 
think we will only end up
in stage2_coalesce_tables_into_block() in the following situation:
1) KVM create a new block mapping in stage2_map_walker_try_leaf() for 
the first time, if guest accesses
     memory backed by a THP/HUGETLB huge page. And CMOs will be 
performed here.
2) KVM split this block mapping in dirty logging, and build only one new 
page mapping.
3) KVM will build other new page mappings in dirty logging lazily, if 
guest access any other pages
     within the block. *In this stage, pages in this block may be fully 
mapped, or may be not.*
4) After dirty logging is disabled, KVM decides to rebuild the block 
mapping.

Do we still have to perform CMOs when rebuilding the block mapping in 
step 4, if pages in the block
were not fully mapped in step 3 ? I'm not completely sure about this.

Thanks,

Yanan
>> At least, they are not needed if we rebuild the block mappings backed by hugetlbfs
>> pages, because we must have built the new block mappings for the first time before
>> and now need to rebuild them after they were split in dirty logging. Can we
>> agree on this?
>> Then let's see the following situation.
>>> I can think of the following situation where they
>>> are needed:
>>>
>>> 1. The 2nd level (PMD) table that will be turned into a block is mapped at stage 2
>>> because one of the pages in the 3rd level (PTE) table it points to is accessed by
>>> the guest.
>>>
>>> 2. The kernel decides to turn the userspace mapping into a transparent huge page
>>> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level table
>>> is still valid.
>> I have a question here. Won't the PMD entry been invalidated too in this case?
>> If remove of the stage2 mapping by mmu notifier is an unmap operation of a range,
>> then it's correct and reasonable to both invalidate the PMD entry and free the
>> PTE table.
>> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>>
>> And if I was right about this, we will not end up in
>> stage2_coalesce_tables_into_block()
>> like step 3 describes, but in stage2_map_walker_try_leaf() instead. Because the
>> PMD entry
>> is invalid, so KVM will create the new 2M block mapping.
> Looking at the code for stage2_unmap_walker(), I believe you are correct. After
> the entire PTE table has been unmapped, the function will mark the PMD entry as
> invalid. In the situation I described, at step 3 we would end up in the leaf
> mapper function because the PMD entry is invalid. My example was wrong.
>
>> If I'm wrong about this, then I think this is a valid situation.
>>> 3. Guest accesses a page which is not the page it accessed at step 1, which causes
>>> a translation fault. KVM decides we can use a PMD block mapping to map the address
>>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
>>> because the guest accesses memory it didn't access before.
>>>
>>> What do you think, is that a valid situation?
>>>>        return 0;
>>>>    }
>>>>    @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64
>>>> end, u32 level,
>>>>                          kvm_pte_t *ptep,
>>>>                          struct stage2_map_data *data)
>>>>    {
>>>> -    int ret = 0;
>>>> -
>>>>        if (!data->anchor)
>>>>            return 0;
>>>>    -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>>> -    put_page(virt_to_page(ptep));
>>>> -
>>>> -    if (data->anchor == ptep) {
>>>> +    if (data->anchor != ptep) {
>>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>>> +        put_page(virt_to_page(ptep));
>>>> +    } else {
>>>> +        free_page((unsigned long)data->follow);
>>>>            data->anchor = NULL;
>>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and
>>> get_page() once in our case (valid old mapping). It looks to me like we're missing
>>> a put_page() call when the function is called for the anchor. Have you found the
>>> call to be unnecessary?
>> Before this patch:
>> When we find data->anchor == ptep, put_page() has been called once in advance
>> for the anchor
>> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
>> stage2_map_walker_try_leaf()
>> to install the block entry, and only get_page() will be called once in
>> stage2_map_walker_try_leaf().
>> There is a put_page() followed by a get_page() for the anchor, and there will
>> not be a problem about
>> page_counts.
> This is how I'm reading the code before your patch:
>
> - stage2_map_walk_table_post() returns early if there is no anchor.
>
> - stage2_map_walk_table_pre() sets the anchor and marks the entry as invalid. The
> entry was a table so the leaf visitor is not called in __kvm_pgtable_visit().
>
> - __kvm_pgtable_visit() visits the next level table.
>
> - stage2_map_walk_table_post() calls put_page(), calls stage2_map_walk_leaf() ->
> stage2_map_walker_try_leaf(). The old entry was invalidated by the pre visitor, so
> it only calls get_page() (and not put_page() + get_page().
>
> I agree with your conclusion, I didn't realize that because the pre visitor marks
> the entry as invalid, stage2_map_walker_try_leaf() will not call put_page().
>
>> After this patch:
>> Before we find data->anchor == ptep and after it, there is not a put_page() call
>> for the anchor.
>> This is because that we didn't call get_page() either in
>> stage2_coalesce_tables_into_block() when
>> install the block entry. So I think there will not be a problem too.
> I agree, the refcount will be identical.
>
>> Is above the right answer for your point?
> Yes, thank you clearing that up for me.
>
> Thanks,
>
> Alex
>
>>>>        }
>>>>    -    return ret;
>>>> +    return 0;
>>> I think it's correct for this function to succeed unconditionally. The error was
>>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
>>> can return an error code if block mapping is not supported, which we know is
>>> supported because we have an anchor, and if only the permissions are different
>>> between the old and the new entry, but in our case we've changed both the valid
>>> and type bits.
>> Agreed. Besides, we will definitely not end up updating an old valid entry for
>> the anchor
>> in stage2_map_walker_try_leaf(), because *anchor has already been invalidated in
>> stage2_map_walk_table_pre() before set the anchor, so it will look like a build
>> of new mapping.
>>
>> Thanks,
>>
>> Yanan
>>> Thanks,
>>>
>>> Alex
>>>
>>>>    }
>>>>      /*
>>> .
> .

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-03-04  7:07           ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-03-04  7:07 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvm, Marc Zyngier, linux-kernel, linux-arm-kernel,
	Catalin Marinas, Will Deacon, kvmarm

Hi Alex,

On 2021/3/4 1:27, Alexandru Elisei wrote:
> Hi Yanan,
>
> On 3/3/21 11:04 AM, wangyanan (Y) wrote:
>> Hi Alex,
>>
>> On 2021/3/3 1:13, Alexandru Elisei wrote:
>>> Hello,
>>>
>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>>> we currently invalidate the old table entry first followed by invalidation
>>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>>
>>>> It will cost a long time to unmap the numerous page mappings, which means
>>>> there will be a long period when the table entry can be found invalid.
>>>> If other vCPUs access any guest page within the block range and find the
>>>> table entry invalid, they will all exit from guest with a translation fault
>>>> which is not necessary. And KVM will make efforts to handle these faults,
>>>> especially when performing CMOs by block range.
>>>>
>>>> So let's quickly install the block entry at first to ensure uninterrupted
>>>> memory access of the other vCPUs, and then unmap the page mappings after
>>>> installation. This will reduce most of the time when the table entry is
>>>> invalid, and avoid most of the unnecessary translation faults.
>>> I'm not convinced I've fully understood what is going on yet, but it seems to me
>>> that the idea is sound. Some questions and comments below.
>> What I am trying to do in this patch is to adjust the order of rebuilding block
>> mappings from page mappings.
>> Take the rebuilding of 1G block mappings as an example.
>> Before this patch, the order is like:
>> 1) invalidate the table entry of the 1st level(PUD)
>> 2) flush TLB by VMID
>> 3) unmap the old PMD/PTE tables
>> 4) install the new block entry to the 1st level(PUD)
>>
>> So entry in the 1st level can be found invalid by other vcpus in 1), 2), and 3),
>> and it's a long time in 3) to unmap
>> the numerous old PMD/PTE tables, which means the total time of the entry being
>> invalid is long enough to
>> affect the performance.
>>
>> After this patch, the order is like:
>> 1) invalidate the table ebtry of the 1st level(PUD)
>> 2) flush TLB by VMID
>> 3) install the new block entry to the 1st level(PUD)
>> 4) unmap the old PMD/PTE tables
>>
>> The change ensures that period of entry in the 1st level(PUD) being invalid is
>> only in 1) and 2),
>> so if other vcpus access memory within 1G, there will be less chance to find the
>> entry invalid
>> and as a result trigger an unnecessary translation fault.
> Thank you for the explanation, that was my understand of it also, and I believe
> your idea is correct. I was more concerned that I got some of the details wrong,
> and you have kindly corrected me below.
>
>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>> ---
>>>>    arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>>    1 file changed, 12 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>> index 78a560446f80..308c36b9cd21 100644
>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>>        kvm_pte_t            attr;
>>>>          kvm_pte_t            *anchor;
>>>> +    kvm_pte_t            *follow;
>>>>          struct kvm_s2_mmu        *mmu;
>>>>        struct kvm_mmu_memory_cache    *memcache;
>>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
>>>> u32 level,
>>>>        if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>>>>            return 0;
>>>>    -    kvm_set_invalid_pte(ptep);
>>>> -
>>>>        /*
>>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>>> -     * entries below us which would otherwise need invalidating
>>>> -     * individually.
>>>> +     * If we need to coalesce existing table entries into a block here,
>>>> +     * then install the block entry first and the sub-level page mappings
>>>> +     * will be unmapped later.
>>>>         */
>>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>        data->anchor = ptep;
>>>> +    data->follow = kvm_pte_follow(*ptep);
>>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>>> Here's how stage2_coalesce_tables_into_block() is implemented from the previous
>>> patch (it might be worth merging it with this patch, I found it impossible to
>>> judge if the function is correct without seeing how it is used and what is
>>> replacing):
>> Ok, will do this if v2 is going to be post.
>>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>>                             kvm_pte_t *ptep,
>>>                             struct stage2_map_data *data)
>>> {
>>>       u64 granule = kvm_granule_size(level), phys = data->phys;
>>>       kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>>
>>>       kvm_set_invalid_pte(ptep);
>>>
>>>       /*
>>>        * Invalidate the whole stage-2, as we may have numerous leaf entries
>>>        * below us which would otherwise need invalidating individually.
>>>        */
>>>       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>       smp_store_release(ptep, new);
>>>       data->phys += granule;
>>> }
>>>
>>> This works because __kvm_pgtable_visit() saves the *ptep value before calling the
>>> pre callback, and it visits the next level table based on the initial pte value,
>>> not the new value written by stage2_coalesce_tables_into_block().
>> Right. So before replacing the initial pte value with the new value, we have to use
>> *data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to save
>> the initial pte value in advance. And data->follow will be used when  we start to
>> unmap the old sub-level tables later.
> Right, stage2_map_walk_table_post() will use data->follow to free the table page
> which is no longer needed because we've replaced the entire next level table with
> a block mapping.
>
>>> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
>>> dcache to the map handler"), this function is missing the CMOs from
>>> stage2_map_walker_try_leaf().
>> Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() currently,
>> because I thought they were not needed when we rebuild the block mappings from
>> normal page mappings.
> This assumes that the *only* situation when we replace a table entry with a block
> mapping is when the next level table (or tables) is *fully* populated. Is there a
> way to prove that this is true? I think it's important to prove it unequivocally,
> because if there's a corner case where this doesn't happen and we remove the
> dcache maintenance, we can end up with hard to reproduce and hard to diagnose
> errors in a guest.
So there is still one thing left about this patch to determine, and that 
is whether we can straightly
discard CMOs in stage2_coalesce_tables_into_block() or we should 
distinguish different situations.

Now we know that the situation you have described won't happen, then I 
think we will only end up
in stage2_coalesce_tables_into_block() in the following situation:
1) KVM create a new block mapping in stage2_map_walker_try_leaf() for 
the first time, if guest accesses
     memory backed by a THP/HUGETLB huge page. And CMOs will be 
performed here.
2) KVM split this block mapping in dirty logging, and build only one new 
page mapping.
3) KVM will build other new page mappings in dirty logging lazily, if 
guest access any other pages
     within the block. *In this stage, pages in this block may be fully 
mapped, or may be not.*
4) After dirty logging is disabled, KVM decides to rebuild the block 
mapping.

Do we still have to perform CMOs when rebuilding the block mapping in 
step 4, if pages in the block
were not fully mapped in step 3 ? I'm not completely sure about this.

Thanks,

Yanan
>> At least, they are not needed if we rebuild the block mappings backed by hugetlbfs
>> pages, because we must have built the new block mappings for the first time before
>> and now need to rebuild them after they were split in dirty logging. Can we
>> agree on this?
>> Then let's see the following situation.
>>> I can think of the following situation where they
>>> are needed:
>>>
>>> 1. The 2nd level (PMD) table that will be turned into a block is mapped at stage 2
>>> because one of the pages in the 3rd level (PTE) table it points to is accessed by
>>> the guest.
>>>
>>> 2. The kernel decides to turn the userspace mapping into a transparent huge page
>>> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level table
>>> is still valid.
>> I have a question here. Won't the PMD entry been invalidated too in this case?
>> If remove of the stage2 mapping by mmu notifier is an unmap operation of a range,
>> then it's correct and reasonable to both invalidate the PMD entry and free the
>> PTE table.
>> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>>
>> And if I was right about this, we will not end up in
>> stage2_coalesce_tables_into_block()
>> like step 3 describes, but in stage2_map_walker_try_leaf() instead. Because the
>> PMD entry
>> is invalid, so KVM will create the new 2M block mapping.
> Looking at the code for stage2_unmap_walker(), I believe you are correct. After
> the entire PTE table has been unmapped, the function will mark the PMD entry as
> invalid. In the situation I described, at step 3 we would end up in the leaf
> mapper function because the PMD entry is invalid. My example was wrong.
>
>> If I'm wrong about this, then I think this is a valid situation.
>>> 3. Guest accesses a page which is not the page it accessed at step 1, which causes
>>> a translation fault. KVM decides we can use a PMD block mapping to map the address
>>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
>>> because the guest accesses memory it didn't access before.
>>>
>>> What do you think, is that a valid situation?
>>>>        return 0;
>>>>    }
>>>>    @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64
>>>> end, u32 level,
>>>>                          kvm_pte_t *ptep,
>>>>                          struct stage2_map_data *data)
>>>>    {
>>>> -    int ret = 0;
>>>> -
>>>>        if (!data->anchor)
>>>>            return 0;
>>>>    -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>>> -    put_page(virt_to_page(ptep));
>>>> -
>>>> -    if (data->anchor == ptep) {
>>>> +    if (data->anchor != ptep) {
>>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>>> +        put_page(virt_to_page(ptep));
>>>> +    } else {
>>>> +        free_page((unsigned long)data->follow);
>>>>            data->anchor = NULL;
>>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and
>>> get_page() once in our case (valid old mapping). It looks to me like we're missing
>>> a put_page() call when the function is called for the anchor. Have you found the
>>> call to be unnecessary?
>> Before this patch:
>> When we find data->anchor == ptep, put_page() has been called once in advance
>> for the anchor
>> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
>> stage2_map_walker_try_leaf()
>> to install the block entry, and only get_page() will be called once in
>> stage2_map_walker_try_leaf().
>> There is a put_page() followed by a get_page() for the anchor, and there will
>> not be a problem about
>> page_counts.
> This is how I'm reading the code before your patch:
>
> - stage2_map_walk_table_post() returns early if there is no anchor.
>
> - stage2_map_walk_table_pre() sets the anchor and marks the entry as invalid. The
> entry was a table so the leaf visitor is not called in __kvm_pgtable_visit().
>
> - __kvm_pgtable_visit() visits the next level table.
>
> - stage2_map_walk_table_post() calls put_page(), calls stage2_map_walk_leaf() ->
> stage2_map_walker_try_leaf(). The old entry was invalidated by the pre visitor, so
> it only calls get_page() (and not put_page() + get_page().
>
> I agree with your conclusion, I didn't realize that because the pre visitor marks
> the entry as invalid, stage2_map_walker_try_leaf() will not call put_page().
>
>> After this patch:
>> Before we find data->anchor == ptep and after it, there is not a put_page() call
>> for the anchor.
>> This is because that we didn't call get_page() either in
>> stage2_coalesce_tables_into_block() when
>> install the block entry. So I think there will not be a problem too.
> I agree, the refcount will be identical.
>
>> Is above the right answer for your point?
> Yes, thank you clearing that up for me.
>
> Thanks,
>
> Alex
>
>>>>        }
>>>>    -    return ret;
>>>> +    return 0;
>>> I think it's correct for this function to succeed unconditionally. The error was
>>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
>>> can return an error code if block mapping is not supported, which we know is
>>> supported because we have an anchor, and if only the permissions are different
>>> between the old and the new entry, but in our case we've changed both the valid
>>> and type bits.
>> Agreed. Besides, we will definitely not end up updating an old valid entry for
>> the anchor
>> in stage2_map_walker_try_leaf(), because *anchor has already been invalidated in
>> stage2_map_walk_table_pre() before set the anchor, so it will look like a build
>> of new mapping.
>>
>> Thanks,
>>
>> Yanan
>>> Thanks,
>>>
>>> Alex
>>>
>>>>    }
>>>>      /*
>>> .
> .
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-03-04  7:07           ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-03-04  7:07 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Marc Zyngier, Will Deacon, Catalin Marinas, Julien Thierry,
	James Morse, Suzuki K Poulose, Quentin Perret, Gavin Shan,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Alex,

On 2021/3/4 1:27, Alexandru Elisei wrote:
> Hi Yanan,
>
> On 3/3/21 11:04 AM, wangyanan (Y) wrote:
>> Hi Alex,
>>
>> On 2021/3/3 1:13, Alexandru Elisei wrote:
>>> Hello,
>>>
>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>>> we currently invalidate the old table entry first followed by invalidation
>>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>>
>>>> It will cost a long time to unmap the numerous page mappings, which means
>>>> there will be a long period when the table entry can be found invalid.
>>>> If other vCPUs access any guest page within the block range and find the
>>>> table entry invalid, they will all exit from guest with a translation fault
>>>> which is not necessary. And KVM will make efforts to handle these faults,
>>>> especially when performing CMOs by block range.
>>>>
>>>> So let's quickly install the block entry at first to ensure uninterrupted
>>>> memory access of the other vCPUs, and then unmap the page mappings after
>>>> installation. This will reduce most of the time when the table entry is
>>>> invalid, and avoid most of the unnecessary translation faults.
>>> I'm not convinced I've fully understood what is going on yet, but it seems to me
>>> that the idea is sound. Some questions and comments below.
>> What I am trying to do in this patch is to adjust the order of rebuilding block
>> mappings from page mappings.
>> Take the rebuilding of 1G block mappings as an example.
>> Before this patch, the order is like:
>> 1) invalidate the table entry of the 1st level(PUD)
>> 2) flush TLB by VMID
>> 3) unmap the old PMD/PTE tables
>> 4) install the new block entry to the 1st level(PUD)
>>
>> So entry in the 1st level can be found invalid by other vcpus in 1), 2), and 3),
>> and it's a long time in 3) to unmap
>> the numerous old PMD/PTE tables, which means the total time of the entry being
>> invalid is long enough to
>> affect the performance.
>>
>> After this patch, the order is like:
>> 1) invalidate the table ebtry of the 1st level(PUD)
>> 2) flush TLB by VMID
>> 3) install the new block entry to the 1st level(PUD)
>> 4) unmap the old PMD/PTE tables
>>
>> The change ensures that period of entry in the 1st level(PUD) being invalid is
>> only in 1) and 2),
>> so if other vcpus access memory within 1G, there will be less chance to find the
>> entry invalid
>> and as a result trigger an unnecessary translation fault.
> Thank you for the explanation, that was my understand of it also, and I believe
> your idea is correct. I was more concerned that I got some of the details wrong,
> and you have kindly corrected me below.
>
>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>> ---
>>>>    arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>>    1 file changed, 12 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>> index 78a560446f80..308c36b9cd21 100644
>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>>        kvm_pte_t            attr;
>>>>          kvm_pte_t            *anchor;
>>>> +    kvm_pte_t            *follow;
>>>>          struct kvm_s2_mmu        *mmu;
>>>>        struct kvm_mmu_memory_cache    *memcache;
>>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
>>>> u32 level,
>>>>        if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>>>>            return 0;
>>>>    -    kvm_set_invalid_pte(ptep);
>>>> -
>>>>        /*
>>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>>> -     * entries below us which would otherwise need invalidating
>>>> -     * individually.
>>>> +     * If we need to coalesce existing table entries into a block here,
>>>> +     * then install the block entry first and the sub-level page mappings
>>>> +     * will be unmapped later.
>>>>         */
>>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>        data->anchor = ptep;
>>>> +    data->follow = kvm_pte_follow(*ptep);
>>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>>> Here's how stage2_coalesce_tables_into_block() is implemented from the previous
>>> patch (it might be worth merging it with this patch, I found it impossible to
>>> judge if the function is correct without seeing how it is used and what is
>>> replacing):
>> Ok, will do this if v2 is going to be post.
>>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>>                             kvm_pte_t *ptep,
>>>                             struct stage2_map_data *data)
>>> {
>>>       u64 granule = kvm_granule_size(level), phys = data->phys;
>>>       kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>>
>>>       kvm_set_invalid_pte(ptep);
>>>
>>>       /*
>>>        * Invalidate the whole stage-2, as we may have numerous leaf entries
>>>        * below us which would otherwise need invalidating individually.
>>>        */
>>>       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>       smp_store_release(ptep, new);
>>>       data->phys += granule;
>>> }
>>>
>>> This works because __kvm_pgtable_visit() saves the *ptep value before calling the
>>> pre callback, and it visits the next level table based on the initial pte value,
>>> not the new value written by stage2_coalesce_tables_into_block().
>> Right. So before replacing the initial pte value with the new value, we have to use
>> *data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to save
>> the initial pte value in advance. And data->follow will be used when  we start to
>> unmap the old sub-level tables later.
> Right, stage2_map_walk_table_post() will use data->follow to free the table page
> which is no longer needed because we've replaced the entire next level table with
> a block mapping.
>
>>> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
>>> dcache to the map handler"), this function is missing the CMOs from
>>> stage2_map_walker_try_leaf().
>> Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() currently,
>> because I thought they were not needed when we rebuild the block mappings from
>> normal page mappings.
> This assumes that the *only* situation when we replace a table entry with a block
> mapping is when the next level table (or tables) is *fully* populated. Is there a
> way to prove that this is true? I think it's important to prove it unequivocally,
> because if there's a corner case where this doesn't happen and we remove the
> dcache maintenance, we can end up with hard to reproduce and hard to diagnose
> errors in a guest.
So there is still one thing left about this patch to determine, and that 
is whether we can straightly
discard CMOs in stage2_coalesce_tables_into_block() or we should 
distinguish different situations.

Now we know that the situation you have described won't happen, then I 
think we will only end up
in stage2_coalesce_tables_into_block() in the following situation:
1) KVM create a new block mapping in stage2_map_walker_try_leaf() for 
the first time, if guest accesses
     memory backed by a THP/HUGETLB huge page. And CMOs will be 
performed here.
2) KVM split this block mapping in dirty logging, and build only one new 
page mapping.
3) KVM will build other new page mappings in dirty logging lazily, if 
guest access any other pages
     within the block. *In this stage, pages in this block may be fully 
mapped, or may be not.*
4) After dirty logging is disabled, KVM decides to rebuild the block 
mapping.

Do we still have to perform CMOs when rebuilding the block mapping in 
step 4, if pages in the block
were not fully mapped in step 3 ? I'm not completely sure about this.

Thanks,

Yanan
>> At least, they are not needed if we rebuild the block mappings backed by hugetlbfs
>> pages, because we must have built the new block mappings for the first time before
>> and now need to rebuild them after they were split in dirty logging. Can we
>> agree on this?
>> Then let's see the following situation.
>>> I can think of the following situation where they
>>> are needed:
>>>
>>> 1. The 2nd level (PMD) table that will be turned into a block is mapped at stage 2
>>> because one of the pages in the 3rd level (PTE) table it points to is accessed by
>>> the guest.
>>>
>>> 2. The kernel decides to turn the userspace mapping into a transparent huge page
>>> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level table
>>> is still valid.
>> I have a question here. Won't the PMD entry been invalidated too in this case?
>> If remove of the stage2 mapping by mmu notifier is an unmap operation of a range,
>> then it's correct and reasonable to both invalidate the PMD entry and free the
>> PTE table.
>> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>>
>> And if I was right about this, we will not end up in
>> stage2_coalesce_tables_into_block()
>> like step 3 describes, but in stage2_map_walker_try_leaf() instead. Because the
>> PMD entry
>> is invalid, so KVM will create the new 2M block mapping.
> Looking at the code for stage2_unmap_walker(), I believe you are correct. After
> the entire PTE table has been unmapped, the function will mark the PMD entry as
> invalid. In the situation I described, at step 3 we would end up in the leaf
> mapper function because the PMD entry is invalid. My example was wrong.
>
>> If I'm wrong about this, then I think this is a valid situation.
>>> 3. Guest accesses a page which is not the page it accessed at step 1, which causes
>>> a translation fault. KVM decides we can use a PMD block mapping to map the address
>>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
>>> because the guest accesses memory it didn't access before.
>>>
>>> What do you think, is that a valid situation?
>>>>        return 0;
>>>>    }
>>>>    @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64
>>>> end, u32 level,
>>>>                          kvm_pte_t *ptep,
>>>>                          struct stage2_map_data *data)
>>>>    {
>>>> -    int ret = 0;
>>>> -
>>>>        if (!data->anchor)
>>>>            return 0;
>>>>    -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>>> -    put_page(virt_to_page(ptep));
>>>> -
>>>> -    if (data->anchor == ptep) {
>>>> +    if (data->anchor != ptep) {
>>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>>> +        put_page(virt_to_page(ptep));
>>>> +    } else {
>>>> +        free_page((unsigned long)data->follow);
>>>>            data->anchor = NULL;
>>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and
>>> get_page() once in our case (valid old mapping). It looks to me like we're missing
>>> a put_page() call when the function is called for the anchor. Have you found the
>>> call to be unnecessary?
>> Before this patch:
>> When we find data->anchor == ptep, put_page() has been called once in advance
>> for the anchor
>> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
>> stage2_map_walker_try_leaf()
>> to install the block entry, and only get_page() will be called once in
>> stage2_map_walker_try_leaf().
>> There is a put_page() followed by a get_page() for the anchor, and there will
>> not be a problem about
>> page_counts.
> This is how I'm reading the code before your patch:
>
> - stage2_map_walk_table_post() returns early if there is no anchor.
>
> - stage2_map_walk_table_pre() sets the anchor and marks the entry as invalid. The
> entry was a table so the leaf visitor is not called in __kvm_pgtable_visit().
>
> - __kvm_pgtable_visit() visits the next level table.
>
> - stage2_map_walk_table_post() calls put_page(), calls stage2_map_walk_leaf() ->
> stage2_map_walker_try_leaf(). The old entry was invalidated by the pre visitor, so
> it only calls get_page() (and not put_page() + get_page().
>
> I agree with your conclusion, I didn't realize that because the pre visitor marks
> the entry as invalid, stage2_map_walker_try_leaf() will not call put_page().
>
>> After this patch:
>> Before we find data->anchor == ptep and after it, there is not a put_page() call
>> for the anchor.
>> This is because that we didn't call get_page() either in
>> stage2_coalesce_tables_into_block() when
>> install the block entry. So I think there will not be a problem too.
> I agree, the refcount will be identical.
>
>> Is above the right answer for your point?
> Yes, thank you clearing that up for me.
>
> Thanks,
>
> Alex
>
>>>>        }
>>>>    -    return ret;
>>>> +    return 0;
>>> I think it's correct for this function to succeed unconditionally. The error was
>>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
>>> can return an error code if block mapping is not supported, which we know is
>>> supported because we have an anchor, and if only the permissions are different
>>> between the old and the new entry, but in our case we've changed both the valid
>>> and type bits.
>> Agreed. Besides, we will definitely not end up updating an old valid entry for
>> the anchor
>> in stage2_map_walker_try_leaf(), because *anchor has already been invalidated in
>> stage2_map_walk_table_pre() before set the anchor, so it will look like a build
>> of new mapping.
>>
>> Thanks,
>>
>> Yanan
>>> Thanks,
>>>
>>> Alex
>>>
>>>>    }
>>>>      /*
>>> .
> .

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
  2021-03-04  7:07           ` wangyanan (Y)
  (?)
@ 2021-03-04  7:22             ` wangyanan (Y)
  -1 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-03-04  7:22 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Alexandru Elisei, Will Deacon, Catalin Marinas, Julien Thierry,
	James Morse, Suzuki K Poulose, Quentin Perret, Gavin Shan,
	kvmarm, linux-arm-kernel, kvm, linux-kernel


On 2021/3/4 15:07, wangyanan (Y) wrote:
> Hi Alex,
>
> On 2021/3/4 1:27, Alexandru Elisei wrote:
>> Hi Yanan,
>>
>> On 3/3/21 11:04 AM, wangyanan (Y) wrote:
>>> Hi Alex,
>>>
>>> On 2021/3/3 1:13, Alexandru Elisei wrote:
>>>> Hello,
>>>>
>>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>>> When KVM needs to coalesce the normal page mappings into a block 
>>>>> mapping,
>>>>> we currently invalidate the old table entry first followed by 
>>>>> invalidation
>>>>> of TLB, then unmap the page mappings, and install the block entry 
>>>>> at last.
>>>>>
>>>>> It will cost a long time to unmap the numerous page mappings, 
>>>>> which means
>>>>> there will be a long period when the table entry can be found 
>>>>> invalid.
>>>>> If other vCPUs access any guest page within the block range and 
>>>>> find the
>>>>> table entry invalid, they will all exit from guest with a 
>>>>> translation fault
>>>>> which is not necessary. And KVM will make efforts to handle these 
>>>>> faults,
>>>>> especially when performing CMOs by block range.
>>>>>
>>>>> So let's quickly install the block entry at first to ensure 
>>>>> uninterrupted
>>>>> memory access of the other vCPUs, and then unmap the page mappings 
>>>>> after
>>>>> installation. This will reduce most of the time when the table 
>>>>> entry is
>>>>> invalid, and avoid most of the unnecessary translation faults.
>>>> I'm not convinced I've fully understood what is going on yet, but 
>>>> it seems to me
>>>> that the idea is sound. Some questions and comments below.
>>> What I am trying to do in this patch is to adjust the order of 
>>> rebuilding block
>>> mappings from page mappings.
>>> Take the rebuilding of 1G block mappings as an example.
>>> Before this patch, the order is like:
>>> 1) invalidate the table entry of the 1st level(PUD)
>>> 2) flush TLB by VMID
>>> 3) unmap the old PMD/PTE tables
>>> 4) install the new block entry to the 1st level(PUD)
>>>
>>> So entry in the 1st level can be found invalid by other vcpus in 1), 
>>> 2), and 3),
>>> and it's a long time in 3) to unmap
>>> the numerous old PMD/PTE tables, which means the total time of the 
>>> entry being
>>> invalid is long enough to
>>> affect the performance.
>>>
>>> After this patch, the order is like:
>>> 1) invalidate the table ebtry of the 1st level(PUD)
>>> 2) flush TLB by VMID
>>> 3) install the new block entry to the 1st level(PUD)
>>> 4) unmap the old PMD/PTE tables
>>>
>>> The change ensures that period of entry in the 1st level(PUD) being 
>>> invalid is
>>> only in 1) and 2),
>>> so if other vcpus access memory within 1G, there will be less chance 
>>> to find the
>>> entry invalid
>>> and as a result trigger an unnecessary translation fault.
>> Thank you for the explanation, that was my understand of it also, and 
>> I believe
>> your idea is correct. I was more concerned that I got some of the 
>> details wrong,
>> and you have kindly corrected me below.
>>
>>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>>> ---
>>>>>    arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>>>    1 file changed, 12 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c 
>>>>> b/arch/arm64/kvm/hyp/pgtable.c
>>>>> index 78a560446f80..308c36b9cd21 100644
>>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>>>        kvm_pte_t            attr;
>>>>>          kvm_pte_t            *anchor;
>>>>> +    kvm_pte_t            *follow;
>>>>>          struct kvm_s2_mmu        *mmu;
>>>>>        struct kvm_mmu_memory_cache    *memcache;
>>>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 
>>>>> addr, u64 end,
>>>>> u32 level,
>>>>>        if (!kvm_block_mapping_supported(addr, end, data->phys, 
>>>>> level))
>>>>>            return 0;
>>>>>    -    kvm_set_invalid_pte(ptep);
>>>>> -
>>>>>        /*
>>>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>>>> -     * entries below us which would otherwise need invalidating
>>>>> -     * individually.
>>>>> +     * If we need to coalesce existing table entries into a block 
>>>>> here,
>>>>> +     * then install the block entry first and the sub-level page 
>>>>> mappings
>>>>> +     * will be unmapped later.
>>>>>         */
>>>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>>        data->anchor = ptep;
>>>>> +    data->follow = kvm_pte_follow(*ptep);
>>>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>>>> Here's how stage2_coalesce_tables_into_block() is implemented from 
>>>> the previous
>>>> patch (it might be worth merging it with this patch, I found it 
>>>> impossible to
>>>> judge if the function is correct without seeing how it is used and 
>>>> what is
>>>> replacing):
>>> Ok, will do this if v2 is going to be post.
>>>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>>>                             kvm_pte_t *ptep,
>>>>                             struct stage2_map_data *data)
>>>> {
>>>>       u64 granule = kvm_granule_size(level), phys = data->phys;
>>>>       kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, 
>>>> level);
>>>>
>>>>       kvm_set_invalid_pte(ptep);
>>>>
>>>>       /*
>>>>        * Invalidate the whole stage-2, as we may have numerous leaf 
>>>> entries
>>>>        * below us which would otherwise need invalidating 
>>>> individually.
>>>>        */
>>>>       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>       smp_store_release(ptep, new);
>>>>       data->phys += granule;
>>>> }
>>>>
>>>> This works because __kvm_pgtable_visit() saves the *ptep value 
>>>> before calling the
>>>> pre callback, and it visits the next level table based on the 
>>>> initial pte value,
>>>> not the new value written by stage2_coalesce_tables_into_block().
>>> Right. So before replacing the initial pte value with the new value, 
>>> we have to use
>>> *data->follow = kvm_pte_follow(*ptep)* in 
>>> stage2_map_walk_table_pre() to save
>>> the initial pte value in advance. And data->follow will be used 
>>> when  we start to
>>> unmap the old sub-level tables later.
>> Right, stage2_map_walk_table_post() will use data->follow to free the 
>> table page
>> which is no longer needed because we've replaced the entire next 
>> level table with
>> a block mapping.
>>
>>>> Assuming the first patch in the series is merged ("KVM: arm64: Move 
>>>> the clean of
>>>> dcache to the map handler"), this function is missing the CMOs from
>>>> stage2_map_walker_try_leaf().
>>> Yes, the CMOs are not performed in 
>>> stage2_coalesce_tables_into_block() currently,
>>> because I thought they were not needed when we rebuild the block 
>>> mappings from
>>> normal page mappings.
>> This assumes that the *only* situation when we replace a table entry 
>> with a block
>> mapping is when the next level table (or tables) is *fully* 
>> populated. Is there a
>> way to prove that this is true? I think it's important to prove it 
>> unequivocally,
>> because if there's a corner case where this doesn't happen and we 
>> remove the
>> dcache maintenance, we can end up with hard to reproduce and hard to 
>> diagnose
>> errors in a guest.
> So there is still one thing left about this patch to determine, and 
> that is whether we can straightly
> discard CMOs in stage2_coalesce_tables_into_block() or we should 
> distinguish different situations.
>
> Now we know that the situation you have described won't happen, then I 
> think we will only end up
> in stage2_coalesce_tables_into_block() in the following situation:
> 1) KVM create a new block mapping in stage2_map_walker_try_leaf() for 
> the first time, if guest accesses
>     memory backed by a THP/HUGETLB huge page. And CMOs will be 
> performed here.
> 2) KVM split this block mapping in dirty logging, and build only one 
> new page mapping.
> 3) KVM will build other new page mappings in dirty logging lazily, if 
> guest access any other pages
>     within the block. *In this stage, pages in this block may be fully 
> mapped, or may be not.*
> 4) After dirty logging is disabled, KVM decides to rebuild the block 
> mapping.
>
> Do we still have to perform CMOs when rebuilding the block mapping in 
> step 4, if pages in the block
> were not fully mapped in step 3 ? I'm not completely sure about this.
>
Hi Marc,
Could you please have an answer for above confusion :) ?

Thanks,

Yanan


> Thanks,
>
> Yanan
>>> At least, they are not needed if we rebuild the block mappings 
>>> backed by hugetlbfs
>>> pages, because we must have built the new block mappings for the 
>>> first time before
>>> and now need to rebuild them after they were split in dirty logging. 
>>> Can we
>>> agree on this?
>>> Then let's see the following situation.
>>>> I can think of the following situation where they
>>>> are needed:
>>>>
>>>> 1. The 2nd level (PMD) table that will be turned into a block is 
>>>> mapped at stage 2
>>>> because one of the pages in the 3rd level (PTE) table it points to 
>>>> is accessed by
>>>> the guest.
>>>>
>>>> 2. The kernel decides to turn the userspace mapping into a 
>>>> transparent huge page
>>>> and calls the mmu notifier to remove the mapping from stage 2. The 
>>>> 2nd level table
>>>> is still valid.
>>> I have a question here. Won't the PMD entry been invalidated too in 
>>> this case?
>>> If remove of the stage2 mapping by mmu notifier is an unmap 
>>> operation of a range,
>>> then it's correct and reasonable to both invalidate the PMD entry 
>>> and free the
>>> PTE table.
>>> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>>>
>>> And if I was right about this, we will not end up in
>>> stage2_coalesce_tables_into_block()
>>> like step 3 describes, but in stage2_map_walker_try_leaf() instead. 
>>> Because the
>>> PMD entry
>>> is invalid, so KVM will create the new 2M block mapping.
>> Looking at the code for stage2_unmap_walker(), I believe you are 
>> correct. After
>> the entire PTE table has been unmapped, the function will mark the 
>> PMD entry as
>> invalid. In the situation I described, at step 3 we would end up in 
>> the leaf
>> mapper function because the PMD entry is invalid. My example was wrong.
>>
>>> If I'm wrong about this, then I think this is a valid situation.
>>>> 3. Guest accesses a page which is not the page it accessed at step 
>>>> 1, which causes
>>>> a translation fault. KVM decides we can use a PMD block mapping to 
>>>> map the address
>>>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs 
>>>> in this case
>>>> because the guest accesses memory it didn't access before.
>>>>
>>>> What do you think, is that a valid situation?
>>>>>        return 0;
>>>>>    }
>>>>>    @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 
>>>>> addr, u64
>>>>> end, u32 level,
>>>>>                          kvm_pte_t *ptep,
>>>>>                          struct stage2_map_data *data)
>>>>>    {
>>>>> -    int ret = 0;
>>>>> -
>>>>>        if (!data->anchor)
>>>>>            return 0;
>>>>>    -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>> -    put_page(virt_to_page(ptep));
>>>>> -
>>>>> -    if (data->anchor == ptep) {
>>>>> +    if (data->anchor != ptep) {
>>>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>> +        put_page(virt_to_page(ptep));
>>>>> +    } else {
>>>>> +        free_page((unsigned long)data->follow);
>>>>>            data->anchor = NULL;
>>>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>>>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls 
>>>> put_page() and
>>>> get_page() once in our case (valid old mapping). It looks to me 
>>>> like we're missing
>>>> a put_page() call when the function is called for the anchor. Have 
>>>> you found the
>>>> call to be unnecessary?
>>> Before this patch:
>>> When we find data->anchor == ptep, put_page() has been called once 
>>> in advance
>>> for the anchor
>>> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
>>> stage2_map_walker_try_leaf()
>>> to install the block entry, and only get_page() will be called once in
>>> stage2_map_walker_try_leaf().
>>> There is a put_page() followed by a get_page() for the anchor, and 
>>> there will
>>> not be a problem about
>>> page_counts.
>> This is how I'm reading the code before your patch:
>>
>> - stage2_map_walk_table_post() returns early if there is no anchor.
>>
>> - stage2_map_walk_table_pre() sets the anchor and marks the entry as 
>> invalid. The
>> entry was a table so the leaf visitor is not called in 
>> __kvm_pgtable_visit().
>>
>> - __kvm_pgtable_visit() visits the next level table.
>>
>> - stage2_map_walk_table_post() calls put_page(), calls 
>> stage2_map_walk_leaf() ->
>> stage2_map_walker_try_leaf(). The old entry was invalidated by the 
>> pre visitor, so
>> it only calls get_page() (and not put_page() + get_page().
>>
>> I agree with your conclusion, I didn't realize that because the pre 
>> visitor marks
>> the entry as invalid, stage2_map_walker_try_leaf() will not call 
>> put_page().
>>
>>> After this patch:
>>> Before we find data->anchor == ptep and after it, there is not a 
>>> put_page() call
>>> for the anchor.
>>> This is because that we didn't call get_page() either in
>>> stage2_coalesce_tables_into_block() when
>>> install the block entry. So I think there will not be a problem too.
>> I agree, the refcount will be identical.
>>
>>> Is above the right answer for your point?
>> Yes, thank you clearing that up for me.
>>
>> Thanks,
>>
>> Alex
>>
>>>>>        }
>>>>>    -    return ret;
>>>>> +    return 0;
>>>> I think it's correct for this function to succeed unconditionally. 
>>>> The error was
>>>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). 
>>>> The function
>>>> can return an error code if block mapping is not supported, which 
>>>> we know is
>>>> supported because we have an anchor, and if only the permissions 
>>>> are different
>>>> between the old and the new entry, but in our case we've changed 
>>>> both the valid
>>>> and type bits.
>>> Agreed. Besides, we will definitely not end up updating an old valid 
>>> entry for
>>> the anchor
>>> in stage2_map_walker_try_leaf(), because *anchor has already been 
>>> invalidated in
>>> stage2_map_walk_table_pre() before set the anchor, so it will look 
>>> like a build
>>> of new mapping.
>>>
>>> Thanks,
>>>
>>> Yanan
>>>> Thanks,
>>>>
>>>> Alex
>>>>
>>>>>    }
>>>>>      /*
>>>> .
>> .

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-03-04  7:22             ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-03-04  7:22 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, linux-kernel, linux-arm-kernel, Catalin Marinas,
	Will Deacon, kvmarm


On 2021/3/4 15:07, wangyanan (Y) wrote:
> Hi Alex,
>
> On 2021/3/4 1:27, Alexandru Elisei wrote:
>> Hi Yanan,
>>
>> On 3/3/21 11:04 AM, wangyanan (Y) wrote:
>>> Hi Alex,
>>>
>>> On 2021/3/3 1:13, Alexandru Elisei wrote:
>>>> Hello,
>>>>
>>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>>> When KVM needs to coalesce the normal page mappings into a block 
>>>>> mapping,
>>>>> we currently invalidate the old table entry first followed by 
>>>>> invalidation
>>>>> of TLB, then unmap the page mappings, and install the block entry 
>>>>> at last.
>>>>>
>>>>> It will cost a long time to unmap the numerous page mappings, 
>>>>> which means
>>>>> there will be a long period when the table entry can be found 
>>>>> invalid.
>>>>> If other vCPUs access any guest page within the block range and 
>>>>> find the
>>>>> table entry invalid, they will all exit from guest with a 
>>>>> translation fault
>>>>> which is not necessary. And KVM will make efforts to handle these 
>>>>> faults,
>>>>> especially when performing CMOs by block range.
>>>>>
>>>>> So let's quickly install the block entry at first to ensure 
>>>>> uninterrupted
>>>>> memory access of the other vCPUs, and then unmap the page mappings 
>>>>> after
>>>>> installation. This will reduce most of the time when the table 
>>>>> entry is
>>>>> invalid, and avoid most of the unnecessary translation faults.
>>>> I'm not convinced I've fully understood what is going on yet, but 
>>>> it seems to me
>>>> that the idea is sound. Some questions and comments below.
>>> What I am trying to do in this patch is to adjust the order of 
>>> rebuilding block
>>> mappings from page mappings.
>>> Take the rebuilding of 1G block mappings as an example.
>>> Before this patch, the order is like:
>>> 1) invalidate the table entry of the 1st level(PUD)
>>> 2) flush TLB by VMID
>>> 3) unmap the old PMD/PTE tables
>>> 4) install the new block entry to the 1st level(PUD)
>>>
>>> So entry in the 1st level can be found invalid by other vcpus in 1), 
>>> 2), and 3),
>>> and it's a long time in 3) to unmap
>>> the numerous old PMD/PTE tables, which means the total time of the 
>>> entry being
>>> invalid is long enough to
>>> affect the performance.
>>>
>>> After this patch, the order is like:
>>> 1) invalidate the table ebtry of the 1st level(PUD)
>>> 2) flush TLB by VMID
>>> 3) install the new block entry to the 1st level(PUD)
>>> 4) unmap the old PMD/PTE tables
>>>
>>> The change ensures that period of entry in the 1st level(PUD) being 
>>> invalid is
>>> only in 1) and 2),
>>> so if other vcpus access memory within 1G, there will be less chance 
>>> to find the
>>> entry invalid
>>> and as a result trigger an unnecessary translation fault.
>> Thank you for the explanation, that was my understand of it also, and 
>> I believe
>> your idea is correct. I was more concerned that I got some of the 
>> details wrong,
>> and you have kindly corrected me below.
>>
>>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>>> ---
>>>>>    arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>>>    1 file changed, 12 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c 
>>>>> b/arch/arm64/kvm/hyp/pgtable.c
>>>>> index 78a560446f80..308c36b9cd21 100644
>>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>>>        kvm_pte_t            attr;
>>>>>          kvm_pte_t            *anchor;
>>>>> +    kvm_pte_t            *follow;
>>>>>          struct kvm_s2_mmu        *mmu;
>>>>>        struct kvm_mmu_memory_cache    *memcache;
>>>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 
>>>>> addr, u64 end,
>>>>> u32 level,
>>>>>        if (!kvm_block_mapping_supported(addr, end, data->phys, 
>>>>> level))
>>>>>            return 0;
>>>>>    -    kvm_set_invalid_pte(ptep);
>>>>> -
>>>>>        /*
>>>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>>>> -     * entries below us which would otherwise need invalidating
>>>>> -     * individually.
>>>>> +     * If we need to coalesce existing table entries into a block 
>>>>> here,
>>>>> +     * then install the block entry first and the sub-level page 
>>>>> mappings
>>>>> +     * will be unmapped later.
>>>>>         */
>>>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>>        data->anchor = ptep;
>>>>> +    data->follow = kvm_pte_follow(*ptep);
>>>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>>>> Here's how stage2_coalesce_tables_into_block() is implemented from 
>>>> the previous
>>>> patch (it might be worth merging it with this patch, I found it 
>>>> impossible to
>>>> judge if the function is correct without seeing how it is used and 
>>>> what is
>>>> replacing):
>>> Ok, will do this if v2 is going to be post.
>>>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>>>                             kvm_pte_t *ptep,
>>>>                             struct stage2_map_data *data)
>>>> {
>>>>       u64 granule = kvm_granule_size(level), phys = data->phys;
>>>>       kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, 
>>>> level);
>>>>
>>>>       kvm_set_invalid_pte(ptep);
>>>>
>>>>       /*
>>>>        * Invalidate the whole stage-2, as we may have numerous leaf 
>>>> entries
>>>>        * below us which would otherwise need invalidating 
>>>> individually.
>>>>        */
>>>>       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>       smp_store_release(ptep, new);
>>>>       data->phys += granule;
>>>> }
>>>>
>>>> This works because __kvm_pgtable_visit() saves the *ptep value 
>>>> before calling the
>>>> pre callback, and it visits the next level table based on the 
>>>> initial pte value,
>>>> not the new value written by stage2_coalesce_tables_into_block().
>>> Right. So before replacing the initial pte value with the new value, 
>>> we have to use
>>> *data->follow = kvm_pte_follow(*ptep)* in 
>>> stage2_map_walk_table_pre() to save
>>> the initial pte value in advance. And data->follow will be used 
>>> when  we start to
>>> unmap the old sub-level tables later.
>> Right, stage2_map_walk_table_post() will use data->follow to free the 
>> table page
>> which is no longer needed because we've replaced the entire next 
>> level table with
>> a block mapping.
>>
>>>> Assuming the first patch in the series is merged ("KVM: arm64: Move 
>>>> the clean of
>>>> dcache to the map handler"), this function is missing the CMOs from
>>>> stage2_map_walker_try_leaf().
>>> Yes, the CMOs are not performed in 
>>> stage2_coalesce_tables_into_block() currently,
>>> because I thought they were not needed when we rebuild the block 
>>> mappings from
>>> normal page mappings.
>> This assumes that the *only* situation when we replace a table entry 
>> with a block
>> mapping is when the next level table (or tables) is *fully* 
>> populated. Is there a
>> way to prove that this is true? I think it's important to prove it 
>> unequivocally,
>> because if there's a corner case where this doesn't happen and we 
>> remove the
>> dcache maintenance, we can end up with hard to reproduce and hard to 
>> diagnose
>> errors in a guest.
> So there is still one thing left about this patch to determine, and 
> that is whether we can straightly
> discard CMOs in stage2_coalesce_tables_into_block() or we should 
> distinguish different situations.
>
> Now we know that the situation you have described won't happen, then I 
> think we will only end up
> in stage2_coalesce_tables_into_block() in the following situation:
> 1) KVM create a new block mapping in stage2_map_walker_try_leaf() for 
> the first time, if guest accesses
>     memory backed by a THP/HUGETLB huge page. And CMOs will be 
> performed here.
> 2) KVM split this block mapping in dirty logging, and build only one 
> new page mapping.
> 3) KVM will build other new page mappings in dirty logging lazily, if 
> guest access any other pages
>     within the block. *In this stage, pages in this block may be fully 
> mapped, or may be not.*
> 4) After dirty logging is disabled, KVM decides to rebuild the block 
> mapping.
>
> Do we still have to perform CMOs when rebuilding the block mapping in 
> step 4, if pages in the block
> were not fully mapped in step 3 ? I'm not completely sure about this.
>
Hi Marc,
Could you please have an answer for above confusion :) ?

Thanks,

Yanan


> Thanks,
>
> Yanan
>>> At least, they are not needed if we rebuild the block mappings 
>>> backed by hugetlbfs
>>> pages, because we must have built the new block mappings for the 
>>> first time before
>>> and now need to rebuild them after they were split in dirty logging. 
>>> Can we
>>> agree on this?
>>> Then let's see the following situation.
>>>> I can think of the following situation where they
>>>> are needed:
>>>>
>>>> 1. The 2nd level (PMD) table that will be turned into a block is 
>>>> mapped at stage 2
>>>> because one of the pages in the 3rd level (PTE) table it points to 
>>>> is accessed by
>>>> the guest.
>>>>
>>>> 2. The kernel decides to turn the userspace mapping into a 
>>>> transparent huge page
>>>> and calls the mmu notifier to remove the mapping from stage 2. The 
>>>> 2nd level table
>>>> is still valid.
>>> I have a question here. Won't the PMD entry been invalidated too in 
>>> this case?
>>> If remove of the stage2 mapping by mmu notifier is an unmap 
>>> operation of a range,
>>> then it's correct and reasonable to both invalidate the PMD entry 
>>> and free the
>>> PTE table.
>>> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>>>
>>> And if I was right about this, we will not end up in
>>> stage2_coalesce_tables_into_block()
>>> like step 3 describes, but in stage2_map_walker_try_leaf() instead. 
>>> Because the
>>> PMD entry
>>> is invalid, so KVM will create the new 2M block mapping.
>> Looking at the code for stage2_unmap_walker(), I believe you are 
>> correct. After
>> the entire PTE table has been unmapped, the function will mark the 
>> PMD entry as
>> invalid. In the situation I described, at step 3 we would end up in 
>> the leaf
>> mapper function because the PMD entry is invalid. My example was wrong.
>>
>>> If I'm wrong about this, then I think this is a valid situation.
>>>> 3. Guest accesses a page which is not the page it accessed at step 
>>>> 1, which causes
>>>> a translation fault. KVM decides we can use a PMD block mapping to 
>>>> map the address
>>>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs 
>>>> in this case
>>>> because the guest accesses memory it didn't access before.
>>>>
>>>> What do you think, is that a valid situation?
>>>>>        return 0;
>>>>>    }
>>>>>    @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 
>>>>> addr, u64
>>>>> end, u32 level,
>>>>>                          kvm_pte_t *ptep,
>>>>>                          struct stage2_map_data *data)
>>>>>    {
>>>>> -    int ret = 0;
>>>>> -
>>>>>        if (!data->anchor)
>>>>>            return 0;
>>>>>    -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>> -    put_page(virt_to_page(ptep));
>>>>> -
>>>>> -    if (data->anchor == ptep) {
>>>>> +    if (data->anchor != ptep) {
>>>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>> +        put_page(virt_to_page(ptep));
>>>>> +    } else {
>>>>> +        free_page((unsigned long)data->follow);
>>>>>            data->anchor = NULL;
>>>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>>>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls 
>>>> put_page() and
>>>> get_page() once in our case (valid old mapping). It looks to me 
>>>> like we're missing
>>>> a put_page() call when the function is called for the anchor. Have 
>>>> you found the
>>>> call to be unnecessary?
>>> Before this patch:
>>> When we find data->anchor == ptep, put_page() has been called once 
>>> in advance
>>> for the anchor
>>> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
>>> stage2_map_walker_try_leaf()
>>> to install the block entry, and only get_page() will be called once in
>>> stage2_map_walker_try_leaf().
>>> There is a put_page() followed by a get_page() for the anchor, and 
>>> there will
>>> not be a problem about
>>> page_counts.
>> This is how I'm reading the code before your patch:
>>
>> - stage2_map_walk_table_post() returns early if there is no anchor.
>>
>> - stage2_map_walk_table_pre() sets the anchor and marks the entry as 
>> invalid. The
>> entry was a table so the leaf visitor is not called in 
>> __kvm_pgtable_visit().
>>
>> - __kvm_pgtable_visit() visits the next level table.
>>
>> - stage2_map_walk_table_post() calls put_page(), calls 
>> stage2_map_walk_leaf() ->
>> stage2_map_walker_try_leaf(). The old entry was invalidated by the 
>> pre visitor, so
>> it only calls get_page() (and not put_page() + get_page().
>>
>> I agree with your conclusion, I didn't realize that because the pre 
>> visitor marks
>> the entry as invalid, stage2_map_walker_try_leaf() will not call 
>> put_page().
>>
>>> After this patch:
>>> Before we find data->anchor == ptep and after it, there is not a 
>>> put_page() call
>>> for the anchor.
>>> This is because that we didn't call get_page() either in
>>> stage2_coalesce_tables_into_block() when
>>> install the block entry. So I think there will not be a problem too.
>> I agree, the refcount will be identical.
>>
>>> Is above the right answer for your point?
>> Yes, thank you clearing that up for me.
>>
>> Thanks,
>>
>> Alex
>>
>>>>>        }
>>>>>    -    return ret;
>>>>> +    return 0;
>>>> I think it's correct for this function to succeed unconditionally. 
>>>> The error was
>>>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). 
>>>> The function
>>>> can return an error code if block mapping is not supported, which 
>>>> we know is
>>>> supported because we have an anchor, and if only the permissions 
>>>> are different
>>>> between the old and the new entry, but in our case we've changed 
>>>> both the valid
>>>> and type bits.
>>> Agreed. Besides, we will definitely not end up updating an old valid 
>>> entry for
>>> the anchor
>>> in stage2_map_walker_try_leaf(), because *anchor has already been 
>>> invalidated in
>>> stage2_map_walk_table_pre() before set the anchor, so it will look 
>>> like a build
>>> of new mapping.
>>>
>>> Thanks,
>>>
>>> Yanan
>>>> Thanks,
>>>>
>>>> Alex
>>>>
>>>>>    }
>>>>>      /*
>>>> .
>> .
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-03-04  7:22             ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-03-04  7:22 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Alexandru Elisei, Will Deacon, Catalin Marinas, Julien Thierry,
	James Morse, Suzuki K Poulose, Quentin Perret, Gavin Shan,
	kvmarm, linux-arm-kernel, kvm, linux-kernel


On 2021/3/4 15:07, wangyanan (Y) wrote:
> Hi Alex,
>
> On 2021/3/4 1:27, Alexandru Elisei wrote:
>> Hi Yanan,
>>
>> On 3/3/21 11:04 AM, wangyanan (Y) wrote:
>>> Hi Alex,
>>>
>>> On 2021/3/3 1:13, Alexandru Elisei wrote:
>>>> Hello,
>>>>
>>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>>> When KVM needs to coalesce the normal page mappings into a block 
>>>>> mapping,
>>>>> we currently invalidate the old table entry first followed by 
>>>>> invalidation
>>>>> of TLB, then unmap the page mappings, and install the block entry 
>>>>> at last.
>>>>>
>>>>> It will cost a long time to unmap the numerous page mappings, 
>>>>> which means
>>>>> there will be a long period when the table entry can be found 
>>>>> invalid.
>>>>> If other vCPUs access any guest page within the block range and 
>>>>> find the
>>>>> table entry invalid, they will all exit from guest with a 
>>>>> translation fault
>>>>> which is not necessary. And KVM will make efforts to handle these 
>>>>> faults,
>>>>> especially when performing CMOs by block range.
>>>>>
>>>>> So let's quickly install the block entry at first to ensure 
>>>>> uninterrupted
>>>>> memory access of the other vCPUs, and then unmap the page mappings 
>>>>> after
>>>>> installation. This will reduce most of the time when the table 
>>>>> entry is
>>>>> invalid, and avoid most of the unnecessary translation faults.
>>>> I'm not convinced I've fully understood what is going on yet, but 
>>>> it seems to me
>>>> that the idea is sound. Some questions and comments below.
>>> What I am trying to do in this patch is to adjust the order of 
>>> rebuilding block
>>> mappings from page mappings.
>>> Take the rebuilding of 1G block mappings as an example.
>>> Before this patch, the order is like:
>>> 1) invalidate the table entry of the 1st level(PUD)
>>> 2) flush TLB by VMID
>>> 3) unmap the old PMD/PTE tables
>>> 4) install the new block entry to the 1st level(PUD)
>>>
>>> So entry in the 1st level can be found invalid by other vcpus in 1), 
>>> 2), and 3),
>>> and it's a long time in 3) to unmap
>>> the numerous old PMD/PTE tables, which means the total time of the 
>>> entry being
>>> invalid is long enough to
>>> affect the performance.
>>>
>>> After this patch, the order is like:
>>> 1) invalidate the table ebtry of the 1st level(PUD)
>>> 2) flush TLB by VMID
>>> 3) install the new block entry to the 1st level(PUD)
>>> 4) unmap the old PMD/PTE tables
>>>
>>> The change ensures that period of entry in the 1st level(PUD) being 
>>> invalid is
>>> only in 1) and 2),
>>> so if other vcpus access memory within 1G, there will be less chance 
>>> to find the
>>> entry invalid
>>> and as a result trigger an unnecessary translation fault.
>> Thank you for the explanation, that was my understand of it also, and 
>> I believe
>> your idea is correct. I was more concerned that I got some of the 
>> details wrong,
>> and you have kindly corrected me below.
>>
>>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>>> ---
>>>>>    arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>>>    1 file changed, 12 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c 
>>>>> b/arch/arm64/kvm/hyp/pgtable.c
>>>>> index 78a560446f80..308c36b9cd21 100644
>>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>>>        kvm_pte_t            attr;
>>>>>          kvm_pte_t            *anchor;
>>>>> +    kvm_pte_t            *follow;
>>>>>          struct kvm_s2_mmu        *mmu;
>>>>>        struct kvm_mmu_memory_cache    *memcache;
>>>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 
>>>>> addr, u64 end,
>>>>> u32 level,
>>>>>        if (!kvm_block_mapping_supported(addr, end, data->phys, 
>>>>> level))
>>>>>            return 0;
>>>>>    -    kvm_set_invalid_pte(ptep);
>>>>> -
>>>>>        /*
>>>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>>>> -     * entries below us which would otherwise need invalidating
>>>>> -     * individually.
>>>>> +     * If we need to coalesce existing table entries into a block 
>>>>> here,
>>>>> +     * then install the block entry first and the sub-level page 
>>>>> mappings
>>>>> +     * will be unmapped later.
>>>>>         */
>>>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>>        data->anchor = ptep;
>>>>> +    data->follow = kvm_pte_follow(*ptep);
>>>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>>>> Here's how stage2_coalesce_tables_into_block() is implemented from 
>>>> the previous
>>>> patch (it might be worth merging it with this patch, I found it 
>>>> impossible to
>>>> judge if the function is correct without seeing how it is used and 
>>>> what is
>>>> replacing):
>>> Ok, will do this if v2 is going to be post.
>>>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>>>                             kvm_pte_t *ptep,
>>>>                             struct stage2_map_data *data)
>>>> {
>>>>       u64 granule = kvm_granule_size(level), phys = data->phys;
>>>>       kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, 
>>>> level);
>>>>
>>>>       kvm_set_invalid_pte(ptep);
>>>>
>>>>       /*
>>>>        * Invalidate the whole stage-2, as we may have numerous leaf 
>>>> entries
>>>>        * below us which would otherwise need invalidating 
>>>> individually.
>>>>        */
>>>>       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>       smp_store_release(ptep, new);
>>>>       data->phys += granule;
>>>> }
>>>>
>>>> This works because __kvm_pgtable_visit() saves the *ptep value 
>>>> before calling the
>>>> pre callback, and it visits the next level table based on the 
>>>> initial pte value,
>>>> not the new value written by stage2_coalesce_tables_into_block().
>>> Right. So before replacing the initial pte value with the new value, 
>>> we have to use
>>> *data->follow = kvm_pte_follow(*ptep)* in 
>>> stage2_map_walk_table_pre() to save
>>> the initial pte value in advance. And data->follow will be used 
>>> when  we start to
>>> unmap the old sub-level tables later.
>> Right, stage2_map_walk_table_post() will use data->follow to free the 
>> table page
>> which is no longer needed because we've replaced the entire next 
>> level table with
>> a block mapping.
>>
>>>> Assuming the first patch in the series is merged ("KVM: arm64: Move 
>>>> the clean of
>>>> dcache to the map handler"), this function is missing the CMOs from
>>>> stage2_map_walker_try_leaf().
>>> Yes, the CMOs are not performed in 
>>> stage2_coalesce_tables_into_block() currently,
>>> because I thought they were not needed when we rebuild the block 
>>> mappings from
>>> normal page mappings.
>> This assumes that the *only* situation when we replace a table entry 
>> with a block
>> mapping is when the next level table (or tables) is *fully* 
>> populated. Is there a
>> way to prove that this is true? I think it's important to prove it 
>> unequivocally,
>> because if there's a corner case where this doesn't happen and we 
>> remove the
>> dcache maintenance, we can end up with hard to reproduce and hard to 
>> diagnose
>> errors in a guest.
> So there is still one thing left about this patch to determine, and 
> that is whether we can straightly
> discard CMOs in stage2_coalesce_tables_into_block() or we should 
> distinguish different situations.
>
> Now we know that the situation you have described won't happen, then I 
> think we will only end up
> in stage2_coalesce_tables_into_block() in the following situation:
> 1) KVM create a new block mapping in stage2_map_walker_try_leaf() for 
> the first time, if guest accesses
>     memory backed by a THP/HUGETLB huge page. And CMOs will be 
> performed here.
> 2) KVM split this block mapping in dirty logging, and build only one 
> new page mapping.
> 3) KVM will build other new page mappings in dirty logging lazily, if 
> guest access any other pages
>     within the block. *In this stage, pages in this block may be fully 
> mapped, or may be not.*
> 4) After dirty logging is disabled, KVM decides to rebuild the block 
> mapping.
>
> Do we still have to perform CMOs when rebuilding the block mapping in 
> step 4, if pages in the block
> were not fully mapped in step 3 ? I'm not completely sure about this.
>
Hi Marc,
Could you please have an answer for above confusion :) ?

Thanks,

Yanan


> Thanks,
>
> Yanan
>>> At least, they are not needed if we rebuild the block mappings 
>>> backed by hugetlbfs
>>> pages, because we must have built the new block mappings for the 
>>> first time before
>>> and now need to rebuild them after they were split in dirty logging. 
>>> Can we
>>> agree on this?
>>> Then let's see the following situation.
>>>> I can think of the following situation where they
>>>> are needed:
>>>>
>>>> 1. The 2nd level (PMD) table that will be turned into a block is 
>>>> mapped at stage 2
>>>> because one of the pages in the 3rd level (PTE) table it points to 
>>>> is accessed by
>>>> the guest.
>>>>
>>>> 2. The kernel decides to turn the userspace mapping into a 
>>>> transparent huge page
>>>> and calls the mmu notifier to remove the mapping from stage 2. The 
>>>> 2nd level table
>>>> is still valid.
>>> I have a question here. Won't the PMD entry been invalidated too in 
>>> this case?
>>> If remove of the stage2 mapping by mmu notifier is an unmap 
>>> operation of a range,
>>> then it's correct and reasonable to both invalidate the PMD entry 
>>> and free the
>>> PTE table.
>>> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>>>
>>> And if I was right about this, we will not end up in
>>> stage2_coalesce_tables_into_block()
>>> like step 3 describes, but in stage2_map_walker_try_leaf() instead. 
>>> Because the
>>> PMD entry
>>> is invalid, so KVM will create the new 2M block mapping.
>> Looking at the code for stage2_unmap_walker(), I believe you are 
>> correct. After
>> the entire PTE table has been unmapped, the function will mark the 
>> PMD entry as
>> invalid. In the situation I described, at step 3 we would end up in 
>> the leaf
>> mapper function because the PMD entry is invalid. My example was wrong.
>>
>>> If I'm wrong about this, then I think this is a valid situation.
>>>> 3. Guest accesses a page which is not the page it accessed at step 
>>>> 1, which causes
>>>> a translation fault. KVM decides we can use a PMD block mapping to 
>>>> map the address
>>>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs 
>>>> in this case
>>>> because the guest accesses memory it didn't access before.
>>>>
>>>> What do you think, is that a valid situation?
>>>>>        return 0;
>>>>>    }
>>>>>    @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 
>>>>> addr, u64
>>>>> end, u32 level,
>>>>>                          kvm_pte_t *ptep,
>>>>>                          struct stage2_map_data *data)
>>>>>    {
>>>>> -    int ret = 0;
>>>>> -
>>>>>        if (!data->anchor)
>>>>>            return 0;
>>>>>    -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>> -    put_page(virt_to_page(ptep));
>>>>> -
>>>>> -    if (data->anchor == ptep) {
>>>>> +    if (data->anchor != ptep) {
>>>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>> +        put_page(virt_to_page(ptep));
>>>>> +    } else {
>>>>> +        free_page((unsigned long)data->follow);
>>>>>            data->anchor = NULL;
>>>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>>>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls 
>>>> put_page() and
>>>> get_page() once in our case (valid old mapping). It looks to me 
>>>> like we're missing
>>>> a put_page() call when the function is called for the anchor. Have 
>>>> you found the
>>>> call to be unnecessary?
>>> Before this patch:
>>> When we find data->anchor == ptep, put_page() has been called once 
>>> in advance
>>> for the anchor
>>> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
>>> stage2_map_walker_try_leaf()
>>> to install the block entry, and only get_page() will be called once in
>>> stage2_map_walker_try_leaf().
>>> There is a put_page() followed by a get_page() for the anchor, and 
>>> there will
>>> not be a problem about
>>> page_counts.
>> This is how I'm reading the code before your patch:
>>
>> - stage2_map_walk_table_post() returns early if there is no anchor.
>>
>> - stage2_map_walk_table_pre() sets the anchor and marks the entry as 
>> invalid. The
>> entry was a table so the leaf visitor is not called in 
>> __kvm_pgtable_visit().
>>
>> - __kvm_pgtable_visit() visits the next level table.
>>
>> - stage2_map_walk_table_post() calls put_page(), calls 
>> stage2_map_walk_leaf() ->
>> stage2_map_walker_try_leaf(). The old entry was invalidated by the 
>> pre visitor, so
>> it only calls get_page() (and not put_page() + get_page().
>>
>> I agree with your conclusion, I didn't realize that because the pre 
>> visitor marks
>> the entry as invalid, stage2_map_walker_try_leaf() will not call 
>> put_page().
>>
>>> After this patch:
>>> Before we find data->anchor == ptep and after it, there is not a 
>>> put_page() call
>>> for the anchor.
>>> This is because that we didn't call get_page() either in
>>> stage2_coalesce_tables_into_block() when
>>> install the block entry. So I think there will not be a problem too.
>> I agree, the refcount will be identical.
>>
>>> Is above the right answer for your point?
>> Yes, thank you clearing that up for me.
>>
>> Thanks,
>>
>> Alex
>>
>>>>>        }
>>>>>    -    return ret;
>>>>> +    return 0;
>>>> I think it's correct for this function to succeed unconditionally. 
>>>> The error was
>>>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). 
>>>> The function
>>>> can return an error code if block mapping is not supported, which 
>>>> we know is
>>>> supported because we have an anchor, and if only the permissions 
>>>> are different
>>>> between the old and the new entry, but in our case we've changed 
>>>> both the valid
>>>> and type bits.
>>> Agreed. Besides, we will definitely not end up updating an old valid 
>>> entry for
>>> the anchor
>>> in stage2_map_walker_try_leaf(), because *anchor has already been 
>>> invalidated in
>>> stage2_map_walk_table_pre() before set the anchor, so it will look 
>>> like a build
>>> of new mapping.
>>>
>>> Thanks,
>>>
>>> Yanan
>>>> Thanks,
>>>>
>>>> Alex
>>>>
>>>>>    }
>>>>>      /*
>>>> .
>> .

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
  2021-03-04  7:07           ` wangyanan (Y)
  (?)
@ 2021-03-19 15:07             ` Alexandru Elisei
  -1 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-03-19 15:07 UTC (permalink / raw)
  To: wangyanan (Y)
  Cc: Marc Zyngier, Will Deacon, Catalin Marinas, Julien Thierry,
	James Morse, Suzuki K Poulose, Quentin Perret, Gavin Shan,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Yanan,

Sorry for taking so long to reply, been busy with other things unfortunately. I
did notice that you sent a new version of this series, but I would like to
continue our discussion on this patch, since it's easier to get the full context.

On 3/4/21 7:07 AM, wangyanan (Y) wrote:
> Hi Alex,
>
> On 2021/3/4 1:27, Alexandru Elisei wrote:
>> Hi Yanan,
>>
>> On 3/3/21 11:04 AM, wangyanan (Y) wrote:
>>> Hi Alex,
>>>
>>> On 2021/3/3 1:13, Alexandru Elisei wrote:
>>>> Hello,
>>>>
>>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>>>> we currently invalidate the old table entry first followed by invalidation
>>>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>>>
>>>>> It will cost a long time to unmap the numerous page mappings, which means
>>>>> there will be a long period when the table entry can be found invalid.
>>>>> If other vCPUs access any guest page within the block range and find the
>>>>> table entry invalid, they will all exit from guest with a translation fault
>>>>> which is not necessary. And KVM will make efforts to handle these faults,
>>>>> especially when performing CMOs by block range.
>>>>>
>>>>> So let's quickly install the block entry at first to ensure uninterrupted
>>>>> memory access of the other vCPUs, and then unmap the page mappings after
>>>>> installation. This will reduce most of the time when the table entry is
>>>>> invalid, and avoid most of the unnecessary translation faults.
>>>> I'm not convinced I've fully understood what is going on yet, but it seems to me
>>>> that the idea is sound. Some questions and comments below.
>>> What I am trying to do in this patch is to adjust the order of rebuilding block
>>> mappings from page mappings.
>>> Take the rebuilding of 1G block mappings as an example.
>>> Before this patch, the order is like:
>>> 1) invalidate the table entry of the 1st level(PUD)
>>> 2) flush TLB by VMID
>>> 3) unmap the old PMD/PTE tables
>>> 4) install the new block entry to the 1st level(PUD)
>>>
>>> So entry in the 1st level can be found invalid by other vcpus in 1), 2), and 3),
>>> and it's a long time in 3) to unmap
>>> the numerous old PMD/PTE tables, which means the total time of the entry being
>>> invalid is long enough to
>>> affect the performance.
>>>
>>> After this patch, the order is like:
>>> 1) invalidate the table ebtry of the 1st level(PUD)
>>> 2) flush TLB by VMID
>>> 3) install the new block entry to the 1st level(PUD)
>>> 4) unmap the old PMD/PTE tables
>>>
>>> The change ensures that period of entry in the 1st level(PUD) being invalid is
>>> only in 1) and 2),
>>> so if other vcpus access memory within 1G, there will be less chance to find the
>>> entry invalid
>>> and as a result trigger an unnecessary translation fault.
>> Thank you for the explanation, that was my understand of it also, and I believe
>> your idea is correct. I was more concerned that I got some of the details wrong,
>> and you have kindly corrected me below.
>>
>>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>>> ---
>>>>>    arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>>>    1 file changed, 12 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>>> index 78a560446f80..308c36b9cd21 100644
>>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>>>        kvm_pte_t            attr;
>>>>>          kvm_pte_t            *anchor;
>>>>> +    kvm_pte_t            *follow;
>>>>>          struct kvm_s2_mmu        *mmu;
>>>>>        struct kvm_mmu_memory_cache    *memcache;
>>>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
>>>>> u32 level,
>>>>>        if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>>>>>            return 0;
>>>>>    -    kvm_set_invalid_pte(ptep);
>>>>> -
>>>>>        /*
>>>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>>>> -     * entries below us which would otherwise need invalidating
>>>>> -     * individually.
>>>>> +     * If we need to coalesce existing table entries into a block here,
>>>>> +     * then install the block entry first and the sub-level page mappings
>>>>> +     * will be unmapped later.
>>>>>         */
>>>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>>        data->anchor = ptep;
>>>>> +    data->follow = kvm_pte_follow(*ptep);
>>>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>>>> Here's how stage2_coalesce_tables_into_block() is implemented from the previous
>>>> patch (it might be worth merging it with this patch, I found it impossible to
>>>> judge if the function is correct without seeing how it is used and what is
>>>> replacing):
>>> Ok, will do this if v2 is going to be post.
>>>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>>>                             kvm_pte_t *ptep,
>>>>                             struct stage2_map_data *data)
>>>> {
>>>>       u64 granule = kvm_granule_size(level), phys = data->phys;
>>>>       kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>>>
>>>>       kvm_set_invalid_pte(ptep);
>>>>
>>>>       /*
>>>>        * Invalidate the whole stage-2, as we may have numerous leaf entries
>>>>        * below us which would otherwise need invalidating individually.
>>>>        */
>>>>       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>       smp_store_release(ptep, new);
>>>>       data->phys += granule;
>>>> }
>>>>
>>>> This works because __kvm_pgtable_visit() saves the *ptep value before calling
>>>> the
>>>> pre callback, and it visits the next level table based on the initial pte value,
>>>> not the new value written by stage2_coalesce_tables_into_block().
>>> Right. So before replacing the initial pte value with the new value, we have
>>> to use
>>> *data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to save
>>> the initial pte value in advance. And data->follow will be used when  we start to
>>> unmap the old sub-level tables later.
>> Right, stage2_map_walk_table_post() will use data->follow to free the table page
>> which is no longer needed because we've replaced the entire next level table with
>> a block mapping.
>>
>>>> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
>>>> dcache to the map handler"), this function is missing the CMOs from
>>>> stage2_map_walker_try_leaf().
>>> Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() currently,
>>> because I thought they were not needed when we rebuild the block mappings from
>>> normal page mappings.
>> This assumes that the *only* situation when we replace a table entry with a block
>> mapping is when the next level table (or tables) is *fully* populated. Is there a
>> way to prove that this is true? I think it's important to prove it unequivocally,
>> because if there's a corner case where this doesn't happen and we remove the
>> dcache maintenance, we can end up with hard to reproduce and hard to diagnose
>> errors in a guest.
> So there is still one thing left about this patch to determine, and that is
> whether we can straightly
> discard CMOs in stage2_coalesce_tables_into_block() or we should distinguish
> different situations.
>
> Now we know that the situation you have described won't happen, then I think we
> will only end up
> in stage2_coalesce_tables_into_block() in the following situation:
> 1) KVM create a new block mapping in stage2_map_walker_try_leaf() for the first
> time, if guest accesses
>     memory backed by a THP/HUGETLB huge page. And CMOs will be performed here.
> 2) KVM split this block mapping in dirty logging, and build only one new page
> mapping.
> 3) KVM will build other new page mappings in dirty logging lazily, if guest
> access any other pages
>     within the block. *In this stage, pages in this block may be fully mapped,
> or may be not.*
> 4) After dirty logging is disabled, KVM decides to rebuild the block mapping.
>
> Do we still have to perform CMOs when rebuilding the block mapping in step 4, if
> pages in the block
> were not fully mapped in step 3 ? I'm not completely sure about this.

Did some digging and this is my understanding of what is happening. Please correct
me if I get something wrong.

When the kernel coalesces the userspace PTEs into a transparent hugepage, KVM will
unmap the old mappings and mark the PMD table as invalidated via the MMU
notifiers. To have a table at the PMD level while the corresponding entry is a
block mapping in the userspace translation tables, it means that the table was
created *after* the userspace block mapping was created.

user_mem_abort() will create a PAGE_SIZE mapping when the backing userspace
mapping is a block mapping in the following situations:

1. The start of the userspace block mapping is not aligned to the start of the
stage 2 block mapping (see fault_supports_stage2_huge_mapping()).

2. The stage 2 block mapping falls outside the memslot (see
fault_supports_stage2_huge_mapping()).

3. The memslot logs dirty pages.

For 1 and 2, the only scenario in which we can use a stage 2 block mapping for the
faulting IPA is if the memslot is modified, and that means the IPA range will have
been unmapped first, which destroys the PMD table entry (kvm_set_memslot() will
call kvm_arch_flush_shadow_memslot because change == KVM_MR_MOVE).

This leaves us with scenario 3. We can get in this scenario if the memslot is
logging and the userspace mapping has been coalesced into a transparent huge page
before dirty logging was set or if the userspace mapping is a hugetlb page. To
allow a block mapping at stage 2, we first need to remove the
KVM_MEM_LOG_DIRTY_PAGES flag from the memslot. Then we need to get a dabt in the
IPA range backed by the userspace block mapping. At this point there's nothing to
guarantee that the *entire* IPA range backed by the userspace block mapping is
mapped at stage 2.

In this case, we definitely need to do dcache maintenance because the guest might
be running with the MMU off and doing loads from from PoC (assuming not FWB), and
whatever userspace wrote in the guest memory (like the kernel image) might still
be in the dcache. We also need to do the icache inval after the dcache clean +
inval because instruction fetches can be cached even if the MMU is off.

Thanks,

Alex

>
> Thanks,
>
> Yanan
>>> At least, they are not needed if we rebuild the block mappings backed by
>>> hugetlbfs
>>> pages, because we must have built the new block mappings for the first time
>>> before
>>> and now need to rebuild them after they were split in dirty logging. Can we
>>> agree on this?
>>> Then let's see the following situation.
>>>> I can think of the following situation where they
>>>> are needed:
>>>>
>>>> 1. The 2nd level (PMD) table that will be turned into a block is mapped at
>>>> stage 2
>>>> because one of the pages in the 3rd level (PTE) table it points to is
>>>> accessed by
>>>> the guest.
>>>>
>>>> 2. The kernel decides to turn the userspace mapping into a transparent huge page
>>>> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level
>>>> table
>>>> is still valid.
>>> I have a question here. Won't the PMD entry been invalidated too in this case?
>>> If remove of the stage2 mapping by mmu notifier is an unmap operation of a range,
>>> then it's correct and reasonable to both invalidate the PMD entry and free the
>>> PTE table.
>>> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>>>
>>> And if I was right about this, we will not end up in
>>> stage2_coalesce_tables_into_block()
>>> like step 3 describes, but in stage2_map_walker_try_leaf() instead. Because the
>>> PMD entry
>>> is invalid, so KVM will create the new 2M block mapping.
>> Looking at the code for stage2_unmap_walker(), I believe you are correct. After
>> the entire PTE table has been unmapped, the function will mark the PMD entry as
>> invalid. In the situation I described, at step 3 we would end up in the leaf
>> mapper function because the PMD entry is invalid. My example was wrong.
>>
>>> If I'm wrong about this, then I think this is a valid situation.
>>>> 3. Guest accesses a page which is not the page it accessed at step 1, which
>>>> causes
>>>> a translation fault. KVM decides we can use a PMD block mapping to map the
>>>> address
>>>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
>>>> because the guest accesses memory it didn't access before.
>>>>
>>>> What do you think, is that a valid situation?
>>>>>        return 0;
>>>>>    }
>>>>>    @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64
>>>>> end, u32 level,
>>>>>                          kvm_pte_t *ptep,
>>>>>                          struct stage2_map_data *data)
>>>>>    {
>>>>> -    int ret = 0;
>>>>> -
>>>>>        if (!data->anchor)
>>>>>            return 0;
>>>>>    -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>> -    put_page(virt_to_page(ptep));
>>>>> -
>>>>> -    if (data->anchor == ptep) {
>>>>> +    if (data->anchor != ptep) {
>>>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>> +        put_page(virt_to_page(ptep));
>>>>> +    } else {
>>>>> +        free_page((unsigned long)data->follow);
>>>>>            data->anchor = NULL;
>>>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>>>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and
>>>> get_page() once in our case (valid old mapping). It looks to me like we're
>>>> missing
>>>> a put_page() call when the function is called for the anchor. Have you found the
>>>> call to be unnecessary?
>>> Before this patch:
>>> When we find data->anchor == ptep, put_page() has been called once in advance
>>> for the anchor
>>> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
>>> stage2_map_walker_try_leaf()
>>> to install the block entry, and only get_page() will be called once in
>>> stage2_map_walker_try_leaf().
>>> There is a put_page() followed by a get_page() for the anchor, and there will
>>> not be a problem about
>>> page_counts.
>> This is how I'm reading the code before your patch:
>>
>> - stage2_map_walk_table_post() returns early if there is no anchor.
>>
>> - stage2_map_walk_table_pre() sets the anchor and marks the entry as invalid. The
>> entry was a table so the leaf visitor is not called in __kvm_pgtable_visit().
>>
>> - __kvm_pgtable_visit() visits the next level table.
>>
>> - stage2_map_walk_table_post() calls put_page(), calls stage2_map_walk_leaf() ->
>> stage2_map_walker_try_leaf(). The old entry was invalidated by the pre visitor, so
>> it only calls get_page() (and not put_page() + get_page().
>>
>> I agree with your conclusion, I didn't realize that because the pre visitor marks
>> the entry as invalid, stage2_map_walker_try_leaf() will not call put_page().
>>
>>> After this patch:
>>> Before we find data->anchor == ptep and after it, there is not a put_page() call
>>> for the anchor.
>>> This is because that we didn't call get_page() either in
>>> stage2_coalesce_tables_into_block() when
>>> install the block entry. So I think there will not be a problem too.
>> I agree, the refcount will be identical.
>>
>>> Is above the right answer for your point?
>> Yes, thank you clearing that up for me.
>>
>> Thanks,
>>
>> Alex
>>
>>>>>        }
>>>>>    -    return ret;
>>>>> +    return 0;
>>>> I think it's correct for this function to succeed unconditionally. The error was
>>>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
>>>> can return an error code if block mapping is not supported, which we know is
>>>> supported because we have an anchor, and if only the permissions are different
>>>> between the old and the new entry, but in our case we've changed both the valid
>>>> and type bits.
>>> Agreed. Besides, we will definitely not end up updating an old valid entry for
>>> the anchor
>>> in stage2_map_walker_try_leaf(), because *anchor has already been invalidated in
>>> stage2_map_walk_table_pre() before set the anchor, so it will look like a build
>>> of new mapping.
>>>
>>> Thanks,
>>>
>>> Yanan
>>>> Thanks,
>>>>
>>>> Alex
>>>>
>>>>>    }
>>>>>      /*
>>>> .
>> .

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-03-19 15:07             ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-03-19 15:07 UTC (permalink / raw)
  To: wangyanan (Y)
  Cc: kvm, Marc Zyngier, linux-kernel, linux-arm-kernel,
	Catalin Marinas, Will Deacon, kvmarm

Hi Yanan,

Sorry for taking so long to reply, been busy with other things unfortunately. I
did notice that you sent a new version of this series, but I would like to
continue our discussion on this patch, since it's easier to get the full context.

On 3/4/21 7:07 AM, wangyanan (Y) wrote:
> Hi Alex,
>
> On 2021/3/4 1:27, Alexandru Elisei wrote:
>> Hi Yanan,
>>
>> On 3/3/21 11:04 AM, wangyanan (Y) wrote:
>>> Hi Alex,
>>>
>>> On 2021/3/3 1:13, Alexandru Elisei wrote:
>>>> Hello,
>>>>
>>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>>>> we currently invalidate the old table entry first followed by invalidation
>>>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>>>
>>>>> It will cost a long time to unmap the numerous page mappings, which means
>>>>> there will be a long period when the table entry can be found invalid.
>>>>> If other vCPUs access any guest page within the block range and find the
>>>>> table entry invalid, they will all exit from guest with a translation fault
>>>>> which is not necessary. And KVM will make efforts to handle these faults,
>>>>> especially when performing CMOs by block range.
>>>>>
>>>>> So let's quickly install the block entry at first to ensure uninterrupted
>>>>> memory access of the other vCPUs, and then unmap the page mappings after
>>>>> installation. This will reduce most of the time when the table entry is
>>>>> invalid, and avoid most of the unnecessary translation faults.
>>>> I'm not convinced I've fully understood what is going on yet, but it seems to me
>>>> that the idea is sound. Some questions and comments below.
>>> What I am trying to do in this patch is to adjust the order of rebuilding block
>>> mappings from page mappings.
>>> Take the rebuilding of 1G block mappings as an example.
>>> Before this patch, the order is like:
>>> 1) invalidate the table entry of the 1st level(PUD)
>>> 2) flush TLB by VMID
>>> 3) unmap the old PMD/PTE tables
>>> 4) install the new block entry to the 1st level(PUD)
>>>
>>> So entry in the 1st level can be found invalid by other vcpus in 1), 2), and 3),
>>> and it's a long time in 3) to unmap
>>> the numerous old PMD/PTE tables, which means the total time of the entry being
>>> invalid is long enough to
>>> affect the performance.
>>>
>>> After this patch, the order is like:
>>> 1) invalidate the table ebtry of the 1st level(PUD)
>>> 2) flush TLB by VMID
>>> 3) install the new block entry to the 1st level(PUD)
>>> 4) unmap the old PMD/PTE tables
>>>
>>> The change ensures that period of entry in the 1st level(PUD) being invalid is
>>> only in 1) and 2),
>>> so if other vcpus access memory within 1G, there will be less chance to find the
>>> entry invalid
>>> and as a result trigger an unnecessary translation fault.
>> Thank you for the explanation, that was my understand of it also, and I believe
>> your idea is correct. I was more concerned that I got some of the details wrong,
>> and you have kindly corrected me below.
>>
>>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>>> ---
>>>>>    arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>>>    1 file changed, 12 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>>> index 78a560446f80..308c36b9cd21 100644
>>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>>>        kvm_pte_t            attr;
>>>>>          kvm_pte_t            *anchor;
>>>>> +    kvm_pte_t            *follow;
>>>>>          struct kvm_s2_mmu        *mmu;
>>>>>        struct kvm_mmu_memory_cache    *memcache;
>>>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
>>>>> u32 level,
>>>>>        if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>>>>>            return 0;
>>>>>    -    kvm_set_invalid_pte(ptep);
>>>>> -
>>>>>        /*
>>>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>>>> -     * entries below us which would otherwise need invalidating
>>>>> -     * individually.
>>>>> +     * If we need to coalesce existing table entries into a block here,
>>>>> +     * then install the block entry first and the sub-level page mappings
>>>>> +     * will be unmapped later.
>>>>>         */
>>>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>>        data->anchor = ptep;
>>>>> +    data->follow = kvm_pte_follow(*ptep);
>>>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>>>> Here's how stage2_coalesce_tables_into_block() is implemented from the previous
>>>> patch (it might be worth merging it with this patch, I found it impossible to
>>>> judge if the function is correct without seeing how it is used and what is
>>>> replacing):
>>> Ok, will do this if v2 is going to be post.
>>>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>>>                             kvm_pte_t *ptep,
>>>>                             struct stage2_map_data *data)
>>>> {
>>>>       u64 granule = kvm_granule_size(level), phys = data->phys;
>>>>       kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>>>
>>>>       kvm_set_invalid_pte(ptep);
>>>>
>>>>       /*
>>>>        * Invalidate the whole stage-2, as we may have numerous leaf entries
>>>>        * below us which would otherwise need invalidating individually.
>>>>        */
>>>>       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>       smp_store_release(ptep, new);
>>>>       data->phys += granule;
>>>> }
>>>>
>>>> This works because __kvm_pgtable_visit() saves the *ptep value before calling
>>>> the
>>>> pre callback, and it visits the next level table based on the initial pte value,
>>>> not the new value written by stage2_coalesce_tables_into_block().
>>> Right. So before replacing the initial pte value with the new value, we have
>>> to use
>>> *data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to save
>>> the initial pte value in advance. And data->follow will be used when  we start to
>>> unmap the old sub-level tables later.
>> Right, stage2_map_walk_table_post() will use data->follow to free the table page
>> which is no longer needed because we've replaced the entire next level table with
>> a block mapping.
>>
>>>> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
>>>> dcache to the map handler"), this function is missing the CMOs from
>>>> stage2_map_walker_try_leaf().
>>> Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() currently,
>>> because I thought they were not needed when we rebuild the block mappings from
>>> normal page mappings.
>> This assumes that the *only* situation when we replace a table entry with a block
>> mapping is when the next level table (or tables) is *fully* populated. Is there a
>> way to prove that this is true? I think it's important to prove it unequivocally,
>> because if there's a corner case where this doesn't happen and we remove the
>> dcache maintenance, we can end up with hard to reproduce and hard to diagnose
>> errors in a guest.
> So there is still one thing left about this patch to determine, and that is
> whether we can straightly
> discard CMOs in stage2_coalesce_tables_into_block() or we should distinguish
> different situations.
>
> Now we know that the situation you have described won't happen, then I think we
> will only end up
> in stage2_coalesce_tables_into_block() in the following situation:
> 1) KVM create a new block mapping in stage2_map_walker_try_leaf() for the first
> time, if guest accesses
>     memory backed by a THP/HUGETLB huge page. And CMOs will be performed here.
> 2) KVM split this block mapping in dirty logging, and build only one new page
> mapping.
> 3) KVM will build other new page mappings in dirty logging lazily, if guest
> access any other pages
>     within the block. *In this stage, pages in this block may be fully mapped,
> or may be not.*
> 4) After dirty logging is disabled, KVM decides to rebuild the block mapping.
>
> Do we still have to perform CMOs when rebuilding the block mapping in step 4, if
> pages in the block
> were not fully mapped in step 3 ? I'm not completely sure about this.

Did some digging and this is my understanding of what is happening. Please correct
me if I get something wrong.

When the kernel coalesces the userspace PTEs into a transparent hugepage, KVM will
unmap the old mappings and mark the PMD table as invalidated via the MMU
notifiers. To have a table at the PMD level while the corresponding entry is a
block mapping in the userspace translation tables, it means that the table was
created *after* the userspace block mapping was created.

user_mem_abort() will create a PAGE_SIZE mapping when the backing userspace
mapping is a block mapping in the following situations:

1. The start of the userspace block mapping is not aligned to the start of the
stage 2 block mapping (see fault_supports_stage2_huge_mapping()).

2. The stage 2 block mapping falls outside the memslot (see
fault_supports_stage2_huge_mapping()).

3. The memslot logs dirty pages.

For 1 and 2, the only scenario in which we can use a stage 2 block mapping for the
faulting IPA is if the memslot is modified, and that means the IPA range will have
been unmapped first, which destroys the PMD table entry (kvm_set_memslot() will
call kvm_arch_flush_shadow_memslot because change == KVM_MR_MOVE).

This leaves us with scenario 3. We can get in this scenario if the memslot is
logging and the userspace mapping has been coalesced into a transparent huge page
before dirty logging was set or if the userspace mapping is a hugetlb page. To
allow a block mapping at stage 2, we first need to remove the
KVM_MEM_LOG_DIRTY_PAGES flag from the memslot. Then we need to get a dabt in the
IPA range backed by the userspace block mapping. At this point there's nothing to
guarantee that the *entire* IPA range backed by the userspace block mapping is
mapped at stage 2.

In this case, we definitely need to do dcache maintenance because the guest might
be running with the MMU off and doing loads from from PoC (assuming not FWB), and
whatever userspace wrote in the guest memory (like the kernel image) might still
be in the dcache. We also need to do the icache inval after the dcache clean +
inval because instruction fetches can be cached even if the MMU is off.

Thanks,

Alex

>
> Thanks,
>
> Yanan
>>> At least, they are not needed if we rebuild the block mappings backed by
>>> hugetlbfs
>>> pages, because we must have built the new block mappings for the first time
>>> before
>>> and now need to rebuild them after they were split in dirty logging. Can we
>>> agree on this?
>>> Then let's see the following situation.
>>>> I can think of the following situation where they
>>>> are needed:
>>>>
>>>> 1. The 2nd level (PMD) table that will be turned into a block is mapped at
>>>> stage 2
>>>> because one of the pages in the 3rd level (PTE) table it points to is
>>>> accessed by
>>>> the guest.
>>>>
>>>> 2. The kernel decides to turn the userspace mapping into a transparent huge page
>>>> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level
>>>> table
>>>> is still valid.
>>> I have a question here. Won't the PMD entry been invalidated too in this case?
>>> If remove of the stage2 mapping by mmu notifier is an unmap operation of a range,
>>> then it's correct and reasonable to both invalidate the PMD entry and free the
>>> PTE table.
>>> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>>>
>>> And if I was right about this, we will not end up in
>>> stage2_coalesce_tables_into_block()
>>> like step 3 describes, but in stage2_map_walker_try_leaf() instead. Because the
>>> PMD entry
>>> is invalid, so KVM will create the new 2M block mapping.
>> Looking at the code for stage2_unmap_walker(), I believe you are correct. After
>> the entire PTE table has been unmapped, the function will mark the PMD entry as
>> invalid. In the situation I described, at step 3 we would end up in the leaf
>> mapper function because the PMD entry is invalid. My example was wrong.
>>
>>> If I'm wrong about this, then I think this is a valid situation.
>>>> 3. Guest accesses a page which is not the page it accessed at step 1, which
>>>> causes
>>>> a translation fault. KVM decides we can use a PMD block mapping to map the
>>>> address
>>>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
>>>> because the guest accesses memory it didn't access before.
>>>>
>>>> What do you think, is that a valid situation?
>>>>>        return 0;
>>>>>    }
>>>>>    @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64
>>>>> end, u32 level,
>>>>>                          kvm_pte_t *ptep,
>>>>>                          struct stage2_map_data *data)
>>>>>    {
>>>>> -    int ret = 0;
>>>>> -
>>>>>        if (!data->anchor)
>>>>>            return 0;
>>>>>    -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>> -    put_page(virt_to_page(ptep));
>>>>> -
>>>>> -    if (data->anchor == ptep) {
>>>>> +    if (data->anchor != ptep) {
>>>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>> +        put_page(virt_to_page(ptep));
>>>>> +    } else {
>>>>> +        free_page((unsigned long)data->follow);
>>>>>            data->anchor = NULL;
>>>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>>>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and
>>>> get_page() once in our case (valid old mapping). It looks to me like we're
>>>> missing
>>>> a put_page() call when the function is called for the anchor. Have you found the
>>>> call to be unnecessary?
>>> Before this patch:
>>> When we find data->anchor == ptep, put_page() has been called once in advance
>>> for the anchor
>>> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
>>> stage2_map_walker_try_leaf()
>>> to install the block entry, and only get_page() will be called once in
>>> stage2_map_walker_try_leaf().
>>> There is a put_page() followed by a get_page() for the anchor, and there will
>>> not be a problem about
>>> page_counts.
>> This is how I'm reading the code before your patch:
>>
>> - stage2_map_walk_table_post() returns early if there is no anchor.
>>
>> - stage2_map_walk_table_pre() sets the anchor and marks the entry as invalid. The
>> entry was a table so the leaf visitor is not called in __kvm_pgtable_visit().
>>
>> - __kvm_pgtable_visit() visits the next level table.
>>
>> - stage2_map_walk_table_post() calls put_page(), calls stage2_map_walk_leaf() ->
>> stage2_map_walker_try_leaf(). The old entry was invalidated by the pre visitor, so
>> it only calls get_page() (and not put_page() + get_page().
>>
>> I agree with your conclusion, I didn't realize that because the pre visitor marks
>> the entry as invalid, stage2_map_walker_try_leaf() will not call put_page().
>>
>>> After this patch:
>>> Before we find data->anchor == ptep and after it, there is not a put_page() call
>>> for the anchor.
>>> This is because that we didn't call get_page() either in
>>> stage2_coalesce_tables_into_block() when
>>> install the block entry. So I think there will not be a problem too.
>> I agree, the refcount will be identical.
>>
>>> Is above the right answer for your point?
>> Yes, thank you clearing that up for me.
>>
>> Thanks,
>>
>> Alex
>>
>>>>>        }
>>>>>    -    return ret;
>>>>> +    return 0;
>>>> I think it's correct for this function to succeed unconditionally. The error was
>>>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
>>>> can return an error code if block mapping is not supported, which we know is
>>>> supported because we have an anchor, and if only the permissions are different
>>>> between the old and the new entry, but in our case we've changed both the valid
>>>> and type bits.
>>> Agreed. Besides, we will definitely not end up updating an old valid entry for
>>> the anchor
>>> in stage2_map_walker_try_leaf(), because *anchor has already been invalidated in
>>> stage2_map_walk_table_pre() before set the anchor, so it will look like a build
>>> of new mapping.
>>>
>>> Thanks,
>>>
>>> Yanan
>>>> Thanks,
>>>>
>>>> Alex
>>>>
>>>>>    }
>>>>>      /*
>>>> .
>> .
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-03-19 15:07             ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-03-19 15:07 UTC (permalink / raw)
  To: wangyanan (Y)
  Cc: Marc Zyngier, Will Deacon, Catalin Marinas, Julien Thierry,
	James Morse, Suzuki K Poulose, Quentin Perret, Gavin Shan,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Yanan,

Sorry for taking so long to reply, been busy with other things unfortunately. I
did notice that you sent a new version of this series, but I would like to
continue our discussion on this patch, since it's easier to get the full context.

On 3/4/21 7:07 AM, wangyanan (Y) wrote:
> Hi Alex,
>
> On 2021/3/4 1:27, Alexandru Elisei wrote:
>> Hi Yanan,
>>
>> On 3/3/21 11:04 AM, wangyanan (Y) wrote:
>>> Hi Alex,
>>>
>>> On 2021/3/3 1:13, Alexandru Elisei wrote:
>>>> Hello,
>>>>
>>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>>>> we currently invalidate the old table entry first followed by invalidation
>>>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>>>
>>>>> It will cost a long time to unmap the numerous page mappings, which means
>>>>> there will be a long period when the table entry can be found invalid.
>>>>> If other vCPUs access any guest page within the block range and find the
>>>>> table entry invalid, they will all exit from guest with a translation fault
>>>>> which is not necessary. And KVM will make efforts to handle these faults,
>>>>> especially when performing CMOs by block range.
>>>>>
>>>>> So let's quickly install the block entry at first to ensure uninterrupted
>>>>> memory access of the other vCPUs, and then unmap the page mappings after
>>>>> installation. This will reduce most of the time when the table entry is
>>>>> invalid, and avoid most of the unnecessary translation faults.
>>>> I'm not convinced I've fully understood what is going on yet, but it seems to me
>>>> that the idea is sound. Some questions and comments below.
>>> What I am trying to do in this patch is to adjust the order of rebuilding block
>>> mappings from page mappings.
>>> Take the rebuilding of 1G block mappings as an example.
>>> Before this patch, the order is like:
>>> 1) invalidate the table entry of the 1st level(PUD)
>>> 2) flush TLB by VMID
>>> 3) unmap the old PMD/PTE tables
>>> 4) install the new block entry to the 1st level(PUD)
>>>
>>> So entry in the 1st level can be found invalid by other vcpus in 1), 2), and 3),
>>> and it's a long time in 3) to unmap
>>> the numerous old PMD/PTE tables, which means the total time of the entry being
>>> invalid is long enough to
>>> affect the performance.
>>>
>>> After this patch, the order is like:
>>> 1) invalidate the table ebtry of the 1st level(PUD)
>>> 2) flush TLB by VMID
>>> 3) install the new block entry to the 1st level(PUD)
>>> 4) unmap the old PMD/PTE tables
>>>
>>> The change ensures that period of entry in the 1st level(PUD) being invalid is
>>> only in 1) and 2),
>>> so if other vcpus access memory within 1G, there will be less chance to find the
>>> entry invalid
>>> and as a result trigger an unnecessary translation fault.
>> Thank you for the explanation, that was my understand of it also, and I believe
>> your idea is correct. I was more concerned that I got some of the details wrong,
>> and you have kindly corrected me below.
>>
>>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>>> ---
>>>>>    arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>>>    1 file changed, 12 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>>> index 78a560446f80..308c36b9cd21 100644
>>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>>>        kvm_pte_t            attr;
>>>>>          kvm_pte_t            *anchor;
>>>>> +    kvm_pte_t            *follow;
>>>>>          struct kvm_s2_mmu        *mmu;
>>>>>        struct kvm_mmu_memory_cache    *memcache;
>>>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
>>>>> u32 level,
>>>>>        if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>>>>>            return 0;
>>>>>    -    kvm_set_invalid_pte(ptep);
>>>>> -
>>>>>        /*
>>>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>>>> -     * entries below us which would otherwise need invalidating
>>>>> -     * individually.
>>>>> +     * If we need to coalesce existing table entries into a block here,
>>>>> +     * then install the block entry first and the sub-level page mappings
>>>>> +     * will be unmapped later.
>>>>>         */
>>>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>>        data->anchor = ptep;
>>>>> +    data->follow = kvm_pte_follow(*ptep);
>>>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>>>> Here's how stage2_coalesce_tables_into_block() is implemented from the previous
>>>> patch (it might be worth merging it with this patch, I found it impossible to
>>>> judge if the function is correct without seeing how it is used and what is
>>>> replacing):
>>> Ok, will do this if v2 is going to be post.
>>>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>>>                             kvm_pte_t *ptep,
>>>>                             struct stage2_map_data *data)
>>>> {
>>>>       u64 granule = kvm_granule_size(level), phys = data->phys;
>>>>       kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>>>
>>>>       kvm_set_invalid_pte(ptep);
>>>>
>>>>       /*
>>>>        * Invalidate the whole stage-2, as we may have numerous leaf entries
>>>>        * below us which would otherwise need invalidating individually.
>>>>        */
>>>>       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>       smp_store_release(ptep, new);
>>>>       data->phys += granule;
>>>> }
>>>>
>>>> This works because __kvm_pgtable_visit() saves the *ptep value before calling
>>>> the
>>>> pre callback, and it visits the next level table based on the initial pte value,
>>>> not the new value written by stage2_coalesce_tables_into_block().
>>> Right. So before replacing the initial pte value with the new value, we have
>>> to use
>>> *data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to save
>>> the initial pte value in advance. And data->follow will be used when  we start to
>>> unmap the old sub-level tables later.
>> Right, stage2_map_walk_table_post() will use data->follow to free the table page
>> which is no longer needed because we've replaced the entire next level table with
>> a block mapping.
>>
>>>> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
>>>> dcache to the map handler"), this function is missing the CMOs from
>>>> stage2_map_walker_try_leaf().
>>> Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() currently,
>>> because I thought they were not needed when we rebuild the block mappings from
>>> normal page mappings.
>> This assumes that the *only* situation when we replace a table entry with a block
>> mapping is when the next level table (or tables) is *fully* populated. Is there a
>> way to prove that this is true? I think it's important to prove it unequivocally,
>> because if there's a corner case where this doesn't happen and we remove the
>> dcache maintenance, we can end up with hard to reproduce and hard to diagnose
>> errors in a guest.
> So there is still one thing left about this patch to determine, and that is
> whether we can straightly
> discard CMOs in stage2_coalesce_tables_into_block() or we should distinguish
> different situations.
>
> Now we know that the situation you have described won't happen, then I think we
> will only end up
> in stage2_coalesce_tables_into_block() in the following situation:
> 1) KVM create a new block mapping in stage2_map_walker_try_leaf() for the first
> time, if guest accesses
>     memory backed by a THP/HUGETLB huge page. And CMOs will be performed here.
> 2) KVM split this block mapping in dirty logging, and build only one new page
> mapping.
> 3) KVM will build other new page mappings in dirty logging lazily, if guest
> access any other pages
>     within the block. *In this stage, pages in this block may be fully mapped,
> or may be not.*
> 4) After dirty logging is disabled, KVM decides to rebuild the block mapping.
>
> Do we still have to perform CMOs when rebuilding the block mapping in step 4, if
> pages in the block
> were not fully mapped in step 3 ? I'm not completely sure about this.

Did some digging and this is my understanding of what is happening. Please correct
me if I get something wrong.

When the kernel coalesces the userspace PTEs into a transparent hugepage, KVM will
unmap the old mappings and mark the PMD table as invalidated via the MMU
notifiers. To have a table at the PMD level while the corresponding entry is a
block mapping in the userspace translation tables, it means that the table was
created *after* the userspace block mapping was created.

user_mem_abort() will create a PAGE_SIZE mapping when the backing userspace
mapping is a block mapping in the following situations:

1. The start of the userspace block mapping is not aligned to the start of the
stage 2 block mapping (see fault_supports_stage2_huge_mapping()).

2. The stage 2 block mapping falls outside the memslot (see
fault_supports_stage2_huge_mapping()).

3. The memslot logs dirty pages.

For 1 and 2, the only scenario in which we can use a stage 2 block mapping for the
faulting IPA is if the memslot is modified, and that means the IPA range will have
been unmapped first, which destroys the PMD table entry (kvm_set_memslot() will
call kvm_arch_flush_shadow_memslot because change == KVM_MR_MOVE).

This leaves us with scenario 3. We can get in this scenario if the memslot is
logging and the userspace mapping has been coalesced into a transparent huge page
before dirty logging was set or if the userspace mapping is a hugetlb page. To
allow a block mapping at stage 2, we first need to remove the
KVM_MEM_LOG_DIRTY_PAGES flag from the memslot. Then we need to get a dabt in the
IPA range backed by the userspace block mapping. At this point there's nothing to
guarantee that the *entire* IPA range backed by the userspace block mapping is
mapped at stage 2.

In this case, we definitely need to do dcache maintenance because the guest might
be running with the MMU off and doing loads from from PoC (assuming not FWB), and
whatever userspace wrote in the guest memory (like the kernel image) might still
be in the dcache. We also need to do the icache inval after the dcache clean +
inval because instruction fetches can be cached even if the MMU is off.

Thanks,

Alex

>
> Thanks,
>
> Yanan
>>> At least, they are not needed if we rebuild the block mappings backed by
>>> hugetlbfs
>>> pages, because we must have built the new block mappings for the first time
>>> before
>>> and now need to rebuild them after they were split in dirty logging. Can we
>>> agree on this?
>>> Then let's see the following situation.
>>>> I can think of the following situation where they
>>>> are needed:
>>>>
>>>> 1. The 2nd level (PMD) table that will be turned into a block is mapped at
>>>> stage 2
>>>> because one of the pages in the 3rd level (PTE) table it points to is
>>>> accessed by
>>>> the guest.
>>>>
>>>> 2. The kernel decides to turn the userspace mapping into a transparent huge page
>>>> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level
>>>> table
>>>> is still valid.
>>> I have a question here. Won't the PMD entry been invalidated too in this case?
>>> If remove of the stage2 mapping by mmu notifier is an unmap operation of a range,
>>> then it's correct and reasonable to both invalidate the PMD entry and free the
>>> PTE table.
>>> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>>>
>>> And if I was right about this, we will not end up in
>>> stage2_coalesce_tables_into_block()
>>> like step 3 describes, but in stage2_map_walker_try_leaf() instead. Because the
>>> PMD entry
>>> is invalid, so KVM will create the new 2M block mapping.
>> Looking at the code for stage2_unmap_walker(), I believe you are correct. After
>> the entire PTE table has been unmapped, the function will mark the PMD entry as
>> invalid. In the situation I described, at step 3 we would end up in the leaf
>> mapper function because the PMD entry is invalid. My example was wrong.
>>
>>> If I'm wrong about this, then I think this is a valid situation.
>>>> 3. Guest accesses a page which is not the page it accessed at step 1, which
>>>> causes
>>>> a translation fault. KVM decides we can use a PMD block mapping to map the
>>>> address
>>>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
>>>> because the guest accesses memory it didn't access before.
>>>>
>>>> What do you think, is that a valid situation?
>>>>>        return 0;
>>>>>    }
>>>>>    @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64
>>>>> end, u32 level,
>>>>>                          kvm_pte_t *ptep,
>>>>>                          struct stage2_map_data *data)
>>>>>    {
>>>>> -    int ret = 0;
>>>>> -
>>>>>        if (!data->anchor)
>>>>>            return 0;
>>>>>    -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>> -    put_page(virt_to_page(ptep));
>>>>> -
>>>>> -    if (data->anchor == ptep) {
>>>>> +    if (data->anchor != ptep) {
>>>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>> +        put_page(virt_to_page(ptep));
>>>>> +    } else {
>>>>> +        free_page((unsigned long)data->follow);
>>>>>            data->anchor = NULL;
>>>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>>>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and
>>>> get_page() once in our case (valid old mapping). It looks to me like we're
>>>> missing
>>>> a put_page() call when the function is called for the anchor. Have you found the
>>>> call to be unnecessary?
>>> Before this patch:
>>> When we find data->anchor == ptep, put_page() has been called once in advance
>>> for the anchor
>>> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
>>> stage2_map_walker_try_leaf()
>>> to install the block entry, and only get_page() will be called once in
>>> stage2_map_walker_try_leaf().
>>> There is a put_page() followed by a get_page() for the anchor, and there will
>>> not be a problem about
>>> page_counts.
>> This is how I'm reading the code before your patch:
>>
>> - stage2_map_walk_table_post() returns early if there is no anchor.
>>
>> - stage2_map_walk_table_pre() sets the anchor and marks the entry as invalid. The
>> entry was a table so the leaf visitor is not called in __kvm_pgtable_visit().
>>
>> - __kvm_pgtable_visit() visits the next level table.
>>
>> - stage2_map_walk_table_post() calls put_page(), calls stage2_map_walk_leaf() ->
>> stage2_map_walker_try_leaf(). The old entry was invalidated by the pre visitor, so
>> it only calls get_page() (and not put_page() + get_page().
>>
>> I agree with your conclusion, I didn't realize that because the pre visitor marks
>> the entry as invalid, stage2_map_walker_try_leaf() will not call put_page().
>>
>>> After this patch:
>>> Before we find data->anchor == ptep and after it, there is not a put_page() call
>>> for the anchor.
>>> This is because that we didn't call get_page() either in
>>> stage2_coalesce_tables_into_block() when
>>> install the block entry. So I think there will not be a problem too.
>> I agree, the refcount will be identical.
>>
>>> Is above the right answer for your point?
>> Yes, thank you clearing that up for me.
>>
>> Thanks,
>>
>> Alex
>>
>>>>>        }
>>>>>    -    return ret;
>>>>> +    return 0;
>>>> I think it's correct for this function to succeed unconditionally. The error was
>>>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
>>>> can return an error code if block mapping is not supported, which we know is
>>>> supported because we have an anchor, and if only the permissions are different
>>>> between the old and the new entry, but in our case we've changed both the valid
>>>> and type bits.
>>> Agreed. Besides, we will definitely not end up updating an old valid entry for
>>> the anchor
>>> in stage2_map_walker_try_leaf(), because *anchor has already been invalidated in
>>> stage2_map_walk_table_pre() before set the anchor, so it will look like a build
>>> of new mapping.
>>>
>>> Thanks,
>>>
>>> Yanan
>>>> Thanks,
>>>>
>>>> Alex
>>>>
>>>>>    }
>>>>>      /*
>>>> .
>> .

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
  2021-03-19 15:07             ` Alexandru Elisei
  (?)
@ 2021-03-22 13:19               ` wangyanan (Y)
  -1 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-03-22 13:19 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Marc Zyngier, Will Deacon, Catalin Marinas, Julien Thierry,
	James Morse, Suzuki K Poulose, Quentin Perret, Gavin Shan,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Alex,

On 2021/3/19 23:07, Alexandru Elisei wrote:
> Hi Yanan,
>
> Sorry for taking so long to reply, been busy with other things unfortunately.
Still appreciate your patient reply! :)
> I
> did notice that you sent a new version of this series, but I would like to
> continue our discussion on this patch, since it's easier to get the full context.
>
> On 3/4/21 7:07 AM, wangyanan (Y) wrote:
>> Hi Alex,
>>
>> On 2021/3/4 1:27, Alexandru Elisei wrote:
>>> Hi Yanan,
>>>
>>> On 3/3/21 11:04 AM, wangyanan (Y) wrote:
>>>> Hi Alex,
>>>>
>>>> On 2021/3/3 1:13, Alexandru Elisei wrote:
>>>>> Hello,
>>>>>
>>>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>>>>> we currently invalidate the old table entry first followed by invalidation
>>>>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>>>>
>>>>>> It will cost a long time to unmap the numerous page mappings, which means
>>>>>> there will be a long period when the table entry can be found invalid.
>>>>>> If other vCPUs access any guest page within the block range and find the
>>>>>> table entry invalid, they will all exit from guest with a translation fault
>>>>>> which is not necessary. And KVM will make efforts to handle these faults,
>>>>>> especially when performing CMOs by block range.
>>>>>>
>>>>>> So let's quickly install the block entry at first to ensure uninterrupted
>>>>>> memory access of the other vCPUs, and then unmap the page mappings after
>>>>>> installation. This will reduce most of the time when the table entry is
>>>>>> invalid, and avoid most of the unnecessary translation faults.
>>>>> I'm not convinced I've fully understood what is going on yet, but it seems to me
>>>>> that the idea is sound. Some questions and comments below.
>>>> What I am trying to do in this patch is to adjust the order of rebuilding block
>>>> mappings from page mappings.
>>>> Take the rebuilding of 1G block mappings as an example.
>>>> Before this patch, the order is like:
>>>> 1) invalidate the table entry of the 1st level(PUD)
>>>> 2) flush TLB by VMID
>>>> 3) unmap the old PMD/PTE tables
>>>> 4) install the new block entry to the 1st level(PUD)
>>>>
>>>> So entry in the 1st level can be found invalid by other vcpus in 1), 2), and 3),
>>>> and it's a long time in 3) to unmap
>>>> the numerous old PMD/PTE tables, which means the total time of the entry being
>>>> invalid is long enough to
>>>> affect the performance.
>>>>
>>>> After this patch, the order is like:
>>>> 1) invalidate the table ebtry of the 1st level(PUD)
>>>> 2) flush TLB by VMID
>>>> 3) install the new block entry to the 1st level(PUD)
>>>> 4) unmap the old PMD/PTE tables
>>>>
>>>> The change ensures that period of entry in the 1st level(PUD) being invalid is
>>>> only in 1) and 2),
>>>> so if other vcpus access memory within 1G, there will be less chance to find the
>>>> entry invalid
>>>> and as a result trigger an unnecessary translation fault.
>>> Thank you for the explanation, that was my understand of it also, and I believe
>>> your idea is correct. I was more concerned that I got some of the details wrong,
>>> and you have kindly corrected me below.
>>>
>>>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>>>> ---
>>>>>>     arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>>>>     1 file changed, 12 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>>>> index 78a560446f80..308c36b9cd21 100644
>>>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>>>>         kvm_pte_t            attr;
>>>>>>           kvm_pte_t            *anchor;
>>>>>> +    kvm_pte_t            *follow;
>>>>>>           struct kvm_s2_mmu        *mmu;
>>>>>>         struct kvm_mmu_memory_cache    *memcache;
>>>>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
>>>>>> u32 level,
>>>>>>         if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>>>>>>             return 0;
>>>>>>     -    kvm_set_invalid_pte(ptep);
>>>>>> -
>>>>>>         /*
>>>>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>>>>> -     * entries below us which would otherwise need invalidating
>>>>>> -     * individually.
>>>>>> +     * If we need to coalesce existing table entries into a block here,
>>>>>> +     * then install the block entry first and the sub-level page mappings
>>>>>> +     * will be unmapped later.
>>>>>>          */
>>>>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>>>         data->anchor = ptep;
>>>>>> +    data->follow = kvm_pte_follow(*ptep);
>>>>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>>>>> Here's how stage2_coalesce_tables_into_block() is implemented from the previous
>>>>> patch (it might be worth merging it with this patch, I found it impossible to
>>>>> judge if the function is correct without seeing how it is used and what is
>>>>> replacing):
>>>> Ok, will do this if v2 is going to be post.
>>>>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>>>>                              kvm_pte_t *ptep,
>>>>>                              struct stage2_map_data *data)
>>>>> {
>>>>>        u64 granule = kvm_granule_size(level), phys = data->phys;
>>>>>        kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>>>>
>>>>>        kvm_set_invalid_pte(ptep);
>>>>>
>>>>>        /*
>>>>>         * Invalidate the whole stage-2, as we may have numerous leaf entries
>>>>>         * below us which would otherwise need invalidating individually.
>>>>>         */
>>>>>        kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>>        smp_store_release(ptep, new);
>>>>>        data->phys += granule;
>>>>> }
>>>>>
>>>>> This works because __kvm_pgtable_visit() saves the *ptep value before calling
>>>>> the
>>>>> pre callback, and it visits the next level table based on the initial pte value,
>>>>> not the new value written by stage2_coalesce_tables_into_block().
>>>> Right. So before replacing the initial pte value with the new value, we have
>>>> to use
>>>> *data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to save
>>>> the initial pte value in advance. And data->follow will be used when  we start to
>>>> unmap the old sub-level tables later.
>>> Right, stage2_map_walk_table_post() will use data->follow to free the table page
>>> which is no longer needed because we've replaced the entire next level table with
>>> a block mapping.
>>>
>>>>> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
>>>>> dcache to the map handler"), this function is missing the CMOs from
>>>>> stage2_map_walker_try_leaf().
>>>> Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() currently,
>>>> because I thought they were not needed when we rebuild the block mappings from
>>>> normal page mappings.
>>> This assumes that the *only* situation when we replace a table entry with a block
>>> mapping is when the next level table (or tables) is *fully* populated. Is there a
>>> way to prove that this is true? I think it's important to prove it unequivocally,
>>> because if there's a corner case where this doesn't happen and we remove the
>>> dcache maintenance, we can end up with hard to reproduce and hard to diagnose
>>> errors in a guest.
>> So there is still one thing left about this patch to determine, and that is
>> whether we can straightly
>> discard CMOs in stage2_coalesce_tables_into_block() or we should distinguish
>> different situations.
>>
>> Now we know that the situation you have described won't happen, then I think we
>> will only end up
>> in stage2_coalesce_tables_into_block() in the following situation:
>> 1) KVM create a new block mapping in stage2_map_walker_try_leaf() for the first
>> time, if guest accesses
>>      memory backed by a THP/HUGETLB huge page. And CMOs will be performed here.
>> 2) KVM split this block mapping in dirty logging, and build only one new page
>> mapping.
>> 3) KVM will build other new page mappings in dirty logging lazily, if guest
>> access any other pages
>>      within the block. *In this stage, pages in this block may be fully mapped,
>> or may be not.*
>> 4) After dirty logging is disabled, KVM decides to rebuild the block mapping.
>>
>> Do we still have to perform CMOs when rebuilding the block mapping in step 4, if
>> pages in the block
>> were not fully mapped in step 3 ? I'm not completely sure about this.
> Did some digging and this is my understanding of what is happening. Please correct
> me if I get something wrong.
>
> When the kernel coalesces the userspace PTEs into a transparent hugepage, KVM will
> unmap the old mappings and mark the PMD table as invalidated via the MMU
> notifiers. To have a table at the PMD level while the corresponding entry is a
> block mapping in the userspace translation tables, it means that the table was
> created *after* the userspace block mapping was created.
>
> user_mem_abort() will create a PAGE_SIZE mapping when the backing userspace
> mapping is a block mapping in the following situations:
>
> 1. The start of the userspace block mapping is not aligned to the start of the
> stage 2 block mapping (see fault_supports_stage2_huge_mapping()).
>
> 2. The stage 2 block mapping falls outside the memslot (see
> fault_supports_stage2_huge_mapping()).
>
> 3. The memslot logs dirty pages.
>
> For 1 and 2, the only scenario in which we can use a stage 2 block mapping for the
> faulting IPA is if the memslot is modified, and that means the IPA range will have
> been unmapped first, which destroys the PMD table entry (kvm_set_memslot() will
> call kvm_arch_flush_shadow_memslot because change == KVM_MR_MOVE).
>
> This leaves us with scenario 3. We can get in this scenario if the memslot is
> logging and the userspace mapping has been coalesced into a transparent huge page
> before dirty logging was set or if the userspace mapping is a hugetlb page. To
> allow a block mapping at stage 2, we first need to remove the
> KVM_MEM_LOG_DIRTY_PAGES flag from the memslot. Then we need to get a dabt in the
> IPA range backed by the userspace block mapping. At this point there's nothing to
> guarantee that the *entire* IPA range backed by the userspace block mapping is
> mapped at stage 2.
I get your point and I think you are correct.
We can't ensure that dirty logging happens after *all* the stage 2 block 
mappings
have been created for the first time by user_mem_abort(). So it's 
possible that we
create a PAGE_SIZE mapping for the IPA backed by a huge page in dirty 
logging
and the corresponding IPA range has never been mapped by block in stage 
2 before.
When KVM needs to coalesce page mappings into a block after dirty 
logging, it actually
ends up creating the block mapping for the first time and CMOs are 
needed in this case.

So in summary, the key point of the need of CMOs is whether the next 
level table (or tables)
is *fully* populated (you have mentioned before). But checking whether 
the tables are fully
populated needs another PTW for the IPA range which will add new complexity.

I think the most concise and straight way is to still uniformly perform 
CMOs when we need
to coalesce tables into a block. And that's exactly what the previous 
code logic does.

Thanks,

Yanan
> In this case, we definitely need to do dcache maintenance because the guest might
> be running with the MMU off and doing loads from from PoC (assuming not FWB), and
> whatever userspace wrote in the guest memory (like the kernel image) might still
> be in the dcache. We also need to do the icache inval after the dcache clean +
> inval because instruction fetches can be cached even if the MMU is off.
>
> Thanks,
>
> Alex
>
>> Thanks,
>>
>> Yanan
>>>> At least, they are not needed if we rebuild the block mappings backed by
>>>> hugetlbfs
>>>> pages, because we must have built the new block mappings for the first time
>>>> before
>>>> and now need to rebuild them after they were split in dirty logging. Can we
>>>> agree on this?
>>>> Then let's see the following situation.
>>>>> I can think of the following situation where they
>>>>> are needed:
>>>>>
>>>>> 1. The 2nd level (PMD) table that will be turned into a block is mapped at
>>>>> stage 2
>>>>> because one of the pages in the 3rd level (PTE) table it points to is
>>>>> accessed by
>>>>> the guest.
>>>>>
>>>>> 2. The kernel decides to turn the userspace mapping into a transparent huge page
>>>>> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level
>>>>> table
>>>>> is still valid.
>>>> I have a question here. Won't the PMD entry been invalidated too in this case?
>>>> If remove of the stage2 mapping by mmu notifier is an unmap operation of a range,
>>>> then it's correct and reasonable to both invalidate the PMD entry and free the
>>>> PTE table.
>>>> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>>>>
>>>> And if I was right about this, we will not end up in
>>>> stage2_coalesce_tables_into_block()
>>>> like step 3 describes, but in stage2_map_walker_try_leaf() instead. Because the
>>>> PMD entry
>>>> is invalid, so KVM will create the new 2M block mapping.
>>> Looking at the code for stage2_unmap_walker(), I believe you are correct. After
>>> the entire PTE table has been unmapped, the function will mark the PMD entry as
>>> invalid. In the situation I described, at step 3 we would end up in the leaf
>>> mapper function because the PMD entry is invalid. My example was wrong.
>>>
>>>> If I'm wrong about this, then I think this is a valid situation.
>>>>> 3. Guest accesses a page which is not the page it accessed at step 1, which
>>>>> causes
>>>>> a translation fault. KVM decides we can use a PMD block mapping to map the
>>>>> address
>>>>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
>>>>> because the guest accesses memory it didn't access before.
>>>>>
>>>>> What do you think, is that a valid situation?
>>>>>>         return 0;
>>>>>>     }
>>>>>>     @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64
>>>>>> end, u32 level,
>>>>>>                           kvm_pte_t *ptep,
>>>>>>                           struct stage2_map_data *data)
>>>>>>     {
>>>>>> -    int ret = 0;
>>>>>> -
>>>>>>         if (!data->anchor)
>>>>>>             return 0;
>>>>>>     -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>>> -    put_page(virt_to_page(ptep));
>>>>>> -
>>>>>> -    if (data->anchor == ptep) {
>>>>>> +    if (data->anchor != ptep) {
>>>>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>>> +        put_page(virt_to_page(ptep));
>>>>>> +    } else {
>>>>>> +        free_page((unsigned long)data->follow);
>>>>>>             data->anchor = NULL;
>>>>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>>>>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and
>>>>> get_page() once in our case (valid old mapping). It looks to me like we're
>>>>> missing
>>>>> a put_page() call when the function is called for the anchor. Have you found the
>>>>> call to be unnecessary?
>>>> Before this patch:
>>>> When we find data->anchor == ptep, put_page() has been called once in advance
>>>> for the anchor
>>>> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
>>>> stage2_map_walker_try_leaf()
>>>> to install the block entry, and only get_page() will be called once in
>>>> stage2_map_walker_try_leaf().
>>>> There is a put_page() followed by a get_page() for the anchor, and there will
>>>> not be a problem about
>>>> page_counts.
>>> This is how I'm reading the code before your patch:
>>>
>>> - stage2_map_walk_table_post() returns early if there is no anchor.
>>>
>>> - stage2_map_walk_table_pre() sets the anchor and marks the entry as invalid. The
>>> entry was a table so the leaf visitor is not called in __kvm_pgtable_visit().
>>>
>>> - __kvm_pgtable_visit() visits the next level table.
>>>
>>> - stage2_map_walk_table_post() calls put_page(), calls stage2_map_walk_leaf() ->
>>> stage2_map_walker_try_leaf(). The old entry was invalidated by the pre visitor, so
>>> it only calls get_page() (and not put_page() + get_page().
>>>
>>> I agree with your conclusion, I didn't realize that because the pre visitor marks
>>> the entry as invalid, stage2_map_walker_try_leaf() will not call put_page().
>>>
>>>> After this patch:
>>>> Before we find data->anchor == ptep and after it, there is not a put_page() call
>>>> for the anchor.
>>>> This is because that we didn't call get_page() either in
>>>> stage2_coalesce_tables_into_block() when
>>>> install the block entry. So I think there will not be a problem too.
>>> I agree, the refcount will be identical.
>>>
>>>> Is above the right answer for your point?
>>> Yes, thank you clearing that up for me.
>>>
>>> Thanks,
>>>
>>> Alex
>>>
>>>>>>         }
>>>>>>     -    return ret;
>>>>>> +    return 0;
>>>>> I think it's correct for this function to succeed unconditionally. The error was
>>>>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
>>>>> can return an error code if block mapping is not supported, which we know is
>>>>> supported because we have an anchor, and if only the permissions are different
>>>>> between the old and the new entry, but in our case we've changed both the valid
>>>>> and type bits.
>>>> Agreed. Besides, we will definitely not end up updating an old valid entry for
>>>> the anchor
>>>> in stage2_map_walker_try_leaf(), because *anchor has already been invalidated in
>>>> stage2_map_walk_table_pre() before set the anchor, so it will look like a build
>>>> of new mapping.
>>>>
>>>> Thanks,
>>>>
>>>> Yanan
>>>>> Thanks,
>>>>>
>>>>> Alex
>>>>>
>>>>>>     }
>>>>>>       /*
>>>>> .
>>> .
> .

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-03-22 13:19               ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-03-22 13:19 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvm, Marc Zyngier, linux-kernel, linux-arm-kernel,
	Catalin Marinas, Will Deacon, kvmarm

Hi Alex,

On 2021/3/19 23:07, Alexandru Elisei wrote:
> Hi Yanan,
>
> Sorry for taking so long to reply, been busy with other things unfortunately.
Still appreciate your patient reply! :)
> I
> did notice that you sent a new version of this series, but I would like to
> continue our discussion on this patch, since it's easier to get the full context.
>
> On 3/4/21 7:07 AM, wangyanan (Y) wrote:
>> Hi Alex,
>>
>> On 2021/3/4 1:27, Alexandru Elisei wrote:
>>> Hi Yanan,
>>>
>>> On 3/3/21 11:04 AM, wangyanan (Y) wrote:
>>>> Hi Alex,
>>>>
>>>> On 2021/3/3 1:13, Alexandru Elisei wrote:
>>>>> Hello,
>>>>>
>>>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>>>>> we currently invalidate the old table entry first followed by invalidation
>>>>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>>>>
>>>>>> It will cost a long time to unmap the numerous page mappings, which means
>>>>>> there will be a long period when the table entry can be found invalid.
>>>>>> If other vCPUs access any guest page within the block range and find the
>>>>>> table entry invalid, they will all exit from guest with a translation fault
>>>>>> which is not necessary. And KVM will make efforts to handle these faults,
>>>>>> especially when performing CMOs by block range.
>>>>>>
>>>>>> So let's quickly install the block entry at first to ensure uninterrupted
>>>>>> memory access of the other vCPUs, and then unmap the page mappings after
>>>>>> installation. This will reduce most of the time when the table entry is
>>>>>> invalid, and avoid most of the unnecessary translation faults.
>>>>> I'm not convinced I've fully understood what is going on yet, but it seems to me
>>>>> that the idea is sound. Some questions and comments below.
>>>> What I am trying to do in this patch is to adjust the order of rebuilding block
>>>> mappings from page mappings.
>>>> Take the rebuilding of 1G block mappings as an example.
>>>> Before this patch, the order is like:
>>>> 1) invalidate the table entry of the 1st level(PUD)
>>>> 2) flush TLB by VMID
>>>> 3) unmap the old PMD/PTE tables
>>>> 4) install the new block entry to the 1st level(PUD)
>>>>
>>>> So entry in the 1st level can be found invalid by other vcpus in 1), 2), and 3),
>>>> and it's a long time in 3) to unmap
>>>> the numerous old PMD/PTE tables, which means the total time of the entry being
>>>> invalid is long enough to
>>>> affect the performance.
>>>>
>>>> After this patch, the order is like:
>>>> 1) invalidate the table ebtry of the 1st level(PUD)
>>>> 2) flush TLB by VMID
>>>> 3) install the new block entry to the 1st level(PUD)
>>>> 4) unmap the old PMD/PTE tables
>>>>
>>>> The change ensures that period of entry in the 1st level(PUD) being invalid is
>>>> only in 1) and 2),
>>>> so if other vcpus access memory within 1G, there will be less chance to find the
>>>> entry invalid
>>>> and as a result trigger an unnecessary translation fault.
>>> Thank you for the explanation, that was my understand of it also, and I believe
>>> your idea is correct. I was more concerned that I got some of the details wrong,
>>> and you have kindly corrected me below.
>>>
>>>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>>>> ---
>>>>>>     arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>>>>     1 file changed, 12 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>>>> index 78a560446f80..308c36b9cd21 100644
>>>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>>>>         kvm_pte_t            attr;
>>>>>>           kvm_pte_t            *anchor;
>>>>>> +    kvm_pte_t            *follow;
>>>>>>           struct kvm_s2_mmu        *mmu;
>>>>>>         struct kvm_mmu_memory_cache    *memcache;
>>>>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
>>>>>> u32 level,
>>>>>>         if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>>>>>>             return 0;
>>>>>>     -    kvm_set_invalid_pte(ptep);
>>>>>> -
>>>>>>         /*
>>>>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>>>>> -     * entries below us which would otherwise need invalidating
>>>>>> -     * individually.
>>>>>> +     * If we need to coalesce existing table entries into a block here,
>>>>>> +     * then install the block entry first and the sub-level page mappings
>>>>>> +     * will be unmapped later.
>>>>>>          */
>>>>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>>>         data->anchor = ptep;
>>>>>> +    data->follow = kvm_pte_follow(*ptep);
>>>>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>>>>> Here's how stage2_coalesce_tables_into_block() is implemented from the previous
>>>>> patch (it might be worth merging it with this patch, I found it impossible to
>>>>> judge if the function is correct without seeing how it is used and what is
>>>>> replacing):
>>>> Ok, will do this if v2 is going to be post.
>>>>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>>>>                              kvm_pte_t *ptep,
>>>>>                              struct stage2_map_data *data)
>>>>> {
>>>>>        u64 granule = kvm_granule_size(level), phys = data->phys;
>>>>>        kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>>>>
>>>>>        kvm_set_invalid_pte(ptep);
>>>>>
>>>>>        /*
>>>>>         * Invalidate the whole stage-2, as we may have numerous leaf entries
>>>>>         * below us which would otherwise need invalidating individually.
>>>>>         */
>>>>>        kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>>        smp_store_release(ptep, new);
>>>>>        data->phys += granule;
>>>>> }
>>>>>
>>>>> This works because __kvm_pgtable_visit() saves the *ptep value before calling
>>>>> the
>>>>> pre callback, and it visits the next level table based on the initial pte value,
>>>>> not the new value written by stage2_coalesce_tables_into_block().
>>>> Right. So before replacing the initial pte value with the new value, we have
>>>> to use
>>>> *data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to save
>>>> the initial pte value in advance. And data->follow will be used when  we start to
>>>> unmap the old sub-level tables later.
>>> Right, stage2_map_walk_table_post() will use data->follow to free the table page
>>> which is no longer needed because we've replaced the entire next level table with
>>> a block mapping.
>>>
>>>>> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
>>>>> dcache to the map handler"), this function is missing the CMOs from
>>>>> stage2_map_walker_try_leaf().
>>>> Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() currently,
>>>> because I thought they were not needed when we rebuild the block mappings from
>>>> normal page mappings.
>>> This assumes that the *only* situation when we replace a table entry with a block
>>> mapping is when the next level table (or tables) is *fully* populated. Is there a
>>> way to prove that this is true? I think it's important to prove it unequivocally,
>>> because if there's a corner case where this doesn't happen and we remove the
>>> dcache maintenance, we can end up with hard to reproduce and hard to diagnose
>>> errors in a guest.
>> So there is still one thing left about this patch to determine, and that is
>> whether we can straightly
>> discard CMOs in stage2_coalesce_tables_into_block() or we should distinguish
>> different situations.
>>
>> Now we know that the situation you have described won't happen, then I think we
>> will only end up
>> in stage2_coalesce_tables_into_block() in the following situation:
>> 1) KVM create a new block mapping in stage2_map_walker_try_leaf() for the first
>> time, if guest accesses
>>      memory backed by a THP/HUGETLB huge page. And CMOs will be performed here.
>> 2) KVM split this block mapping in dirty logging, and build only one new page
>> mapping.
>> 3) KVM will build other new page mappings in dirty logging lazily, if guest
>> access any other pages
>>      within the block. *In this stage, pages in this block may be fully mapped,
>> or may be not.*
>> 4) After dirty logging is disabled, KVM decides to rebuild the block mapping.
>>
>> Do we still have to perform CMOs when rebuilding the block mapping in step 4, if
>> pages in the block
>> were not fully mapped in step 3 ? I'm not completely sure about this.
> Did some digging and this is my understanding of what is happening. Please correct
> me if I get something wrong.
>
> When the kernel coalesces the userspace PTEs into a transparent hugepage, KVM will
> unmap the old mappings and mark the PMD table as invalidated via the MMU
> notifiers. To have a table at the PMD level while the corresponding entry is a
> block mapping in the userspace translation tables, it means that the table was
> created *after* the userspace block mapping was created.
>
> user_mem_abort() will create a PAGE_SIZE mapping when the backing userspace
> mapping is a block mapping in the following situations:
>
> 1. The start of the userspace block mapping is not aligned to the start of the
> stage 2 block mapping (see fault_supports_stage2_huge_mapping()).
>
> 2. The stage 2 block mapping falls outside the memslot (see
> fault_supports_stage2_huge_mapping()).
>
> 3. The memslot logs dirty pages.
>
> For 1 and 2, the only scenario in which we can use a stage 2 block mapping for the
> faulting IPA is if the memslot is modified, and that means the IPA range will have
> been unmapped first, which destroys the PMD table entry (kvm_set_memslot() will
> call kvm_arch_flush_shadow_memslot because change == KVM_MR_MOVE).
>
> This leaves us with scenario 3. We can get in this scenario if the memslot is
> logging and the userspace mapping has been coalesced into a transparent huge page
> before dirty logging was set or if the userspace mapping is a hugetlb page. To
> allow a block mapping at stage 2, we first need to remove the
> KVM_MEM_LOG_DIRTY_PAGES flag from the memslot. Then we need to get a dabt in the
> IPA range backed by the userspace block mapping. At this point there's nothing to
> guarantee that the *entire* IPA range backed by the userspace block mapping is
> mapped at stage 2.
I get your point and I think you are correct.
We can't ensure that dirty logging happens after *all* the stage 2 block 
mappings
have been created for the first time by user_mem_abort(). So it's 
possible that we
create a PAGE_SIZE mapping for the IPA backed by a huge page in dirty 
logging
and the corresponding IPA range has never been mapped by block in stage 
2 before.
When KVM needs to coalesce page mappings into a block after dirty 
logging, it actually
ends up creating the block mapping for the first time and CMOs are 
needed in this case.

So in summary, the key point of the need of CMOs is whether the next 
level table (or tables)
is *fully* populated (you have mentioned before). But checking whether 
the tables are fully
populated needs another PTW for the IPA range which will add new complexity.

I think the most concise and straight way is to still uniformly perform 
CMOs when we need
to coalesce tables into a block. And that's exactly what the previous 
code logic does.

Thanks,

Yanan
> In this case, we definitely need to do dcache maintenance because the guest might
> be running with the MMU off and doing loads from from PoC (assuming not FWB), and
> whatever userspace wrote in the guest memory (like the kernel image) might still
> be in the dcache. We also need to do the icache inval after the dcache clean +
> inval because instruction fetches can be cached even if the MMU is off.
>
> Thanks,
>
> Alex
>
>> Thanks,
>>
>> Yanan
>>>> At least, they are not needed if we rebuild the block mappings backed by
>>>> hugetlbfs
>>>> pages, because we must have built the new block mappings for the first time
>>>> before
>>>> and now need to rebuild them after they were split in dirty logging. Can we
>>>> agree on this?
>>>> Then let's see the following situation.
>>>>> I can think of the following situation where they
>>>>> are needed:
>>>>>
>>>>> 1. The 2nd level (PMD) table that will be turned into a block is mapped at
>>>>> stage 2
>>>>> because one of the pages in the 3rd level (PTE) table it points to is
>>>>> accessed by
>>>>> the guest.
>>>>>
>>>>> 2. The kernel decides to turn the userspace mapping into a transparent huge page
>>>>> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level
>>>>> table
>>>>> is still valid.
>>>> I have a question here. Won't the PMD entry been invalidated too in this case?
>>>> If remove of the stage2 mapping by mmu notifier is an unmap operation of a range,
>>>> then it's correct and reasonable to both invalidate the PMD entry and free the
>>>> PTE table.
>>>> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>>>>
>>>> And if I was right about this, we will not end up in
>>>> stage2_coalesce_tables_into_block()
>>>> like step 3 describes, but in stage2_map_walker_try_leaf() instead. Because the
>>>> PMD entry
>>>> is invalid, so KVM will create the new 2M block mapping.
>>> Looking at the code for stage2_unmap_walker(), I believe you are correct. After
>>> the entire PTE table has been unmapped, the function will mark the PMD entry as
>>> invalid. In the situation I described, at step 3 we would end up in the leaf
>>> mapper function because the PMD entry is invalid. My example was wrong.
>>>
>>>> If I'm wrong about this, then I think this is a valid situation.
>>>>> 3. Guest accesses a page which is not the page it accessed at step 1, which
>>>>> causes
>>>>> a translation fault. KVM decides we can use a PMD block mapping to map the
>>>>> address
>>>>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
>>>>> because the guest accesses memory it didn't access before.
>>>>>
>>>>> What do you think, is that a valid situation?
>>>>>>         return 0;
>>>>>>     }
>>>>>>     @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64
>>>>>> end, u32 level,
>>>>>>                           kvm_pte_t *ptep,
>>>>>>                           struct stage2_map_data *data)
>>>>>>     {
>>>>>> -    int ret = 0;
>>>>>> -
>>>>>>         if (!data->anchor)
>>>>>>             return 0;
>>>>>>     -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>>> -    put_page(virt_to_page(ptep));
>>>>>> -
>>>>>> -    if (data->anchor == ptep) {
>>>>>> +    if (data->anchor != ptep) {
>>>>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>>> +        put_page(virt_to_page(ptep));
>>>>>> +    } else {
>>>>>> +        free_page((unsigned long)data->follow);
>>>>>>             data->anchor = NULL;
>>>>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>>>>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and
>>>>> get_page() once in our case (valid old mapping). It looks to me like we're
>>>>> missing
>>>>> a put_page() call when the function is called for the anchor. Have you found the
>>>>> call to be unnecessary?
>>>> Before this patch:
>>>> When we find data->anchor == ptep, put_page() has been called once in advance
>>>> for the anchor
>>>> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
>>>> stage2_map_walker_try_leaf()
>>>> to install the block entry, and only get_page() will be called once in
>>>> stage2_map_walker_try_leaf().
>>>> There is a put_page() followed by a get_page() for the anchor, and there will
>>>> not be a problem about
>>>> page_counts.
>>> This is how I'm reading the code before your patch:
>>>
>>> - stage2_map_walk_table_post() returns early if there is no anchor.
>>>
>>> - stage2_map_walk_table_pre() sets the anchor and marks the entry as invalid. The
>>> entry was a table so the leaf visitor is not called in __kvm_pgtable_visit().
>>>
>>> - __kvm_pgtable_visit() visits the next level table.
>>>
>>> - stage2_map_walk_table_post() calls put_page(), calls stage2_map_walk_leaf() ->
>>> stage2_map_walker_try_leaf(). The old entry was invalidated by the pre visitor, so
>>> it only calls get_page() (and not put_page() + get_page().
>>>
>>> I agree with your conclusion, I didn't realize that because the pre visitor marks
>>> the entry as invalid, stage2_map_walker_try_leaf() will not call put_page().
>>>
>>>> After this patch:
>>>> Before we find data->anchor == ptep and after it, there is not a put_page() call
>>>> for the anchor.
>>>> This is because that we didn't call get_page() either in
>>>> stage2_coalesce_tables_into_block() when
>>>> install the block entry. So I think there will not be a problem too.
>>> I agree, the refcount will be identical.
>>>
>>>> Is above the right answer for your point?
>>> Yes, thank you clearing that up for me.
>>>
>>> Thanks,
>>>
>>> Alex
>>>
>>>>>>         }
>>>>>>     -    return ret;
>>>>>> +    return 0;
>>>>> I think it's correct for this function to succeed unconditionally. The error was
>>>>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
>>>>> can return an error code if block mapping is not supported, which we know is
>>>>> supported because we have an anchor, and if only the permissions are different
>>>>> between the old and the new entry, but in our case we've changed both the valid
>>>>> and type bits.
>>>> Agreed. Besides, we will definitely not end up updating an old valid entry for
>>>> the anchor
>>>> in stage2_map_walker_try_leaf(), because *anchor has already been invalidated in
>>>> stage2_map_walk_table_pre() before set the anchor, so it will look like a build
>>>> of new mapping.
>>>>
>>>> Thanks,
>>>>
>>>> Yanan
>>>>> Thanks,
>>>>>
>>>>> Alex
>>>>>
>>>>>>     }
>>>>>>       /*
>>>>> .
>>> .
> .
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
@ 2021-03-22 13:19               ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-03-22 13:19 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Marc Zyngier, Will Deacon, Catalin Marinas, Julien Thierry,
	James Morse, Suzuki K Poulose, Quentin Perret, Gavin Shan,
	kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Alex,

On 2021/3/19 23:07, Alexandru Elisei wrote:
> Hi Yanan,
>
> Sorry for taking so long to reply, been busy with other things unfortunately.
Still appreciate your patient reply! :)
> I
> did notice that you sent a new version of this series, but I would like to
> continue our discussion on this patch, since it's easier to get the full context.
>
> On 3/4/21 7:07 AM, wangyanan (Y) wrote:
>> Hi Alex,
>>
>> On 2021/3/4 1:27, Alexandru Elisei wrote:
>>> Hi Yanan,
>>>
>>> On 3/3/21 11:04 AM, wangyanan (Y) wrote:
>>>> Hi Alex,
>>>>
>>>> On 2021/3/3 1:13, Alexandru Elisei wrote:
>>>>> Hello,
>>>>>
>>>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>>>> When KVM needs to coalesce the normal page mappings into a block mapping,
>>>>>> we currently invalidate the old table entry first followed by invalidation
>>>>>> of TLB, then unmap the page mappings, and install the block entry at last.
>>>>>>
>>>>>> It will cost a long time to unmap the numerous page mappings, which means
>>>>>> there will be a long period when the table entry can be found invalid.
>>>>>> If other vCPUs access any guest page within the block range and find the
>>>>>> table entry invalid, they will all exit from guest with a translation fault
>>>>>> which is not necessary. And KVM will make efforts to handle these faults,
>>>>>> especially when performing CMOs by block range.
>>>>>>
>>>>>> So let's quickly install the block entry at first to ensure uninterrupted
>>>>>> memory access of the other vCPUs, and then unmap the page mappings after
>>>>>> installation. This will reduce most of the time when the table entry is
>>>>>> invalid, and avoid most of the unnecessary translation faults.
>>>>> I'm not convinced I've fully understood what is going on yet, but it seems to me
>>>>> that the idea is sound. Some questions and comments below.
>>>> What I am trying to do in this patch is to adjust the order of rebuilding block
>>>> mappings from page mappings.
>>>> Take the rebuilding of 1G block mappings as an example.
>>>> Before this patch, the order is like:
>>>> 1) invalidate the table entry of the 1st level(PUD)
>>>> 2) flush TLB by VMID
>>>> 3) unmap the old PMD/PTE tables
>>>> 4) install the new block entry to the 1st level(PUD)
>>>>
>>>> So entry in the 1st level can be found invalid by other vcpus in 1), 2), and 3),
>>>> and it's a long time in 3) to unmap
>>>> the numerous old PMD/PTE tables, which means the total time of the entry being
>>>> invalid is long enough to
>>>> affect the performance.
>>>>
>>>> After this patch, the order is like:
>>>> 1) invalidate the table ebtry of the 1st level(PUD)
>>>> 2) flush TLB by VMID
>>>> 3) install the new block entry to the 1st level(PUD)
>>>> 4) unmap the old PMD/PTE tables
>>>>
>>>> The change ensures that period of entry in the 1st level(PUD) being invalid is
>>>> only in 1) and 2),
>>>> so if other vcpus access memory within 1G, there will be less chance to find the
>>>> entry invalid
>>>> and as a result trigger an unnecessary translation fault.
>>> Thank you for the explanation, that was my understand of it also, and I believe
>>> your idea is correct. I was more concerned that I got some of the details wrong,
>>> and you have kindly corrected me below.
>>>
>>>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>>>> ---
>>>>>>     arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>>>>     1 file changed, 12 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>>>> index 78a560446f80..308c36b9cd21 100644
>>>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>>>>         kvm_pte_t            attr;
>>>>>>           kvm_pte_t            *anchor;
>>>>>> +    kvm_pte_t            *follow;
>>>>>>           struct kvm_s2_mmu        *mmu;
>>>>>>         struct kvm_mmu_memory_cache    *memcache;
>>>>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
>>>>>> u32 level,
>>>>>>         if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>>>>>>             return 0;
>>>>>>     -    kvm_set_invalid_pte(ptep);
>>>>>> -
>>>>>>         /*
>>>>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>>>>> -     * entries below us which would otherwise need invalidating
>>>>>> -     * individually.
>>>>>> +     * If we need to coalesce existing table entries into a block here,
>>>>>> +     * then install the block entry first and the sub-level page mappings
>>>>>> +     * will be unmapped later.
>>>>>>          */
>>>>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>>>         data->anchor = ptep;
>>>>>> +    data->follow = kvm_pte_follow(*ptep);
>>>>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>>>>> Here's how stage2_coalesce_tables_into_block() is implemented from the previous
>>>>> patch (it might be worth merging it with this patch, I found it impossible to
>>>>> judge if the function is correct without seeing how it is used and what is
>>>>> replacing):
>>>> Ok, will do this if v2 is going to be post.
>>>>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>>>>                              kvm_pte_t *ptep,
>>>>>                              struct stage2_map_data *data)
>>>>> {
>>>>>        u64 granule = kvm_granule_size(level), phys = data->phys;
>>>>>        kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>>>>
>>>>>        kvm_set_invalid_pte(ptep);
>>>>>
>>>>>        /*
>>>>>         * Invalidate the whole stage-2, as we may have numerous leaf entries
>>>>>         * below us which would otherwise need invalidating individually.
>>>>>         */
>>>>>        kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>>        smp_store_release(ptep, new);
>>>>>        data->phys += granule;
>>>>> }
>>>>>
>>>>> This works because __kvm_pgtable_visit() saves the *ptep value before calling
>>>>> the
>>>>> pre callback, and it visits the next level table based on the initial pte value,
>>>>> not the new value written by stage2_coalesce_tables_into_block().
>>>> Right. So before replacing the initial pte value with the new value, we have
>>>> to use
>>>> *data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to save
>>>> the initial pte value in advance. And data->follow will be used when  we start to
>>>> unmap the old sub-level tables later.
>>> Right, stage2_map_walk_table_post() will use data->follow to free the table page
>>> which is no longer needed because we've replaced the entire next level table with
>>> a block mapping.
>>>
>>>>> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of
>>>>> dcache to the map handler"), this function is missing the CMOs from
>>>>> stage2_map_walker_try_leaf().
>>>> Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() currently,
>>>> because I thought they were not needed when we rebuild the block mappings from
>>>> normal page mappings.
>>> This assumes that the *only* situation when we replace a table entry with a block
>>> mapping is when the next level table (or tables) is *fully* populated. Is there a
>>> way to prove that this is true? I think it's important to prove it unequivocally,
>>> because if there's a corner case where this doesn't happen and we remove the
>>> dcache maintenance, we can end up with hard to reproduce and hard to diagnose
>>> errors in a guest.
>> So there is still one thing left about this patch to determine, and that is
>> whether we can straightly
>> discard CMOs in stage2_coalesce_tables_into_block() or we should distinguish
>> different situations.
>>
>> Now we know that the situation you have described won't happen, then I think we
>> will only end up
>> in stage2_coalesce_tables_into_block() in the following situation:
>> 1) KVM create a new block mapping in stage2_map_walker_try_leaf() for the first
>> time, if guest accesses
>>      memory backed by a THP/HUGETLB huge page. And CMOs will be performed here.
>> 2) KVM split this block mapping in dirty logging, and build only one new page
>> mapping.
>> 3) KVM will build other new page mappings in dirty logging lazily, if guest
>> access any other pages
>>      within the block. *In this stage, pages in this block may be fully mapped,
>> or may be not.*
>> 4) After dirty logging is disabled, KVM decides to rebuild the block mapping.
>>
>> Do we still have to perform CMOs when rebuilding the block mapping in step 4, if
>> pages in the block
>> were not fully mapped in step 3 ? I'm not completely sure about this.
> Did some digging and this is my understanding of what is happening. Please correct
> me if I get something wrong.
>
> When the kernel coalesces the userspace PTEs into a transparent hugepage, KVM will
> unmap the old mappings and mark the PMD table as invalidated via the MMU
> notifiers. To have a table at the PMD level while the corresponding entry is a
> block mapping in the userspace translation tables, it means that the table was
> created *after* the userspace block mapping was created.
>
> user_mem_abort() will create a PAGE_SIZE mapping when the backing userspace
> mapping is a block mapping in the following situations:
>
> 1. The start of the userspace block mapping is not aligned to the start of the
> stage 2 block mapping (see fault_supports_stage2_huge_mapping()).
>
> 2. The stage 2 block mapping falls outside the memslot (see
> fault_supports_stage2_huge_mapping()).
>
> 3. The memslot logs dirty pages.
>
> For 1 and 2, the only scenario in which we can use a stage 2 block mapping for the
> faulting IPA is if the memslot is modified, and that means the IPA range will have
> been unmapped first, which destroys the PMD table entry (kvm_set_memslot() will
> call kvm_arch_flush_shadow_memslot because change == KVM_MR_MOVE).
>
> This leaves us with scenario 3. We can get in this scenario if the memslot is
> logging and the userspace mapping has been coalesced into a transparent huge page
> before dirty logging was set or if the userspace mapping is a hugetlb page. To
> allow a block mapping at stage 2, we first need to remove the
> KVM_MEM_LOG_DIRTY_PAGES flag from the memslot. Then we need to get a dabt in the
> IPA range backed by the userspace block mapping. At this point there's nothing to
> guarantee that the *entire* IPA range backed by the userspace block mapping is
> mapped at stage 2.
I get your point and I think you are correct.
We can't ensure that dirty logging happens after *all* the stage 2 block 
mappings
have been created for the first time by user_mem_abort(). So it's 
possible that we
create a PAGE_SIZE mapping for the IPA backed by a huge page in dirty 
logging
and the corresponding IPA range has never been mapped by block in stage 
2 before.
When KVM needs to coalesce page mappings into a block after dirty 
logging, it actually
ends up creating the block mapping for the first time and CMOs are 
needed in this case.

So in summary, the key point of the need of CMOs is whether the next 
level table (or tables)
is *fully* populated (you have mentioned before). But checking whether 
the tables are fully
populated needs another PTW for the IPA range which will add new complexity.

I think the most concise and straight way is to still uniformly perform 
CMOs when we need
to coalesce tables into a block. And that's exactly what the previous 
code logic does.

Thanks,

Yanan
> In this case, we definitely need to do dcache maintenance because the guest might
> be running with the MMU off and doing loads from from PoC (assuming not FWB), and
> whatever userspace wrote in the guest memory (like the kernel image) might still
> be in the dcache. We also need to do the icache inval after the dcache clean +
> inval because instruction fetches can be cached even if the MMU is off.
>
> Thanks,
>
> Alex
>
>> Thanks,
>>
>> Yanan
>>>> At least, they are not needed if we rebuild the block mappings backed by
>>>> hugetlbfs
>>>> pages, because we must have built the new block mappings for the first time
>>>> before
>>>> and now need to rebuild them after they were split in dirty logging. Can we
>>>> agree on this?
>>>> Then let's see the following situation.
>>>>> I can think of the following situation where they
>>>>> are needed:
>>>>>
>>>>> 1. The 2nd level (PMD) table that will be turned into a block is mapped at
>>>>> stage 2
>>>>> because one of the pages in the 3rd level (PTE) table it points to is
>>>>> accessed by
>>>>> the guest.
>>>>>
>>>>> 2. The kernel decides to turn the userspace mapping into a transparent huge page
>>>>> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level
>>>>> table
>>>>> is still valid.
>>>> I have a question here. Won't the PMD entry been invalidated too in this case?
>>>> If remove of the stage2 mapping by mmu notifier is an unmap operation of a range,
>>>> then it's correct and reasonable to both invalidate the PMD entry and free the
>>>> PTE table.
>>>> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>>>>
>>>> And if I was right about this, we will not end up in
>>>> stage2_coalesce_tables_into_block()
>>>> like step 3 describes, but in stage2_map_walker_try_leaf() instead. Because the
>>>> PMD entry
>>>> is invalid, so KVM will create the new 2M block mapping.
>>> Looking at the code for stage2_unmap_walker(), I believe you are correct. After
>>> the entire PTE table has been unmapped, the function will mark the PMD entry as
>>> invalid. In the situation I described, at step 3 we would end up in the leaf
>>> mapper function because the PMD entry is invalid. My example was wrong.
>>>
>>>> If I'm wrong about this, then I think this is a valid situation.
>>>>> 3. Guest accesses a page which is not the page it accessed at step 1, which
>>>>> causes
>>>>> a translation fault. KVM decides we can use a PMD block mapping to map the
>>>>> address
>>>>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case
>>>>> because the guest accesses memory it didn't access before.
>>>>>
>>>>> What do you think, is that a valid situation?
>>>>>>         return 0;
>>>>>>     }
>>>>>>     @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64
>>>>>> end, u32 level,
>>>>>>                           kvm_pte_t *ptep,
>>>>>>                           struct stage2_map_data *data)
>>>>>>     {
>>>>>> -    int ret = 0;
>>>>>> -
>>>>>>         if (!data->anchor)
>>>>>>             return 0;
>>>>>>     -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>>> -    put_page(virt_to_page(ptep));
>>>>>> -
>>>>>> -    if (data->anchor == ptep) {
>>>>>> +    if (data->anchor != ptep) {
>>>>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>>> +        put_page(virt_to_page(ptep));
>>>>>> +    } else {
>>>>>> +        free_page((unsigned long)data->follow);
>>>>>>             data->anchor = NULL;
>>>>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>>>>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and
>>>>> get_page() once in our case (valid old mapping). It looks to me like we're
>>>>> missing
>>>>> a put_page() call when the function is called for the anchor. Have you found the
>>>>> call to be unnecessary?
>>>> Before this patch:
>>>> When we find data->anchor == ptep, put_page() has been called once in advance
>>>> for the anchor
>>>> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
>>>> stage2_map_walker_try_leaf()
>>>> to install the block entry, and only get_page() will be called once in
>>>> stage2_map_walker_try_leaf().
>>>> There is a put_page() followed by a get_page() for the anchor, and there will
>>>> not be a problem about
>>>> page_counts.
>>> This is how I'm reading the code before your patch:
>>>
>>> - stage2_map_walk_table_post() returns early if there is no anchor.
>>>
>>> - stage2_map_walk_table_pre() sets the anchor and marks the entry as invalid. The
>>> entry was a table so the leaf visitor is not called in __kvm_pgtable_visit().
>>>
>>> - __kvm_pgtable_visit() visits the next level table.
>>>
>>> - stage2_map_walk_table_post() calls put_page(), calls stage2_map_walk_leaf() ->
>>> stage2_map_walker_try_leaf(). The old entry was invalidated by the pre visitor, so
>>> it only calls get_page() (and not put_page() + get_page().
>>>
>>> I agree with your conclusion, I didn't realize that because the pre visitor marks
>>> the entry as invalid, stage2_map_walker_try_leaf() will not call put_page().
>>>
>>>> After this patch:
>>>> Before we find data->anchor == ptep and after it, there is not a put_page() call
>>>> for the anchor.
>>>> This is because that we didn't call get_page() either in
>>>> stage2_coalesce_tables_into_block() when
>>>> install the block entry. So I think there will not be a problem too.
>>> I agree, the refcount will be identical.
>>>
>>>> Is above the right answer for your point?
>>> Yes, thank you clearing that up for me.
>>>
>>> Thanks,
>>>
>>> Alex
>>>
>>>>>>         }
>>>>>>     -    return ret;
>>>>>> +    return 0;
>>>>> I think it's correct for this function to succeed unconditionally. The error was
>>>>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function
>>>>> can return an error code if block mapping is not supported, which we know is
>>>>> supported because we have an anchor, and if only the permissions are different
>>>>> between the old and the new entry, but in our case we've changed both the valid
>>>>> and type bits.
>>>> Agreed. Besides, we will definitely not end up updating an old valid entry for
>>>> the anchor
>>>> in stage2_map_walker_try_leaf(), because *anchor has already been invalidated in
>>>> stage2_map_walk_table_pre() before set the anchor, so it will look like a build
>>>> of new mapping.
>>>>
>>>> Thanks,
>>>>
>>>> Yanan
>>>>> Thanks,
>>>>>
>>>>> Alex
>>>>>
>>>>>>     }
>>>>>>       /*
>>>>> .
>>> .
> .

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 4/4] KVM: arm64: Distinguish cases of memcache allocations completely
  2021-02-08 11:22   ` Yanan Wang
  (?)
@ 2021-03-25 17:26     ` Alexandru Elisei
  -1 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-03-25 17:26 UTC (permalink / raw)
  To: Yanan Wang, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Yanan,

On 2/8/21 11:22 AM, Yanan Wang wrote:
> With a guest translation fault, the memcache pages are not needed if KVM
> is only about to install a new leaf entry into the existing page table.
> And with a guest permission fault, the memcache pages are also not needed
> for a write_fault in dirty-logging time if KVM is only about to update
> the existing leaf entry instead of collapsing a block entry into a table.
>
> By comparing fault_granule and vma_pagesize, cases that require allocations
> from memcache and cases that don't can be distinguished completely.
>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/kvm/mmu.c | 25 ++++++++++++-------------
>  1 file changed, 12 insertions(+), 13 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index d151927a7d62..550498a9104e 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -815,19 +815,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	gfn = fault_ipa >> PAGE_SHIFT;
>  	mmap_read_unlock(current->mm);
>  
> -	/*
> -	 * Permission faults just need to update the existing leaf entry,
> -	 * and so normally don't require allocations from the memcache. The
> -	 * only exception to this is when dirty logging is enabled at runtime
> -	 * and a write fault needs to collapse a block entry into a table.
> -	 */
> -	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
> -		ret = kvm_mmu_topup_memory_cache(memcache,
> -						 kvm_mmu_cache_min_pages(kvm));
> -		if (ret)
> -			return ret;
> -	}
> -
>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>  	/*
>  	 * Ensure the read of mmu_notifier_seq happens before we call
> @@ -887,6 +874,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
>  		prot |= KVM_PGTABLE_PROT_X;
>  
> +	/*
> +	 * Allocations from the memcache are required only when granule of the
> +	 * lookup level where the guest fault happened exceeds vma_pagesize,
> +	 * which means new page tables will be created in the fault handlers.
> +	 */
> +	if (fault_granule > vma_pagesize) {
> +		ret = kvm_mmu_topup_memory_cache(memcache,
> +						 kvm_mmu_cache_min_pages(kvm));
> +		if (ret)
> +			return ret;
> +	}

I distinguish three situations:

1. fault_granule == vma_pagesize. If the stage 2 fault occurs at the leaf level,
then it means that all the tables that the translation table walker traversed
until the leaf are valid. No need to allocate a new page, as stage 2 will only
change the leaf to point to a valid PA.

2. fault_granule > vma_pagesize. This means that there's a table missing at some
point in the table walk, so we're going to need to allocate at least one table to
hold the leaf entry. We need to topup the memory cache.

3. fault_granule < vma_pagesize. From our discussion in patch #3, this can happen
only if the userspace translation tables use a block mapping, dirty page logging
is enabled, the fault_ipa is mapped as a last level entry, dirty page logging gets
disabled and then we get a fault. In this case, the PTE table will be coalesced
into a PMD block mapping, and the PMD table entry that pointed to the PTE table
will be changed to a block mapping. No table will be allocated.

Looks to me like this patch is valid, but getting it wrong can break a VM and I
would feel a lot more comfortable if someone who is more familiar with the code
would have a look.

Thanks,

Alex

> +
>  	/*
>  	 * Under the premise of getting a FSC_PERM fault, we just need to relax
>  	 * permissions only if vma_pagesize equals fault_granule. Otherwise,

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 4/4] KVM: arm64: Distinguish cases of memcache allocations completely
@ 2021-03-25 17:26     ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-03-25 17:26 UTC (permalink / raw)
  To: Yanan Wang, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Yanan,

On 2/8/21 11:22 AM, Yanan Wang wrote:
> With a guest translation fault, the memcache pages are not needed if KVM
> is only about to install a new leaf entry into the existing page table.
> And with a guest permission fault, the memcache pages are also not needed
> for a write_fault in dirty-logging time if KVM is only about to update
> the existing leaf entry instead of collapsing a block entry into a table.
>
> By comparing fault_granule and vma_pagesize, cases that require allocations
> from memcache and cases that don't can be distinguished completely.
>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/kvm/mmu.c | 25 ++++++++++++-------------
>  1 file changed, 12 insertions(+), 13 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index d151927a7d62..550498a9104e 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -815,19 +815,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	gfn = fault_ipa >> PAGE_SHIFT;
>  	mmap_read_unlock(current->mm);
>  
> -	/*
> -	 * Permission faults just need to update the existing leaf entry,
> -	 * and so normally don't require allocations from the memcache. The
> -	 * only exception to this is when dirty logging is enabled at runtime
> -	 * and a write fault needs to collapse a block entry into a table.
> -	 */
> -	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
> -		ret = kvm_mmu_topup_memory_cache(memcache,
> -						 kvm_mmu_cache_min_pages(kvm));
> -		if (ret)
> -			return ret;
> -	}
> -
>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>  	/*
>  	 * Ensure the read of mmu_notifier_seq happens before we call
> @@ -887,6 +874,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
>  		prot |= KVM_PGTABLE_PROT_X;
>  
> +	/*
> +	 * Allocations from the memcache are required only when granule of the
> +	 * lookup level where the guest fault happened exceeds vma_pagesize,
> +	 * which means new page tables will be created in the fault handlers.
> +	 */
> +	if (fault_granule > vma_pagesize) {
> +		ret = kvm_mmu_topup_memory_cache(memcache,
> +						 kvm_mmu_cache_min_pages(kvm));
> +		if (ret)
> +			return ret;
> +	}

I distinguish three situations:

1. fault_granule == vma_pagesize. If the stage 2 fault occurs at the leaf level,
then it means that all the tables that the translation table walker traversed
until the leaf are valid. No need to allocate a new page, as stage 2 will only
change the leaf to point to a valid PA.

2. fault_granule > vma_pagesize. This means that there's a table missing at some
point in the table walk, so we're going to need to allocate at least one table to
hold the leaf entry. We need to topup the memory cache.

3. fault_granule < vma_pagesize. From our discussion in patch #3, this can happen
only if the userspace translation tables use a block mapping, dirty page logging
is enabled, the fault_ipa is mapped as a last level entry, dirty page logging gets
disabled and then we get a fault. In this case, the PTE table will be coalesced
into a PMD block mapping, and the PMD table entry that pointed to the PTE table
will be changed to a block mapping. No table will be allocated.

Looks to me like this patch is valid, but getting it wrong can break a VM and I
would feel a lot more comfortable if someone who is more familiar with the code
would have a look.

Thanks,

Alex

> +
>  	/*
>  	 * Under the premise of getting a FSC_PERM fault, we just need to relax
>  	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 4/4] KVM: arm64: Distinguish cases of memcache allocations completely
@ 2021-03-25 17:26     ` Alexandru Elisei
  0 siblings, 0 replies; 80+ messages in thread
From: Alexandru Elisei @ 2021-03-25 17:26 UTC (permalink / raw)
  To: Yanan Wang, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Yanan,

On 2/8/21 11:22 AM, Yanan Wang wrote:
> With a guest translation fault, the memcache pages are not needed if KVM
> is only about to install a new leaf entry into the existing page table.
> And with a guest permission fault, the memcache pages are also not needed
> for a write_fault in dirty-logging time if KVM is only about to update
> the existing leaf entry instead of collapsing a block entry into a table.
>
> By comparing fault_granule and vma_pagesize, cases that require allocations
> from memcache and cases that don't can be distinguished completely.
>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> ---
>  arch/arm64/kvm/mmu.c | 25 ++++++++++++-------------
>  1 file changed, 12 insertions(+), 13 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index d151927a7d62..550498a9104e 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -815,19 +815,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	gfn = fault_ipa >> PAGE_SHIFT;
>  	mmap_read_unlock(current->mm);
>  
> -	/*
> -	 * Permission faults just need to update the existing leaf entry,
> -	 * and so normally don't require allocations from the memcache. The
> -	 * only exception to this is when dirty logging is enabled at runtime
> -	 * and a write fault needs to collapse a block entry into a table.
> -	 */
> -	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
> -		ret = kvm_mmu_topup_memory_cache(memcache,
> -						 kvm_mmu_cache_min_pages(kvm));
> -		if (ret)
> -			return ret;
> -	}
> -
>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>  	/*
>  	 * Ensure the read of mmu_notifier_seq happens before we call
> @@ -887,6 +874,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
>  		prot |= KVM_PGTABLE_PROT_X;
>  
> +	/*
> +	 * Allocations from the memcache are required only when granule of the
> +	 * lookup level where the guest fault happened exceeds vma_pagesize,
> +	 * which means new page tables will be created in the fault handlers.
> +	 */
> +	if (fault_granule > vma_pagesize) {
> +		ret = kvm_mmu_topup_memory_cache(memcache,
> +						 kvm_mmu_cache_min_pages(kvm));
> +		if (ret)
> +			return ret;
> +	}

I distinguish three situations:

1. fault_granule == vma_pagesize. If the stage 2 fault occurs at the leaf level,
then it means that all the tables that the translation table walker traversed
until the leaf are valid. No need to allocate a new page, as stage 2 will only
change the leaf to point to a valid PA.

2. fault_granule > vma_pagesize. This means that there's a table missing at some
point in the table walk, so we're going to need to allocate at least one table to
hold the leaf entry. We need to topup the memory cache.

3. fault_granule < vma_pagesize. From our discussion in patch #3, this can happen
only if the userspace translation tables use a block mapping, dirty page logging
is enabled, the fault_ipa is mapped as a last level entry, dirty page logging gets
disabled and then we get a fault. In this case, the PTE table will be coalesced
into a PMD block mapping, and the PMD table entry that pointed to the PTE table
will be changed to a block mapping. No table will be allocated.

Looks to me like this patch is valid, but getting it wrong can break a VM and I
would feel a lot more comfortable if someone who is more familiar with the code
would have a look.

Thanks,

Alex

> +
>  	/*
>  	 * Under the premise of getting a FSC_PERM fault, we just need to relax
>  	 * permissions only if vma_pagesize equals fault_granule. Otherwise,

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 4/4] KVM: arm64: Distinguish cases of memcache allocations completely
  2021-03-25 17:26     ` Alexandru Elisei
  (?)
@ 2021-03-26  1:24       ` wangyanan (Y)
  -1 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-03-26  1:24 UTC (permalink / raw)
  To: Alexandru Elisei, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Alex,

On 2021/3/26 1:26, Alexandru Elisei wrote:
> Hi Yanan,
>
> On 2/8/21 11:22 AM, Yanan Wang wrote:
>> With a guest translation fault, the memcache pages are not needed if KVM
>> is only about to install a new leaf entry into the existing page table.
>> And with a guest permission fault, the memcache pages are also not needed
>> for a write_fault in dirty-logging time if KVM is only about to update
>> the existing leaf entry instead of collapsing a block entry into a table.
>>
>> By comparing fault_granule and vma_pagesize, cases that require allocations
>> from memcache and cases that don't can be distinguished completely.
>>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>   arch/arm64/kvm/mmu.c | 25 ++++++++++++-------------
>>   1 file changed, 12 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index d151927a7d62..550498a9104e 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -815,19 +815,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	gfn = fault_ipa >> PAGE_SHIFT;
>>   	mmap_read_unlock(current->mm);
>>   
>> -	/*
>> -	 * Permission faults just need to update the existing leaf entry,
>> -	 * and so normally don't require allocations from the memcache. The
>> -	 * only exception to this is when dirty logging is enabled at runtime
>> -	 * and a write fault needs to collapse a block entry into a table.
>> -	 */
>> -	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
>> -		ret = kvm_mmu_topup_memory_cache(memcache,
>> -						 kvm_mmu_cache_min_pages(kvm));
>> -		if (ret)
>> -			return ret;
>> -	}
>> -
>>   	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>>   	/*
>>   	 * Ensure the read of mmu_notifier_seq happens before we call
>> @@ -887,6 +874,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
>>   		prot |= KVM_PGTABLE_PROT_X;
>>   
>> +	/*
>> +	 * Allocations from the memcache are required only when granule of the
>> +	 * lookup level where the guest fault happened exceeds vma_pagesize,
>> +	 * which means new page tables will be created in the fault handlers.
>> +	 */
>> +	if (fault_granule > vma_pagesize) {
>> +		ret = kvm_mmu_topup_memory_cache(memcache,
>> +						 kvm_mmu_cache_min_pages(kvm));
>> +		if (ret)
>> +			return ret;
>> +	}
> I distinguish three situations:
>
> 1. fault_granule == vma_pagesize. If the stage 2 fault occurs at the leaf level,
> then it means that all the tables that the translation table walker traversed
> until the leaf are valid. No need to allocate a new page, as stage 2 will only
> change the leaf to point to a valid PA.
>
> 2. fault_granule > vma_pagesize. This means that there's a table missing at some
> point in the table walk, so we're going to need to allocate at least one table to
> hold the leaf entry. We need to topup the memory cache.
>
> 3. fault_granule < vma_pagesize. From our discussion in patch #3, this can happen
> only if the userspace translation tables use a block mapping, dirty page logging
> is enabled, the fault_ipa is mapped as a last level entry, dirty page logging gets
> disabled and then we get a fault. In this case, the PTE table will be coalesced
> into a PMD block mapping, and the PMD table entry that pointed to the PTE table
> will be changed to a block mapping. No table will be allocated.
>
> Looks to me like this patch is valid, but getting it wrong can break a VM and I
> would feel a lot more comfortable if someone who is more familiar with the code
> would have a look.
Thanks for your explanation here. Above is also what I thought about 
this patch.

Thanks,
Yanan
>
> Thanks,
>
> Alex
>
>> +
>>   	/*
>>   	 * Under the premise of getting a FSC_PERM fault, we just need to relax
>>   	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
> .

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 4/4] KVM: arm64: Distinguish cases of memcache allocations completely
@ 2021-03-26  1:24       ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-03-26  1:24 UTC (permalink / raw)
  To: Alexandru Elisei, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Alex,

On 2021/3/26 1:26, Alexandru Elisei wrote:
> Hi Yanan,
>
> On 2/8/21 11:22 AM, Yanan Wang wrote:
>> With a guest translation fault, the memcache pages are not needed if KVM
>> is only about to install a new leaf entry into the existing page table.
>> And with a guest permission fault, the memcache pages are also not needed
>> for a write_fault in dirty-logging time if KVM is only about to update
>> the existing leaf entry instead of collapsing a block entry into a table.
>>
>> By comparing fault_granule and vma_pagesize, cases that require allocations
>> from memcache and cases that don't can be distinguished completely.
>>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>   arch/arm64/kvm/mmu.c | 25 ++++++++++++-------------
>>   1 file changed, 12 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index d151927a7d62..550498a9104e 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -815,19 +815,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	gfn = fault_ipa >> PAGE_SHIFT;
>>   	mmap_read_unlock(current->mm);
>>   
>> -	/*
>> -	 * Permission faults just need to update the existing leaf entry,
>> -	 * and so normally don't require allocations from the memcache. The
>> -	 * only exception to this is when dirty logging is enabled at runtime
>> -	 * and a write fault needs to collapse a block entry into a table.
>> -	 */
>> -	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
>> -		ret = kvm_mmu_topup_memory_cache(memcache,
>> -						 kvm_mmu_cache_min_pages(kvm));
>> -		if (ret)
>> -			return ret;
>> -	}
>> -
>>   	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>>   	/*
>>   	 * Ensure the read of mmu_notifier_seq happens before we call
>> @@ -887,6 +874,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
>>   		prot |= KVM_PGTABLE_PROT_X;
>>   
>> +	/*
>> +	 * Allocations from the memcache are required only when granule of the
>> +	 * lookup level where the guest fault happened exceeds vma_pagesize,
>> +	 * which means new page tables will be created in the fault handlers.
>> +	 */
>> +	if (fault_granule > vma_pagesize) {
>> +		ret = kvm_mmu_topup_memory_cache(memcache,
>> +						 kvm_mmu_cache_min_pages(kvm));
>> +		if (ret)
>> +			return ret;
>> +	}
> I distinguish three situations:
>
> 1. fault_granule == vma_pagesize. If the stage 2 fault occurs at the leaf level,
> then it means that all the tables that the translation table walker traversed
> until the leaf are valid. No need to allocate a new page, as stage 2 will only
> change the leaf to point to a valid PA.
>
> 2. fault_granule > vma_pagesize. This means that there's a table missing at some
> point in the table walk, so we're going to need to allocate at least one table to
> hold the leaf entry. We need to topup the memory cache.
>
> 3. fault_granule < vma_pagesize. From our discussion in patch #3, this can happen
> only if the userspace translation tables use a block mapping, dirty page logging
> is enabled, the fault_ipa is mapped as a last level entry, dirty page logging gets
> disabled and then we get a fault. In this case, the PTE table will be coalesced
> into a PMD block mapping, and the PMD table entry that pointed to the PTE table
> will be changed to a block mapping. No table will be allocated.
>
> Looks to me like this patch is valid, but getting it wrong can break a VM and I
> would feel a lot more comfortable if someone who is more familiar with the code
> would have a look.
Thanks for your explanation here. Above is also what I thought about 
this patch.

Thanks,
Yanan
>
> Thanks,
>
> Alex
>
>> +
>>   	/*
>>   	 * Under the premise of getting a FSC_PERM fault, we just need to relax
>>   	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
> .
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC PATCH 4/4] KVM: arm64: Distinguish cases of memcache allocations completely
@ 2021-03-26  1:24       ` wangyanan (Y)
  0 siblings, 0 replies; 80+ messages in thread
From: wangyanan (Y) @ 2021-03-26  1:24 UTC (permalink / raw)
  To: Alexandru Elisei, Marc Zyngier, Will Deacon, Catalin Marinas,
	James Morse, Julien Thierry, Suzuki K Poulose, Gavin Shan,
	Quentin Perret, kvmarm, linux-arm-kernel, kvm, linux-kernel

Hi Alex,

On 2021/3/26 1:26, Alexandru Elisei wrote:
> Hi Yanan,
>
> On 2/8/21 11:22 AM, Yanan Wang wrote:
>> With a guest translation fault, the memcache pages are not needed if KVM
>> is only about to install a new leaf entry into the existing page table.
>> And with a guest permission fault, the memcache pages are also not needed
>> for a write_fault in dirty-logging time if KVM is only about to update
>> the existing leaf entry instead of collapsing a block entry into a table.
>>
>> By comparing fault_granule and vma_pagesize, cases that require allocations
>> from memcache and cases that don't can be distinguished completely.
>>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> ---
>>   arch/arm64/kvm/mmu.c | 25 ++++++++++++-------------
>>   1 file changed, 12 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index d151927a7d62..550498a9104e 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -815,19 +815,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	gfn = fault_ipa >> PAGE_SHIFT;
>>   	mmap_read_unlock(current->mm);
>>   
>> -	/*
>> -	 * Permission faults just need to update the existing leaf entry,
>> -	 * and so normally don't require allocations from the memcache. The
>> -	 * only exception to this is when dirty logging is enabled at runtime
>> -	 * and a write fault needs to collapse a block entry into a table.
>> -	 */
>> -	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
>> -		ret = kvm_mmu_topup_memory_cache(memcache,
>> -						 kvm_mmu_cache_min_pages(kvm));
>> -		if (ret)
>> -			return ret;
>> -	}
>> -
>>   	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>>   	/*
>>   	 * Ensure the read of mmu_notifier_seq happens before we call
>> @@ -887,6 +874,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
>>   		prot |= KVM_PGTABLE_PROT_X;
>>   
>> +	/*
>> +	 * Allocations from the memcache are required only when granule of the
>> +	 * lookup level where the guest fault happened exceeds vma_pagesize,
>> +	 * which means new page tables will be created in the fault handlers.
>> +	 */
>> +	if (fault_granule > vma_pagesize) {
>> +		ret = kvm_mmu_topup_memory_cache(memcache,
>> +						 kvm_mmu_cache_min_pages(kvm));
>> +		if (ret)
>> +			return ret;
>> +	}
> I distinguish three situations:
>
> 1. fault_granule == vma_pagesize. If the stage 2 fault occurs at the leaf level,
> then it means that all the tables that the translation table walker traversed
> until the leaf are valid. No need to allocate a new page, as stage 2 will only
> change the leaf to point to a valid PA.
>
> 2. fault_granule > vma_pagesize. This means that there's a table missing at some
> point in the table walk, so we're going to need to allocate at least one table to
> hold the leaf entry. We need to topup the memory cache.
>
> 3. fault_granule < vma_pagesize. From our discussion in patch #3, this can happen
> only if the userspace translation tables use a block mapping, dirty page logging
> is enabled, the fault_ipa is mapped as a last level entry, dirty page logging gets
> disabled and then we get a fault. In this case, the PTE table will be coalesced
> into a PMD block mapping, and the PMD table entry that pointed to the PTE table
> will be changed to a block mapping. No table will be allocated.
>
> Looks to me like this patch is valid, but getting it wrong can break a VM and I
> would feel a lot more comfortable if someone who is more familiar with the code
> would have a look.
Thanks for your explanation here. Above is also what I thought about 
this patch.

Thanks,
Yanan
>
> Thanks,
>
> Alex
>
>> +
>>   	/*
>>   	 * Under the premise of getting a FSC_PERM fault, we just need to relax
>>   	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
> .

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2021-03-26  1:27 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-08 11:22 [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table Yanan Wang
2021-02-08 11:22 ` Yanan Wang
2021-02-08 11:22 ` Yanan Wang
2021-02-08 11:22 ` [RFC PATCH 1/4] KVM: arm64: Move the clean of dcache to the map handler Yanan Wang
2021-02-08 11:22   ` Yanan Wang
2021-02-08 11:22   ` Yanan Wang
2021-02-24 17:21   ` Alexandru Elisei
2021-02-24 17:21     ` Alexandru Elisei
2021-02-24 17:21     ` Alexandru Elisei
2021-02-24 17:39     ` Marc Zyngier
2021-02-24 17:39       ` Marc Zyngier
2021-02-24 17:39       ` Marc Zyngier
2021-02-25 16:45       ` Alexandru Elisei
2021-02-25 16:45         ` Alexandru Elisei
2021-02-25 16:45         ` Alexandru Elisei
2021-02-25  9:55   ` Marc Zyngier
2021-02-25  9:55     ` Marc Zyngier
2021-02-25 17:39     ` Alexandru Elisei
2021-02-25 17:39       ` Alexandru Elisei
2021-02-25 17:39       ` Alexandru Elisei
2021-02-25 18:30       ` Marc Zyngier
2021-02-25 18:30         ` Marc Zyngier
2021-02-25 18:30         ` Marc Zyngier
2021-02-26 15:51         ` wangyanan (Y)
2021-02-26 15:51           ` wangyanan (Y)
2021-02-26 15:51           ` wangyanan (Y)
2021-02-26 15:58     ` wangyanan (Y)
2021-02-26 15:58       ` wangyanan (Y)
2021-02-26 15:58       ` wangyanan (Y)
2021-02-08 11:22 ` [RFC PATCH 2/4] KVM: arm64: Add an independent API for coalescing tables Yanan Wang
2021-02-08 11:22   ` Yanan Wang
2021-02-08 11:22   ` Yanan Wang
2021-02-08 11:22 ` [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings Yanan Wang
2021-02-08 11:22   ` Yanan Wang
2021-02-08 11:22   ` Yanan Wang
2021-02-28 11:11   ` wangyanan (Y)
2021-02-28 11:11     ` wangyanan (Y)
2021-02-28 11:11     ` wangyanan (Y)
2021-03-02 17:13   ` Alexandru Elisei
2021-03-02 17:13     ` Alexandru Elisei
2021-03-02 17:13     ` Alexandru Elisei
2021-03-03 11:04     ` wangyanan (Y)
2021-03-03 11:04       ` wangyanan (Y)
2021-03-03 11:04       ` wangyanan (Y)
2021-03-03 17:27       ` Alexandru Elisei
2021-03-03 17:27         ` Alexandru Elisei
2021-03-03 17:27         ` Alexandru Elisei
2021-03-04  7:07         ` wangyanan (Y)
2021-03-04  7:07           ` wangyanan (Y)
2021-03-04  7:07           ` wangyanan (Y)
2021-03-04  7:22           ` wangyanan (Y)
2021-03-04  7:22             ` wangyanan (Y)
2021-03-04  7:22             ` wangyanan (Y)
2021-03-19 15:07           ` Alexandru Elisei
2021-03-19 15:07             ` Alexandru Elisei
2021-03-19 15:07             ` Alexandru Elisei
2021-03-22 13:19             ` wangyanan (Y)
2021-03-22 13:19               ` wangyanan (Y)
2021-03-22 13:19               ` wangyanan (Y)
2021-02-08 11:22 ` [RFC PATCH 4/4] KVM: arm64: Distinguish cases of memcache allocations completely Yanan Wang
2021-02-08 11:22   ` Yanan Wang
2021-02-08 11:22   ` Yanan Wang
2021-03-25 17:26   ` Alexandru Elisei
2021-03-25 17:26     ` Alexandru Elisei
2021-03-25 17:26     ` Alexandru Elisei
2021-03-26  1:24     ` wangyanan (Y)
2021-03-26  1:24       ` wangyanan (Y)
2021-03-26  1:24       ` wangyanan (Y)
2021-02-23 15:55 ` [RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table Alexandru Elisei
2021-02-23 15:55   ` Alexandru Elisei
2021-02-23 15:55   ` Alexandru Elisei
2021-02-24  2:35   ` wangyanan (Y)
2021-02-24  2:35     ` wangyanan (Y)
2021-02-24  2:35     ` wangyanan (Y)
2021-02-24 17:20     ` Alexandru Elisei
2021-02-24 17:20       ` Alexandru Elisei
2021-02-24 17:20       ` Alexandru Elisei
2021-02-25  6:13       ` wangyanan (Y)
2021-02-25  6:13         ` wangyanan (Y)
2021-02-25  6:13         ` wangyanan (Y)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.