All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 00/12] Implement Eager Page Splitting for ARM
@ 2023-04-09  6:29 Ricardo Koller
  2023-04-09  6:29 ` [PATCH v7 01/12] KVM: arm64: Rename free_removed to free_unlinked Ricardo Koller
                   ` (12 more replies)
  0 siblings, 13 replies; 40+ messages in thread
From: Ricardo Koller @ 2023-04-09  6:29 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller

Eager Page Splitting improves the performance of dirty-logging (used
in live migrations) when guest memory is backed by huge-pages.  It's
an optimization used in Google Cloud since 2016 on x86, and for the
last couple of months on ARM.

Background and motivation
=========================
Dirty logging is typically used for live-migration iterative copying.
KVM implements dirty-logging at the PAGE_SIZE granularity (will refer
to 4K pages from now on).  It does it by faulting on write-protected
4K pages.  Therefore, enabling dirty-logging on a huge-page requires
breaking it into 4K pages in the first place.  KVM does this breaking
on fault, and because it's in the critical path it only maps the 4K
page that faulted; every other 4K page is left unmapped.  This is not
great for performance on ARM for a couple of reasons:

- Splitting on fault can halt vcpus for milliseconds in some
  implementations. Splitting a block PTE requires using a broadcasted
  TLB invalidation (TLBI) for every huge-page (due to the
  break-before-make requirement). Note that x86 doesn't need this. We
  observed some implementations that take millliseconds to complete
  broadcasted TLBIs when done in parallel from multiple vcpus.  And
  that's exactly what happens when doing it on fault: multiple vcpus
  fault at the same time triggering TLBIs in parallel.

- Read intensive guest workloads end up paying for dirty-logging.
  Only mapping the faulting 4K page means that all the other pages
  that were part of the huge-page will now be unmapped. The effect is
  that any access, including reads, now has to fault.

Eager Page Splitting (on ARM)
=============================
Eager Page Splitting fixes the above two issues by eagerly splitting
huge-pages when enabling dirty logging. The goal is to avoid doing it
while faulting on write-protected pages. This is what the TDP MMU does
for x86 [0], except that x86 does it for different reasons: to avoid
grabbing the MMU lock on fault. Note that taking care of
write-protection faults still requires grabbing the MMU lock on ARM,
but not on x86 (with the fast_page_fault path).

An additional benefit of eagerly splitting huge-pages is that it can
be done in a controlled way (e.g., via an IOCTL). This series provides
two knobs for doing it, just like its x86 counterpart: when enabling
dirty logging, and when using the KVM_CLEAR_DIRTY_LOG ioctl. The
benefit of doing it on KVM_CLEAR_DIRTY_LOG is that this ioctl takes
ranges, and not complete memslots like when enabling dirty logging.
This means that the cost of splitting (mainly broadcasted TLBIs) can
be throttled: split a range, wait for a bit, split another range, etc.
The benefits of this approach were presented by Oliver Upton at KVM
Forum 2022 [1].

Implementation
==============
Patches 3-4 add a pgtable utility function for splitting huge block
PTEs: kvm_pgtable_stage2_split(). Patches 5-9 add support for eagerly
splitting huge-pages when enabling dirty-logging and when using the
KVM_CLEAR_DIRTY_LOG ioctl. Note that this is just like what x86 does,
and the code is actually based on it.  And finally, patch 9:

	KVM: arm64: Use local TLBI on permission relaxation

adds support for using local TLBIs instead of broadcasts when doing
permission relaxation. This last patch is key to achieving good
performance during dirty-logging, as eagerly breaking huge-pages
replaces mapping new pages with permission relaxation. Got this patch
(indirectly) from Marc Z.  and took the liberty of adding a commit
message.

Note: this applies on top of 6.3-rc2.

Performance evaluation
======================
The performance benefits were tested using the dirty_log_perf_test
selftest with 2M huge-pages.

The first test uses a write-only sequential workload where the stride
is 2M instead of 4K [2]. The idea with this experiment is to emulate a
random access pattern writing a different huge-page at every access.
Observe that the benefit increases with the number of vcpus: up to
5.76x for 152 vcpus. This table shows the guest dirtying time when
using the CLEAR ioctl (and KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2):

/dirty_log_perf_test_sparse -s anonymous_hugetlb_2mb -b 1G -v $i -i 3 -m 2

	+-------+----------+------------------+
	| vCPUs | 6.2-rc3  | 6.2-rc3 + series |
	|       |    (ms)  |             (ms) |
	+-------+----------+------------------+
	|    1  |    2.63  |          1.66    |
	|    2  |    2.95  |          1.70    |
	|    4  |    3.21  |          1.71    |
	|    8  |    4.97  |          1.78    |
	|   16  |    9.51  |          1.82    |
	|   32  |   20.15  |          3.03    |
	|   64  |   40.09  |          5.80    |
	|  128  |   80.08  |         12.24    |
	|  152  |  109.81  |         15.14    |
	+-------+----------+------------------+

This secondv test measures the benefit of eager page splitting on read
intensive workloads (1 write for every 10 reads). As in the other
test, the benefit increases with the number of vcpus, up to 8.82x for
152 vcpus. This table shows the guest dirtying time when using the
CLEAR ioctl (and KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2):

./dirty_log_perf_test -s anonymous_hugetlb_2mb -b 1G -v $i -i 3 -m 2 -w 10

	+-------+----------+------------------+
	| vCPUs | 6.2-rc3  | 6.2-rc3 + series |
	|       |   (sec)  |            (sec) |
	+-------+----------+------------------+
	|    1  |    0.65  |          0.07    |
	|    2  |    0.70  |          0.08    |
	|    4  |    0.71  |          0.08    |
	|    8  |    0.72  |          0.08    |
	|   16  |    0.76  |          0.08    |
	|   32  |    1.61  |          0.14    |
	|   64  |    3.46  |          0.30    |
	|  128  |    5.49  |          0.64    |
	|  152  |    6.44  |          0.63    |
	+-------+----------+------------------+

Changes from v6. All based on Marc comments:
https://lore.kernel.org/kvmarm/20230307034555.39733-1-ricarkol@google.com/
- Don't enable eager-splitting by default
- Only accept block sizes as valid CHUNK_SIZE's.
- Export the acceptable block sizes (as a bitmap) in a new capability.
- KVM_PGTABLE_WALK_SKIP_BBM to KVM_PGTABLE_WALK_SKIP_BBM_TLBI
- Moved kvm_pgtable_walk_skip_*() from kvm_pgtable.h to pgtable.c.
- Add unlikely() to kvm_pgtable_walk_skip_*().
- Add check for alignment in kvm_pgtable_stage2_create_unlinked().
- kvm_pgtable_stage2_split() does not take a void *mc anymore, and takes
  a kvm_mmu_memory_cache struct instead.
- kvm_pgtable_stage2_split() does not take an mc_capacity argument anymore.
- Renamed need_topup_split_page_cache_or_resched().
- Moved lockdep_assert_held_write() to the begginning of their respective
  functions in mmu.c.
- Removed the cache_capacity arg from need_split_memcache_topup_or_resched()
- Rewording for:
	- Replaced all instances of PMD and PUD
	- KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE description in api.rst
	- multiple function descriptions and comments in mmu.c
	- description of kvm_pgtable_stage2_split() and create_unlinked().
	- removed "no functional change expected" from all commits.
	- improved commit messages for the first 2 and last one
	  (Marc's commit).

Changes from v5:
https://lore.kernel.org/kvmarm/20230301210928.565562-1-ricarkol@google.com/
- fixed message in "Use local TLBI on permission relaxation". (Vladimir)
- s/removed/unlinked in first commit message. (Shaoqin)
- rebased series
- collected r-b's from Shaoqin

Changes from v4:
https://lore.kernel.org/kvmarm/20230218032314.635829-1-ricarkol@google.com/
- nits on some comments (s/removed/unlinked and remove @new).
  (Shaoqin)

Changes from v3:
https://lore.kernel.org/kvmarm/20230215174046.2201432-1-ricarkol@google.com/
- KVM_PGTABLE_WALK_SKIP_CMO to use BIT(5). (Shaoqin)
- Rewritten commit message for "Rename free_unlinked to free_removed"
  using Oliver's suggestion. (Oliver)
- "un" -> "an" typo. (Shaoqin)
- kvm_pgtable_stage2_create_unlinked() to return a "kvm_pte_t *". (Oliver)
- refactored stage2_block_get_nr_page_tables(). (Oliver)
- /s/bock/block. (Shaoqin)

Changes from v2:
https://lore.kernel.org/kvmarm/20230206165851.3106338-1-ricarkol@google.com/
- removed redundant kvm_pte_table() check from split walker function. (Gavin)
- fix compilation of patch 8 by moving some definitions from path 9. (Gavin)
- add comment for kvm_mmu_split_nr_page_tables(). (Gavin)

Changes from v1:
https://lore.kernel.org/kvmarm/20230113035000.480021-1-ricarkol@google.com/
- added a capability to set the eager splitting chunk size. This
  indirectly sets the number of pages in the cache. It also allows for
  opting out of this feature. (Oliver, Marc)
- changed kvm_pgtable_stage2_split() to split 1g huge-pages
  using either 513 or 1 at a time (with a cache of 1). (Oliver, Marc)
- added force_pte arg to kvm_pgtable_stage2_create_removed().
- renamed free_removed to free_unlinked. (Ben and Oliver)
- added KVM_PGTABLE_WALK ctx->flags for skipping BBM and CMO, instead
  of KVM_PGTABLE_WALK_REMOVED. (Oliver)

Changes from the RFC:
https://lore.kernel.org/kvmarm/20221112081714.2169495-1-ricarkol@google.com/
- dropped the changes to split on POST visits. No visible perf
  benefit.
- changed the kvm_pgtable_stage2_free_removed() implementation to
  reuse the stage2 mapper.
- dropped the FEAT_BBM changes and optimization. Will send this on a
  different series.

Thanks,
Ricardo

Marc Zyngier (1):
  KVM: arm64: Use local TLBI on permission relaxation

Ricardo Koller (11):
  KVM: arm64: Rename free_removed to free_unlinked
  KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIs
  KVM: arm64: Add helper for creating unlinked stage2 subtrees
  KVM: arm64: Add kvm_pgtable_stage2_split()
  KVM: arm64: Refactor kvm_arch_commit_memory_region()
  KVM: arm64: Add kvm_uninit_stage2_mmu()
  KVM: arm64: Export kvm_are_all_memslots_empty()
  KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
  KVM: arm64: Split huge pages when dirty logging is enabled
  KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()
  KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG

 Documentation/virt/kvm/api.rst        |  28 ++++
 arch/arm64/include/asm/kvm_asm.h      |   4 +
 arch/arm64/include/asm/kvm_host.h     |  15 ++
 arch/arm64/include/asm/kvm_mmu.h      |   1 +
 arch/arm64/include/asm/kvm_pgtable.h  |  79 ++++++++++-
 arch/arm64/kvm/arm.c                  |  30 ++++
 arch/arm64/kvm/hyp/nvhe/hyp-main.c    |  10 ++
 arch/arm64/kvm/hyp/nvhe/mem_protect.c |   6 +-
 arch/arm64/kvm/hyp/nvhe/tlb.c         |  54 +++++++
 arch/arm64/kvm/hyp/pgtable.c          | 197 ++++++++++++++++++++++++--
 arch/arm64/kvm/hyp/vhe/tlb.c          |  32 +++++
 arch/arm64/kvm/mmu.c                  | 195 +++++++++++++++++++++----
 include/linux/kvm_host.h              |   2 +
 include/uapi/linux/kvm.h              |   2 +
 virt/kvm/kvm_main.c                   |   2 +-
 15 files changed, 603 insertions(+), 54 deletions(-)

-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v7 01/12] KVM: arm64: Rename free_removed to free_unlinked
  2023-04-09  6:29 [PATCH v7 00/12] Implement Eager Page Splitting for ARM Ricardo Koller
@ 2023-04-09  6:29 ` Ricardo Koller
  2023-04-17  6:08   ` Gavin Shan
  2023-04-09  6:29 ` [PATCH v7 02/12] KVM: arm64: Add KVM_PGTABLE_WALK ctx->flags for skipping BBM and CMO Ricardo Koller
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Ricardo Koller @ 2023-04-09  6:29 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller, Oliver Upton,
	Shaoqin Huang

Normalize on referring to tables outside of an active paging structure
as 'unlinked'.

A subsequent change to KVM will add support for building page tables
that are not part of an active paging structure. The existing
'removed_table' terminology is quite clunky when applied in this
context.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
---
 arch/arm64/include/asm/kvm_pgtable.h  |  8 ++++----
 arch/arm64/kvm/hyp/nvhe/mem_protect.c |  6 +++---
 arch/arm64/kvm/hyp/pgtable.c          |  6 +++---
 arch/arm64/kvm/mmu.c                  | 10 +++++-----
 4 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 4cd6762bda805..26a4293726c14 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -104,7 +104,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
  *				allocation is physically contiguous.
  * @free_pages_exact:		Free an exact number of memory pages previously
  *				allocated by zalloc_pages_exact.
- * @free_removed_table:		Free a removed paging structure by unlinking and
+ * @free_unlinked_table:	Free an unlinked paging structure by unlinking and
  *				dropping references.
  * @get_page:			Increment the refcount on a page.
  * @put_page:			Decrement the refcount on a page. When the
@@ -124,7 +124,7 @@ struct kvm_pgtable_mm_ops {
 	void*		(*zalloc_page)(void *arg);
 	void*		(*zalloc_pages_exact)(size_t size);
 	void		(*free_pages_exact)(void *addr, size_t size);
-	void		(*free_removed_table)(void *addr, u32 level);
+	void		(*free_unlinked_table)(void *addr, u32 level);
 	void		(*get_page)(void *addr);
 	void		(*put_page)(void *addr);
 	int		(*page_count)(void *addr);
@@ -440,7 +440,7 @@ int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
 
 /**
- * kvm_pgtable_stage2_free_removed() - Free a removed stage-2 paging structure.
+ * kvm_pgtable_stage2_free_unlinked() - Free an unlinked stage-2 paging structure.
  * @mm_ops:	Memory management callbacks.
  * @pgtable:	Unlinked stage-2 paging structure to be freed.
  * @level:	Level of the stage-2 paging structure to be freed.
@@ -448,7 +448,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
  * The page-table is assumed to be unreachable by any hardware walkers prior to
  * freeing and therefore no TLB invalidation is performed.
  */
-void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level);
+void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level);
 
 /**
  * kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table.
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 552653fa18be3..b030170d803b6 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -91,9 +91,9 @@ static void host_s2_put_page(void *addr)
 	hyp_put_page(&host_s2_pool, addr);
 }
 
-static void host_s2_free_removed_table(void *addr, u32 level)
+static void host_s2_free_unlinked_table(void *addr, u32 level)
 {
-	kvm_pgtable_stage2_free_removed(&host_mmu.mm_ops, addr, level);
+	kvm_pgtable_stage2_free_unlinked(&host_mmu.mm_ops, addr, level);
 }
 
 static int prepare_s2_pool(void *pgt_pool_base)
@@ -110,7 +110,7 @@ static int prepare_s2_pool(void *pgt_pool_base)
 	host_mmu.mm_ops = (struct kvm_pgtable_mm_ops) {
 		.zalloc_pages_exact = host_s2_zalloc_pages_exact,
 		.zalloc_page = host_s2_zalloc_page,
-		.free_removed_table = host_s2_free_removed_table,
+		.free_unlinked_table = host_s2_free_unlinked_table,
 		.phys_to_virt = hyp_phys_to_virt,
 		.virt_to_phys = hyp_virt_to_phys,
 		.page_count = hyp_page_count,
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 3d61bd3e591d2..a3246d6cddec7 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -860,7 +860,7 @@ static int stage2_map_walk_table_pre(const struct kvm_pgtable_visit_ctx *ctx,
 	if (ret)
 		return ret;
 
-	mm_ops->free_removed_table(childp, ctx->level);
+	mm_ops->free_unlinked_table(childp, ctx->level);
 	return 0;
 }
 
@@ -905,7 +905,7 @@ static int stage2_map_walk_leaf(const struct kvm_pgtable_visit_ctx *ctx,
  * The TABLE_PRE callback runs for table entries on the way down, looking
  * for table entries which we could conceivably replace with a block entry
  * for this mapping. If it finds one it replaces the entry and calls
- * kvm_pgtable_mm_ops::free_removed_table() to tear down the detached table.
+ * kvm_pgtable_mm_ops::free_unlinked_table() to tear down the detached table.
  *
  * Otherwise, the LEAF callback performs the mapping at the existing leaves
  * instead.
@@ -1276,7 +1276,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
 	pgt->pgd = NULL;
 }
 
-void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level)
+void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level)
 {
 	kvm_pteref_t ptep = (kvm_pteref_t)pgtable;
 	struct kvm_pgtable_walker walker = {
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7113587222ffe..efdaab3f154de 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -131,21 +131,21 @@ static void kvm_s2_free_pages_exact(void *virt, size_t size)
 
 static struct kvm_pgtable_mm_ops kvm_s2_mm_ops;
 
-static void stage2_free_removed_table_rcu_cb(struct rcu_head *head)
+static void stage2_free_unlinked_table_rcu_cb(struct rcu_head *head)
 {
 	struct page *page = container_of(head, struct page, rcu_head);
 	void *pgtable = page_to_virt(page);
 	u32 level = page_private(page);
 
-	kvm_pgtable_stage2_free_removed(&kvm_s2_mm_ops, pgtable, level);
+	kvm_pgtable_stage2_free_unlinked(&kvm_s2_mm_ops, pgtable, level);
 }
 
-static void stage2_free_removed_table(void *addr, u32 level)
+static void stage2_free_unlinked_table(void *addr, u32 level)
 {
 	struct page *page = virt_to_page(addr);
 
 	set_page_private(page, (unsigned long)level);
-	call_rcu(&page->rcu_head, stage2_free_removed_table_rcu_cb);
+	call_rcu(&page->rcu_head, stage2_free_unlinked_table_rcu_cb);
 }
 
 static void kvm_host_get_page(void *addr)
@@ -682,7 +682,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
 	.zalloc_page		= stage2_memcache_zalloc_page,
 	.zalloc_pages_exact	= kvm_s2_zalloc_pages_exact,
 	.free_pages_exact	= kvm_s2_free_pages_exact,
-	.free_removed_table	= stage2_free_removed_table,
+	.free_unlinked_table	= stage2_free_unlinked_table,
 	.get_page		= kvm_host_get_page,
 	.put_page		= kvm_s2_put_page,
 	.page_count		= kvm_host_page_count,
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 02/12] KVM: arm64: Add KVM_PGTABLE_WALK ctx->flags for skipping BBM and CMO
  2023-04-09  6:29 [PATCH v7 00/12] Implement Eager Page Splitting for ARM Ricardo Koller
  2023-04-09  6:29 ` [PATCH v7 01/12] KVM: arm64: Rename free_removed to free_unlinked Ricardo Koller
@ 2023-04-09  6:29 ` Ricardo Koller
  2023-04-17  6:10   ` Gavin Shan
  2023-04-09  6:29 ` [PATCH v7 02/12] KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIs Ricardo Koller
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Ricardo Koller @ 2023-04-09  6:29 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller, Shaoqin Huang

Add two flags to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_SKIP_BBM and
KVM_PGTABLE_WALK_SKIP_CMO, to indicate that the walk should not
perform break-before-make (BBM) nor cache maintenance operations
(CMO). This will be used by a future commit to create unlinked tables
not accessible to the HW page-table walker.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  8 ++++++
 arch/arm64/kvm/hyp/pgtable.c         | 37 +++++++++++++++++++---------
 2 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 26a4293726c14..3f2d43ba2b628 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -195,6 +195,12 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
  *					with other software walkers.
  * @KVM_PGTABLE_WALK_HANDLE_FAULT:	Indicates the page-table walk was
  *					invoked from a fault handler.
+ * @KVM_PGTABLE_WALK_SKIP_BBM_TLBI:	Visit and update table entries
+ *					without Break-before-make's
+ *					TLB invalidation.
+ * @KVM_PGTABLE_WALK_SKIP_CMO:		Visit and update table entries
+ *					without Cache maintenance
+ *					operations required.
  */
 enum kvm_pgtable_walk_flags {
 	KVM_PGTABLE_WALK_LEAF			= BIT(0),
@@ -202,6 +208,8 @@ enum kvm_pgtable_walk_flags {
 	KVM_PGTABLE_WALK_TABLE_POST		= BIT(2),
 	KVM_PGTABLE_WALK_SHARED			= BIT(3),
 	KVM_PGTABLE_WALK_HANDLE_FAULT		= BIT(4),
+	KVM_PGTABLE_WALK_SKIP_BBM_TLBI		= BIT(5),
+	KVM_PGTABLE_WALK_SKIP_CMO		= BIT(6),
 };
 
 struct kvm_pgtable_visit_ctx {
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index a3246d6cddec7..633679ee3c49a 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -62,6 +62,16 @@ struct kvm_pgtable_walk_data {
 	u64				end;
 };
 
+static bool kvm_pgtable_walk_skip_bbm_tlbi(const struct kvm_pgtable_visit_ctx *ctx)
+{
+	return unlikely(ctx->flags & KVM_PGTABLE_WALK_SKIP_BBM_TLBI);
+}
+
+static bool kvm_pgtable_walk_skip_cmo(const struct kvm_pgtable_visit_ctx *ctx)
+{
+	return unlikely(ctx->flags & KVM_PGTABLE_WALK_SKIP_CMO);
+}
+
 static bool kvm_phys_is_valid(u64 phys)
 {
 	return phys < BIT(id_aa64mmfr0_parange_to_phys_shift(ID_AA64MMFR0_EL1_PARANGE_MAX));
@@ -741,14 +751,17 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
 	if (!stage2_try_set_pte(ctx, KVM_INVALID_PTE_LOCKED))
 		return false;
 
-	/*
-	 * Perform the appropriate TLB invalidation based on the evicted pte
-	 * value (if any).
-	 */
-	if (kvm_pte_table(ctx->old, ctx->level))
-		kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
-	else if (kvm_pte_valid(ctx->old))
-		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
+	if (!kvm_pgtable_walk_skip_bbm_tlbi(ctx)) {
+		/*
+		 * Perform the appropriate TLB invalidation based on the
+		 * evicted pte value (if any).
+		 */
+		if (kvm_pte_table(ctx->old, ctx->level))
+			kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
+		else if (kvm_pte_valid(ctx->old))
+			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
+				     ctx->addr, ctx->level);
+	}
 
 	if (stage2_pte_is_counted(ctx->old))
 		mm_ops->put_page(ctx->ptep);
@@ -832,11 +845,13 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
 		return -EAGAIN;
 
 	/* Perform CMOs before installation of the guest stage-2 PTE */
-	if (mm_ops->dcache_clean_inval_poc && stage2_pte_cacheable(pgt, new))
+	if (!kvm_pgtable_walk_skip_cmo(ctx) && mm_ops->dcache_clean_inval_poc &&
+	    stage2_pte_cacheable(pgt, new))
 		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(new, mm_ops),
-						granule);
+					       granule);
 
-	if (mm_ops->icache_inval_pou && stage2_pte_executable(new))
+	if (!kvm_pgtable_walk_skip_cmo(ctx) && mm_ops->icache_inval_pou &&
+	    stage2_pte_executable(new))
 		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
 
 	stage2_make_pte(ctx, new);
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 02/12] KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIs
  2023-04-09  6:29 [PATCH v7 00/12] Implement Eager Page Splitting for ARM Ricardo Koller
  2023-04-09  6:29 ` [PATCH v7 01/12] KVM: arm64: Rename free_removed to free_unlinked Ricardo Koller
  2023-04-09  6:29 ` [PATCH v7 02/12] KVM: arm64: Add KVM_PGTABLE_WALK ctx->flags for skipping BBM and CMO Ricardo Koller
@ 2023-04-09  6:29 ` Ricardo Koller
  2023-04-17  6:13   ` Gavin Shan
  2023-04-09  6:29 ` [PATCH v7 03/12] KVM: arm64: Add helper for creating unlinked stage2 subtrees Ricardo Koller
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Ricardo Koller @ 2023-04-09  6:29 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller, Shaoqin Huang

Add two flags to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_SKIP_BBM_TLBI
and KVM_PGTABLE_WALK_SKIP_CMO, to indicate that the walk should not
perform TLB invalidations (TLBIs) in break-before-make (BBM) nor cache
maintenance operations (CMO). This will be used by a future commit to
create unlinked tables not accessible to the HW page-table walker.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  8 ++++++
 arch/arm64/kvm/hyp/pgtable.c         | 37 +++++++++++++++++++---------
 2 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 26a4293726c14..3f2d43ba2b628 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -195,6 +195,12 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
  *					with other software walkers.
  * @KVM_PGTABLE_WALK_HANDLE_FAULT:	Indicates the page-table walk was
  *					invoked from a fault handler.
+ * @KVM_PGTABLE_WALK_SKIP_BBM_TLBI:	Visit and update table entries
+ *					without Break-before-make's
+ *					TLB invalidation.
+ * @KVM_PGTABLE_WALK_SKIP_CMO:		Visit and update table entries
+ *					without Cache maintenance
+ *					operations required.
  */
 enum kvm_pgtable_walk_flags {
 	KVM_PGTABLE_WALK_LEAF			= BIT(0),
@@ -202,6 +208,8 @@ enum kvm_pgtable_walk_flags {
 	KVM_PGTABLE_WALK_TABLE_POST		= BIT(2),
 	KVM_PGTABLE_WALK_SHARED			= BIT(3),
 	KVM_PGTABLE_WALK_HANDLE_FAULT		= BIT(4),
+	KVM_PGTABLE_WALK_SKIP_BBM_TLBI		= BIT(5),
+	KVM_PGTABLE_WALK_SKIP_CMO		= BIT(6),
 };
 
 struct kvm_pgtable_visit_ctx {
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index a3246d6cddec7..633679ee3c49a 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -62,6 +62,16 @@ struct kvm_pgtable_walk_data {
 	u64				end;
 };
 
+static bool kvm_pgtable_walk_skip_bbm_tlbi(const struct kvm_pgtable_visit_ctx *ctx)
+{
+	return unlikely(ctx->flags & KVM_PGTABLE_WALK_SKIP_BBM_TLBI);
+}
+
+static bool kvm_pgtable_walk_skip_cmo(const struct kvm_pgtable_visit_ctx *ctx)
+{
+	return unlikely(ctx->flags & KVM_PGTABLE_WALK_SKIP_CMO);
+}
+
 static bool kvm_phys_is_valid(u64 phys)
 {
 	return phys < BIT(id_aa64mmfr0_parange_to_phys_shift(ID_AA64MMFR0_EL1_PARANGE_MAX));
@@ -741,14 +751,17 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
 	if (!stage2_try_set_pte(ctx, KVM_INVALID_PTE_LOCKED))
 		return false;
 
-	/*
-	 * Perform the appropriate TLB invalidation based on the evicted pte
-	 * value (if any).
-	 */
-	if (kvm_pte_table(ctx->old, ctx->level))
-		kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
-	else if (kvm_pte_valid(ctx->old))
-		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
+	if (!kvm_pgtable_walk_skip_bbm_tlbi(ctx)) {
+		/*
+		 * Perform the appropriate TLB invalidation based on the
+		 * evicted pte value (if any).
+		 */
+		if (kvm_pte_table(ctx->old, ctx->level))
+			kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
+		else if (kvm_pte_valid(ctx->old))
+			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
+				     ctx->addr, ctx->level);
+	}
 
 	if (stage2_pte_is_counted(ctx->old))
 		mm_ops->put_page(ctx->ptep);
@@ -832,11 +845,13 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
 		return -EAGAIN;
 
 	/* Perform CMOs before installation of the guest stage-2 PTE */
-	if (mm_ops->dcache_clean_inval_poc && stage2_pte_cacheable(pgt, new))
+	if (!kvm_pgtable_walk_skip_cmo(ctx) && mm_ops->dcache_clean_inval_poc &&
+	    stage2_pte_cacheable(pgt, new))
 		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(new, mm_ops),
-						granule);
+					       granule);
 
-	if (mm_ops->icache_inval_pou && stage2_pte_executable(new))
+	if (!kvm_pgtable_walk_skip_cmo(ctx) && mm_ops->icache_inval_pou &&
+	    stage2_pte_executable(new))
 		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
 
 	stage2_make_pte(ctx, new);
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 03/12] KVM: arm64: Add helper for creating unlinked stage2 subtrees
  2023-04-09  6:29 [PATCH v7 00/12] Implement Eager Page Splitting for ARM Ricardo Koller
                   ` (2 preceding siblings ...)
  2023-04-09  6:29 ` [PATCH v7 02/12] KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIs Ricardo Koller
@ 2023-04-09  6:29 ` Ricardo Koller
  2023-04-17  6:18   ` Gavin Shan
  2023-04-09  6:29 ` [PATCH v7 04/12] KVM: arm64: Add kvm_pgtable_stage2_split() Ricardo Koller
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Ricardo Koller @ 2023-04-09  6:29 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller, Shaoqin Huang

Add a stage2 helper, kvm_pgtable_stage2_create_unlinked(), for
creating unlinked tables (which is the opposite of
kvm_pgtable_stage2_free_unlinked()).  Creating an unlinked table is
useful for splitting level 1 and 2 entries into subtrees of PAGE_SIZE
PTEs.  For example, a level 1 entry can be split into PAGE_SIZE PTEs
by first creating a fully populated tree, and then use it to replace
the level 1 entry in a single step.  This will be used in a subsequent
commit for eager huge-page splitting (a dirty-logging optimization).

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
---
 arch/arm64/include/asm/kvm_pgtable.h | 26 +++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c         | 49 ++++++++++++++++++++++++++++
 2 files changed, 75 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 3f2d43ba2b628..c8e0e7d9303b2 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -458,6 +458,32 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
  */
 void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level);
 
+/**
+ * kvm_pgtable_stage2_create_unlinked() - Create an unlinked stage-2 paging structure.
+ * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
+ * @phys:	Physical address of the memory to map.
+ * @level:	Starting level of the stage-2 paging structure to be created.
+ * @prot:	Permissions and attributes for the mapping.
+ * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
+ *		page-table pages.
+ * @force_pte:  Force mappings to PAGE_SIZE granularity.
+ *
+ * Returns an unlinked page-table tree.  This new page-table tree is
+ * not reachable (i.e., it is unlinked) from the root pgd and it's
+ * therefore unreachableby the hardware page-table walker. No TLB
+ * invalidation or CMOs are performed.
+ *
+ * If device attributes are not explicitly requested in @prot, then the
+ * mapping will be normal, cacheable.
+ *
+ * Return: The fully populated (unlinked) stage-2 paging structure, or
+ * an ERR_PTR(error) on failure.
+ */
+kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt,
+					      u64 phys, u32 level,
+					      enum kvm_pgtable_prot prot,
+					      void *mc, bool force_pte);
+
 /**
  * kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table.
  * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 633679ee3c49a..477d2be67d401 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1222,6 +1222,55 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
 	return kvm_pgtable_walk(pgt, addr, size, &walker);
 }
 
+kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt,
+					      u64 phys, u32 level,
+					      enum kvm_pgtable_prot prot,
+					      void *mc, bool force_pte)
+{
+	struct stage2_map_data map_data = {
+		.phys		= phys,
+		.mmu		= pgt->mmu,
+		.memcache	= mc,
+		.force_pte	= force_pte,
+	};
+	struct kvm_pgtable_walker walker = {
+		.cb		= stage2_map_walker,
+		.flags		= KVM_PGTABLE_WALK_LEAF |
+				  KVM_PGTABLE_WALK_SKIP_BBM_TLBI |
+				  KVM_PGTABLE_WALK_SKIP_CMO,
+		.arg		= &map_data,
+	};
+	/* .addr (the IPA) is irrelevant for an unlinked table */
+	struct kvm_pgtable_walk_data data = {
+		.walker	= &walker,
+		.addr	= 0,
+		.end	= kvm_granule_size(level),
+	};
+	struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
+	kvm_pte_t *pgtable;
+	int ret;
+
+	if (!IS_ALIGNED(phys, kvm_granule_size(level)))
+		return ERR_PTR(-EINVAL);
+
+	ret = stage2_set_prot_attr(pgt, prot, &map_data.attr);
+	if (ret)
+		return ERR_PTR(ret);
+
+	pgtable = mm_ops->zalloc_page(mc);
+	if (!pgtable)
+		return ERR_PTR(-ENOMEM);
+
+	ret = __kvm_pgtable_walk(&data, mm_ops, (kvm_pteref_t)pgtable,
+				 level + 1);
+	if (ret) {
+		kvm_pgtable_stage2_free_unlinked(mm_ops, pgtable, level);
+		mm_ops->put_page(pgtable);
+		return ERR_PTR(ret);
+	}
+
+	return pgtable;
+}
 
 int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 			      struct kvm_pgtable_mm_ops *mm_ops,
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
  2023-04-09  6:29 [PATCH v7 00/12] Implement Eager Page Splitting for ARM Ricardo Koller
                   ` (3 preceding siblings ...)
  2023-04-09  6:29 ` [PATCH v7 03/12] KVM: arm64: Add helper for creating unlinked stage2 subtrees Ricardo Koller
@ 2023-04-09  6:29 ` Ricardo Koller
  2023-04-09  9:36   ` kernel test robot
  2023-04-17  6:38   ` Gavin Shan
  2023-04-09  6:29 ` [PATCH v7 05/12] KVM: arm64: Refactor kvm_arch_commit_memory_region() Ricardo Koller
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 40+ messages in thread
From: Ricardo Koller @ 2023-04-09  6:29 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller, Shaoqin Huang

Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a
range of huge pages. This will be used for eager-splitting huge pages
into PAGE_SIZE pages. The goal is to avoid having to split huge pages
on write-protection faults, and instead use this function to do it
ahead of time for large ranges (e.g., all guest memory in 1G chunks at
a time).

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  19 +++++
 arch/arm64/kvm/hyp/pgtable.c         | 103 +++++++++++++++++++++++++++
 2 files changed, 122 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index c8e0e7d9303b2..32e5d42bf020f 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -653,6 +653,25 @@ bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr);
  */
 int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
 
+/**
+ * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
+ *				to PAGE_SIZE guest pages.
+ * @pgt:	 Page-table structure initialised by kvm_pgtable_stage2_init().
+ * @addr:	 Intermediate physical address from which to split.
+ * @size:	 Size of the range.
+ * @mc:		 Cache of pre-allocated and zeroed memory from which to allocate
+ *		 page-table pages.
+ *
+ * The function tries to split any level 1 or 2 entry that overlaps
+ * with the input range (given by @addr and @size).
+ *
+ * Return: 0 on success, negative error code on failure. Note that
+ * kvm_pgtable_stage2_split() is best effort: it tries to break as many
+ * blocks in the input range as allowed by @mc_capacity.
+ */
+int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
+			     struct kvm_mmu_memory_cache *mc);
+
 /**
  * kvm_pgtable_walk() - Walk a page-table.
  * @pgt:	Page-table structure initialised by kvm_pgtable_*_init().
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 477d2be67d401..48c5a95c6e8cd 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1272,6 +1272,109 @@ kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt,
 	return pgtable;
 }
 
+/*
+ * Get the number of page-tables needed to replace a block with a
+ * fully populated tree up to the PTE entries. Note that @level is
+ * interpreted as in "level @level entry".
+ */
+static int stage2_block_get_nr_page_tables(u32 level)
+{
+	switch (level) {
+	case 1:
+		return PTRS_PER_PTE + 1;
+	case 2:
+		return 1;
+	case 3:
+		return 0;
+	default:
+		WARN_ON_ONCE(level < KVM_PGTABLE_MIN_BLOCK_LEVEL ||
+			     level >= KVM_PGTABLE_MAX_LEVELS);
+		return -EINVAL;
+	};
+}
+
+static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
+			       enum kvm_pgtable_walk_flags visit)
+{
+	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
+	struct kvm_mmu_memory_cache *mc = ctx->arg;
+	struct kvm_s2_mmu *mmu;
+	kvm_pte_t pte = ctx->old, new, *childp;
+	enum kvm_pgtable_prot prot;
+	u32 level = ctx->level;
+	bool force_pte;
+	int nr_pages;
+	u64 phys;
+
+	/* No huge-pages exist at the last level */
+	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
+		return 0;
+
+	/* We only split valid block mappings */
+	if (!kvm_pte_valid(pte))
+		return 0;
+
+	nr_pages = stage2_block_get_nr_page_tables(level);
+	if (nr_pages < 0)
+		return nr_pages;
+
+	if (mc->nobjs >= nr_pages) {
+		/* Build a tree mapped down to the PTE granularity. */
+		force_pte = true;
+	} else {
+		/*
+		 * Don't force PTEs, so create_unlinked() below does
+		 * not populate the tree up to the PTE level. The
+		 * consequence is that the call will require a single
+		 * page of level 2 entries at level 1, or a single
+		 * page of PTEs at level 2. If we are at level 1, the
+		 * PTEs will be created recursively.
+		 */
+		force_pte = false;
+		nr_pages = 1;
+	}
+
+	if (mc->nobjs < nr_pages)
+		return -ENOMEM;
+
+	mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
+	phys = kvm_pte_to_phys(pte);
+	prot = kvm_pgtable_stage2_pte_prot(pte);
+
+	childp = kvm_pgtable_stage2_create_unlinked(mmu->pgt, phys,
+						    level, prot, mc, force_pte);
+	if (IS_ERR(childp))
+		return PTR_ERR(childp);
+
+	if (!stage2_try_break_pte(ctx, mmu)) {
+		kvm_pgtable_stage2_free_unlinked(mm_ops, childp, level);
+		mm_ops->put_page(childp);
+		return -EAGAIN;
+	}
+
+	/*
+	 * Note, the contents of the page table are guaranteed to be made
+	 * visible before the new PTE is assigned because stage2_make_pte()
+	 * writes the PTE using smp_store_release().
+	 */
+	new = kvm_init_table_pte(childp, mm_ops);
+	stage2_make_pte(ctx, new);
+	dsb(ishst);
+	return 0;
+}
+
+int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
+			     struct kvm_mmu_memory_cache *mc)
+{
+	struct kvm_pgtable_walker walker = {
+		.cb	= stage2_split_walker,
+		.flags	= KVM_PGTABLE_WALK_LEAF,
+		.arg	= mc,
+	};
+
+	return kvm_pgtable_walk(pgt, addr, size, &walker);
+}
+
 int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 			      struct kvm_pgtable_mm_ops *mm_ops,
 			      enum kvm_pgtable_stage2_flags flags,
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 05/12] KVM: arm64: Refactor kvm_arch_commit_memory_region()
  2023-04-09  6:29 [PATCH v7 00/12] Implement Eager Page Splitting for ARM Ricardo Koller
                   ` (4 preceding siblings ...)
  2023-04-09  6:29 ` [PATCH v7 04/12] KVM: arm64: Add kvm_pgtable_stage2_split() Ricardo Koller
@ 2023-04-09  6:29 ` Ricardo Koller
  2023-04-17  6:41   ` Gavin Shan
  2023-04-17  6:42   ` Gavin Shan
  2023-04-09  6:29 ` [PATCH v7 06/12] KVM: arm64: Add kvm_uninit_stage2_mmu() Ricardo Koller
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 40+ messages in thread
From: Ricardo Koller @ 2023-04-09  6:29 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller, Shaoqin Huang

Refactor kvm_arch_commit_memory_region() as a preparation for a future
commit to look cleaner and more understandable. Also, it looks more
like its x86 counterpart (in kvm_mmu_slot_apply_flags()).

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
---
 arch/arm64/kvm/mmu.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index efdaab3f154de..37d7d2aa472ab 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1761,20 +1761,27 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 				   const struct kvm_memory_slot *new,
 				   enum kvm_mr_change change)
 {
+	bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
+
 	/*
 	 * At this point memslot has been committed and there is an
 	 * allocated dirty_bitmap[], dirty pages will be tracked while the
 	 * memory slot is write protected.
 	 */
-	if (change != KVM_MR_DELETE && new->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+	if (log_dirty_pages) {
+
+		if (change == KVM_MR_DELETE)
+			return;
+
 		/*
 		 * If we're with initial-all-set, we don't need to write
 		 * protect any pages because they're all reported as dirty.
 		 * Huge pages and normal pages will be write protect gradually.
 		 */
-		if (!kvm_dirty_log_manual_protect_and_init_set(kvm)) {
-			kvm_mmu_wp_memory_region(kvm, new->id);
-		}
+		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
+			return;
+
+		kvm_mmu_wp_memory_region(kvm, new->id);
 	}
 }
 
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 06/12] KVM: arm64: Add kvm_uninit_stage2_mmu()
  2023-04-09  6:29 [PATCH v7 00/12] Implement Eager Page Splitting for ARM Ricardo Koller
                   ` (5 preceding siblings ...)
  2023-04-09  6:29 ` [PATCH v7 05/12] KVM: arm64: Refactor kvm_arch_commit_memory_region() Ricardo Koller
@ 2023-04-09  6:29 ` Ricardo Koller
  2023-04-17  6:44   ` Gavin Shan
  2023-04-09  6:29 ` [PATCH v7 07/12] KVM: arm64: Export kvm_are_all_memslots_empty() Ricardo Koller
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Ricardo Koller @ 2023-04-09  6:29 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller, Shaoqin Huang

Add kvm_uninit_stage2_mmu() and move kvm_free_stage2_pgd() into it. A
future commit will add some more things to do inside of
kvm_uninit_stage2_mmu().

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
---
 arch/arm64/include/asm/kvm_mmu.h | 1 +
 arch/arm64/kvm/mmu.c             | 7 ++++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 083cc47dca086..7d173da5bd51c 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -168,6 +168,7 @@ void __init free_hyp_pgds(void);
 
 void stage2_unmap_vm(struct kvm *kvm);
 int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type);
+void kvm_uninit_stage2_mmu(struct kvm *kvm);
 void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
 int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 			  phys_addr_t pa, unsigned long size, bool writable);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 37d7d2aa472ab..a2800e5c42712 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -767,6 +767,11 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 	return err;
 }
 
+void kvm_uninit_stage2_mmu(struct kvm *kvm)
+{
+	kvm_free_stage2_pgd(&kvm->arch.mmu);
+}
+
 static void stage2_unmap_memslot(struct kvm *kvm,
 				 struct kvm_memory_slot *memslot)
 {
@@ -1855,7 +1860,7 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
-	kvm_free_stage2_pgd(&kvm->arch.mmu);
+	kvm_uninit_stage2_mmu(kvm);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 07/12] KVM: arm64: Export kvm_are_all_memslots_empty()
  2023-04-09  6:29 [PATCH v7 00/12] Implement Eager Page Splitting for ARM Ricardo Koller
                   ` (6 preceding siblings ...)
  2023-04-09  6:29 ` [PATCH v7 06/12] KVM: arm64: Add kvm_uninit_stage2_mmu() Ricardo Koller
@ 2023-04-09  6:29 ` Ricardo Koller
  2023-04-17  6:47   ` Gavin Shan
  2023-04-09  6:29 ` [PATCH v7 08/12] KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE Ricardo Koller
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Ricardo Koller @ 2023-04-09  6:29 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller, Shaoqin Huang

Export kvm_are_all_memslots_empty(). This will be used by a future
commit when checking before setting a capability.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
---
 include/linux/kvm_host.h | 2 ++
 virt/kvm/kvm_main.c      | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8ada23756b0ec..c6fa634f236d9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -990,6 +990,8 @@ static inline bool kvm_memslots_empty(struct kvm_memslots *slots)
 	return RB_EMPTY_ROOT(&slots->gfn_tree);
 }
 
+bool kvm_are_all_memslots_empty(struct kvm *kvm);
+
 #define kvm_for_each_memslot(memslot, bkt, slots)			      \
 	hash_for_each(slots->id_hash, bkt, memslot, id_node[slots->node_idx]) \
 		if (WARN_ON_ONCE(!memslot->npages)) {			      \
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d255964ec331e..897b000787beb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4596,7 +4596,7 @@ int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 	return -EINVAL;
 }
 
-static bool kvm_are_all_memslots_empty(struct kvm *kvm)
+bool kvm_are_all_memslots_empty(struct kvm *kvm)
 {
 	int i;
 
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 08/12] KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
  2023-04-09  6:29 [PATCH v7 00/12] Implement Eager Page Splitting for ARM Ricardo Koller
                   ` (7 preceding siblings ...)
  2023-04-09  6:29 ` [PATCH v7 07/12] KVM: arm64: Export kvm_are_all_memslots_empty() Ricardo Koller
@ 2023-04-09  6:29 ` Ricardo Koller
  2023-04-17  7:04   ` Gavin Shan
  2023-04-09  6:29 ` [PATCH v7 09/12] KVM: arm64: Split huge pages when dirty logging is enabled Ricardo Koller
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Ricardo Koller @ 2023-04-09  6:29 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller, Oliver Upton

Add a capability for userspace to specify the eager split chunk size.
The chunk size specifies how many pages to break at a time, using a
single allocation. Bigger the chunk size, more pages need to be
allocated ahead of time.

Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 Documentation/virt/kvm/api.rst       | 28 ++++++++++++++++++++++++++
 arch/arm64/include/asm/kvm_host.h    | 15 ++++++++++++++
 arch/arm64/include/asm/kvm_pgtable.h | 18 +++++++++++++++++
 arch/arm64/kvm/arm.c                 | 30 ++++++++++++++++++++++++++++
 arch/arm64/kvm/mmu.c                 |  3 +++
 include/uapi/linux/kvm.h             |  2 ++
 6 files changed, 96 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 62de0768d6aa5..f8faa80d87057 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8380,6 +8380,34 @@ structure.
 When getting the Modified Change Topology Report value, the attr->addr
 must point to a byte where the value will be stored or retrieved from.
 
+8.40 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
+---------------------------------------
+
+:Capability: KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
+:Architectures: arm64
+:Type: vm
+:Parameters: arg[0] is the new split chunk size.
+:Returns: 0 on success, -EINVAL if any memslot was already created.
+
+This capability sets the chunk size used in Eager Page Splitting.
+
+Eager Page Splitting improves the performance of dirty-logging (used
+in live migrations) when guest memory is backed by huge-pages.  It
+avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing
+it eagerly when enabling dirty logging (with the
+KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using
+KVM_CLEAR_DIRTY_LOG.
+
+The chunk size specifies how many pages to break at a time, using a
+single allocation for each chunk. Bigger the chunk size, more pages
+need to be allocated ahead of time.
+
+The chunk size needs to be a valid block size. The list of acceptable
+block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a 64bit
+bitmap (each bit describing a block size). Setting
+KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE to 0 disables Eager Page Splitting;
+this is the default value.
+
 9. Known KVM API problems
 =========================
 
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index a1892a8f60323..b87da1ebc3454 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -158,6 +158,21 @@ struct kvm_s2_mmu {
 	/* The last vcpu id that ran on each physical CPU */
 	int __percpu *last_vcpu_ran;
 
+#define KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT 0
+	/*
+	 * Memory cache used to split
+	 * KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE worth of huge pages. It
+	 * is used to allocate stage2 page tables while splitting huge
+	 * pages. Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE
+	 * influences both the capacity of the split page cache, and
+	 * how often KVM reschedules. Be wary of raising CHUNK_SIZE
+	 * too high.
+	 *
+	 * Protected by kvm->slots_lock.
+	 */
+	struct kvm_mmu_memory_cache split_page_cache;
+	uint64_t split_page_chunk_size;
+
 	struct kvm_arch *arch;
 };
 
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 32e5d42bf020f..889bd7afeb355 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -92,6 +92,24 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
 	return level >= KVM_PGTABLE_MIN_BLOCK_LEVEL;
 }
 
+static inline u64 kvm_supported_block_sizes(void)
+{
+	u32 level = KVM_PGTABLE_MIN_BLOCK_LEVEL;
+	u64 res = 0;
+
+	for (; level < KVM_PGTABLE_MAX_LEVELS; level++)
+		res |= BIT(kvm_granule_shift(level));
+
+	return res;
+}
+
+static inline bool kvm_is_block_size_supported(u64 size)
+{
+	bool is_power_of_two = !((size) & ((size)-1));
+
+	return is_power_of_two && (size & kvm_supported_block_sizes());
+}
+
 /**
  * struct kvm_pgtable_mm_ops - Memory management callbacks.
  * @zalloc_page:		Allocate a single zeroed memory page.
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 3bd732eaf0872..34fd3c59a9b82 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -67,6 +67,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 			    struct kvm_enable_cap *cap)
 {
 	int r;
+	u64 new_cap;
 
 	if (cap->flags)
 		return -EINVAL;
@@ -91,6 +92,26 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		r = 0;
 		set_bit(KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED, &kvm->arch.flags);
 		break;
+	case KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE:
+		new_cap = cap->args[0];
+
+		mutex_lock(&kvm->lock);
+		mutex_lock(&kvm->slots_lock);
+		/*
+		 * To keep things simple, allow changing the chunk
+		 * size only if there are no memslots already created.
+		 */
+		if (!kvm_are_all_memslots_empty(kvm)) {
+			r = -EINVAL;
+		} else if (new_cap && !kvm_is_block_size_supported(new_cap)) {
+			r = -EINVAL;
+		} else {
+			r = 0;
+			kvm->arch.mmu.split_page_chunk_size = new_cap;
+		}
+		mutex_unlock(&kvm->slots_lock);
+		mutex_unlock(&kvm->lock);
+		break;
 	default:
 		r = -EINVAL;
 		break;
@@ -288,6 +309,15 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_PTRAUTH_GENERIC:
 		r = system_has_full_ptr_auth();
 		break;
+	case KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE:
+		if (kvm)
+			r = kvm->arch.mmu.split_page_chunk_size;
+		else
+			r = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
+		break;
+	case KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES:
+		r = kvm_supported_block_sizes();
+		break;
 	default:
 		r = 0;
 	}
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index a2800e5c42712..898985b09321a 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -756,6 +756,9 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
 
+	mmu->split_page_cache.gfp_zero = __GFP_ZERO;
+	mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
+
 	mmu->pgt = pgt;
 	mmu->pgd_phys = __pa(pgt->pgd);
 	return 0;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d77aef872a0a0..f18b48fcd25ba 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1184,6 +1184,8 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
 #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
 #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
+#define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 227
+#define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 228
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 09/12] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-04-09  6:29 [PATCH v7 00/12] Implement Eager Page Splitting for ARM Ricardo Koller
                   ` (8 preceding siblings ...)
  2023-04-09  6:29 ` [PATCH v7 08/12] KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE Ricardo Koller
@ 2023-04-09  6:29 ` Ricardo Koller
  2023-04-17  7:11   ` Gavin Shan
  2023-04-09  6:29 ` [PATCH v7 10/12] KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked() Ricardo Koller
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Ricardo Koller @ 2023-04-09  6:29 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller, Shaoqin Huang

Split huge pages eagerly when enabling dirty logging. The goal is to
avoid doing it while faulting on write-protected pages, which
negatively impacts guest performance.

A memslot marked for dirty logging is split in 1GB pieces at a time.
This is in order to release the mmu_lock and give other kernel threads
the opportunity to run, and also in order to allocate enough pages to
split a 1GB range worth of huge pages (or a single 1GB huge page).
Note that these page allocations can fail, so eager page splitting is
best-effort.  This is not a correctness issue though, as huge pages
can still be split on write-faults.

The benefits of eager page splitting are the same as in x86, added
with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
the TDP MMU when dirty logging is enabled"). For example, when running
dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
all of their memory after dirty logging is enabled decreased by 44%
from 2.58s to 1.42s.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
---
 arch/arm64/kvm/mmu.c | 123 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 121 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 898985b09321a..aaefabd8de89d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -31,14 +31,21 @@ static phys_addr_t __ro_after_init hyp_idmap_vector;
 
 static unsigned long __ro_after_init io_map_base;
 
-static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
+static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
+					   phys_addr_t size)
 {
-	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
 	phys_addr_t boundary = ALIGN_DOWN(addr + size, size);
 
 	return (boundary - 1 < end - 1) ? boundary : end;
 }
 
+static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
+{
+	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
+
+	return __stage2_range_addr_end(addr, end, size);
+}
+
 /*
  * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
  * we may see kernel panics with CONFIG_DETECT_HUNG_TASK,
@@ -75,6 +82,79 @@ static int stage2_apply_range(struct kvm_s2_mmu *mmu, phys_addr_t addr,
 #define stage2_apply_range_resched(mmu, addr, end, fn)			\
 	stage2_apply_range(mmu, addr, end, fn, true)
 
+/*
+ * Get the maximum number of page-tables pages needed to split a range
+ * of blocks into PAGE_SIZE PTEs. It assumes the range is already
+ * mapped at level 2, or at level 1 if allowed.
+ */
+static int kvm_mmu_split_nr_page_tables(u64 range)
+{
+	int n = 0;
+
+	if (KVM_PGTABLE_MIN_BLOCK_LEVEL < 2)
+		n += DIV_ROUND_UP_ULL(range, PUD_SIZE);
+	n += DIV_ROUND_UP_ULL(range, PMD_SIZE);
+	return n;
+}
+
+static bool need_split_memcache_topup_or_resched(struct kvm *kvm)
+{
+	struct kvm_mmu_memory_cache *cache;
+	u64 chunk_size, min;
+
+	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+		return true;
+
+	chunk_size = kvm->arch.mmu.split_page_chunk_size;
+	min = kvm_mmu_split_nr_page_tables(chunk_size);
+	cache = &kvm->arch.mmu.split_page_cache;
+	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
+}
+
+static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
+				    phys_addr_t end)
+{
+	struct kvm_mmu_memory_cache *cache;
+	struct kvm_pgtable *pgt;
+	int ret, cache_capacity;
+	u64 next, chunk_size;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	chunk_size = kvm->arch.mmu.split_page_chunk_size;
+	cache_capacity = kvm_mmu_split_nr_page_tables(chunk_size);
+
+	if (chunk_size == 0)
+		return 0;
+
+	cache = &kvm->arch.mmu.split_page_cache;
+
+	do {
+		if (need_split_memcache_topup_or_resched(kvm)) {
+			write_unlock(&kvm->mmu_lock);
+			cond_resched();
+			/* Eager page splitting is best-effort. */
+			ret = __kvm_mmu_topup_memory_cache(cache,
+							   cache_capacity,
+							   cache_capacity);
+			write_lock(&kvm->mmu_lock);
+			if (ret)
+				break;
+		}
+
+		pgt = kvm->arch.mmu.pgt;
+		if (!pgt)
+			return -EINVAL;
+
+		next = __stage2_range_addr_end(addr, end, chunk_size);
+		ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache);
+		if (ret)
+			break;
+	} while (addr = next, addr != end);
+
+	return ret;
+}
+
 static bool memslot_is_logging(struct kvm_memory_slot *memslot)
 {
 	return memslot->dirty_bitmap && !(memslot->flags & KVM_MEM_READONLY);
@@ -773,6 +853,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 void kvm_uninit_stage2_mmu(struct kvm *kvm)
 {
 	kvm_free_stage2_pgd(&kvm->arch.mmu);
+	kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
 }
 
 static void stage2_unmap_memslot(struct kvm *kvm,
@@ -999,6 +1080,34 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 	stage2_wp_range(&kvm->arch.mmu, start, end);
 }
 
+/**
+ * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE
+ *				   pages for memory slot
+ * @kvm:	The KVM pointer
+ * @slot:	The memory slot to split
+ *
+ * Acquires kvm->mmu_lock. Called with kvm->slots_lock mutex acquired,
+ * serializing operations for VM memory regions.
+ */
+static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
+{
+	struct kvm_memslots *slots;
+	struct kvm_memory_slot *memslot;
+	phys_addr_t start, end;
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	slots = kvm_memslots(kvm);
+	memslot = id_to_memslot(slots, slot);
+
+	start = memslot->base_gfn << PAGE_SHIFT;
+	end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
+
+	write_lock(&kvm->mmu_lock);
+	kvm_mmu_split_huge_pages(kvm, start, end);
+	write_unlock(&kvm->mmu_lock);
+}
+
 /*
  * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
  * dirty pages.
@@ -1790,6 +1899,16 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 			return;
 
 		kvm_mmu_wp_memory_region(kvm, new->id);
+		kvm_mmu_split_memory_region(kvm, new->id);
+	} else {
+		/*
+		 * Free any leftovers from the eager page splitting cache. Do
+		 * this when deleting, moving, disabling dirty logging, or
+		 * creating the memslot (a nop). Doing it for deletes makes
+		 * sure we don't leak memory, and there's no need to keep the
+		 * cache around for any of the other cases.
+		 */
+		kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
 	}
 }
 
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 10/12] KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()
  2023-04-09  6:29 [PATCH v7 00/12] Implement Eager Page Splitting for ARM Ricardo Koller
                   ` (9 preceding siblings ...)
  2023-04-09  6:29 ` [PATCH v7 09/12] KVM: arm64: Split huge pages when dirty logging is enabled Ricardo Koller
@ 2023-04-09  6:29 ` Ricardo Koller
  2023-04-17  7:14   ` Gavin Shan
  2023-04-09  6:29 ` [PATCH v7 11/12] KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG Ricardo Koller
  2023-04-09  6:30 ` [PATCH v7 12/12] KVM: arm64: Use local TLBI on permission relaxation Ricardo Koller
  12 siblings, 1 reply; 40+ messages in thread
From: Ricardo Koller @ 2023-04-09  6:29 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller

Move the functionality of kvm_mmu_write_protect_pt_masked() into its
caller, kvm_arch_mmu_enable_log_dirty_pt_masked().  This will be used
in a subsequent commit in order to share some of the code in
kvm_arch_mmu_enable_log_dirty_pt_masked().

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kvm/mmu.c | 42 +++++++++++++++---------------------------
 1 file changed, 15 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index aaefabd8de89d..16fa24f761152 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1058,28 +1058,6 @@ static void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
 	kvm_flush_remote_tlbs(kvm);
 }
 
-/**
- * kvm_mmu_write_protect_pt_masked() - write protect dirty pages
- * @kvm:	The KVM pointer
- * @slot:	The memory slot associated with mask
- * @gfn_offset:	The gfn offset in memory slot
- * @mask:	The mask of dirty pages at offset 'gfn_offset' in this memory
- *		slot to be write protected
- *
- * Walks bits set in mask write protects the associated pte's. Caller must
- * acquire kvm_mmu_lock.
- */
-static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-		struct kvm_memory_slot *slot,
-		gfn_t gfn_offset, unsigned long mask)
-{
-	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
-	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
-	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
-
-	stage2_wp_range(&kvm->arch.mmu, start, end);
-}
-
 /**
  * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE
  *				   pages for memory slot
@@ -1109,17 +1087,27 @@ static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
 }
 
 /*
- * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
- * dirty pages.
+ * kvm_arch_mmu_enable_log_dirty_pt_masked() - enable dirty logging for selected pages.
+ * @kvm:	The KVM pointer
+ * @slot:	The memory slot associated with mask
+ * @gfn_offset:	The gfn offset in memory slot
+ * @mask:	The mask of pages at offset 'gfn_offset' in this memory
+ *		slot to enable dirty logging on
  *
- * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
- * enable dirty logging for them.
+ * Writes protect selected pages to enable dirty logging for them. Caller must
+ * acquire kvm->mmu_lock.
  */
 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		struct kvm_memory_slot *slot,
 		gfn_t gfn_offset, unsigned long mask)
 {
-	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
+	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
+	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	stage2_wp_range(&kvm->arch.mmu, start, end);
 }
 
 static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 11/12] KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG
  2023-04-09  6:29 [PATCH v7 00/12] Implement Eager Page Splitting for ARM Ricardo Koller
                   ` (10 preceding siblings ...)
  2023-04-09  6:29 ` [PATCH v7 10/12] KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked() Ricardo Koller
@ 2023-04-09  6:29 ` Ricardo Koller
  2023-04-17  7:18   ` Gavin Shan
  2023-04-09  6:30 ` [PATCH v7 12/12] KVM: arm64: Use local TLBI on permission relaxation Ricardo Koller
  12 siblings, 1 reply; 40+ messages in thread
From: Ricardo Koller @ 2023-04-09  6:29 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller

This is the arm64 counterpart of commit cb00a70bd4b7 ("KVM: x86/mmu:
Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG"),
which has the benefit of splitting the cost of splitting a memslot
across multiple ioctls.

Split huge pages on the range specified using KVM_CLEAR_DIRTY_LOG.
And do not split when enabling dirty logging if
KVM_DIRTY_LOG_INITIALLY_SET is set.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kvm/mmu.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 16fa24f761152..50488daab0f4d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1094,8 +1094,8 @@ static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
  * @mask:	The mask of pages at offset 'gfn_offset' in this memory
  *		slot to enable dirty logging on
  *
- * Writes protect selected pages to enable dirty logging for them. Caller must
- * acquire kvm->mmu_lock.
+ * Writes protect selected pages to enable dirty logging, and then
+ * splits them to PAGE_SIZE. Caller must acquire kvm->mmu_lock.
  */
 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		struct kvm_memory_slot *slot,
@@ -1108,6 +1108,17 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	stage2_wp_range(&kvm->arch.mmu, start, end);
+
+	/*
+	 * Eager-splitting is done when manual-protect is set.  We
+	 * also check for initially-all-set because we can avoid
+	 * eager-splitting if initially-all-set is false.
+	 * Initially-all-set equal false implies that huge-pages were
+	 * already split when enabling dirty logging: no need to do it
+	 * again.
+	 */
+	if (kvm_dirty_log_manual_protect_and_init_set(kvm))
+		kvm_mmu_split_huge_pages(kvm, start, end);
 }
 
 static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 12/12] KVM: arm64: Use local TLBI on permission relaxation
  2023-04-09  6:29 [PATCH v7 00/12] Implement Eager Page Splitting for ARM Ricardo Koller
                   ` (11 preceding siblings ...)
  2023-04-09  6:29 ` [PATCH v7 11/12] KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG Ricardo Koller
@ 2023-04-09  6:30 ` Ricardo Koller
  2023-04-17  7:20   ` Gavin Shan
  12 siblings, 1 reply; 40+ messages in thread
From: Ricardo Koller @ 2023-04-09  6:30 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller

From: Marc Zyngier <maz@kernel.org>

Broadcast TLB invalidations (TLBIs) targeting the Inner Shareable
Domain are usually less performant than their non-shareable variant.
In particular, we observed some implementations that take
millliseconds to complete parallel broadcasted TLBIs.

It's safe to use non-shareable TLBIs when relaxing permissions on a
PTE in the KVM case.  According to the ARM ARM (0487I.a) section
D8.13.1 "Using break-before-make when updating translation table
entries", permission relaxation does not need break-before-make.
Specifically, R_WHZWS states that these are the only changes that
require a break-before-make sequence: changes of memory type
(Shareability or Cacheability), address changes, or changing the block
size.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_asm.h   |  4 +++
 arch/arm64/kvm/hyp/nvhe/hyp-main.c | 10 ++++++
 arch/arm64/kvm/hyp/nvhe/tlb.c      | 54 ++++++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c       |  2 +-
 arch/arm64/kvm/hyp/vhe/tlb.c       | 32 ++++++++++++++++++
 5 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 43c3bc0f9544d..bb17b2ead4c71 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -68,6 +68,7 @@ enum __kvm_host_smccc_func {
 	__KVM_HOST_SMCCC_FUNC___kvm_vcpu_run,
 	__KVM_HOST_SMCCC_FUNC___kvm_flush_vm_context,
 	__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa,
+	__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa_nsh,
 	__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid,
 	__KVM_HOST_SMCCC_FUNC___kvm_flush_cpu_context,
 	__KVM_HOST_SMCCC_FUNC___kvm_timer_set_cntvoff,
@@ -225,6 +226,9 @@ extern void __kvm_flush_vm_context(void);
 extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
 extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
 				     int level);
+extern void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
+					 phys_addr_t ipa,
+					 int level);
 extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu);
 
 extern void __kvm_timer_set_cntvoff(u64 cntvoff);
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 728e01d4536b0..c6bf1e49ca934 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -125,6 +125,15 @@ static void handle___kvm_tlb_flush_vmid_ipa(struct kvm_cpu_context *host_ctxt)
 	__kvm_tlb_flush_vmid_ipa(kern_hyp_va(mmu), ipa, level);
 }
 
+static void handle___kvm_tlb_flush_vmid_ipa_nsh(struct kvm_cpu_context *host_ctxt)
+{
+	DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
+	DECLARE_REG(phys_addr_t, ipa, host_ctxt, 2);
+	DECLARE_REG(int, level, host_ctxt, 3);
+
+	__kvm_tlb_flush_vmid_ipa_nsh(kern_hyp_va(mmu), ipa, level);
+}
+
 static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt)
 {
 	DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
@@ -315,6 +324,7 @@ static const hcall_t host_hcall[] = {
 	HANDLE_FUNC(__kvm_vcpu_run),
 	HANDLE_FUNC(__kvm_flush_vm_context),
 	HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa),
+	HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa_nsh),
 	HANDLE_FUNC(__kvm_tlb_flush_vmid),
 	HANDLE_FUNC(__kvm_flush_cpu_context),
 	HANDLE_FUNC(__kvm_timer_set_cntvoff),
diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
index d296d617f5896..ef2b70587f933 100644
--- a/arch/arm64/kvm/hyp/nvhe/tlb.c
+++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
@@ -109,6 +109,60 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
 	__tlb_switch_to_host(&cxt);
 }
 
+void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
+				  phys_addr_t ipa, int level)
+{
+	struct tlb_inv_context cxt;
+
+	dsb(nshst);
+
+	/* Switch to requested VMID */
+	__tlb_switch_to_guest(mmu, &cxt);
+
+	/*
+	 * We could do so much better if we had the VA as well.
+	 * Instead, we invalidate Stage-2 for this IPA, and the
+	 * whole of Stage-1. Weep...
+	 */
+	ipa >>= 12;
+	__tlbi_level(ipas2e1, ipa, level);
+
+	/*
+	 * We have to ensure completion of the invalidation at Stage-2,
+	 * since a table walk on another CPU could refill a TLB with a
+	 * complete (S1 + S2) walk based on the old Stage-2 mapping if
+	 * the Stage-1 invalidation happened first.
+	 */
+	dsb(nsh);
+	__tlbi(vmalle1);
+	dsb(nsh);
+	isb();
+
+	/*
+	 * If the host is running at EL1 and we have a VPIPT I-cache,
+	 * then we must perform I-cache maintenance at EL2 in order for
+	 * it to have an effect on the guest. Since the guest cannot hit
+	 * I-cache lines allocated with a different VMID, we don't need
+	 * to worry about junk out of guest reset (we nuke the I-cache on
+	 * VMID rollover), but we do need to be careful when remapping
+	 * executable pages for the same guest. This can happen when KSM
+	 * takes a CoW fault on an executable page, copies the page into
+	 * a page that was previously mapped in the guest and then needs
+	 * to invalidate the guest view of the I-cache for that page
+	 * from EL1. To solve this, we invalidate the entire I-cache when
+	 * unmapping a page from a guest if we have a VPIPT I-cache but
+	 * the host is running at EL1. As above, we could do better if
+	 * we had the VA.
+	 *
+	 * The moral of this story is: if you have a VPIPT I-cache, then
+	 * you should be running with VHE enabled.
+	 */
+	if (icache_is_vpipt())
+		icache_inval_all_pou();
+
+	__tlb_switch_to_host(&cxt);
+}
+
 void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
 {
 	struct tlb_inv_context cxt;
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 48c5a95c6e8cd..023269dd84f76 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1189,7 +1189,7 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 				       KVM_PGTABLE_WALK_HANDLE_FAULT |
 				       KVM_PGTABLE_WALK_SHARED);
 	if (!ret)
-		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, pgt->mmu, addr, level);
+		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa_nsh, pgt->mmu, addr, level);
 	return ret;
 }
 
diff --git a/arch/arm64/kvm/hyp/vhe/tlb.c b/arch/arm64/kvm/hyp/vhe/tlb.c
index 24cef9b87f9e9..e69da550cdc5b 100644
--- a/arch/arm64/kvm/hyp/vhe/tlb.c
+++ b/arch/arm64/kvm/hyp/vhe/tlb.c
@@ -111,6 +111,38 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
 	__tlb_switch_to_host(&cxt);
 }
 
+void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
+				  phys_addr_t ipa, int level)
+{
+	struct tlb_inv_context cxt;
+
+	dsb(nshst);
+
+	/* Switch to requested VMID */
+	__tlb_switch_to_guest(mmu, &cxt);
+
+	/*
+	 * We could do so much better if we had the VA as well.
+	 * Instead, we invalidate Stage-2 for this IPA, and the
+	 * whole of Stage-1. Weep...
+	 */
+	ipa >>= 12;
+	__tlbi_level(ipas2e1, ipa, level);
+
+	/*
+	 * We have to ensure completion of the invalidation at Stage-2,
+	 * since a table walk on another CPU could refill a TLB with a
+	 * complete (S1 + S2) walk based on the old Stage-2 mapping if
+	 * the Stage-1 invalidation happened first.
+	 */
+	dsb(nsh);
+	__tlbi(vmalle1);
+	dsb(nsh);
+	isb();
+
+	__tlb_switch_to_host(&cxt);
+}
+
 void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
 {
 	struct tlb_inv_context cxt;
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
  2023-04-09  6:29 ` [PATCH v7 04/12] KVM: arm64: Add kvm_pgtable_stage2_split() Ricardo Koller
@ 2023-04-09  9:36   ` kernel test robot
  2023-04-10 17:40     ` Ricardo Koller
  2023-04-17  6:38   ` Gavin Shan
  1 sibling, 1 reply; 40+ messages in thread
From: kernel test robot @ 2023-04-09  9:36 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: oe-kbuild-all, kvm, kvmarm, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon, ricarkol,
	Ricardo Koller, Shaoqin Huang

Hi Ricardo,

kernel test robot noticed the following build errors:

[auto build test ERROR on kvm/queue]
[also build test ERROR on mst-vhost/linux-next linus/master v6.3-rc5 next-20230406]
[cannot apply to kvmarm/next kvm/linux-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ricardo-Koller/KVM-arm64-Rename-free_removed-to-free_unlinked/20230409-143229
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link:    https://lore.kernel.org/r/20230409063000.3559991-6-ricarkol%40google.com
patch subject: [PATCH v7 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
config: arm64-defconfig (https://download.01.org/0day-ci/archive/20230409/202304091707.ALABRVCG-lkp@intel.com/config)
compiler: aarch64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/c94328e3e8b2d2d873503360ea730c87f4a03301
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Ricardo-Koller/KVM-arm64-Rename-free_removed-to-free_unlinked/20230409-143229
        git checkout c94328e3e8b2d2d873503360ea730c87f4a03301
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm64 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm64 SHELL=/bin/bash arch/arm64/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304091707.ALABRVCG-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/linux/bitfield.h:10,
                    from arch/arm64/kvm/hyp/pgtable.c:10:
   arch/arm64/kvm/hyp/pgtable.c: In function 'stage2_split_walker':
>> include/linux/container_of.h:20:54: error: 'struct kvm_s2_mmu' has no member named 'split_page_cache'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |                                                      ^~
   include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                                        ^~~~
   include/linux/container_of.h:20:9: note: in expansion of macro 'static_assert'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |         ^~~~~~~~~~~~~
   include/linux/container_of.h:20:23: note: in expansion of macro '__same_type'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |                       ^~~~~~~~~~~
   arch/arm64/kvm/hyp/pgtable.c:1340:15: note: in expansion of macro 'container_of'
    1340 |         mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
         |               ^~~~~~~~~~~~
   include/linux/compiler_types.h:338:27: error: expression in static assertion is not an integer
     338 | #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
         |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                                        ^~~~
   include/linux/container_of.h:20:9: note: in expansion of macro 'static_assert'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |         ^~~~~~~~~~~~~
   include/linux/container_of.h:20:23: note: in expansion of macro '__same_type'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |                       ^~~~~~~~~~~
   arch/arm64/kvm/hyp/pgtable.c:1340:15: note: in expansion of macro 'container_of'
    1340 |         mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
         |               ^~~~~~~~~~~~
   In file included from include/uapi/linux/posix_types.h:5,
                    from include/uapi/linux/types.h:14,
                    from include/linux/types.h:6,
                    from include/linux/kasan-checks.h:5,
                    from include/asm-generic/rwonce.h:26,
                    from arch/arm64/include/asm/rwonce.h:71,
                    from include/linux/compiler.h:247,
                    from include/linux/build_bug.h:5:
>> include/linux/stddef.h:16:33: error: 'struct kvm_s2_mmu' has no member named 'split_page_cache'
      16 | #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
         |                                 ^~~~~~~~~~~~~~~~~~
   include/linux/container_of.h:23:28: note: in expansion of macro 'offsetof'
      23 |         ((type *)(__mptr - offsetof(type, member))); })
         |                            ^~~~~~~~
   arch/arm64/kvm/hyp/pgtable.c:1340:15: note: in expansion of macro 'container_of'
    1340 |         mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
         |               ^~~~~~~~~~~~
--
   In file included from include/linux/bitfield.h:10,
                    from arch/arm64/kvm/hyp/nvhe/../pgtable.c:10:
   arch/arm64/kvm/hyp/nvhe/../pgtable.c: In function 'stage2_split_walker':
>> include/linux/container_of.h:20:54: error: 'struct kvm_s2_mmu' has no member named 'split_page_cache'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |                                                      ^~
   include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                                        ^~~~
   include/linux/container_of.h:20:9: note: in expansion of macro 'static_assert'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |         ^~~~~~~~~~~~~
   include/linux/container_of.h:20:23: note: in expansion of macro '__same_type'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |                       ^~~~~~~~~~~
   arch/arm64/kvm/hyp/nvhe/../pgtable.c:1340:15: note: in expansion of macro 'container_of'
    1340 |         mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
         |               ^~~~~~~~~~~~
   include/linux/compiler_types.h:338:27: error: expression in static assertion is not an integer
     338 | #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
         |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                                        ^~~~
   include/linux/container_of.h:20:9: note: in expansion of macro 'static_assert'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |         ^~~~~~~~~~~~~
   include/linux/container_of.h:20:23: note: in expansion of macro '__same_type'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |                       ^~~~~~~~~~~
   arch/arm64/kvm/hyp/nvhe/../pgtable.c:1340:15: note: in expansion of macro 'container_of'
    1340 |         mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
         |               ^~~~~~~~~~~~
   In file included from include/uapi/linux/posix_types.h:5,
                    from include/uapi/linux/types.h:14,
                    from include/linux/types.h:6,
                    from include/linux/kasan-checks.h:5,
                    from include/asm-generic/rwonce.h:26,
                    from arch/arm64/include/asm/rwonce.h:71,
                    from include/linux/compiler.h:247,
                    from include/linux/build_bug.h:5:
>> include/linux/stddef.h:16:33: error: 'struct kvm_s2_mmu' has no member named 'split_page_cache'
      16 | #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
         |                                 ^~~~~~~~~~~~~~~~~~
   include/linux/container_of.h:23:28: note: in expansion of macro 'offsetof'
      23 |         ((type *)(__mptr - offsetof(type, member))); })
         |                            ^~~~~~~~
   arch/arm64/kvm/hyp/nvhe/../pgtable.c:1340:15: note: in expansion of macro 'container_of'
    1340 |         mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
         |               ^~~~~~~~~~~~


vim +20 include/linux/container_of.h

d2a8ebbf8192b8 Andy Shevchenko  2021-11-08   9  
d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  10  /**
d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  11   * container_of - cast a member of a structure out to the containing structure
d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  12   * @ptr:	the pointer to the member.
d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  13   * @type:	the type of the container struct this is embedded in.
d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  14   * @member:	the name of the member within the struct.
d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  15   *
7376e561fd2e01 Sakari Ailus     2022-10-24  16   * WARNING: any const qualifier of @ptr is lost.
d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  17   */
d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  18  #define container_of(ptr, type, member) ({				\
d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  19  	void *__mptr = (void *)(ptr);					\
e1edc277e6f6df Rasmus Villemoes 2021-11-08 @20  	static_assert(__same_type(*(ptr), ((type *)0)->member) ||	\
e1edc277e6f6df Rasmus Villemoes 2021-11-08  21  		      __same_type(*(ptr), void),			\
d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  22  		      "pointer type mismatch in container_of()");	\
d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  23  	((type *)(__mptr - offsetof(type, member))); })
d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  24  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
  2023-04-09  9:36   ` kernel test robot
@ 2023-04-10 17:40     ` Ricardo Koller
  0 siblings, 0 replies; 40+ messages in thread
From: Ricardo Koller @ 2023-04-10 17:40 UTC (permalink / raw)
  To: kernel test robot
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, oe-kbuild-all, kvm,
	kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Shaoqin Huang

On Sun, Apr 09, 2023 at 05:36:02PM +0800, kernel test robot wrote:
> Hi Ricardo,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on kvm/queue]
> [also build test ERROR on mst-vhost/linux-next linus/master v6.3-rc5 next-20230406]
> [cannot apply to kvmarm/next kvm/linux-next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Ricardo-Koller/KVM-arm64-Rename-free_removed-to-free_unlinked/20230409-143229
> base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
> patch link:    https://lore.kernel.org/r/20230409063000.3559991-6-ricarkol%40google.com
> patch subject: [PATCH v7 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
> config: arm64-defconfig (https://download.01.org/0day-ci/archive/20230409/202304091707.ALABRVCG-lkp@intel.com/config)
> compiler: aarch64-linux-gcc (GCC) 12.1.0
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # https://github.com/intel-lab-lkp/linux/commit/c94328e3e8b2d2d873503360ea730c87f4a03301
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review Ricardo-Koller/KVM-arm64-Rename-free_removed-to-free_unlinked/20230409-143229
>         git checkout c94328e3e8b2d2d873503360ea730c87f4a03301
>         # save the config file
>         mkdir build_dir && cp config build_dir/.config
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm64 olddefconfig
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm64 SHELL=/bin/bash arch/arm64/
> 
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <lkp@intel.com>
> | Link: https://lore.kernel.org/oe-kbuild-all/202304091707.ALABRVCG-lkp@intel.com/
> 
> All errors (new ones prefixed by >>):
> 
>    In file included from include/linux/bitfield.h:10,
>                     from arch/arm64/kvm/hyp/pgtable.c:10:
>    arch/arm64/kvm/hyp/pgtable.c: In function 'stage2_split_walker':
> >> include/linux/container_of.h:20:54: error: 'struct kvm_s2_mmu' has no member named 'split_page_cache'
>       20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
>          |                                                      ^~
>    include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
>       78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
>          |                                                        ^~~~
>    include/linux/container_of.h:20:9: note: in expansion of macro 'static_assert'
>       20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
>          |         ^~~~~~~~~~~~~
>    include/linux/container_of.h:20:23: note: in expansion of macro '__same_type'
>       20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
>          |                       ^~~~~~~~~~~
>    arch/arm64/kvm/hyp/pgtable.c:1340:15: note: in expansion of macro 'container_of'
>     1340 |         mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
>          |               ^~~~~~~~~~~~
>    include/linux/compiler_types.h:338:27: error: expression in static assertion is not an integer
>      338 | #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
>          |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
>       78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
>          |                                                        ^~~~
>    include/linux/container_of.h:20:9: note: in expansion of macro 'static_assert'
>       20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
>          |         ^~~~~~~~~~~~~
>    include/linux/container_of.h:20:23: note: in expansion of macro '__same_type'
>       20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
>          |                       ^~~~~~~~~~~
>    arch/arm64/kvm/hyp/pgtable.c:1340:15: note: in expansion of macro 'container_of'
>     1340 |         mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
>          |               ^~~~~~~~~~~~
>    In file included from include/uapi/linux/posix_types.h:5,
>                     from include/uapi/linux/types.h:14,
>                     from include/linux/types.h:6,
>                     from include/linux/kasan-checks.h:5,
>                     from include/asm-generic/rwonce.h:26,
>                     from arch/arm64/include/asm/rwonce.h:71,
>                     from include/linux/compiler.h:247,
>                     from include/linux/build_bug.h:5:
> >> include/linux/stddef.h:16:33: error: 'struct kvm_s2_mmu' has no member named 'split_page_cache'
>       16 | #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
>          |                                 ^~~~~~~~~~~~~~~~~~
>    include/linux/container_of.h:23:28: note: in expansion of macro 'offsetof'
>       23 |         ((type *)(__mptr - offsetof(type, member))); })
>          |                            ^~~~~~~~
>    arch/arm64/kvm/hyp/pgtable.c:1340:15: note: in expansion of macro 'container_of'
>     1340 |         mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
>          |               ^~~~~~~~~~~~
> --
>    In file included from include/linux/bitfield.h:10,
>                     from arch/arm64/kvm/hyp/nvhe/../pgtable.c:10:
>    arch/arm64/kvm/hyp/nvhe/../pgtable.c: In function 'stage2_split_walker':
> >> include/linux/container_of.h:20:54: error: 'struct kvm_s2_mmu' has no member named 'split_page_cache'
>       20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
>          |                                                      ^~
>    include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
>       78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
>          |                                                        ^~~~
>    include/linux/container_of.h:20:9: note: in expansion of macro 'static_assert'
>       20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
>          |         ^~~~~~~~~~~~~
>    include/linux/container_of.h:20:23: note: in expansion of macro '__same_type'
>       20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
>          |                       ^~~~~~~~~~~
>    arch/arm64/kvm/hyp/nvhe/../pgtable.c:1340:15: note: in expansion of macro 'container_of'
>     1340 |         mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
>          |               ^~~~~~~~~~~~
>    include/linux/compiler_types.h:338:27: error: expression in static assertion is not an integer
>      338 | #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
>          |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
>       78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
>          |                                                        ^~~~
>    include/linux/container_of.h:20:9: note: in expansion of macro 'static_assert'
>       20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
>          |         ^~~~~~~~~~~~~
>    include/linux/container_of.h:20:23: note: in expansion of macro '__same_type'
>       20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
>          |                       ^~~~~~~~~~~
>    arch/arm64/kvm/hyp/nvhe/../pgtable.c:1340:15: note: in expansion of macro 'container_of'
>     1340 |         mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
>          |               ^~~~~~~~~~~~
>    In file included from include/uapi/linux/posix_types.h:5,
>                     from include/uapi/linux/types.h:14,
>                     from include/linux/types.h:6,
>                     from include/linux/kasan-checks.h:5,
>                     from include/asm-generic/rwonce.h:26,
>                     from arch/arm64/include/asm/rwonce.h:71,
>                     from include/linux/compiler.h:247,
>                     from include/linux/build_bug.h:5:
> >> include/linux/stddef.h:16:33: error: 'struct kvm_s2_mmu' has no member named 'split_page_cache'
>       16 | #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
>          |                                 ^~~~~~~~~~~~~~~~~~
>    include/linux/container_of.h:23:28: note: in expansion of macro 'offsetof'
>       23 |         ((type *)(__mptr - offsetof(type, member))); })
>          |                            ^~~~~~~~
>    arch/arm64/kvm/hyp/nvhe/../pgtable.c:1340:15: note: in expansion of macro 'container_of'
>     1340 |         mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
>          |               ^~~~~~~~~~~~
> 
> 
> vim +20 include/linux/container_of.h
> 
> d2a8ebbf8192b8 Andy Shevchenko  2021-11-08   9  
> d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  10  /**
> d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  11   * container_of - cast a member of a structure out to the containing structure
> d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  12   * @ptr:	the pointer to the member.
> d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  13   * @type:	the type of the container struct this is embedded in.
> d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  14   * @member:	the name of the member within the struct.
> d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  15   *
> 7376e561fd2e01 Sakari Ailus     2022-10-24  16   * WARNING: any const qualifier of @ptr is lost.
> d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  17   */
> d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  18  #define container_of(ptr, type, member) ({				\
> d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  19  	void *__mptr = (void *)(ptr);					\
> e1edc277e6f6df Rasmus Villemoes 2021-11-08 @20  	static_assert(__same_type(*(ptr), ((type *)0)->member) ||	\
> e1edc277e6f6df Rasmus Villemoes 2021-11-08  21  		      __same_type(*(ptr), void),			\
> d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  22  		      "pointer type mismatch in container_of()");	\
> d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  23  	((type *)(__mptr - offsetof(type, member))); })
> d2a8ebbf8192b8 Andy Shevchenko  2021-11-08  24  
> 
> -- 
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests
> 

Hi,

The fix is to move the commit introducing KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
right before this one, like this:

	KVM: arm64: Rename free_removed to free_unlinked
	KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIs
	KVM: arm64: Add helper for creating unlinked stage2 subtrees
KVM: arm64: Export kvm_are_all_memslots_empty()
KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
	KVM: arm64: Add kvm_pgtable_stage2_split()
	KVM: arm64: Refactor kvm_arch_commit_memory_region()
	KVM: arm64: Add kvm_uninit_stage2_mmu()
	KVM: arm64: Split huge pages when dirty logging is enabled
	KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()
	KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG
	KVM: arm64: Use local TLBI on permission relaxation

Thanks,
Ricardo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 01/12] KVM: arm64: Rename free_removed to free_unlinked
  2023-04-09  6:29 ` [PATCH v7 01/12] KVM: arm64: Rename free_removed to free_unlinked Ricardo Koller
@ 2023-04-17  6:08   ` Gavin Shan
  0 siblings, 0 replies; 40+ messages in thread
From: Gavin Shan @ 2023-04-17  6:08 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol, Oliver Upton, Shaoqin Huang

On 4/9/23 2:29 PM, Ricardo Koller wrote:
> Normalize on referring to tables outside of an active paging structure
> as 'unlinked'.
> 
> A subsequent change to KVM will add support for building page tables
> that are not part of an active paging structure. The existing
> 'removed_table' terminology is quite clunky when applied in this
> context.
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
> Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
> ---
>   arch/arm64/include/asm/kvm_pgtable.h  |  8 ++++----
>   arch/arm64/kvm/hyp/nvhe/mem_protect.c |  6 +++---
>   arch/arm64/kvm/hyp/pgtable.c          |  6 +++---
>   arch/arm64/kvm/mmu.c                  | 10 +++++-----
>   4 files changed, 15 insertions(+), 15 deletions(-)
> 

Reviewed-by: Gavin Shan <gshan@redhat.com>

> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 4cd6762bda805..26a4293726c14 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -104,7 +104,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
>    *				allocation is physically contiguous.
>    * @free_pages_exact:		Free an exact number of memory pages previously
>    *				allocated by zalloc_pages_exact.
> - * @free_removed_table:		Free a removed paging structure by unlinking and
> + * @free_unlinked_table:	Free an unlinked paging structure by unlinking and
>    *				dropping references.
>    * @get_page:			Increment the refcount on a page.
>    * @put_page:			Decrement the refcount on a page. When the
> @@ -124,7 +124,7 @@ struct kvm_pgtable_mm_ops {
>   	void*		(*zalloc_page)(void *arg);
>   	void*		(*zalloc_pages_exact)(size_t size);
>   	void		(*free_pages_exact)(void *addr, size_t size);
> -	void		(*free_removed_table)(void *addr, u32 level);
> +	void		(*free_unlinked_table)(void *addr, u32 level);
>   	void		(*get_page)(void *addr);
>   	void		(*put_page)(void *addr);
>   	int		(*page_count)(void *addr);
> @@ -440,7 +440,7 @@ int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
>   void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
>   
>   /**
> - * kvm_pgtable_stage2_free_removed() - Free a removed stage-2 paging structure.
> + * kvm_pgtable_stage2_free_unlinked() - Free an unlinked stage-2 paging structure.
>    * @mm_ops:	Memory management callbacks.
>    * @pgtable:	Unlinked stage-2 paging structure to be freed.
>    * @level:	Level of the stage-2 paging structure to be freed.
> @@ -448,7 +448,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
>    * The page-table is assumed to be unreachable by any hardware walkers prior to
>    * freeing and therefore no TLB invalidation is performed.
>    */
> -void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level);
> +void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level);
>   
>   /**
>    * kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table.
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 552653fa18be3..b030170d803b6 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -91,9 +91,9 @@ static void host_s2_put_page(void *addr)
>   	hyp_put_page(&host_s2_pool, addr);
>   }
>   
> -static void host_s2_free_removed_table(void *addr, u32 level)
> +static void host_s2_free_unlinked_table(void *addr, u32 level)
>   {
> -	kvm_pgtable_stage2_free_removed(&host_mmu.mm_ops, addr, level);
> +	kvm_pgtable_stage2_free_unlinked(&host_mmu.mm_ops, addr, level);
>   }
>   
>   static int prepare_s2_pool(void *pgt_pool_base)
> @@ -110,7 +110,7 @@ static int prepare_s2_pool(void *pgt_pool_base)
>   	host_mmu.mm_ops = (struct kvm_pgtable_mm_ops) {
>   		.zalloc_pages_exact = host_s2_zalloc_pages_exact,
>   		.zalloc_page = host_s2_zalloc_page,
> -		.free_removed_table = host_s2_free_removed_table,
> +		.free_unlinked_table = host_s2_free_unlinked_table,
>   		.phys_to_virt = hyp_phys_to_virt,
>   		.virt_to_phys = hyp_virt_to_phys,
>   		.page_count = hyp_page_count,
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 3d61bd3e591d2..a3246d6cddec7 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -860,7 +860,7 @@ static int stage2_map_walk_table_pre(const struct kvm_pgtable_visit_ctx *ctx,
>   	if (ret)
>   		return ret;
>   
> -	mm_ops->free_removed_table(childp, ctx->level);
> +	mm_ops->free_unlinked_table(childp, ctx->level);
>   	return 0;
>   }
>   
> @@ -905,7 +905,7 @@ static int stage2_map_walk_leaf(const struct kvm_pgtable_visit_ctx *ctx,
>    * The TABLE_PRE callback runs for table entries on the way down, looking
>    * for table entries which we could conceivably replace with a block entry
>    * for this mapping. If it finds one it replaces the entry and calls
> - * kvm_pgtable_mm_ops::free_removed_table() to tear down the detached table.
> + * kvm_pgtable_mm_ops::free_unlinked_table() to tear down the detached table.
>    *
>    * Otherwise, the LEAF callback performs the mapping at the existing leaves
>    * instead.
> @@ -1276,7 +1276,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
>   	pgt->pgd = NULL;
>   }
>   
> -void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level)
> +void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level)
>   {
>   	kvm_pteref_t ptep = (kvm_pteref_t)pgtable;
>   	struct kvm_pgtable_walker walker = {
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 7113587222ffe..efdaab3f154de 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -131,21 +131,21 @@ static void kvm_s2_free_pages_exact(void *virt, size_t size)
>   
>   static struct kvm_pgtable_mm_ops kvm_s2_mm_ops;
>   
> -static void stage2_free_removed_table_rcu_cb(struct rcu_head *head)
> +static void stage2_free_unlinked_table_rcu_cb(struct rcu_head *head)
>   {
>   	struct page *page = container_of(head, struct page, rcu_head);
>   	void *pgtable = page_to_virt(page);
>   	u32 level = page_private(page);
>   
> -	kvm_pgtable_stage2_free_removed(&kvm_s2_mm_ops, pgtable, level);
> +	kvm_pgtable_stage2_free_unlinked(&kvm_s2_mm_ops, pgtable, level);
>   }
>   
> -static void stage2_free_removed_table(void *addr, u32 level)
> +static void stage2_free_unlinked_table(void *addr, u32 level)
>   {
>   	struct page *page = virt_to_page(addr);
>   
>   	set_page_private(page, (unsigned long)level);
> -	call_rcu(&page->rcu_head, stage2_free_removed_table_rcu_cb);
> +	call_rcu(&page->rcu_head, stage2_free_unlinked_table_rcu_cb);
>   }
>   
>   static void kvm_host_get_page(void *addr)
> @@ -682,7 +682,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>   	.zalloc_page		= stage2_memcache_zalloc_page,
>   	.zalloc_pages_exact	= kvm_s2_zalloc_pages_exact,
>   	.free_pages_exact	= kvm_s2_free_pages_exact,
> -	.free_removed_table	= stage2_free_removed_table,
> +	.free_unlinked_table	= stage2_free_unlinked_table,
>   	.get_page		= kvm_host_get_page,
>   	.put_page		= kvm_s2_put_page,
>   	.page_count		= kvm_host_page_count,
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 02/12] KVM: arm64: Add KVM_PGTABLE_WALK ctx->flags for skipping BBM and CMO
  2023-04-09  6:29 ` [PATCH v7 02/12] KVM: arm64: Add KVM_PGTABLE_WALK ctx->flags for skipping BBM and CMO Ricardo Koller
@ 2023-04-17  6:10   ` Gavin Shan
  0 siblings, 0 replies; 40+ messages in thread
From: Gavin Shan @ 2023-04-17  6:10 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol, Shaoqin Huang

On 4/9/23 2:29 PM, Ricardo Koller wrote:
> Add two flags to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_SKIP_BBM and
> KVM_PGTABLE_WALK_SKIP_CMO, to indicate that the walk should not
> perform break-before-make (BBM) nor cache maintenance operations
> (CMO). This will be used by a future commit to create unlinked tables
> not accessible to the HW page-table walker.
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
> ---
>   arch/arm64/include/asm/kvm_pgtable.h |  8 ++++++
>   arch/arm64/kvm/hyp/pgtable.c         | 37 +++++++++++++++++++---------
>   2 files changed, 34 insertions(+), 11 deletions(-)
> 

Reviewed-by: Gavin Shan <gshan@redhat.com>

> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 26a4293726c14..3f2d43ba2b628 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -195,6 +195,12 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
>    *					with other software walkers.
>    * @KVM_PGTABLE_WALK_HANDLE_FAULT:	Indicates the page-table walk was
>    *					invoked from a fault handler.
> + * @KVM_PGTABLE_WALK_SKIP_BBM_TLBI:	Visit and update table entries
> + *					without Break-before-make's
> + *					TLB invalidation.
> + * @KVM_PGTABLE_WALK_SKIP_CMO:		Visit and update table entries
> + *					without Cache maintenance
> + *					operations required.
>    */
>   enum kvm_pgtable_walk_flags {
>   	KVM_PGTABLE_WALK_LEAF			= BIT(0),
> @@ -202,6 +208,8 @@ enum kvm_pgtable_walk_flags {
>   	KVM_PGTABLE_WALK_TABLE_POST		= BIT(2),
>   	KVM_PGTABLE_WALK_SHARED			= BIT(3),
>   	KVM_PGTABLE_WALK_HANDLE_FAULT		= BIT(4),
> +	KVM_PGTABLE_WALK_SKIP_BBM_TLBI		= BIT(5),
> +	KVM_PGTABLE_WALK_SKIP_CMO		= BIT(6),
>   };
>   
>   struct kvm_pgtable_visit_ctx {
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index a3246d6cddec7..633679ee3c49a 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -62,6 +62,16 @@ struct kvm_pgtable_walk_data {
>   	u64				end;
>   };
>   
> +static bool kvm_pgtable_walk_skip_bbm_tlbi(const struct kvm_pgtable_visit_ctx *ctx)
> +{
> +	return unlikely(ctx->flags & KVM_PGTABLE_WALK_SKIP_BBM_TLBI);
> +}
> +
> +static bool kvm_pgtable_walk_skip_cmo(const struct kvm_pgtable_visit_ctx *ctx)
> +{
> +	return unlikely(ctx->flags & KVM_PGTABLE_WALK_SKIP_CMO);
> +}
> +
>   static bool kvm_phys_is_valid(u64 phys)
>   {
>   	return phys < BIT(id_aa64mmfr0_parange_to_phys_shift(ID_AA64MMFR0_EL1_PARANGE_MAX));
> @@ -741,14 +751,17 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
>   	if (!stage2_try_set_pte(ctx, KVM_INVALID_PTE_LOCKED))
>   		return false;
>   
> -	/*
> -	 * Perform the appropriate TLB invalidation based on the evicted pte
> -	 * value (if any).
> -	 */
> -	if (kvm_pte_table(ctx->old, ctx->level))
> -		kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
> -	else if (kvm_pte_valid(ctx->old))
> -		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
> +	if (!kvm_pgtable_walk_skip_bbm_tlbi(ctx)) {
> +		/*
> +		 * Perform the appropriate TLB invalidation based on the
> +		 * evicted pte value (if any).
> +		 */
> +		if (kvm_pte_table(ctx->old, ctx->level))
> +			kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
> +		else if (kvm_pte_valid(ctx->old))
> +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
> +				     ctx->addr, ctx->level);
> +	}
>   
>   	if (stage2_pte_is_counted(ctx->old))
>   		mm_ops->put_page(ctx->ptep);
> @@ -832,11 +845,13 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
>   		return -EAGAIN;
>   
>   	/* Perform CMOs before installation of the guest stage-2 PTE */
> -	if (mm_ops->dcache_clean_inval_poc && stage2_pte_cacheable(pgt, new))
> +	if (!kvm_pgtable_walk_skip_cmo(ctx) && mm_ops->dcache_clean_inval_poc &&
> +	    stage2_pte_cacheable(pgt, new))
>   		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(new, mm_ops),
> -						granule);
> +					       granule);
>   
> -	if (mm_ops->icache_inval_pou && stage2_pte_executable(new))
> +	if (!kvm_pgtable_walk_skip_cmo(ctx) && mm_ops->icache_inval_pou &&
> +	    stage2_pte_executable(new))
>   		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
>   
>   	stage2_make_pte(ctx, new);
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 02/12] KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIs
  2023-04-09  6:29 ` [PATCH v7 02/12] KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIs Ricardo Koller
@ 2023-04-17  6:13   ` Gavin Shan
  0 siblings, 0 replies; 40+ messages in thread
From: Gavin Shan @ 2023-04-17  6:13 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol, Shaoqin Huang

On 4/9/23 2:29 PM, Ricardo Koller wrote:
> Add two flags to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_SKIP_BBM_TLBI
> and KVM_PGTABLE_WALK_SKIP_CMO, to indicate that the walk should not
> perform TLB invalidations (TLBIs) in break-before-make (BBM) nor cache
> maintenance operations (CMO). This will be used by a future commit to
> create unlinked tables not accessible to the HW page-table walker.
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
> ---
>   arch/arm64/include/asm/kvm_pgtable.h |  8 ++++++
>   arch/arm64/kvm/hyp/pgtable.c         | 37 +++++++++++++++++++---------
>   2 files changed, 34 insertions(+), 11 deletions(-)
> 

This patch has been posted for towice since it was sent as the following one.

[PATCH v7 02/12] KVM: arm64: Add KVM_PGTABLE_WALK ctx->flags for skipping BBM and CMO

The code changes look good to me:

Reviewed-by: Gavin Shan <gshan@redhat.com>

> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 26a4293726c14..3f2d43ba2b628 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -195,6 +195,12 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
>    *					with other software walkers.
>    * @KVM_PGTABLE_WALK_HANDLE_FAULT:	Indicates the page-table walk was
>    *					invoked from a fault handler.
> + * @KVM_PGTABLE_WALK_SKIP_BBM_TLBI:	Visit and update table entries
> + *					without Break-before-make's
> + *					TLB invalidation.
> + * @KVM_PGTABLE_WALK_SKIP_CMO:		Visit and update table entries
> + *					without Cache maintenance
> + *					operations required.
>    */
>   enum kvm_pgtable_walk_flags {
>   	KVM_PGTABLE_WALK_LEAF			= BIT(0),
> @@ -202,6 +208,8 @@ enum kvm_pgtable_walk_flags {
>   	KVM_PGTABLE_WALK_TABLE_POST		= BIT(2),
>   	KVM_PGTABLE_WALK_SHARED			= BIT(3),
>   	KVM_PGTABLE_WALK_HANDLE_FAULT		= BIT(4),
> +	KVM_PGTABLE_WALK_SKIP_BBM_TLBI		= BIT(5),
> +	KVM_PGTABLE_WALK_SKIP_CMO		= BIT(6),
>   };
>   
>   struct kvm_pgtable_visit_ctx {
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index a3246d6cddec7..633679ee3c49a 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -62,6 +62,16 @@ struct kvm_pgtable_walk_data {
>   	u64				end;
>   };
>   
> +static bool kvm_pgtable_walk_skip_bbm_tlbi(const struct kvm_pgtable_visit_ctx *ctx)
> +{
> +	return unlikely(ctx->flags & KVM_PGTABLE_WALK_SKIP_BBM_TLBI);
> +}
> +
> +static bool kvm_pgtable_walk_skip_cmo(const struct kvm_pgtable_visit_ctx *ctx)
> +{
> +	return unlikely(ctx->flags & KVM_PGTABLE_WALK_SKIP_CMO);
> +}
> +
>   static bool kvm_phys_is_valid(u64 phys)
>   {
>   	return phys < BIT(id_aa64mmfr0_parange_to_phys_shift(ID_AA64MMFR0_EL1_PARANGE_MAX));
> @@ -741,14 +751,17 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
>   	if (!stage2_try_set_pte(ctx, KVM_INVALID_PTE_LOCKED))
>   		return false;
>   
> -	/*
> -	 * Perform the appropriate TLB invalidation based on the evicted pte
> -	 * value (if any).
> -	 */
> -	if (kvm_pte_table(ctx->old, ctx->level))
> -		kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
> -	else if (kvm_pte_valid(ctx->old))
> -		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
> +	if (!kvm_pgtable_walk_skip_bbm_tlbi(ctx)) {
> +		/*
> +		 * Perform the appropriate TLB invalidation based on the
> +		 * evicted pte value (if any).
> +		 */
> +		if (kvm_pte_table(ctx->old, ctx->level))
> +			kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
> +		else if (kvm_pte_valid(ctx->old))
> +			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
> +				     ctx->addr, ctx->level);
> +	}
>   
>   	if (stage2_pte_is_counted(ctx->old))
>   		mm_ops->put_page(ctx->ptep);
> @@ -832,11 +845,13 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
>   		return -EAGAIN;
>   
>   	/* Perform CMOs before installation of the guest stage-2 PTE */
> -	if (mm_ops->dcache_clean_inval_poc && stage2_pte_cacheable(pgt, new))
> +	if (!kvm_pgtable_walk_skip_cmo(ctx) && mm_ops->dcache_clean_inval_poc &&
> +	    stage2_pte_cacheable(pgt, new))
>   		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(new, mm_ops),
> -						granule);
> +					       granule);
>   
> -	if (mm_ops->icache_inval_pou && stage2_pte_executable(new))
> +	if (!kvm_pgtable_walk_skip_cmo(ctx) && mm_ops->icache_inval_pou &&
> +	    stage2_pte_executable(new))
>   		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
>   
>   	stage2_make_pte(ctx, new);
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 03/12] KVM: arm64: Add helper for creating unlinked stage2 subtrees
  2023-04-09  6:29 ` [PATCH v7 03/12] KVM: arm64: Add helper for creating unlinked stage2 subtrees Ricardo Koller
@ 2023-04-17  6:18   ` Gavin Shan
  2023-04-22 20:09     ` Ricardo Koller
  0 siblings, 1 reply; 40+ messages in thread
From: Gavin Shan @ 2023-04-17  6:18 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol, Shaoqin Huang

On 4/9/23 2:29 PM, Ricardo Koller wrote:
> Add a stage2 helper, kvm_pgtable_stage2_create_unlinked(), for
> creating unlinked tables (which is the opposite of
> kvm_pgtable_stage2_free_unlinked()).  Creating an unlinked table is
> useful for splitting level 1 and 2 entries into subtrees of PAGE_SIZE
> PTEs.  For example, a level 1 entry can be split into PAGE_SIZE PTEs
> by first creating a fully populated tree, and then use it to replace
> the level 1 entry in a single step.  This will be used in a subsequent
> commit for eager huge-page splitting (a dirty-logging optimization).
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
> ---
>   arch/arm64/include/asm/kvm_pgtable.h | 26 +++++++++++++++
>   arch/arm64/kvm/hyp/pgtable.c         | 49 ++++++++++++++++++++++++++++
>   2 files changed, 75 insertions(+)
> 

With the following nits addressed:

Reviewed-by: Gavin Shan <gshan@redhat.com>

> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 3f2d43ba2b628..c8e0e7d9303b2 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -458,6 +458,32 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
>    */
>   void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level);
>   
> +/**
> + * kvm_pgtable_stage2_create_unlinked() - Create an unlinked stage-2 paging structure.
> + * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
> + * @phys:	Physical address of the memory to map.
> + * @level:	Starting level of the stage-2 paging structure to be created.
> + * @prot:	Permissions and attributes for the mapping.
> + * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
                 ^^^^^^^^
Alignment.

> + *		page-table pages.
> + * @force_pte:  Force mappings to PAGE_SIZE granularity.
> + *
> + * Returns an unlinked page-table tree.  This new page-table tree is
> + * not reachable (i.e., it is unlinked) from the root pgd and it's
> + * therefore unreachableby the hardware page-table walker. No TLB
> + * invalidation or CMOs are performed.
> + *
> + * If device attributes are not explicitly requested in @prot, then the
> + * mapping will be normal, cacheable.
> + *
> + * Return: The fully populated (unlinked) stage-2 paging structure, or
> + * an ERR_PTR(error) on failure.
> + */
> +kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt,
> +					      u64 phys, u32 level,
> +					      enum kvm_pgtable_prot prot,
> +					      void *mc, bool force_pte);
> +
>   /**
>    * kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table.
>    * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 633679ee3c49a..477d2be67d401 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1222,6 +1222,55 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
>   	return kvm_pgtable_walk(pgt, addr, size, &walker);
>   }
>   
> +kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt,
> +					      u64 phys, u32 level,
> +					      enum kvm_pgtable_prot prot,
> +					      void *mc, bool force_pte)
> +{
> +	struct stage2_map_data map_data = {
> +		.phys		= phys,
> +		.mmu		= pgt->mmu,
> +		.memcache	= mc,
> +		.force_pte	= force_pte,
> +	};
> +	struct kvm_pgtable_walker walker = {
> +		.cb		= stage2_map_walker,
> +		.flags		= KVM_PGTABLE_WALK_LEAF |
> +				  KVM_PGTABLE_WALK_SKIP_BBM_TLBI |
> +				  KVM_PGTABLE_WALK_SKIP_CMO,
> +		.arg		= &map_data,
> +	};
> +	/* .addr (the IPA) is irrelevant for an unlinked table */
> +	struct kvm_pgtable_walk_data data = {
> +		.walker	= &walker,
> +		.addr	= 0,
> +		.end	= kvm_granule_size(level),
> +	};

The comment about '.addr' seems incorrect. The IPA address is still
used to locate the page table entry, so I think it would be something
like below:

	/* The IPA address (.addr) is relative to zero */

> +	struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
> +	kvm_pte_t *pgtable;
> +	int ret;
> +
> +	if (!IS_ALIGNED(phys, kvm_granule_size(level)))
> +		return ERR_PTR(-EINVAL);
> +
> +	ret = stage2_set_prot_attr(pgt, prot, &map_data.attr);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	pgtable = mm_ops->zalloc_page(mc);
> +	if (!pgtable)
> +		return ERR_PTR(-ENOMEM);
> +
> +	ret = __kvm_pgtable_walk(&data, mm_ops, (kvm_pteref_t)pgtable,
> +				 level + 1);
> +	if (ret) {
> +		kvm_pgtable_stage2_free_unlinked(mm_ops, pgtable, level);
> +		mm_ops->put_page(pgtable);
> +		return ERR_PTR(ret);
> +	}
> +
> +	return pgtable;
> +}
>   
>   int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
>   			      struct kvm_pgtable_mm_ops *mm_ops,
> 

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
  2023-04-09  6:29 ` [PATCH v7 04/12] KVM: arm64: Add kvm_pgtable_stage2_split() Ricardo Koller
  2023-04-09  9:36   ` kernel test robot
@ 2023-04-17  6:38   ` Gavin Shan
  2023-04-22 20:32     ` Ricardo Koller
  1 sibling, 1 reply; 40+ messages in thread
From: Gavin Shan @ 2023-04-17  6:38 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol, Shaoqin Huang

On 4/9/23 2:29 PM, Ricardo Koller wrote:
> Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a
> range of huge pages. This will be used for eager-splitting huge pages
> into PAGE_SIZE pages. The goal is to avoid having to split huge pages
> on write-protection faults, and instead use this function to do it
> ahead of time for large ranges (e.g., all guest memory in 1G chunks at
> a time).
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
> ---
>   arch/arm64/include/asm/kvm_pgtable.h |  19 +++++
>   arch/arm64/kvm/hyp/pgtable.c         | 103 +++++++++++++++++++++++++++
>   2 files changed, 122 insertions(+)
> 

With the following nits addressed:

Reviewed-by: Gavin Shan <gshan@redhat.com>

> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index c8e0e7d9303b2..32e5d42bf020f 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -653,6 +653,25 @@ bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr);
>    */
>   int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
>   
> +/**
> + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
> + *				to PAGE_SIZE guest pages.
> + * @pgt:	 Page-table structure initialised by kvm_pgtable_stage2_init().
> + * @addr:	 Intermediate physical address from which to split.
> + * @size:	 Size of the range.
> + * @mc:		 Cache of pre-allocated and zeroed memory from which to allocate
                  ^^^^^^^^
Alignment.

> + *		 page-table pages.
> + *
> + * The function tries to split any level 1 or 2 entry that overlaps
> + * with the input range (given by @addr and @size).
> + *
> + * Return: 0 on success, negative error code on failure. Note that
> + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> + * blocks in the input range as allowed by @mc_capacity.
> + */
> +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
> +			     struct kvm_mmu_memory_cache *mc);
> +
>   /**
>    * kvm_pgtable_walk() - Walk a page-table.
>    * @pgt:	Page-table structure initialised by kvm_pgtable_*_init().
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 477d2be67d401..48c5a95c6e8cd 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1272,6 +1272,109 @@ kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt,
>   	return pgtable;
>   }
>   
> +/*
> + * Get the number of page-tables needed to replace a block with a
> + * fully populated tree up to the PTE entries. Note that @level is
> + * interpreted as in "level @level entry".
> + */
> +static int stage2_block_get_nr_page_tables(u32 level)
> +{
> +	switch (level) {
> +	case 1:
> +		return PTRS_PER_PTE + 1;
> +	case 2:
> +		return 1;
> +	case 3:
> +		return 0;
> +	default:
> +		WARN_ON_ONCE(level < KVM_PGTABLE_MIN_BLOCK_LEVEL ||
> +			     level >= KVM_PGTABLE_MAX_LEVELS);
> +		return -EINVAL;
> +	};
> +}
> +

When the starting level is 3, it's not a block mapping if I'm correct. Besides,
the caller (stage2_split_walker()) bails when the starting level is 3. In this
case, the changes may be integrated to stage2_split_walker(), which is the only
caller. Otherwise, 'inline' is worthy to have.

	nr_pages = kvm_granule_shift(level) == PUD_SHIFT && kvm_granule_shift(level) != PMD_SHIFT) ?
                    (PTRS_PER_PTE + 1) : 1;

> +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> +			       enum kvm_pgtable_walk_flags visit)
> +{
> +	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> +	struct kvm_mmu_memory_cache *mc = ctx->arg;
> +	struct kvm_s2_mmu *mmu;
> +	kvm_pte_t pte = ctx->old, new, *childp;
> +	enum kvm_pgtable_prot prot;
> +	u32 level = ctx->level;
> +	bool force_pte;
> +	int nr_pages;
> +	u64 phys;
> +
> +	/* No huge-pages exist at the last level */
> +	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> +		return 0;
> +
> +	/* We only split valid block mappings */
> +	if (!kvm_pte_valid(pte))
> +		return 0;
> +
> +	nr_pages = stage2_block_get_nr_page_tables(level);
> +	if (nr_pages < 0)
> +		return nr_pages;
> +
> +	if (mc->nobjs >= nr_pages) {
> +		/* Build a tree mapped down to the PTE granularity. */
> +		force_pte = true;
> +	} else {
> +		/*
> +		 * Don't force PTEs, so create_unlinked() below does
> +		 * not populate the tree up to the PTE level. The
> +		 * consequence is that the call will require a single
> +		 * page of level 2 entries at level 1, or a single
> +		 * page of PTEs at level 2. If we are at level 1, the
> +		 * PTEs will be created recursively.
> +		 */
> +		force_pte = false;
> +		nr_pages = 1;
> +	}
> +
> +	if (mc->nobjs < nr_pages)
> +		return -ENOMEM;
> +
> +	mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
> +	phys = kvm_pte_to_phys(pte);
> +	prot = kvm_pgtable_stage2_pte_prot(pte);
> +
> +	childp = kvm_pgtable_stage2_create_unlinked(mmu->pgt, phys,
> +						    level, prot, mc, force_pte);
> +	if (IS_ERR(childp))
> +		return PTR_ERR(childp);
> +
> +	if (!stage2_try_break_pte(ctx, mmu)) {
> +		kvm_pgtable_stage2_free_unlinked(mm_ops, childp, level);
> +		mm_ops->put_page(childp);
> +		return -EAGAIN;
> +	}
> +
> +	/*
> +	 * Note, the contents of the page table are guaranteed to be made
> +	 * visible before the new PTE is assigned because stage2_make_pte()
> +	 * writes the PTE using smp_store_release().
> +	 */
> +	new = kvm_init_table_pte(childp, mm_ops);
> +	stage2_make_pte(ctx, new);
> +	dsb(ishst);
> +	return 0;
> +}
> +
> +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
> +			     struct kvm_mmu_memory_cache *mc)
> +{
> +	struct kvm_pgtable_walker walker = {
> +		.cb	= stage2_split_walker,
> +		.flags	= KVM_PGTABLE_WALK_LEAF,
> +		.arg	= mc,
> +	};
> +
> +	return kvm_pgtable_walk(pgt, addr, size, &walker);
> +}
> +
>   int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
>   			      struct kvm_pgtable_mm_ops *mm_ops,
>   			      enum kvm_pgtable_stage2_flags flags,
> 

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 05/12] KVM: arm64: Refactor kvm_arch_commit_memory_region()
  2023-04-09  6:29 ` [PATCH v7 05/12] KVM: arm64: Refactor kvm_arch_commit_memory_region() Ricardo Koller
@ 2023-04-17  6:41   ` Gavin Shan
  2023-04-23 19:47     ` Ricardo Koller
  2023-04-17  6:42   ` Gavin Shan
  1 sibling, 1 reply; 40+ messages in thread
From: Gavin Shan @ 2023-04-17  6:41 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol, Shaoqin Huang

On 4/9/23 2:29 PM, Ricardo Koller wrote:
> Refactor kvm_arch_commit_memory_region() as a preparation for a future
> commit to look cleaner and more understandable. Also, it looks more
> like its x86 counterpart (in kvm_mmu_slot_apply_flags()).
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
> ---
>   arch/arm64/kvm/mmu.c | 15 +++++++++++----
>   1 file changed, 11 insertions(+), 4 deletions(-)
> 

With the following nits addressed:

Reviewed-by: Gavin Shan <gshan@redhat.com>

> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index efdaab3f154de..37d7d2aa472ab 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1761,20 +1761,27 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>   				   const struct kvm_memory_slot *new,
>   				   enum kvm_mr_change change)
>   {
> +	bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> +
>   	/*
>   	 * At this point memslot has been committed and there is an
>   	 * allocated dirty_bitmap[], dirty pages will be tracked while the
>   	 * memory slot is write protected.
>   	 */
> -	if (change != KVM_MR_DELETE && new->flags & KVM_MEM_LOG_DIRTY_PAGES) {
> +	if (log_dirty_pages) {
> +
> +		if (change == KVM_MR_DELETE)
> +			return;
> +
>   		/*
>   		 * If we're with initial-all-set, we don't need to write
>   		 * protect any pages because they're all reported as dirty.
>   		 * Huge pages and normal pages will be write protect gradually.
>   		 */

The comments need to be adjusted after this series is applied. The huge pages
won't be write protected gradually. Instead, the huge pages will be split and
write protected in one shoot.

> -		if (!kvm_dirty_log_manual_protect_and_init_set(kvm)) {
> -			kvm_mmu_wp_memory_region(kvm, new->id);
> -		}
> +		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> +			return;
> +
> +		kvm_mmu_wp_memory_region(kvm, new->id);
>   	}
>   }
>   
> 

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 05/12] KVM: arm64: Refactor kvm_arch_commit_memory_region()
  2023-04-09  6:29 ` [PATCH v7 05/12] KVM: arm64: Refactor kvm_arch_commit_memory_region() Ricardo Koller
  2023-04-17  6:41   ` Gavin Shan
@ 2023-04-17  6:42   ` Gavin Shan
  1 sibling, 0 replies; 40+ messages in thread
From: Gavin Shan @ 2023-04-17  6:42 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol, Shaoqin Huang

On 4/9/23 2:29 PM, Ricardo Koller wrote:
> Refactor kvm_arch_commit_memory_region() as a preparation for a future
> commit to look cleaner and more understandable. Also, it looks more
> like its x86 counterpart (in kvm_mmu_slot_apply_flags()).
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
> ---
>   arch/arm64/kvm/mmu.c | 15 +++++++++++----
>   1 file changed, 11 insertions(+), 4 deletions(-)
> 

Reviewed-by: Gavin Shan <gshan@redhat.com>

> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index efdaab3f154de..37d7d2aa472ab 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1761,20 +1761,27 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>   				   const struct kvm_memory_slot *new,
>   				   enum kvm_mr_change change)
>   {
> +	bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> +
>   	/*
>   	 * At this point memslot has been committed and there is an
>   	 * allocated dirty_bitmap[], dirty pages will be tracked while the
>   	 * memory slot is write protected.
>   	 */
> -	if (change != KVM_MR_DELETE && new->flags & KVM_MEM_LOG_DIRTY_PAGES) {
> +	if (log_dirty_pages) {
> +
> +		if (change == KVM_MR_DELETE)
> +			return;
> +
>   		/*
>   		 * If we're with initial-all-set, we don't need to write
>   		 * protect any pages because they're all reported as dirty.
>   		 * Huge pages and normal pages will be write protect gradually.
>   		 */
> -		if (!kvm_dirty_log_manual_protect_and_init_set(kvm)) {
> -			kvm_mmu_wp_memory_region(kvm, new->id);
> -		}
> +		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> +			return;
> +
> +		kvm_mmu_wp_memory_region(kvm, new->id);
>   	}
>   }
>   
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 06/12] KVM: arm64: Add kvm_uninit_stage2_mmu()
  2023-04-09  6:29 ` [PATCH v7 06/12] KVM: arm64: Add kvm_uninit_stage2_mmu() Ricardo Koller
@ 2023-04-17  6:44   ` Gavin Shan
  0 siblings, 0 replies; 40+ messages in thread
From: Gavin Shan @ 2023-04-17  6:44 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol, Shaoqin Huang

On 4/9/23 2:29 PM, Ricardo Koller wrote:
> Add kvm_uninit_stage2_mmu() and move kvm_free_stage2_pgd() into it. A
> future commit will add some more things to do inside of
> kvm_uninit_stage2_mmu().
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
> ---
>   arch/arm64/include/asm/kvm_mmu.h | 1 +
>   arch/arm64/kvm/mmu.c             | 7 ++++++-
>   2 files changed, 7 insertions(+), 1 deletion(-)
> 

Reviewed-by: Gavin Shan <gshan@redhat.com>

> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 083cc47dca086..7d173da5bd51c 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -168,6 +168,7 @@ void __init free_hyp_pgds(void);
>   
>   void stage2_unmap_vm(struct kvm *kvm);
>   int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type);
> +void kvm_uninit_stage2_mmu(struct kvm *kvm);
>   void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
>   int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>   			  phys_addr_t pa, unsigned long size, bool writable);
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 37d7d2aa472ab..a2800e5c42712 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -767,6 +767,11 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>   	return err;
>   }
>   
> +void kvm_uninit_stage2_mmu(struct kvm *kvm)
> +{
> +	kvm_free_stage2_pgd(&kvm->arch.mmu);
> +}
> +
>   static void stage2_unmap_memslot(struct kvm *kvm,
>   				 struct kvm_memory_slot *memslot)
>   {
> @@ -1855,7 +1860,7 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
>   
>   void kvm_arch_flush_shadow_all(struct kvm *kvm)
>   {
> -	kvm_free_stage2_pgd(&kvm->arch.mmu);
> +	kvm_uninit_stage2_mmu(kvm);
>   }
>   
>   void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 07/12] KVM: arm64: Export kvm_are_all_memslots_empty()
  2023-04-09  6:29 ` [PATCH v7 07/12] KVM: arm64: Export kvm_are_all_memslots_empty() Ricardo Koller
@ 2023-04-17  6:47   ` Gavin Shan
  0 siblings, 0 replies; 40+ messages in thread
From: Gavin Shan @ 2023-04-17  6:47 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol, Shaoqin Huang

On 4/9/23 2:29 PM, Ricardo Koller wrote:
> Export kvm_are_all_memslots_empty(). This will be used by a future
> commit when checking before setting a capability.
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
> ---
>   include/linux/kvm_host.h | 2 ++
>   virt/kvm/kvm_main.c      | 2 +-
>   2 files changed, 3 insertions(+), 1 deletion(-)
> 

With the following nits addressed:

Reviewed-by: Gavin Shan <gshan@redhat.com>

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8ada23756b0ec..c6fa634f236d9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -990,6 +990,8 @@ static inline bool kvm_memslots_empty(struct kvm_memslots *slots)
>   	return RB_EMPTY_ROOT(&slots->gfn_tree);
>   }
>   
> +bool kvm_are_all_memslots_empty(struct kvm *kvm);
> +
>   #define kvm_for_each_memslot(memslot, bkt, slots)			      \
>   	hash_for_each(slots->id_hash, bkt, memslot, id_node[slots->node_idx]) \
>   		if (WARN_ON_ONCE(!memslot->npages)) {			      \
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d255964ec331e..897b000787beb 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4596,7 +4596,7 @@ int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>   	return -EINVAL;
>   }
>   
> -static bool kvm_are_all_memslots_empty(struct kvm *kvm)
> +bool kvm_are_all_memslots_empty(struct kvm *kvm)
>   {
>   	int i;
>   
> 

We may need EXPORT_SYMBOL_GPL() to export it, to be consistent with the
exported APIs. KVM may be standalone module on architectures other than
ARM64.

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 08/12] KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
  2023-04-09  6:29 ` [PATCH v7 08/12] KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE Ricardo Koller
@ 2023-04-17  7:04   ` Gavin Shan
  2023-04-23 20:27     ` Ricardo Koller
  0 siblings, 1 reply; 40+ messages in thread
From: Gavin Shan @ 2023-04-17  7:04 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol, Oliver Upton

On 4/9/23 2:29 PM, Ricardo Koller wrote:
> Add a capability for userspace to specify the eager split chunk size.
> The chunk size specifies how many pages to break at a time, using a
> single allocation. Bigger the chunk size, more pages need to be
> allocated ahead of time.
> 
> Suggested-by: Oliver Upton <oliver.upton@linux.dev>
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> ---
>   Documentation/virt/kvm/api.rst       | 28 ++++++++++++++++++++++++++
>   arch/arm64/include/asm/kvm_host.h    | 15 ++++++++++++++
>   arch/arm64/include/asm/kvm_pgtable.h | 18 +++++++++++++++++
>   arch/arm64/kvm/arm.c                 | 30 ++++++++++++++++++++++++++++
>   arch/arm64/kvm/mmu.c                 |  3 +++
>   include/uapi/linux/kvm.h             |  2 ++
>   6 files changed, 96 insertions(+)
> 

With the following comments addressed:

Reviewed-by: Gavin Shan <gshan@redhat.com>

> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 62de0768d6aa5..f8faa80d87057 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -8380,6 +8380,34 @@ structure.
>   When getting the Modified Change Topology Report value, the attr->addr
>   must point to a byte where the value will be stored or retrieved from.
>   
> +8.40 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
> +---------------------------------------
> +
> +:Capability: KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
> +:Architectures: arm64
> +:Type: vm
> +:Parameters: arg[0] is the new split chunk size.
> +:Returns: 0 on success, -EINVAL if any memslot was already created.
                                                   ^^^^^^^^^^^^^^^^^^^

Maybe s/was already created/has been created

> +
> +This capability sets the chunk size used in Eager Page Splitting.
> +
> +Eager Page Splitting improves the performance of dirty-logging (used
> +in live migrations) when guest memory is backed by huge-pages.  It
> +avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing
> +it eagerly when enabling dirty logging (with the
> +KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using
> +KVM_CLEAR_DIRTY_LOG.
> +
> +The chunk size specifies how many pages to break at a time, using a
> +single allocation for each chunk. Bigger the chunk size, more pages
> +need to be allocated ahead of time.
> +
> +The chunk size needs to be a valid block size. The list of acceptable
> +block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a 64bit
> +bitmap (each bit describing a block size). Setting
> +KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE to 0 disables Eager Page Splitting;
> +this is the default value.
> +

s/a 64bit bitmap/a 64-bit bitmap

For the last sentence, maybe:

The default value is 0, to disable the eager page splitting.

>   9. Known KVM API problems
>   =========================
>   
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index a1892a8f60323..b87da1ebc3454 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -158,6 +158,21 @@ struct kvm_s2_mmu {
>   	/* The last vcpu id that ran on each physical CPU */
>   	int __percpu *last_vcpu_ran;
>   
> +#define KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT 0
> +	/*
> +	 * Memory cache used to split
> +	 * KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE worth of huge pages. It
> +	 * is used to allocate stage2 page tables while splitting huge
> +	 * pages. Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE
> +	 * influences both the capacity of the split page cache, and
> +	 * how often KVM reschedules. Be wary of raising CHUNK_SIZE
> +	 * too high.
> +	 *
> +	 * Protected by kvm->slots_lock.
> +	 */
> +	struct kvm_mmu_memory_cache split_page_cache;
> +	uint64_t split_page_chunk_size;
> +
>   	struct kvm_arch *arch;
>   };
>   
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 32e5d42bf020f..889bd7afeb355 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -92,6 +92,24 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
>   	return level >= KVM_PGTABLE_MIN_BLOCK_LEVEL;
>   }
>   
> +static inline u64 kvm_supported_block_sizes(void)
> +{
> +	u32 level = KVM_PGTABLE_MIN_BLOCK_LEVEL;
> +	u64 res = 0;
> +
> +	for (; level < KVM_PGTABLE_MAX_LEVELS; level++)
> +		res |= BIT(kvm_granule_shift(level));
> +
> +	return res;
> +}
> +

maybe s/@res/@r

> +static inline bool kvm_is_block_size_supported(u64 size)
> +{
> +	bool is_power_of_two = !((size) & ((size)-1));
> +
> +	return is_power_of_two && (size & kvm_supported_block_sizes());
> +}
> +

IS_ALIGNED() maybe used here.

>   /**
>    * struct kvm_pgtable_mm_ops - Memory management callbacks.
>    * @zalloc_page:		Allocate a single zeroed memory page.
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 3bd732eaf0872..34fd3c59a9b82 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -67,6 +67,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>   			    struct kvm_enable_cap *cap)
>   {
>   	int r;
> +	u64 new_cap;
>   
>   	if (cap->flags)
>   		return -EINVAL;
> @@ -91,6 +92,26 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>   		r = 0;
>   		set_bit(KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED, &kvm->arch.flags);
>   		break;
> +	case KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE:
> +		new_cap = cap->args[0];
> +
> +		mutex_lock(&kvm->lock);
> +		mutex_lock(&kvm->slots_lock);
> +		/*
> +		 * To keep things simple, allow changing the chunk
> +		 * size only if there are no memslots already created.
> +		 */

		/*
		 * To keep things simple, allow changing the chunk size
		 * only when no memory slots have been created.
		 */

> +		if (!kvm_are_all_memslots_empty(kvm)) {
> +			r = -EINVAL;
> +		} else if (new_cap && !kvm_is_block_size_supported(new_cap)) {
> +			r = -EINVAL;
> +		} else {
> +			r = 0;
> +			kvm->arch.mmu.split_page_chunk_size = new_cap;
> +		}
> +		mutex_unlock(&kvm->slots_lock);
> +		mutex_unlock(&kvm->lock);
> +		break;
>   	default:
>   		r = -EINVAL;
>   		break;
> @@ -288,6 +309,15 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   	case KVM_CAP_ARM_PTRAUTH_GENERIC:
>   		r = system_has_full_ptr_auth();
>   		break;
> +	case KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE:
> +		if (kvm)
> +			r = kvm->arch.mmu.split_page_chunk_size;
> +		else
> +			r = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
> +		break;
> +	case KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES:
> +		r = kvm_supported_block_sizes();
> +		break;

kvm_supported_block_sizes() returns u64, but @r is 32-bits in width. It may be
worthy to make the return value from kvm_supported_block_sizes() as u32.

>   	default:
>   		r = 0;
>   	}
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index a2800e5c42712..898985b09321a 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -756,6 +756,9 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>   	for_each_possible_cpu(cpu)
>   		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
>  

It may be worthy to have comments like below:

	/* The eager page splitting is disabled by default */
  
> +	mmu->split_page_cache.gfp_zero = __GFP_ZERO;
> +	mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
> +
>   	mmu->pgt = pgt;
>   	mmu->pgd_phys = __pa(pgt->pgd);
>   	return 0;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index d77aef872a0a0..f18b48fcd25ba 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1184,6 +1184,8 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
>   #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
>   #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
> +#define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 227
> +#define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 228
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   
> 

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 09/12] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-04-09  6:29 ` [PATCH v7 09/12] KVM: arm64: Split huge pages when dirty logging is enabled Ricardo Koller
@ 2023-04-17  7:11   ` Gavin Shan
  0 siblings, 0 replies; 40+ messages in thread
From: Gavin Shan @ 2023-04-17  7:11 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol, Shaoqin Huang

On 4/9/23 2:29 PM, Ricardo Koller wrote:
> Split huge pages eagerly when enabling dirty logging. The goal is to
> avoid doing it while faulting on write-protected pages, which
> negatively impacts guest performance.
> 
> A memslot marked for dirty logging is split in 1GB pieces at a time.
> This is in order to release the mmu_lock and give other kernel threads
> the opportunity to run, and also in order to allocate enough pages to
> split a 1GB range worth of huge pages (or a single 1GB huge page).
> Note that these page allocations can fail, so eager page splitting is
> best-effort.  This is not a correctness issue though, as huge pages
> can still be split on write-faults.
> 
> The benefits of eager page splitting are the same as in x86, added
> with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
> the TDP MMU when dirty logging is enabled"). For example, when running
> dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
> 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
> all of their memory after dirty logging is enabled decreased by 44%
> from 2.58s to 1.42s.
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
> ---
>   arch/arm64/kvm/mmu.c | 123 ++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 121 insertions(+), 2 deletions(-)
> 

In the changelog, it may be worthy to be mentioned:

The eager page splitting only takes effect when the huge page mapping has
been existing in the stage-2 page table. Otherwise, the huge page will be
mapped to multiple base pages on handling page fault.

The code changes look good to me:

Reviewed-by: Gavin Shan <gshan@redhat.com>

> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 898985b09321a..aaefabd8de89d 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -31,14 +31,21 @@ static phys_addr_t __ro_after_init hyp_idmap_vector;
>   
>   static unsigned long __ro_after_init io_map_base;
>   
> -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> +static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
> +					   phys_addr_t size)
>   {
> -	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
>   	phys_addr_t boundary = ALIGN_DOWN(addr + size, size);
>   
>   	return (boundary - 1 < end - 1) ? boundary : end;
>   }
>   
> +static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> +{
> +	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> +
> +	return __stage2_range_addr_end(addr, end, size);
> +}
> +
>   /*
>    * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
>    * we may see kernel panics with CONFIG_DETECT_HUNG_TASK,
> @@ -75,6 +82,79 @@ static int stage2_apply_range(struct kvm_s2_mmu *mmu, phys_addr_t addr,
>   #define stage2_apply_range_resched(mmu, addr, end, fn)			\
>   	stage2_apply_range(mmu, addr, end, fn, true)
>   
> +/*
> + * Get the maximum number of page-tables pages needed to split a range
> + * of blocks into PAGE_SIZE PTEs. It assumes the range is already
> + * mapped at level 2, or at level 1 if allowed.
> + */
> +static int kvm_mmu_split_nr_page_tables(u64 range)
> +{
> +	int n = 0;
> +
> +	if (KVM_PGTABLE_MIN_BLOCK_LEVEL < 2)
> +		n += DIV_ROUND_UP_ULL(range, PUD_SIZE);
> +	n += DIV_ROUND_UP_ULL(range, PMD_SIZE);
> +	return n;
> +}
> +
> +static bool need_split_memcache_topup_or_resched(struct kvm *kvm)
> +{
> +	struct kvm_mmu_memory_cache *cache;
> +	u64 chunk_size, min;
> +
> +	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> +		return true;
> +
> +	chunk_size = kvm->arch.mmu.split_page_chunk_size;
> +	min = kvm_mmu_split_nr_page_tables(chunk_size);
> +	cache = &kvm->arch.mmu.split_page_cache;
> +	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
> +}
> +
> +static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
> +				    phys_addr_t end)
> +{
> +	struct kvm_mmu_memory_cache *cache;
> +	struct kvm_pgtable *pgt;
> +	int ret, cache_capacity;
> +	u64 next, chunk_size;
> +
> +	lockdep_assert_held_write(&kvm->mmu_lock);
> +
> +	chunk_size = kvm->arch.mmu.split_page_chunk_size;
> +	cache_capacity = kvm_mmu_split_nr_page_tables(chunk_size);
> +
> +	if (chunk_size == 0)
> +		return 0;
> +
> +	cache = &kvm->arch.mmu.split_page_cache;
> +
> +	do {
> +		if (need_split_memcache_topup_or_resched(kvm)) {
> +			write_unlock(&kvm->mmu_lock);
> +			cond_resched();
> +			/* Eager page splitting is best-effort. */
> +			ret = __kvm_mmu_topup_memory_cache(cache,
> +							   cache_capacity,
> +							   cache_capacity);
> +			write_lock(&kvm->mmu_lock);
> +			if (ret)
> +				break;
> +		}
> +
> +		pgt = kvm->arch.mmu.pgt;
> +		if (!pgt)
> +			return -EINVAL;
> +
> +		next = __stage2_range_addr_end(addr, end, chunk_size);
> +		ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache);
> +		if (ret)
> +			break;
> +	} while (addr = next, addr != end);
> +
> +	return ret;
> +}
> +
>   static bool memslot_is_logging(struct kvm_memory_slot *memslot)
>   {
>   	return memslot->dirty_bitmap && !(memslot->flags & KVM_MEM_READONLY);
> @@ -773,6 +853,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>   void kvm_uninit_stage2_mmu(struct kvm *kvm)
>   {
>   	kvm_free_stage2_pgd(&kvm->arch.mmu);
> +	kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
>   }
>   
>   static void stage2_unmap_memslot(struct kvm *kvm,
> @@ -999,6 +1080,34 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
>   	stage2_wp_range(&kvm->arch.mmu, start, end);
>   }
>   
> +/**
> + * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE
> + *				   pages for memory slot
> + * @kvm:	The KVM pointer
> + * @slot:	The memory slot to split
> + *
> + * Acquires kvm->mmu_lock. Called with kvm->slots_lock mutex acquired,
> + * serializing operations for VM memory regions.
> + */
> +static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
> +{
> +	struct kvm_memslots *slots;
> +	struct kvm_memory_slot *memslot;
> +	phys_addr_t start, end;
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	slots = kvm_memslots(kvm);
> +	memslot = id_to_memslot(slots, slot);
> +
> +	start = memslot->base_gfn << PAGE_SHIFT;
> +	end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
> +
> +	write_lock(&kvm->mmu_lock);
> +	kvm_mmu_split_huge_pages(kvm, start, end);
> +	write_unlock(&kvm->mmu_lock);
> +}
> +
>   /*
>    * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
>    * dirty pages.
> @@ -1790,6 +1899,16 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>   			return;
>   
>   		kvm_mmu_wp_memory_region(kvm, new->id);
> +		kvm_mmu_split_memory_region(kvm, new->id);
> +	} else {
> +		/*
> +		 * Free any leftovers from the eager page splitting cache. Do
> +		 * this when deleting, moving, disabling dirty logging, or
> +		 * creating the memslot (a nop). Doing it for deletes makes
> +		 * sure we don't leak memory, and there's no need to keep the
> +		 * cache around for any of the other cases.
> +		 */
> +		kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
>   	}
>   }
>   
> 

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 10/12] KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()
  2023-04-09  6:29 ` [PATCH v7 10/12] KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked() Ricardo Koller
@ 2023-04-17  7:14   ` Gavin Shan
  0 siblings, 0 replies; 40+ messages in thread
From: Gavin Shan @ 2023-04-17  7:14 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol

On 4/9/23 2:29 PM, Ricardo Koller wrote:
> Move the functionality of kvm_mmu_write_protect_pt_masked() into its
> caller, kvm_arch_mmu_enable_log_dirty_pt_masked().  This will be used
> in a subsequent commit in order to share some of the code in
> kvm_arch_mmu_enable_log_dirty_pt_masked().
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> ---
>   arch/arm64/kvm/mmu.c | 42 +++++++++++++++---------------------------
>   1 file changed, 15 insertions(+), 27 deletions(-)
> 

Reviewed-by: Gavin Shan <gshan@redhat.com>

> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index aaefabd8de89d..16fa24f761152 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1058,28 +1058,6 @@ static void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
>   	kvm_flush_remote_tlbs(kvm);
>   }
>   
> -/**
> - * kvm_mmu_write_protect_pt_masked() - write protect dirty pages
> - * @kvm:	The KVM pointer
> - * @slot:	The memory slot associated with mask
> - * @gfn_offset:	The gfn offset in memory slot
> - * @mask:	The mask of dirty pages at offset 'gfn_offset' in this memory
> - *		slot to be write protected
> - *
> - * Walks bits set in mask write protects the associated pte's. Caller must
> - * acquire kvm_mmu_lock.
> - */
> -static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> -		struct kvm_memory_slot *slot,
> -		gfn_t gfn_offset, unsigned long mask)
> -{
> -	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
> -	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
> -	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
> -
> -	stage2_wp_range(&kvm->arch.mmu, start, end);
> -}
> -
>   /**
>    * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE
>    *				   pages for memory slot
> @@ -1109,17 +1087,27 @@ static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
>   }
>   
>   /*
> - * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
> - * dirty pages.
> + * kvm_arch_mmu_enable_log_dirty_pt_masked() - enable dirty logging for selected pages.
> + * @kvm:	The KVM pointer
> + * @slot:	The memory slot associated with mask
> + * @gfn_offset:	The gfn offset in memory slot
> + * @mask:	The mask of pages at offset 'gfn_offset' in this memory
> + *		slot to enable dirty logging on
>    *
> - * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
> - * enable dirty logging for them.
> + * Writes protect selected pages to enable dirty logging for them. Caller must
> + * acquire kvm->mmu_lock.
>    */
>   void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>   		struct kvm_memory_slot *slot,
>   		gfn_t gfn_offset, unsigned long mask)
>   {
> -	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
> +	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
> +	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
> +	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
> +
> +	lockdep_assert_held_write(&kvm->mmu_lock);
> +
> +	stage2_wp_range(&kvm->arch.mmu, start, end);
>   }
>   
>   static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 11/12] KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG
  2023-04-09  6:29 ` [PATCH v7 11/12] KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG Ricardo Koller
@ 2023-04-17  7:18   ` Gavin Shan
  0 siblings, 0 replies; 40+ messages in thread
From: Gavin Shan @ 2023-04-17  7:18 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol

On 4/9/23 2:29 PM, Ricardo Koller wrote:
> This is the arm64 counterpart of commit cb00a70bd4b7 ("KVM: x86/mmu:
> Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG"),
> which has the benefit of splitting the cost of splitting a memslot
> across multiple ioctls.
> 
> Split huge pages on the range specified using KVM_CLEAR_DIRTY_LOG.
> And do not split when enabling dirty logging if
> KVM_DIRTY_LOG_INITIALLY_SET is set.
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> ---
>   arch/arm64/kvm/mmu.c | 15 +++++++++++++--
>   1 file changed, 13 insertions(+), 2 deletions(-)
> 

Reviewed-by: Gavin Shan <gshan@redhat.com>

> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 16fa24f761152..50488daab0f4d 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1094,8 +1094,8 @@ static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
>    * @mask:	The mask of pages at offset 'gfn_offset' in this memory
>    *		slot to enable dirty logging on
>    *
> - * Writes protect selected pages to enable dirty logging for them. Caller must
> - * acquire kvm->mmu_lock.
> + * Writes protect selected pages to enable dirty logging, and then
> + * splits them to PAGE_SIZE. Caller must acquire kvm->mmu_lock.
>    */
>   void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>   		struct kvm_memory_slot *slot,
> @@ -1108,6 +1108,17 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>   	lockdep_assert_held_write(&kvm->mmu_lock);
>   
>   	stage2_wp_range(&kvm->arch.mmu, start, end);
> +
> +	/*
> +	 * Eager-splitting is done when manual-protect is set.  We
> +	 * also check for initially-all-set because we can avoid
> +	 * eager-splitting if initially-all-set is false.
> +	 * Initially-all-set equal false implies that huge-pages were
> +	 * already split when enabling dirty logging: no need to do it
> +	 * again.
> +	 */
> +	if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> +		kvm_mmu_split_huge_pages(kvm, start, end);
>   }
>  >   static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 12/12] KVM: arm64: Use local TLBI on permission relaxation
  2023-04-09  6:30 ` [PATCH v7 12/12] KVM: arm64: Use local TLBI on permission relaxation Ricardo Koller
@ 2023-04-17  7:20   ` Gavin Shan
  0 siblings, 0 replies; 40+ messages in thread
From: Gavin Shan @ 2023-04-17  7:20 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol

On 4/9/23 2:30 PM, Ricardo Koller wrote:
> From: Marc Zyngier <maz@kernel.org>
> 
> Broadcast TLB invalidations (TLBIs) targeting the Inner Shareable
> Domain are usually less performant than their non-shareable variant.
> In particular, we observed some implementations that take
> millliseconds to complete parallel broadcasted TLBIs.
> 
> It's safe to use non-shareable TLBIs when relaxing permissions on a
> PTE in the KVM case.  According to the ARM ARM (0487I.a) section
> D8.13.1 "Using break-before-make when updating translation table
> entries", permission relaxation does not need break-before-make.
> Specifically, R_WHZWS states that these are the only changes that
> require a break-before-make sequence: changes of memory type
> (Shareability or Cacheability), address changes, or changing the block
> size.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> ---
>   arch/arm64/include/asm/kvm_asm.h   |  4 +++
>   arch/arm64/kvm/hyp/nvhe/hyp-main.c | 10 ++++++
>   arch/arm64/kvm/hyp/nvhe/tlb.c      | 54 ++++++++++++++++++++++++++++++
>   arch/arm64/kvm/hyp/pgtable.c       |  2 +-
>   arch/arm64/kvm/hyp/vhe/tlb.c       | 32 ++++++++++++++++++
>   5 files changed, 101 insertions(+), 1 deletion(-)
> 

Reviewed-by: Gavin Shan <gshan@redhat.com>

> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index 43c3bc0f9544d..bb17b2ead4c71 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -68,6 +68,7 @@ enum __kvm_host_smccc_func {
>   	__KVM_HOST_SMCCC_FUNC___kvm_vcpu_run,
>   	__KVM_HOST_SMCCC_FUNC___kvm_flush_vm_context,
>   	__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa,
> +	__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa_nsh,
>   	__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid,
>   	__KVM_HOST_SMCCC_FUNC___kvm_flush_cpu_context,
>   	__KVM_HOST_SMCCC_FUNC___kvm_timer_set_cntvoff,
> @@ -225,6 +226,9 @@ extern void __kvm_flush_vm_context(void);
>   extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
>   extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
>   				     int level);
> +extern void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
> +					 phys_addr_t ipa,
> +					 int level);
>   extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu);
>   
>   extern void __kvm_timer_set_cntvoff(u64 cntvoff);
> diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> index 728e01d4536b0..c6bf1e49ca934 100644
> --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> @@ -125,6 +125,15 @@ static void handle___kvm_tlb_flush_vmid_ipa(struct kvm_cpu_context *host_ctxt)
>   	__kvm_tlb_flush_vmid_ipa(kern_hyp_va(mmu), ipa, level);
>   }
>   
> +static void handle___kvm_tlb_flush_vmid_ipa_nsh(struct kvm_cpu_context *host_ctxt)
> +{
> +	DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
> +	DECLARE_REG(phys_addr_t, ipa, host_ctxt, 2);
> +	DECLARE_REG(int, level, host_ctxt, 3);
> +
> +	__kvm_tlb_flush_vmid_ipa_nsh(kern_hyp_va(mmu), ipa, level);
> +}
> +
>   static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt)
>   {
>   	DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
> @@ -315,6 +324,7 @@ static const hcall_t host_hcall[] = {
>   	HANDLE_FUNC(__kvm_vcpu_run),
>   	HANDLE_FUNC(__kvm_flush_vm_context),
>   	HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa),
> +	HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa_nsh),
>   	HANDLE_FUNC(__kvm_tlb_flush_vmid),
>   	HANDLE_FUNC(__kvm_flush_cpu_context),
>   	HANDLE_FUNC(__kvm_timer_set_cntvoff),
> diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
> index d296d617f5896..ef2b70587f933 100644
> --- a/arch/arm64/kvm/hyp/nvhe/tlb.c
> +++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
> @@ -109,6 +109,60 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
>   	__tlb_switch_to_host(&cxt);
>   }
>   
> +void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
> +				  phys_addr_t ipa, int level)
> +{
> +	struct tlb_inv_context cxt;
> +
> +	dsb(nshst);
> +
> +	/* Switch to requested VMID */
> +	__tlb_switch_to_guest(mmu, &cxt);
> +
> +	/*
> +	 * We could do so much better if we had the VA as well.
> +	 * Instead, we invalidate Stage-2 for this IPA, and the
> +	 * whole of Stage-1. Weep...
> +	 */
> +	ipa >>= 12;
> +	__tlbi_level(ipas2e1, ipa, level);
> +
> +	/*
> +	 * We have to ensure completion of the invalidation at Stage-2,
> +	 * since a table walk on another CPU could refill a TLB with a
> +	 * complete (S1 + S2) walk based on the old Stage-2 mapping if
> +	 * the Stage-1 invalidation happened first.
> +	 */
> +	dsb(nsh);
> +	__tlbi(vmalle1);
> +	dsb(nsh);
> +	isb();
> +
> +	/*
> +	 * If the host is running at EL1 and we have a VPIPT I-cache,
> +	 * then we must perform I-cache maintenance at EL2 in order for
> +	 * it to have an effect on the guest. Since the guest cannot hit
> +	 * I-cache lines allocated with a different VMID, we don't need
> +	 * to worry about junk out of guest reset (we nuke the I-cache on
> +	 * VMID rollover), but we do need to be careful when remapping
> +	 * executable pages for the same guest. This can happen when KSM
> +	 * takes a CoW fault on an executable page, copies the page into
> +	 * a page that was previously mapped in the guest and then needs
> +	 * to invalidate the guest view of the I-cache for that page
> +	 * from EL1. To solve this, we invalidate the entire I-cache when
> +	 * unmapping a page from a guest if we have a VPIPT I-cache but
> +	 * the host is running at EL1. As above, we could do better if
> +	 * we had the VA.
> +	 *
> +	 * The moral of this story is: if you have a VPIPT I-cache, then
> +	 * you should be running with VHE enabled.
> +	 */
> +	if (icache_is_vpipt())
> +		icache_inval_all_pou();
> +
> +	__tlb_switch_to_host(&cxt);
> +}
> +
>   void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
>   {
>   	struct tlb_inv_context cxt;
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 48c5a95c6e8cd..023269dd84f76 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1189,7 +1189,7 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
>   				       KVM_PGTABLE_WALK_HANDLE_FAULT |
>   				       KVM_PGTABLE_WALK_SHARED);
>   	if (!ret)
> -		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, pgt->mmu, addr, level);
> +		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa_nsh, pgt->mmu, addr, level);
>   	return ret;
>   }
>   
> diff --git a/arch/arm64/kvm/hyp/vhe/tlb.c b/arch/arm64/kvm/hyp/vhe/tlb.c
> index 24cef9b87f9e9..e69da550cdc5b 100644
> --- a/arch/arm64/kvm/hyp/vhe/tlb.c
> +++ b/arch/arm64/kvm/hyp/vhe/tlb.c
> @@ -111,6 +111,38 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
>   	__tlb_switch_to_host(&cxt);
>   }
>   
> +void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
> +				  phys_addr_t ipa, int level)
> +{
> +	struct tlb_inv_context cxt;
> +
> +	dsb(nshst);
> +
> +	/* Switch to requested VMID */
> +	__tlb_switch_to_guest(mmu, &cxt);
> +
> +	/*
> +	 * We could do so much better if we had the VA as well.
> +	 * Instead, we invalidate Stage-2 for this IPA, and the
> +	 * whole of Stage-1. Weep...
> +	 */
> +	ipa >>= 12;
> +	__tlbi_level(ipas2e1, ipa, level);
> +
> +	/*
> +	 * We have to ensure completion of the invalidation at Stage-2,
> +	 * since a table walk on another CPU could refill a TLB with a
> +	 * complete (S1 + S2) walk based on the old Stage-2 mapping if
> +	 * the Stage-1 invalidation happened first.
> +	 */
> +	dsb(nsh);
> +	__tlbi(vmalle1);
> +	dsb(nsh);
> +	isb();
> +
> +	__tlb_switch_to_host(&cxt);
> +}
> +
>   void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
>   {
>   	struct tlb_inv_context cxt;
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 03/12] KVM: arm64: Add helper for creating unlinked stage2 subtrees
  2023-04-17  6:18   ` Gavin Shan
@ 2023-04-22 20:09     ` Ricardo Koller
  2023-04-22 20:32       ` Oliver Upton
  0 siblings, 1 reply; 40+ messages in thread
From: Ricardo Koller @ 2023-04-22 20:09 UTC (permalink / raw)
  To: Gavin Shan
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, reijiw, rananta, bgardon, ricarkol,
	Shaoqin Huang

On Mon, Apr 17, 2023 at 02:18:26PM +0800, Gavin Shan wrote:
> On 4/9/23 2:29 PM, Ricardo Koller wrote:
> > Add a stage2 helper, kvm_pgtable_stage2_create_unlinked(), for
> > creating unlinked tables (which is the opposite of
> > kvm_pgtable_stage2_free_unlinked()).  Creating an unlinked table is
> > useful for splitting level 1 and 2 entries into subtrees of PAGE_SIZE
> > PTEs.  For example, a level 1 entry can be split into PAGE_SIZE PTEs
> > by first creating a fully populated tree, and then use it to replace
> > the level 1 entry in a single step.  This will be used in a subsequent
> > commit for eager huge-page splitting (a dirty-logging optimization).
> > 
> > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
> > ---
> >   arch/arm64/include/asm/kvm_pgtable.h | 26 +++++++++++++++
> >   arch/arm64/kvm/hyp/pgtable.c         | 49 ++++++++++++++++++++++++++++
> >   2 files changed, 75 insertions(+)
> > 
> 
> With the following nits addressed:
> 
> Reviewed-by: Gavin Shan <gshan@redhat.com>
> 
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 3f2d43ba2b628..c8e0e7d9303b2 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -458,6 +458,32 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
> >    */
> >   void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level);
> > +/**
> > + * kvm_pgtable_stage2_create_unlinked() - Create an unlinked stage-2 paging structure.
> > + * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
> > + * @phys:	Physical address of the memory to map.
> > + * @level:	Starting level of the stage-2 paging structure to be created.
> > + * @prot:	Permissions and attributes for the mapping.
> > + * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
>                 ^^^^^^^^
> Alignment.

This seems to be due to the "+ ". It looks like this without it:

 * kvm_pgtable_stage2_create_unlinked() - Create an unlinked stage-2 paging structure.
 * @pgt:        Page-table structure initialised by kvm_pgtable_stage2_init*().
 * @phys:       Physical address of the memory to map.
 * @level:      Starting level of the stage-2 paging structure to be created.
 * @prot:       Permissions and attributes for the mapping.
 * @mc:         Cache of pre-allocated and zeroed memory from which to allocate
 *              page-table pages.

> 
> > + *		page-table pages.
> > + * @force_pte:  Force mappings to PAGE_SIZE granularity.
> > + *
> > + * Returns an unlinked page-table tree.  This new page-table tree is
> > + * not reachable (i.e., it is unlinked) from the root pgd and it's
> > + * therefore unreachableby the hardware page-table walker. No TLB
> > + * invalidation or CMOs are performed.
> > + *
> > + * If device attributes are not explicitly requested in @prot, then the
> > + * mapping will be normal, cacheable.
> > + *
> > + * Return: The fully populated (unlinked) stage-2 paging structure, or
> > + * an ERR_PTR(error) on failure.
> > + */
> > +kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt,
> > +					      u64 phys, u32 level,
> > +					      enum kvm_pgtable_prot prot,
> > +					      void *mc, bool force_pte);
> > +
> >   /**
> >    * kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table.
> >    * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 633679ee3c49a..477d2be67d401 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1222,6 +1222,55 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
> >   	return kvm_pgtable_walk(pgt, addr, size, &walker);
> >   }
> > +kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt,
> > +					      u64 phys, u32 level,
> > +					      enum kvm_pgtable_prot prot,
> > +					      void *mc, bool force_pte)
> > +{
> > +	struct stage2_map_data map_data = {
> > +		.phys		= phys,
> > +		.mmu		= pgt->mmu,
> > +		.memcache	= mc,
> > +		.force_pte	= force_pte,
> > +	};
> > +	struct kvm_pgtable_walker walker = {
> > +		.cb		= stage2_map_walker,
> > +		.flags		= KVM_PGTABLE_WALK_LEAF |
> > +				  KVM_PGTABLE_WALK_SKIP_BBM_TLBI |
> > +				  KVM_PGTABLE_WALK_SKIP_CMO,
> > +		.arg		= &map_data,
> > +	};
> > +	/* .addr (the IPA) is irrelevant for an unlinked table */
> > +	struct kvm_pgtable_walk_data data = {
> > +		.walker	= &walker,
> > +		.addr	= 0,
> > +		.end	= kvm_granule_size(level),
> > +	};
> 
> The comment about '.addr' seems incorrect. The IPA address is still
> used to locate the page table entry, so I think it would be something
> like below:
> 
> 	/* The IPA address (.addr) is relative to zero */
> 

Extended it to say this:

         * The IPA address (.addr) is relative to zero. The goal is to
         * map "kvm_granule_size(level) - 0" worth of pages.

> > +	struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
> > +	kvm_pte_t *pgtable;
> > +	int ret;
> > +
> > +	if (!IS_ALIGNED(phys, kvm_granule_size(level)))
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	ret = stage2_set_prot_attr(pgt, prot, &map_data.attr);
> > +	if (ret)
> > +		return ERR_PTR(ret);
> > +
> > +	pgtable = mm_ops->zalloc_page(mc);
> > +	if (!pgtable)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	ret = __kvm_pgtable_walk(&data, mm_ops, (kvm_pteref_t)pgtable,
> > +				 level + 1);
> > +	if (ret) {
> > +		kvm_pgtable_stage2_free_unlinked(mm_ops, pgtable, level);
> > +		mm_ops->put_page(pgtable);
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	return pgtable;
> > +}
> >   int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
> >   			      struct kvm_pgtable_mm_ops *mm_ops,
> > 
> 
> Thanks,
> Gavin
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 03/12] KVM: arm64: Add helper for creating unlinked stage2 subtrees
  2023-04-22 20:09     ` Ricardo Koller
@ 2023-04-22 20:32       ` Oliver Upton
  2023-04-22 20:37         ` Ricardo Koller
  0 siblings, 1 reply; 40+ messages in thread
From: Oliver Upton @ 2023-04-22 20:32 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: Gavin Shan, pbonzini, maz, oupton, yuzenghui, dmatlack, kvm,
	kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol, Shaoqin Huang

On Sat, Apr 22, 2023 at 01:09:26PM -0700, Ricardo Koller wrote:
> On Mon, Apr 17, 2023 at 02:18:26PM +0800, Gavin Shan wrote:
> > > +	/* .addr (the IPA) is irrelevant for an unlinked table */
> > > +	struct kvm_pgtable_walk_data data = {
> > > +		.walker	= &walker,
> > > +		.addr	= 0,
> > > +		.end	= kvm_granule_size(level),
> > > +	};
> > 
> > The comment about '.addr' seems incorrect. The IPA address is still
> > used to locate the page table entry, so I think it would be something
> > like below:
> > 
> > 	/* The IPA address (.addr) is relative to zero */
> > 
> 
> Extended it to say this:
> 
>          * The IPA address (.addr) is relative to zero. The goal is to
>	   * map "kvm_granule_size(level) - 0" worth of pages.

I actually prefer the original wording, as Gavin's suggestion makes this
comment read as though the IPA of the walk bears some degree of
validity, which it does not.

The intent of the code is to create some *ambiguous* input address
range, so maybe:

	/*
	 * The input address (.addr) is irrelevant for walking an
	 * unlinked table. Construct an ambiguous IA range to map
	 * kvm_granule_size(level) worth of memory.
	 */

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
  2023-04-17  6:38   ` Gavin Shan
@ 2023-04-22 20:32     ` Ricardo Koller
  2023-04-23  6:58       ` Gavin Shan
  0 siblings, 1 reply; 40+ messages in thread
From: Ricardo Koller @ 2023-04-22 20:32 UTC (permalink / raw)
  To: Gavin Shan
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, reijiw, rananta, bgardon, ricarkol,
	Shaoqin Huang

On Mon, Apr 17, 2023 at 02:38:06PM +0800, Gavin Shan wrote:
> On 4/9/23 2:29 PM, Ricardo Koller wrote:
> > Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a
> > range of huge pages. This will be used for eager-splitting huge pages
> > into PAGE_SIZE pages. The goal is to avoid having to split huge pages
> > on write-protection faults, and instead use this function to do it
> > ahead of time for large ranges (e.g., all guest memory in 1G chunks at
> > a time).
> > 
> > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
> > ---
> >   arch/arm64/include/asm/kvm_pgtable.h |  19 +++++
> >   arch/arm64/kvm/hyp/pgtable.c         | 103 +++++++++++++++++++++++++++
> >   2 files changed, 122 insertions(+)
> > 
> 
> With the following nits addressed:
> 
> Reviewed-by: Gavin Shan <gshan@redhat.com>
> 
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index c8e0e7d9303b2..32e5d42bf020f 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -653,6 +653,25 @@ bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr);
> >    */
> >   int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
> > +/**
> > + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
> > + *				to PAGE_SIZE guest pages.
> > + * @pgt:	 Page-table structure initialised by kvm_pgtable_stage2_init().
> > + * @addr:	 Intermediate physical address from which to split.
> > + * @size:	 Size of the range.
> > + * @mc:		 Cache of pre-allocated and zeroed memory from which to allocate
>                  ^^^^^^^^
> Alignment.
>

Same as in the previous commit. This is due to the added "+ " in the
diff. The line looks aligned without it.

> > + *		 page-table pages.
> > + *
> > + * The function tries to split any level 1 or 2 entry that overlaps
> > + * with the input range (given by @addr and @size).
> > + *
> > + * Return: 0 on success, negative error code on failure. Note that
> > + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> > + * blocks in the input range as allowed by @mc_capacity.
> > + */
> > +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
> > +			     struct kvm_mmu_memory_cache *mc);
> > +
> >   /**
> >    * kvm_pgtable_walk() - Walk a page-table.
> >    * @pgt:	Page-table structure initialised by kvm_pgtable_*_init().
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 477d2be67d401..48c5a95c6e8cd 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1272,6 +1272,109 @@ kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt,
> >   	return pgtable;
> >   }
> > +/*
> > + * Get the number of page-tables needed to replace a block with a
> > + * fully populated tree up to the PTE entries. Note that @level is
> > + * interpreted as in "level @level entry".
> > + */
> > +static int stage2_block_get_nr_page_tables(u32 level)
> > +{
> > +	switch (level) {
> > +	case 1:
> > +		return PTRS_PER_PTE + 1;
> > +	case 2:
> > +		return 1;
> > +	case 3:
> > +		return 0;
> > +	default:
> > +		WARN_ON_ONCE(level < KVM_PGTABLE_MIN_BLOCK_LEVEL ||
> > +			     level >= KVM_PGTABLE_MAX_LEVELS);
> > +		return -EINVAL;
> > +	};
> > +}
> > +
> 
> When the starting level is 3, it's not a block mapping if I'm correct. Besides,
> the caller (stage2_split_walker()) bails when the starting level is 3. In this
> case, the changes may be integrated to stage2_split_walker(), which is the only
> caller. Otherwise, 'inline' is worthy to have.
> 
> 	nr_pages = kvm_granule_shift(level) == PUD_SHIFT && kvm_granule_shift(level) != PMD_SHIFT) ?
>                    (PTRS_PER_PTE + 1) : 1;
> 

Mind if I keep the function? It helps explaining what's going on: we
need to calculate the number of pages needed to replace a block (and how
it's done). Regarding the "inline", Marc suggested removing it as the
compiler will figure it out.

> > +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > +			       enum kvm_pgtable_walk_flags visit)
> > +{
> > +	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> > +	struct kvm_mmu_memory_cache *mc = ctx->arg;
> > +	struct kvm_s2_mmu *mmu;
> > +	kvm_pte_t pte = ctx->old, new, *childp;
> > +	enum kvm_pgtable_prot prot;
> > +	u32 level = ctx->level;
> > +	bool force_pte;
> > +	int nr_pages;
> > +	u64 phys;
> > +
> > +	/* No huge-pages exist at the last level */
> > +	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> > +		return 0;
> > +
> > +	/* We only split valid block mappings */
> > +	if (!kvm_pte_valid(pte))
> > +		return 0;
> > +
> > +	nr_pages = stage2_block_get_nr_page_tables(level);
> > +	if (nr_pages < 0)
> > +		return nr_pages;
> > +
> > +	if (mc->nobjs >= nr_pages) {
> > +		/* Build a tree mapped down to the PTE granularity. */
> > +		force_pte = true;
> > +	} else {
> > +		/*
> > +		 * Don't force PTEs, so create_unlinked() below does
> > +		 * not populate the tree up to the PTE level. The
> > +		 * consequence is that the call will require a single
> > +		 * page of level 2 entries at level 1, or a single
> > +		 * page of PTEs at level 2. If we are at level 1, the
> > +		 * PTEs will be created recursively.
> > +		 */
> > +		force_pte = false;
> > +		nr_pages = 1;
> > +	}
> > +
> > +	if (mc->nobjs < nr_pages)
> > +		return -ENOMEM;
> > +
> > +	mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
> > +	phys = kvm_pte_to_phys(pte);
> > +	prot = kvm_pgtable_stage2_pte_prot(pte);
> > +
> > +	childp = kvm_pgtable_stage2_create_unlinked(mmu->pgt, phys,
> > +						    level, prot, mc, force_pte);
> > +	if (IS_ERR(childp))
> > +		return PTR_ERR(childp);
> > +
> > +	if (!stage2_try_break_pte(ctx, mmu)) {
> > +		kvm_pgtable_stage2_free_unlinked(mm_ops, childp, level);
> > +		mm_ops->put_page(childp);
> > +		return -EAGAIN;
> > +	}
> > +
> > +	/*
> > +	 * Note, the contents of the page table are guaranteed to be made
> > +	 * visible before the new PTE is assigned because stage2_make_pte()
> > +	 * writes the PTE using smp_store_release().
> > +	 */
> > +	new = kvm_init_table_pte(childp, mm_ops);
> > +	stage2_make_pte(ctx, new);
> > +	dsb(ishst);
> > +	return 0;
> > +}
> > +
> > +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
> > +			     struct kvm_mmu_memory_cache *mc)
> > +{
> > +	struct kvm_pgtable_walker walker = {
> > +		.cb	= stage2_split_walker,
> > +		.flags	= KVM_PGTABLE_WALK_LEAF,
> > +		.arg	= mc,
> > +	};
> > +
> > +	return kvm_pgtable_walk(pgt, addr, size, &walker);
> > +}
> > +
> >   int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
> >   			      struct kvm_pgtable_mm_ops *mm_ops,
> >   			      enum kvm_pgtable_stage2_flags flags,
> > 
> 
> Thanks,
> Gavin
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 03/12] KVM: arm64: Add helper for creating unlinked stage2 subtrees
  2023-04-22 20:32       ` Oliver Upton
@ 2023-04-22 20:37         ` Ricardo Koller
  2023-04-23  6:55           ` Gavin Shan
  0 siblings, 1 reply; 40+ messages in thread
From: Ricardo Koller @ 2023-04-22 20:37 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Gavin Shan, pbonzini, maz, oupton, yuzenghui, dmatlack, kvm,
	kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, reijiw, rananta,
	bgardon, ricarkol, Shaoqin Huang

On Sat, Apr 22, 2023 at 08:32:02PM +0000, Oliver Upton wrote:
> On Sat, Apr 22, 2023 at 01:09:26PM -0700, Ricardo Koller wrote:
> > On Mon, Apr 17, 2023 at 02:18:26PM +0800, Gavin Shan wrote:
> > > > +	/* .addr (the IPA) is irrelevant for an unlinked table */
> > > > +	struct kvm_pgtable_walk_data data = {
> > > > +		.walker	= &walker,
> > > > +		.addr	= 0,
> > > > +		.end	= kvm_granule_size(level),
> > > > +	};
> > > 
> > > The comment about '.addr' seems incorrect. The IPA address is still
> > > used to locate the page table entry, so I think it would be something
> > > like below:
> > > 
> > > 	/* The IPA address (.addr) is relative to zero */
> > > 
> > 
> > Extended it to say this:
> > 
> >          * The IPA address (.addr) is relative to zero. The goal is to
> >	   * map "kvm_granule_size(level) - 0" worth of pages.
> 
> I actually prefer the original wording, as Gavin's suggestion makes this
> comment read as though the IPA of the walk bears some degree of
> validity, which it does not.
> 
> The intent of the code is to create some *ambiguous* input address
> range, so maybe:
> 
> 	/*
> 	 * The input address (.addr) is irrelevant for walking an
> 	 * unlinked table. Construct an ambiguous IA range to map
> 	 * kvm_granule_size(level) worth of memory.
> 	 */
> 

OK, this is the winner. Will go with this one in v8. Gavin, let me know
if you are not OK with this.

Thank you both,
Ricardo

> -- 
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 03/12] KVM: arm64: Add helper for creating unlinked stage2 subtrees
  2023-04-22 20:37         ` Ricardo Koller
@ 2023-04-23  6:55           ` Gavin Shan
  0 siblings, 0 replies; 40+ messages in thread
From: Gavin Shan @ 2023-04-23  6:55 UTC (permalink / raw)
  To: Ricardo Koller, Oliver Upton
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, reijiw, rananta, bgardon, ricarkol,
	Shaoqin Huang

On 4/23/23 4:37 AM, Ricardo Koller wrote:
> On Sat, Apr 22, 2023 at 08:32:02PM +0000, Oliver Upton wrote:
>> On Sat, Apr 22, 2023 at 01:09:26PM -0700, Ricardo Koller wrote:
>>> On Mon, Apr 17, 2023 at 02:18:26PM +0800, Gavin Shan wrote:
>>>>> +	/* .addr (the IPA) is irrelevant for an unlinked table */
>>>>> +	struct kvm_pgtable_walk_data data = {
>>>>> +		.walker	= &walker,
>>>>> +		.addr	= 0,
>>>>> +		.end	= kvm_granule_size(level),
>>>>> +	};
>>>>
>>>> The comment about '.addr' seems incorrect. The IPA address is still
>>>> used to locate the page table entry, so I think it would be something
>>>> like below:
>>>>
>>>> 	/* The IPA address (.addr) is relative to zero */
>>>>
>>>
>>> Extended it to say this:
>>>
>>>           * The IPA address (.addr) is relative to zero. The goal is to
>>> 	   * map "kvm_granule_size(level) - 0" worth of pages.
>>
>> I actually prefer the original wording, as Gavin's suggestion makes this
>> comment read as though the IPA of the walk bears some degree of
>> validity, which it does not.
>>
>> The intent of the code is to create some *ambiguous* input address
>> range, so maybe:
>>
>> 	/*
>> 	 * The input address (.addr) is irrelevant for walking an
>> 	 * unlinked table. Construct an ambiguous IA range to map
>> 	 * kvm_granule_size(level) worth of memory.
>> 	 */
>>
> 
> OK, this is the winner. Will go with this one in v8. Gavin, let me know
> if you are not OK with this.
> 

Looks good to me either.

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
  2023-04-22 20:32     ` Ricardo Koller
@ 2023-04-23  6:58       ` Gavin Shan
  0 siblings, 0 replies; 40+ messages in thread
From: Gavin Shan @ 2023-04-23  6:58 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, reijiw, rananta, bgardon, ricarkol,
	Shaoqin Huang

On 4/23/23 4:32 AM, Ricardo Koller wrote:
> On Mon, Apr 17, 2023 at 02:38:06PM +0800, Gavin Shan wrote:
>> On 4/9/23 2:29 PM, Ricardo Koller wrote:
>>> Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a
>>> range of huge pages. This will be used for eager-splitting huge pages
>>> into PAGE_SIZE pages. The goal is to avoid having to split huge pages
>>> on write-protection faults, and instead use this function to do it
>>> ahead of time for large ranges (e.g., all guest memory in 1G chunks at
>>> a time).
>>>
>>> Signed-off-by: Ricardo Koller <ricarkol@google.com>
>>> Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
>>> ---
>>>    arch/arm64/include/asm/kvm_pgtable.h |  19 +++++
>>>    arch/arm64/kvm/hyp/pgtable.c         | 103 +++++++++++++++++++++++++++
>>>    2 files changed, 122 insertions(+)
>>>
>>
>> With the following nits addressed:
>>
>> Reviewed-by: Gavin Shan <gshan@redhat.com>
>>
>>> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
>>> index c8e0e7d9303b2..32e5d42bf020f 100644
>>> --- a/arch/arm64/include/asm/kvm_pgtable.h
>>> +++ b/arch/arm64/include/asm/kvm_pgtable.h
>>> @@ -653,6 +653,25 @@ bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr);
>>>     */
>>>    int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
>>> +/**
>>> + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
>>> + *				to PAGE_SIZE guest pages.
>>> + * @pgt:	 Page-table structure initialised by kvm_pgtable_stage2_init().
>>> + * @addr:	 Intermediate physical address from which to split.
>>> + * @size:	 Size of the range.
>>> + * @mc:		 Cache of pre-allocated and zeroed memory from which to allocate
>>                   ^^^^^^^^
>> Alignment.
>>
> 
> Same as in the previous commit. This is due to the added "+ " in the
> diff. The line looks aligned without it.
> 
>>> + *		 page-table pages.
>>> + *
>>> + * The function tries to split any level 1 or 2 entry that overlaps
>>> + * with the input range (given by @addr and @size).
>>> + *
>>> + * Return: 0 on success, negative error code on failure. Note that
>>> + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
>>> + * blocks in the input range as allowed by @mc_capacity.
>>> + */
>>> +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
>>> +			     struct kvm_mmu_memory_cache *mc);
>>> +
>>>    /**
>>>     * kvm_pgtable_walk() - Walk a page-table.
>>>     * @pgt:	Page-table structure initialised by kvm_pgtable_*_init().
>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>> index 477d2be67d401..48c5a95c6e8cd 100644
>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>> @@ -1272,6 +1272,109 @@ kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt,
>>>    	return pgtable;
>>>    }
>>> +/*
>>> + * Get the number of page-tables needed to replace a block with a
>>> + * fully populated tree up to the PTE entries. Note that @level is
>>> + * interpreted as in "level @level entry".
>>> + */
>>> +static int stage2_block_get_nr_page_tables(u32 level)
>>> +{
>>> +	switch (level) {
>>> +	case 1:
>>> +		return PTRS_PER_PTE + 1;
>>> +	case 2:
>>> +		return 1;
>>> +	case 3:
>>> +		return 0;
>>> +	default:
>>> +		WARN_ON_ONCE(level < KVM_PGTABLE_MIN_BLOCK_LEVEL ||
>>> +			     level >= KVM_PGTABLE_MAX_LEVELS);
>>> +		return -EINVAL;
>>> +	};
>>> +}
>>> +
>>
>> When the starting level is 3, it's not a block mapping if I'm correct. Besides,
>> the caller (stage2_split_walker()) bails when the starting level is 3. In this
>> case, the changes may be integrated to stage2_split_walker(), which is the only
>> caller. Otherwise, 'inline' is worthy to have.
>>
>> 	nr_pages = kvm_granule_shift(level) == PUD_SHIFT && kvm_granule_shift(level) != PMD_SHIFT) ?
>>                     (PTRS_PER_PTE + 1) : 1;
>>
> 
> Mind if I keep the function? It helps explaining what's going on: we
> need to calculate the number of pages needed to replace a block (and how
> it's done). Regarding the "inline", Marc suggested removing it as the
> compiler will figure it out.
> 

Ok. Lets keep it. The original code looks obvious at least. The "inline"
isn't necessary if gcc is smart enough.

>>> +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
>>> +			       enum kvm_pgtable_walk_flags visit)
>>> +{
>>> +	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
>>> +	struct kvm_mmu_memory_cache *mc = ctx->arg;
>>> +	struct kvm_s2_mmu *mmu;
>>> +	kvm_pte_t pte = ctx->old, new, *childp;
>>> +	enum kvm_pgtable_prot prot;
>>> +	u32 level = ctx->level;
>>> +	bool force_pte;
>>> +	int nr_pages;
>>> +	u64 phys;
>>> +
>>> +	/* No huge-pages exist at the last level */
>>> +	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
>>> +		return 0;
>>> +
>>> +	/* We only split valid block mappings */
>>> +	if (!kvm_pte_valid(pte))
>>> +		return 0;
>>> +
>>> +	nr_pages = stage2_block_get_nr_page_tables(level);
>>> +	if (nr_pages < 0)
>>> +		return nr_pages;
>>> +
>>> +	if (mc->nobjs >= nr_pages) {
>>> +		/* Build a tree mapped down to the PTE granularity. */
>>> +		force_pte = true;
>>> +	} else {
>>> +		/*
>>> +		 * Don't force PTEs, so create_unlinked() below does
>>> +		 * not populate the tree up to the PTE level. The
>>> +		 * consequence is that the call will require a single
>>> +		 * page of level 2 entries at level 1, or a single
>>> +		 * page of PTEs at level 2. If we are at level 1, the
>>> +		 * PTEs will be created recursively.
>>> +		 */
>>> +		force_pte = false;
>>> +		nr_pages = 1;
>>> +	}
>>> +
>>> +	if (mc->nobjs < nr_pages)
>>> +		return -ENOMEM;
>>> +
>>> +	mmu = container_of(mc, struct kvm_s2_mmu, split_page_cache);
>>> +	phys = kvm_pte_to_phys(pte);
>>> +	prot = kvm_pgtable_stage2_pte_prot(pte);
>>> +
>>> +	childp = kvm_pgtable_stage2_create_unlinked(mmu->pgt, phys,
>>> +						    level, prot, mc, force_pte);
>>> +	if (IS_ERR(childp))
>>> +		return PTR_ERR(childp);
>>> +
>>> +	if (!stage2_try_break_pte(ctx, mmu)) {
>>> +		kvm_pgtable_stage2_free_unlinked(mm_ops, childp, level);
>>> +		mm_ops->put_page(childp);
>>> +		return -EAGAIN;
>>> +	}
>>> +
>>> +	/*
>>> +	 * Note, the contents of the page table are guaranteed to be made
>>> +	 * visible before the new PTE is assigned because stage2_make_pte()
>>> +	 * writes the PTE using smp_store_release().
>>> +	 */
>>> +	new = kvm_init_table_pte(childp, mm_ops);
>>> +	stage2_make_pte(ctx, new);
>>> +	dsb(ishst);
>>> +	return 0;
>>> +}
>>> +
>>> +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
>>> +			     struct kvm_mmu_memory_cache *mc)
>>> +{
>>> +	struct kvm_pgtable_walker walker = {
>>> +		.cb	= stage2_split_walker,
>>> +		.flags	= KVM_PGTABLE_WALK_LEAF,
>>> +		.arg	= mc,
>>> +	};
>>> +
>>> +	return kvm_pgtable_walk(pgt, addr, size, &walker);
>>> +}
>>> +
>>>    int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
>>>    			      struct kvm_pgtable_mm_ops *mm_ops,
>>>    			      enum kvm_pgtable_stage2_flags flags,
>>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 05/12] KVM: arm64: Refactor kvm_arch_commit_memory_region()
  2023-04-17  6:41   ` Gavin Shan
@ 2023-04-23 19:47     ` Ricardo Koller
  0 siblings, 0 replies; 40+ messages in thread
From: Ricardo Koller @ 2023-04-23 19:47 UTC (permalink / raw)
  To: Gavin Shan
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, reijiw, rananta, bgardon, ricarkol,
	Shaoqin Huang

On Mon, Apr 17, 2023 at 02:41:39PM +0800, Gavin Shan wrote:
> On 4/9/23 2:29 PM, Ricardo Koller wrote:
> > Refactor kvm_arch_commit_memory_region() as a preparation for a future
> > commit to look cleaner and more understandable. Also, it looks more
> > like its x86 counterpart (in kvm_mmu_slot_apply_flags()).
> > 
> > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
> > ---
> >   arch/arm64/kvm/mmu.c | 15 +++++++++++----
> >   1 file changed, 11 insertions(+), 4 deletions(-)
> > 
> 
> With the following nits addressed:
> 
> Reviewed-by: Gavin Shan <gshan@redhat.com>
> 
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index efdaab3f154de..37d7d2aa472ab 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1761,20 +1761,27 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> >   				   const struct kvm_memory_slot *new,
> >   				   enum kvm_mr_change change)
> >   {
> > +	bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
> > +
> >   	/*
> >   	 * At this point memslot has been committed and there is an
> >   	 * allocated dirty_bitmap[], dirty pages will be tracked while the
> >   	 * memory slot is write protected.
> >   	 */
> > -	if (change != KVM_MR_DELETE && new->flags & KVM_MEM_LOG_DIRTY_PAGES) {
> > +	if (log_dirty_pages) {
> > +
> > +		if (change == KVM_MR_DELETE)
> > +			return;
> > +
> >   		/*
> >   		 * If we're with initial-all-set, we don't need to write
> >   		 * protect any pages because they're all reported as dirty.
> >   		 * Huge pages and normal pages will be write protect gradually.
> >   		 */
> 
> The comments need to be adjusted after this series is applied. The huge pages
> won't be write protected gradually. Instead, the huge pages will be split and
> write protected in one shoot.
>

I see, this comment is a bit confusing. Will update it to this:

                /*
                 * Pages are write-protected on either of these two
                 * cases:
                 *
                 * 1. with initial-all-set: gradually with CLEAR ioctls,
                 */
                if (kvm_dirty_log_manual_protect_and_init_set(kvm))
                        return;
                /*
                 * or
                 * 2. without initial-all-set: all in one shot when
                 *    enabling dirty logging.
                 */
                kvm_mmu_wp_memory_region(kvm, new->id);

Will update the comment to include splitting when introducing eager-splitting
on the CLEAR ioctl (case 1.): "KVM: arm64: Split huge pages during
KVM_CLEAR_DIRTY_LOG".

> > -		if (!kvm_dirty_log_manual_protect_and_init_set(kvm)) {
> > -			kvm_mmu_wp_memory_region(kvm, new->id);
> > -		}
> > +		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> > +			return;
> > +
> > +		kvm_mmu_wp_memory_region(kvm, new->id);
> >   	}
> >   }
> > 
> 
> Thanks,
> Gavin
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 08/12] KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
  2023-04-17  7:04   ` Gavin Shan
@ 2023-04-23 20:27     ` Ricardo Koller
  2023-04-24 11:14       ` Gavin Shan
  0 siblings, 1 reply; 40+ messages in thread
From: Ricardo Koller @ 2023-04-23 20:27 UTC (permalink / raw)
  To: Gavin Shan
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, reijiw, rananta, bgardon, ricarkol,
	Oliver Upton

On Mon, Apr 17, 2023 at 03:04:47PM +0800, Gavin Shan wrote:
> On 4/9/23 2:29 PM, Ricardo Koller wrote:
> > Add a capability for userspace to specify the eager split chunk size.
> > The chunk size specifies how many pages to break at a time, using a
> > single allocation. Bigger the chunk size, more pages need to be
> > allocated ahead of time.
> > 
> > Suggested-by: Oliver Upton <oliver.upton@linux.dev>
> > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > ---
> >   Documentation/virt/kvm/api.rst       | 28 ++++++++++++++++++++++++++
> >   arch/arm64/include/asm/kvm_host.h    | 15 ++++++++++++++
> >   arch/arm64/include/asm/kvm_pgtable.h | 18 +++++++++++++++++
> >   arch/arm64/kvm/arm.c                 | 30 ++++++++++++++++++++++++++++
> >   arch/arm64/kvm/mmu.c                 |  3 +++
> >   include/uapi/linux/kvm.h             |  2 ++
> >   6 files changed, 96 insertions(+)
> > 
> 
> With the following comments addressed:
> 
> Reviewed-by: Gavin Shan <gshan@redhat.com>
> 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 62de0768d6aa5..f8faa80d87057 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -8380,6 +8380,34 @@ structure.
> >   When getting the Modified Change Topology Report value, the attr->addr
> >   must point to a byte where the value will be stored or retrieved from.
> > +8.40 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
> > +---------------------------------------
> > +
> > +:Capability: KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
> > +:Architectures: arm64
> > +:Type: vm
> > +:Parameters: arg[0] is the new split chunk size.
> > +:Returns: 0 on success, -EINVAL if any memslot was already created.
>                                                   ^^^^^^^^^^^^^^^^^^^
> 
> Maybe s/was already created/has been created
> 
> > +
> > +This capability sets the chunk size used in Eager Page Splitting.
> > +
> > +Eager Page Splitting improves the performance of dirty-logging (used
> > +in live migrations) when guest memory is backed by huge-pages.  It
> > +avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing
> > +it eagerly when enabling dirty logging (with the
> > +KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using
> > +KVM_CLEAR_DIRTY_LOG.
> > +
> > +The chunk size specifies how many pages to break at a time, using a
> > +single allocation for each chunk. Bigger the chunk size, more pages
> > +need to be allocated ahead of time.
> > +
> > +The chunk size needs to be a valid block size. The list of acceptable
> > +block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a 64bit
> > +bitmap (each bit describing a block size). Setting
> > +KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE to 0 disables Eager Page Splitting;
> > +this is the default value.
> > +
> 
> s/a 64bit bitmap/a 64-bit bitmap
> 
> For the last sentence, maybe:
> 
> The default value is 0, to disable the eager page splitting.

Fixed, much better.

> 
> >   9. Known KVM API problems
> >   =========================
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index a1892a8f60323..b87da1ebc3454 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -158,6 +158,21 @@ struct kvm_s2_mmu {
> >   	/* The last vcpu id that ran on each physical CPU */
> >   	int __percpu *last_vcpu_ran;
> > +#define KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT 0
> > +	/*
> > +	 * Memory cache used to split
> > +	 * KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE worth of huge pages. It
> > +	 * is used to allocate stage2 page tables while splitting huge
> > +	 * pages. Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE
> > +	 * influences both the capacity of the split page cache, and
> > +	 * how often KVM reschedules. Be wary of raising CHUNK_SIZE
> > +	 * too high.
> > +	 *
> > +	 * Protected by kvm->slots_lock.
> > +	 */
> > +	struct kvm_mmu_memory_cache split_page_cache;
> > +	uint64_t split_page_chunk_size;
> > +
> >   	struct kvm_arch *arch;
> >   };
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 32e5d42bf020f..889bd7afeb355 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -92,6 +92,24 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> >   	return level >= KVM_PGTABLE_MIN_BLOCK_LEVEL;
> >   }
> > +static inline u64 kvm_supported_block_sizes(void)
> > +{
> > +	u32 level = KVM_PGTABLE_MIN_BLOCK_LEVEL;
> > +	u64 res = 0;
> > +
> > +	for (; level < KVM_PGTABLE_MAX_LEVELS; level++)
> > +		res |= BIT(kvm_granule_shift(level));
> > +
> > +	return res;
> > +}
> > +
> 
> maybe s/@res/@r

changed

> 
> > +static inline bool kvm_is_block_size_supported(u64 size)
> > +{
> > +	bool is_power_of_two = !((size) & ((size)-1));
> > +
> > +	return is_power_of_two && (size & kvm_supported_block_sizes());
> > +}
> > +
> 
> IS_ALIGNED() maybe used here.

I've been trying to reuse some bitmap related function in the kernel,
like IS_ALIGNED(), but can't find anything. Or at least it doesn't occur
to me how.

kvm_is_block_size_supported() returns true if @size matches only one of
the bits set in kvm_supported_block_sizes(). For example, given these
supported sizes: 10000100001000.

kvm_is_block_size_supported(100000000) => true
kvm_is_block_size_supported(1100) => false

> 
> >   /**
> >    * struct kvm_pgtable_mm_ops - Memory management callbacks.
> >    * @zalloc_page:		Allocate a single zeroed memory page.
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 3bd732eaf0872..34fd3c59a9b82 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -67,6 +67,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> >   			    struct kvm_enable_cap *cap)
> >   {
> >   	int r;
> > +	u64 new_cap;
> >   	if (cap->flags)
> >   		return -EINVAL;
> > @@ -91,6 +92,26 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> >   		r = 0;
> >   		set_bit(KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED, &kvm->arch.flags);
> >   		break;
> > +	case KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE:
> > +		new_cap = cap->args[0];
> > +
> > +		mutex_lock(&kvm->lock);
> > +		mutex_lock(&kvm->slots_lock);
> > +		/*
> > +		 * To keep things simple, allow changing the chunk
> > +		 * size only if there are no memslots already created.
> > +		 */
> 
> 		/*
> 		 * To keep things simple, allow changing the chunk size
> 		 * only when no memory slots have been created.
> 		 */
> 
> > +		if (!kvm_are_all_memslots_empty(kvm)) {
> > +			r = -EINVAL;
> > +		} else if (new_cap && !kvm_is_block_size_supported(new_cap)) {
> > +			r = -EINVAL;
> > +		} else {
> > +			r = 0;
> > +			kvm->arch.mmu.split_page_chunk_size = new_cap;
> > +		}
> > +		mutex_unlock(&kvm->slots_lock);
> > +		mutex_unlock(&kvm->lock);
> > +		break;
> >   	default:
> >   		r = -EINVAL;
> >   		break;
> > @@ -288,6 +309,15 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >   	case KVM_CAP_ARM_PTRAUTH_GENERIC:
> >   		r = system_has_full_ptr_auth();
> >   		break;
> > +	case KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE:
> > +		if (kvm)
> > +			r = kvm->arch.mmu.split_page_chunk_size;
> > +		else
> > +			r = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
> > +		break;
> > +	case KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES:
> > +		r = kvm_supported_block_sizes();
> > +		break;
> 
> kvm_supported_block_sizes() returns u64, but @r is 32-bits in width. It may be
> worthy to make the return value from kvm_supported_block_sizes() as u32.
> 
> >   	default:
> >   		r = 0;
> >   	}
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index a2800e5c42712..898985b09321a 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -756,6 +756,9 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
> >   	for_each_possible_cpu(cpu)
> >   		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
> 
> It may be worthy to have comments like below:
> 
> 	/* The eager page splitting is disabled by default */
> > +	mmu->split_page_cache.gfp_zero = __GFP_ZERO;
> > +	mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
> > +
> >   	mmu->pgt = pgt;
> >   	mmu->pgd_phys = __pa(pgt->pgd);
> >   	return 0;
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index d77aef872a0a0..f18b48fcd25ba 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -1184,6 +1184,8 @@ struct kvm_ppc_resize_hpt {
> >   #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> >   #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
> >   #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
> > +#define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 227
> > +#define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 228
> >   #ifdef KVM_CAP_IRQ_ROUTING
> > 
> 
> Thanks,
> Gavin
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 08/12] KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
  2023-04-23 20:27     ` Ricardo Koller
@ 2023-04-24 11:14       ` Gavin Shan
  2023-04-24 18:48         ` Ricardo Koller
  0 siblings, 1 reply; 40+ messages in thread
From: Gavin Shan @ 2023-04-24 11:14 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, reijiw, rananta, bgardon, ricarkol,
	Oliver Upton

On 4/24/23 4:27 AM, Ricardo Koller wrote:
> On Mon, Apr 17, 2023 at 03:04:47PM +0800, Gavin Shan wrote:
>> On 4/9/23 2:29 PM, Ricardo Koller wrote:
>>> Add a capability for userspace to specify the eager split chunk size.
>>> The chunk size specifies how many pages to break at a time, using a
>>> single allocation. Bigger the chunk size, more pages need to be
>>> allocated ahead of time.
>>>
>>> Suggested-by: Oliver Upton <oliver.upton@linux.dev>
>>> Signed-off-by: Ricardo Koller <ricarkol@google.com>
>>> ---
>>>    Documentation/virt/kvm/api.rst       | 28 ++++++++++++++++++++++++++
>>>    arch/arm64/include/asm/kvm_host.h    | 15 ++++++++++++++
>>>    arch/arm64/include/asm/kvm_pgtable.h | 18 +++++++++++++++++
>>>    arch/arm64/kvm/arm.c                 | 30 ++++++++++++++++++++++++++++
>>>    arch/arm64/kvm/mmu.c                 |  3 +++
>>>    include/uapi/linux/kvm.h             |  2 ++
>>>    6 files changed, 96 insertions(+)
>>>
>>
>> With the following comments addressed:
>>
>> Reviewed-by: Gavin Shan <gshan@redhat.com>
>>

[...]

>>
>>> +static inline bool kvm_is_block_size_supported(u64 size)
>>> +{
>>> +	bool is_power_of_two = !((size) & ((size)-1));
>>> +
>>> +	return is_power_of_two && (size & kvm_supported_block_sizes());
>>> +}
>>> +
>>
>> IS_ALIGNED() maybe used here.
> 
> I've been trying to reuse some bitmap related function in the kernel,
> like IS_ALIGNED(), but can't find anything. Or at least it doesn't occur
> to me how.
> 
> kvm_is_block_size_supported() returns true if @size matches only one of
> the bits set in kvm_supported_block_sizes(). For example, given these
> supported sizes: 10000100001000.
> 
> kvm_is_block_size_supported(100000000) => true
> kvm_is_block_size_supported(1100) => false
> 

I was actually thinking of @is_power_of_two is replaced by IS_ALIGNED(),
For example:

static inline bool kvm_is_block_size_supported(u64 size)
{
     return IS_ALIGNED(size, size) && (size & kvm_supported_block_sizes());
}

IS_ALIGNED() is defined in include/linux/align.h, as below. It's almost
similar to '((size) & ((size)-1))'

#define IS_ALIGNED(x, a)                (((x) & ((typeof(x))(a) - 1)) == 0)

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 08/12] KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
  2023-04-24 11:14       ` Gavin Shan
@ 2023-04-24 18:48         ` Ricardo Koller
  0 siblings, 0 replies; 40+ messages in thread
From: Ricardo Koller @ 2023-04-24 18:48 UTC (permalink / raw)
  To: Gavin Shan
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, reijiw, rananta, bgardon, ricarkol,
	Oliver Upton

On Mon, Apr 24, 2023 at 07:14:21PM +0800, Gavin Shan wrote:
> On 4/24/23 4:27 AM, Ricardo Koller wrote:
> > On Mon, Apr 17, 2023 at 03:04:47PM +0800, Gavin Shan wrote:
> > > On 4/9/23 2:29 PM, Ricardo Koller wrote:
> > > > Add a capability for userspace to specify the eager split chunk size.
> > > > The chunk size specifies how many pages to break at a time, using a
> > > > single allocation. Bigger the chunk size, more pages need to be
> > > > allocated ahead of time.
> > > > 
> > > > Suggested-by: Oliver Upton <oliver.upton@linux.dev>
> > > > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > > > ---
> > > >    Documentation/virt/kvm/api.rst       | 28 ++++++++++++++++++++++++++
> > > >    arch/arm64/include/asm/kvm_host.h    | 15 ++++++++++++++
> > > >    arch/arm64/include/asm/kvm_pgtable.h | 18 +++++++++++++++++
> > > >    arch/arm64/kvm/arm.c                 | 30 ++++++++++++++++++++++++++++
> > > >    arch/arm64/kvm/mmu.c                 |  3 +++
> > > >    include/uapi/linux/kvm.h             |  2 ++
> > > >    6 files changed, 96 insertions(+)
> > > > 
> > > 
> > > With the following comments addressed:
> > > 
> > > Reviewed-by: Gavin Shan <gshan@redhat.com>
> > > 
> 
> [...]
> 
> > > 
> > > > +static inline bool kvm_is_block_size_supported(u64 size)
> > > > +{
> > > > +	bool is_power_of_two = !((size) & ((size)-1));
> > > > +
> > > > +	return is_power_of_two && (size & kvm_supported_block_sizes());
> > > > +}
> > > > +
> > > 
> > > IS_ALIGNED() maybe used here.
> > 
> > I've been trying to reuse some bitmap related function in the kernel,
> > like IS_ALIGNED(), but can't find anything. Or at least it doesn't occur
> > to me how.
> > 
> > kvm_is_block_size_supported() returns true if @size matches only one of
> > the bits set in kvm_supported_block_sizes(). For example, given these
> > supported sizes: 10000100001000.
> > 
> > kvm_is_block_size_supported(100000000) => true
> > kvm_is_block_size_supported(1100) => false
> > 
> 
> I was actually thinking of @is_power_of_two is replaced by IS_ALIGNED(),
> For example:
> 
> static inline bool kvm_is_block_size_supported(u64 size)
> {
>     return IS_ALIGNED(size, size) && (size & kvm_supported_block_sizes());
> }
> 
> IS_ALIGNED() is defined in include/linux/align.h, as below. It's almost
> similar to '((size) & ((size)-1))'
> 
> #define IS_ALIGNED(x, a)                (((x) & ((typeof(x))(a) - 1)) == 0)

Ah! you are right, yes, will use this instead.

Thanks,
Ricardo

> 
> Thanks,
> Gavin
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2023-04-24 18:50 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-09  6:29 [PATCH v7 00/12] Implement Eager Page Splitting for ARM Ricardo Koller
2023-04-09  6:29 ` [PATCH v7 01/12] KVM: arm64: Rename free_removed to free_unlinked Ricardo Koller
2023-04-17  6:08   ` Gavin Shan
2023-04-09  6:29 ` [PATCH v7 02/12] KVM: arm64: Add KVM_PGTABLE_WALK ctx->flags for skipping BBM and CMO Ricardo Koller
2023-04-17  6:10   ` Gavin Shan
2023-04-09  6:29 ` [PATCH v7 02/12] KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIs Ricardo Koller
2023-04-17  6:13   ` Gavin Shan
2023-04-09  6:29 ` [PATCH v7 03/12] KVM: arm64: Add helper for creating unlinked stage2 subtrees Ricardo Koller
2023-04-17  6:18   ` Gavin Shan
2023-04-22 20:09     ` Ricardo Koller
2023-04-22 20:32       ` Oliver Upton
2023-04-22 20:37         ` Ricardo Koller
2023-04-23  6:55           ` Gavin Shan
2023-04-09  6:29 ` [PATCH v7 04/12] KVM: arm64: Add kvm_pgtable_stage2_split() Ricardo Koller
2023-04-09  9:36   ` kernel test robot
2023-04-10 17:40     ` Ricardo Koller
2023-04-17  6:38   ` Gavin Shan
2023-04-22 20:32     ` Ricardo Koller
2023-04-23  6:58       ` Gavin Shan
2023-04-09  6:29 ` [PATCH v7 05/12] KVM: arm64: Refactor kvm_arch_commit_memory_region() Ricardo Koller
2023-04-17  6:41   ` Gavin Shan
2023-04-23 19:47     ` Ricardo Koller
2023-04-17  6:42   ` Gavin Shan
2023-04-09  6:29 ` [PATCH v7 06/12] KVM: arm64: Add kvm_uninit_stage2_mmu() Ricardo Koller
2023-04-17  6:44   ` Gavin Shan
2023-04-09  6:29 ` [PATCH v7 07/12] KVM: arm64: Export kvm_are_all_memslots_empty() Ricardo Koller
2023-04-17  6:47   ` Gavin Shan
2023-04-09  6:29 ` [PATCH v7 08/12] KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE Ricardo Koller
2023-04-17  7:04   ` Gavin Shan
2023-04-23 20:27     ` Ricardo Koller
2023-04-24 11:14       ` Gavin Shan
2023-04-24 18:48         ` Ricardo Koller
2023-04-09  6:29 ` [PATCH v7 09/12] KVM: arm64: Split huge pages when dirty logging is enabled Ricardo Koller
2023-04-17  7:11   ` Gavin Shan
2023-04-09  6:29 ` [PATCH v7 10/12] KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked() Ricardo Koller
2023-04-17  7:14   ` Gavin Shan
2023-04-09  6:29 ` [PATCH v7 11/12] KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG Ricardo Koller
2023-04-17  7:18   ` Gavin Shan
2023-04-09  6:30 ` [PATCH v7 12/12] KVM: arm64: Use local TLBI on permission relaxation Ricardo Koller
2023-04-17  7:20   ` Gavin Shan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.