All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/12] KVM: arm64: Eager huge-page splitting for dirty-logging
@ 2022-11-12  8:17 ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvm, kvmarm, kvmarm, ricarkol, Ricardo Koller

Hi,

I'm sending this RFC mainly to get some early feedback on the approach used
for implementing "Eager Page Splitting" on ARM.  "Eager Page Splitting"
improves the performance of dirty-logging (used in live migrations) when
guest memory is backed by huge-pages.  It's an optimization used in Google
Cloud since 2016 on x86, and for the last couple of months on ARM.

I tried multiple ways of implementing this optimization on ARM: from
completely reusing the stage2 mapper, to implementing a new walker from
scratch, and some versions in between. This RFC is one of those in
between. They all have similar performance benefits, based on some light
performance testing (mainly dirty_log_perf_test).

Background and motivation
=========================
Dirty logging is typically used for live-migration iterative copying.  KVM
implements dirty-logging at the PAGE_SIZE granularity (will refer to 4K
pages from now on).  It does it by faulting on write-protected 4K pages.
Therefore, enabling dirty-logging on a huge-page requires breaking it into
4K pages in the first place.  KVM does this breaking on fault, and because
it's in the critical path it only maps the 4K page that faulted; every
other 4K page is left unmapped.  This is not great for performance on ARM
for a couple of reasons:

- Splitting on fault can halt vcpus for milliseconds in some
  implementations. Splitting a block PTE requires using a broadcasted TLB
  invalidation (TLBI) for every huge-page (due to the break-before-make
  requirement). Note that x86 doesn't need this. We observed some
  implementations that take millliseconds to complete broadcasted TLBIs
  when done in parallel from multiple vcpus.  And that's exactly what
  happens when doing it on fault: multiple vcpus fault at the same time
  triggering TLBIs in parallel.

- Read intensive guest workloads end up paying for dirty-logging.  Only
  mapping the faulting 4K page means that all the other pages that were
  part of the huge-page will now be unmapped. The effect is that any
  access, including reads, now has to fault.

Eager Page Splitting (on ARM)
=============================
Eager Page Splitting fixes the above two issues by eagerly splitting
huge-pages when enabling dirty logging. The goal is to avoid doing it while
faulting on write-protected pages. This is what the TDP MMU does for x86
[0], except that x86 does it for different reasons: to avoid grabbing the
MMU lock on fault. Note that taking care of write-protection faults still
requires grabbing the MMU lock on ARM, but not on x86 (with the
fast_page_fault path).

An additional benefit of eagerly splitting huge-pages is that it can be
done in a controlled way (e.g., via an IOCTL). This series provides two
knobs for doing it, just like its x86 counterpart: when enabling dirty
logging, and when using the KVM_CLEAR_DIRTY_LOG ioctl. The benefit of doing
it on KVM_CLEAR_DIRTY_LOG is that this ioctl takes ranges, and not complete
memslots like when enabling dirty logging. This means that the cost of
splitting (mainly broadcasted TLBIs) can be throttled: split a range, wait
for a bit, split another range, etc. The benefits of this approach were
presented by Oliver Upton at KVM Forum 2022 [1].

Implementation
==============
Patches 1-4 add a pgtable utility function for splitting huge block PTEs:
kvm_pgtable_stage2_split(). Patches 5-6 add support for not doing
break-before-make on huge-page breaking when FEAT_BBM level 2 is supported.
Patches 7-11 add support for eagerly splitting huge-pages when enabling
dirty-logging and when using the KVM_CLEAR_DIRTY_LOG ioctl. Note that this
is just like what x86 does, and the code is actually based on it.  And
finally, patch 12:

	KVM: arm64: Use local TLBI on permission relaxation

adds support for using local TLBIs instead of broadcasts when doing
permission relaxation. This last patch is key to achieving good performance
during dirty-logging, as eagerly breaking huge-pages replaces mapping new
pages with permission relaxation. Got this patch (indirectly) from Marc Z.
and took the liberty of adding a commit message.

Note: this applies on top of kvmarm/next at 3ba3e7266ab6. Although most of
the tests were done using [2] (v6.1-rc1 + [3]).

Performance evaluation
======================
The performance benefits were tested on an Ampere AmpereOne using the
dirty_log_perf_test selftest with 2M huge-pages.

The first test uses a write-only sequential workload where the stride is 2M
instead of 4K [2]. The idea with this experiment is to emulate a random
access pattern writing a different huge-page at every access. Observe that
the benefit increases with the number of vcpus: up to 5.76x for 152 vcpus.

./dirty_log_perf_test_sparse -s anonymous_hugetlb_2mb -b 1G -v $i -i 3 -m 2

	+-------+----------+---------------+
	| vCPUs |    next  | next + series |
	|       |    (ms)  |          (ms) |
	+-------+----------+---------------+
	|    1  |    3.34  |       1.85    |
	|    2  |    3.51  |       1.92    |
	|    4  |    3.93  |       1.99    |
	|    8  |    5.78  |       1.97    |
	|   16  |   10.06  |       2.08    |
	|   32  |   21.07  |       3.06    |
	|   64  |   41.75  |       6.92    |
	|  128  |   86.09  |      12.07    |
	|  152  |  109.72  |      18.94    |
	+-------+----------+---------------+

This second test measures the benefit of eager page splitting on read
intensive workloads (1 write for every 10 reads). As in the other test, the
benefit increases with the number of vcpus, up to 8.82x for 152 vcpus.

./dirty_log_perf_test -s anonymous_hugetlb_2mb -b 1G -v $i -i 3 -m 2 -f 10

	+-------+----------+---------------+
	| vCPUs |   next   | next + series |
	|       |  (sec)   |         (sec) |
	+-------+----------+---------------+
	|    1  |    0.65  |       0.09    |
	|    2  |    0.70  |       0.09    |
	|    4  |    0.71  |       0.09    |
	|    8  |    0.72  |       0.10    |
	|   16  |    0.76  |       0.10    |
	|   32  |    1.61  |       0.15    |
	|   64  |    3.46  |       0.36    |
	|  128  |    5.49  |       0.74    |
	|  152  |    6.44  |       0.73    |
	+-------+----------+---------------+

Thanks,
Ricardo

[0] https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/
[1] https://kvmforum2022.sched.com/event/15jJq/kvmarm-at-scale-improvements-to-the-mmu-in-the-face-of-hardware-growing-pains-oliver-upton-google
[2] https://github.com/ricarkol/linux/commit/f78e9102b2bff4fb7f30bee810d7d611a537b46d
[3] https://lore.kernel.org/kvmarm/20221107215644.1895162-1-oliver.upton@linux.dev/

Marc Zyngier (1):
  KVM: arm64: Use local TLBI on permission relaxation

Ricardo Koller (11):
  KVM: arm64: Relax WARN check in stage2_make_pte()
  KVM: arm64: Allow visiting block PTEs in post-order
  KVM: arm64: Add stage2_create_removed()
  KVM: arm64: Add kvm_pgtable_stage2_split()
  arm64: Add a capability for FEAT_BBM level 2
  KVM: arm64: Split block PTEs without using break-before-make
  KVM: arm64: Refactor kvm_arch_commit_memory_region()
  KVM: arm64: Add kvm_uninit_stage2_mmu()
  KVM: arm64: Split huge pages when dirty logging is enabled
  KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()
  KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG

 arch/arm64/include/asm/esr.h         |   1 +
 arch/arm64/include/asm/kvm_arm.h     |   1 +
 arch/arm64/include/asm/kvm_asm.h     |   4 +
 arch/arm64/include/asm/kvm_host.h    |  30 ++++
 arch/arm64/include/asm/kvm_mmu.h     |   1 +
 arch/arm64/include/asm/kvm_pgtable.h |  33 ++++-
 arch/arm64/kernel/cpufeature.c       |  11 ++
 arch/arm64/kvm/hyp/nvhe/hyp-main.c   |  10 ++
 arch/arm64/kvm/hyp/nvhe/setup.c      |   2 +-
 arch/arm64/kvm/hyp/nvhe/tlb.c        |  54 +++++++
 arch/arm64/kvm/hyp/pgtable.c         | 205 +++++++++++++++++++++++++--
 arch/arm64/kvm/hyp/vhe/tlb.c         |  32 +++++
 arch/arm64/kvm/mmu.c                 | 177 +++++++++++++++++++----
 arch/arm64/tools/cpucaps             |   1 +
 14 files changed, 517 insertions(+), 45 deletions(-)

-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC PATCH 00/12] KVM: arm64: Eager huge-page splitting for dirty-logging
@ 2022-11-12  8:17 ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvmarm, ricarkol, kvmarm, kvm

Hi,

I'm sending this RFC mainly to get some early feedback on the approach used
for implementing "Eager Page Splitting" on ARM.  "Eager Page Splitting"
improves the performance of dirty-logging (used in live migrations) when
guest memory is backed by huge-pages.  It's an optimization used in Google
Cloud since 2016 on x86, and for the last couple of months on ARM.

I tried multiple ways of implementing this optimization on ARM: from
completely reusing the stage2 mapper, to implementing a new walker from
scratch, and some versions in between. This RFC is one of those in
between. They all have similar performance benefits, based on some light
performance testing (mainly dirty_log_perf_test).

Background and motivation
=========================
Dirty logging is typically used for live-migration iterative copying.  KVM
implements dirty-logging at the PAGE_SIZE granularity (will refer to 4K
pages from now on).  It does it by faulting on write-protected 4K pages.
Therefore, enabling dirty-logging on a huge-page requires breaking it into
4K pages in the first place.  KVM does this breaking on fault, and because
it's in the critical path it only maps the 4K page that faulted; every
other 4K page is left unmapped.  This is not great for performance on ARM
for a couple of reasons:

- Splitting on fault can halt vcpus for milliseconds in some
  implementations. Splitting a block PTE requires using a broadcasted TLB
  invalidation (TLBI) for every huge-page (due to the break-before-make
  requirement). Note that x86 doesn't need this. We observed some
  implementations that take millliseconds to complete broadcasted TLBIs
  when done in parallel from multiple vcpus.  And that's exactly what
  happens when doing it on fault: multiple vcpus fault at the same time
  triggering TLBIs in parallel.

- Read intensive guest workloads end up paying for dirty-logging.  Only
  mapping the faulting 4K page means that all the other pages that were
  part of the huge-page will now be unmapped. The effect is that any
  access, including reads, now has to fault.

Eager Page Splitting (on ARM)
=============================
Eager Page Splitting fixes the above two issues by eagerly splitting
huge-pages when enabling dirty logging. The goal is to avoid doing it while
faulting on write-protected pages. This is what the TDP MMU does for x86
[0], except that x86 does it for different reasons: to avoid grabbing the
MMU lock on fault. Note that taking care of write-protection faults still
requires grabbing the MMU lock on ARM, but not on x86 (with the
fast_page_fault path).

An additional benefit of eagerly splitting huge-pages is that it can be
done in a controlled way (e.g., via an IOCTL). This series provides two
knobs for doing it, just like its x86 counterpart: when enabling dirty
logging, and when using the KVM_CLEAR_DIRTY_LOG ioctl. The benefit of doing
it on KVM_CLEAR_DIRTY_LOG is that this ioctl takes ranges, and not complete
memslots like when enabling dirty logging. This means that the cost of
splitting (mainly broadcasted TLBIs) can be throttled: split a range, wait
for a bit, split another range, etc. The benefits of this approach were
presented by Oliver Upton at KVM Forum 2022 [1].

Implementation
==============
Patches 1-4 add a pgtable utility function for splitting huge block PTEs:
kvm_pgtable_stage2_split(). Patches 5-6 add support for not doing
break-before-make on huge-page breaking when FEAT_BBM level 2 is supported.
Patches 7-11 add support for eagerly splitting huge-pages when enabling
dirty-logging and when using the KVM_CLEAR_DIRTY_LOG ioctl. Note that this
is just like what x86 does, and the code is actually based on it.  And
finally, patch 12:

	KVM: arm64: Use local TLBI on permission relaxation

adds support for using local TLBIs instead of broadcasts when doing
permission relaxation. This last patch is key to achieving good performance
during dirty-logging, as eagerly breaking huge-pages replaces mapping new
pages with permission relaxation. Got this patch (indirectly) from Marc Z.
and took the liberty of adding a commit message.

Note: this applies on top of kvmarm/next at 3ba3e7266ab6. Although most of
the tests were done using [2] (v6.1-rc1 + [3]).

Performance evaluation
======================
The performance benefits were tested on an Ampere AmpereOne using the
dirty_log_perf_test selftest with 2M huge-pages.

The first test uses a write-only sequential workload where the stride is 2M
instead of 4K [2]. The idea with this experiment is to emulate a random
access pattern writing a different huge-page at every access. Observe that
the benefit increases with the number of vcpus: up to 5.76x for 152 vcpus.

./dirty_log_perf_test_sparse -s anonymous_hugetlb_2mb -b 1G -v $i -i 3 -m 2

	+-------+----------+---------------+
	| vCPUs |    next  | next + series |
	|       |    (ms)  |          (ms) |
	+-------+----------+---------------+
	|    1  |    3.34  |       1.85    |
	|    2  |    3.51  |       1.92    |
	|    4  |    3.93  |       1.99    |
	|    8  |    5.78  |       1.97    |
	|   16  |   10.06  |       2.08    |
	|   32  |   21.07  |       3.06    |
	|   64  |   41.75  |       6.92    |
	|  128  |   86.09  |      12.07    |
	|  152  |  109.72  |      18.94    |
	+-------+----------+---------------+

This second test measures the benefit of eager page splitting on read
intensive workloads (1 write for every 10 reads). As in the other test, the
benefit increases with the number of vcpus, up to 8.82x for 152 vcpus.

./dirty_log_perf_test -s anonymous_hugetlb_2mb -b 1G -v $i -i 3 -m 2 -f 10

	+-------+----------+---------------+
	| vCPUs |   next   | next + series |
	|       |  (sec)   |         (sec) |
	+-------+----------+---------------+
	|    1  |    0.65  |       0.09    |
	|    2  |    0.70  |       0.09    |
	|    4  |    0.71  |       0.09    |
	|    8  |    0.72  |       0.10    |
	|   16  |    0.76  |       0.10    |
	|   32  |    1.61  |       0.15    |
	|   64  |    3.46  |       0.36    |
	|  128  |    5.49  |       0.74    |
	|  152  |    6.44  |       0.73    |
	+-------+----------+---------------+

Thanks,
Ricardo

[0] https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/
[1] https://kvmforum2022.sched.com/event/15jJq/kvmarm-at-scale-improvements-to-the-mmu-in-the-face-of-hardware-growing-pains-oliver-upton-google
[2] https://github.com/ricarkol/linux/commit/f78e9102b2bff4fb7f30bee810d7d611a537b46d
[3] https://lore.kernel.org/kvmarm/20221107215644.1895162-1-oliver.upton@linux.dev/

Marc Zyngier (1):
  KVM: arm64: Use local TLBI on permission relaxation

Ricardo Koller (11):
  KVM: arm64: Relax WARN check in stage2_make_pte()
  KVM: arm64: Allow visiting block PTEs in post-order
  KVM: arm64: Add stage2_create_removed()
  KVM: arm64: Add kvm_pgtable_stage2_split()
  arm64: Add a capability for FEAT_BBM level 2
  KVM: arm64: Split block PTEs without using break-before-make
  KVM: arm64: Refactor kvm_arch_commit_memory_region()
  KVM: arm64: Add kvm_uninit_stage2_mmu()
  KVM: arm64: Split huge pages when dirty logging is enabled
  KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()
  KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG

 arch/arm64/include/asm/esr.h         |   1 +
 arch/arm64/include/asm/kvm_arm.h     |   1 +
 arch/arm64/include/asm/kvm_asm.h     |   4 +
 arch/arm64/include/asm/kvm_host.h    |  30 ++++
 arch/arm64/include/asm/kvm_mmu.h     |   1 +
 arch/arm64/include/asm/kvm_pgtable.h |  33 ++++-
 arch/arm64/kernel/cpufeature.c       |  11 ++
 arch/arm64/kvm/hyp/nvhe/hyp-main.c   |  10 ++
 arch/arm64/kvm/hyp/nvhe/setup.c      |   2 +-
 arch/arm64/kvm/hyp/nvhe/tlb.c        |  54 +++++++
 arch/arm64/kvm/hyp/pgtable.c         | 205 +++++++++++++++++++++++++--
 arch/arm64/kvm/hyp/vhe/tlb.c         |  32 +++++
 arch/arm64/kvm/mmu.c                 | 177 +++++++++++++++++++----
 arch/arm64/tools/cpucaps             |   1 +
 14 files changed, 517 insertions(+), 45 deletions(-)

-- 
2.38.1.431.g37b22c650d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC PATCH 01/12] KVM: arm64: Relax WARN check in stage2_make_pte()
  2022-11-12  8:17 ` Ricardo Koller
@ 2022-11-12  8:17   ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvm, kvmarm, kvmarm, ricarkol, Ricardo Koller

stage2_make_pte() throws a warning when used in a non-shared walk, as PTEs
are not "locked" when walking non-shared. Add a check so it can be used
non-shared.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index c12462439e70..b16107bf917c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -733,7 +733,8 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n
 {
 	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
 
-	WARN_ON(!stage2_pte_is_locked(*ctx->ptep));
+	if (kvm_pgtable_walk_shared(ctx))
+		WARN_ON(!stage2_pte_is_locked(*ctx->ptep));
 
 	if (stage2_pte_is_counted(new))
 		mm_ops->get_page(ctx->ptep);
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 01/12] KVM: arm64: Relax WARN check in stage2_make_pte()
@ 2022-11-12  8:17   ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvmarm, ricarkol, kvmarm, kvm

stage2_make_pte() throws a warning when used in a non-shared walk, as PTEs
are not "locked" when walking non-shared. Add a check so it can be used
non-shared.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index c12462439e70..b16107bf917c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -733,7 +733,8 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n
 {
 	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
 
-	WARN_ON(!stage2_pte_is_locked(*ctx->ptep));
+	if (kvm_pgtable_walk_shared(ctx))
+		WARN_ON(!stage2_pte_is_locked(*ctx->ptep));
 
 	if (stage2_pte_is_counted(new))
 		mm_ops->get_page(ctx->ptep);
-- 
2.38.1.431.g37b22c650d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 02/12] KVM: arm64: Allow visiting block PTEs in post-order
  2022-11-12  8:17 ` Ricardo Koller
@ 2022-11-12  8:17   ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvm, kvmarm, kvmarm, ricarkol, Ricardo Koller

The page table walker does not visit block PTEs in post-order. But there
are some cases where doing so would be beneficial, for example: breaking a
1G block PTE into a full tree in post-order avoids visiting the new tree.

Allow post order visits of block PTEs. This will be used in a subsequent
commit for eagerly breaking huge pages.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  4 ++--
 arch/arm64/kvm/hyp/nvhe/setup.c      |  2 +-
 arch/arm64/kvm/hyp/pgtable.c         | 25 ++++++++++++-------------
 3 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index e2edeed462e8..d2e4a5032146 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -255,7 +255,7 @@ struct kvm_pgtable {
  *					entries.
  * @KVM_PGTABLE_WALK_TABLE_PRE:		Visit table entries before their
  *					children.
- * @KVM_PGTABLE_WALK_TABLE_POST:	Visit table entries after their
+ * @KVM_PGTABLE_WALK_POST:		Visit leaf or table entries after their
  *					children.
  * @KVM_PGTABLE_WALK_SHARED:		Indicates the page-tables may be shared
  *					with other software walkers.
@@ -263,7 +263,7 @@ struct kvm_pgtable {
 enum kvm_pgtable_walk_flags {
 	KVM_PGTABLE_WALK_LEAF			= BIT(0),
 	KVM_PGTABLE_WALK_TABLE_PRE		= BIT(1),
-	KVM_PGTABLE_WALK_TABLE_POST		= BIT(2),
+	KVM_PGTABLE_WALK_POST			= BIT(2),
 	KVM_PGTABLE_WALK_SHARED			= BIT(3),
 };
 
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index b47d969ae4d3..b0c1618d053b 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -265,7 +265,7 @@ static int fix_hyp_pgtable_refcnt(void)
 {
 	struct kvm_pgtable_walker walker = {
 		.cb	= fix_hyp_pgtable_refcnt_walker,
-		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
+		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_POST,
 		.arg	= pkvm_pgtable.mm_ops,
 	};
 
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index b16107bf917c..1b371f6dbac2 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -206,16 +206,15 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 	if (!table) {
 		data->addr = ALIGN_DOWN(data->addr, kvm_granule_size(level));
 		data->addr += kvm_granule_size(level);
-		goto out;
+	} else {
+		childp = (kvm_pteref_t)kvm_pte_follow(ctx.old, mm_ops);
+		ret = __kvm_pgtable_walk(data, mm_ops, childp, level + 1);
+		if (ret)
+			goto out;
 	}
 
-	childp = (kvm_pteref_t)kvm_pte_follow(ctx.old, mm_ops);
-	ret = __kvm_pgtable_walk(data, mm_ops, childp, level + 1);
-	if (ret)
-		goto out;
-
-	if (ctx.flags & KVM_PGTABLE_WALK_TABLE_POST)
-		ret = kvm_pgtable_visitor_cb(data, &ctx, KVM_PGTABLE_WALK_TABLE_POST);
+	if (ctx.flags & KVM_PGTABLE_WALK_POST)
+		ret = kvm_pgtable_visitor_cb(data, &ctx, KVM_PGTABLE_WALK_POST);
 
 out:
 	return ret;
@@ -494,7 +493,7 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
 	struct kvm_pgtable_walker walker = {
 		.cb	= hyp_unmap_walker,
 		.arg	= &unmapped,
-		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
+		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_POST,
 	};
 
 	if (!pgt->mm_ops->page_count)
@@ -542,7 +541,7 @@ void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt)
 {
 	struct kvm_pgtable_walker walker = {
 		.cb	= hyp_free_walker,
-		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
+		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_POST,
 	};
 
 	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
@@ -1003,7 +1002,7 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
 	struct kvm_pgtable_walker walker = {
 		.cb	= stage2_unmap_walker,
 		.arg	= pgt,
-		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
+		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_POST,
 	};
 
 	return kvm_pgtable_walk(pgt, addr, size, &walker);
@@ -1234,7 +1233,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
 	struct kvm_pgtable_walker walker = {
 		.cb	= stage2_free_walker,
 		.flags	= KVM_PGTABLE_WALK_LEAF |
-			  KVM_PGTABLE_WALK_TABLE_POST,
+			  KVM_PGTABLE_WALK_POST,
 	};
 
 	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
@@ -1249,7 +1248,7 @@ void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pg
 	struct kvm_pgtable_walker walker = {
 		.cb	= stage2_free_walker,
 		.flags	= KVM_PGTABLE_WALK_LEAF |
-			  KVM_PGTABLE_WALK_TABLE_POST,
+			  KVM_PGTABLE_WALK_POST,
 	};
 	struct kvm_pgtable_walk_data data = {
 		.walker	= &walker,
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 02/12] KVM: arm64: Allow visiting block PTEs in post-order
@ 2022-11-12  8:17   ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvmarm, ricarkol, kvmarm, kvm

The page table walker does not visit block PTEs in post-order. But there
are some cases where doing so would be beneficial, for example: breaking a
1G block PTE into a full tree in post-order avoids visiting the new tree.

Allow post order visits of block PTEs. This will be used in a subsequent
commit for eagerly breaking huge pages.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  4 ++--
 arch/arm64/kvm/hyp/nvhe/setup.c      |  2 +-
 arch/arm64/kvm/hyp/pgtable.c         | 25 ++++++++++++-------------
 3 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index e2edeed462e8..d2e4a5032146 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -255,7 +255,7 @@ struct kvm_pgtable {
  *					entries.
  * @KVM_PGTABLE_WALK_TABLE_PRE:		Visit table entries before their
  *					children.
- * @KVM_PGTABLE_WALK_TABLE_POST:	Visit table entries after their
+ * @KVM_PGTABLE_WALK_POST:		Visit leaf or table entries after their
  *					children.
  * @KVM_PGTABLE_WALK_SHARED:		Indicates the page-tables may be shared
  *					with other software walkers.
@@ -263,7 +263,7 @@ struct kvm_pgtable {
 enum kvm_pgtable_walk_flags {
 	KVM_PGTABLE_WALK_LEAF			= BIT(0),
 	KVM_PGTABLE_WALK_TABLE_PRE		= BIT(1),
-	KVM_PGTABLE_WALK_TABLE_POST		= BIT(2),
+	KVM_PGTABLE_WALK_POST			= BIT(2),
 	KVM_PGTABLE_WALK_SHARED			= BIT(3),
 };
 
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index b47d969ae4d3..b0c1618d053b 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -265,7 +265,7 @@ static int fix_hyp_pgtable_refcnt(void)
 {
 	struct kvm_pgtable_walker walker = {
 		.cb	= fix_hyp_pgtable_refcnt_walker,
-		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
+		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_POST,
 		.arg	= pkvm_pgtable.mm_ops,
 	};
 
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index b16107bf917c..1b371f6dbac2 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -206,16 +206,15 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 	if (!table) {
 		data->addr = ALIGN_DOWN(data->addr, kvm_granule_size(level));
 		data->addr += kvm_granule_size(level);
-		goto out;
+	} else {
+		childp = (kvm_pteref_t)kvm_pte_follow(ctx.old, mm_ops);
+		ret = __kvm_pgtable_walk(data, mm_ops, childp, level + 1);
+		if (ret)
+			goto out;
 	}
 
-	childp = (kvm_pteref_t)kvm_pte_follow(ctx.old, mm_ops);
-	ret = __kvm_pgtable_walk(data, mm_ops, childp, level + 1);
-	if (ret)
-		goto out;
-
-	if (ctx.flags & KVM_PGTABLE_WALK_TABLE_POST)
-		ret = kvm_pgtable_visitor_cb(data, &ctx, KVM_PGTABLE_WALK_TABLE_POST);
+	if (ctx.flags & KVM_PGTABLE_WALK_POST)
+		ret = kvm_pgtable_visitor_cb(data, &ctx, KVM_PGTABLE_WALK_POST);
 
 out:
 	return ret;
@@ -494,7 +493,7 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
 	struct kvm_pgtable_walker walker = {
 		.cb	= hyp_unmap_walker,
 		.arg	= &unmapped,
-		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
+		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_POST,
 	};
 
 	if (!pgt->mm_ops->page_count)
@@ -542,7 +541,7 @@ void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt)
 {
 	struct kvm_pgtable_walker walker = {
 		.cb	= hyp_free_walker,
-		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
+		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_POST,
 	};
 
 	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
@@ -1003,7 +1002,7 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
 	struct kvm_pgtable_walker walker = {
 		.cb	= stage2_unmap_walker,
 		.arg	= pgt,
-		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
+		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_POST,
 	};
 
 	return kvm_pgtable_walk(pgt, addr, size, &walker);
@@ -1234,7 +1233,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
 	struct kvm_pgtable_walker walker = {
 		.cb	= stage2_free_walker,
 		.flags	= KVM_PGTABLE_WALK_LEAF |
-			  KVM_PGTABLE_WALK_TABLE_POST,
+			  KVM_PGTABLE_WALK_POST,
 	};
 
 	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
@@ -1249,7 +1248,7 @@ void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pg
 	struct kvm_pgtable_walker walker = {
 		.cb	= stage2_free_walker,
 		.flags	= KVM_PGTABLE_WALK_LEAF |
-			  KVM_PGTABLE_WALK_TABLE_POST,
+			  KVM_PGTABLE_WALK_POST,
 	};
 	struct kvm_pgtable_walk_data data = {
 		.walker	= &walker,
-- 
2.38.1.431.g37b22c650d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 03/12] KVM: arm64: Add stage2_create_removed()
  2022-11-12  8:17 ` Ricardo Koller
@ 2022-11-12  8:17   ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvm, kvmarm, kvmarm, ricarkol, Ricardo Koller

Add a new stage2 function, stage2_create_removed(), for creating removed
tables (the opposite of kvm_pgtable_stage2_free_removed()).  Creating a
removed table is useful for splitting block PTEs into tables.  For example,
a 1G block PTE can be split into 4K PTEs by first creating a fully
populated tree, and then use it to replace the 1G PTE in a single step.
This will be used in a subsequent commit for eager huge page splitting.

No functional change intended. This new function will be used in a
subsequent commit.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 93 ++++++++++++++++++++++++++++++++++++
 1 file changed, 93 insertions(+)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 1b371f6dbac2..d1f309128118 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1173,6 +1173,99 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
 	return kvm_pgtable_walk(pgt, addr, size, &walker);
 }
 
+struct stage2_create_removed_data {
+	void				*memcache;
+	struct kvm_pgtable_mm_ops	*mm_ops;
+	u64				phys;
+	kvm_pte_t			attr;
+};
+
+/*
+ * This flag should only be used by the create_removed walker, as it would
+ * be misinterpreted it in an installed PTE.
+ */
+#define KVM_INVALID_PTE_NO_PAGE		BIT(9)
+
+/*
+ * Failure to allocate a table results in setting the respective PTE with a
+ * valid block PTE instead of a table PTE.
+ */
+static int stage2_create_removed_walker(const struct kvm_pgtable_visit_ctx *ctx,
+					enum kvm_pgtable_walk_flags visit)
+{
+	struct stage2_create_removed_data *data = ctx->arg;
+	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
+	u64 granule = kvm_granule_size(ctx->level);
+	kvm_pte_t attr = data->attr;
+	kvm_pte_t *childp = NULL;
+	u32 level = ctx->level;
+	int ret = 0;
+
+	if (level < KVM_PGTABLE_MAX_LEVELS - 1) {
+		childp = mm_ops->zalloc_page(data->memcache);
+		ret = childp ? 0 : -ENOMEM;
+	}
+
+	if (childp)
+		*ctx->ptep = kvm_init_table_pte(childp, mm_ops);
+
+	/*
+	 * Create a block PTE if we are at the max level, or if we failed
+	 * to create a table (we are not at max level).
+	 */
+	if (level == KVM_PGTABLE_MAX_LEVELS - 1 || !childp) {
+		*ctx->ptep = kvm_init_valid_leaf_pte(data->phys, attr, level);
+		data->phys += granule;
+	}
+
+	if (ctx->old != KVM_INVALID_PTE_NO_PAGE)
+		mm_ops->get_page(ctx->ptep);
+
+	return ret;
+}
+
+/*
+ * Create a removed page-table tree of PAGE_SIZE leaf PTEs under *ptep.
+ * This new page-table tree is not reachable (i.e., it is removed) from the
+ * root (the pgd).
+ *
+ * This function will try to create as many entries in the tree as allowed
+ * by the memcache capacity. It always writes a valid PTE into *ptep. In
+ * the best case, it returns 0 and a fully populated tree under *ptep. In
+ * the worst case, it returns -ENOMEM and *ptep will contain a valid block
+ * PTE covering the expected level, or any other valid combination (e.g., a
+ * 1G table PTE pointing to half 2M block PTEs and half 2M table PTEs).
+ */
+static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
+				 kvm_pte_t attr, void *memcache,
+				 struct kvm_pgtable_mm_ops *mm_ops)
+{
+	struct stage2_create_removed_data alloc_data = {
+		.phys		= phys,
+		.memcache	= memcache,
+		.mm_ops		= mm_ops,
+		.attr		= attr,
+	};
+	struct kvm_pgtable_walker walker = {
+		.cb	= stage2_create_removed_walker,
+		.flags	= KVM_PGTABLE_WALK_LEAF,
+		.arg	= &alloc_data,
+	};
+	struct kvm_pgtable_walk_data data = {
+		.walker	= &walker,
+
+		/* The IPA is irrelevant for a removed table. */
+		.addr	= 0,
+		.end	= kvm_granule_size(level),
+	};
+
+	/*
+	 * The walker should not try to get a reference to the memory
+	 * holding this ptep (it's not a page).
+	 */
+	*ptep = KVM_INVALID_PTE_NO_PAGE;
+	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
+}
 
 int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 			      struct kvm_pgtable_mm_ops *mm_ops,
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 03/12] KVM: arm64: Add stage2_create_removed()
@ 2022-11-12  8:17   ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvmarm, ricarkol, kvmarm, kvm

Add a new stage2 function, stage2_create_removed(), for creating removed
tables (the opposite of kvm_pgtable_stage2_free_removed()).  Creating a
removed table is useful for splitting block PTEs into tables.  For example,
a 1G block PTE can be split into 4K PTEs by first creating a fully
populated tree, and then use it to replace the 1G PTE in a single step.
This will be used in a subsequent commit for eager huge page splitting.

No functional change intended. This new function will be used in a
subsequent commit.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 93 ++++++++++++++++++++++++++++++++++++
 1 file changed, 93 insertions(+)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 1b371f6dbac2..d1f309128118 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1173,6 +1173,99 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
 	return kvm_pgtable_walk(pgt, addr, size, &walker);
 }
 
+struct stage2_create_removed_data {
+	void				*memcache;
+	struct kvm_pgtable_mm_ops	*mm_ops;
+	u64				phys;
+	kvm_pte_t			attr;
+};
+
+/*
+ * This flag should only be used by the create_removed walker, as it would
+ * be misinterpreted it in an installed PTE.
+ */
+#define KVM_INVALID_PTE_NO_PAGE		BIT(9)
+
+/*
+ * Failure to allocate a table results in setting the respective PTE with a
+ * valid block PTE instead of a table PTE.
+ */
+static int stage2_create_removed_walker(const struct kvm_pgtable_visit_ctx *ctx,
+					enum kvm_pgtable_walk_flags visit)
+{
+	struct stage2_create_removed_data *data = ctx->arg;
+	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
+	u64 granule = kvm_granule_size(ctx->level);
+	kvm_pte_t attr = data->attr;
+	kvm_pte_t *childp = NULL;
+	u32 level = ctx->level;
+	int ret = 0;
+
+	if (level < KVM_PGTABLE_MAX_LEVELS - 1) {
+		childp = mm_ops->zalloc_page(data->memcache);
+		ret = childp ? 0 : -ENOMEM;
+	}
+
+	if (childp)
+		*ctx->ptep = kvm_init_table_pte(childp, mm_ops);
+
+	/*
+	 * Create a block PTE if we are at the max level, or if we failed
+	 * to create a table (we are not at max level).
+	 */
+	if (level == KVM_PGTABLE_MAX_LEVELS - 1 || !childp) {
+		*ctx->ptep = kvm_init_valid_leaf_pte(data->phys, attr, level);
+		data->phys += granule;
+	}
+
+	if (ctx->old != KVM_INVALID_PTE_NO_PAGE)
+		mm_ops->get_page(ctx->ptep);
+
+	return ret;
+}
+
+/*
+ * Create a removed page-table tree of PAGE_SIZE leaf PTEs under *ptep.
+ * This new page-table tree is not reachable (i.e., it is removed) from the
+ * root (the pgd).
+ *
+ * This function will try to create as many entries in the tree as allowed
+ * by the memcache capacity. It always writes a valid PTE into *ptep. In
+ * the best case, it returns 0 and a fully populated tree under *ptep. In
+ * the worst case, it returns -ENOMEM and *ptep will contain a valid block
+ * PTE covering the expected level, or any other valid combination (e.g., a
+ * 1G table PTE pointing to half 2M block PTEs and half 2M table PTEs).
+ */
+static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
+				 kvm_pte_t attr, void *memcache,
+				 struct kvm_pgtable_mm_ops *mm_ops)
+{
+	struct stage2_create_removed_data alloc_data = {
+		.phys		= phys,
+		.memcache	= memcache,
+		.mm_ops		= mm_ops,
+		.attr		= attr,
+	};
+	struct kvm_pgtable_walker walker = {
+		.cb	= stage2_create_removed_walker,
+		.flags	= KVM_PGTABLE_WALK_LEAF,
+		.arg	= &alloc_data,
+	};
+	struct kvm_pgtable_walk_data data = {
+		.walker	= &walker,
+
+		/* The IPA is irrelevant for a removed table. */
+		.addr	= 0,
+		.end	= kvm_granule_size(level),
+	};
+
+	/*
+	 * The walker should not try to get a reference to the memory
+	 * holding this ptep (it's not a page).
+	 */
+	*ptep = KVM_INVALID_PTE_NO_PAGE;
+	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
+}
 
 int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 			      struct kvm_pgtable_mm_ops *mm_ops,
-- 
2.38.1.431.g37b22c650d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
  2022-11-12  8:17 ` Ricardo Koller
@ 2022-11-12  8:17   ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvm, kvmarm, kvmarm, ricarkol, Ricardo Koller

Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a
range of huge pages. This will be used for eager-splitting huge pages into
PAGE_SIZE pages. The goal is to avoid having to split huge pages on
write-protection faults, and instead use this function to do it ahead of
time for large ranges (e.g., all guest memory in 1G chunks at a time).

No functional change intended. This new function will be used in a

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h | 29 +++++++++++
 arch/arm64/kvm/hyp/pgtable.c         | 74 ++++++++++++++++++++++++++++
 2 files changed, 103 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index d2e4a5032146..396ebb0949fb 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -594,6 +594,35 @@ bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr);
  */
 int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
 
+/**
+ * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
+ *				to PAGE_SIZE guest pages.
+ * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
+ * @addr:	Intermediate physical address from which to split.
+ * @size:	Size of the range.
+ * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
+ *		page-table pages.
+ *
+ * @addr and the end (@addr + @size) are effectively aligned down and up to
+ * the top level huge-page block size. This is an exampe using 1GB
+ * huge-pages and 4KB granules.
+ *
+ *                          [---input range---]
+ *                          :                 :
+ * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
+ *                          :                 :
+ *                   [--2MB--][--2MB--][--2MB--][--2MB--]
+ *                          :                 :
+ *                   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
+ *                          :                 :
+ *
+ * Return: 0 on success, negative error code on failure. Note that
+ * kvm_pgtable_stage2_split() is best effort: it tries to break as many
+ * blocks in the input range as allowed by the size of the memcache. It
+ * will fail it wasn't able to break any block.
+ */
+int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, void *mc);
+
 /**
  * kvm_pgtable_walk() - Walk a page-table.
  * @pgt:	Page-table structure initialised by kvm_pgtable_*_init().
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index d1f309128118..9c42eff6d42e 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
 	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
 }
 
+struct stage2_split_data {
+	struct kvm_s2_mmu		*mmu;
+	void				*memcache;
+	struct kvm_pgtable_mm_ops	*mm_ops;
+};
+
+static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
+			       enum kvm_pgtable_walk_flags visit)
+{
+	struct stage2_split_data *data = ctx->arg;
+	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
+	kvm_pte_t pte = ctx->old, attr, new;
+	enum kvm_pgtable_prot prot;
+	void *mc = data->memcache;
+	u32 level = ctx->level;
+	u64 phys;
+
+	if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
+		return -EINVAL;
+
+	/* Nothing to split at the last level */
+	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
+		return 0;
+
+	/* We only split valid block mappings */
+	if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
+		return 0;
+
+	phys = kvm_pte_to_phys(pte);
+	prot = kvm_pgtable_stage2_pte_prot(pte);
+	stage2_set_prot_attr(data->mmu->pgt, prot, &attr);
+
+	/*
+	 * Eager page splitting is best-effort, so we can ignore the error.
+	 * The returned PTE (new) will be valid even if this call returns
+	 * error: new will be a single (big) block PTE.  The only issue is
+	 * that it will affect dirty logging performance, as the huge-pages
+	 * will have to be split on fault, and so we WARN.
+	 */
+	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));
+
+	stage2_put_pte(ctx, data->mmu, mm_ops);
+
+	/*
+	 * Note, the contents of the page table are guaranteed to be made
+	 * visible before the new PTE is assigned because
+	 * stage2_make__pte() writes the PTE using smp_store_release().
+	 */
+	stage2_make_pte(ctx, new);
+	dsb(ishst);
+	return 0;
+}
+
+int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt,
+			     u64 addr, u64 size, void *mc)
+{
+	int ret;
+
+	struct stage2_split_data split_data = {
+		.mmu		= pgt->mmu,
+		.memcache	= mc,
+		.mm_ops		= pgt->mm_ops,
+	};
+
+	struct kvm_pgtable_walker walker = {
+		.cb	= stage2_split_walker,
+		.flags	= KVM_PGTABLE_WALK_POST,
+		.arg	= &split_data,
+	};
+
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+	return ret;
+}
+
 int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 			      struct kvm_pgtable_mm_ops *mm_ops,
 			      enum kvm_pgtable_stage2_flags flags,
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
@ 2022-11-12  8:17   ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvmarm, ricarkol, kvmarm, kvm

Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a
range of huge pages. This will be used for eager-splitting huge pages into
PAGE_SIZE pages. The goal is to avoid having to split huge pages on
write-protection faults, and instead use this function to do it ahead of
time for large ranges (e.g., all guest memory in 1G chunks at a time).

No functional change intended. This new function will be used in a

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h | 29 +++++++++++
 arch/arm64/kvm/hyp/pgtable.c         | 74 ++++++++++++++++++++++++++++
 2 files changed, 103 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index d2e4a5032146..396ebb0949fb 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -594,6 +594,35 @@ bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr);
  */
 int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
 
+/**
+ * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
+ *				to PAGE_SIZE guest pages.
+ * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
+ * @addr:	Intermediate physical address from which to split.
+ * @size:	Size of the range.
+ * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
+ *		page-table pages.
+ *
+ * @addr and the end (@addr + @size) are effectively aligned down and up to
+ * the top level huge-page block size. This is an exampe using 1GB
+ * huge-pages and 4KB granules.
+ *
+ *                          [---input range---]
+ *                          :                 :
+ * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
+ *                          :                 :
+ *                   [--2MB--][--2MB--][--2MB--][--2MB--]
+ *                          :                 :
+ *                   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
+ *                          :                 :
+ *
+ * Return: 0 on success, negative error code on failure. Note that
+ * kvm_pgtable_stage2_split() is best effort: it tries to break as many
+ * blocks in the input range as allowed by the size of the memcache. It
+ * will fail it wasn't able to break any block.
+ */
+int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, void *mc);
+
 /**
  * kvm_pgtable_walk() - Walk a page-table.
  * @pgt:	Page-table structure initialised by kvm_pgtable_*_init().
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index d1f309128118..9c42eff6d42e 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
 	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
 }
 
+struct stage2_split_data {
+	struct kvm_s2_mmu		*mmu;
+	void				*memcache;
+	struct kvm_pgtable_mm_ops	*mm_ops;
+};
+
+static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
+			       enum kvm_pgtable_walk_flags visit)
+{
+	struct stage2_split_data *data = ctx->arg;
+	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
+	kvm_pte_t pte = ctx->old, attr, new;
+	enum kvm_pgtable_prot prot;
+	void *mc = data->memcache;
+	u32 level = ctx->level;
+	u64 phys;
+
+	if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
+		return -EINVAL;
+
+	/* Nothing to split at the last level */
+	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
+		return 0;
+
+	/* We only split valid block mappings */
+	if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
+		return 0;
+
+	phys = kvm_pte_to_phys(pte);
+	prot = kvm_pgtable_stage2_pte_prot(pte);
+	stage2_set_prot_attr(data->mmu->pgt, prot, &attr);
+
+	/*
+	 * Eager page splitting is best-effort, so we can ignore the error.
+	 * The returned PTE (new) will be valid even if this call returns
+	 * error: new will be a single (big) block PTE.  The only issue is
+	 * that it will affect dirty logging performance, as the huge-pages
+	 * will have to be split on fault, and so we WARN.
+	 */
+	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));
+
+	stage2_put_pte(ctx, data->mmu, mm_ops);
+
+	/*
+	 * Note, the contents of the page table are guaranteed to be made
+	 * visible before the new PTE is assigned because
+	 * stage2_make__pte() writes the PTE using smp_store_release().
+	 */
+	stage2_make_pte(ctx, new);
+	dsb(ishst);
+	return 0;
+}
+
+int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt,
+			     u64 addr, u64 size, void *mc)
+{
+	int ret;
+
+	struct stage2_split_data split_data = {
+		.mmu		= pgt->mmu,
+		.memcache	= mc,
+		.mm_ops		= pgt->mm_ops,
+	};
+
+	struct kvm_pgtable_walker walker = {
+		.cb	= stage2_split_walker,
+		.flags	= KVM_PGTABLE_WALK_POST,
+		.arg	= &split_data,
+	};
+
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+	return ret;
+}
+
 int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 			      struct kvm_pgtable_mm_ops *mm_ops,
 			      enum kvm_pgtable_stage2_flags flags,
-- 
2.38.1.431.g37b22c650d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 05/12] arm64: Add a capability for FEAT_BBM level 2
  2022-11-12  8:17 ` Ricardo Koller
@ 2022-11-12  8:17   ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvm, kvmarm, kvmarm, ricarkol, Ricardo Koller

Add a new capability to detect "Stage-2 Translation table
break-before-make" (FEAT_BBM) level 2.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kernel/cpufeature.c | 11 +++++++++++
 arch/arm64/tools/cpucaps       |  1 +
 2 files changed, 12 insertions(+)

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 6062454a9067..ff97fb05d430 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2339,6 +2339,17 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
 		.min_field_value = 1,
 		.matches = has_cpuid_feature,
 	},
+	{
+		.desc = "Stage-2 Translation table break-before-make level 2",
+		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
+		.capability = ARM64_HAS_STAGE2_BBM2,
+		.sys_reg = SYS_ID_AA64MMFR2_EL1,
+		.sign = FTR_UNSIGNED,
+		.field_pos = ID_AA64MMFR2_EL1_BBM_SHIFT,
+		.field_width = 4,
+		.min_field_value = 2,
+		.matches = has_cpuid_feature,
+	},
 	{
 		.desc = "TLB range maintenance instructions",
 		.capability = ARM64_HAS_TLB_RANGE,
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index f1c0347ec31a..f421adbdb08b 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -36,6 +36,7 @@ HAS_PAN
 HAS_RAS_EXTN
 HAS_RNG
 HAS_SB
+HAS_STAGE2_BBM2
 HAS_STAGE2_FWB
 HAS_SYSREG_GIC_CPUIF
 HAS_TIDCP1
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 05/12] arm64: Add a capability for FEAT_BBM level 2
@ 2022-11-12  8:17   ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvmarm, ricarkol, kvmarm, kvm

Add a new capability to detect "Stage-2 Translation table
break-before-make" (FEAT_BBM) level 2.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kernel/cpufeature.c | 11 +++++++++++
 arch/arm64/tools/cpucaps       |  1 +
 2 files changed, 12 insertions(+)

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 6062454a9067..ff97fb05d430 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2339,6 +2339,17 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
 		.min_field_value = 1,
 		.matches = has_cpuid_feature,
 	},
+	{
+		.desc = "Stage-2 Translation table break-before-make level 2",
+		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
+		.capability = ARM64_HAS_STAGE2_BBM2,
+		.sys_reg = SYS_ID_AA64MMFR2_EL1,
+		.sign = FTR_UNSIGNED,
+		.field_pos = ID_AA64MMFR2_EL1_BBM_SHIFT,
+		.field_width = 4,
+		.min_field_value = 2,
+		.matches = has_cpuid_feature,
+	},
 	{
 		.desc = "TLB range maintenance instructions",
 		.capability = ARM64_HAS_TLB_RANGE,
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index f1c0347ec31a..f421adbdb08b 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -36,6 +36,7 @@ HAS_PAN
 HAS_RAS_EXTN
 HAS_RNG
 HAS_SB
+HAS_STAGE2_BBM2
 HAS_STAGE2_FWB
 HAS_SYSREG_GIC_CPUIF
 HAS_TIDCP1
-- 
2.38.1.431.g37b22c650d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 06/12] KVM: arm64: Split block PTEs without using break-before-make
  2022-11-12  8:17 ` Ricardo Koller
@ 2022-11-12  8:17   ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvm, kvmarm, kvmarm, ricarkol, Ricardo Koller

Breaking a huge-page block PTE into an equivalent table of smaller PTEs
does not require using break-before-make (BBM) when FEAT_BBM level 2 is
implemented. Add the respective check for eager page splitting and avoid
using BBM.

Also take care of possible Conflict aborts.  According to the rules
specified in the Arm ARM (DDI 0487H.a) section "Support levels for changing
block size" D5.10.1, this can result in a Conflict abort. So, handle it by
clearing all VM TLB entries.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/esr.h     |  1 +
 arch/arm64/include/asm/kvm_arm.h |  1 +
 arch/arm64/kvm/hyp/pgtable.c     | 10 +++++++++-
 arch/arm64/kvm/mmu.c             |  6 ++++++
 4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
index 15b34fbfca66..6f5b976396e7 100644
--- a/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@ -114,6 +114,7 @@
 #define ESR_ELx_FSC_ACCESS	(0x08)
 #define ESR_ELx_FSC_FAULT	(0x04)
 #define ESR_ELx_FSC_PERM	(0x0C)
+#define ESR_ELx_FSC_CONFLICT	(0x30)
 
 /* ISS field definitions for Data Aborts */
 #define ESR_ELx_ISV_SHIFT	(24)
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 0df3fc3a0173..58e7cbe3c250 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -333,6 +333,7 @@
 #define FSC_SECC_TTW1	(0x1d)
 #define FSC_SECC_TTW2	(0x1e)
 #define FSC_SECC_TTW3	(0x1f)
+#define FSC_CONFLICT	ESR_ELx_FSC_CONFLICT
 
 /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
 #define HPFAR_MASK	(~UL(0xf))
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 9c42eff6d42e..36b81df5687e 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1267,6 +1267,11 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
 	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
 }
 
+static bool stage2_has_bbm_level2(void)
+{
+	return cpus_have_const_cap(ARM64_HAS_STAGE2_BBM2);
+}
+
 struct stage2_split_data {
 	struct kvm_s2_mmu		*mmu;
 	void				*memcache;
@@ -1308,7 +1313,10 @@ static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
 	 */
 	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));
 
-	stage2_put_pte(ctx, data->mmu, mm_ops);
+	if (stage2_has_bbm_level2())
+		mm_ops->put_page(ctx->ptep);
+	else
+		stage2_put_pte(ctx, data->mmu, mm_ops);
 
 	/*
 	 * Note, the contents of the page table are guaranteed to be made
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8f26c65693a9..318f7b0aa20b 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1481,6 +1481,12 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 		return 1;
 	}
 
+	/* Conflict abort? */
+	if (fault_status == FSC_CONFLICT) {
+		kvm_flush_remote_tlbs(vcpu->kvm);
+		return 1;
+	}
+
 	trace_kvm_guest_fault(*vcpu_pc(vcpu), kvm_vcpu_get_esr(vcpu),
 			      kvm_vcpu_get_hfar(vcpu), fault_ipa);
 
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 06/12] KVM: arm64: Split block PTEs without using break-before-make
@ 2022-11-12  8:17   ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvmarm, ricarkol, kvmarm, kvm

Breaking a huge-page block PTE into an equivalent table of smaller PTEs
does not require using break-before-make (BBM) when FEAT_BBM level 2 is
implemented. Add the respective check for eager page splitting and avoid
using BBM.

Also take care of possible Conflict aborts.  According to the rules
specified in the Arm ARM (DDI 0487H.a) section "Support levels for changing
block size" D5.10.1, this can result in a Conflict abort. So, handle it by
clearing all VM TLB entries.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/esr.h     |  1 +
 arch/arm64/include/asm/kvm_arm.h |  1 +
 arch/arm64/kvm/hyp/pgtable.c     | 10 +++++++++-
 arch/arm64/kvm/mmu.c             |  6 ++++++
 4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
index 15b34fbfca66..6f5b976396e7 100644
--- a/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@ -114,6 +114,7 @@
 #define ESR_ELx_FSC_ACCESS	(0x08)
 #define ESR_ELx_FSC_FAULT	(0x04)
 #define ESR_ELx_FSC_PERM	(0x0C)
+#define ESR_ELx_FSC_CONFLICT	(0x30)
 
 /* ISS field definitions for Data Aborts */
 #define ESR_ELx_ISV_SHIFT	(24)
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 0df3fc3a0173..58e7cbe3c250 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -333,6 +333,7 @@
 #define FSC_SECC_TTW1	(0x1d)
 #define FSC_SECC_TTW2	(0x1e)
 #define FSC_SECC_TTW3	(0x1f)
+#define FSC_CONFLICT	ESR_ELx_FSC_CONFLICT
 
 /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
 #define HPFAR_MASK	(~UL(0xf))
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 9c42eff6d42e..36b81df5687e 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1267,6 +1267,11 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
 	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
 }
 
+static bool stage2_has_bbm_level2(void)
+{
+	return cpus_have_const_cap(ARM64_HAS_STAGE2_BBM2);
+}
+
 struct stage2_split_data {
 	struct kvm_s2_mmu		*mmu;
 	void				*memcache;
@@ -1308,7 +1313,10 @@ static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
 	 */
 	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));
 
-	stage2_put_pte(ctx, data->mmu, mm_ops);
+	if (stage2_has_bbm_level2())
+		mm_ops->put_page(ctx->ptep);
+	else
+		stage2_put_pte(ctx, data->mmu, mm_ops);
 
 	/*
 	 * Note, the contents of the page table are guaranteed to be made
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8f26c65693a9..318f7b0aa20b 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1481,6 +1481,12 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 		return 1;
 	}
 
+	/* Conflict abort? */
+	if (fault_status == FSC_CONFLICT) {
+		kvm_flush_remote_tlbs(vcpu->kvm);
+		return 1;
+	}
+
 	trace_kvm_guest_fault(*vcpu_pc(vcpu), kvm_vcpu_get_esr(vcpu),
 			      kvm_vcpu_get_hfar(vcpu), fault_ipa);
 
-- 
2.38.1.431.g37b22c650d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 07/12] KVM: arm64: Refactor kvm_arch_commit_memory_region()
  2022-11-12  8:17 ` Ricardo Koller
@ 2022-11-12  8:17   ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvm, kvmarm, kvmarm, ricarkol, Ricardo Koller

Refactor kvm_arch_commit_memory_region() as a preparation for a future
commit to look cleaner and more understandable. Also, it looks more like
its x86 counterpart (in kvm_mmu_slot_apply_flags()).

No functional change intended.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kvm/mmu.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 318f7b0aa20b..6599a45eebf5 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1770,20 +1770,27 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 				   const struct kvm_memory_slot *new,
 				   enum kvm_mr_change change)
 {
+	bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
+
 	/*
 	 * At this point memslot has been committed and there is an
 	 * allocated dirty_bitmap[], dirty pages will be tracked while the
 	 * memory slot is write protected.
 	 */
-	if (change != KVM_MR_DELETE && new->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+	if (log_dirty_pages) {
+
+		if (change == KVM_MR_DELETE)
+			return;
+
 		/*
 		 * If we're with initial-all-set, we don't need to write
 		 * protect any pages because they're all reported as dirty.
 		 * Huge pages and normal pages will be write protect gradually.
 		 */
-		if (!kvm_dirty_log_manual_protect_and_init_set(kvm)) {
-			kvm_mmu_wp_memory_region(kvm, new->id);
-		}
+		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
+			return;
+
+		kvm_mmu_wp_memory_region(kvm, new->id);
 	}
 }
 
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 07/12] KVM: arm64: Refactor kvm_arch_commit_memory_region()
@ 2022-11-12  8:17   ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvmarm, ricarkol, kvmarm, kvm

Refactor kvm_arch_commit_memory_region() as a preparation for a future
commit to look cleaner and more understandable. Also, it looks more like
its x86 counterpart (in kvm_mmu_slot_apply_flags()).

No functional change intended.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kvm/mmu.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 318f7b0aa20b..6599a45eebf5 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1770,20 +1770,27 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 				   const struct kvm_memory_slot *new,
 				   enum kvm_mr_change change)
 {
+	bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
+
 	/*
 	 * At this point memslot has been committed and there is an
 	 * allocated dirty_bitmap[], dirty pages will be tracked while the
 	 * memory slot is write protected.
 	 */
-	if (change != KVM_MR_DELETE && new->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+	if (log_dirty_pages) {
+
+		if (change == KVM_MR_DELETE)
+			return;
+
 		/*
 		 * If we're with initial-all-set, we don't need to write
 		 * protect any pages because they're all reported as dirty.
 		 * Huge pages and normal pages will be write protect gradually.
 		 */
-		if (!kvm_dirty_log_manual_protect_and_init_set(kvm)) {
-			kvm_mmu_wp_memory_region(kvm, new->id);
-		}
+		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
+			return;
+
+		kvm_mmu_wp_memory_region(kvm, new->id);
 	}
 }
 
-- 
2.38.1.431.g37b22c650d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 08/12] KVM: arm64: Add kvm_uninit_stage2_mmu()
  2022-11-12  8:17 ` Ricardo Koller
@ 2022-11-12  8:17   ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvm, kvmarm, kvmarm, ricarkol, Ricardo Koller

Add kvm_uninit_stage2_mmu() and move kvm_free_stage2_pgd()
into it. A future commit will add some more things to do
inside of kvm_uninit_stage2_mmu().

No functional change intended.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_mmu.h | 1 +
 arch/arm64/kvm/mmu.c             | 7 ++++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e4a7e6369499..058f3ae5bc26 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -167,6 +167,7 @@ void free_hyp_pgds(void);
 
 void stage2_unmap_vm(struct kvm *kvm);
 int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type);
+void kvm_uninit_stage2_mmu(struct kvm *kvm);
 void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
 int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 			  phys_addr_t pa, unsigned long size, bool writable);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6599a45eebf5..94865c5ce181 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -766,6 +766,11 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 	return err;
 }
 
+void kvm_uninit_stage2_mmu(struct kvm *kvm)
+{
+	kvm_free_stage2_pgd(&kvm->arch.mmu);
+}
+
 static void stage2_unmap_memslot(struct kvm *kvm,
 				 struct kvm_memory_slot *memslot)
 {
@@ -1869,7 +1874,7 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
-	kvm_free_stage2_pgd(&kvm->arch.mmu);
+	kvm_uninit_stage2_mmu(kvm);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 08/12] KVM: arm64: Add kvm_uninit_stage2_mmu()
@ 2022-11-12  8:17   ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvmarm, ricarkol, kvmarm, kvm

Add kvm_uninit_stage2_mmu() and move kvm_free_stage2_pgd()
into it. A future commit will add some more things to do
inside of kvm_uninit_stage2_mmu().

No functional change intended.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_mmu.h | 1 +
 arch/arm64/kvm/mmu.c             | 7 ++++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e4a7e6369499..058f3ae5bc26 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -167,6 +167,7 @@ void free_hyp_pgds(void);
 
 void stage2_unmap_vm(struct kvm *kvm);
 int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type);
+void kvm_uninit_stage2_mmu(struct kvm *kvm);
 void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
 int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 			  phys_addr_t pa, unsigned long size, bool writable);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6599a45eebf5..94865c5ce181 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -766,6 +766,11 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 	return err;
 }
 
+void kvm_uninit_stage2_mmu(struct kvm *kvm)
+{
+	kvm_free_stage2_pgd(&kvm->arch.mmu);
+}
+
 static void stage2_unmap_memslot(struct kvm *kvm,
 				 struct kvm_memory_slot *memslot)
 {
@@ -1869,7 +1874,7 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
-	kvm_free_stage2_pgd(&kvm->arch.mmu);
+	kvm_uninit_stage2_mmu(kvm);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
-- 
2.38.1.431.g37b22c650d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 09/12] KVM: arm64: Split huge pages when dirty logging is enabled
  2022-11-12  8:17 ` Ricardo Koller
@ 2022-11-12  8:17   ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvm, kvmarm, kvmarm, ricarkol, Ricardo Koller

Split huge pages eagerly when enabling dirty logging. The goal is to
avoid doing it while faulting on write-protected pages, which negatively
impacts guest performance.

A memslot marked for dirty logging is split in 1GB pieces at a time.
This is in order to release the mmu_lock and give other kernel threads
the opportunity to run, and also in order to allocate enough pages to
split a 1GB range worth of huge pages (or a single 1GB huge page).  Note
that these page allocations can fail, so eager page splitting is
best-effort.  This is not a correctness issue though, as huge pages can
still be split on write-faults.

The benefits of eager page splitting are the same as in x86, added with
commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by the TDP MMU
when dirty logging is enabled"). For example, when running
dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU, 50%
reads, and 2MB HugeTLB memory, the time it takes vCPUs to access all of
their memory after dirty logging is enabled decreased by 44% from 2.58s to
1.42s.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  30 ++++++++
 arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
 2 files changed, 138 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 63307e7dc9c5..d43f133518cf 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -153,6 +153,36 @@ struct kvm_s2_mmu {
 	/* The last vcpu id that ran on each physical CPU */
 	int __percpu *last_vcpu_ran;
 
+	/*
+	 * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
+	 * pages. It is used to allocate stage2 page tables while splitting
+	 * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
+	 * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
+	 * the capacity of the split page cache (CACHE_CAPACITY), and how often
+	 * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
+	 *
+	 * A good heuristic to pick CHUNK_SIZE is that it should be larger than
+	 * all the available huge-page sizes, and be a multiple of all the
+	 * other ones; for example, 1GB when all the available huge-page sizes
+	 * are (1GB, 2MB, 32MB, 512MB).
+	 *
+	 * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
+	 * example, 1GB requires the following number of PAGE_SIZE-pages:
+	 * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
+	 * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
+	 * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
+	 * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
+	 * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
+	 * granules.
+	 *
+	 * Protected by kvm->slots_lock.
+	 */
+#define EAGER_PAGE_SPLIT_CHUNK_SIZE		       SZ_1G
+#define EAGER_PAGE_SPLIT_CACHE_CAPACITY					\
+	(DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +		\
+	 DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))
+	struct kvm_mmu_memory_cache split_page_cache;
+
 	struct kvm_arch *arch;
 };
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 94865c5ce181..f2753d9deb19 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector;
 
 static unsigned long io_map_base;
 
-static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
+bool __read_mostly eager_page_split = true;
+module_param(eager_page_split, bool, 0644);
+
+static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
+					   phys_addr_t size)
 {
-	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
 	phys_addr_t boundary = ALIGN_DOWN(addr + size, size);
 
 	return (boundary - 1 < end - 1) ? boundary : end;
 }
 
+static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
+{
+	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
+
+	return __stage2_range_addr_end(addr, end, size);
+}
+
 /*
  * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
  * we may see kernel panics with CONFIG_DETECT_HUNG_TASK,
@@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
 	return ret;
 }
 
+static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
+{
+	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
+}
+
+static bool need_topup_split_page_cache_or_resched(struct kvm *kvm)
+{
+	struct kvm_mmu_memory_cache *cache;
+
+	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+		return true;
+
+	cache = &kvm->arch.mmu.split_page_cache;
+	return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY);
+}
+
+static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
+			      phys_addr_t end)
+{
+	struct kvm_mmu_memory_cache *cache;
+	struct kvm_pgtable *pgt;
+	int ret;
+	u64 next;
+	int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	cache = &kvm->arch.mmu.split_page_cache;
+
+	do {
+		if (need_topup_split_page_cache_or_resched(kvm)) {
+			write_unlock(&kvm->mmu_lock);
+			cond_resched();
+			/* Eager page splitting is best-effort. */
+			ret = __kvm_mmu_topup_memory_cache(cache,
+							   cache_capacity,
+							   cache_capacity);
+			write_lock(&kvm->mmu_lock);
+			if (ret)
+				break;
+		}
+
+		pgt = kvm->arch.mmu.pgt;
+		if (!pgt)
+			return -EINVAL;
+
+		next = __stage2_range_addr_end(addr, end,
+					       EAGER_PAGE_SPLIT_CHUNK_SIZE);
+		ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache);
+		if (ret)
+			break;
+	} while (addr = next, addr != end);
+
+	return ret;
+}
+
 #define stage2_apply_range_resched(kvm, addr, end, fn)			\
 	stage2_apply_range(kvm, addr, end, fn, true)
 
@@ -755,6 +823,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
 
+	mmu->split_page_cache.gfp_zero = __GFP_ZERO;
+
 	mmu->pgt = pgt;
 	mmu->pgd_phys = __pa(pgt->pgd);
 	return 0;
@@ -769,6 +839,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 void kvm_uninit_stage2_mmu(struct kvm *kvm)
 {
 	kvm_free_stage2_pgd(&kvm->arch.mmu);
+	kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
 }
 
 static void stage2_unmap_memslot(struct kvm *kvm,
@@ -996,6 +1067,29 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 	stage2_wp_range(&kvm->arch.mmu, start, end);
 }
 
+/**
+ * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE
+ *				   pages for memory slot
+ * @kvm:	The KVM pointer
+ * @slot:	The memory slot to split
+ *
+ * Acquires kvm->mmu_lock. Called with kvm->slots_lock mutex acquired,
+ * serializing operations for VM memory regions.
+ */
+static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	struct kvm_memory_slot *memslot = id_to_memslot(slots, slot);
+	phys_addr_t start, end;
+
+	start = memslot->base_gfn << PAGE_SHIFT;
+	end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
+
+	write_lock(&kvm->mmu_lock);
+	kvm_mmu_split_huge_pages(kvm, start, end);
+	write_unlock(&kvm->mmu_lock);
+}
+
 /*
  * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
  * dirty pages.
@@ -1795,7 +1889,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
 			return;
 
+		if (READ_ONCE(eager_page_split))
+			kvm_mmu_split_memory_region(kvm, new->id);
+
 		kvm_mmu_wp_memory_region(kvm, new->id);
+	} else {
+		/*
+		 * Free any leftovers from the eager page splitting cache. Do
+		 * this when deleting, moving, disabling dirty logging, or
+		 * creating the memslot (a nop). Doing it for deletes makes
+		 * sure we don't leak memory, and there's no need to keep the
+		 * cache around for any of the other cases.
+		 */
+		kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
 	}
 }
 
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 09/12] KVM: arm64: Split huge pages when dirty logging is enabled
@ 2022-11-12  8:17   ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvmarm, ricarkol, kvmarm, kvm

Split huge pages eagerly when enabling dirty logging. The goal is to
avoid doing it while faulting on write-protected pages, which negatively
impacts guest performance.

A memslot marked for dirty logging is split in 1GB pieces at a time.
This is in order to release the mmu_lock and give other kernel threads
the opportunity to run, and also in order to allocate enough pages to
split a 1GB range worth of huge pages (or a single 1GB huge page).  Note
that these page allocations can fail, so eager page splitting is
best-effort.  This is not a correctness issue though, as huge pages can
still be split on write-faults.

The benefits of eager page splitting are the same as in x86, added with
commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by the TDP MMU
when dirty logging is enabled"). For example, when running
dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU, 50%
reads, and 2MB HugeTLB memory, the time it takes vCPUs to access all of
their memory after dirty logging is enabled decreased by 44% from 2.58s to
1.42s.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  30 ++++++++
 arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
 2 files changed, 138 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 63307e7dc9c5..d43f133518cf 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -153,6 +153,36 @@ struct kvm_s2_mmu {
 	/* The last vcpu id that ran on each physical CPU */
 	int __percpu *last_vcpu_ran;
 
+	/*
+	 * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
+	 * pages. It is used to allocate stage2 page tables while splitting
+	 * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
+	 * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
+	 * the capacity of the split page cache (CACHE_CAPACITY), and how often
+	 * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
+	 *
+	 * A good heuristic to pick CHUNK_SIZE is that it should be larger than
+	 * all the available huge-page sizes, and be a multiple of all the
+	 * other ones; for example, 1GB when all the available huge-page sizes
+	 * are (1GB, 2MB, 32MB, 512MB).
+	 *
+	 * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
+	 * example, 1GB requires the following number of PAGE_SIZE-pages:
+	 * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
+	 * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
+	 * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
+	 * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
+	 * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
+	 * granules.
+	 *
+	 * Protected by kvm->slots_lock.
+	 */
+#define EAGER_PAGE_SPLIT_CHUNK_SIZE		       SZ_1G
+#define EAGER_PAGE_SPLIT_CACHE_CAPACITY					\
+	(DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +		\
+	 DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))
+	struct kvm_mmu_memory_cache split_page_cache;
+
 	struct kvm_arch *arch;
 };
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 94865c5ce181..f2753d9deb19 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector;
 
 static unsigned long io_map_base;
 
-static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
+bool __read_mostly eager_page_split = true;
+module_param(eager_page_split, bool, 0644);
+
+static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
+					   phys_addr_t size)
 {
-	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
 	phys_addr_t boundary = ALIGN_DOWN(addr + size, size);
 
 	return (boundary - 1 < end - 1) ? boundary : end;
 }
 
+static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
+{
+	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
+
+	return __stage2_range_addr_end(addr, end, size);
+}
+
 /*
  * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
  * we may see kernel panics with CONFIG_DETECT_HUNG_TASK,
@@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
 	return ret;
 }
 
+static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
+{
+	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
+}
+
+static bool need_topup_split_page_cache_or_resched(struct kvm *kvm)
+{
+	struct kvm_mmu_memory_cache *cache;
+
+	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+		return true;
+
+	cache = &kvm->arch.mmu.split_page_cache;
+	return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY);
+}
+
+static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
+			      phys_addr_t end)
+{
+	struct kvm_mmu_memory_cache *cache;
+	struct kvm_pgtable *pgt;
+	int ret;
+	u64 next;
+	int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	cache = &kvm->arch.mmu.split_page_cache;
+
+	do {
+		if (need_topup_split_page_cache_or_resched(kvm)) {
+			write_unlock(&kvm->mmu_lock);
+			cond_resched();
+			/* Eager page splitting is best-effort. */
+			ret = __kvm_mmu_topup_memory_cache(cache,
+							   cache_capacity,
+							   cache_capacity);
+			write_lock(&kvm->mmu_lock);
+			if (ret)
+				break;
+		}
+
+		pgt = kvm->arch.mmu.pgt;
+		if (!pgt)
+			return -EINVAL;
+
+		next = __stage2_range_addr_end(addr, end,
+					       EAGER_PAGE_SPLIT_CHUNK_SIZE);
+		ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache);
+		if (ret)
+			break;
+	} while (addr = next, addr != end);
+
+	return ret;
+}
+
 #define stage2_apply_range_resched(kvm, addr, end, fn)			\
 	stage2_apply_range(kvm, addr, end, fn, true)
 
@@ -755,6 +823,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
 
+	mmu->split_page_cache.gfp_zero = __GFP_ZERO;
+
 	mmu->pgt = pgt;
 	mmu->pgd_phys = __pa(pgt->pgd);
 	return 0;
@@ -769,6 +839,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 void kvm_uninit_stage2_mmu(struct kvm *kvm)
 {
 	kvm_free_stage2_pgd(&kvm->arch.mmu);
+	kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
 }
 
 static void stage2_unmap_memslot(struct kvm *kvm,
@@ -996,6 +1067,29 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 	stage2_wp_range(&kvm->arch.mmu, start, end);
 }
 
+/**
+ * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE
+ *				   pages for memory slot
+ * @kvm:	The KVM pointer
+ * @slot:	The memory slot to split
+ *
+ * Acquires kvm->mmu_lock. Called with kvm->slots_lock mutex acquired,
+ * serializing operations for VM memory regions.
+ */
+static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	struct kvm_memory_slot *memslot = id_to_memslot(slots, slot);
+	phys_addr_t start, end;
+
+	start = memslot->base_gfn << PAGE_SHIFT;
+	end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
+
+	write_lock(&kvm->mmu_lock);
+	kvm_mmu_split_huge_pages(kvm, start, end);
+	write_unlock(&kvm->mmu_lock);
+}
+
 /*
  * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
  * dirty pages.
@@ -1795,7 +1889,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
 			return;
 
+		if (READ_ONCE(eager_page_split))
+			kvm_mmu_split_memory_region(kvm, new->id);
+
 		kvm_mmu_wp_memory_region(kvm, new->id);
+	} else {
+		/*
+		 * Free any leftovers from the eager page splitting cache. Do
+		 * this when deleting, moving, disabling dirty logging, or
+		 * creating the memslot (a nop). Doing it for deletes makes
+		 * sure we don't leak memory, and there's no need to keep the
+		 * cache around for any of the other cases.
+		 */
+		kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
 	}
 }
 
-- 
2.38.1.431.g37b22c650d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 10/12] KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()
  2022-11-12  8:17 ` Ricardo Koller
@ 2022-11-12  8:17   ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvm, kvmarm, kvmarm, ricarkol, Ricardo Koller

Move the functionality of kvm_mmu_write_protect_pt_masked() into its
caller, kvm_arch_mmu_enable_log_dirty_pt_masked().  This will be used in a
subsequent commit in order to share some of the code in
kvm_arch_mmu_enable_log_dirty_pt_masked().

No functional change intended.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kvm/mmu.c | 42 +++++++++++++++---------------------------
 1 file changed, 15 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f2753d9deb19..7881df411643 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1045,28 +1045,6 @@ static void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
 	kvm_flush_remote_tlbs(kvm);
 }
 
-/**
- * kvm_mmu_write_protect_pt_masked() - write protect dirty pages
- * @kvm:	The KVM pointer
- * @slot:	The memory slot associated with mask
- * @gfn_offset:	The gfn offset in memory slot
- * @mask:	The mask of dirty pages at offset 'gfn_offset' in this memory
- *		slot to be write protected
- *
- * Walks bits set in mask write protects the associated pte's. Caller must
- * acquire kvm_mmu_lock.
- */
-static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-		struct kvm_memory_slot *slot,
-		gfn_t gfn_offset, unsigned long mask)
-{
-	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
-	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
-	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
-
-	stage2_wp_range(&kvm->arch.mmu, start, end);
-}
-
 /**
  * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE
  *				   pages for memory slot
@@ -1091,17 +1069,27 @@ static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
 }
 
 /*
- * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
- * dirty pages.
+ * kvm_arch_mmu_enable_log_dirty_pt_masked() - enable dirty logging for selected pages.
+ * @kvm:	The KVM pointer
+ * @slot:	The memory slot associated with mask
+ * @gfn_offset:	The gfn offset in memory slot
+ * @mask:	The mask of pages at offset 'gfn_offset' in this memory
+ *		slot to enable dirty logging on
  *
- * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
- * enable dirty logging for them.
+ * Writes protect selected pages to enable dirty logging for them. Caller must
+ * acquire kvm->mmu_lock.
  */
 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		struct kvm_memory_slot *slot,
 		gfn_t gfn_offset, unsigned long mask)
 {
-	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
+	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
+	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	stage2_wp_range(&kvm->arch.mmu, start, end);
 }
 
 static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 10/12] KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()
@ 2022-11-12  8:17   ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvmarm, ricarkol, kvmarm, kvm

Move the functionality of kvm_mmu_write_protect_pt_masked() into its
caller, kvm_arch_mmu_enable_log_dirty_pt_masked().  This will be used in a
subsequent commit in order to share some of the code in
kvm_arch_mmu_enable_log_dirty_pt_masked().

No functional change intended.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kvm/mmu.c | 42 +++++++++++++++---------------------------
 1 file changed, 15 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f2753d9deb19..7881df411643 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1045,28 +1045,6 @@ static void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
 	kvm_flush_remote_tlbs(kvm);
 }
 
-/**
- * kvm_mmu_write_protect_pt_masked() - write protect dirty pages
- * @kvm:	The KVM pointer
- * @slot:	The memory slot associated with mask
- * @gfn_offset:	The gfn offset in memory slot
- * @mask:	The mask of dirty pages at offset 'gfn_offset' in this memory
- *		slot to be write protected
- *
- * Walks bits set in mask write protects the associated pte's. Caller must
- * acquire kvm_mmu_lock.
- */
-static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-		struct kvm_memory_slot *slot,
-		gfn_t gfn_offset, unsigned long mask)
-{
-	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
-	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
-	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
-
-	stage2_wp_range(&kvm->arch.mmu, start, end);
-}
-
 /**
  * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE
  *				   pages for memory slot
@@ -1091,17 +1069,27 @@ static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
 }
 
 /*
- * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
- * dirty pages.
+ * kvm_arch_mmu_enable_log_dirty_pt_masked() - enable dirty logging for selected pages.
+ * @kvm:	The KVM pointer
+ * @slot:	The memory slot associated with mask
+ * @gfn_offset:	The gfn offset in memory slot
+ * @mask:	The mask of pages at offset 'gfn_offset' in this memory
+ *		slot to enable dirty logging on
  *
- * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
- * enable dirty logging for them.
+ * Writes protect selected pages to enable dirty logging for them. Caller must
+ * acquire kvm->mmu_lock.
  */
 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		struct kvm_memory_slot *slot,
 		gfn_t gfn_offset, unsigned long mask)
 {
-	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
+	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
+	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	stage2_wp_range(&kvm->arch.mmu, start, end);
 }
 
 static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
-- 
2.38.1.431.g37b22c650d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 11/12] KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG
  2022-11-12  8:17 ` Ricardo Koller
@ 2022-11-12  8:17   ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvm, kvmarm, kvmarm, ricarkol, Ricardo Koller

This is the arm64 counterpart of commit cb00a70bd4b7 ("KVM: x86/mmu:
Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG"),
which has the benefit of splitting the cost of splitting a memslot
across multiple ioctls.

Split huge pages on the range specified using KVM_CLEAR_DIRTY_LOG.  And
do not split when enabling dirty logging if KVM_DIRTY_LOG_INITIALLY_SET
is set.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kvm/mmu.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7881df411643..b8211d833cc1 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1076,8 +1076,8 @@ static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
  * @mask:	The mask of pages at offset 'gfn_offset' in this memory
  *		slot to enable dirty logging on
  *
- * Writes protect selected pages to enable dirty logging for them. Caller must
- * acquire kvm->mmu_lock.
+ * Splits selected pages to PAGE_SIZE and then writes protect them to enable
+ * dirty logging for them. Caller must acquire kvm->mmu_lock.
  */
 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		struct kvm_memory_slot *slot,
@@ -1090,6 +1090,14 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	stage2_wp_range(&kvm->arch.mmu, start, end);
+
+	/*
+	 * If initially-all-set mode is not set, then huge-pages were already
+	 * split when enabling dirty logging: no need to do it again.
+	 */
+	if (kvm_dirty_log_manual_protect_and_init_set(kvm) &&
+	    READ_ONCE(eager_page_split))
+		kvm_mmu_split_huge_pages(kvm, start, end);
 }
 
 static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
@@ -1474,10 +1482,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 */
 	if (fault_status == FSC_PERM && vma_pagesize == fault_granule)
 		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
-	else
+	else {
 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
 					     __pfn_to_phys(pfn), prot,
 					     memcache, KVM_PGTABLE_WALK_SHARED);
+	}
 
 	/* Mark the page dirty only if the fault is handled successfully */
 	if (writable && !ret) {
@@ -1887,7 +1896,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 		 * this when deleting, moving, disabling dirty logging, or
 		 * creating the memslot (a nop). Doing it for deletes makes
 		 * sure we don't leak memory, and there's no need to keep the
-		 * cache around for any of the other cases.
+		 * cache around for any of the other cases. Keeping the cache
+		 * is useful for succesive KVM_CLEAR_DIRTY_LOG calls, which is
+		 * not handled in this function.
 		 */
 		kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
 	}
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 11/12] KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG
@ 2022-11-12  8:17   ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvmarm, ricarkol, kvmarm, kvm

This is the arm64 counterpart of commit cb00a70bd4b7 ("KVM: x86/mmu:
Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG"),
which has the benefit of splitting the cost of splitting a memslot
across multiple ioctls.

Split huge pages on the range specified using KVM_CLEAR_DIRTY_LOG.  And
do not split when enabling dirty logging if KVM_DIRTY_LOG_INITIALLY_SET
is set.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kvm/mmu.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7881df411643..b8211d833cc1 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1076,8 +1076,8 @@ static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
  * @mask:	The mask of pages at offset 'gfn_offset' in this memory
  *		slot to enable dirty logging on
  *
- * Writes protect selected pages to enable dirty logging for them. Caller must
- * acquire kvm->mmu_lock.
+ * Splits selected pages to PAGE_SIZE and then writes protect them to enable
+ * dirty logging for them. Caller must acquire kvm->mmu_lock.
  */
 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		struct kvm_memory_slot *slot,
@@ -1090,6 +1090,14 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	stage2_wp_range(&kvm->arch.mmu, start, end);
+
+	/*
+	 * If initially-all-set mode is not set, then huge-pages were already
+	 * split when enabling dirty logging: no need to do it again.
+	 */
+	if (kvm_dirty_log_manual_protect_and_init_set(kvm) &&
+	    READ_ONCE(eager_page_split))
+		kvm_mmu_split_huge_pages(kvm, start, end);
 }
 
 static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
@@ -1474,10 +1482,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 */
 	if (fault_status == FSC_PERM && vma_pagesize == fault_granule)
 		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
-	else
+	else {
 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
 					     __pfn_to_phys(pfn), prot,
 					     memcache, KVM_PGTABLE_WALK_SHARED);
+	}
 
 	/* Mark the page dirty only if the fault is handled successfully */
 	if (writable && !ret) {
@@ -1887,7 +1896,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 		 * this when deleting, moving, disabling dirty logging, or
 		 * creating the memslot (a nop). Doing it for deletes makes
 		 * sure we don't leak memory, and there's no need to keep the
-		 * cache around for any of the other cases.
+		 * cache around for any of the other cases. Keeping the cache
+		 * is useful for succesive KVM_CLEAR_DIRTY_LOG calls, which is
+		 * not handled in this function.
 		 */
 		kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
 	}
-- 
2.38.1.431.g37b22c650d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 12/12] KVM: arm64: Use local TLBI on permission relaxation
  2022-11-12  8:17 ` Ricardo Koller
@ 2022-11-12  8:17   ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvm, kvmarm, kvmarm, ricarkol, Ricardo Koller

From: Marc Zyngier <maz@kernel.org>

Broadcasted TLB invalidations (TLBI) are usually less performant than their
local variant. In particular, we observed some implementations that take
millliseconds to complete parallel broadcasted TLBIs.

It's safe to use local, non-shareable, TLBIs when relaxing permissions on a
PTE in the KVM case for a couple of reasons. First, according to the ARM
Arm (DDI 0487H.a D5-4913), permission relaxation does not need
break-before-make.  Second, KVM does not set the VTTBR_EL2.CnP bit, so each
PE has its own TLB entry for the same page. KVM could tolerate that when
doing permission relaxation (i.e., not having changes broadcasted to all
PEs).

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_asm.h   |  4 +++
 arch/arm64/kvm/hyp/nvhe/hyp-main.c | 10 ++++++
 arch/arm64/kvm/hyp/nvhe/tlb.c      | 54 ++++++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c       |  2 +-
 arch/arm64/kvm/hyp/vhe/tlb.c       | 32 ++++++++++++++++++
 5 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 43c3bc0f9544..bb17b2ead4c7 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -68,6 +68,7 @@ enum __kvm_host_smccc_func {
 	__KVM_HOST_SMCCC_FUNC___kvm_vcpu_run,
 	__KVM_HOST_SMCCC_FUNC___kvm_flush_vm_context,
 	__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa,
+	__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa_nsh,
 	__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid,
 	__KVM_HOST_SMCCC_FUNC___kvm_flush_cpu_context,
 	__KVM_HOST_SMCCC_FUNC___kvm_timer_set_cntvoff,
@@ -225,6 +226,9 @@ extern void __kvm_flush_vm_context(void);
 extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
 extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
 				     int level);
+extern void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
+					 phys_addr_t ipa,
+					 int level);
 extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu);
 
 extern void __kvm_timer_set_cntvoff(u64 cntvoff);
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 728e01d4536b..c6bf1e49ca93 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -125,6 +125,15 @@ static void handle___kvm_tlb_flush_vmid_ipa(struct kvm_cpu_context *host_ctxt)
 	__kvm_tlb_flush_vmid_ipa(kern_hyp_va(mmu), ipa, level);
 }
 
+static void handle___kvm_tlb_flush_vmid_ipa_nsh(struct kvm_cpu_context *host_ctxt)
+{
+	DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
+	DECLARE_REG(phys_addr_t, ipa, host_ctxt, 2);
+	DECLARE_REG(int, level, host_ctxt, 3);
+
+	__kvm_tlb_flush_vmid_ipa_nsh(kern_hyp_va(mmu), ipa, level);
+}
+
 static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt)
 {
 	DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
@@ -315,6 +324,7 @@ static const hcall_t host_hcall[] = {
 	HANDLE_FUNC(__kvm_vcpu_run),
 	HANDLE_FUNC(__kvm_flush_vm_context),
 	HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa),
+	HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa_nsh),
 	HANDLE_FUNC(__kvm_tlb_flush_vmid),
 	HANDLE_FUNC(__kvm_flush_cpu_context),
 	HANDLE_FUNC(__kvm_timer_set_cntvoff),
diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
index d296d617f589..ef2b70587f93 100644
--- a/arch/arm64/kvm/hyp/nvhe/tlb.c
+++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
@@ -109,6 +109,60 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
 	__tlb_switch_to_host(&cxt);
 }
 
+void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
+				  phys_addr_t ipa, int level)
+{
+	struct tlb_inv_context cxt;
+
+	dsb(nshst);
+
+	/* Switch to requested VMID */
+	__tlb_switch_to_guest(mmu, &cxt);
+
+	/*
+	 * We could do so much better if we had the VA as well.
+	 * Instead, we invalidate Stage-2 for this IPA, and the
+	 * whole of Stage-1. Weep...
+	 */
+	ipa >>= 12;
+	__tlbi_level(ipas2e1, ipa, level);
+
+	/*
+	 * We have to ensure completion of the invalidation at Stage-2,
+	 * since a table walk on another CPU could refill a TLB with a
+	 * complete (S1 + S2) walk based on the old Stage-2 mapping if
+	 * the Stage-1 invalidation happened first.
+	 */
+	dsb(nsh);
+	__tlbi(vmalle1);
+	dsb(nsh);
+	isb();
+
+	/*
+	 * If the host is running at EL1 and we have a VPIPT I-cache,
+	 * then we must perform I-cache maintenance at EL2 in order for
+	 * it to have an effect on the guest. Since the guest cannot hit
+	 * I-cache lines allocated with a different VMID, we don't need
+	 * to worry about junk out of guest reset (we nuke the I-cache on
+	 * VMID rollover), but we do need to be careful when remapping
+	 * executable pages for the same guest. This can happen when KSM
+	 * takes a CoW fault on an executable page, copies the page into
+	 * a page that was previously mapped in the guest and then needs
+	 * to invalidate the guest view of the I-cache for that page
+	 * from EL1. To solve this, we invalidate the entire I-cache when
+	 * unmapping a page from a guest if we have a VPIPT I-cache but
+	 * the host is running at EL1. As above, we could do better if
+	 * we had the VA.
+	 *
+	 * The moral of this story is: if you have a VPIPT I-cache, then
+	 * you should be running with VHE enabled.
+	 */
+	if (icache_is_vpipt())
+		icache_inval_all_pou();
+
+	__tlb_switch_to_host(&cxt);
+}
+
 void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
 {
 	struct tlb_inv_context cxt;
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 36b81df5687e..4f8b610316ed 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1140,7 +1140,7 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 	ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level,
 				       KVM_PGTABLE_WALK_SHARED);
 	if (!ret)
-		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, pgt->mmu, addr, level);
+		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa_nsh, pgt->mmu, addr, level);
 	return ret;
 }
 
diff --git a/arch/arm64/kvm/hyp/vhe/tlb.c b/arch/arm64/kvm/hyp/vhe/tlb.c
index 24cef9b87f9e..e69da550cdc5 100644
--- a/arch/arm64/kvm/hyp/vhe/tlb.c
+++ b/arch/arm64/kvm/hyp/vhe/tlb.c
@@ -111,6 +111,38 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
 	__tlb_switch_to_host(&cxt);
 }
 
+void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
+				  phys_addr_t ipa, int level)
+{
+	struct tlb_inv_context cxt;
+
+	dsb(nshst);
+
+	/* Switch to requested VMID */
+	__tlb_switch_to_guest(mmu, &cxt);
+
+	/*
+	 * We could do so much better if we had the VA as well.
+	 * Instead, we invalidate Stage-2 for this IPA, and the
+	 * whole of Stage-1. Weep...
+	 */
+	ipa >>= 12;
+	__tlbi_level(ipas2e1, ipa, level);
+
+	/*
+	 * We have to ensure completion of the invalidation at Stage-2,
+	 * since a table walk on another CPU could refill a TLB with a
+	 * complete (S1 + S2) walk based on the old Stage-2 mapping if
+	 * the Stage-1 invalidation happened first.
+	 */
+	dsb(nsh);
+	__tlbi(vmalle1);
+	dsb(nsh);
+	isb();
+
+	__tlb_switch_to_host(&cxt);
+}
+
 void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
 {
 	struct tlb_inv_context cxt;
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 12/12] KVM: arm64: Use local TLBI on permission relaxation
@ 2022-11-12  8:17   ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-12  8:17 UTC (permalink / raw)
  To: pbonzini, maz, oupton, dmatlack, qperret, catalin.marinas,
	andrew.jones, seanjc, alexandru.elisei, suzuki.poulose,
	eric.auger, gshan, reijiw, rananta, bgardon
  Cc: kvmarm, ricarkol, kvmarm, kvm

From: Marc Zyngier <maz@kernel.org>

Broadcasted TLB invalidations (TLBI) are usually less performant than their
local variant. In particular, we observed some implementations that take
millliseconds to complete parallel broadcasted TLBIs.

It's safe to use local, non-shareable, TLBIs when relaxing permissions on a
PTE in the KVM case for a couple of reasons. First, according to the ARM
Arm (DDI 0487H.a D5-4913), permission relaxation does not need
break-before-make.  Second, KVM does not set the VTTBR_EL2.CnP bit, so each
PE has its own TLB entry for the same page. KVM could tolerate that when
doing permission relaxation (i.e., not having changes broadcasted to all
PEs).

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_asm.h   |  4 +++
 arch/arm64/kvm/hyp/nvhe/hyp-main.c | 10 ++++++
 arch/arm64/kvm/hyp/nvhe/tlb.c      | 54 ++++++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c       |  2 +-
 arch/arm64/kvm/hyp/vhe/tlb.c       | 32 ++++++++++++++++++
 5 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 43c3bc0f9544..bb17b2ead4c7 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -68,6 +68,7 @@ enum __kvm_host_smccc_func {
 	__KVM_HOST_SMCCC_FUNC___kvm_vcpu_run,
 	__KVM_HOST_SMCCC_FUNC___kvm_flush_vm_context,
 	__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa,
+	__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa_nsh,
 	__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid,
 	__KVM_HOST_SMCCC_FUNC___kvm_flush_cpu_context,
 	__KVM_HOST_SMCCC_FUNC___kvm_timer_set_cntvoff,
@@ -225,6 +226,9 @@ extern void __kvm_flush_vm_context(void);
 extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
 extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
 				     int level);
+extern void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
+					 phys_addr_t ipa,
+					 int level);
 extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu);
 
 extern void __kvm_timer_set_cntvoff(u64 cntvoff);
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 728e01d4536b..c6bf1e49ca93 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -125,6 +125,15 @@ static void handle___kvm_tlb_flush_vmid_ipa(struct kvm_cpu_context *host_ctxt)
 	__kvm_tlb_flush_vmid_ipa(kern_hyp_va(mmu), ipa, level);
 }
 
+static void handle___kvm_tlb_flush_vmid_ipa_nsh(struct kvm_cpu_context *host_ctxt)
+{
+	DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
+	DECLARE_REG(phys_addr_t, ipa, host_ctxt, 2);
+	DECLARE_REG(int, level, host_ctxt, 3);
+
+	__kvm_tlb_flush_vmid_ipa_nsh(kern_hyp_va(mmu), ipa, level);
+}
+
 static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt)
 {
 	DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
@@ -315,6 +324,7 @@ static const hcall_t host_hcall[] = {
 	HANDLE_FUNC(__kvm_vcpu_run),
 	HANDLE_FUNC(__kvm_flush_vm_context),
 	HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa),
+	HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa_nsh),
 	HANDLE_FUNC(__kvm_tlb_flush_vmid),
 	HANDLE_FUNC(__kvm_flush_cpu_context),
 	HANDLE_FUNC(__kvm_timer_set_cntvoff),
diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
index d296d617f589..ef2b70587f93 100644
--- a/arch/arm64/kvm/hyp/nvhe/tlb.c
+++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
@@ -109,6 +109,60 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
 	__tlb_switch_to_host(&cxt);
 }
 
+void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
+				  phys_addr_t ipa, int level)
+{
+	struct tlb_inv_context cxt;
+
+	dsb(nshst);
+
+	/* Switch to requested VMID */
+	__tlb_switch_to_guest(mmu, &cxt);
+
+	/*
+	 * We could do so much better if we had the VA as well.
+	 * Instead, we invalidate Stage-2 for this IPA, and the
+	 * whole of Stage-1. Weep...
+	 */
+	ipa >>= 12;
+	__tlbi_level(ipas2e1, ipa, level);
+
+	/*
+	 * We have to ensure completion of the invalidation at Stage-2,
+	 * since a table walk on another CPU could refill a TLB with a
+	 * complete (S1 + S2) walk based on the old Stage-2 mapping if
+	 * the Stage-1 invalidation happened first.
+	 */
+	dsb(nsh);
+	__tlbi(vmalle1);
+	dsb(nsh);
+	isb();
+
+	/*
+	 * If the host is running at EL1 and we have a VPIPT I-cache,
+	 * then we must perform I-cache maintenance at EL2 in order for
+	 * it to have an effect on the guest. Since the guest cannot hit
+	 * I-cache lines allocated with a different VMID, we don't need
+	 * to worry about junk out of guest reset (we nuke the I-cache on
+	 * VMID rollover), but we do need to be careful when remapping
+	 * executable pages for the same guest. This can happen when KSM
+	 * takes a CoW fault on an executable page, copies the page into
+	 * a page that was previously mapped in the guest and then needs
+	 * to invalidate the guest view of the I-cache for that page
+	 * from EL1. To solve this, we invalidate the entire I-cache when
+	 * unmapping a page from a guest if we have a VPIPT I-cache but
+	 * the host is running at EL1. As above, we could do better if
+	 * we had the VA.
+	 *
+	 * The moral of this story is: if you have a VPIPT I-cache, then
+	 * you should be running with VHE enabled.
+	 */
+	if (icache_is_vpipt())
+		icache_inval_all_pou();
+
+	__tlb_switch_to_host(&cxt);
+}
+
 void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
 {
 	struct tlb_inv_context cxt;
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 36b81df5687e..4f8b610316ed 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1140,7 +1140,7 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 	ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level,
 				       KVM_PGTABLE_WALK_SHARED);
 	if (!ret)
-		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, pgt->mmu, addr, level);
+		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa_nsh, pgt->mmu, addr, level);
 	return ret;
 }
 
diff --git a/arch/arm64/kvm/hyp/vhe/tlb.c b/arch/arm64/kvm/hyp/vhe/tlb.c
index 24cef9b87f9e..e69da550cdc5 100644
--- a/arch/arm64/kvm/hyp/vhe/tlb.c
+++ b/arch/arm64/kvm/hyp/vhe/tlb.c
@@ -111,6 +111,38 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
 	__tlb_switch_to_host(&cxt);
 }
 
+void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
+				  phys_addr_t ipa, int level)
+{
+	struct tlb_inv_context cxt;
+
+	dsb(nshst);
+
+	/* Switch to requested VMID */
+	__tlb_switch_to_guest(mmu, &cxt);
+
+	/*
+	 * We could do so much better if we had the VA as well.
+	 * Instead, we invalidate Stage-2 for this IPA, and the
+	 * whole of Stage-1. Weep...
+	 */
+	ipa >>= 12;
+	__tlbi_level(ipas2e1, ipa, level);
+
+	/*
+	 * We have to ensure completion of the invalidation at Stage-2,
+	 * since a table walk on another CPU could refill a TLB with a
+	 * complete (S1 + S2) walk based on the old Stage-2 mapping if
+	 * the Stage-1 invalidation happened first.
+	 */
+	dsb(nsh);
+	__tlbi(vmalle1);
+	dsb(nsh);
+	isb();
+
+	__tlb_switch_to_host(&cxt);
+}
+
 void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
 {
 	struct tlb_inv_context cxt;
-- 
2.38.1.431.g37b22c650d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/12] KVM: arm64: Eager huge-page splitting for dirty-logging
  2022-11-12  8:17 ` Ricardo Koller
@ 2022-11-14 18:42   ` Oliver Upton
  -1 siblings, 0 replies; 46+ messages in thread
From: Oliver Upton @ 2022-11-14 18:42 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: pbonzini, maz, dmatlack, qperret, catalin.marinas, andrew.jones,
	seanjc, alexandru.elisei, suzuki.poulose, eric.auger, gshan,
	reijiw, rananta, bgardon, kvmarm, ricarkol, kvmarm, kvm

Hi Ricardo,

On Sat, Nov 12, 2022 at 08:17:02AM +0000, Ricardo Koller wrote:
> Hi,
> 
> I'm sending this RFC mainly to get some early feedback on the approach used
> for implementing "Eager Page Splitting" on ARM.  "Eager Page Splitting"
> improves the performance of dirty-logging (used in live migrations) when
> guest memory is backed by huge-pages.  It's an optimization used in Google
> Cloud since 2016 on x86, and for the last couple of months on ARM.
> 
> I tried multiple ways of implementing this optimization on ARM: from
> completely reusing the stage2 mapper, to implementing a new walker from
> scratch, and some versions in between. This RFC is one of those in
> between. They all have similar performance benefits, based on some light
> performance testing (mainly dirty_log_perf_test).
> 
> Background and motivation
> =========================
> Dirty logging is typically used for live-migration iterative copying.  KVM
> implements dirty-logging at the PAGE_SIZE granularity (will refer to 4K
> pages from now on).  It does it by faulting on write-protected 4K pages.
> Therefore, enabling dirty-logging on a huge-page requires breaking it into
> 4K pages in the first place.  KVM does this breaking on fault, and because
> it's in the critical path it only maps the 4K page that faulted; every
> other 4K page is left unmapped.  This is not great for performance on ARM
> for a couple of reasons:
> 
> - Splitting on fault can halt vcpus for milliseconds in some
>   implementations. Splitting a block PTE requires using a broadcasted TLB
>   invalidation (TLBI) for every huge-page (due to the break-before-make
>   requirement). Note that x86 doesn't need this. We observed some
>   implementations that take millliseconds to complete broadcasted TLBIs
>   when done in parallel from multiple vcpus.  And that's exactly what
>   happens when doing it on fault: multiple vcpus fault at the same time
>   triggering TLBIs in parallel.
> 
> - Read intensive guest workloads end up paying for dirty-logging.  Only
>   mapping the faulting 4K page means that all the other pages that were
>   part of the huge-page will now be unmapped. The effect is that any
>   access, including reads, now has to fault.
> 
> Eager Page Splitting (on ARM)
> =============================
> Eager Page Splitting fixes the above two issues by eagerly splitting
> huge-pages when enabling dirty logging. The goal is to avoid doing it while
> faulting on write-protected pages. This is what the TDP MMU does for x86
> [0], except that x86 does it for different reasons: to avoid grabbing the
> MMU lock on fault. Note that taking care of write-protection faults still
> requires grabbing the MMU lock on ARM, but not on x86 (with the
> fast_page_fault path).
> 
> An additional benefit of eagerly splitting huge-pages is that it can be
> done in a controlled way (e.g., via an IOCTL). This series provides two
> knobs for doing it, just like its x86 counterpart: when enabling dirty
> logging, and when using the KVM_CLEAR_DIRTY_LOG ioctl. The benefit of doing
> it on KVM_CLEAR_DIRTY_LOG is that this ioctl takes ranges, and not complete
> memslots like when enabling dirty logging. This means that the cost of
> splitting (mainly broadcasted TLBIs) can be throttled: split a range, wait
> for a bit, split another range, etc. The benefits of this approach were
> presented by Oliver Upton at KVM Forum 2022 [1].
> 
> Implementation
> ==============
> Patches 1-4 add a pgtable utility function for splitting huge block PTEs:
> kvm_pgtable_stage2_split(). Patches 5-6 add support for not doing
> break-before-make on huge-page breaking when FEAT_BBM level 2 is supported.

I would suggest you split up FEAT_BBM=2 and eager page splitting into
two separate series, if possible. IMO, the eager page split is easier to
reason about if it follows the existing pattern of break-before-make.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/12] KVM: arm64: Eager huge-page splitting for dirty-logging
@ 2022-11-14 18:42   ` Oliver Upton
  0 siblings, 0 replies; 46+ messages in thread
From: Oliver Upton @ 2022-11-14 18:42 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: ricarkol, kvm, catalin.marinas, kvmarm, andrew.jones, bgardon,
	maz, dmatlack, pbonzini, kvmarm

Hi Ricardo,

On Sat, Nov 12, 2022 at 08:17:02AM +0000, Ricardo Koller wrote:
> Hi,
> 
> I'm sending this RFC mainly to get some early feedback on the approach used
> for implementing "Eager Page Splitting" on ARM.  "Eager Page Splitting"
> improves the performance of dirty-logging (used in live migrations) when
> guest memory is backed by huge-pages.  It's an optimization used in Google
> Cloud since 2016 on x86, and for the last couple of months on ARM.
> 
> I tried multiple ways of implementing this optimization on ARM: from
> completely reusing the stage2 mapper, to implementing a new walker from
> scratch, and some versions in between. This RFC is one of those in
> between. They all have similar performance benefits, based on some light
> performance testing (mainly dirty_log_perf_test).
> 
> Background and motivation
> =========================
> Dirty logging is typically used for live-migration iterative copying.  KVM
> implements dirty-logging at the PAGE_SIZE granularity (will refer to 4K
> pages from now on).  It does it by faulting on write-protected 4K pages.
> Therefore, enabling dirty-logging on a huge-page requires breaking it into
> 4K pages in the first place.  KVM does this breaking on fault, and because
> it's in the critical path it only maps the 4K page that faulted; every
> other 4K page is left unmapped.  This is not great for performance on ARM
> for a couple of reasons:
> 
> - Splitting on fault can halt vcpus for milliseconds in some
>   implementations. Splitting a block PTE requires using a broadcasted TLB
>   invalidation (TLBI) for every huge-page (due to the break-before-make
>   requirement). Note that x86 doesn't need this. We observed some
>   implementations that take millliseconds to complete broadcasted TLBIs
>   when done in parallel from multiple vcpus.  And that's exactly what
>   happens when doing it on fault: multiple vcpus fault at the same time
>   triggering TLBIs in parallel.
> 
> - Read intensive guest workloads end up paying for dirty-logging.  Only
>   mapping the faulting 4K page means that all the other pages that were
>   part of the huge-page will now be unmapped. The effect is that any
>   access, including reads, now has to fault.
> 
> Eager Page Splitting (on ARM)
> =============================
> Eager Page Splitting fixes the above two issues by eagerly splitting
> huge-pages when enabling dirty logging. The goal is to avoid doing it while
> faulting on write-protected pages. This is what the TDP MMU does for x86
> [0], except that x86 does it for different reasons: to avoid grabbing the
> MMU lock on fault. Note that taking care of write-protection faults still
> requires grabbing the MMU lock on ARM, but not on x86 (with the
> fast_page_fault path).
> 
> An additional benefit of eagerly splitting huge-pages is that it can be
> done in a controlled way (e.g., via an IOCTL). This series provides two
> knobs for doing it, just like its x86 counterpart: when enabling dirty
> logging, and when using the KVM_CLEAR_DIRTY_LOG ioctl. The benefit of doing
> it on KVM_CLEAR_DIRTY_LOG is that this ioctl takes ranges, and not complete
> memslots like when enabling dirty logging. This means that the cost of
> splitting (mainly broadcasted TLBIs) can be throttled: split a range, wait
> for a bit, split another range, etc. The benefits of this approach were
> presented by Oliver Upton at KVM Forum 2022 [1].
> 
> Implementation
> ==============
> Patches 1-4 add a pgtable utility function for splitting huge block PTEs:
> kvm_pgtable_stage2_split(). Patches 5-6 add support for not doing
> break-before-make on huge-page breaking when FEAT_BBM level 2 is supported.

I would suggest you split up FEAT_BBM=2 and eager page splitting into
two separate series, if possible. IMO, the eager page split is easier to
reason about if it follows the existing pattern of break-before-make.

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 02/12] KVM: arm64: Allow visiting block PTEs in post-order
  2022-11-12  8:17   ` Ricardo Koller
@ 2022-11-14 18:48     ` Oliver Upton
  -1 siblings, 0 replies; 46+ messages in thread
From: Oliver Upton @ 2022-11-14 18:48 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: pbonzini, maz, dmatlack, qperret, catalin.marinas, andrew.jones,
	seanjc, alexandru.elisei, suzuki.poulose, eric.auger, gshan,
	reijiw, rananta, bgardon, kvm, kvmarm, kvmarm, ricarkol

On Sat, Nov 12, 2022 at 08:17:04AM +0000, Ricardo Koller wrote:
> The page table walker does not visit block PTEs in post-order. But there
> are some cases where doing so would be beneficial, for example: breaking a
> 1G block PTE into a full tree in post-order avoids visiting the new tree.
> 
> Allow post order visits of block PTEs. This will be used in a subsequent
> commit for eagerly breaking huge pages.
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  4 ++--
>  arch/arm64/kvm/hyp/nvhe/setup.c      |  2 +-
>  arch/arm64/kvm/hyp/pgtable.c         | 25 ++++++++++++-------------
>  3 files changed, 15 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index e2edeed462e8..d2e4a5032146 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -255,7 +255,7 @@ struct kvm_pgtable {
>   *					entries.
>   * @KVM_PGTABLE_WALK_TABLE_PRE:		Visit table entries before their
>   *					children.
> - * @KVM_PGTABLE_WALK_TABLE_POST:	Visit table entries after their
> + * @KVM_PGTABLE_WALK_POST:		Visit leaf or table entries after their
>   *					children.

It is not immediately obvious from this change alone that promoting the
post-order traversal of every walker to cover leaf + table PTEs is safe.

Have you considered using a flag for just leaf post-order visits?

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 02/12] KVM: arm64: Allow visiting block PTEs in post-order
@ 2022-11-14 18:48     ` Oliver Upton
  0 siblings, 0 replies; 46+ messages in thread
From: Oliver Upton @ 2022-11-14 18:48 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: ricarkol, kvm, catalin.marinas, kvmarm, andrew.jones, bgardon,
	maz, dmatlack, pbonzini, kvmarm

On Sat, Nov 12, 2022 at 08:17:04AM +0000, Ricardo Koller wrote:
> The page table walker does not visit block PTEs in post-order. But there
> are some cases where doing so would be beneficial, for example: breaking a
> 1G block PTE into a full tree in post-order avoids visiting the new tree.
> 
> Allow post order visits of block PTEs. This will be used in a subsequent
> commit for eagerly breaking huge pages.
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  4 ++--
>  arch/arm64/kvm/hyp/nvhe/setup.c      |  2 +-
>  arch/arm64/kvm/hyp/pgtable.c         | 25 ++++++++++++-------------
>  3 files changed, 15 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index e2edeed462e8..d2e4a5032146 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -255,7 +255,7 @@ struct kvm_pgtable {
>   *					entries.
>   * @KVM_PGTABLE_WALK_TABLE_PRE:		Visit table entries before their
>   *					children.
> - * @KVM_PGTABLE_WALK_TABLE_POST:	Visit table entries after their
> + * @KVM_PGTABLE_WALK_POST:		Visit leaf or table entries after their
>   *					children.

It is not immediately obvious from this change alone that promoting the
post-order traversal of every walker to cover leaf + table PTEs is safe.

Have you considered using a flag for just leaf post-order visits?

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 06/12] KVM: arm64: Split block PTEs without using break-before-make
  2022-11-12  8:17   ` Ricardo Koller
@ 2022-11-14 18:56     ` Oliver Upton
  -1 siblings, 0 replies; 46+ messages in thread
From: Oliver Upton @ 2022-11-14 18:56 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: pbonzini, maz, dmatlack, qperret, catalin.marinas, andrew.jones,
	seanjc, alexandru.elisei, suzuki.poulose, eric.auger, gshan,
	reijiw, rananta, bgardon, kvm, kvmarm, kvmarm, ricarkol

On Sat, Nov 12, 2022 at 08:17:08AM +0000, Ricardo Koller wrote:
> Breaking a huge-page block PTE into an equivalent table of smaller PTEs
> does not require using break-before-make (BBM) when FEAT_BBM level 2 is
> implemented. Add the respective check for eager page splitting and avoid
> using BBM.
> 
> Also take care of possible Conflict aborts.  According to the rules
> specified in the Arm ARM (DDI 0487H.a) section "Support levels for changing
> block size" D5.10.1, this can result in a Conflict abort. So, handle it by
> clearing all VM TLB entries.
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>

I'd suggest adding the TLB conflict abort handler as a separate commit
prior to actually relaxing break-before-make requirements.

> ---
>  arch/arm64/include/asm/esr.h     |  1 +
>  arch/arm64/include/asm/kvm_arm.h |  1 +
>  arch/arm64/kvm/hyp/pgtable.c     | 10 +++++++++-
>  arch/arm64/kvm/mmu.c             |  6 ++++++
>  4 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> index 15b34fbfca66..6f5b976396e7 100644
> --- a/arch/arm64/include/asm/esr.h
> +++ b/arch/arm64/include/asm/esr.h
> @@ -114,6 +114,7 @@
>  #define ESR_ELx_FSC_ACCESS	(0x08)
>  #define ESR_ELx_FSC_FAULT	(0x04)
>  #define ESR_ELx_FSC_PERM	(0x0C)
> +#define ESR_ELx_FSC_CONFLICT	(0x30)
>  
>  /* ISS field definitions for Data Aborts */
>  #define ESR_ELx_ISV_SHIFT	(24)
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 0df3fc3a0173..58e7cbe3c250 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -333,6 +333,7 @@
>  #define FSC_SECC_TTW1	(0x1d)
>  #define FSC_SECC_TTW2	(0x1e)
>  #define FSC_SECC_TTW3	(0x1f)
> +#define FSC_CONFLICT	ESR_ELx_FSC_CONFLICT
>  
>  /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
>  #define HPFAR_MASK	(~UL(0xf))
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 9c42eff6d42e..36b81df5687e 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1267,6 +1267,11 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
>  	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
>  }
>  
> +static bool stage2_has_bbm_level2(void)
> +{
> +	return cpus_have_const_cap(ARM64_HAS_STAGE2_BBM2);
> +}
> +
>  struct stage2_split_data {
>  	struct kvm_s2_mmu		*mmu;
>  	void				*memcache;
> @@ -1308,7 +1313,10 @@ static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
>  	 */
>  	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));
>  
> -	stage2_put_pte(ctx, data->mmu, mm_ops);
> +	if (stage2_has_bbm_level2())
> +		mm_ops->put_page(ctx->ptep);
> +	else
> +		stage2_put_pte(ctx, data->mmu, mm_ops);
>  
>  	/*
>  	 * Note, the contents of the page table are guaranteed to be made
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 8f26c65693a9..318f7b0aa20b 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1481,6 +1481,12 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>  		return 1;
>  	}
>  
> +	/* Conflict abort? */
> +	if (fault_status == FSC_CONFLICT) {
> +		kvm_flush_remote_tlbs(vcpu->kvm);

You don't need to perfom a broadcasted invalidation in this case. A
local invalidation using the guest's VMID should suffice.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 06/12] KVM: arm64: Split block PTEs without using break-before-make
@ 2022-11-14 18:56     ` Oliver Upton
  0 siblings, 0 replies; 46+ messages in thread
From: Oliver Upton @ 2022-11-14 18:56 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: ricarkol, kvm, catalin.marinas, kvmarm, andrew.jones, bgardon,
	maz, dmatlack, pbonzini, kvmarm

On Sat, Nov 12, 2022 at 08:17:08AM +0000, Ricardo Koller wrote:
> Breaking a huge-page block PTE into an equivalent table of smaller PTEs
> does not require using break-before-make (BBM) when FEAT_BBM level 2 is
> implemented. Add the respective check for eager page splitting and avoid
> using BBM.
> 
> Also take care of possible Conflict aborts.  According to the rules
> specified in the Arm ARM (DDI 0487H.a) section "Support levels for changing
> block size" D5.10.1, this can result in a Conflict abort. So, handle it by
> clearing all VM TLB entries.
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>

I'd suggest adding the TLB conflict abort handler as a separate commit
prior to actually relaxing break-before-make requirements.

> ---
>  arch/arm64/include/asm/esr.h     |  1 +
>  arch/arm64/include/asm/kvm_arm.h |  1 +
>  arch/arm64/kvm/hyp/pgtable.c     | 10 +++++++++-
>  arch/arm64/kvm/mmu.c             |  6 ++++++
>  4 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> index 15b34fbfca66..6f5b976396e7 100644
> --- a/arch/arm64/include/asm/esr.h
> +++ b/arch/arm64/include/asm/esr.h
> @@ -114,6 +114,7 @@
>  #define ESR_ELx_FSC_ACCESS	(0x08)
>  #define ESR_ELx_FSC_FAULT	(0x04)
>  #define ESR_ELx_FSC_PERM	(0x0C)
> +#define ESR_ELx_FSC_CONFLICT	(0x30)
>  
>  /* ISS field definitions for Data Aborts */
>  #define ESR_ELx_ISV_SHIFT	(24)
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 0df3fc3a0173..58e7cbe3c250 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -333,6 +333,7 @@
>  #define FSC_SECC_TTW1	(0x1d)
>  #define FSC_SECC_TTW2	(0x1e)
>  #define FSC_SECC_TTW3	(0x1f)
> +#define FSC_CONFLICT	ESR_ELx_FSC_CONFLICT
>  
>  /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
>  #define HPFAR_MASK	(~UL(0xf))
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 9c42eff6d42e..36b81df5687e 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1267,6 +1267,11 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
>  	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
>  }
>  
> +static bool stage2_has_bbm_level2(void)
> +{
> +	return cpus_have_const_cap(ARM64_HAS_STAGE2_BBM2);
> +}
> +
>  struct stage2_split_data {
>  	struct kvm_s2_mmu		*mmu;
>  	void				*memcache;
> @@ -1308,7 +1313,10 @@ static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
>  	 */
>  	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));
>  
> -	stage2_put_pte(ctx, data->mmu, mm_ops);
> +	if (stage2_has_bbm_level2())
> +		mm_ops->put_page(ctx->ptep);
> +	else
> +		stage2_put_pte(ctx, data->mmu, mm_ops);
>  
>  	/*
>  	 * Note, the contents of the page table are guaranteed to be made
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 8f26c65693a9..318f7b0aa20b 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1481,6 +1481,12 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>  		return 1;
>  	}
>  
> +	/* Conflict abort? */
> +	if (fault_status == FSC_CONFLICT) {
> +		kvm_flush_remote_tlbs(vcpu->kvm);

You don't need to perfom a broadcasted invalidation in this case. A
local invalidation using the guest's VMID should suffice.

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
  2022-11-12  8:17   ` Ricardo Koller
@ 2022-11-14 20:54     ` Oliver Upton
  -1 siblings, 0 replies; 46+ messages in thread
From: Oliver Upton @ 2022-11-14 20:54 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: pbonzini, maz, dmatlack, qperret, catalin.marinas, andrew.jones,
	seanjc, alexandru.elisei, suzuki.poulose, eric.auger, gshan,
	reijiw, rananta, bgardon, kvm, kvmarm, kvmarm, ricarkol

Hi Ricardo,

On Sat, Nov 12, 2022 at 08:17:06AM +0000, Ricardo Koller wrote:

[...]

> +/**
> + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
> + *				to PAGE_SIZE guest pages.
> + * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
> + * @addr:	Intermediate physical address from which to split.
> + * @size:	Size of the range.
> + * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
> + *		page-table pages.
> + *
> + * @addr and the end (@addr + @size) are effectively aligned down and up to
> + * the top level huge-page block size. This is an exampe using 1GB
> + * huge-pages and 4KB granules.
> + *
> + *                          [---input range---]
> + *                          :                 :
> + * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
> + *                          :                 :
> + *                   [--2MB--][--2MB--][--2MB--][--2MB--]
> + *                          :                 :
> + *                   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
> + *                          :                 :
> + *
> + * Return: 0 on success, negative error code on failure. Note that
> + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> + * blocks in the input range as allowed by the size of the memcache. It
> + * will fail it wasn't able to break any block.
> + */
> +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, void *mc);
> +
>  /**
>   * kvm_pgtable_walk() - Walk a page-table.
>   * @pgt:	Page-table structure initialised by kvm_pgtable_*_init().
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index d1f309128118..9c42eff6d42e 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
>  	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
>  }
>  
> +struct stage2_split_data {
> +	struct kvm_s2_mmu		*mmu;
> +	void				*memcache;
> +	struct kvm_pgtable_mm_ops	*mm_ops;

You can also get at mm_ops through kvm_pgtable_visit_ctx

> +};
> +
> +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> +			       enum kvm_pgtable_walk_flags visit)
> +{
> +	struct stage2_split_data *data = ctx->arg;
> +	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> +	kvm_pte_t pte = ctx->old, attr, new;
> +	enum kvm_pgtable_prot prot;
> +	void *mc = data->memcache;
> +	u32 level = ctx->level;
> +	u64 phys;
> +
> +	if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> +		return -EINVAL;
> +
> +	/* Nothing to split at the last level */
> +	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> +		return 0;
> +
> +	/* We only split valid block mappings */
> +	if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> +		return 0;
> +
> +	phys = kvm_pte_to_phys(pte);
> +	prot = kvm_pgtable_stage2_pte_prot(pte);
> +	stage2_set_prot_attr(data->mmu->pgt, prot, &attr);
> +
> +	/*
> +	 * Eager page splitting is best-effort, so we can ignore the error.
> +	 * The returned PTE (new) will be valid even if this call returns
> +	 * error: new will be a single (big) block PTE.  The only issue is
> +	 * that it will affect dirty logging performance, as the huge-pages
> +	 * will have to be split on fault, and so we WARN.
> +	 */
> +	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));

I don't believe we should warn in this case, at least not
unconditionally. ENOMEM is an expected outcome, for example.

Additionally, I believe you'll want to bail out at this point to avoid
installing a potentially garbage PTE as well.

> +	stage2_put_pte(ctx, data->mmu, mm_ops);

Ah, I see why you've relaxed the WARN in patch 1 now.

I would recommend you follow the break-before-make pattern and use the
helpers here as well. stage2_try_break_pte() will demote the store to
WRITE_ONCE() if called from a non-shared context.

Then the WARN will behave as expected in stage2_make_pte().

> +	/*
> +	 * Note, the contents of the page table are guaranteed to be made
> +	 * visible before the new PTE is assigned because
> +	 * stage2_make__pte() writes the PTE using smp_store_release().

typo: stage2_make_pte()

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
@ 2022-11-14 20:54     ` Oliver Upton
  0 siblings, 0 replies; 46+ messages in thread
From: Oliver Upton @ 2022-11-14 20:54 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: ricarkol, kvm, catalin.marinas, kvmarm, andrew.jones, bgardon,
	maz, dmatlack, pbonzini, kvmarm

Hi Ricardo,

On Sat, Nov 12, 2022 at 08:17:06AM +0000, Ricardo Koller wrote:

[...]

> +/**
> + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
> + *				to PAGE_SIZE guest pages.
> + * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
> + * @addr:	Intermediate physical address from which to split.
> + * @size:	Size of the range.
> + * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
> + *		page-table pages.
> + *
> + * @addr and the end (@addr + @size) are effectively aligned down and up to
> + * the top level huge-page block size. This is an exampe using 1GB
> + * huge-pages and 4KB granules.
> + *
> + *                          [---input range---]
> + *                          :                 :
> + * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
> + *                          :                 :
> + *                   [--2MB--][--2MB--][--2MB--][--2MB--]
> + *                          :                 :
> + *                   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
> + *                          :                 :
> + *
> + * Return: 0 on success, negative error code on failure. Note that
> + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> + * blocks in the input range as allowed by the size of the memcache. It
> + * will fail it wasn't able to break any block.
> + */
> +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, void *mc);
> +
>  /**
>   * kvm_pgtable_walk() - Walk a page-table.
>   * @pgt:	Page-table structure initialised by kvm_pgtable_*_init().
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index d1f309128118..9c42eff6d42e 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
>  	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
>  }
>  
> +struct stage2_split_data {
> +	struct kvm_s2_mmu		*mmu;
> +	void				*memcache;
> +	struct kvm_pgtable_mm_ops	*mm_ops;

You can also get at mm_ops through kvm_pgtable_visit_ctx

> +};
> +
> +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> +			       enum kvm_pgtable_walk_flags visit)
> +{
> +	struct stage2_split_data *data = ctx->arg;
> +	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> +	kvm_pte_t pte = ctx->old, attr, new;
> +	enum kvm_pgtable_prot prot;
> +	void *mc = data->memcache;
> +	u32 level = ctx->level;
> +	u64 phys;
> +
> +	if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> +		return -EINVAL;
> +
> +	/* Nothing to split at the last level */
> +	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> +		return 0;
> +
> +	/* We only split valid block mappings */
> +	if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> +		return 0;
> +
> +	phys = kvm_pte_to_phys(pte);
> +	prot = kvm_pgtable_stage2_pte_prot(pte);
> +	stage2_set_prot_attr(data->mmu->pgt, prot, &attr);
> +
> +	/*
> +	 * Eager page splitting is best-effort, so we can ignore the error.
> +	 * The returned PTE (new) will be valid even if this call returns
> +	 * error: new will be a single (big) block PTE.  The only issue is
> +	 * that it will affect dirty logging performance, as the huge-pages
> +	 * will have to be split on fault, and so we WARN.
> +	 */
> +	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));

I don't believe we should warn in this case, at least not
unconditionally. ENOMEM is an expected outcome, for example.

Additionally, I believe you'll want to bail out at this point to avoid
installing a potentially garbage PTE as well.

> +	stage2_put_pte(ctx, data->mmu, mm_ops);

Ah, I see why you've relaxed the WARN in patch 1 now.

I would recommend you follow the break-before-make pattern and use the
helpers here as well. stage2_try_break_pte() will demote the store to
WRITE_ONCE() if called from a non-shared context.

Then the WARN will behave as expected in stage2_make_pte().

> +	/*
> +	 * Note, the contents of the page table are guaranteed to be made
> +	 * visible before the new PTE is assigned because
> +	 * stage2_make__pte() writes the PTE using smp_store_release().

typo: stage2_make_pte()

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 01/12] KVM: arm64: Relax WARN check in stage2_make_pte()
  2022-11-12  8:17   ` Ricardo Koller
@ 2022-11-14 20:59     ` Oliver Upton
  -1 siblings, 0 replies; 46+ messages in thread
From: Oliver Upton @ 2022-11-14 20:59 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: pbonzini, maz, dmatlack, qperret, catalin.marinas, andrew.jones,
	seanjc, alexandru.elisei, suzuki.poulose, eric.auger, gshan,
	reijiw, rananta, bgardon, kvm, kvmarm, kvmarm, ricarkol

Hi Ricardo,

On Sat, Nov 12, 2022 at 08:17:03AM +0000, Ricardo Koller wrote:
> stage2_make_pte() throws a warning when used in a non-shared walk, as PTEs
> are not "locked" when walking non-shared. Add a check so it can be used
> non-shared.
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>

I would very much prefer to leave this WARN as-is. Correct me if I am
wrong, but I do not believe this warning is firing with the existing
code.

While the locking portion doesn't make a whole lot of sense for a
non-shared walk, it is also a magic value that indicates we've already
done the break side of break-before-make. If the warning fires then that
would suggest our break-before-make implementation isn't working as
expected.

--
Thanks,
Oliver

> ---
>  arch/arm64/kvm/hyp/pgtable.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index c12462439e70..b16107bf917c 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -733,7 +733,8 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n
>  {
>  	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
>  
> -	WARN_ON(!stage2_pte_is_locked(*ctx->ptep));
> +	if (kvm_pgtable_walk_shared(ctx))
> +		WARN_ON(!stage2_pte_is_locked(*ctx->ptep));
>  
>  	if (stage2_pte_is_counted(new))
>  		mm_ops->get_page(ctx->ptep);
> -- 
> 2.38.1.431.g37b22c650d-goog
> 
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 01/12] KVM: arm64: Relax WARN check in stage2_make_pte()
@ 2022-11-14 20:59     ` Oliver Upton
  0 siblings, 0 replies; 46+ messages in thread
From: Oliver Upton @ 2022-11-14 20:59 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: ricarkol, kvm, catalin.marinas, kvmarm, andrew.jones, bgardon,
	maz, dmatlack, pbonzini, kvmarm

Hi Ricardo,

On Sat, Nov 12, 2022 at 08:17:03AM +0000, Ricardo Koller wrote:
> stage2_make_pte() throws a warning when used in a non-shared walk, as PTEs
> are not "locked" when walking non-shared. Add a check so it can be used
> non-shared.
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>

I would very much prefer to leave this WARN as-is. Correct me if I am
wrong, but I do not believe this warning is firing with the existing
code.

While the locking portion doesn't make a whole lot of sense for a
non-shared walk, it is also a magic value that indicates we've already
done the break side of break-before-make. If the warning fires then that
would suggest our break-before-make implementation isn't working as
expected.

--
Thanks,
Oliver

> ---
>  arch/arm64/kvm/hyp/pgtable.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index c12462439e70..b16107bf917c 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -733,7 +733,8 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n
>  {
>  	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
>  
> -	WARN_ON(!stage2_pte_is_locked(*ctx->ptep));
> +	if (kvm_pgtable_walk_shared(ctx))
> +		WARN_ON(!stage2_pte_is_locked(*ctx->ptep));
>  
>  	if (stage2_pte_is_counted(new))
>  		mm_ops->get_page(ctx->ptep);
> -- 
> 2.38.1.431.g37b22c650d-goog
> 
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
  2022-11-14 20:54     ` Oliver Upton
@ 2022-11-15 23:03       ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-15 23:03 UTC (permalink / raw)
  To: Oliver Upton
  Cc: pbonzini, maz, dmatlack, qperret, catalin.marinas, andrew.jones,
	seanjc, alexandru.elisei, suzuki.poulose, eric.auger, gshan,
	reijiw, rananta, bgardon, kvm, kvmarm, kvmarm, ricarkol

On Mon, Nov 14, 2022 at 08:54:52PM +0000, Oliver Upton wrote:
> Hi Ricardo,
> 
> On Sat, Nov 12, 2022 at 08:17:06AM +0000, Ricardo Koller wrote:
> 
> [...]
> 
> > +/**
> > + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
> > + *				to PAGE_SIZE guest pages.
> > + * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
> > + * @addr:	Intermediate physical address from which to split.
> > + * @size:	Size of the range.
> > + * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
> > + *		page-table pages.
> > + *
> > + * @addr and the end (@addr + @size) are effectively aligned down and up to
> > + * the top level huge-page block size. This is an exampe using 1GB
> > + * huge-pages and 4KB granules.
> > + *
> > + *                          [---input range---]
> > + *                          :                 :
> > + * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
> > + *                          :                 :
> > + *                   [--2MB--][--2MB--][--2MB--][--2MB--]
> > + *                          :                 :
> > + *                   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
> > + *                          :                 :
> > + *
> > + * Return: 0 on success, negative error code on failure. Note that
> > + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> > + * blocks in the input range as allowed by the size of the memcache. It
> > + * will fail it wasn't able to break any block.
> > + */
> > +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, void *mc);
> > +
> >  /**
> >   * kvm_pgtable_walk() - Walk a page-table.
> >   * @pgt:	Page-table structure initialised by kvm_pgtable_*_init().
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index d1f309128118..9c42eff6d42e 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
> >  	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
> >  }
> >  
> > +struct stage2_split_data {
> > +	struct kvm_s2_mmu		*mmu;
> > +	void				*memcache;
> > +	struct kvm_pgtable_mm_ops	*mm_ops;
> 
> You can also get at mm_ops through kvm_pgtable_visit_ctx
> 
> > +};
> > +
> > +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > +			       enum kvm_pgtable_walk_flags visit)
> > +{
> > +	struct stage2_split_data *data = ctx->arg;
> > +	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> > +	kvm_pte_t pte = ctx->old, attr, new;
> > +	enum kvm_pgtable_prot prot;
> > +	void *mc = data->memcache;
> > +	u32 level = ctx->level;
> > +	u64 phys;
> > +
> > +	if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> > +		return -EINVAL;
> > +
> > +	/* Nothing to split at the last level */
> > +	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> > +		return 0;
> > +
> > +	/* We only split valid block mappings */
> > +	if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> > +		return 0;
> > +
> > +	phys = kvm_pte_to_phys(pte);
> > +	prot = kvm_pgtable_stage2_pte_prot(pte);
> > +	stage2_set_prot_attr(data->mmu->pgt, prot, &attr);
> > +
> > +	/*
> > +	 * Eager page splitting is best-effort, so we can ignore the error.
> > +	 * The returned PTE (new) will be valid even if this call returns
> > +	 * error: new will be a single (big) block PTE.  The only issue is
> > +	 * that it will affect dirty logging performance, as the huge-pages
> > +	 * will have to be split on fault, and so we WARN.
> > +	 */
> > +	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));
> 
> I don't believe we should warn in this case, at least not
> unconditionally. ENOMEM is an expected outcome, for example.

Given that "eager page splitting" is best-effort, the error must be
ignored somewhere: either here or by the caller (in mmu.c). It seems
that ignoring the error here is not a very good idea.

> 
> Additionally, I believe you'll want to bail out at this point to avoid
> installing a potentially garbage PTE as well.

It should be fine as stage2_create_removed() is also best-effort. The
returned PTE is valid even when it fails; it just returns a big block
PTE.

> 
> > +	stage2_put_pte(ctx, data->mmu, mm_ops);
> 
> Ah, I see why you've relaxed the WARN in patch 1 now.
> 
> I would recommend you follow the break-before-make pattern and use the
> helpers here as well. stage2_try_break_pte() will demote the store to
> WRITE_ONCE() if called from a non-shared context.
> 

ACK, I can do that. The only reason why I didnt' is because I would have
to handle the potential error from stage2_try_break_pte(). It would feel
wrong not to, even if it's !shared. On the other hand, I would like to
easily experiment with both the !shared and the shared approaches
easily.

> Then the WARN will behave as expected in stage2_make_pte().
> 
> > +	/*
> > +	 * Note, the contents of the page table are guaranteed to be made
> > +	 * visible before the new PTE is assigned because
> > +	 * stage2_make__pte() writes the PTE using smp_store_release().
> 
> typo: stage2_make_pte()
> 
> --
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
@ 2022-11-15 23:03       ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-15 23:03 UTC (permalink / raw)
  To: Oliver Upton
  Cc: ricarkol, kvm, catalin.marinas, kvmarm, andrew.jones, bgardon,
	maz, dmatlack, pbonzini, kvmarm

On Mon, Nov 14, 2022 at 08:54:52PM +0000, Oliver Upton wrote:
> Hi Ricardo,
> 
> On Sat, Nov 12, 2022 at 08:17:06AM +0000, Ricardo Koller wrote:
> 
> [...]
> 
> > +/**
> > + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
> > + *				to PAGE_SIZE guest pages.
> > + * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
> > + * @addr:	Intermediate physical address from which to split.
> > + * @size:	Size of the range.
> > + * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
> > + *		page-table pages.
> > + *
> > + * @addr and the end (@addr + @size) are effectively aligned down and up to
> > + * the top level huge-page block size. This is an exampe using 1GB
> > + * huge-pages and 4KB granules.
> > + *
> > + *                          [---input range---]
> > + *                          :                 :
> > + * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
> > + *                          :                 :
> > + *                   [--2MB--][--2MB--][--2MB--][--2MB--]
> > + *                          :                 :
> > + *                   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
> > + *                          :                 :
> > + *
> > + * Return: 0 on success, negative error code on failure. Note that
> > + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> > + * blocks in the input range as allowed by the size of the memcache. It
> > + * will fail it wasn't able to break any block.
> > + */
> > +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, void *mc);
> > +
> >  /**
> >   * kvm_pgtable_walk() - Walk a page-table.
> >   * @pgt:	Page-table structure initialised by kvm_pgtable_*_init().
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index d1f309128118..9c42eff6d42e 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
> >  	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
> >  }
> >  
> > +struct stage2_split_data {
> > +	struct kvm_s2_mmu		*mmu;
> > +	void				*memcache;
> > +	struct kvm_pgtable_mm_ops	*mm_ops;
> 
> You can also get at mm_ops through kvm_pgtable_visit_ctx
> 
> > +};
> > +
> > +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > +			       enum kvm_pgtable_walk_flags visit)
> > +{
> > +	struct stage2_split_data *data = ctx->arg;
> > +	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> > +	kvm_pte_t pte = ctx->old, attr, new;
> > +	enum kvm_pgtable_prot prot;
> > +	void *mc = data->memcache;
> > +	u32 level = ctx->level;
> > +	u64 phys;
> > +
> > +	if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> > +		return -EINVAL;
> > +
> > +	/* Nothing to split at the last level */
> > +	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> > +		return 0;
> > +
> > +	/* We only split valid block mappings */
> > +	if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> > +		return 0;
> > +
> > +	phys = kvm_pte_to_phys(pte);
> > +	prot = kvm_pgtable_stage2_pte_prot(pte);
> > +	stage2_set_prot_attr(data->mmu->pgt, prot, &attr);
> > +
> > +	/*
> > +	 * Eager page splitting is best-effort, so we can ignore the error.
> > +	 * The returned PTE (new) will be valid even if this call returns
> > +	 * error: new will be a single (big) block PTE.  The only issue is
> > +	 * that it will affect dirty logging performance, as the huge-pages
> > +	 * will have to be split on fault, and so we WARN.
> > +	 */
> > +	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));
> 
> I don't believe we should warn in this case, at least not
> unconditionally. ENOMEM is an expected outcome, for example.

Given that "eager page splitting" is best-effort, the error must be
ignored somewhere: either here or by the caller (in mmu.c). It seems
that ignoring the error here is not a very good idea.

> 
> Additionally, I believe you'll want to bail out at this point to avoid
> installing a potentially garbage PTE as well.

It should be fine as stage2_create_removed() is also best-effort. The
returned PTE is valid even when it fails; it just returns a big block
PTE.

> 
> > +	stage2_put_pte(ctx, data->mmu, mm_ops);
> 
> Ah, I see why you've relaxed the WARN in patch 1 now.
> 
> I would recommend you follow the break-before-make pattern and use the
> helpers here as well. stage2_try_break_pte() will demote the store to
> WRITE_ONCE() if called from a non-shared context.
> 

ACK, I can do that. The only reason why I didnt' is because I would have
to handle the potential error from stage2_try_break_pte(). It would feel
wrong not to, even if it's !shared. On the other hand, I would like to
easily experiment with both the !shared and the shared approaches
easily.

> Then the WARN will behave as expected in stage2_make_pte().
> 
> > +	/*
> > +	 * Note, the contents of the page table are guaranteed to be made
> > +	 * visible before the new PTE is assigned because
> > +	 * stage2_make__pte() writes the PTE using smp_store_release().
> 
> typo: stage2_make_pte()
> 
> --
> Thanks,
> Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
  2022-11-15 23:03       ` Ricardo Koller
@ 2022-11-15 23:27         ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-15 23:27 UTC (permalink / raw)
  To: Oliver Upton
  Cc: pbonzini, maz, dmatlack, qperret, catalin.marinas, andrew.jones,
	seanjc, alexandru.elisei, suzuki.poulose, eric.auger, gshan,
	reijiw, rananta, bgardon, kvm, kvmarm, kvmarm, ricarkol

On Tue, Nov 15, 2022 at 03:03:42PM -0800, Ricardo Koller wrote:
> On Mon, Nov 14, 2022 at 08:54:52PM +0000, Oliver Upton wrote:
> > Hi Ricardo,
> > 
> > On Sat, Nov 12, 2022 at 08:17:06AM +0000, Ricardo Koller wrote:
> > 
> > [...]
> > 
> > > +/**
> > > + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
> > > + *				to PAGE_SIZE guest pages.
> > > + * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
> > > + * @addr:	Intermediate physical address from which to split.
> > > + * @size:	Size of the range.
> > > + * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
> > > + *		page-table pages.
> > > + *
> > > + * @addr and the end (@addr + @size) are effectively aligned down and up to
> > > + * the top level huge-page block size. This is an exampe using 1GB
> > > + * huge-pages and 4KB granules.
> > > + *
> > > + *                          [---input range---]
> > > + *                          :                 :
> > > + * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
> > > + *                          :                 :
> > > + *                   [--2MB--][--2MB--][--2MB--][--2MB--]
> > > + *                          :                 :
> > > + *                   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
> > > + *                          :                 :
> > > + *
> > > + * Return: 0 on success, negative error code on failure. Note that
> > > + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> > > + * blocks in the input range as allowed by the size of the memcache. It
> > > + * will fail it wasn't able to break any block.
> > > + */
> > > +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, void *mc);
> > > +
> > >  /**
> > >   * kvm_pgtable_walk() - Walk a page-table.
> > >   * @pgt:	Page-table structure initialised by kvm_pgtable_*_init().
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index d1f309128118..9c42eff6d42e 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
> > >  	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
> > >  }
> > >  
> > > +struct stage2_split_data {
> > > +	struct kvm_s2_mmu		*mmu;
> > > +	void				*memcache;
> > > +	struct kvm_pgtable_mm_ops	*mm_ops;
> > 
> > You can also get at mm_ops through kvm_pgtable_visit_ctx
> > 
> > > +};
> > > +
> > > +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > +			       enum kvm_pgtable_walk_flags visit)
> > > +{
> > > +	struct stage2_split_data *data = ctx->arg;
> > > +	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> > > +	kvm_pte_t pte = ctx->old, attr, new;
> > > +	enum kvm_pgtable_prot prot;
> > > +	void *mc = data->memcache;
> > > +	u32 level = ctx->level;
> > > +	u64 phys;
> > > +
> > > +	if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> > > +		return -EINVAL;
> > > +
> > > +	/* Nothing to split at the last level */
> > > +	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> > > +		return 0;
> > > +
> > > +	/* We only split valid block mappings */
> > > +	if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> > > +		return 0;
> > > +
> > > +	phys = kvm_pte_to_phys(pte);
> > > +	prot = kvm_pgtable_stage2_pte_prot(pte);
> > > +	stage2_set_prot_attr(data->mmu->pgt, prot, &attr);
> > > +
> > > +	/*
> > > +	 * Eager page splitting is best-effort, so we can ignore the error.
> > > +	 * The returned PTE (new) will be valid even if this call returns
> > > +	 * error: new will be a single (big) block PTE.  The only issue is
> > > +	 * that it will affect dirty logging performance, as the huge-pages
> > > +	 * will have to be split on fault, and so we WARN.
> > > +	 */
> > > +	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));
> > 
> > I don't believe we should warn in this case, at least not
> > unconditionally. ENOMEM is an expected outcome, for example.
> 
> Given that "eager page splitting" is best-effort, the error must be
> ignored somewhere: either here or by the caller (in mmu.c). It seems
> that ignoring the error here is not a very good idea.

Actually, ignoring the error here simplifies the error handling.
stage2_create_removed() is best-effort; here's an example.  If
stage2_create_removed() was called to split a 1G block PTE, and it
wasn't able to split all 2MB blocks, it would return ENOMEM and a valid
PTE pointing to a tree like this:

		[---------1GB-------------]
		:                         :
		[--2MB--][--2MB--][--2MB--]
		:       :
		[ ][ ][ ]

If we returned ENOMEM instead of ignoring the error, we would have to
clean all the intermediate state.  But stage2_create_removed() is
designed to always return a valid PTE, even if the tree is not fully
split (as above).  So, there's no really need to clean it: it's a valid
tree. Moreover, this valid tree would result in better dirty logging
performance as it already has some 2M blocks split into 4K pages.

> 
> > 
> > Additionally, I believe you'll want to bail out at this point to avoid
> > installing a potentially garbage PTE as well.
> 
> It should be fine as stage2_create_removed() is also best-effort. The
> returned PTE is valid even when it fails; it just returns a big block
> PTE.
> 
> > 
> > > +	stage2_put_pte(ctx, data->mmu, mm_ops);
> > 
> > Ah, I see why you've relaxed the WARN in patch 1 now.
> > 
> > I would recommend you follow the break-before-make pattern and use the
> > helpers here as well. stage2_try_break_pte() will demote the store to
> > WRITE_ONCE() if called from a non-shared context.
> > 
> 
> ACK, I can do that. The only reason why I didnt' is because I would have
> to handle the potential error from stage2_try_break_pte(). It would feel
> wrong not to, even if it's !shared. On the other hand, I would like to
> easily experiment with both the !shared and the shared approaches
> easily.
> 
> > Then the WARN will behave as expected in stage2_make_pte().
> > 
> > > +	/*
> > > +	 * Note, the contents of the page table are guaranteed to be made
> > > +	 * visible before the new PTE is assigned because
> > > +	 * stage2_make__pte() writes the PTE using smp_store_release().
> > 
> > typo: stage2_make_pte()
> > 
> > --
> > Thanks,
> > Oliver

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
@ 2022-11-15 23:27         ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-15 23:27 UTC (permalink / raw)
  To: Oliver Upton
  Cc: ricarkol, kvm, catalin.marinas, kvmarm, andrew.jones, bgardon,
	maz, dmatlack, pbonzini, kvmarm

On Tue, Nov 15, 2022 at 03:03:42PM -0800, Ricardo Koller wrote:
> On Mon, Nov 14, 2022 at 08:54:52PM +0000, Oliver Upton wrote:
> > Hi Ricardo,
> > 
> > On Sat, Nov 12, 2022 at 08:17:06AM +0000, Ricardo Koller wrote:
> > 
> > [...]
> > 
> > > +/**
> > > + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
> > > + *				to PAGE_SIZE guest pages.
> > > + * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
> > > + * @addr:	Intermediate physical address from which to split.
> > > + * @size:	Size of the range.
> > > + * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
> > > + *		page-table pages.
> > > + *
> > > + * @addr and the end (@addr + @size) are effectively aligned down and up to
> > > + * the top level huge-page block size. This is an exampe using 1GB
> > > + * huge-pages and 4KB granules.
> > > + *
> > > + *                          [---input range---]
> > > + *                          :                 :
> > > + * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
> > > + *                          :                 :
> > > + *                   [--2MB--][--2MB--][--2MB--][--2MB--]
> > > + *                          :                 :
> > > + *                   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
> > > + *                          :                 :
> > > + *
> > > + * Return: 0 on success, negative error code on failure. Note that
> > > + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> > > + * blocks in the input range as allowed by the size of the memcache. It
> > > + * will fail it wasn't able to break any block.
> > > + */
> > > +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, void *mc);
> > > +
> > >  /**
> > >   * kvm_pgtable_walk() - Walk a page-table.
> > >   * @pgt:	Page-table structure initialised by kvm_pgtable_*_init().
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index d1f309128118..9c42eff6d42e 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
> > >  	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
> > >  }
> > >  
> > > +struct stage2_split_data {
> > > +	struct kvm_s2_mmu		*mmu;
> > > +	void				*memcache;
> > > +	struct kvm_pgtable_mm_ops	*mm_ops;
> > 
> > You can also get at mm_ops through kvm_pgtable_visit_ctx
> > 
> > > +};
> > > +
> > > +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > +			       enum kvm_pgtable_walk_flags visit)
> > > +{
> > > +	struct stage2_split_data *data = ctx->arg;
> > > +	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> > > +	kvm_pte_t pte = ctx->old, attr, new;
> > > +	enum kvm_pgtable_prot prot;
> > > +	void *mc = data->memcache;
> > > +	u32 level = ctx->level;
> > > +	u64 phys;
> > > +
> > > +	if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> > > +		return -EINVAL;
> > > +
> > > +	/* Nothing to split at the last level */
> > > +	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> > > +		return 0;
> > > +
> > > +	/* We only split valid block mappings */
> > > +	if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> > > +		return 0;
> > > +
> > > +	phys = kvm_pte_to_phys(pte);
> > > +	prot = kvm_pgtable_stage2_pte_prot(pte);
> > > +	stage2_set_prot_attr(data->mmu->pgt, prot, &attr);
> > > +
> > > +	/*
> > > +	 * Eager page splitting is best-effort, so we can ignore the error.
> > > +	 * The returned PTE (new) will be valid even if this call returns
> > > +	 * error: new will be a single (big) block PTE.  The only issue is
> > > +	 * that it will affect dirty logging performance, as the huge-pages
> > > +	 * will have to be split on fault, and so we WARN.
> > > +	 */
> > > +	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));
> > 
> > I don't believe we should warn in this case, at least not
> > unconditionally. ENOMEM is an expected outcome, for example.
> 
> Given that "eager page splitting" is best-effort, the error must be
> ignored somewhere: either here or by the caller (in mmu.c). It seems
> that ignoring the error here is not a very good idea.

Actually, ignoring the error here simplifies the error handling.
stage2_create_removed() is best-effort; here's an example.  If
stage2_create_removed() was called to split a 1G block PTE, and it
wasn't able to split all 2MB blocks, it would return ENOMEM and a valid
PTE pointing to a tree like this:

		[---------1GB-------------]
		:                         :
		[--2MB--][--2MB--][--2MB--]
		:       :
		[ ][ ][ ]

If we returned ENOMEM instead of ignoring the error, we would have to
clean all the intermediate state.  But stage2_create_removed() is
designed to always return a valid PTE, even if the tree is not fully
split (as above).  So, there's no really need to clean it: it's a valid
tree. Moreover, this valid tree would result in better dirty logging
performance as it already has some 2M blocks split into 4K pages.

> 
> > 
> > Additionally, I believe you'll want to bail out at this point to avoid
> > installing a potentially garbage PTE as well.
> 
> It should be fine as stage2_create_removed() is also best-effort. The
> returned PTE is valid even when it fails; it just returns a big block
> PTE.
> 
> > 
> > > +	stage2_put_pte(ctx, data->mmu, mm_ops);
> > 
> > Ah, I see why you've relaxed the WARN in patch 1 now.
> > 
> > I would recommend you follow the break-before-make pattern and use the
> > helpers here as well. stage2_try_break_pte() will demote the store to
> > WRITE_ONCE() if called from a non-shared context.
> > 
> 
> ACK, I can do that. The only reason why I didnt' is because I would have
> to handle the potential error from stage2_try_break_pte(). It would feel
> wrong not to, even if it's !shared. On the other hand, I would like to
> easily experiment with both the !shared and the shared approaches
> easily.
> 
> > Then the WARN will behave as expected in stage2_make_pte().
> > 
> > > +	/*
> > > +	 * Note, the contents of the page table are guaranteed to be made
> > > +	 * visible before the new PTE is assigned because
> > > +	 * stage2_make__pte() writes the PTE using smp_store_release().
> > 
> > typo: stage2_make_pte()
> > 
> > --
> > Thanks,
> > Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
  2022-11-15 23:27         ` Ricardo Koller
@ 2022-11-15 23:54           ` Oliver Upton
  -1 siblings, 0 replies; 46+ messages in thread
From: Oliver Upton @ 2022-11-15 23:54 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: pbonzini, maz, dmatlack, qperret, catalin.marinas, andrew.jones,
	seanjc, alexandru.elisei, suzuki.poulose, eric.auger, gshan,
	reijiw, rananta, bgardon, kvm, kvmarm, kvmarm, ricarkol

On Tue, Nov 15, 2022 at 03:27:18PM -0800, Ricardo Koller wrote:
> On Tue, Nov 15, 2022 at 03:03:42PM -0800, Ricardo Koller wrote:
> > On Mon, Nov 14, 2022 at 08:54:52PM +0000, Oliver Upton wrote:

[...]

> > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > index d1f309128118..9c42eff6d42e 100644
> > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
> > > >  	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
> > > >  }
> > > >  
> > > > +struct stage2_split_data {
> > > > +	struct kvm_s2_mmu		*mmu;
> > > > +	void				*memcache;
> > > > +	struct kvm_pgtable_mm_ops	*mm_ops;
> > > 
> > > You can also get at mm_ops through kvm_pgtable_visit_ctx
> > > 
> > > > +};
> > > > +
> > > > +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > +			       enum kvm_pgtable_walk_flags visit)
> > > > +{
> > > > +	struct stage2_split_data *data = ctx->arg;
> > > > +	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> > > > +	kvm_pte_t pte = ctx->old, attr, new;
> > > > +	enum kvm_pgtable_prot prot;
> > > > +	void *mc = data->memcache;
> > > > +	u32 level = ctx->level;
> > > > +	u64 phys;
> > > > +
> > > > +	if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> > > > +		return -EINVAL;
> > > > +
> > > > +	/* Nothing to split at the last level */
> > > > +	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> > > > +		return 0;
> > > > +
> > > > +	/* We only split valid block mappings */
> > > > +	if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> > > > +		return 0;
> > > > +
> > > > +	phys = kvm_pte_to_phys(pte);
> > > > +	prot = kvm_pgtable_stage2_pte_prot(pte);
> > > > +	stage2_set_prot_attr(data->mmu->pgt, prot, &attr);
> > > > +
> > > > +	/*
> > > > +	 * Eager page splitting is best-effort, so we can ignore the error.
> > > > +	 * The returned PTE (new) will be valid even if this call returns
> > > > +	 * error: new will be a single (big) block PTE.  The only issue is
> > > > +	 * that it will affect dirty logging performance, as the huge-pages
> > > > +	 * will have to be split on fault, and so we WARN.
> > > > +	 */
> > > > +	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));
> > > 
> > > I don't believe we should warn in this case, at least not
> > > unconditionally. ENOMEM is an expected outcome, for example.
> > 
> > Given that "eager page splitting" is best-effort, the error must be
> > ignored somewhere: either here or by the caller (in mmu.c). It seems
> > that ignoring the error here is not a very good idea.
> 
> Actually, ignoring the error here simplifies the error handling.
> stage2_create_removed() is best-effort; here's an example.  If
> stage2_create_removed() was called to split a 1G block PTE, and it
> wasn't able to split all 2MB blocks, it would return ENOMEM and a valid
> PTE pointing to a tree like this:
> 
> 		[---------1GB-------------]
> 		:                         :
> 		[--2MB--][--2MB--][--2MB--]
> 		:       :
> 		[ ][ ][ ]
> 
> If we returned ENOMEM instead of ignoring the error, we would have to
> clean all the intermediate state.  But stage2_create_removed() is
> designed to always return a valid PTE, even if the tree is not fully
> split (as above).  So, there's no really need to clean it: it's a valid
> tree. Moreover, this valid tree would result in better dirty logging
> performance as it already has some 2M blocks split into 4K pages.

I have no issue with installing a partially-populated table, but
unconditionally ignoring the return code and marching onwards seems
dangerous. If you document the behavior of -ENOMEM on
stage2_create_removed() and return early for anything else it may read a
bit better.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
@ 2022-11-15 23:54           ` Oliver Upton
  0 siblings, 0 replies; 46+ messages in thread
From: Oliver Upton @ 2022-11-15 23:54 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: ricarkol, kvm, catalin.marinas, kvmarm, andrew.jones, bgardon,
	maz, dmatlack, pbonzini, kvmarm

On Tue, Nov 15, 2022 at 03:27:18PM -0800, Ricardo Koller wrote:
> On Tue, Nov 15, 2022 at 03:03:42PM -0800, Ricardo Koller wrote:
> > On Mon, Nov 14, 2022 at 08:54:52PM +0000, Oliver Upton wrote:

[...]

> > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > index d1f309128118..9c42eff6d42e 100644
> > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
> > > >  	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
> > > >  }
> > > >  
> > > > +struct stage2_split_data {
> > > > +	struct kvm_s2_mmu		*mmu;
> > > > +	void				*memcache;
> > > > +	struct kvm_pgtable_mm_ops	*mm_ops;
> > > 
> > > You can also get at mm_ops through kvm_pgtable_visit_ctx
> > > 
> > > > +};
> > > > +
> > > > +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > +			       enum kvm_pgtable_walk_flags visit)
> > > > +{
> > > > +	struct stage2_split_data *data = ctx->arg;
> > > > +	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> > > > +	kvm_pte_t pte = ctx->old, attr, new;
> > > > +	enum kvm_pgtable_prot prot;
> > > > +	void *mc = data->memcache;
> > > > +	u32 level = ctx->level;
> > > > +	u64 phys;
> > > > +
> > > > +	if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> > > > +		return -EINVAL;
> > > > +
> > > > +	/* Nothing to split at the last level */
> > > > +	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> > > > +		return 0;
> > > > +
> > > > +	/* We only split valid block mappings */
> > > > +	if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> > > > +		return 0;
> > > > +
> > > > +	phys = kvm_pte_to_phys(pte);
> > > > +	prot = kvm_pgtable_stage2_pte_prot(pte);
> > > > +	stage2_set_prot_attr(data->mmu->pgt, prot, &attr);
> > > > +
> > > > +	/*
> > > > +	 * Eager page splitting is best-effort, so we can ignore the error.
> > > > +	 * The returned PTE (new) will be valid even if this call returns
> > > > +	 * error: new will be a single (big) block PTE.  The only issue is
> > > > +	 * that it will affect dirty logging performance, as the huge-pages
> > > > +	 * will have to be split on fault, and so we WARN.
> > > > +	 */
> > > > +	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));
> > > 
> > > I don't believe we should warn in this case, at least not
> > > unconditionally. ENOMEM is an expected outcome, for example.
> > 
> > Given that "eager page splitting" is best-effort, the error must be
> > ignored somewhere: either here or by the caller (in mmu.c). It seems
> > that ignoring the error here is not a very good idea.
> 
> Actually, ignoring the error here simplifies the error handling.
> stage2_create_removed() is best-effort; here's an example.  If
> stage2_create_removed() was called to split a 1G block PTE, and it
> wasn't able to split all 2MB blocks, it would return ENOMEM and a valid
> PTE pointing to a tree like this:
> 
> 		[---------1GB-------------]
> 		:                         :
> 		[--2MB--][--2MB--][--2MB--]
> 		:       :
> 		[ ][ ][ ]
> 
> If we returned ENOMEM instead of ignoring the error, we would have to
> clean all the intermediate state.  But stage2_create_removed() is
> designed to always return a valid PTE, even if the tree is not fully
> split (as above).  So, there's no really need to clean it: it's a valid
> tree. Moreover, this valid tree would result in better dirty logging
> performance as it already has some 2M blocks split into 4K pages.

I have no issue with installing a partially-populated table, but
unconditionally ignoring the return code and marching onwards seems
dangerous. If you document the behavior of -ENOMEM on
stage2_create_removed() and return early for anything else it may read a
bit better.

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
  2022-11-15 23:54           ` Oliver Upton
@ 2022-11-17 21:50             ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-17 21:50 UTC (permalink / raw)
  To: Oliver Upton
  Cc: pbonzini, maz, dmatlack, qperret, catalin.marinas, andrew.jones,
	seanjc, alexandru.elisei, suzuki.poulose, eric.auger, gshan,
	reijiw, rananta, bgardon, kvm, kvmarm, kvmarm, ricarkol

On Tue, Nov 15, 2022 at 11:54:27PM +0000, Oliver Upton wrote:
> On Tue, Nov 15, 2022 at 03:27:18PM -0800, Ricardo Koller wrote:
> > On Tue, Nov 15, 2022 at 03:03:42PM -0800, Ricardo Koller wrote:
> > > On Mon, Nov 14, 2022 at 08:54:52PM +0000, Oliver Upton wrote:
> 
> [...]
> 
> > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > > index d1f309128118..9c42eff6d42e 100644
> > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > > @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
> > > > >  	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
> > > > >  }
> > > > >  
> > > > > +struct stage2_split_data {
> > > > > +	struct kvm_s2_mmu		*mmu;
> > > > > +	void				*memcache;
> > > > > +	struct kvm_pgtable_mm_ops	*mm_ops;
> > > > 
> > > > You can also get at mm_ops through kvm_pgtable_visit_ctx
> > > > 
> > > > > +};
> > > > > +
> > > > > +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > > +			       enum kvm_pgtable_walk_flags visit)
> > > > > +{
> > > > > +	struct stage2_split_data *data = ctx->arg;
> > > > > +	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> > > > > +	kvm_pte_t pte = ctx->old, attr, new;
> > > > > +	enum kvm_pgtable_prot prot;
> > > > > +	void *mc = data->memcache;
> > > > > +	u32 level = ctx->level;
> > > > > +	u64 phys;
> > > > > +
> > > > > +	if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	/* Nothing to split at the last level */
> > > > > +	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> > > > > +		return 0;
> > > > > +
> > > > > +	/* We only split valid block mappings */
> > > > > +	if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> > > > > +		return 0;
> > > > > +
> > > > > +	phys = kvm_pte_to_phys(pte);
> > > > > +	prot = kvm_pgtable_stage2_pte_prot(pte);
> > > > > +	stage2_set_prot_attr(data->mmu->pgt, prot, &attr);
> > > > > +
> > > > > +	/*
> > > > > +	 * Eager page splitting is best-effort, so we can ignore the error.
> > > > > +	 * The returned PTE (new) will be valid even if this call returns
> > > > > +	 * error: new will be a single (big) block PTE.  The only issue is
> > > > > +	 * that it will affect dirty logging performance, as the huge-pages
> > > > > +	 * will have to be split on fault, and so we WARN.
> > > > > +	 */
> > > > > +	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));
> > > > 
> > > > I don't believe we should warn in this case, at least not
> > > > unconditionally. ENOMEM is an expected outcome, for example.
> > > 
> > > Given that "eager page splitting" is best-effort, the error must be
> > > ignored somewhere: either here or by the caller (in mmu.c). It seems
> > > that ignoring the error here is not a very good idea.
> > 
> > Actually, ignoring the error here simplifies the error handling.
> > stage2_create_removed() is best-effort; here's an example.  If
> > stage2_create_removed() was called to split a 1G block PTE, and it
> > wasn't able to split all 2MB blocks, it would return ENOMEM and a valid
> > PTE pointing to a tree like this:
> > 
> > 		[---------1GB-------------]
> > 		:                         :
> > 		[--2MB--][--2MB--][--2MB--]
> > 		:       :
> > 		[ ][ ][ ]
> > 
> > If we returned ENOMEM instead of ignoring the error, we would have to
> > clean all the intermediate state.  But stage2_create_removed() is
> > designed to always return a valid PTE, even if the tree is not fully
> > split (as above).  So, there's no really need to clean it: it's a valid
> > tree. Moreover, this valid tree would result in better dirty logging
> > performance as it already has some 2M blocks split into 4K pages.
> 
> I have no issue with installing a partially-populated table, but
> unconditionally ignoring the return code and marching onwards seems
> dangerous. If you document the behavior of -ENOMEM on
> stage2_create_removed() and return early for anything else it may read a
> bit better.

This got me thinking. These partially-populated tables are complicating
things too much for not good reason. They should be very rare, as the
caller will ensure there's enough memory in the memcache. So what do you
think of this other option?

https://github.com/ricarkol/linux/commit/54ba44f7d00d93cbaecebd148c102ca9d7c0e711

used here:

https://github.com/ricarkol/linux/commit/ff63a8744c18d5e4589911831e20ccb41712bda2#

It reuses the stage2 mapper to implement create_removed(), and makes
things simpler by only returning success when building a fully populated
tree. The caller doesn't need to clean anything in case of failures:
partially populated trees are cleaned up by create_removed() before
returninf failure.

> 
> --
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()
@ 2022-11-17 21:50             ` Ricardo Koller
  0 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2022-11-17 21:50 UTC (permalink / raw)
  To: Oliver Upton
  Cc: ricarkol, kvm, catalin.marinas, kvmarm, andrew.jones, bgardon,
	maz, dmatlack, pbonzini, kvmarm

On Tue, Nov 15, 2022 at 11:54:27PM +0000, Oliver Upton wrote:
> On Tue, Nov 15, 2022 at 03:27:18PM -0800, Ricardo Koller wrote:
> > On Tue, Nov 15, 2022 at 03:03:42PM -0800, Ricardo Koller wrote:
> > > On Mon, Nov 14, 2022 at 08:54:52PM +0000, Oliver Upton wrote:
> 
> [...]
> 
> > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > > index d1f309128118..9c42eff6d42e 100644
> > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > > @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 phys, u32 level,
> > > > >  	return __kvm_pgtable_visit(&data, mm_ops, ptep, level);
> > > > >  }
> > > > >  
> > > > > +struct stage2_split_data {
> > > > > +	struct kvm_s2_mmu		*mmu;
> > > > > +	void				*memcache;
> > > > > +	struct kvm_pgtable_mm_ops	*mm_ops;
> > > > 
> > > > You can also get at mm_ops through kvm_pgtable_visit_ctx
> > > > 
> > > > > +};
> > > > > +
> > > > > +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > > +			       enum kvm_pgtable_walk_flags visit)
> > > > > +{
> > > > > +	struct stage2_split_data *data = ctx->arg;
> > > > > +	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> > > > > +	kvm_pte_t pte = ctx->old, attr, new;
> > > > > +	enum kvm_pgtable_prot prot;
> > > > > +	void *mc = data->memcache;
> > > > > +	u32 level = ctx->level;
> > > > > +	u64 phys;
> > > > > +
> > > > > +	if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	/* Nothing to split at the last level */
> > > > > +	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> > > > > +		return 0;
> > > > > +
> > > > > +	/* We only split valid block mappings */
> > > > > +	if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> > > > > +		return 0;
> > > > > +
> > > > > +	phys = kvm_pte_to_phys(pte);
> > > > > +	prot = kvm_pgtable_stage2_pte_prot(pte);
> > > > > +	stage2_set_prot_attr(data->mmu->pgt, prot, &attr);
> > > > > +
> > > > > +	/*
> > > > > +	 * Eager page splitting is best-effort, so we can ignore the error.
> > > > > +	 * The returned PTE (new) will be valid even if this call returns
> > > > > +	 * error: new will be a single (big) block PTE.  The only issue is
> > > > > +	 * that it will affect dirty logging performance, as the huge-pages
> > > > > +	 * will have to be split on fault, and so we WARN.
> > > > > +	 */
> > > > > +	WARN_ON(stage2_create_removed(&new, phys, level, attr, mc, mm_ops));
> > > > 
> > > > I don't believe we should warn in this case, at least not
> > > > unconditionally. ENOMEM is an expected outcome, for example.
> > > 
> > > Given that "eager page splitting" is best-effort, the error must be
> > > ignored somewhere: either here or by the caller (in mmu.c). It seems
> > > that ignoring the error here is not a very good idea.
> > 
> > Actually, ignoring the error here simplifies the error handling.
> > stage2_create_removed() is best-effort; here's an example.  If
> > stage2_create_removed() was called to split a 1G block PTE, and it
> > wasn't able to split all 2MB blocks, it would return ENOMEM and a valid
> > PTE pointing to a tree like this:
> > 
> > 		[---------1GB-------------]
> > 		:                         :
> > 		[--2MB--][--2MB--][--2MB--]
> > 		:       :
> > 		[ ][ ][ ]
> > 
> > If we returned ENOMEM instead of ignoring the error, we would have to
> > clean all the intermediate state.  But stage2_create_removed() is
> > designed to always return a valid PTE, even if the tree is not fully
> > split (as above).  So, there's no really need to clean it: it's a valid
> > tree. Moreover, this valid tree would result in better dirty logging
> > performance as it already has some 2M blocks split into 4K pages.
> 
> I have no issue with installing a partially-populated table, but
> unconditionally ignoring the return code and marching onwards seems
> dangerous. If you document the behavior of -ENOMEM on
> stage2_create_removed() and return early for anything else it may read a
> bit better.

This got me thinking. These partially-populated tables are complicating
things too much for not good reason. They should be very rare, as the
caller will ensure there's enough memory in the memcache. So what do you
think of this other option?

https://github.com/ricarkol/linux/commit/54ba44f7d00d93cbaecebd148c102ca9d7c0e711

used here:

https://github.com/ricarkol/linux/commit/ff63a8744c18d5e4589911831e20ccb41712bda2#

It reuses the stage2 mapper to implement create_removed(), and makes
things simpler by only returning success when building a fully populated
tree. The caller doesn't need to clean anything in case of failures:
partially populated trees are cleaned up by create_removed() before
returninf failure.

> 
> --
> Thanks,
> Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/12] KVM: arm64: Eager huge-page splitting for dirty-logging
  2022-11-14 18:42   ` Oliver Upton
  (?)
@ 2023-01-13  3:42   ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2023-01-13  3:42 UTC (permalink / raw)
  To: Oliver Upton
  Cc: pbonzini, maz, dmatlack, qperret, catalin.marinas, andrew.jones,
	seanjc, alexandru.elisei, suzuki.poulose, eric.auger, gshan,
	reijiw, rananta, bgardon, kvmarm, ricarkol, kvmarm, kvm

On Mon, Nov 14, 2022 at 06:42:36PM +0000, Oliver Upton wrote:
> Hi Ricardo,
> 
> On Sat, Nov 12, 2022 at 08:17:02AM +0000, Ricardo Koller wrote:
> > Hi,
> > 
> > I'm sending this RFC mainly to get some early feedback on the approach used
> > for implementing "Eager Page Splitting" on ARM.  "Eager Page Splitting"
> > improves the performance of dirty-logging (used in live migrations) when
> > guest memory is backed by huge-pages.  It's an optimization used in Google
> > Cloud since 2016 on x86, and for the last couple of months on ARM.
> > 
> > I tried multiple ways of implementing this optimization on ARM: from
> > completely reusing the stage2 mapper, to implementing a new walker from
> > scratch, and some versions in between. This RFC is one of those in
> > between. They all have similar performance benefits, based on some light
> > performance testing (mainly dirty_log_perf_test).
> > 
> > Background and motivation
> > =========================
> > Dirty logging is typically used for live-migration iterative copying.  KVM
> > implements dirty-logging at the PAGE_SIZE granularity (will refer to 4K
> > pages from now on).  It does it by faulting on write-protected 4K pages.
> > Therefore, enabling dirty-logging on a huge-page requires breaking it into
> > 4K pages in the first place.  KVM does this breaking on fault, and because
> > it's in the critical path it only maps the 4K page that faulted; every
> > other 4K page is left unmapped.  This is not great for performance on ARM
> > for a couple of reasons:
> > 
> > - Splitting on fault can halt vcpus for milliseconds in some
> >   implementations. Splitting a block PTE requires using a broadcasted TLB
> >   invalidation (TLBI) for every huge-page (due to the break-before-make
> >   requirement). Note that x86 doesn't need this. We observed some
> >   implementations that take millliseconds to complete broadcasted TLBIs
> >   when done in parallel from multiple vcpus.  And that's exactly what
> >   happens when doing it on fault: multiple vcpus fault at the same time
> >   triggering TLBIs in parallel.
> > 
> > - Read intensive guest workloads end up paying for dirty-logging.  Only
> >   mapping the faulting 4K page means that all the other pages that were
> >   part of the huge-page will now be unmapped. The effect is that any
> >   access, including reads, now has to fault.
> > 
> > Eager Page Splitting (on ARM)
> > =============================
> > Eager Page Splitting fixes the above two issues by eagerly splitting
> > huge-pages when enabling dirty logging. The goal is to avoid doing it while
> > faulting on write-protected pages. This is what the TDP MMU does for x86
> > [0], except that x86 does it for different reasons: to avoid grabbing the
> > MMU lock on fault. Note that taking care of write-protection faults still
> > requires grabbing the MMU lock on ARM, but not on x86 (with the
> > fast_page_fault path).
> > 
> > An additional benefit of eagerly splitting huge-pages is that it can be
> > done in a controlled way (e.g., via an IOCTL). This series provides two
> > knobs for doing it, just like its x86 counterpart: when enabling dirty
> > logging, and when using the KVM_CLEAR_DIRTY_LOG ioctl. The benefit of doing
> > it on KVM_CLEAR_DIRTY_LOG is that this ioctl takes ranges, and not complete
> > memslots like when enabling dirty logging. This means that the cost of
> > splitting (mainly broadcasted TLBIs) can be throttled: split a range, wait
> > for a bit, split another range, etc. The benefits of this approach were
> > presented by Oliver Upton at KVM Forum 2022 [1].
> > 
> > Implementation
> > ==============
> > Patches 1-4 add a pgtable utility function for splitting huge block PTEs:
> > kvm_pgtable_stage2_split(). Patches 5-6 add support for not doing
> > break-before-make on huge-page breaking when FEAT_BBM level 2 is supported.
> 
> I would suggest you split up FEAT_BBM=2 and eager page splitting into
> two separate series, if possible. IMO, the eager page split is easier to
> reason about if it follows the existing pattern of break-before-make.

Dropping these changes in v1.

> 
> --
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 02/12] KVM: arm64: Allow visiting block PTEs in post-order
  2022-11-14 18:48     ` Oliver Upton
  (?)
@ 2023-01-13  3:44     ` Ricardo Koller
  -1 siblings, 0 replies; 46+ messages in thread
From: Ricardo Koller @ 2023-01-13  3:44 UTC (permalink / raw)
  To: Oliver Upton
  Cc: pbonzini, maz, dmatlack, qperret, catalin.marinas, andrew.jones,
	seanjc, alexandru.elisei, suzuki.poulose, eric.auger, gshan,
	reijiw, rananta, bgardon, kvm, kvmarm, kvmarm, ricarkol

On Mon, Nov 14, 2022 at 06:48:04PM +0000, Oliver Upton wrote:
> On Sat, Nov 12, 2022 at 08:17:04AM +0000, Ricardo Koller wrote:
> > The page table walker does not visit block PTEs in post-order. But there
> > are some cases where doing so would be beneficial, for example: breaking a
> > 1G block PTE into a full tree in post-order avoids visiting the new tree.
> > 
> > Allow post order visits of block PTEs. This will be used in a subsequent
> > commit for eagerly breaking huge pages.
> > 
> > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h |  4 ++--
> >  arch/arm64/kvm/hyp/nvhe/setup.c      |  2 +-
> >  arch/arm64/kvm/hyp/pgtable.c         | 25 ++++++++++++-------------
> >  3 files changed, 15 insertions(+), 16 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index e2edeed462e8..d2e4a5032146 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -255,7 +255,7 @@ struct kvm_pgtable {
> >   *					entries.
> >   * @KVM_PGTABLE_WALK_TABLE_PRE:		Visit table entries before their
> >   *					children.
> > - * @KVM_PGTABLE_WALK_TABLE_POST:	Visit table entries after their
> > + * @KVM_PGTABLE_WALK_POST:		Visit leaf or table entries after their
> >   *					children.
> 
> It is not immediately obvious from this change alone that promoting the
> post-order traversal of every walker to cover leaf + table PTEs is safe.
> 
> Have you considered using a flag for just leaf post-order visits?
> 

Not using this commit in v1. There's no (noticeable) perf benefit from
avoiding visiting the new split tree.

Thanks,
Ricardo

> --
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2023-01-13  3:44 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-12  8:17 [RFC PATCH 00/12] KVM: arm64: Eager huge-page splitting for dirty-logging Ricardo Koller
2022-11-12  8:17 ` Ricardo Koller
2022-11-12  8:17 ` [RFC PATCH 01/12] KVM: arm64: Relax WARN check in stage2_make_pte() Ricardo Koller
2022-11-12  8:17   ` Ricardo Koller
2022-11-14 20:59   ` Oliver Upton
2022-11-14 20:59     ` Oliver Upton
2022-11-12  8:17 ` [RFC PATCH 02/12] KVM: arm64: Allow visiting block PTEs in post-order Ricardo Koller
2022-11-12  8:17   ` Ricardo Koller
2022-11-14 18:48   ` Oliver Upton
2022-11-14 18:48     ` Oliver Upton
2023-01-13  3:44     ` Ricardo Koller
2022-11-12  8:17 ` [RFC PATCH 03/12] KVM: arm64: Add stage2_create_removed() Ricardo Koller
2022-11-12  8:17   ` Ricardo Koller
2022-11-12  8:17 ` [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split() Ricardo Koller
2022-11-12  8:17   ` Ricardo Koller
2022-11-14 20:54   ` Oliver Upton
2022-11-14 20:54     ` Oliver Upton
2022-11-15 23:03     ` Ricardo Koller
2022-11-15 23:03       ` Ricardo Koller
2022-11-15 23:27       ` Ricardo Koller
2022-11-15 23:27         ` Ricardo Koller
2022-11-15 23:54         ` Oliver Upton
2022-11-15 23:54           ` Oliver Upton
2022-11-17 21:50           ` Ricardo Koller
2022-11-17 21:50             ` Ricardo Koller
2022-11-12  8:17 ` [RFC PATCH 05/12] arm64: Add a capability for FEAT_BBM level 2 Ricardo Koller
2022-11-12  8:17   ` Ricardo Koller
2022-11-12  8:17 ` [RFC PATCH 06/12] KVM: arm64: Split block PTEs without using break-before-make Ricardo Koller
2022-11-12  8:17   ` Ricardo Koller
2022-11-14 18:56   ` Oliver Upton
2022-11-14 18:56     ` Oliver Upton
2022-11-12  8:17 ` [RFC PATCH 07/12] KVM: arm64: Refactor kvm_arch_commit_memory_region() Ricardo Koller
2022-11-12  8:17   ` Ricardo Koller
2022-11-12  8:17 ` [RFC PATCH 08/12] KVM: arm64: Add kvm_uninit_stage2_mmu() Ricardo Koller
2022-11-12  8:17   ` Ricardo Koller
2022-11-12  8:17 ` [RFC PATCH 09/12] KVM: arm64: Split huge pages when dirty logging is enabled Ricardo Koller
2022-11-12  8:17   ` Ricardo Koller
2022-11-12  8:17 ` [RFC PATCH 10/12] KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked() Ricardo Koller
2022-11-12  8:17   ` Ricardo Koller
2022-11-12  8:17 ` [RFC PATCH 11/12] KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG Ricardo Koller
2022-11-12  8:17   ` Ricardo Koller
2022-11-12  8:17 ` [RFC PATCH 12/12] KVM: arm64: Use local TLBI on permission relaxation Ricardo Koller
2022-11-12  8:17   ` Ricardo Koller
2022-11-14 18:42 ` [RFC PATCH 00/12] KVM: arm64: Eager huge-page splitting for dirty-logging Oliver Upton
2022-11-14 18:42   ` Oliver Upton
2023-01-13  3:42   ` Ricardo Koller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.