All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-15 21:58 ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

Presently KVM only takes a read lock for stage 2 faults if it believes
the fault can be fixed by relaxing permissions on a PTE (write unprotect
for dirty logging). Otherwise, stage 2 faults grab the write lock, which
predictably can pile up all the vCPUs in a sufficiently large VM.

The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
MMU protected by the combination of a read-write lock and RCU, allowing
page walkers to traverse in parallel.

This series is strongly inspired by the mechanics of the TDP MMU,
making use of RCU to protect parallel walks. Note that the TLB
invalidation mechanics are a bit different between x86 and ARM, so we
need to use the 'break-before-make' sequence to split/collapse a
block/table mapping, respectively.

Nonetheless, using atomics on the break side allows fault handlers to
acquire exclusive access to a PTE (lets just call it locked). Once the
PTE lock is acquired it is then safe to assume exclusive access.

Special consideration is required when pruning the page tables in
parallel. Suppose we are collapsing a table into a block. Allowing
parallel faults means that a software walker could be in the middle of
a lower level traversal when the table is unlinked. Table
walkers that prune the paging structures must now 'lock' all descendent
PTEs, effectively asserting exclusive ownership of the substructure
(no other walker can install something to an already locked pte).

Additionally, for parallel walks we need to punt the freeing of table
pages to the next RCU sync, as there could be multiple observers of the
table until all walkers exit the RCU critical section. For this I
decided to cram an rcu_head into page private data for every table page.
We wind up spending a bit more on table pages now, but lazily allocating
for rcu callbacks probably doesn't make a lot of sense. Not only would
we need a large cache of them (think about installing a level 1 block)
to wire up callbacks on all descendent tables, but we also then need to
spend memory to actually free memory.

I tried to organize these patches as best I could w/o introducing
intermediate breakage.

The first 5 patches are meant mostly as prepatory reworks, and, in the
case of RCU a nop.

Patch 6 is quite large, but I had a hard time deciding how to change the
way we link/unlink tables to use atomics without breaking things along
the way.

Patch 7 probably should come before patch 6, as it informs the other
read-side fault (perm relax) about when a map is in progress so it'll
back off.

Patches 8-10 take care of the pruning case, actually locking the child ptes
instead of simply dropping table page references along the way. Note
that we cannot assume a pte points to a table/page at this point, hence
the same helper is called for pre- and leaf-traversal. Guide the
recursion based on what got yanked from the PTE.

Patches 11-14 wire up everything to schedule rcu callbacks on
to-be-freed table pages. rcu_barrier() is called on the way out from
tearing down a stage 2 page table to guarantee all memory associated
with the VM has actually been cleaned up.

Patches 15-16 loop in the fault handler to the new table traversal game.

Lastly, patch 17 is a nasty bit of debugging residue to spot possible
table page leaks. Please don't laugh ;-)

Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
exercise the table pruning code. Haven't done anything beyond this,
sending as an RFC now to get eyes on the code.

Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of
git://git.kernel.dk/linux-block")

Oliver Upton (17):
  KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
  KVM: arm64: Only read the pte once per visit
  KVM: arm64: Return the next table from map callbacks
  KVM: arm64: Protect page table traversal with RCU
  KVM: arm64: Take an argument to indicate parallel walk
  KVM: arm64: Implement break-before-make sequence for parallel walks
  KVM: arm64: Enlighten perm relax path about parallel walks
  KVM: arm64: Spin off helper for initializing table pte
  KVM: arm64: Tear down unlinked page tables in parallel walk
  KVM: arm64: Assume a table pte is already owned in post-order
    traversal
  KVM: arm64: Move MMU cache init/destroy into helpers
  KVM: arm64: Stuff mmu page cache in sub struct
  KVM: arm64: Setup cache for stage2 page headers
  KVM: arm64: Punt last page reference to rcu callback for parallel walk
  KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map()
  KVM: arm64: Enable parallel stage 2 MMU faults
  TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages

 arch/arm64/include/asm/kvm_host.h     |   5 +-
 arch/arm64/include/asm/kvm_mmu.h      |   2 +
 arch/arm64/include/asm/kvm_pgtable.h  |  14 +-
 arch/arm64/kvm/arm.c                  |   4 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c |  13 +-
 arch/arm64/kvm/hyp/nvhe/setup.c       |  13 +-
 arch/arm64/kvm/hyp/pgtable.c          | 518 +++++++++++++++++++-------
 arch/arm64/kvm/mmu.c                  | 120 ++++--
 8 files changed, 503 insertions(+), 186 deletions(-)

-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-15 21:58 ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

Presently KVM only takes a read lock for stage 2 faults if it believes
the fault can be fixed by relaxing permissions on a PTE (write unprotect
for dirty logging). Otherwise, stage 2 faults grab the write lock, which
predictably can pile up all the vCPUs in a sufficiently large VM.

The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
MMU protected by the combination of a read-write lock and RCU, allowing
page walkers to traverse in parallel.

This series is strongly inspired by the mechanics of the TDP MMU,
making use of RCU to protect parallel walks. Note that the TLB
invalidation mechanics are a bit different between x86 and ARM, so we
need to use the 'break-before-make' sequence to split/collapse a
block/table mapping, respectively.

Nonetheless, using atomics on the break side allows fault handlers to
acquire exclusive access to a PTE (lets just call it locked). Once the
PTE lock is acquired it is then safe to assume exclusive access.

Special consideration is required when pruning the page tables in
parallel. Suppose we are collapsing a table into a block. Allowing
parallel faults means that a software walker could be in the middle of
a lower level traversal when the table is unlinked. Table
walkers that prune the paging structures must now 'lock' all descendent
PTEs, effectively asserting exclusive ownership of the substructure
(no other walker can install something to an already locked pte).

Additionally, for parallel walks we need to punt the freeing of table
pages to the next RCU sync, as there could be multiple observers of the
table until all walkers exit the RCU critical section. For this I
decided to cram an rcu_head into page private data for every table page.
We wind up spending a bit more on table pages now, but lazily allocating
for rcu callbacks probably doesn't make a lot of sense. Not only would
we need a large cache of them (think about installing a level 1 block)
to wire up callbacks on all descendent tables, but we also then need to
spend memory to actually free memory.

I tried to organize these patches as best I could w/o introducing
intermediate breakage.

The first 5 patches are meant mostly as prepatory reworks, and, in the
case of RCU a nop.

Patch 6 is quite large, but I had a hard time deciding how to change the
way we link/unlink tables to use atomics without breaking things along
the way.

Patch 7 probably should come before patch 6, as it informs the other
read-side fault (perm relax) about when a map is in progress so it'll
back off.

Patches 8-10 take care of the pruning case, actually locking the child ptes
instead of simply dropping table page references along the way. Note
that we cannot assume a pte points to a table/page at this point, hence
the same helper is called for pre- and leaf-traversal. Guide the
recursion based on what got yanked from the PTE.

Patches 11-14 wire up everything to schedule rcu callbacks on
to-be-freed table pages. rcu_barrier() is called on the way out from
tearing down a stage 2 page table to guarantee all memory associated
with the VM has actually been cleaned up.

Patches 15-16 loop in the fault handler to the new table traversal game.

Lastly, patch 17 is a nasty bit of debugging residue to spot possible
table page leaks. Please don't laugh ;-)

Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
exercise the table pruning code. Haven't done anything beyond this,
sending as an RFC now to get eyes on the code.

Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of
git://git.kernel.dk/linux-block")

Oliver Upton (17):
  KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
  KVM: arm64: Only read the pte once per visit
  KVM: arm64: Return the next table from map callbacks
  KVM: arm64: Protect page table traversal with RCU
  KVM: arm64: Take an argument to indicate parallel walk
  KVM: arm64: Implement break-before-make sequence for parallel walks
  KVM: arm64: Enlighten perm relax path about parallel walks
  KVM: arm64: Spin off helper for initializing table pte
  KVM: arm64: Tear down unlinked page tables in parallel walk
  KVM: arm64: Assume a table pte is already owned in post-order
    traversal
  KVM: arm64: Move MMU cache init/destroy into helpers
  KVM: arm64: Stuff mmu page cache in sub struct
  KVM: arm64: Setup cache for stage2 page headers
  KVM: arm64: Punt last page reference to rcu callback for parallel walk
  KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map()
  KVM: arm64: Enable parallel stage 2 MMU faults
  TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages

 arch/arm64/include/asm/kvm_host.h     |   5 +-
 arch/arm64/include/asm/kvm_mmu.h      |   2 +
 arch/arm64/include/asm/kvm_pgtable.h  |  14 +-
 arch/arm64/kvm/arm.c                  |   4 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c |  13 +-
 arch/arm64/kvm/hyp/nvhe/setup.c       |  13 +-
 arch/arm64/kvm/hyp/pgtable.c          | 518 +++++++++++++++++++-------
 arch/arm64/kvm/mmu.c                  | 120 ++++--
 8 files changed, 503 insertions(+), 186 deletions(-)

-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-15 21:58 ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

Presently KVM only takes a read lock for stage 2 faults if it believes
the fault can be fixed by relaxing permissions on a PTE (write unprotect
for dirty logging). Otherwise, stage 2 faults grab the write lock, which
predictably can pile up all the vCPUs in a sufficiently large VM.

The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
MMU protected by the combination of a read-write lock and RCU, allowing
page walkers to traverse in parallel.

This series is strongly inspired by the mechanics of the TDP MMU,
making use of RCU to protect parallel walks. Note that the TLB
invalidation mechanics are a bit different between x86 and ARM, so we
need to use the 'break-before-make' sequence to split/collapse a
block/table mapping, respectively.

Nonetheless, using atomics on the break side allows fault handlers to
acquire exclusive access to a PTE (lets just call it locked). Once the
PTE lock is acquired it is then safe to assume exclusive access.

Special consideration is required when pruning the page tables in
parallel. Suppose we are collapsing a table into a block. Allowing
parallel faults means that a software walker could be in the middle of
a lower level traversal when the table is unlinked. Table
walkers that prune the paging structures must now 'lock' all descendent
PTEs, effectively asserting exclusive ownership of the substructure
(no other walker can install something to an already locked pte).

Additionally, for parallel walks we need to punt the freeing of table
pages to the next RCU sync, as there could be multiple observers of the
table until all walkers exit the RCU critical section. For this I
decided to cram an rcu_head into page private data for every table page.
We wind up spending a bit more on table pages now, but lazily allocating
for rcu callbacks probably doesn't make a lot of sense. Not only would
we need a large cache of them (think about installing a level 1 block)
to wire up callbacks on all descendent tables, but we also then need to
spend memory to actually free memory.

I tried to organize these patches as best I could w/o introducing
intermediate breakage.

The first 5 patches are meant mostly as prepatory reworks, and, in the
case of RCU a nop.

Patch 6 is quite large, but I had a hard time deciding how to change the
way we link/unlink tables to use atomics without breaking things along
the way.

Patch 7 probably should come before patch 6, as it informs the other
read-side fault (perm relax) about when a map is in progress so it'll
back off.

Patches 8-10 take care of the pruning case, actually locking the child ptes
instead of simply dropping table page references along the way. Note
that we cannot assume a pte points to a table/page at this point, hence
the same helper is called for pre- and leaf-traversal. Guide the
recursion based on what got yanked from the PTE.

Patches 11-14 wire up everything to schedule rcu callbacks on
to-be-freed table pages. rcu_barrier() is called on the way out from
tearing down a stage 2 page table to guarantee all memory associated
with the VM has actually been cleaned up.

Patches 15-16 loop in the fault handler to the new table traversal game.

Lastly, patch 17 is a nasty bit of debugging residue to spot possible
table page leaks. Please don't laugh ;-)

Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
exercise the table pruning code. Haven't done anything beyond this,
sending as an RFC now to get eyes on the code.

Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of
git://git.kernel.dk/linux-block")

Oliver Upton (17):
  KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
  KVM: arm64: Only read the pte once per visit
  KVM: arm64: Return the next table from map callbacks
  KVM: arm64: Protect page table traversal with RCU
  KVM: arm64: Take an argument to indicate parallel walk
  KVM: arm64: Implement break-before-make sequence for parallel walks
  KVM: arm64: Enlighten perm relax path about parallel walks
  KVM: arm64: Spin off helper for initializing table pte
  KVM: arm64: Tear down unlinked page tables in parallel walk
  KVM: arm64: Assume a table pte is already owned in post-order
    traversal
  KVM: arm64: Move MMU cache init/destroy into helpers
  KVM: arm64: Stuff mmu page cache in sub struct
  KVM: arm64: Setup cache for stage2 page headers
  KVM: arm64: Punt last page reference to rcu callback for parallel walk
  KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map()
  KVM: arm64: Enable parallel stage 2 MMU faults
  TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages

 arch/arm64/include/asm/kvm_host.h     |   5 +-
 arch/arm64/include/asm/kvm_mmu.h      |   2 +
 arch/arm64/include/asm/kvm_pgtable.h  |  14 +-
 arch/arm64/kvm/arm.c                  |   4 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c |  13 +-
 arch/arm64/kvm/hyp/nvhe/setup.c       |  13 +-
 arch/arm64/kvm/hyp/pgtable.c          | 518 +++++++++++++++++++-------
 arch/arm64/kvm/mmu.c                  | 120 ++++--
 8 files changed, 503 insertions(+), 186 deletions(-)

-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* [RFC PATCH 01/17] KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:58   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

A subsequent change to KVM will make use of additional bits in invalid
ptes. Prepare for said change by explicitly checking the valid bit and
owner fields in stage2_pte_is_counted()

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 2cb3867eb7c2..e1506da3e2fb 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -172,6 +172,11 @@ static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id)
 	return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id);
 }
 
+static u8 kvm_invalid_pte_owner(kvm_pte_t pte)
+{
+	return FIELD_GET(KVM_INVALID_PTE_OWNER_MASK, pte);
+}
+
 static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
 				  u32 level, kvm_pte_t *ptep,
 				  enum kvm_pgtable_walk_flags flag)
@@ -679,7 +684,7 @@ static bool stage2_pte_is_counted(kvm_pte_t pte)
 	 * encode ownership of a page to another entity than the page-table
 	 * owner, whose id is 0.
 	 */
-	return !!pte;
+	return kvm_pte_valid(pte) || kvm_invalid_pte_owner(pte);
 }
 
 static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 01/17] KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

A subsequent change to KVM will make use of additional bits in invalid
ptes. Prepare for said change by explicitly checking the valid bit and
owner fields in stage2_pte_is_counted()

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 2cb3867eb7c2..e1506da3e2fb 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -172,6 +172,11 @@ static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id)
 	return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id);
 }
 
+static u8 kvm_invalid_pte_owner(kvm_pte_t pte)
+{
+	return FIELD_GET(KVM_INVALID_PTE_OWNER_MASK, pte);
+}
+
 static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
 				  u32 level, kvm_pte_t *ptep,
 				  enum kvm_pgtable_walk_flags flag)
@@ -679,7 +684,7 @@ static bool stage2_pte_is_counted(kvm_pte_t pte)
 	 * encode ownership of a page to another entity than the page-table
 	 * owner, whose id is 0.
 	 */
-	return !!pte;
+	return kvm_pte_valid(pte) || kvm_invalid_pte_owner(pte);
 }
 
 static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 01/17] KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

A subsequent change to KVM will make use of additional bits in invalid
ptes. Prepare for said change by explicitly checking the valid bit and
owner fields in stage2_pte_is_counted()

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 2cb3867eb7c2..e1506da3e2fb 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -172,6 +172,11 @@ static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id)
 	return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id);
 }
 
+static u8 kvm_invalid_pte_owner(kvm_pte_t pte)
+{
+	return FIELD_GET(KVM_INVALID_PTE_OWNER_MASK, pte);
+}
+
 static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
 				  u32 level, kvm_pte_t *ptep,
 				  enum kvm_pgtable_walk_flags flag)
@@ -679,7 +684,7 @@ static bool stage2_pte_is_counted(kvm_pte_t pte)
 	 * encode ownership of a page to another entity than the page-table
 	 * owner, whose id is 0.
 	 */
-	return !!pte;
+	return kvm_pte_valid(pte) || kvm_invalid_pte_owner(pte);
 }
 
 static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 02/17] KVM: arm64: Only read the pte once per visit
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:58   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

A subsequent change to KVM will parallize modifications to the stage-2
page tables. The various page table walkers read the ptep multiple
times, which could lead to a visitor seeing multiple values during the
visit.

Pass through the observed pte to the visitor callbacks. Promote reads of
the ptep to a full READ_ONCE(), which will matter more when we start
tweaking ptes atomically. Note that a pointer to the old pte is given to
visitors, as parallel visitors will need to steer the page table
traversal as they adjust the page tables.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h  |   2 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c |   7 +-
 arch/arm64/kvm/hyp/nvhe/setup.c       |   9 +-
 arch/arm64/kvm/hyp/pgtable.c          | 113 +++++++++++++-------------
 4 files changed, 63 insertions(+), 68 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 9f339dffbc1a..ea818a5f7408 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -192,7 +192,7 @@ enum kvm_pgtable_walk_flags {
 };
 
 typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
-					kvm_pte_t *ptep,
+					kvm_pte_t *ptep, kvm_pte_t *old,
 					enum kvm_pgtable_walk_flags flag,
 					void * const arg);
 
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 78edf077fa3b..601a586581d8 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -422,17 +422,16 @@ struct check_walk_data {
 };
 
 static int __check_page_state_visitor(u64 addr, u64 end, u32 level,
-				      kvm_pte_t *ptep,
+				      kvm_pte_t *ptep, kvm_pte_t *old,
 				      enum kvm_pgtable_walk_flags flag,
 				      void * const arg)
 {
 	struct check_walk_data *d = arg;
-	kvm_pte_t pte = *ptep;
 
-	if (kvm_pte_valid(pte) && !addr_is_memory(kvm_pte_to_phys(pte)))
+	if (kvm_pte_valid(*old) && !addr_is_memory(kvm_pte_to_phys(*old)))
 		return -EINVAL;
 
-	return d->get_page_state(pte) == d->desired ? 0 : -EPERM;
+	return d->get_page_state(*old) == d->desired ? 0 : -EPERM;
 }
 
 static int check_page_state_range(struct kvm_pgtable *pgt, u64 addr, u64 size,
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 27af337f9fea..ecab7a4049d6 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -162,17 +162,16 @@ static void hpool_put_page(void *addr)
 }
 
 static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
-					 kvm_pte_t *ptep,
+					 kvm_pte_t *ptep, kvm_pte_t *old,
 					 enum kvm_pgtable_walk_flags flag,
 					 void * const arg)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
 	enum kvm_pgtable_prot prot;
 	enum pkvm_page_state state;
-	kvm_pte_t pte = *ptep;
 	phys_addr_t phys;
 
-	if (!kvm_pte_valid(pte))
+	if (!kvm_pte_valid(*old))
 		return 0;
 
 	/*
@@ -187,7 +186,7 @@ static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
 	if (level != (KVM_PGTABLE_MAX_LEVELS - 1))
 		return -EINVAL;
 
-	phys = kvm_pte_to_phys(pte);
+	phys = kvm_pte_to_phys(*old);
 	if (!addr_is_memory(phys))
 		return -EINVAL;
 
@@ -195,7 +194,7 @@ static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
 	 * Adjust the host stage-2 mappings to match the ownership attributes
 	 * configured in the hypervisor stage-1.
 	 */
-	state = pkvm_getstate(kvm_pgtable_hyp_pte_prot(pte));
+	state = pkvm_getstate(kvm_pgtable_hyp_pte_prot(*old));
 	switch (state) {
 	case PKVM_PAGE_OWNED:
 		return host_stage2_set_owner_locked(phys, PAGE_SIZE, pkvm_hyp_id);
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index e1506da3e2fb..ad911cd44425 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -178,11 +178,11 @@ static u8 kvm_invalid_pte_owner(kvm_pte_t pte)
 }
 
 static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
-				  u32 level, kvm_pte_t *ptep,
+				  u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 				  enum kvm_pgtable_walk_flags flag)
 {
 	struct kvm_pgtable_walker *walker = data->walker;
-	return walker->cb(addr, data->end, level, ptep, flag, walker->arg);
+	return walker->cb(addr, data->end, level, ptep, old, flag, walker->arg);
 }
 
 static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
@@ -193,17 +193,17 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 {
 	int ret = 0;
 	u64 addr = data->addr;
-	kvm_pte_t *childp, pte = *ptep;
+	kvm_pte_t *childp, pte = READ_ONCE(*ptep);
 	bool table = kvm_pte_table(pte, level);
 	enum kvm_pgtable_walk_flags flags = data->walker->flags;
 
 	if (table && (flags & KVM_PGTABLE_WALK_TABLE_PRE)) {
-		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
+		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
 					     KVM_PGTABLE_WALK_TABLE_PRE);
 	}
 
 	if (!table && (flags & KVM_PGTABLE_WALK_LEAF)) {
-		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
+		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
 					     KVM_PGTABLE_WALK_LEAF);
 		pte = *ptep;
 		table = kvm_pte_table(pte, level);
@@ -224,7 +224,7 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 		goto out;
 
 	if (flags & KVM_PGTABLE_WALK_TABLE_POST) {
-		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
+		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
 					     KVM_PGTABLE_WALK_TABLE_POST);
 	}
 
@@ -297,12 +297,12 @@ struct leaf_walk_data {
 	u32		level;
 };
 
-static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 		       enum kvm_pgtable_walk_flags flag, void * const arg)
 {
 	struct leaf_walk_data *data = arg;
 
-	data->pte   = *ptep;
+	data->pte   = *old;
 	data->level = level;
 
 	return 0;
@@ -388,10 +388,10 @@ enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte)
 	return prot;
 }
 
-static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
-				    kvm_pte_t *ptep, struct hyp_map_data *data)
+static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+				    kvm_pte_t old, struct hyp_map_data *data)
 {
-	kvm_pte_t new, old = *ptep;
+	kvm_pte_t new;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
 
 	if (!kvm_block_mapping_supported(addr, end, phys, level))
@@ -410,14 +410,14 @@ static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	return true;
 }
 
-static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 			  enum kvm_pgtable_walk_flags flag, void * const arg)
 {
 	kvm_pte_t *childp;
 	struct hyp_map_data *data = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
-	if (hyp_map_walker_try_leaf(addr, end, level, ptep, arg))
+	if (hyp_map_walker_try_leaf(addr, end, level, ptep, *old, arg))
 		return 0;
 
 	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
@@ -461,19 +461,19 @@ struct hyp_unmap_data {
 	struct kvm_pgtable_mm_ops	*mm_ops;
 };
 
-static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 			    enum kvm_pgtable_walk_flags flag, void * const arg)
 {
-	kvm_pte_t pte = *ptep, *childp = NULL;
+	kvm_pte_t *childp = NULL;
 	u64 granule = kvm_granule_size(level);
 	struct hyp_unmap_data *data = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
-	if (!kvm_pte_valid(pte))
+	if (!kvm_pte_valid(*old))
 		return -EINVAL;
 
-	if (kvm_pte_table(pte, level)) {
-		childp = kvm_pte_follow(pte, mm_ops);
+	if (kvm_pte_table(*old, level)) {
+		childp = kvm_pte_follow(*old, mm_ops);
 
 		if (mm_ops->page_count(childp) != 1)
 			return 0;
@@ -537,19 +537,18 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
 	return 0;
 }
 
-static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 			   enum kvm_pgtable_walk_flags flag, void * const arg)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
-	kvm_pte_t pte = *ptep;
 
-	if (!kvm_pte_valid(pte))
+	if (!kvm_pte_valid(*old))
 		return 0;
 
 	mm_ops->put_page(ptep);
 
-	if (kvm_pte_table(pte, level))
-		mm_ops->put_page(kvm_pte_follow(pte, mm_ops));
+	if (kvm_pte_table(*old, level))
+		mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
 
 	return 0;
 }
@@ -723,10 +722,10 @@ static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level,
 }
 
 static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
-				      kvm_pte_t *ptep,
+				      kvm_pte_t *ptep, kvm_pte_t old,
 				      struct stage2_map_data *data)
 {
-	kvm_pte_t new, old = *ptep;
+	kvm_pte_t new;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
 	struct kvm_pgtable *pgt = data->mmu->pgt;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
@@ -769,7 +768,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 }
 
 static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
-				     kvm_pte_t *ptep,
+				     kvm_pte_t *ptep, kvm_pte_t *old,
 				     struct stage2_map_data *data)
 {
 	if (data->anchor)
@@ -778,7 +777,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 	if (!stage2_leaf_mapping_allowed(addr, end, level, data))
 		return 0;
 
-	data->childp = kvm_pte_follow(*ptep, data->mm_ops);
+	data->childp = kvm_pte_follow(*old, data->mm_ops);
 	kvm_clear_pte(ptep);
 
 	/*
@@ -792,20 +791,20 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 }
 
 static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-				struct stage2_map_data *data)
+				kvm_pte_t *old, struct stage2_map_data *data)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
-	kvm_pte_t *childp, pte = *ptep;
+	kvm_pte_t *childp;
 	int ret;
 
 	if (data->anchor) {
-		if (stage2_pte_is_counted(pte))
+		if (stage2_pte_is_counted(*old))
 			mm_ops->put_page(ptep);
 
 		return 0;
 	}
 
-	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, data);
+	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data);
 	if (ret != -E2BIG)
 		return ret;
 
@@ -824,7 +823,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	 * a table. Accesses beyond 'end' that fall within the new table
 	 * will be mapped lazily.
 	 */
-	if (stage2_pte_is_counted(pte))
+	if (stage2_pte_is_counted(*old))
 		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
 
 	kvm_set_table_pte(ptep, childp, mm_ops);
@@ -834,7 +833,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 }
 
 static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
-				      kvm_pte_t *ptep,
+				      kvm_pte_t *ptep, kvm_pte_t *old,
 				      struct stage2_map_data *data)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
@@ -848,9 +847,9 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 		childp = data->childp;
 		data->anchor = NULL;
 		data->childp = NULL;
-		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
+		ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data);
 	} else {
-		childp = kvm_pte_follow(*ptep, mm_ops);
+		childp = kvm_pte_follow(*old, mm_ops);
 	}
 
 	mm_ops->put_page(childp);
@@ -878,18 +877,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
  * the page-table, installing the block entry when it revisits the anchor
  * pointer and clearing the anchor to NULL.
  */
-static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 			     enum kvm_pgtable_walk_flags flag, void * const arg)
 {
 	struct stage2_map_data *data = arg;
 
 	switch (flag) {
 	case KVM_PGTABLE_WALK_TABLE_PRE:
-		return stage2_map_walk_table_pre(addr, end, level, ptep, data);
+		return stage2_map_walk_table_pre(addr, end, level, ptep, old, data);
 	case KVM_PGTABLE_WALK_LEAF:
-		return stage2_map_walk_leaf(addr, end, level, ptep, data);
+		return stage2_map_walk_leaf(addr, end, level, ptep, old, data);
 	case KVM_PGTABLE_WALK_TABLE_POST:
-		return stage2_map_walk_table_post(addr, end, level, ptep, data);
+		return stage2_map_walk_table_post(addr, end, level, ptep, old, data);
 	}
 
 	return -EINVAL;
@@ -955,29 +954,29 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
 }
 
 static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-			       enum kvm_pgtable_walk_flags flag,
+			       kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
 			       void * const arg)
 {
 	struct kvm_pgtable *pgt = arg;
 	struct kvm_s2_mmu *mmu = pgt->mmu;
 	struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
-	kvm_pte_t pte = *ptep, *childp = NULL;
+	kvm_pte_t *childp = NULL;
 	bool need_flush = false;
 
-	if (!kvm_pte_valid(pte)) {
-		if (stage2_pte_is_counted(pte)) {
+	if (!kvm_pte_valid(*old)) {
+		if (stage2_pte_is_counted(*old)) {
 			kvm_clear_pte(ptep);
 			mm_ops->put_page(ptep);
 		}
 		return 0;
 	}
 
-	if (kvm_pte_table(pte, level)) {
-		childp = kvm_pte_follow(pte, mm_ops);
+	if (kvm_pte_table(*old, level)) {
+		childp = kvm_pte_follow(*old, mm_ops);
 
 		if (mm_ops->page_count(childp) != 1)
 			return 0;
-	} else if (stage2_pte_cacheable(pgt, pte)) {
+	} else if (stage2_pte_cacheable(pgt, *old)) {
 		need_flush = !stage2_has_fwb(pgt);
 	}
 
@@ -989,7 +988,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	stage2_put_pte(ptep, mmu, addr, level, mm_ops);
 
 	if (need_flush && mm_ops->dcache_clean_inval_poc)
-		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(pte, mm_ops),
+		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(*old, mm_ops),
 					       kvm_granule_size(level));
 
 	if (childp)
@@ -1018,10 +1017,10 @@ struct stage2_attr_data {
 };
 
 static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-			      enum kvm_pgtable_walk_flags flag,
+			      kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
 			      void * const arg)
 {
-	kvm_pte_t pte = *ptep;
+	kvm_pte_t pte = *old;
 	struct stage2_attr_data *data = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
@@ -1146,18 +1145,17 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 }
 
 static int stage2_flush_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-			       enum kvm_pgtable_walk_flags flag,
+			       kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
 			       void * const arg)
 {
 	struct kvm_pgtable *pgt = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
-	kvm_pte_t pte = *ptep;
 
-	if (!kvm_pte_valid(pte) || !stage2_pte_cacheable(pgt, pte))
+	if (!kvm_pte_valid(*old) || !stage2_pte_cacheable(pgt, *old))
 		return 0;
 
 	if (mm_ops->dcache_clean_inval_poc)
-		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(pte, mm_ops),
+		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(*old, mm_ops),
 					       kvm_granule_size(level));
 	return 0;
 }
@@ -1206,19 +1204,18 @@ int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 }
 
 static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-			      enum kvm_pgtable_walk_flags flag,
+			      kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
 			      void * const arg)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
-	kvm_pte_t pte = *ptep;
 
-	if (!stage2_pte_is_counted(pte))
+	if (!stage2_pte_is_counted(*old))
 		return 0;
 
 	mm_ops->put_page(ptep);
 
-	if (kvm_pte_table(pte, level))
-		mm_ops->put_page(kvm_pte_follow(pte, mm_ops));
+	if (kvm_pte_table(*old, level))
+		mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
 
 	return 0;
 }
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 02/17] KVM: arm64: Only read the pte once per visit
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

A subsequent change to KVM will parallize modifications to the stage-2
page tables. The various page table walkers read the ptep multiple
times, which could lead to a visitor seeing multiple values during the
visit.

Pass through the observed pte to the visitor callbacks. Promote reads of
the ptep to a full READ_ONCE(), which will matter more when we start
tweaking ptes atomically. Note that a pointer to the old pte is given to
visitors, as parallel visitors will need to steer the page table
traversal as they adjust the page tables.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h  |   2 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c |   7 +-
 arch/arm64/kvm/hyp/nvhe/setup.c       |   9 +-
 arch/arm64/kvm/hyp/pgtable.c          | 113 +++++++++++++-------------
 4 files changed, 63 insertions(+), 68 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 9f339dffbc1a..ea818a5f7408 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -192,7 +192,7 @@ enum kvm_pgtable_walk_flags {
 };
 
 typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
-					kvm_pte_t *ptep,
+					kvm_pte_t *ptep, kvm_pte_t *old,
 					enum kvm_pgtable_walk_flags flag,
 					void * const arg);
 
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 78edf077fa3b..601a586581d8 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -422,17 +422,16 @@ struct check_walk_data {
 };
 
 static int __check_page_state_visitor(u64 addr, u64 end, u32 level,
-				      kvm_pte_t *ptep,
+				      kvm_pte_t *ptep, kvm_pte_t *old,
 				      enum kvm_pgtable_walk_flags flag,
 				      void * const arg)
 {
 	struct check_walk_data *d = arg;
-	kvm_pte_t pte = *ptep;
 
-	if (kvm_pte_valid(pte) && !addr_is_memory(kvm_pte_to_phys(pte)))
+	if (kvm_pte_valid(*old) && !addr_is_memory(kvm_pte_to_phys(*old)))
 		return -EINVAL;
 
-	return d->get_page_state(pte) == d->desired ? 0 : -EPERM;
+	return d->get_page_state(*old) == d->desired ? 0 : -EPERM;
 }
 
 static int check_page_state_range(struct kvm_pgtable *pgt, u64 addr, u64 size,
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 27af337f9fea..ecab7a4049d6 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -162,17 +162,16 @@ static void hpool_put_page(void *addr)
 }
 
 static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
-					 kvm_pte_t *ptep,
+					 kvm_pte_t *ptep, kvm_pte_t *old,
 					 enum kvm_pgtable_walk_flags flag,
 					 void * const arg)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
 	enum kvm_pgtable_prot prot;
 	enum pkvm_page_state state;
-	kvm_pte_t pte = *ptep;
 	phys_addr_t phys;
 
-	if (!kvm_pte_valid(pte))
+	if (!kvm_pte_valid(*old))
 		return 0;
 
 	/*
@@ -187,7 +186,7 @@ static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
 	if (level != (KVM_PGTABLE_MAX_LEVELS - 1))
 		return -EINVAL;
 
-	phys = kvm_pte_to_phys(pte);
+	phys = kvm_pte_to_phys(*old);
 	if (!addr_is_memory(phys))
 		return -EINVAL;
 
@@ -195,7 +194,7 @@ static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
 	 * Adjust the host stage-2 mappings to match the ownership attributes
 	 * configured in the hypervisor stage-1.
 	 */
-	state = pkvm_getstate(kvm_pgtable_hyp_pte_prot(pte));
+	state = pkvm_getstate(kvm_pgtable_hyp_pte_prot(*old));
 	switch (state) {
 	case PKVM_PAGE_OWNED:
 		return host_stage2_set_owner_locked(phys, PAGE_SIZE, pkvm_hyp_id);
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index e1506da3e2fb..ad911cd44425 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -178,11 +178,11 @@ static u8 kvm_invalid_pte_owner(kvm_pte_t pte)
 }
 
 static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
-				  u32 level, kvm_pte_t *ptep,
+				  u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 				  enum kvm_pgtable_walk_flags flag)
 {
 	struct kvm_pgtable_walker *walker = data->walker;
-	return walker->cb(addr, data->end, level, ptep, flag, walker->arg);
+	return walker->cb(addr, data->end, level, ptep, old, flag, walker->arg);
 }
 
 static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
@@ -193,17 +193,17 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 {
 	int ret = 0;
 	u64 addr = data->addr;
-	kvm_pte_t *childp, pte = *ptep;
+	kvm_pte_t *childp, pte = READ_ONCE(*ptep);
 	bool table = kvm_pte_table(pte, level);
 	enum kvm_pgtable_walk_flags flags = data->walker->flags;
 
 	if (table && (flags & KVM_PGTABLE_WALK_TABLE_PRE)) {
-		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
+		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
 					     KVM_PGTABLE_WALK_TABLE_PRE);
 	}
 
 	if (!table && (flags & KVM_PGTABLE_WALK_LEAF)) {
-		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
+		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
 					     KVM_PGTABLE_WALK_LEAF);
 		pte = *ptep;
 		table = kvm_pte_table(pte, level);
@@ -224,7 +224,7 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 		goto out;
 
 	if (flags & KVM_PGTABLE_WALK_TABLE_POST) {
-		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
+		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
 					     KVM_PGTABLE_WALK_TABLE_POST);
 	}
 
@@ -297,12 +297,12 @@ struct leaf_walk_data {
 	u32		level;
 };
 
-static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 		       enum kvm_pgtable_walk_flags flag, void * const arg)
 {
 	struct leaf_walk_data *data = arg;
 
-	data->pte   = *ptep;
+	data->pte   = *old;
 	data->level = level;
 
 	return 0;
@@ -388,10 +388,10 @@ enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte)
 	return prot;
 }
 
-static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
-				    kvm_pte_t *ptep, struct hyp_map_data *data)
+static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+				    kvm_pte_t old, struct hyp_map_data *data)
 {
-	kvm_pte_t new, old = *ptep;
+	kvm_pte_t new;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
 
 	if (!kvm_block_mapping_supported(addr, end, phys, level))
@@ -410,14 +410,14 @@ static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	return true;
 }
 
-static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 			  enum kvm_pgtable_walk_flags flag, void * const arg)
 {
 	kvm_pte_t *childp;
 	struct hyp_map_data *data = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
-	if (hyp_map_walker_try_leaf(addr, end, level, ptep, arg))
+	if (hyp_map_walker_try_leaf(addr, end, level, ptep, *old, arg))
 		return 0;
 
 	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
@@ -461,19 +461,19 @@ struct hyp_unmap_data {
 	struct kvm_pgtable_mm_ops	*mm_ops;
 };
 
-static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 			    enum kvm_pgtable_walk_flags flag, void * const arg)
 {
-	kvm_pte_t pte = *ptep, *childp = NULL;
+	kvm_pte_t *childp = NULL;
 	u64 granule = kvm_granule_size(level);
 	struct hyp_unmap_data *data = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
-	if (!kvm_pte_valid(pte))
+	if (!kvm_pte_valid(*old))
 		return -EINVAL;
 
-	if (kvm_pte_table(pte, level)) {
-		childp = kvm_pte_follow(pte, mm_ops);
+	if (kvm_pte_table(*old, level)) {
+		childp = kvm_pte_follow(*old, mm_ops);
 
 		if (mm_ops->page_count(childp) != 1)
 			return 0;
@@ -537,19 +537,18 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
 	return 0;
 }
 
-static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 			   enum kvm_pgtable_walk_flags flag, void * const arg)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
-	kvm_pte_t pte = *ptep;
 
-	if (!kvm_pte_valid(pte))
+	if (!kvm_pte_valid(*old))
 		return 0;
 
 	mm_ops->put_page(ptep);
 
-	if (kvm_pte_table(pte, level))
-		mm_ops->put_page(kvm_pte_follow(pte, mm_ops));
+	if (kvm_pte_table(*old, level))
+		mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
 
 	return 0;
 }
@@ -723,10 +722,10 @@ static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level,
 }
 
 static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
-				      kvm_pte_t *ptep,
+				      kvm_pte_t *ptep, kvm_pte_t old,
 				      struct stage2_map_data *data)
 {
-	kvm_pte_t new, old = *ptep;
+	kvm_pte_t new;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
 	struct kvm_pgtable *pgt = data->mmu->pgt;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
@@ -769,7 +768,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 }
 
 static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
-				     kvm_pte_t *ptep,
+				     kvm_pte_t *ptep, kvm_pte_t *old,
 				     struct stage2_map_data *data)
 {
 	if (data->anchor)
@@ -778,7 +777,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 	if (!stage2_leaf_mapping_allowed(addr, end, level, data))
 		return 0;
 
-	data->childp = kvm_pte_follow(*ptep, data->mm_ops);
+	data->childp = kvm_pte_follow(*old, data->mm_ops);
 	kvm_clear_pte(ptep);
 
 	/*
@@ -792,20 +791,20 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 }
 
 static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-				struct stage2_map_data *data)
+				kvm_pte_t *old, struct stage2_map_data *data)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
-	kvm_pte_t *childp, pte = *ptep;
+	kvm_pte_t *childp;
 	int ret;
 
 	if (data->anchor) {
-		if (stage2_pte_is_counted(pte))
+		if (stage2_pte_is_counted(*old))
 			mm_ops->put_page(ptep);
 
 		return 0;
 	}
 
-	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, data);
+	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data);
 	if (ret != -E2BIG)
 		return ret;
 
@@ -824,7 +823,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	 * a table. Accesses beyond 'end' that fall within the new table
 	 * will be mapped lazily.
 	 */
-	if (stage2_pte_is_counted(pte))
+	if (stage2_pte_is_counted(*old))
 		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
 
 	kvm_set_table_pte(ptep, childp, mm_ops);
@@ -834,7 +833,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 }
 
 static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
-				      kvm_pte_t *ptep,
+				      kvm_pte_t *ptep, kvm_pte_t *old,
 				      struct stage2_map_data *data)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
@@ -848,9 +847,9 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 		childp = data->childp;
 		data->anchor = NULL;
 		data->childp = NULL;
-		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
+		ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data);
 	} else {
-		childp = kvm_pte_follow(*ptep, mm_ops);
+		childp = kvm_pte_follow(*old, mm_ops);
 	}
 
 	mm_ops->put_page(childp);
@@ -878,18 +877,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
  * the page-table, installing the block entry when it revisits the anchor
  * pointer and clearing the anchor to NULL.
  */
-static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 			     enum kvm_pgtable_walk_flags flag, void * const arg)
 {
 	struct stage2_map_data *data = arg;
 
 	switch (flag) {
 	case KVM_PGTABLE_WALK_TABLE_PRE:
-		return stage2_map_walk_table_pre(addr, end, level, ptep, data);
+		return stage2_map_walk_table_pre(addr, end, level, ptep, old, data);
 	case KVM_PGTABLE_WALK_LEAF:
-		return stage2_map_walk_leaf(addr, end, level, ptep, data);
+		return stage2_map_walk_leaf(addr, end, level, ptep, old, data);
 	case KVM_PGTABLE_WALK_TABLE_POST:
-		return stage2_map_walk_table_post(addr, end, level, ptep, data);
+		return stage2_map_walk_table_post(addr, end, level, ptep, old, data);
 	}
 
 	return -EINVAL;
@@ -955,29 +954,29 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
 }
 
 static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-			       enum kvm_pgtable_walk_flags flag,
+			       kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
 			       void * const arg)
 {
 	struct kvm_pgtable *pgt = arg;
 	struct kvm_s2_mmu *mmu = pgt->mmu;
 	struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
-	kvm_pte_t pte = *ptep, *childp = NULL;
+	kvm_pte_t *childp = NULL;
 	bool need_flush = false;
 
-	if (!kvm_pte_valid(pte)) {
-		if (stage2_pte_is_counted(pte)) {
+	if (!kvm_pte_valid(*old)) {
+		if (stage2_pte_is_counted(*old)) {
 			kvm_clear_pte(ptep);
 			mm_ops->put_page(ptep);
 		}
 		return 0;
 	}
 
-	if (kvm_pte_table(pte, level)) {
-		childp = kvm_pte_follow(pte, mm_ops);
+	if (kvm_pte_table(*old, level)) {
+		childp = kvm_pte_follow(*old, mm_ops);
 
 		if (mm_ops->page_count(childp) != 1)
 			return 0;
-	} else if (stage2_pte_cacheable(pgt, pte)) {
+	} else if (stage2_pte_cacheable(pgt, *old)) {
 		need_flush = !stage2_has_fwb(pgt);
 	}
 
@@ -989,7 +988,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	stage2_put_pte(ptep, mmu, addr, level, mm_ops);
 
 	if (need_flush && mm_ops->dcache_clean_inval_poc)
-		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(pte, mm_ops),
+		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(*old, mm_ops),
 					       kvm_granule_size(level));
 
 	if (childp)
@@ -1018,10 +1017,10 @@ struct stage2_attr_data {
 };
 
 static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-			      enum kvm_pgtable_walk_flags flag,
+			      kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
 			      void * const arg)
 {
-	kvm_pte_t pte = *ptep;
+	kvm_pte_t pte = *old;
 	struct stage2_attr_data *data = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
@@ -1146,18 +1145,17 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 }
 
 static int stage2_flush_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-			       enum kvm_pgtable_walk_flags flag,
+			       kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
 			       void * const arg)
 {
 	struct kvm_pgtable *pgt = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
-	kvm_pte_t pte = *ptep;
 
-	if (!kvm_pte_valid(pte) || !stage2_pte_cacheable(pgt, pte))
+	if (!kvm_pte_valid(*old) || !stage2_pte_cacheable(pgt, *old))
 		return 0;
 
 	if (mm_ops->dcache_clean_inval_poc)
-		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(pte, mm_ops),
+		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(*old, mm_ops),
 					       kvm_granule_size(level));
 	return 0;
 }
@@ -1206,19 +1204,18 @@ int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 }
 
 static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-			      enum kvm_pgtable_walk_flags flag,
+			      kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
 			      void * const arg)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
-	kvm_pte_t pte = *ptep;
 
-	if (!stage2_pte_is_counted(pte))
+	if (!stage2_pte_is_counted(*old))
 		return 0;
 
 	mm_ops->put_page(ptep);
 
-	if (kvm_pte_table(pte, level))
-		mm_ops->put_page(kvm_pte_follow(pte, mm_ops));
+	if (kvm_pte_table(*old, level))
+		mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
 
 	return 0;
 }
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 02/17] KVM: arm64: Only read the pte once per visit
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

A subsequent change to KVM will parallize modifications to the stage-2
page tables. The various page table walkers read the ptep multiple
times, which could lead to a visitor seeing multiple values during the
visit.

Pass through the observed pte to the visitor callbacks. Promote reads of
the ptep to a full READ_ONCE(), which will matter more when we start
tweaking ptes atomically. Note that a pointer to the old pte is given to
visitors, as parallel visitors will need to steer the page table
traversal as they adjust the page tables.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h  |   2 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c |   7 +-
 arch/arm64/kvm/hyp/nvhe/setup.c       |   9 +-
 arch/arm64/kvm/hyp/pgtable.c          | 113 +++++++++++++-------------
 4 files changed, 63 insertions(+), 68 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 9f339dffbc1a..ea818a5f7408 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -192,7 +192,7 @@ enum kvm_pgtable_walk_flags {
 };
 
 typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
-					kvm_pte_t *ptep,
+					kvm_pte_t *ptep, kvm_pte_t *old,
 					enum kvm_pgtable_walk_flags flag,
 					void * const arg);
 
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 78edf077fa3b..601a586581d8 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -422,17 +422,16 @@ struct check_walk_data {
 };
 
 static int __check_page_state_visitor(u64 addr, u64 end, u32 level,
-				      kvm_pte_t *ptep,
+				      kvm_pte_t *ptep, kvm_pte_t *old,
 				      enum kvm_pgtable_walk_flags flag,
 				      void * const arg)
 {
 	struct check_walk_data *d = arg;
-	kvm_pte_t pte = *ptep;
 
-	if (kvm_pte_valid(pte) && !addr_is_memory(kvm_pte_to_phys(pte)))
+	if (kvm_pte_valid(*old) && !addr_is_memory(kvm_pte_to_phys(*old)))
 		return -EINVAL;
 
-	return d->get_page_state(pte) == d->desired ? 0 : -EPERM;
+	return d->get_page_state(*old) == d->desired ? 0 : -EPERM;
 }
 
 static int check_page_state_range(struct kvm_pgtable *pgt, u64 addr, u64 size,
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 27af337f9fea..ecab7a4049d6 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -162,17 +162,16 @@ static void hpool_put_page(void *addr)
 }
 
 static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
-					 kvm_pte_t *ptep,
+					 kvm_pte_t *ptep, kvm_pte_t *old,
 					 enum kvm_pgtable_walk_flags flag,
 					 void * const arg)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
 	enum kvm_pgtable_prot prot;
 	enum pkvm_page_state state;
-	kvm_pte_t pte = *ptep;
 	phys_addr_t phys;
 
-	if (!kvm_pte_valid(pte))
+	if (!kvm_pte_valid(*old))
 		return 0;
 
 	/*
@@ -187,7 +186,7 @@ static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
 	if (level != (KVM_PGTABLE_MAX_LEVELS - 1))
 		return -EINVAL;
 
-	phys = kvm_pte_to_phys(pte);
+	phys = kvm_pte_to_phys(*old);
 	if (!addr_is_memory(phys))
 		return -EINVAL;
 
@@ -195,7 +194,7 @@ static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
 	 * Adjust the host stage-2 mappings to match the ownership attributes
 	 * configured in the hypervisor stage-1.
 	 */
-	state = pkvm_getstate(kvm_pgtable_hyp_pte_prot(pte));
+	state = pkvm_getstate(kvm_pgtable_hyp_pte_prot(*old));
 	switch (state) {
 	case PKVM_PAGE_OWNED:
 		return host_stage2_set_owner_locked(phys, PAGE_SIZE, pkvm_hyp_id);
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index e1506da3e2fb..ad911cd44425 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -178,11 +178,11 @@ static u8 kvm_invalid_pte_owner(kvm_pte_t pte)
 }
 
 static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
-				  u32 level, kvm_pte_t *ptep,
+				  u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 				  enum kvm_pgtable_walk_flags flag)
 {
 	struct kvm_pgtable_walker *walker = data->walker;
-	return walker->cb(addr, data->end, level, ptep, flag, walker->arg);
+	return walker->cb(addr, data->end, level, ptep, old, flag, walker->arg);
 }
 
 static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
@@ -193,17 +193,17 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 {
 	int ret = 0;
 	u64 addr = data->addr;
-	kvm_pte_t *childp, pte = *ptep;
+	kvm_pte_t *childp, pte = READ_ONCE(*ptep);
 	bool table = kvm_pte_table(pte, level);
 	enum kvm_pgtable_walk_flags flags = data->walker->flags;
 
 	if (table && (flags & KVM_PGTABLE_WALK_TABLE_PRE)) {
-		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
+		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
 					     KVM_PGTABLE_WALK_TABLE_PRE);
 	}
 
 	if (!table && (flags & KVM_PGTABLE_WALK_LEAF)) {
-		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
+		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
 					     KVM_PGTABLE_WALK_LEAF);
 		pte = *ptep;
 		table = kvm_pte_table(pte, level);
@@ -224,7 +224,7 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 		goto out;
 
 	if (flags & KVM_PGTABLE_WALK_TABLE_POST) {
-		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
+		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
 					     KVM_PGTABLE_WALK_TABLE_POST);
 	}
 
@@ -297,12 +297,12 @@ struct leaf_walk_data {
 	u32		level;
 };
 
-static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 		       enum kvm_pgtable_walk_flags flag, void * const arg)
 {
 	struct leaf_walk_data *data = arg;
 
-	data->pte   = *ptep;
+	data->pte   = *old;
 	data->level = level;
 
 	return 0;
@@ -388,10 +388,10 @@ enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte)
 	return prot;
 }
 
-static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
-				    kvm_pte_t *ptep, struct hyp_map_data *data)
+static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+				    kvm_pte_t old, struct hyp_map_data *data)
 {
-	kvm_pte_t new, old = *ptep;
+	kvm_pte_t new;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
 
 	if (!kvm_block_mapping_supported(addr, end, phys, level))
@@ -410,14 +410,14 @@ static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	return true;
 }
 
-static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 			  enum kvm_pgtable_walk_flags flag, void * const arg)
 {
 	kvm_pte_t *childp;
 	struct hyp_map_data *data = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
-	if (hyp_map_walker_try_leaf(addr, end, level, ptep, arg))
+	if (hyp_map_walker_try_leaf(addr, end, level, ptep, *old, arg))
 		return 0;
 
 	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
@@ -461,19 +461,19 @@ struct hyp_unmap_data {
 	struct kvm_pgtable_mm_ops	*mm_ops;
 };
 
-static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 			    enum kvm_pgtable_walk_flags flag, void * const arg)
 {
-	kvm_pte_t pte = *ptep, *childp = NULL;
+	kvm_pte_t *childp = NULL;
 	u64 granule = kvm_granule_size(level);
 	struct hyp_unmap_data *data = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
-	if (!kvm_pte_valid(pte))
+	if (!kvm_pte_valid(*old))
 		return -EINVAL;
 
-	if (kvm_pte_table(pte, level)) {
-		childp = kvm_pte_follow(pte, mm_ops);
+	if (kvm_pte_table(*old, level)) {
+		childp = kvm_pte_follow(*old, mm_ops);
 
 		if (mm_ops->page_count(childp) != 1)
 			return 0;
@@ -537,19 +537,18 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
 	return 0;
 }
 
-static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 			   enum kvm_pgtable_walk_flags flag, void * const arg)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
-	kvm_pte_t pte = *ptep;
 
-	if (!kvm_pte_valid(pte))
+	if (!kvm_pte_valid(*old))
 		return 0;
 
 	mm_ops->put_page(ptep);
 
-	if (kvm_pte_table(pte, level))
-		mm_ops->put_page(kvm_pte_follow(pte, mm_ops));
+	if (kvm_pte_table(*old, level))
+		mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
 
 	return 0;
 }
@@ -723,10 +722,10 @@ static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level,
 }
 
 static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
-				      kvm_pte_t *ptep,
+				      kvm_pte_t *ptep, kvm_pte_t old,
 				      struct stage2_map_data *data)
 {
-	kvm_pte_t new, old = *ptep;
+	kvm_pte_t new;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
 	struct kvm_pgtable *pgt = data->mmu->pgt;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
@@ -769,7 +768,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 }
 
 static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
-				     kvm_pte_t *ptep,
+				     kvm_pte_t *ptep, kvm_pte_t *old,
 				     struct stage2_map_data *data)
 {
 	if (data->anchor)
@@ -778,7 +777,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 	if (!stage2_leaf_mapping_allowed(addr, end, level, data))
 		return 0;
 
-	data->childp = kvm_pte_follow(*ptep, data->mm_ops);
+	data->childp = kvm_pte_follow(*old, data->mm_ops);
 	kvm_clear_pte(ptep);
 
 	/*
@@ -792,20 +791,20 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 }
 
 static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-				struct stage2_map_data *data)
+				kvm_pte_t *old, struct stage2_map_data *data)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
-	kvm_pte_t *childp, pte = *ptep;
+	kvm_pte_t *childp;
 	int ret;
 
 	if (data->anchor) {
-		if (stage2_pte_is_counted(pte))
+		if (stage2_pte_is_counted(*old))
 			mm_ops->put_page(ptep);
 
 		return 0;
 	}
 
-	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, data);
+	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data);
 	if (ret != -E2BIG)
 		return ret;
 
@@ -824,7 +823,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	 * a table. Accesses beyond 'end' that fall within the new table
 	 * will be mapped lazily.
 	 */
-	if (stage2_pte_is_counted(pte))
+	if (stage2_pte_is_counted(*old))
 		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
 
 	kvm_set_table_pte(ptep, childp, mm_ops);
@@ -834,7 +833,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 }
 
 static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
-				      kvm_pte_t *ptep,
+				      kvm_pte_t *ptep, kvm_pte_t *old,
 				      struct stage2_map_data *data)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
@@ -848,9 +847,9 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 		childp = data->childp;
 		data->anchor = NULL;
 		data->childp = NULL;
-		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
+		ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data);
 	} else {
-		childp = kvm_pte_follow(*ptep, mm_ops);
+		childp = kvm_pte_follow(*old, mm_ops);
 	}
 
 	mm_ops->put_page(childp);
@@ -878,18 +877,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
  * the page-table, installing the block entry when it revisits the anchor
  * pointer and clearing the anchor to NULL.
  */
-static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
 			     enum kvm_pgtable_walk_flags flag, void * const arg)
 {
 	struct stage2_map_data *data = arg;
 
 	switch (flag) {
 	case KVM_PGTABLE_WALK_TABLE_PRE:
-		return stage2_map_walk_table_pre(addr, end, level, ptep, data);
+		return stage2_map_walk_table_pre(addr, end, level, ptep, old, data);
 	case KVM_PGTABLE_WALK_LEAF:
-		return stage2_map_walk_leaf(addr, end, level, ptep, data);
+		return stage2_map_walk_leaf(addr, end, level, ptep, old, data);
 	case KVM_PGTABLE_WALK_TABLE_POST:
-		return stage2_map_walk_table_post(addr, end, level, ptep, data);
+		return stage2_map_walk_table_post(addr, end, level, ptep, old, data);
 	}
 
 	return -EINVAL;
@@ -955,29 +954,29 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
 }
 
 static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-			       enum kvm_pgtable_walk_flags flag,
+			       kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
 			       void * const arg)
 {
 	struct kvm_pgtable *pgt = arg;
 	struct kvm_s2_mmu *mmu = pgt->mmu;
 	struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
-	kvm_pte_t pte = *ptep, *childp = NULL;
+	kvm_pte_t *childp = NULL;
 	bool need_flush = false;
 
-	if (!kvm_pte_valid(pte)) {
-		if (stage2_pte_is_counted(pte)) {
+	if (!kvm_pte_valid(*old)) {
+		if (stage2_pte_is_counted(*old)) {
 			kvm_clear_pte(ptep);
 			mm_ops->put_page(ptep);
 		}
 		return 0;
 	}
 
-	if (kvm_pte_table(pte, level)) {
-		childp = kvm_pte_follow(pte, mm_ops);
+	if (kvm_pte_table(*old, level)) {
+		childp = kvm_pte_follow(*old, mm_ops);
 
 		if (mm_ops->page_count(childp) != 1)
 			return 0;
-	} else if (stage2_pte_cacheable(pgt, pte)) {
+	} else if (stage2_pte_cacheable(pgt, *old)) {
 		need_flush = !stage2_has_fwb(pgt);
 	}
 
@@ -989,7 +988,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	stage2_put_pte(ptep, mmu, addr, level, mm_ops);
 
 	if (need_flush && mm_ops->dcache_clean_inval_poc)
-		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(pte, mm_ops),
+		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(*old, mm_ops),
 					       kvm_granule_size(level));
 
 	if (childp)
@@ -1018,10 +1017,10 @@ struct stage2_attr_data {
 };
 
 static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-			      enum kvm_pgtable_walk_flags flag,
+			      kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
 			      void * const arg)
 {
-	kvm_pte_t pte = *ptep;
+	kvm_pte_t pte = *old;
 	struct stage2_attr_data *data = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
@@ -1146,18 +1145,17 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 }
 
 static int stage2_flush_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-			       enum kvm_pgtable_walk_flags flag,
+			       kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
 			       void * const arg)
 {
 	struct kvm_pgtable *pgt = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
-	kvm_pte_t pte = *ptep;
 
-	if (!kvm_pte_valid(pte) || !stage2_pte_cacheable(pgt, pte))
+	if (!kvm_pte_valid(*old) || !stage2_pte_cacheable(pgt, *old))
 		return 0;
 
 	if (mm_ops->dcache_clean_inval_poc)
-		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(pte, mm_ops),
+		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(*old, mm_ops),
 					       kvm_granule_size(level));
 	return 0;
 }
@@ -1206,19 +1204,18 @@ int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 }
 
 static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-			      enum kvm_pgtable_walk_flags flag,
+			      kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
 			      void * const arg)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
-	kvm_pte_t pte = *ptep;
 
-	if (!stage2_pte_is_counted(pte))
+	if (!stage2_pte_is_counted(*old))
 		return 0;
 
 	mm_ops->put_page(ptep);
 
-	if (kvm_pte_table(pte, level))
-		mm_ops->put_page(kvm_pte_follow(pte, mm_ops));
+	if (kvm_pte_table(*old, level))
+		mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
 
 	return 0;
 }
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 03/17] KVM: arm64: Return the next table from map callbacks
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:58   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

The stage-2 and hyp stage-1 map walkers install new page tables during
their traversal. In order to support parallel table walks, make callers
return the next table to traverse.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index ad911cd44425..5b64fbca8a93 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -205,13 +205,12 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 	if (!table && (flags & KVM_PGTABLE_WALK_LEAF)) {
 		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
 					     KVM_PGTABLE_WALK_LEAF);
-		pte = *ptep;
-		table = kvm_pte_table(pte, level);
 	}
 
 	if (ret)
 		goto out;
 
+	table = kvm_pte_table(pte, level);
 	if (!table) {
 		data->addr = ALIGN_DOWN(data->addr, kvm_granule_size(level));
 		data->addr += kvm_granule_size(level);
@@ -429,6 +428,7 @@ static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte
 
 	kvm_set_table_pte(ptep, childp, mm_ops);
 	mm_ops->get_page(ptep);
+	*old = *ptep;
 	return 0;
 }
 
@@ -828,7 +828,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 
 	kvm_set_table_pte(ptep, childp, mm_ops);
 	mm_ops->get_page(ptep);
-
+	*old = *ptep;
 	return 0;
 }
 
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 03/17] KVM: arm64: Return the next table from map callbacks
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

The stage-2 and hyp stage-1 map walkers install new page tables during
their traversal. In order to support parallel table walks, make callers
return the next table to traverse.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index ad911cd44425..5b64fbca8a93 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -205,13 +205,12 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 	if (!table && (flags & KVM_PGTABLE_WALK_LEAF)) {
 		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
 					     KVM_PGTABLE_WALK_LEAF);
-		pte = *ptep;
-		table = kvm_pte_table(pte, level);
 	}
 
 	if (ret)
 		goto out;
 
+	table = kvm_pte_table(pte, level);
 	if (!table) {
 		data->addr = ALIGN_DOWN(data->addr, kvm_granule_size(level));
 		data->addr += kvm_granule_size(level);
@@ -429,6 +428,7 @@ static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte
 
 	kvm_set_table_pte(ptep, childp, mm_ops);
 	mm_ops->get_page(ptep);
+	*old = *ptep;
 	return 0;
 }
 
@@ -828,7 +828,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 
 	kvm_set_table_pte(ptep, childp, mm_ops);
 	mm_ops->get_page(ptep);
-
+	*old = *ptep;
 	return 0;
 }
 
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 03/17] KVM: arm64: Return the next table from map callbacks
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

The stage-2 and hyp stage-1 map walkers install new page tables during
their traversal. In order to support parallel table walks, make callers
return the next table to traverse.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index ad911cd44425..5b64fbca8a93 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -205,13 +205,12 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 	if (!table && (flags & KVM_PGTABLE_WALK_LEAF)) {
 		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
 					     KVM_PGTABLE_WALK_LEAF);
-		pte = *ptep;
-		table = kvm_pte_table(pte, level);
 	}
 
 	if (ret)
 		goto out;
 
+	table = kvm_pte_table(pte, level);
 	if (!table) {
 		data->addr = ALIGN_DOWN(data->addr, kvm_granule_size(level));
 		data->addr += kvm_granule_size(level);
@@ -429,6 +428,7 @@ static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte
 
 	kvm_set_table_pte(ptep, childp, mm_ops);
 	mm_ops->get_page(ptep);
+	*old = *ptep;
 	return 0;
 }
 
@@ -828,7 +828,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 
 	kvm_set_table_pte(ptep, childp, mm_ops);
 	mm_ops->get_page(ptep);
-
+	*old = *ptep;
 	return 0;
 }
 
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 04/17] KVM: arm64: Protect page table traversal with RCU
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:58   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

Use RCU to safely traverse the page tables in parallel; the tables
themselves will only be freed from an RCU synchronized context. Don't
even bother with adding support to hyp, and instead just assume
exclusive access of the page tables.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 5b64fbca8a93..d4699f698d6e 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -132,9 +132,28 @@ static kvm_pte_t kvm_phys_to_pte(u64 pa)
 	return pte;
 }
 
+
+#if defined(__KVM_NVHE_HYPERVISOR__)
+static inline void kvm_pgtable_walk_begin(void)
+{}
+
+static inline void kvm_pgtable_walk_end(void)
+{}
+
+#define kvm_dereference_ptep	rcu_dereference_raw
+#else
+#define kvm_pgtable_walk_begin	rcu_read_lock
+
+#define kvm_pgtable_walk_end	rcu_read_unlock
+
+#define kvm_dereference_ptep	rcu_dereference
+#endif
+
 static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
 {
-	return mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
+	kvm_pte_t __rcu *ptep = mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
+
+	return kvm_dereference_ptep(ptep);
 }
 
 static void kvm_clear_pte(kvm_pte_t *ptep)
@@ -288,7 +307,9 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
 		.walker	= walker,
 	};
 
+	kvm_pgtable_walk_begin();
 	return _kvm_pgtable_walk(&walk_data);
+	kvm_pgtable_walk_end();
 }
 
 struct leaf_walk_data {
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 04/17] KVM: arm64: Protect page table traversal with RCU
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

Use RCU to safely traverse the page tables in parallel; the tables
themselves will only be freed from an RCU synchronized context. Don't
even bother with adding support to hyp, and instead just assume
exclusive access of the page tables.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 5b64fbca8a93..d4699f698d6e 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -132,9 +132,28 @@ static kvm_pte_t kvm_phys_to_pte(u64 pa)
 	return pte;
 }
 
+
+#if defined(__KVM_NVHE_HYPERVISOR__)
+static inline void kvm_pgtable_walk_begin(void)
+{}
+
+static inline void kvm_pgtable_walk_end(void)
+{}
+
+#define kvm_dereference_ptep	rcu_dereference_raw
+#else
+#define kvm_pgtable_walk_begin	rcu_read_lock
+
+#define kvm_pgtable_walk_end	rcu_read_unlock
+
+#define kvm_dereference_ptep	rcu_dereference
+#endif
+
 static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
 {
-	return mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
+	kvm_pte_t __rcu *ptep = mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
+
+	return kvm_dereference_ptep(ptep);
 }
 
 static void kvm_clear_pte(kvm_pte_t *ptep)
@@ -288,7 +307,9 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
 		.walker	= walker,
 	};
 
+	kvm_pgtable_walk_begin();
 	return _kvm_pgtable_walk(&walk_data);
+	kvm_pgtable_walk_end();
 }
 
 struct leaf_walk_data {
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 04/17] KVM: arm64: Protect page table traversal with RCU
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

Use RCU to safely traverse the page tables in parallel; the tables
themselves will only be freed from an RCU synchronized context. Don't
even bother with adding support to hyp, and instead just assume
exclusive access of the page tables.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 5b64fbca8a93..d4699f698d6e 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -132,9 +132,28 @@ static kvm_pte_t kvm_phys_to_pte(u64 pa)
 	return pte;
 }
 
+
+#if defined(__KVM_NVHE_HYPERVISOR__)
+static inline void kvm_pgtable_walk_begin(void)
+{}
+
+static inline void kvm_pgtable_walk_end(void)
+{}
+
+#define kvm_dereference_ptep	rcu_dereference_raw
+#else
+#define kvm_pgtable_walk_begin	rcu_read_lock
+
+#define kvm_pgtable_walk_end	rcu_read_unlock
+
+#define kvm_dereference_ptep	rcu_dereference
+#endif
+
 static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
 {
-	return mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
+	kvm_pte_t __rcu *ptep = mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
+
+	return kvm_dereference_ptep(ptep);
 }
 
 static void kvm_clear_pte(kvm_pte_t *ptep)
@@ -288,7 +307,9 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
 		.walker	= walker,
 	};
 
+	kvm_pgtable_walk_begin();
 	return _kvm_pgtable_walk(&walk_data);
+	kvm_pgtable_walk_end();
 }
 
 struct leaf_walk_data {
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 05/17] KVM: arm64: Take an argument to indicate parallel walk
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:58   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

It is desirable to reuse the same page walkers for serial and parallel
faults. Take an argument to kvm_pgtable_walk() (and throughout) to
indicate whether or not a walk might happen in parallel with another.

No functional change intended.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h  |  5 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c |  4 +-
 arch/arm64/kvm/hyp/nvhe/setup.c       |  4 +-
 arch/arm64/kvm/hyp/pgtable.c          | 91 ++++++++++++++-------------
 4 files changed, 54 insertions(+), 50 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index ea818a5f7408..74955aba5918 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -194,7 +194,7 @@ enum kvm_pgtable_walk_flags {
 typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
 					kvm_pte_t *ptep, kvm_pte_t *old,
 					enum kvm_pgtable_walk_flags flag,
-					void * const arg);
+					void * const arg, bool shared);
 
 /**
  * struct kvm_pgtable_walker - Hook into a page-table walk.
@@ -490,6 +490,7 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
  * @addr:	Input address for the start of the walk.
  * @size:	Size of the range to walk.
  * @walker:	Walker callback description.
+ * @shared:	Indicates if the page table walk can be done in parallel
  *
  * The offset of @addr within a page is ignored and @size is rounded-up to
  * the next page boundary.
@@ -506,7 +507,7 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
  * Return: 0 on success, negative error code on failure.
  */
 int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
-		     struct kvm_pgtable_walker *walker);
+		     struct kvm_pgtable_walker *walker, bool shared);
 
 /**
  * kvm_pgtable_get_leaf() - Walk a page-table and retrieve the leaf entry
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 601a586581d8..42a5f35cd819 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -424,7 +424,7 @@ struct check_walk_data {
 static int __check_page_state_visitor(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep, kvm_pte_t *old,
 				      enum kvm_pgtable_walk_flags flag,
-				      void * const arg)
+				      void * const arg, bool shared)
 {
 	struct check_walk_data *d = arg;
 
@@ -443,7 +443,7 @@ static int check_page_state_range(struct kvm_pgtable *pgt, u64 addr, u64 size,
 		.flags	= KVM_PGTABLE_WALK_LEAF,
 	};
 
-	return kvm_pgtable_walk(pgt, addr, size, &walker);
+	return kvm_pgtable_walk(pgt, addr, size, &walker, false);
 }
 
 static enum pkvm_page_state host_get_page_state(kvm_pte_t pte)
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index ecab7a4049d6..178a5539fe7c 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -164,7 +164,7 @@ static void hpool_put_page(void *addr)
 static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
 					 kvm_pte_t *ptep, kvm_pte_t *old,
 					 enum kvm_pgtable_walk_flags flag,
-					 void * const arg)
+					 void * const arg, bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
 	enum kvm_pgtable_prot prot;
@@ -224,7 +224,7 @@ static int finalize_host_mappings(void)
 		struct memblock_region *reg = &hyp_memory[i];
 		u64 start = (u64)hyp_phys_to_virt(reg->base);
 
-		ret = kvm_pgtable_walk(&pkvm_pgtable, start, reg->size, &walker);
+		ret = kvm_pgtable_walk(&pkvm_pgtable, start, reg->size, &walker, false);
 		if (ret)
 			return ret;
 	}
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index d4699f698d6e..bf46d6d24951 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -198,17 +198,17 @@ static u8 kvm_invalid_pte_owner(kvm_pte_t pte)
 
 static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
 				  u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-				  enum kvm_pgtable_walk_flags flag)
+				  enum kvm_pgtable_walk_flags flag, bool shared)
 {
 	struct kvm_pgtable_walker *walker = data->walker;
-	return walker->cb(addr, data->end, level, ptep, old, flag, walker->arg);
+	return walker->cb(addr, data->end, level, ptep, old, flag, walker->arg, shared);
 }
 
 static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
-			      kvm_pte_t *pgtable, u32 level);
+			      kvm_pte_t *pgtable, u32 level, bool shared);
 
 static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
-				      kvm_pte_t *ptep, u32 level)
+				      kvm_pte_t *ptep, u32 level, bool shared)
 {
 	int ret = 0;
 	u64 addr = data->addr;
@@ -218,12 +218,12 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 
 	if (table && (flags & KVM_PGTABLE_WALK_TABLE_PRE)) {
 		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
-					     KVM_PGTABLE_WALK_TABLE_PRE);
+					     KVM_PGTABLE_WALK_TABLE_PRE, shared);
 	}
 
 	if (!table && (flags & KVM_PGTABLE_WALK_LEAF)) {
 		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
-					     KVM_PGTABLE_WALK_LEAF);
+					     KVM_PGTABLE_WALK_LEAF, shared);
 	}
 
 	if (ret)
@@ -237,13 +237,13 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 	}
 
 	childp = kvm_pte_follow(pte, data->pgt->mm_ops);
-	ret = __kvm_pgtable_walk(data, childp, level + 1);
+	ret = __kvm_pgtable_walk(data, childp, level + 1, shared);
 	if (ret)
 		goto out;
 
 	if (flags & KVM_PGTABLE_WALK_TABLE_POST) {
 		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
-					     KVM_PGTABLE_WALK_TABLE_POST);
+					     KVM_PGTABLE_WALK_TABLE_POST, shared);
 	}
 
 out:
@@ -251,7 +251,7 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 }
 
 static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
-			      kvm_pte_t *pgtable, u32 level)
+			      kvm_pte_t *pgtable, u32 level, bool shared)
 {
 	u32 idx;
 	int ret = 0;
@@ -265,7 +265,7 @@ static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
 		if (data->addr >= data->end)
 			break;
 
-		ret = __kvm_pgtable_visit(data, ptep, level);
+		ret = __kvm_pgtable_visit(data, ptep, level, shared);
 		if (ret)
 			break;
 	}
@@ -273,7 +273,7 @@ static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
 	return ret;
 }
 
-static int _kvm_pgtable_walk(struct kvm_pgtable_walk_data *data)
+static int _kvm_pgtable_walk(struct kvm_pgtable_walk_data *data, bool shared)
 {
 	u32 idx;
 	int ret = 0;
@@ -289,7 +289,7 @@ static int _kvm_pgtable_walk(struct kvm_pgtable_walk_data *data)
 	for (idx = kvm_pgd_page_idx(data); data->addr < data->end; ++idx) {
 		kvm_pte_t *ptep = &pgt->pgd[idx * PTRS_PER_PTE];
 
-		ret = __kvm_pgtable_walk(data, ptep, pgt->start_level);
+		ret = __kvm_pgtable_walk(data, ptep, pgt->start_level, shared);
 		if (ret)
 			break;
 	}
@@ -298,7 +298,7 @@ static int _kvm_pgtable_walk(struct kvm_pgtable_walk_data *data)
 }
 
 int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
-		     struct kvm_pgtable_walker *walker)
+		     struct kvm_pgtable_walker *walker, bool shared)
 {
 	struct kvm_pgtable_walk_data walk_data = {
 		.pgt	= pgt,
@@ -308,7 +308,7 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	};
 
 	kvm_pgtable_walk_begin();
-	return _kvm_pgtable_walk(&walk_data);
+	return _kvm_pgtable_walk(&walk_data, shared);
 	kvm_pgtable_walk_end();
 }
 
@@ -318,7 +318,7 @@ struct leaf_walk_data {
 };
 
 static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-		       enum kvm_pgtable_walk_flags flag, void * const arg)
+		       enum kvm_pgtable_walk_flags flag, void * const arg, bool shared)
 {
 	struct leaf_walk_data *data = arg;
 
@@ -340,7 +340,7 @@ int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr,
 	int ret;
 
 	ret = kvm_pgtable_walk(pgt, ALIGN_DOWN(addr, PAGE_SIZE),
-			       PAGE_SIZE, &walker);
+			       PAGE_SIZE, &walker, false);
 	if (!ret) {
 		if (ptep)
 			*ptep  = data.pte;
@@ -409,7 +409,7 @@ enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte)
 }
 
 static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-				    kvm_pte_t old, struct hyp_map_data *data)
+				    kvm_pte_t old, struct hyp_map_data *data, bool shared)
 {
 	kvm_pte_t new;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
@@ -431,13 +431,13 @@ static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *pte
 }
 
 static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-			  enum kvm_pgtable_walk_flags flag, void * const arg)
+			  enum kvm_pgtable_walk_flags flag, void * const arg, bool shared)
 {
 	kvm_pte_t *childp;
 	struct hyp_map_data *data = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
-	if (hyp_map_walker_try_leaf(addr, end, level, ptep, *old, arg))
+	if (hyp_map_walker_try_leaf(addr, end, level, ptep, *old, arg, shared))
 		return 0;
 
 	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
@@ -471,7 +471,7 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
 	if (ret)
 		return ret;
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
 	dsb(ishst);
 	isb();
 	return ret;
@@ -483,7 +483,7 @@ struct hyp_unmap_data {
 };
 
 static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-			    enum kvm_pgtable_walk_flags flag, void * const arg)
+			    enum kvm_pgtable_walk_flags flag, void * const arg, bool shared)
 {
 	kvm_pte_t *childp = NULL;
 	u64 granule = kvm_granule_size(level);
@@ -536,7 +536,7 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
 	if (!pgt->mm_ops->page_count)
 		return 0;
 
-	kvm_pgtable_walk(pgt, addr, size, &walker);
+	kvm_pgtable_walk(pgt, addr, size, &walker, false);
 	return unmap_data.unmapped;
 }
 
@@ -559,7 +559,7 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
 }
 
 static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-			   enum kvm_pgtable_walk_flags flag, void * const arg)
+			   enum kvm_pgtable_walk_flags flag, void * const arg, bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
 
@@ -582,7 +582,7 @@ void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt)
 		.arg	= pgt->mm_ops,
 	};
 
-	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
+	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker, false));
 	pgt->mm_ops->put_page(pgt->pgd);
 	pgt->pgd = NULL;
 }
@@ -744,7 +744,8 @@ static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level,
 
 static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep, kvm_pte_t old,
-				      struct stage2_map_data *data)
+				      struct stage2_map_data *data,
+				      bool shared)
 {
 	kvm_pte_t new;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
@@ -790,7 +791,8 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 
 static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 				     kvm_pte_t *ptep, kvm_pte_t *old,
-				     struct stage2_map_data *data)
+				     struct stage2_map_data *data,
+				     bool shared)
 {
 	if (data->anchor)
 		return 0;
@@ -812,7 +814,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 }
 
 static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-				kvm_pte_t *old, struct stage2_map_data *data)
+				kvm_pte_t *old, struct stage2_map_data *data, bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 	kvm_pte_t *childp;
@@ -825,7 +827,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 		return 0;
 	}
 
-	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data);
+	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data, shared);
 	if (ret != -E2BIG)
 		return ret;
 
@@ -855,7 +857,8 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 
 static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep, kvm_pte_t *old,
-				      struct stage2_map_data *data)
+				      struct stage2_map_data *data,
+				      bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 	kvm_pte_t *childp;
@@ -868,7 +871,7 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 		childp = data->childp;
 		data->anchor = NULL;
 		data->childp = NULL;
-		ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data);
+		ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
 	} else {
 		childp = kvm_pte_follow(*old, mm_ops);
 	}
@@ -899,17 +902,17 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
  * pointer and clearing the anchor to NULL.
  */
 static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-			     enum kvm_pgtable_walk_flags flag, void * const arg)
+			     enum kvm_pgtable_walk_flags flag, void * const arg, bool shared)
 {
 	struct stage2_map_data *data = arg;
 
 	switch (flag) {
 	case KVM_PGTABLE_WALK_TABLE_PRE:
-		return stage2_map_walk_table_pre(addr, end, level, ptep, old, data);
+		return stage2_map_walk_table_pre(addr, end, level, ptep, old, data, shared);
 	case KVM_PGTABLE_WALK_LEAF:
-		return stage2_map_walk_leaf(addr, end, level, ptep, old, data);
+		return stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
 	case KVM_PGTABLE_WALK_TABLE_POST:
-		return stage2_map_walk_table_post(addr, end, level, ptep, old, data);
+		return stage2_map_walk_table_post(addr, end, level, ptep, old, data, shared);
 	}
 
 	return -EINVAL;
@@ -942,7 +945,7 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	if (ret)
 		return ret;
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
 	dsb(ishst);
 	return ret;
 }
@@ -970,13 +973,13 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	if (owner_id > KVM_MAX_OWNER_ID)
 		return -EINVAL;
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
 	return ret;
 }
 
 static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			       kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
-			       void * const arg)
+			       void * const arg, bool shared)
 {
 	struct kvm_pgtable *pgt = arg;
 	struct kvm_s2_mmu *mmu = pgt->mmu;
@@ -1026,7 +1029,7 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
 		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
 	};
 
-	return kvm_pgtable_walk(pgt, addr, size, &walker);
+	return kvm_pgtable_walk(pgt, addr, size, &walker, false);
 }
 
 struct stage2_attr_data {
@@ -1039,7 +1042,7 @@ struct stage2_attr_data {
 
 static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			      kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
-			      void * const arg)
+			      void * const arg, bool shared)
 {
 	kvm_pte_t pte = *old;
 	struct stage2_attr_data *data = arg;
@@ -1091,7 +1094,7 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
 		.flags		= KVM_PGTABLE_WALK_LEAF,
 	};
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
 	if (ret)
 		return ret;
 
@@ -1167,7 +1170,7 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 
 static int stage2_flush_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			       kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
-			       void * const arg)
+			       void * const arg, bool shared)
 {
 	struct kvm_pgtable *pgt = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
@@ -1192,7 +1195,7 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
 	if (stage2_has_fwb(pgt))
 		return 0;
 
-	return kvm_pgtable_walk(pgt, addr, size, &walker);
+	return kvm_pgtable_walk(pgt, addr, size, &walker, false);
 }
 
 
@@ -1226,7 +1229,7 @@ int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 
 static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			      kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
-			      void * const arg)
+			      void * const arg, bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
 
@@ -1251,7 +1254,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
 		.arg	= pgt->mm_ops,
 	};
 
-	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
+	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker, false));
 	pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
 	pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
 	pgt->pgd = NULL;
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 05/17] KVM: arm64: Take an argument to indicate parallel walk
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

It is desirable to reuse the same page walkers for serial and parallel
faults. Take an argument to kvm_pgtable_walk() (and throughout) to
indicate whether or not a walk might happen in parallel with another.

No functional change intended.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h  |  5 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c |  4 +-
 arch/arm64/kvm/hyp/nvhe/setup.c       |  4 +-
 arch/arm64/kvm/hyp/pgtable.c          | 91 ++++++++++++++-------------
 4 files changed, 54 insertions(+), 50 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index ea818a5f7408..74955aba5918 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -194,7 +194,7 @@ enum kvm_pgtable_walk_flags {
 typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
 					kvm_pte_t *ptep, kvm_pte_t *old,
 					enum kvm_pgtable_walk_flags flag,
-					void * const arg);
+					void * const arg, bool shared);
 
 /**
  * struct kvm_pgtable_walker - Hook into a page-table walk.
@@ -490,6 +490,7 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
  * @addr:	Input address for the start of the walk.
  * @size:	Size of the range to walk.
  * @walker:	Walker callback description.
+ * @shared:	Indicates if the page table walk can be done in parallel
  *
  * The offset of @addr within a page is ignored and @size is rounded-up to
  * the next page boundary.
@@ -506,7 +507,7 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
  * Return: 0 on success, negative error code on failure.
  */
 int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
-		     struct kvm_pgtable_walker *walker);
+		     struct kvm_pgtable_walker *walker, bool shared);
 
 /**
  * kvm_pgtable_get_leaf() - Walk a page-table and retrieve the leaf entry
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 601a586581d8..42a5f35cd819 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -424,7 +424,7 @@ struct check_walk_data {
 static int __check_page_state_visitor(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep, kvm_pte_t *old,
 				      enum kvm_pgtable_walk_flags flag,
-				      void * const arg)
+				      void * const arg, bool shared)
 {
 	struct check_walk_data *d = arg;
 
@@ -443,7 +443,7 @@ static int check_page_state_range(struct kvm_pgtable *pgt, u64 addr, u64 size,
 		.flags	= KVM_PGTABLE_WALK_LEAF,
 	};
 
-	return kvm_pgtable_walk(pgt, addr, size, &walker);
+	return kvm_pgtable_walk(pgt, addr, size, &walker, false);
 }
 
 static enum pkvm_page_state host_get_page_state(kvm_pte_t pte)
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index ecab7a4049d6..178a5539fe7c 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -164,7 +164,7 @@ static void hpool_put_page(void *addr)
 static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
 					 kvm_pte_t *ptep, kvm_pte_t *old,
 					 enum kvm_pgtable_walk_flags flag,
-					 void * const arg)
+					 void * const arg, bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
 	enum kvm_pgtable_prot prot;
@@ -224,7 +224,7 @@ static int finalize_host_mappings(void)
 		struct memblock_region *reg = &hyp_memory[i];
 		u64 start = (u64)hyp_phys_to_virt(reg->base);
 
-		ret = kvm_pgtable_walk(&pkvm_pgtable, start, reg->size, &walker);
+		ret = kvm_pgtable_walk(&pkvm_pgtable, start, reg->size, &walker, false);
 		if (ret)
 			return ret;
 	}
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index d4699f698d6e..bf46d6d24951 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -198,17 +198,17 @@ static u8 kvm_invalid_pte_owner(kvm_pte_t pte)
 
 static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
 				  u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-				  enum kvm_pgtable_walk_flags flag)
+				  enum kvm_pgtable_walk_flags flag, bool shared)
 {
 	struct kvm_pgtable_walker *walker = data->walker;
-	return walker->cb(addr, data->end, level, ptep, old, flag, walker->arg);
+	return walker->cb(addr, data->end, level, ptep, old, flag, walker->arg, shared);
 }
 
 static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
-			      kvm_pte_t *pgtable, u32 level);
+			      kvm_pte_t *pgtable, u32 level, bool shared);
 
 static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
-				      kvm_pte_t *ptep, u32 level)
+				      kvm_pte_t *ptep, u32 level, bool shared)
 {
 	int ret = 0;
 	u64 addr = data->addr;
@@ -218,12 +218,12 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 
 	if (table && (flags & KVM_PGTABLE_WALK_TABLE_PRE)) {
 		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
-					     KVM_PGTABLE_WALK_TABLE_PRE);
+					     KVM_PGTABLE_WALK_TABLE_PRE, shared);
 	}
 
 	if (!table && (flags & KVM_PGTABLE_WALK_LEAF)) {
 		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
-					     KVM_PGTABLE_WALK_LEAF);
+					     KVM_PGTABLE_WALK_LEAF, shared);
 	}
 
 	if (ret)
@@ -237,13 +237,13 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 	}
 
 	childp = kvm_pte_follow(pte, data->pgt->mm_ops);
-	ret = __kvm_pgtable_walk(data, childp, level + 1);
+	ret = __kvm_pgtable_walk(data, childp, level + 1, shared);
 	if (ret)
 		goto out;
 
 	if (flags & KVM_PGTABLE_WALK_TABLE_POST) {
 		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
-					     KVM_PGTABLE_WALK_TABLE_POST);
+					     KVM_PGTABLE_WALK_TABLE_POST, shared);
 	}
 
 out:
@@ -251,7 +251,7 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 }
 
 static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
-			      kvm_pte_t *pgtable, u32 level)
+			      kvm_pte_t *pgtable, u32 level, bool shared)
 {
 	u32 idx;
 	int ret = 0;
@@ -265,7 +265,7 @@ static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
 		if (data->addr >= data->end)
 			break;
 
-		ret = __kvm_pgtable_visit(data, ptep, level);
+		ret = __kvm_pgtable_visit(data, ptep, level, shared);
 		if (ret)
 			break;
 	}
@@ -273,7 +273,7 @@ static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
 	return ret;
 }
 
-static int _kvm_pgtable_walk(struct kvm_pgtable_walk_data *data)
+static int _kvm_pgtable_walk(struct kvm_pgtable_walk_data *data, bool shared)
 {
 	u32 idx;
 	int ret = 0;
@@ -289,7 +289,7 @@ static int _kvm_pgtable_walk(struct kvm_pgtable_walk_data *data)
 	for (idx = kvm_pgd_page_idx(data); data->addr < data->end; ++idx) {
 		kvm_pte_t *ptep = &pgt->pgd[idx * PTRS_PER_PTE];
 
-		ret = __kvm_pgtable_walk(data, ptep, pgt->start_level);
+		ret = __kvm_pgtable_walk(data, ptep, pgt->start_level, shared);
 		if (ret)
 			break;
 	}
@@ -298,7 +298,7 @@ static int _kvm_pgtable_walk(struct kvm_pgtable_walk_data *data)
 }
 
 int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
-		     struct kvm_pgtable_walker *walker)
+		     struct kvm_pgtable_walker *walker, bool shared)
 {
 	struct kvm_pgtable_walk_data walk_data = {
 		.pgt	= pgt,
@@ -308,7 +308,7 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	};
 
 	kvm_pgtable_walk_begin();
-	return _kvm_pgtable_walk(&walk_data);
+	return _kvm_pgtable_walk(&walk_data, shared);
 	kvm_pgtable_walk_end();
 }
 
@@ -318,7 +318,7 @@ struct leaf_walk_data {
 };
 
 static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-		       enum kvm_pgtable_walk_flags flag, void * const arg)
+		       enum kvm_pgtable_walk_flags flag, void * const arg, bool shared)
 {
 	struct leaf_walk_data *data = arg;
 
@@ -340,7 +340,7 @@ int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr,
 	int ret;
 
 	ret = kvm_pgtable_walk(pgt, ALIGN_DOWN(addr, PAGE_SIZE),
-			       PAGE_SIZE, &walker);
+			       PAGE_SIZE, &walker, false);
 	if (!ret) {
 		if (ptep)
 			*ptep  = data.pte;
@@ -409,7 +409,7 @@ enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte)
 }
 
 static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-				    kvm_pte_t old, struct hyp_map_data *data)
+				    kvm_pte_t old, struct hyp_map_data *data, bool shared)
 {
 	kvm_pte_t new;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
@@ -431,13 +431,13 @@ static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *pte
 }
 
 static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-			  enum kvm_pgtable_walk_flags flag, void * const arg)
+			  enum kvm_pgtable_walk_flags flag, void * const arg, bool shared)
 {
 	kvm_pte_t *childp;
 	struct hyp_map_data *data = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
-	if (hyp_map_walker_try_leaf(addr, end, level, ptep, *old, arg))
+	if (hyp_map_walker_try_leaf(addr, end, level, ptep, *old, arg, shared))
 		return 0;
 
 	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
@@ -471,7 +471,7 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
 	if (ret)
 		return ret;
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
 	dsb(ishst);
 	isb();
 	return ret;
@@ -483,7 +483,7 @@ struct hyp_unmap_data {
 };
 
 static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-			    enum kvm_pgtable_walk_flags flag, void * const arg)
+			    enum kvm_pgtable_walk_flags flag, void * const arg, bool shared)
 {
 	kvm_pte_t *childp = NULL;
 	u64 granule = kvm_granule_size(level);
@@ -536,7 +536,7 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
 	if (!pgt->mm_ops->page_count)
 		return 0;
 
-	kvm_pgtable_walk(pgt, addr, size, &walker);
+	kvm_pgtable_walk(pgt, addr, size, &walker, false);
 	return unmap_data.unmapped;
 }
 
@@ -559,7 +559,7 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
 }
 
 static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-			   enum kvm_pgtable_walk_flags flag, void * const arg)
+			   enum kvm_pgtable_walk_flags flag, void * const arg, bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
 
@@ -582,7 +582,7 @@ void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt)
 		.arg	= pgt->mm_ops,
 	};
 
-	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
+	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker, false));
 	pgt->mm_ops->put_page(pgt->pgd);
 	pgt->pgd = NULL;
 }
@@ -744,7 +744,8 @@ static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level,
 
 static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep, kvm_pte_t old,
-				      struct stage2_map_data *data)
+				      struct stage2_map_data *data,
+				      bool shared)
 {
 	kvm_pte_t new;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
@@ -790,7 +791,8 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 
 static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 				     kvm_pte_t *ptep, kvm_pte_t *old,
-				     struct stage2_map_data *data)
+				     struct stage2_map_data *data,
+				     bool shared)
 {
 	if (data->anchor)
 		return 0;
@@ -812,7 +814,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 }
 
 static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-				kvm_pte_t *old, struct stage2_map_data *data)
+				kvm_pte_t *old, struct stage2_map_data *data, bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 	kvm_pte_t *childp;
@@ -825,7 +827,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 		return 0;
 	}
 
-	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data);
+	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data, shared);
 	if (ret != -E2BIG)
 		return ret;
 
@@ -855,7 +857,8 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 
 static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep, kvm_pte_t *old,
-				      struct stage2_map_data *data)
+				      struct stage2_map_data *data,
+				      bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 	kvm_pte_t *childp;
@@ -868,7 +871,7 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 		childp = data->childp;
 		data->anchor = NULL;
 		data->childp = NULL;
-		ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data);
+		ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
 	} else {
 		childp = kvm_pte_follow(*old, mm_ops);
 	}
@@ -899,17 +902,17 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
  * pointer and clearing the anchor to NULL.
  */
 static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-			     enum kvm_pgtable_walk_flags flag, void * const arg)
+			     enum kvm_pgtable_walk_flags flag, void * const arg, bool shared)
 {
 	struct stage2_map_data *data = arg;
 
 	switch (flag) {
 	case KVM_PGTABLE_WALK_TABLE_PRE:
-		return stage2_map_walk_table_pre(addr, end, level, ptep, old, data);
+		return stage2_map_walk_table_pre(addr, end, level, ptep, old, data, shared);
 	case KVM_PGTABLE_WALK_LEAF:
-		return stage2_map_walk_leaf(addr, end, level, ptep, old, data);
+		return stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
 	case KVM_PGTABLE_WALK_TABLE_POST:
-		return stage2_map_walk_table_post(addr, end, level, ptep, old, data);
+		return stage2_map_walk_table_post(addr, end, level, ptep, old, data, shared);
 	}
 
 	return -EINVAL;
@@ -942,7 +945,7 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	if (ret)
 		return ret;
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
 	dsb(ishst);
 	return ret;
 }
@@ -970,13 +973,13 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	if (owner_id > KVM_MAX_OWNER_ID)
 		return -EINVAL;
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
 	return ret;
 }
 
 static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			       kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
-			       void * const arg)
+			       void * const arg, bool shared)
 {
 	struct kvm_pgtable *pgt = arg;
 	struct kvm_s2_mmu *mmu = pgt->mmu;
@@ -1026,7 +1029,7 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
 		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
 	};
 
-	return kvm_pgtable_walk(pgt, addr, size, &walker);
+	return kvm_pgtable_walk(pgt, addr, size, &walker, false);
 }
 
 struct stage2_attr_data {
@@ -1039,7 +1042,7 @@ struct stage2_attr_data {
 
 static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			      kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
-			      void * const arg)
+			      void * const arg, bool shared)
 {
 	kvm_pte_t pte = *old;
 	struct stage2_attr_data *data = arg;
@@ -1091,7 +1094,7 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
 		.flags		= KVM_PGTABLE_WALK_LEAF,
 	};
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
 	if (ret)
 		return ret;
 
@@ -1167,7 +1170,7 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 
 static int stage2_flush_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			       kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
-			       void * const arg)
+			       void * const arg, bool shared)
 {
 	struct kvm_pgtable *pgt = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
@@ -1192,7 +1195,7 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
 	if (stage2_has_fwb(pgt))
 		return 0;
 
-	return kvm_pgtable_walk(pgt, addr, size, &walker);
+	return kvm_pgtable_walk(pgt, addr, size, &walker, false);
 }
 
 
@@ -1226,7 +1229,7 @@ int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 
 static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			      kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
-			      void * const arg)
+			      void * const arg, bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
 
@@ -1251,7 +1254,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
 		.arg	= pgt->mm_ops,
 	};
 
-	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
+	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker, false));
 	pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
 	pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
 	pgt->pgd = NULL;
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 05/17] KVM: arm64: Take an argument to indicate parallel walk
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

It is desirable to reuse the same page walkers for serial and parallel
faults. Take an argument to kvm_pgtable_walk() (and throughout) to
indicate whether or not a walk might happen in parallel with another.

No functional change intended.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h  |  5 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c |  4 +-
 arch/arm64/kvm/hyp/nvhe/setup.c       |  4 +-
 arch/arm64/kvm/hyp/pgtable.c          | 91 ++++++++++++++-------------
 4 files changed, 54 insertions(+), 50 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index ea818a5f7408..74955aba5918 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -194,7 +194,7 @@ enum kvm_pgtable_walk_flags {
 typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
 					kvm_pte_t *ptep, kvm_pte_t *old,
 					enum kvm_pgtable_walk_flags flag,
-					void * const arg);
+					void * const arg, bool shared);
 
 /**
  * struct kvm_pgtable_walker - Hook into a page-table walk.
@@ -490,6 +490,7 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
  * @addr:	Input address for the start of the walk.
  * @size:	Size of the range to walk.
  * @walker:	Walker callback description.
+ * @shared:	Indicates if the page table walk can be done in parallel
  *
  * The offset of @addr within a page is ignored and @size is rounded-up to
  * the next page boundary.
@@ -506,7 +507,7 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
  * Return: 0 on success, negative error code on failure.
  */
 int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
-		     struct kvm_pgtable_walker *walker);
+		     struct kvm_pgtable_walker *walker, bool shared);
 
 /**
  * kvm_pgtable_get_leaf() - Walk a page-table and retrieve the leaf entry
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 601a586581d8..42a5f35cd819 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -424,7 +424,7 @@ struct check_walk_data {
 static int __check_page_state_visitor(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep, kvm_pte_t *old,
 				      enum kvm_pgtable_walk_flags flag,
-				      void * const arg)
+				      void * const arg, bool shared)
 {
 	struct check_walk_data *d = arg;
 
@@ -443,7 +443,7 @@ static int check_page_state_range(struct kvm_pgtable *pgt, u64 addr, u64 size,
 		.flags	= KVM_PGTABLE_WALK_LEAF,
 	};
 
-	return kvm_pgtable_walk(pgt, addr, size, &walker);
+	return kvm_pgtable_walk(pgt, addr, size, &walker, false);
 }
 
 static enum pkvm_page_state host_get_page_state(kvm_pte_t pte)
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index ecab7a4049d6..178a5539fe7c 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -164,7 +164,7 @@ static void hpool_put_page(void *addr)
 static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
 					 kvm_pte_t *ptep, kvm_pte_t *old,
 					 enum kvm_pgtable_walk_flags flag,
-					 void * const arg)
+					 void * const arg, bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
 	enum kvm_pgtable_prot prot;
@@ -224,7 +224,7 @@ static int finalize_host_mappings(void)
 		struct memblock_region *reg = &hyp_memory[i];
 		u64 start = (u64)hyp_phys_to_virt(reg->base);
 
-		ret = kvm_pgtable_walk(&pkvm_pgtable, start, reg->size, &walker);
+		ret = kvm_pgtable_walk(&pkvm_pgtable, start, reg->size, &walker, false);
 		if (ret)
 			return ret;
 	}
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index d4699f698d6e..bf46d6d24951 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -198,17 +198,17 @@ static u8 kvm_invalid_pte_owner(kvm_pte_t pte)
 
 static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
 				  u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-				  enum kvm_pgtable_walk_flags flag)
+				  enum kvm_pgtable_walk_flags flag, bool shared)
 {
 	struct kvm_pgtable_walker *walker = data->walker;
-	return walker->cb(addr, data->end, level, ptep, old, flag, walker->arg);
+	return walker->cb(addr, data->end, level, ptep, old, flag, walker->arg, shared);
 }
 
 static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
-			      kvm_pte_t *pgtable, u32 level);
+			      kvm_pte_t *pgtable, u32 level, bool shared);
 
 static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
-				      kvm_pte_t *ptep, u32 level)
+				      kvm_pte_t *ptep, u32 level, bool shared)
 {
 	int ret = 0;
 	u64 addr = data->addr;
@@ -218,12 +218,12 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 
 	if (table && (flags & KVM_PGTABLE_WALK_TABLE_PRE)) {
 		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
-					     KVM_PGTABLE_WALK_TABLE_PRE);
+					     KVM_PGTABLE_WALK_TABLE_PRE, shared);
 	}
 
 	if (!table && (flags & KVM_PGTABLE_WALK_LEAF)) {
 		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
-					     KVM_PGTABLE_WALK_LEAF);
+					     KVM_PGTABLE_WALK_LEAF, shared);
 	}
 
 	if (ret)
@@ -237,13 +237,13 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 	}
 
 	childp = kvm_pte_follow(pte, data->pgt->mm_ops);
-	ret = __kvm_pgtable_walk(data, childp, level + 1);
+	ret = __kvm_pgtable_walk(data, childp, level + 1, shared);
 	if (ret)
 		goto out;
 
 	if (flags & KVM_PGTABLE_WALK_TABLE_POST) {
 		ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
-					     KVM_PGTABLE_WALK_TABLE_POST);
+					     KVM_PGTABLE_WALK_TABLE_POST, shared);
 	}
 
 out:
@@ -251,7 +251,7 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
 }
 
 static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
-			      kvm_pte_t *pgtable, u32 level)
+			      kvm_pte_t *pgtable, u32 level, bool shared)
 {
 	u32 idx;
 	int ret = 0;
@@ -265,7 +265,7 @@ static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
 		if (data->addr >= data->end)
 			break;
 
-		ret = __kvm_pgtable_visit(data, ptep, level);
+		ret = __kvm_pgtable_visit(data, ptep, level, shared);
 		if (ret)
 			break;
 	}
@@ -273,7 +273,7 @@ static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
 	return ret;
 }
 
-static int _kvm_pgtable_walk(struct kvm_pgtable_walk_data *data)
+static int _kvm_pgtable_walk(struct kvm_pgtable_walk_data *data, bool shared)
 {
 	u32 idx;
 	int ret = 0;
@@ -289,7 +289,7 @@ static int _kvm_pgtable_walk(struct kvm_pgtable_walk_data *data)
 	for (idx = kvm_pgd_page_idx(data); data->addr < data->end; ++idx) {
 		kvm_pte_t *ptep = &pgt->pgd[idx * PTRS_PER_PTE];
 
-		ret = __kvm_pgtable_walk(data, ptep, pgt->start_level);
+		ret = __kvm_pgtable_walk(data, ptep, pgt->start_level, shared);
 		if (ret)
 			break;
 	}
@@ -298,7 +298,7 @@ static int _kvm_pgtable_walk(struct kvm_pgtable_walk_data *data)
 }
 
 int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
-		     struct kvm_pgtable_walker *walker)
+		     struct kvm_pgtable_walker *walker, bool shared)
 {
 	struct kvm_pgtable_walk_data walk_data = {
 		.pgt	= pgt,
@@ -308,7 +308,7 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	};
 
 	kvm_pgtable_walk_begin();
-	return _kvm_pgtable_walk(&walk_data);
+	return _kvm_pgtable_walk(&walk_data, shared);
 	kvm_pgtable_walk_end();
 }
 
@@ -318,7 +318,7 @@ struct leaf_walk_data {
 };
 
 static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-		       enum kvm_pgtable_walk_flags flag, void * const arg)
+		       enum kvm_pgtable_walk_flags flag, void * const arg, bool shared)
 {
 	struct leaf_walk_data *data = arg;
 
@@ -340,7 +340,7 @@ int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr,
 	int ret;
 
 	ret = kvm_pgtable_walk(pgt, ALIGN_DOWN(addr, PAGE_SIZE),
-			       PAGE_SIZE, &walker);
+			       PAGE_SIZE, &walker, false);
 	if (!ret) {
 		if (ptep)
 			*ptep  = data.pte;
@@ -409,7 +409,7 @@ enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte)
 }
 
 static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-				    kvm_pte_t old, struct hyp_map_data *data)
+				    kvm_pte_t old, struct hyp_map_data *data, bool shared)
 {
 	kvm_pte_t new;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
@@ -431,13 +431,13 @@ static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *pte
 }
 
 static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-			  enum kvm_pgtable_walk_flags flag, void * const arg)
+			  enum kvm_pgtable_walk_flags flag, void * const arg, bool shared)
 {
 	kvm_pte_t *childp;
 	struct hyp_map_data *data = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
-	if (hyp_map_walker_try_leaf(addr, end, level, ptep, *old, arg))
+	if (hyp_map_walker_try_leaf(addr, end, level, ptep, *old, arg, shared))
 		return 0;
 
 	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
@@ -471,7 +471,7 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
 	if (ret)
 		return ret;
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
 	dsb(ishst);
 	isb();
 	return ret;
@@ -483,7 +483,7 @@ struct hyp_unmap_data {
 };
 
 static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-			    enum kvm_pgtable_walk_flags flag, void * const arg)
+			    enum kvm_pgtable_walk_flags flag, void * const arg, bool shared)
 {
 	kvm_pte_t *childp = NULL;
 	u64 granule = kvm_granule_size(level);
@@ -536,7 +536,7 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
 	if (!pgt->mm_ops->page_count)
 		return 0;
 
-	kvm_pgtable_walk(pgt, addr, size, &walker);
+	kvm_pgtable_walk(pgt, addr, size, &walker, false);
 	return unmap_data.unmapped;
 }
 
@@ -559,7 +559,7 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
 }
 
 static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-			   enum kvm_pgtable_walk_flags flag, void * const arg)
+			   enum kvm_pgtable_walk_flags flag, void * const arg, bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
 
@@ -582,7 +582,7 @@ void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt)
 		.arg	= pgt->mm_ops,
 	};
 
-	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
+	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker, false));
 	pgt->mm_ops->put_page(pgt->pgd);
 	pgt->pgd = NULL;
 }
@@ -744,7 +744,8 @@ static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level,
 
 static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep, kvm_pte_t old,
-				      struct stage2_map_data *data)
+				      struct stage2_map_data *data,
+				      bool shared)
 {
 	kvm_pte_t new;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
@@ -790,7 +791,8 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 
 static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 				     kvm_pte_t *ptep, kvm_pte_t *old,
-				     struct stage2_map_data *data)
+				     struct stage2_map_data *data,
+				     bool shared)
 {
 	if (data->anchor)
 		return 0;
@@ -812,7 +814,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 }
 
 static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-				kvm_pte_t *old, struct stage2_map_data *data)
+				kvm_pte_t *old, struct stage2_map_data *data, bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 	kvm_pte_t *childp;
@@ -825,7 +827,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 		return 0;
 	}
 
-	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data);
+	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data, shared);
 	if (ret != -E2BIG)
 		return ret;
 
@@ -855,7 +857,8 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 
 static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep, kvm_pte_t *old,
-				      struct stage2_map_data *data)
+				      struct stage2_map_data *data,
+				      bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 	kvm_pte_t *childp;
@@ -868,7 +871,7 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 		childp = data->childp;
 		data->anchor = NULL;
 		data->childp = NULL;
-		ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data);
+		ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
 	} else {
 		childp = kvm_pte_follow(*old, mm_ops);
 	}
@@ -899,17 +902,17 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
  * pointer and clearing the anchor to NULL.
  */
 static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
-			     enum kvm_pgtable_walk_flags flag, void * const arg)
+			     enum kvm_pgtable_walk_flags flag, void * const arg, bool shared)
 {
 	struct stage2_map_data *data = arg;
 
 	switch (flag) {
 	case KVM_PGTABLE_WALK_TABLE_PRE:
-		return stage2_map_walk_table_pre(addr, end, level, ptep, old, data);
+		return stage2_map_walk_table_pre(addr, end, level, ptep, old, data, shared);
 	case KVM_PGTABLE_WALK_LEAF:
-		return stage2_map_walk_leaf(addr, end, level, ptep, old, data);
+		return stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
 	case KVM_PGTABLE_WALK_TABLE_POST:
-		return stage2_map_walk_table_post(addr, end, level, ptep, old, data);
+		return stage2_map_walk_table_post(addr, end, level, ptep, old, data, shared);
 	}
 
 	return -EINVAL;
@@ -942,7 +945,7 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	if (ret)
 		return ret;
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
 	dsb(ishst);
 	return ret;
 }
@@ -970,13 +973,13 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	if (owner_id > KVM_MAX_OWNER_ID)
 		return -EINVAL;
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
 	return ret;
 }
 
 static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			       kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
-			       void * const arg)
+			       void * const arg, bool shared)
 {
 	struct kvm_pgtable *pgt = arg;
 	struct kvm_s2_mmu *mmu = pgt->mmu;
@@ -1026,7 +1029,7 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
 		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
 	};
 
-	return kvm_pgtable_walk(pgt, addr, size, &walker);
+	return kvm_pgtable_walk(pgt, addr, size, &walker, false);
 }
 
 struct stage2_attr_data {
@@ -1039,7 +1042,7 @@ struct stage2_attr_data {
 
 static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			      kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
-			      void * const arg)
+			      void * const arg, bool shared)
 {
 	kvm_pte_t pte = *old;
 	struct stage2_attr_data *data = arg;
@@ -1091,7 +1094,7 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
 		.flags		= KVM_PGTABLE_WALK_LEAF,
 	};
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
 	if (ret)
 		return ret;
 
@@ -1167,7 +1170,7 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 
 static int stage2_flush_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			       kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
-			       void * const arg)
+			       void * const arg, bool shared)
 {
 	struct kvm_pgtable *pgt = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
@@ -1192,7 +1195,7 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
 	if (stage2_has_fwb(pgt))
 		return 0;
 
-	return kvm_pgtable_walk(pgt, addr, size, &walker);
+	return kvm_pgtable_walk(pgt, addr, size, &walker, false);
 }
 
 
@@ -1226,7 +1229,7 @@ int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 
 static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 			      kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
-			      void * const arg)
+			      void * const arg, bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = arg;
 
@@ -1251,7 +1254,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
 		.arg	= pgt->mm_ops,
 	};
 
-	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
+	WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker, false));
 	pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
 	pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
 	pgt->pgd = NULL;
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:58   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

The ARM architecture requires that software use the 'break-before-make'
sequence whenever memory is being remapped. An additional requirement of
parallel page walks is a mechanism to ensure exclusive access to a pte,
thereby avoiding two threads changing the pte and invariably stomping on
one another.

Roll the two concepts together into a new helper to implement the
'break' sequence. Use a special invalid pte value to indicate that the
pte is under the exclusive control of a thread. If software walkers are
traversing the tables in parallel, use an atomic compare-exchange to
break the pte. Retry execution on a failed attempt to break the pte, in
the hopes that either the instruction will succeed or the pte lock will
be successfully acquired.

Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
evicted pte was valid. For counted non-table ptes drop the reference
immediately. Otherwise, references on tables are dropped in post-order
traversal as the walker must recurse on the pruned subtree.

All of the new atomics do nothing (for now), as there are a few other
bits of the map walker that need to be addressed before actually walking
in parallel.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 172 +++++++++++++++++++++++++++++------
 1 file changed, 146 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index bf46d6d24951..059ebb921125 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -49,6 +49,12 @@
 #define KVM_INVALID_PTE_OWNER_MASK	GENMASK(9, 2)
 #define KVM_MAX_OWNER_ID		1
 
+/*
+ * Used to indicate a pte for which a 'make-before-break' sequence is in
+ * progress.
+ */
+#define KVM_INVALID_PTE_LOCKED		BIT(10)
+
 struct kvm_pgtable_walk_data {
 	struct kvm_pgtable		*pgt;
 	struct kvm_pgtable_walker	*walker;
@@ -707,6 +713,122 @@ static bool stage2_pte_is_counted(kvm_pte_t pte)
 	return kvm_pte_valid(pte) || kvm_invalid_pte_owner(pte);
 }
 
+static bool stage2_pte_is_locked(kvm_pte_t pte)
+{
+	return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED);
+}
+
+static inline bool kvm_try_set_pte(kvm_pte_t *ptep, kvm_pte_t old, kvm_pte_t new, bool shared)
+{
+	if (!shared) {
+		WRITE_ONCE(*ptep, new);
+		return true;
+	}
+
+	return cmpxchg(ptep, old, new) == old;
+}
+
+/**
+ * stage2_try_break_pte() - Invalidates a pte according to the
+ *			    'break-before-make' sequence.
+ *
+ * @ptep: Pointer to the pte to break
+ * @old: The previously observed value of the pte; used for compare-exchange in
+ *	 a parallel walk
+ * @addr: IPA corresponding to the pte
+ * @level: Table level of the pte
+ * @shared: true if the tables are shared by multiple software walkers
+ * @data: pointer to the map walker data
+ *
+ * Returns: true if the pte was successfully broken.
+ *
+ * If the removed pt was valid, performs the necessary DSB and TLB flush for
+ * the old value. Drops references to the page table if a non-table entry was
+ * removed. Otherwise, the table reference is preserved as the walker must also
+ * recurse through the child tables.
+ *
+ * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
+ * on the 'break-before-make' sequence.
+ */
+static bool stage2_try_break_pte(kvm_pte_t *ptep, kvm_pte_t old, u64 addr, u32 level, bool shared,
+				 struct stage2_map_data *data)
+{
+	/*
+	 * Another thread could have already visited this pte and taken
+	 * ownership.
+	 */
+	if (stage2_pte_is_locked(old)) {
+		/*
+		 * If the table walker has exclusive access to the page tables
+		 * then no other software walkers should have locked the pte.
+		 */
+		WARN_ON(!shared);
+		return false;
+	}
+
+	if (!kvm_try_set_pte(ptep, old, KVM_INVALID_PTE_LOCKED, shared))
+		return false;
+
+	/*
+	 * If we removed a valid pte, break-then-make rules are in effect as a
+	 * translation may have been cached that traversed this entry.
+	 */
+	if (kvm_pte_valid(old)) {
+		dsb(ishst);
+
+		if (kvm_pte_table(old, level))
+			/*
+			 * Invalidate the whole stage-2, as we may have numerous leaf
+			 * entries below us which would otherwise need invalidating
+			 * individually.
+			 */
+			kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
+		else
+			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
+	}
+
+	/*
+	 * Don't drop the reference on table entries yet, as the walker must
+	 * first recurse on the unlinked subtree to unlink and drop references
+	 * to child tables.
+	 */
+	if (!kvm_pte_table(old, level) && stage2_pte_is_counted(old))
+		data->mm_ops->put_page(ptep);
+
+	return true;
+}
+
+/**
+ * stage2_make_pte() - Installs a new pte according to the 'break-before-make'
+ *		       sequence.
+ *
+ * @ptep: pointer to the pte to make
+ * @new: new pte value to install
+ *
+ * Assumes that the pte addressed by ptep has already been broken and is under
+ * the ownership of the table walker. If the new pte to be installed is a valid
+ * entry, perform a DSB to make the write visible. Raise the reference count on
+ * the table if the new pte requires a reference.
+ *
+ * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
+ * on the 'break-before-make' sequence.
+ */
+static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
+{
+	/* Yikes! We really shouldn't install to an entry we don't own. */
+	WARN_ON(!stage2_pte_is_locked(*ptep));
+
+	if (stage2_pte_is_counted(new))
+		mm_ops->get_page(ptep);
+
+	if (kvm_pte_valid(new)) {
+		WRITE_ONCE(*ptep, new);
+		dsb(ishst);
+	} else {
+		smp_store_release(ptep, new);
+	}
+}
+
 static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
 			   u32 level, struct kvm_pgtable_mm_ops *mm_ops)
 {
@@ -760,18 +882,17 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	else
 		new = kvm_init_invalid_leaf_owner(data->owner_id);
 
-	if (stage2_pte_is_counted(old)) {
-		/*
-		 * Skip updating the PTE if we are trying to recreate the exact
-		 * same mapping or only change the access permissions. Instead,
-		 * the vCPU will exit one more time from guest if still needed
-		 * and then go through the path of relaxing permissions.
-		 */
-		if (!stage2_pte_needs_update(old, new))
-			return -EAGAIN;
+	/*
+	 * Skip updating the PTE if we are trying to recreate the exact same
+	 * mapping or only change the access permissions. Instead, the vCPU will
+	 * exit one more time from the guest if still needed and then go through
+	 * the path of relaxing permissions.
+	 */
+	if (!stage2_pte_needs_update(old, new))
+		return -EAGAIN;
 
-		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
-	}
+	if (!stage2_try_break_pte(ptep, old, addr, level, shared, data))
+		return -EAGAIN;
 
 	/* Perform CMOs before installation of the guest stage-2 PTE */
 	if (mm_ops->dcache_clean_inval_poc && stage2_pte_cacheable(pgt, new))
@@ -781,9 +902,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	if (mm_ops->icache_inval_pou && stage2_pte_executable(new))
 		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
 
-	smp_store_release(ptep, new);
-	if (stage2_pte_is_counted(new))
-		mm_ops->get_page(ptep);
+	stage2_make_pte(ptep, new, data->mm_ops);
 	if (kvm_phys_is_valid(phys))
 		data->phys += granule;
 	return 0;
@@ -800,15 +919,10 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 	if (!stage2_leaf_mapping_allowed(addr, end, level, data))
 		return 0;
 
-	data->childp = kvm_pte_follow(*old, data->mm_ops);
-	kvm_clear_pte(ptep);
+	if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
+		return -EAGAIN;
 
-	/*
-	 * Invalidate the whole stage-2, as we may have numerous leaf
-	 * entries below us which would otherwise need invalidating
-	 * individually.
-	 */
-	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
+	data->childp = kvm_pte_follow(*old, data->mm_ops);
 	data->anchor = ptep;
 	return 0;
 }
@@ -837,18 +951,24 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	if (!data->memcache)
 		return -ENOMEM;
 
+	if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
+		return -EAGAIN;
+
 	childp = mm_ops->zalloc_page(data->memcache);
-	if (!childp)
+	if (!childp) {
+		/*
+		 * Release the pte if we were unable to install a table to allow
+		 * another thread to make an attempt.
+		 */
+		stage2_make_pte(ptep, 0, data->mm_ops);
 		return -ENOMEM;
+	}
 
 	/*
 	 * If we've run into an existing block mapping then replace it with
 	 * a table. Accesses beyond 'end' that fall within the new table
 	 * will be mapped lazily.
 	 */
-	if (stage2_pte_is_counted(*old))
-		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
-
 	kvm_set_table_pte(ptep, childp, mm_ops);
 	mm_ops->get_page(ptep);
 	*old = *ptep;
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

The ARM architecture requires that software use the 'break-before-make'
sequence whenever memory is being remapped. An additional requirement of
parallel page walks is a mechanism to ensure exclusive access to a pte,
thereby avoiding two threads changing the pte and invariably stomping on
one another.

Roll the two concepts together into a new helper to implement the
'break' sequence. Use a special invalid pte value to indicate that the
pte is under the exclusive control of a thread. If software walkers are
traversing the tables in parallel, use an atomic compare-exchange to
break the pte. Retry execution on a failed attempt to break the pte, in
the hopes that either the instruction will succeed or the pte lock will
be successfully acquired.

Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
evicted pte was valid. For counted non-table ptes drop the reference
immediately. Otherwise, references on tables are dropped in post-order
traversal as the walker must recurse on the pruned subtree.

All of the new atomics do nothing (for now), as there are a few other
bits of the map walker that need to be addressed before actually walking
in parallel.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 172 +++++++++++++++++++++++++++++------
 1 file changed, 146 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index bf46d6d24951..059ebb921125 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -49,6 +49,12 @@
 #define KVM_INVALID_PTE_OWNER_MASK	GENMASK(9, 2)
 #define KVM_MAX_OWNER_ID		1
 
+/*
+ * Used to indicate a pte for which a 'make-before-break' sequence is in
+ * progress.
+ */
+#define KVM_INVALID_PTE_LOCKED		BIT(10)
+
 struct kvm_pgtable_walk_data {
 	struct kvm_pgtable		*pgt;
 	struct kvm_pgtable_walker	*walker;
@@ -707,6 +713,122 @@ static bool stage2_pte_is_counted(kvm_pte_t pte)
 	return kvm_pte_valid(pte) || kvm_invalid_pte_owner(pte);
 }
 
+static bool stage2_pte_is_locked(kvm_pte_t pte)
+{
+	return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED);
+}
+
+static inline bool kvm_try_set_pte(kvm_pte_t *ptep, kvm_pte_t old, kvm_pte_t new, bool shared)
+{
+	if (!shared) {
+		WRITE_ONCE(*ptep, new);
+		return true;
+	}
+
+	return cmpxchg(ptep, old, new) == old;
+}
+
+/**
+ * stage2_try_break_pte() - Invalidates a pte according to the
+ *			    'break-before-make' sequence.
+ *
+ * @ptep: Pointer to the pte to break
+ * @old: The previously observed value of the pte; used for compare-exchange in
+ *	 a parallel walk
+ * @addr: IPA corresponding to the pte
+ * @level: Table level of the pte
+ * @shared: true if the tables are shared by multiple software walkers
+ * @data: pointer to the map walker data
+ *
+ * Returns: true if the pte was successfully broken.
+ *
+ * If the removed pt was valid, performs the necessary DSB and TLB flush for
+ * the old value. Drops references to the page table if a non-table entry was
+ * removed. Otherwise, the table reference is preserved as the walker must also
+ * recurse through the child tables.
+ *
+ * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
+ * on the 'break-before-make' sequence.
+ */
+static bool stage2_try_break_pte(kvm_pte_t *ptep, kvm_pte_t old, u64 addr, u32 level, bool shared,
+				 struct stage2_map_data *data)
+{
+	/*
+	 * Another thread could have already visited this pte and taken
+	 * ownership.
+	 */
+	if (stage2_pte_is_locked(old)) {
+		/*
+		 * If the table walker has exclusive access to the page tables
+		 * then no other software walkers should have locked the pte.
+		 */
+		WARN_ON(!shared);
+		return false;
+	}
+
+	if (!kvm_try_set_pte(ptep, old, KVM_INVALID_PTE_LOCKED, shared))
+		return false;
+
+	/*
+	 * If we removed a valid pte, break-then-make rules are in effect as a
+	 * translation may have been cached that traversed this entry.
+	 */
+	if (kvm_pte_valid(old)) {
+		dsb(ishst);
+
+		if (kvm_pte_table(old, level))
+			/*
+			 * Invalidate the whole stage-2, as we may have numerous leaf
+			 * entries below us which would otherwise need invalidating
+			 * individually.
+			 */
+			kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
+		else
+			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
+	}
+
+	/*
+	 * Don't drop the reference on table entries yet, as the walker must
+	 * first recurse on the unlinked subtree to unlink and drop references
+	 * to child tables.
+	 */
+	if (!kvm_pte_table(old, level) && stage2_pte_is_counted(old))
+		data->mm_ops->put_page(ptep);
+
+	return true;
+}
+
+/**
+ * stage2_make_pte() - Installs a new pte according to the 'break-before-make'
+ *		       sequence.
+ *
+ * @ptep: pointer to the pte to make
+ * @new: new pte value to install
+ *
+ * Assumes that the pte addressed by ptep has already been broken and is under
+ * the ownership of the table walker. If the new pte to be installed is a valid
+ * entry, perform a DSB to make the write visible. Raise the reference count on
+ * the table if the new pte requires a reference.
+ *
+ * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
+ * on the 'break-before-make' sequence.
+ */
+static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
+{
+	/* Yikes! We really shouldn't install to an entry we don't own. */
+	WARN_ON(!stage2_pte_is_locked(*ptep));
+
+	if (stage2_pte_is_counted(new))
+		mm_ops->get_page(ptep);
+
+	if (kvm_pte_valid(new)) {
+		WRITE_ONCE(*ptep, new);
+		dsb(ishst);
+	} else {
+		smp_store_release(ptep, new);
+	}
+}
+
 static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
 			   u32 level, struct kvm_pgtable_mm_ops *mm_ops)
 {
@@ -760,18 +882,17 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	else
 		new = kvm_init_invalid_leaf_owner(data->owner_id);
 
-	if (stage2_pte_is_counted(old)) {
-		/*
-		 * Skip updating the PTE if we are trying to recreate the exact
-		 * same mapping or only change the access permissions. Instead,
-		 * the vCPU will exit one more time from guest if still needed
-		 * and then go through the path of relaxing permissions.
-		 */
-		if (!stage2_pte_needs_update(old, new))
-			return -EAGAIN;
+	/*
+	 * Skip updating the PTE if we are trying to recreate the exact same
+	 * mapping or only change the access permissions. Instead, the vCPU will
+	 * exit one more time from the guest if still needed and then go through
+	 * the path of relaxing permissions.
+	 */
+	if (!stage2_pte_needs_update(old, new))
+		return -EAGAIN;
 
-		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
-	}
+	if (!stage2_try_break_pte(ptep, old, addr, level, shared, data))
+		return -EAGAIN;
 
 	/* Perform CMOs before installation of the guest stage-2 PTE */
 	if (mm_ops->dcache_clean_inval_poc && stage2_pte_cacheable(pgt, new))
@@ -781,9 +902,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	if (mm_ops->icache_inval_pou && stage2_pte_executable(new))
 		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
 
-	smp_store_release(ptep, new);
-	if (stage2_pte_is_counted(new))
-		mm_ops->get_page(ptep);
+	stage2_make_pte(ptep, new, data->mm_ops);
 	if (kvm_phys_is_valid(phys))
 		data->phys += granule;
 	return 0;
@@ -800,15 +919,10 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 	if (!stage2_leaf_mapping_allowed(addr, end, level, data))
 		return 0;
 
-	data->childp = kvm_pte_follow(*old, data->mm_ops);
-	kvm_clear_pte(ptep);
+	if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
+		return -EAGAIN;
 
-	/*
-	 * Invalidate the whole stage-2, as we may have numerous leaf
-	 * entries below us which would otherwise need invalidating
-	 * individually.
-	 */
-	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
+	data->childp = kvm_pte_follow(*old, data->mm_ops);
 	data->anchor = ptep;
 	return 0;
 }
@@ -837,18 +951,24 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	if (!data->memcache)
 		return -ENOMEM;
 
+	if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
+		return -EAGAIN;
+
 	childp = mm_ops->zalloc_page(data->memcache);
-	if (!childp)
+	if (!childp) {
+		/*
+		 * Release the pte if we were unable to install a table to allow
+		 * another thread to make an attempt.
+		 */
+		stage2_make_pte(ptep, 0, data->mm_ops);
 		return -ENOMEM;
+	}
 
 	/*
 	 * If we've run into an existing block mapping then replace it with
 	 * a table. Accesses beyond 'end' that fall within the new table
 	 * will be mapped lazily.
 	 */
-	if (stage2_pte_is_counted(*old))
-		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
-
 	kvm_set_table_pte(ptep, childp, mm_ops);
 	mm_ops->get_page(ptep);
 	*old = *ptep;
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

The ARM architecture requires that software use the 'break-before-make'
sequence whenever memory is being remapped. An additional requirement of
parallel page walks is a mechanism to ensure exclusive access to a pte,
thereby avoiding two threads changing the pte and invariably stomping on
one another.

Roll the two concepts together into a new helper to implement the
'break' sequence. Use a special invalid pte value to indicate that the
pte is under the exclusive control of a thread. If software walkers are
traversing the tables in parallel, use an atomic compare-exchange to
break the pte. Retry execution on a failed attempt to break the pte, in
the hopes that either the instruction will succeed or the pte lock will
be successfully acquired.

Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
evicted pte was valid. For counted non-table ptes drop the reference
immediately. Otherwise, references on tables are dropped in post-order
traversal as the walker must recurse on the pruned subtree.

All of the new atomics do nothing (for now), as there are a few other
bits of the map walker that need to be addressed before actually walking
in parallel.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 172 +++++++++++++++++++++++++++++------
 1 file changed, 146 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index bf46d6d24951..059ebb921125 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -49,6 +49,12 @@
 #define KVM_INVALID_PTE_OWNER_MASK	GENMASK(9, 2)
 #define KVM_MAX_OWNER_ID		1
 
+/*
+ * Used to indicate a pte for which a 'make-before-break' sequence is in
+ * progress.
+ */
+#define KVM_INVALID_PTE_LOCKED		BIT(10)
+
 struct kvm_pgtable_walk_data {
 	struct kvm_pgtable		*pgt;
 	struct kvm_pgtable_walker	*walker;
@@ -707,6 +713,122 @@ static bool stage2_pte_is_counted(kvm_pte_t pte)
 	return kvm_pte_valid(pte) || kvm_invalid_pte_owner(pte);
 }
 
+static bool stage2_pte_is_locked(kvm_pte_t pte)
+{
+	return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED);
+}
+
+static inline bool kvm_try_set_pte(kvm_pte_t *ptep, kvm_pte_t old, kvm_pte_t new, bool shared)
+{
+	if (!shared) {
+		WRITE_ONCE(*ptep, new);
+		return true;
+	}
+
+	return cmpxchg(ptep, old, new) == old;
+}
+
+/**
+ * stage2_try_break_pte() - Invalidates a pte according to the
+ *			    'break-before-make' sequence.
+ *
+ * @ptep: Pointer to the pte to break
+ * @old: The previously observed value of the pte; used for compare-exchange in
+ *	 a parallel walk
+ * @addr: IPA corresponding to the pte
+ * @level: Table level of the pte
+ * @shared: true if the tables are shared by multiple software walkers
+ * @data: pointer to the map walker data
+ *
+ * Returns: true if the pte was successfully broken.
+ *
+ * If the removed pt was valid, performs the necessary DSB and TLB flush for
+ * the old value. Drops references to the page table if a non-table entry was
+ * removed. Otherwise, the table reference is preserved as the walker must also
+ * recurse through the child tables.
+ *
+ * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
+ * on the 'break-before-make' sequence.
+ */
+static bool stage2_try_break_pte(kvm_pte_t *ptep, kvm_pte_t old, u64 addr, u32 level, bool shared,
+				 struct stage2_map_data *data)
+{
+	/*
+	 * Another thread could have already visited this pte and taken
+	 * ownership.
+	 */
+	if (stage2_pte_is_locked(old)) {
+		/*
+		 * If the table walker has exclusive access to the page tables
+		 * then no other software walkers should have locked the pte.
+		 */
+		WARN_ON(!shared);
+		return false;
+	}
+
+	if (!kvm_try_set_pte(ptep, old, KVM_INVALID_PTE_LOCKED, shared))
+		return false;
+
+	/*
+	 * If we removed a valid pte, break-then-make rules are in effect as a
+	 * translation may have been cached that traversed this entry.
+	 */
+	if (kvm_pte_valid(old)) {
+		dsb(ishst);
+
+		if (kvm_pte_table(old, level))
+			/*
+			 * Invalidate the whole stage-2, as we may have numerous leaf
+			 * entries below us which would otherwise need invalidating
+			 * individually.
+			 */
+			kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
+		else
+			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
+	}
+
+	/*
+	 * Don't drop the reference on table entries yet, as the walker must
+	 * first recurse on the unlinked subtree to unlink and drop references
+	 * to child tables.
+	 */
+	if (!kvm_pte_table(old, level) && stage2_pte_is_counted(old))
+		data->mm_ops->put_page(ptep);
+
+	return true;
+}
+
+/**
+ * stage2_make_pte() - Installs a new pte according to the 'break-before-make'
+ *		       sequence.
+ *
+ * @ptep: pointer to the pte to make
+ * @new: new pte value to install
+ *
+ * Assumes that the pte addressed by ptep has already been broken and is under
+ * the ownership of the table walker. If the new pte to be installed is a valid
+ * entry, perform a DSB to make the write visible. Raise the reference count on
+ * the table if the new pte requires a reference.
+ *
+ * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
+ * on the 'break-before-make' sequence.
+ */
+static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
+{
+	/* Yikes! We really shouldn't install to an entry we don't own. */
+	WARN_ON(!stage2_pte_is_locked(*ptep));
+
+	if (stage2_pte_is_counted(new))
+		mm_ops->get_page(ptep);
+
+	if (kvm_pte_valid(new)) {
+		WRITE_ONCE(*ptep, new);
+		dsb(ishst);
+	} else {
+		smp_store_release(ptep, new);
+	}
+}
+
 static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
 			   u32 level, struct kvm_pgtable_mm_ops *mm_ops)
 {
@@ -760,18 +882,17 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	else
 		new = kvm_init_invalid_leaf_owner(data->owner_id);
 
-	if (stage2_pte_is_counted(old)) {
-		/*
-		 * Skip updating the PTE if we are trying to recreate the exact
-		 * same mapping or only change the access permissions. Instead,
-		 * the vCPU will exit one more time from guest if still needed
-		 * and then go through the path of relaxing permissions.
-		 */
-		if (!stage2_pte_needs_update(old, new))
-			return -EAGAIN;
+	/*
+	 * Skip updating the PTE if we are trying to recreate the exact same
+	 * mapping or only change the access permissions. Instead, the vCPU will
+	 * exit one more time from the guest if still needed and then go through
+	 * the path of relaxing permissions.
+	 */
+	if (!stage2_pte_needs_update(old, new))
+		return -EAGAIN;
 
-		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
-	}
+	if (!stage2_try_break_pte(ptep, old, addr, level, shared, data))
+		return -EAGAIN;
 
 	/* Perform CMOs before installation of the guest stage-2 PTE */
 	if (mm_ops->dcache_clean_inval_poc && stage2_pte_cacheable(pgt, new))
@@ -781,9 +902,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	if (mm_ops->icache_inval_pou && stage2_pte_executable(new))
 		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
 
-	smp_store_release(ptep, new);
-	if (stage2_pte_is_counted(new))
-		mm_ops->get_page(ptep);
+	stage2_make_pte(ptep, new, data->mm_ops);
 	if (kvm_phys_is_valid(phys))
 		data->phys += granule;
 	return 0;
@@ -800,15 +919,10 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 	if (!stage2_leaf_mapping_allowed(addr, end, level, data))
 		return 0;
 
-	data->childp = kvm_pte_follow(*old, data->mm_ops);
-	kvm_clear_pte(ptep);
+	if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
+		return -EAGAIN;
 
-	/*
-	 * Invalidate the whole stage-2, as we may have numerous leaf
-	 * entries below us which would otherwise need invalidating
-	 * individually.
-	 */
-	kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
+	data->childp = kvm_pte_follow(*old, data->mm_ops);
 	data->anchor = ptep;
 	return 0;
 }
@@ -837,18 +951,24 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	if (!data->memcache)
 		return -ENOMEM;
 
+	if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
+		return -EAGAIN;
+
 	childp = mm_ops->zalloc_page(data->memcache);
-	if (!childp)
+	if (!childp) {
+		/*
+		 * Release the pte if we were unable to install a table to allow
+		 * another thread to make an attempt.
+		 */
+		stage2_make_pte(ptep, 0, data->mm_ops);
 		return -ENOMEM;
+	}
 
 	/*
 	 * If we've run into an existing block mapping then replace it with
 	 * a table. Accesses beyond 'end' that fall within the new table
 	 * will be mapped lazily.
 	 */
-	if (stage2_pte_is_counted(*old))
-		stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
-
 	kvm_set_table_pte(ptep, childp, mm_ops);
 	mm_ops->get_page(ptep);
 	*old = *ptep;
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 07/17] KVM: arm64: Enlighten perm relax path about parallel walks
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:58   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

To date the permission relax path of the stage-2 fault handler hasn't
had to worry about the paging structures changing under its nose, as map
operations acquire the write lock. That's about to change, which means a
permission relaxation walker could traverse in parallel with a map
operation.

If at any point during traversal the permission relax walker finds a
locked pte, bail immediately. Either the instruction will succeed or the
vCPU will fault once more and (hopefully) walk the tables successfully.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 059ebb921125..ff6f14755d0c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1168,6 +1168,11 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	struct stage2_attr_data *data = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
+	if (stage2_pte_is_locked(pte)) {
+		WARN_ON(!shared);
+		return -EAGAIN;
+	}
+
 	if (!kvm_pte_valid(pte))
 		return 0;
 
@@ -1190,7 +1195,9 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 		    stage2_pte_executable(pte) && !stage2_pte_executable(*ptep))
 			mm_ops->icache_inval_pou(kvm_pte_follow(pte, mm_ops),
 						  kvm_granule_size(level));
-		WRITE_ONCE(*ptep, pte);
+
+		if (!kvm_try_set_pte(ptep, data->pte, pte, shared))
+			return -EAGAIN;
 	}
 
 	return 0;
@@ -1199,7 +1206,7 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
 				    u64 size, kvm_pte_t attr_set,
 				    kvm_pte_t attr_clr, kvm_pte_t *orig_pte,
-				    u32 *level)
+				    u32 *level, bool shared)
 {
 	int ret;
 	kvm_pte_t attr_mask = KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI;
@@ -1214,7 +1221,7 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
 		.flags		= KVM_PGTABLE_WALK_LEAF,
 	};
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, shared);
 	if (ret)
 		return ret;
 
@@ -1230,14 +1237,14 @@ int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size)
 {
 	return stage2_update_leaf_attrs(pgt, addr, size, 0,
 					KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
-					NULL, NULL);
+					NULL, NULL, false);
 }
 
 kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr)
 {
 	kvm_pte_t pte = 0;
 	stage2_update_leaf_attrs(pgt, addr, 1, KVM_PTE_LEAF_ATTR_LO_S2_AF, 0,
-				 &pte, NULL);
+				 &pte, NULL, false);
 	dsb(ishst);
 	return pte;
 }
@@ -1246,7 +1253,7 @@ kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, u64 addr)
 {
 	kvm_pte_t pte = 0;
 	stage2_update_leaf_attrs(pgt, addr, 1, 0, KVM_PTE_LEAF_ATTR_LO_S2_AF,
-				 &pte, NULL);
+				 &pte, NULL, false);
 	/*
 	 * "But where's the TLBI?!", you scream.
 	 * "Over in the core code", I sigh.
@@ -1259,7 +1266,7 @@ kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, u64 addr)
 bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr)
 {
 	kvm_pte_t pte = 0;
-	stage2_update_leaf_attrs(pgt, addr, 1, 0, 0, &pte, NULL);
+	stage2_update_leaf_attrs(pgt, addr, 1, 0, 0, &pte, NULL, false);
 	return pte & KVM_PTE_LEAF_ATTR_LO_S2_AF;
 }
 
@@ -1282,7 +1289,7 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 	if (prot & KVM_PGTABLE_PROT_X)
 		clr |= KVM_PTE_LEAF_ATTR_HI_S2_XN;
 
-	ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level);
+	ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level, true);
 	if (!ret)
 		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, pgt->mmu, addr, level);
 	return ret;
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 07/17] KVM: arm64: Enlighten perm relax path about parallel walks
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

To date the permission relax path of the stage-2 fault handler hasn't
had to worry about the paging structures changing under its nose, as map
operations acquire the write lock. That's about to change, which means a
permission relaxation walker could traverse in parallel with a map
operation.

If at any point during traversal the permission relax walker finds a
locked pte, bail immediately. Either the instruction will succeed or the
vCPU will fault once more and (hopefully) walk the tables successfully.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 059ebb921125..ff6f14755d0c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1168,6 +1168,11 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	struct stage2_attr_data *data = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
+	if (stage2_pte_is_locked(pte)) {
+		WARN_ON(!shared);
+		return -EAGAIN;
+	}
+
 	if (!kvm_pte_valid(pte))
 		return 0;
 
@@ -1190,7 +1195,9 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 		    stage2_pte_executable(pte) && !stage2_pte_executable(*ptep))
 			mm_ops->icache_inval_pou(kvm_pte_follow(pte, mm_ops),
 						  kvm_granule_size(level));
-		WRITE_ONCE(*ptep, pte);
+
+		if (!kvm_try_set_pte(ptep, data->pte, pte, shared))
+			return -EAGAIN;
 	}
 
 	return 0;
@@ -1199,7 +1206,7 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
 				    u64 size, kvm_pte_t attr_set,
 				    kvm_pte_t attr_clr, kvm_pte_t *orig_pte,
-				    u32 *level)
+				    u32 *level, bool shared)
 {
 	int ret;
 	kvm_pte_t attr_mask = KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI;
@@ -1214,7 +1221,7 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
 		.flags		= KVM_PGTABLE_WALK_LEAF,
 	};
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, shared);
 	if (ret)
 		return ret;
 
@@ -1230,14 +1237,14 @@ int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size)
 {
 	return stage2_update_leaf_attrs(pgt, addr, size, 0,
 					KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
-					NULL, NULL);
+					NULL, NULL, false);
 }
 
 kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr)
 {
 	kvm_pte_t pte = 0;
 	stage2_update_leaf_attrs(pgt, addr, 1, KVM_PTE_LEAF_ATTR_LO_S2_AF, 0,
-				 &pte, NULL);
+				 &pte, NULL, false);
 	dsb(ishst);
 	return pte;
 }
@@ -1246,7 +1253,7 @@ kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, u64 addr)
 {
 	kvm_pte_t pte = 0;
 	stage2_update_leaf_attrs(pgt, addr, 1, 0, KVM_PTE_LEAF_ATTR_LO_S2_AF,
-				 &pte, NULL);
+				 &pte, NULL, false);
 	/*
 	 * "But where's the TLBI?!", you scream.
 	 * "Over in the core code", I sigh.
@@ -1259,7 +1266,7 @@ kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, u64 addr)
 bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr)
 {
 	kvm_pte_t pte = 0;
-	stage2_update_leaf_attrs(pgt, addr, 1, 0, 0, &pte, NULL);
+	stage2_update_leaf_attrs(pgt, addr, 1, 0, 0, &pte, NULL, false);
 	return pte & KVM_PTE_LEAF_ATTR_LO_S2_AF;
 }
 
@@ -1282,7 +1289,7 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 	if (prot & KVM_PGTABLE_PROT_X)
 		clr |= KVM_PTE_LEAF_ATTR_HI_S2_XN;
 
-	ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level);
+	ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level, true);
 	if (!ret)
 		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, pgt->mmu, addr, level);
 	return ret;
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 07/17] KVM: arm64: Enlighten perm relax path about parallel walks
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

To date the permission relax path of the stage-2 fault handler hasn't
had to worry about the paging structures changing under its nose, as map
operations acquire the write lock. That's about to change, which means a
permission relaxation walker could traverse in parallel with a map
operation.

If at any point during traversal the permission relax walker finds a
locked pte, bail immediately. Either the instruction will succeed or the
vCPU will fault once more and (hopefully) walk the tables successfully.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 059ebb921125..ff6f14755d0c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1168,6 +1168,11 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	struct stage2_attr_data *data = arg;
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 
+	if (stage2_pte_is_locked(pte)) {
+		WARN_ON(!shared);
+		return -EAGAIN;
+	}
+
 	if (!kvm_pte_valid(pte))
 		return 0;
 
@@ -1190,7 +1195,9 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 		    stage2_pte_executable(pte) && !stage2_pte_executable(*ptep))
 			mm_ops->icache_inval_pou(kvm_pte_follow(pte, mm_ops),
 						  kvm_granule_size(level));
-		WRITE_ONCE(*ptep, pte);
+
+		if (!kvm_try_set_pte(ptep, data->pte, pte, shared))
+			return -EAGAIN;
 	}
 
 	return 0;
@@ -1199,7 +1206,7 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
 				    u64 size, kvm_pte_t attr_set,
 				    kvm_pte_t attr_clr, kvm_pte_t *orig_pte,
-				    u32 *level)
+				    u32 *level, bool shared)
 {
 	int ret;
 	kvm_pte_t attr_mask = KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI;
@@ -1214,7 +1221,7 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
 		.flags		= KVM_PGTABLE_WALK_LEAF,
 	};
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, shared);
 	if (ret)
 		return ret;
 
@@ -1230,14 +1237,14 @@ int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size)
 {
 	return stage2_update_leaf_attrs(pgt, addr, size, 0,
 					KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
-					NULL, NULL);
+					NULL, NULL, false);
 }
 
 kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr)
 {
 	kvm_pte_t pte = 0;
 	stage2_update_leaf_attrs(pgt, addr, 1, KVM_PTE_LEAF_ATTR_LO_S2_AF, 0,
-				 &pte, NULL);
+				 &pte, NULL, false);
 	dsb(ishst);
 	return pte;
 }
@@ -1246,7 +1253,7 @@ kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, u64 addr)
 {
 	kvm_pte_t pte = 0;
 	stage2_update_leaf_attrs(pgt, addr, 1, 0, KVM_PTE_LEAF_ATTR_LO_S2_AF,
-				 &pte, NULL);
+				 &pte, NULL, false);
 	/*
 	 * "But where's the TLBI?!", you scream.
 	 * "Over in the core code", I sigh.
@@ -1259,7 +1266,7 @@ kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, u64 addr)
 bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr)
 {
 	kvm_pte_t pte = 0;
-	stage2_update_leaf_attrs(pgt, addr, 1, 0, 0, &pte, NULL);
+	stage2_update_leaf_attrs(pgt, addr, 1, 0, 0, &pte, NULL, false);
 	return pte & KVM_PTE_LEAF_ATTR_LO_S2_AF;
 }
 
@@ -1282,7 +1289,7 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 	if (prot & KVM_PGTABLE_PROT_X)
 		clr |= KVM_PTE_LEAF_ATTR_HI_S2_XN;
 
-	ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level);
+	ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level, true);
 	if (!ret)
 		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, pgt->mmu, addr, level);
 	return ret;
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 08/17] KVM: arm64: Spin off helper for initializing table pte
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:58   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

With parallel table walks there is no guarantee that KVM reads back the
same pte that was written. Spin off a helper that creates a pte value,
thereby allowing the visitor callback to return the next table without
reading the ptep again.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index ff6f14755d0c..ffdfd5ee9642 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -167,14 +167,23 @@ static void kvm_clear_pte(kvm_pte_t *ptep)
 	WRITE_ONCE(*ptep, 0);
 }
 
-static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp,
-			      struct kvm_pgtable_mm_ops *mm_ops)
+static kvm_pte_t kvm_init_table_pte(kvm_pte_t *childp, struct kvm_pgtable_mm_ops *mm_ops)
 {
-	kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(mm_ops->virt_to_phys(childp));
+	kvm_pte_t pte = kvm_phys_to_pte(mm_ops->virt_to_phys(childp));
 
 	pte |= FIELD_PREP(KVM_PTE_TYPE, KVM_PTE_TYPE_TABLE);
 	pte |= KVM_PTE_VALID;
 
+	return pte;
+}
+
+static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp,
+			      struct kvm_pgtable_mm_ops *mm_ops)
+{
+	kvm_pte_t pte, old = *ptep;
+
+	pte = kvm_init_table_pte(childp, mm_ops);
+
 	WARN_ON(kvm_pte_valid(old));
 	smp_store_release(ptep, pte);
 }
@@ -931,7 +940,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 				kvm_pte_t *old, struct stage2_map_data *data, bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
-	kvm_pte_t *childp;
+	kvm_pte_t *childp, pte;
 	int ret;
 
 	if (data->anchor) {
@@ -969,9 +978,9 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	 * a table. Accesses beyond 'end' that fall within the new table
 	 * will be mapped lazily.
 	 */
-	kvm_set_table_pte(ptep, childp, mm_ops);
-	mm_ops->get_page(ptep);
-	*old = *ptep;
+	pte = kvm_init_table_pte(childp, mm_ops);
+	stage2_make_pte(ptep, pte, data->mm_ops);
+	*old = pte;
 	return 0;
 }
 
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 08/17] KVM: arm64: Spin off helper for initializing table pte
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

With parallel table walks there is no guarantee that KVM reads back the
same pte that was written. Spin off a helper that creates a pte value,
thereby allowing the visitor callback to return the next table without
reading the ptep again.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index ff6f14755d0c..ffdfd5ee9642 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -167,14 +167,23 @@ static void kvm_clear_pte(kvm_pte_t *ptep)
 	WRITE_ONCE(*ptep, 0);
 }
 
-static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp,
-			      struct kvm_pgtable_mm_ops *mm_ops)
+static kvm_pte_t kvm_init_table_pte(kvm_pte_t *childp, struct kvm_pgtable_mm_ops *mm_ops)
 {
-	kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(mm_ops->virt_to_phys(childp));
+	kvm_pte_t pte = kvm_phys_to_pte(mm_ops->virt_to_phys(childp));
 
 	pte |= FIELD_PREP(KVM_PTE_TYPE, KVM_PTE_TYPE_TABLE);
 	pte |= KVM_PTE_VALID;
 
+	return pte;
+}
+
+static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp,
+			      struct kvm_pgtable_mm_ops *mm_ops)
+{
+	kvm_pte_t pte, old = *ptep;
+
+	pte = kvm_init_table_pte(childp, mm_ops);
+
 	WARN_ON(kvm_pte_valid(old));
 	smp_store_release(ptep, pte);
 }
@@ -931,7 +940,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 				kvm_pte_t *old, struct stage2_map_data *data, bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
-	kvm_pte_t *childp;
+	kvm_pte_t *childp, pte;
 	int ret;
 
 	if (data->anchor) {
@@ -969,9 +978,9 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	 * a table. Accesses beyond 'end' that fall within the new table
 	 * will be mapped lazily.
 	 */
-	kvm_set_table_pte(ptep, childp, mm_ops);
-	mm_ops->get_page(ptep);
-	*old = *ptep;
+	pte = kvm_init_table_pte(childp, mm_ops);
+	stage2_make_pte(ptep, pte, data->mm_ops);
+	*old = pte;
 	return 0;
 }
 
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 08/17] KVM: arm64: Spin off helper for initializing table pte
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

With parallel table walks there is no guarantee that KVM reads back the
same pte that was written. Spin off a helper that creates a pte value,
thereby allowing the visitor callback to return the next table without
reading the ptep again.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index ff6f14755d0c..ffdfd5ee9642 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -167,14 +167,23 @@ static void kvm_clear_pte(kvm_pte_t *ptep)
 	WRITE_ONCE(*ptep, 0);
 }
 
-static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp,
-			      struct kvm_pgtable_mm_ops *mm_ops)
+static kvm_pte_t kvm_init_table_pte(kvm_pte_t *childp, struct kvm_pgtable_mm_ops *mm_ops)
 {
-	kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(mm_ops->virt_to_phys(childp));
+	kvm_pte_t pte = kvm_phys_to_pte(mm_ops->virt_to_phys(childp));
 
 	pte |= FIELD_PREP(KVM_PTE_TYPE, KVM_PTE_TYPE_TABLE);
 	pte |= KVM_PTE_VALID;
 
+	return pte;
+}
+
+static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp,
+			      struct kvm_pgtable_mm_ops *mm_ops)
+{
+	kvm_pte_t pte, old = *ptep;
+
+	pte = kvm_init_table_pte(childp, mm_ops);
+
 	WARN_ON(kvm_pte_valid(old));
 	smp_store_release(ptep, pte);
 }
@@ -931,7 +940,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 				kvm_pte_t *old, struct stage2_map_data *data, bool shared)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
-	kvm_pte_t *childp;
+	kvm_pte_t *childp, pte;
 	int ret;
 
 	if (data->anchor) {
@@ -969,9 +978,9 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	 * a table. Accesses beyond 'end' that fall within the new table
 	 * will be mapped lazily.
 	 */
-	kvm_set_table_pte(ptep, childp, mm_ops);
-	mm_ops->get_page(ptep);
-	*old = *ptep;
+	pte = kvm_init_table_pte(childp, mm_ops);
+	stage2_make_pte(ptep, pte, data->mm_ops);
+	*old = pte;
 	return 0;
 }
 
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:58   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

Breaking a table pte is insufficient to guarantee ownership of an
unlinked subtree. Parallel software walkers could be traversing
substructures and changing their mappings.

Recurse through the unlinked subtree and lock all descendent ptes
to take ownership of the subtree. Since the ptes are actually being
evicted, return table ptes back to the table walker to ensure child
tables are also traversed. Note that this is done both in both the
pre-order and leaf visitors as the underlying pte remains volatile until
it is unlinked.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 56 +++++++++++++++++++++++++++++++++---
 1 file changed, 52 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index ffdfd5ee9642..146fc44acf31 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -838,6 +838,54 @@ static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_m
 	}
 }
 
+static kvm_pte_t stage2_unlink_pte_shared(kvm_pte_t *ptep)
+{
+	kvm_pte_t old;
+
+	while (true) {
+		old = xchg(ptep, KVM_INVALID_PTE_LOCKED);
+		if (old != KVM_INVALID_PTE_LOCKED)
+			return old;
+
+		cpu_relax();
+	}
+}
+
+
+/**
+ * stage2_unlink_pte() - Tears down an unreachable pte, returning the next pte
+ *			 to visit (if any).
+ *
+ * @ptep: pointer to the pte to unlink
+ * @level: page table level of the pte
+ * @shared: true if the tables are shared by multiple software walkers
+ * @mm_ops: pointer to the mm ops table
+ *
+ * Return: a table pte if another level of recursion is necessary, 0 otherwise.
+ */
+static kvm_pte_t stage2_unlink_pte(kvm_pte_t *ptep, u32 level, bool shared,
+				   struct kvm_pgtable_mm_ops *mm_ops)
+{
+	kvm_pte_t old;
+
+	if (shared) {
+		old = stage2_unlink_pte_shared(ptep);
+	} else {
+		old = *ptep;
+		WRITE_ONCE(*ptep, KVM_INVALID_PTE_LOCKED);
+	}
+
+	WARN_ON(stage2_pte_is_locked(old));
+
+	if (kvm_pte_table(old, level))
+		return old;
+
+	if (stage2_pte_is_counted(old))
+		mm_ops->put_page(ptep);
+
+	return 0;
+}
+
 static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
 			   u32 level, struct kvm_pgtable_mm_ops *mm_ops)
 {
@@ -922,8 +970,10 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 				     struct stage2_map_data *data,
 				     bool shared)
 {
-	if (data->anchor)
+	if (data->anchor) {
+		*old = stage2_unlink_pte(ptep, level, shared, data->mm_ops);
 		return 0;
+	}
 
 	if (!stage2_leaf_mapping_allowed(addr, end, level, data))
 		return 0;
@@ -944,9 +994,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	int ret;
 
 	if (data->anchor) {
-		if (stage2_pte_is_counted(*old))
-			mm_ops->put_page(ptep);
-
+		*old = stage2_unlink_pte(ptep, level, shared, data->mm_ops);
 		return 0;
 	}
 
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

Breaking a table pte is insufficient to guarantee ownership of an
unlinked subtree. Parallel software walkers could be traversing
substructures and changing their mappings.

Recurse through the unlinked subtree and lock all descendent ptes
to take ownership of the subtree. Since the ptes are actually being
evicted, return table ptes back to the table walker to ensure child
tables are also traversed. Note that this is done both in both the
pre-order and leaf visitors as the underlying pte remains volatile until
it is unlinked.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 56 +++++++++++++++++++++++++++++++++---
 1 file changed, 52 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index ffdfd5ee9642..146fc44acf31 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -838,6 +838,54 @@ static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_m
 	}
 }
 
+static kvm_pte_t stage2_unlink_pte_shared(kvm_pte_t *ptep)
+{
+	kvm_pte_t old;
+
+	while (true) {
+		old = xchg(ptep, KVM_INVALID_PTE_LOCKED);
+		if (old != KVM_INVALID_PTE_LOCKED)
+			return old;
+
+		cpu_relax();
+	}
+}
+
+
+/**
+ * stage2_unlink_pte() - Tears down an unreachable pte, returning the next pte
+ *			 to visit (if any).
+ *
+ * @ptep: pointer to the pte to unlink
+ * @level: page table level of the pte
+ * @shared: true if the tables are shared by multiple software walkers
+ * @mm_ops: pointer to the mm ops table
+ *
+ * Return: a table pte if another level of recursion is necessary, 0 otherwise.
+ */
+static kvm_pte_t stage2_unlink_pte(kvm_pte_t *ptep, u32 level, bool shared,
+				   struct kvm_pgtable_mm_ops *mm_ops)
+{
+	kvm_pte_t old;
+
+	if (shared) {
+		old = stage2_unlink_pte_shared(ptep);
+	} else {
+		old = *ptep;
+		WRITE_ONCE(*ptep, KVM_INVALID_PTE_LOCKED);
+	}
+
+	WARN_ON(stage2_pte_is_locked(old));
+
+	if (kvm_pte_table(old, level))
+		return old;
+
+	if (stage2_pte_is_counted(old))
+		mm_ops->put_page(ptep);
+
+	return 0;
+}
+
 static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
 			   u32 level, struct kvm_pgtable_mm_ops *mm_ops)
 {
@@ -922,8 +970,10 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 				     struct stage2_map_data *data,
 				     bool shared)
 {
-	if (data->anchor)
+	if (data->anchor) {
+		*old = stage2_unlink_pte(ptep, level, shared, data->mm_ops);
 		return 0;
+	}
 
 	if (!stage2_leaf_mapping_allowed(addr, end, level, data))
 		return 0;
@@ -944,9 +994,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	int ret;
 
 	if (data->anchor) {
-		if (stage2_pte_is_counted(*old))
-			mm_ops->put_page(ptep);
-
+		*old = stage2_unlink_pte(ptep, level, shared, data->mm_ops);
 		return 0;
 	}
 
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

Breaking a table pte is insufficient to guarantee ownership of an
unlinked subtree. Parallel software walkers could be traversing
substructures and changing their mappings.

Recurse through the unlinked subtree and lock all descendent ptes
to take ownership of the subtree. Since the ptes are actually being
evicted, return table ptes back to the table walker to ensure child
tables are also traversed. Note that this is done both in both the
pre-order and leaf visitors as the underlying pte remains volatile until
it is unlinked.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 56 +++++++++++++++++++++++++++++++++---
 1 file changed, 52 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index ffdfd5ee9642..146fc44acf31 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -838,6 +838,54 @@ static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_m
 	}
 }
 
+static kvm_pte_t stage2_unlink_pte_shared(kvm_pte_t *ptep)
+{
+	kvm_pte_t old;
+
+	while (true) {
+		old = xchg(ptep, KVM_INVALID_PTE_LOCKED);
+		if (old != KVM_INVALID_PTE_LOCKED)
+			return old;
+
+		cpu_relax();
+	}
+}
+
+
+/**
+ * stage2_unlink_pte() - Tears down an unreachable pte, returning the next pte
+ *			 to visit (if any).
+ *
+ * @ptep: pointer to the pte to unlink
+ * @level: page table level of the pte
+ * @shared: true if the tables are shared by multiple software walkers
+ * @mm_ops: pointer to the mm ops table
+ *
+ * Return: a table pte if another level of recursion is necessary, 0 otherwise.
+ */
+static kvm_pte_t stage2_unlink_pte(kvm_pte_t *ptep, u32 level, bool shared,
+				   struct kvm_pgtable_mm_ops *mm_ops)
+{
+	kvm_pte_t old;
+
+	if (shared) {
+		old = stage2_unlink_pte_shared(ptep);
+	} else {
+		old = *ptep;
+		WRITE_ONCE(*ptep, KVM_INVALID_PTE_LOCKED);
+	}
+
+	WARN_ON(stage2_pte_is_locked(old));
+
+	if (kvm_pte_table(old, level))
+		return old;
+
+	if (stage2_pte_is_counted(old))
+		mm_ops->put_page(ptep);
+
+	return 0;
+}
+
 static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
 			   u32 level, struct kvm_pgtable_mm_ops *mm_ops)
 {
@@ -922,8 +970,10 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 				     struct stage2_map_data *data,
 				     bool shared)
 {
-	if (data->anchor)
+	if (data->anchor) {
+		*old = stage2_unlink_pte(ptep, level, shared, data->mm_ops);
 		return 0;
+	}
 
 	if (!stage2_leaf_mapping_allowed(addr, end, level, data))
 		return 0;
@@ -944,9 +994,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	int ret;
 
 	if (data->anchor) {
-		if (stage2_pte_is_counted(*old))
-			mm_ops->put_page(ptep);
-
+		*old = stage2_unlink_pte(ptep, level, shared, data->mm_ops);
 		return 0;
 	}
 
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 10/17] KVM: arm64: Assume a table pte is already owned in post-order traversal
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:58   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

For parallel walks that collapse a table into a block KVM ensures a
locked invalid pte is visible to all observers in pre-order traversal.
As such, there is no need to try breaking the pte again.

Directly set the pte if it has already been broken.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 146fc44acf31..121818d4c33e 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -924,7 +924,7 @@ static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level,
 static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep, kvm_pte_t old,
 				      struct stage2_map_data *data,
-				      bool shared)
+				      bool shared, bool locked)
 {
 	kvm_pte_t new;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
@@ -948,7 +948,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	if (!stage2_pte_needs_update(old, new))
 		return -EAGAIN;
 
-	if (!stage2_try_break_pte(ptep, old, addr, level, shared, data))
+	if (!locked && !stage2_try_break_pte(ptep, old, addr, level, shared, data))
 		return -EAGAIN;
 
 	/* Perform CMOs before installation of the guest stage-2 PTE */
@@ -987,7 +987,8 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 }
 
 static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-				kvm_pte_t *old, struct stage2_map_data *data, bool shared)
+				kvm_pte_t *old, struct stage2_map_data *data, bool shared,
+				bool locked)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 	kvm_pte_t *childp, pte;
@@ -998,10 +999,13 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 		return 0;
 	}
 
-	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data, shared);
+	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data, shared, locked);
 	if (ret != -E2BIG)
 		return ret;
 
+	/* We should never attempt installing a table in post-order */
+	WARN_ON(locked);
+
 	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
 		return -EINVAL;
 
@@ -1048,7 +1052,13 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 		childp = data->childp;
 		data->anchor = NULL;
 		data->childp = NULL;
-		ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
+
+		/*
+		 * We are guaranteed exclusive access to the pte in post-order
+		 * traversal since the locked value was made visible to all
+		 * observers in stage2_map_walk_table_pre.
+		 */
+		ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared, true);
 	} else {
 		childp = kvm_pte_follow(*old, mm_ops);
 	}
@@ -1087,7 +1097,7 @@ static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_
 	case KVM_PGTABLE_WALK_TABLE_PRE:
 		return stage2_map_walk_table_pre(addr, end, level, ptep, old, data, shared);
 	case KVM_PGTABLE_WALK_LEAF:
-		return stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
+		return stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared, false);
 	case KVM_PGTABLE_WALK_TABLE_POST:
 		return stage2_map_walk_table_post(addr, end, level, ptep, old, data, shared);
 	}
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 10/17] KVM: arm64: Assume a table pte is already owned in post-order traversal
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

For parallel walks that collapse a table into a block KVM ensures a
locked invalid pte is visible to all observers in pre-order traversal.
As such, there is no need to try breaking the pte again.

Directly set the pte if it has already been broken.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 146fc44acf31..121818d4c33e 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -924,7 +924,7 @@ static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level,
 static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep, kvm_pte_t old,
 				      struct stage2_map_data *data,
-				      bool shared)
+				      bool shared, bool locked)
 {
 	kvm_pte_t new;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
@@ -948,7 +948,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	if (!stage2_pte_needs_update(old, new))
 		return -EAGAIN;
 
-	if (!stage2_try_break_pte(ptep, old, addr, level, shared, data))
+	if (!locked && !stage2_try_break_pte(ptep, old, addr, level, shared, data))
 		return -EAGAIN;
 
 	/* Perform CMOs before installation of the guest stage-2 PTE */
@@ -987,7 +987,8 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 }
 
 static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-				kvm_pte_t *old, struct stage2_map_data *data, bool shared)
+				kvm_pte_t *old, struct stage2_map_data *data, bool shared,
+				bool locked)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 	kvm_pte_t *childp, pte;
@@ -998,10 +999,13 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 		return 0;
 	}
 
-	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data, shared);
+	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data, shared, locked);
 	if (ret != -E2BIG)
 		return ret;
 
+	/* We should never attempt installing a table in post-order */
+	WARN_ON(locked);
+
 	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
 		return -EINVAL;
 
@@ -1048,7 +1052,13 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 		childp = data->childp;
 		data->anchor = NULL;
 		data->childp = NULL;
-		ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
+
+		/*
+		 * We are guaranteed exclusive access to the pte in post-order
+		 * traversal since the locked value was made visible to all
+		 * observers in stage2_map_walk_table_pre.
+		 */
+		ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared, true);
 	} else {
 		childp = kvm_pte_follow(*old, mm_ops);
 	}
@@ -1087,7 +1097,7 @@ static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_
 	case KVM_PGTABLE_WALK_TABLE_PRE:
 		return stage2_map_walk_table_pre(addr, end, level, ptep, old, data, shared);
 	case KVM_PGTABLE_WALK_LEAF:
-		return stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
+		return stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared, false);
 	case KVM_PGTABLE_WALK_TABLE_POST:
 		return stage2_map_walk_table_post(addr, end, level, ptep, old, data, shared);
 	}
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 10/17] KVM: arm64: Assume a table pte is already owned in post-order traversal
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

For parallel walks that collapse a table into a block KVM ensures a
locked invalid pte is visible to all observers in pre-order traversal.
As such, there is no need to try breaking the pte again.

Directly set the pte if it has already been broken.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/hyp/pgtable.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 146fc44acf31..121818d4c33e 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -924,7 +924,7 @@ static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level,
 static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 				      kvm_pte_t *ptep, kvm_pte_t old,
 				      struct stage2_map_data *data,
-				      bool shared)
+				      bool shared, bool locked)
 {
 	kvm_pte_t new;
 	u64 granule = kvm_granule_size(level), phys = data->phys;
@@ -948,7 +948,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
 	if (!stage2_pte_needs_update(old, new))
 		return -EAGAIN;
 
-	if (!stage2_try_break_pte(ptep, old, addr, level, shared, data))
+	if (!locked && !stage2_try_break_pte(ptep, old, addr, level, shared, data))
 		return -EAGAIN;
 
 	/* Perform CMOs before installation of the guest stage-2 PTE */
@@ -987,7 +987,8 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
 }
 
 static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
-				kvm_pte_t *old, struct stage2_map_data *data, bool shared)
+				kvm_pte_t *old, struct stage2_map_data *data, bool shared,
+				bool locked)
 {
 	struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
 	kvm_pte_t *childp, pte;
@@ -998,10 +999,13 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 		return 0;
 	}
 
-	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data, shared);
+	ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data, shared, locked);
 	if (ret != -E2BIG)
 		return ret;
 
+	/* We should never attempt installing a table in post-order */
+	WARN_ON(locked);
+
 	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
 		return -EINVAL;
 
@@ -1048,7 +1052,13 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 		childp = data->childp;
 		data->anchor = NULL;
 		data->childp = NULL;
-		ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
+
+		/*
+		 * We are guaranteed exclusive access to the pte in post-order
+		 * traversal since the locked value was made visible to all
+		 * observers in stage2_map_walk_table_pre.
+		 */
+		ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared, true);
 	} else {
 		childp = kvm_pte_follow(*old, mm_ops);
 	}
@@ -1087,7 +1097,7 @@ static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_
 	case KVM_PGTABLE_WALK_TABLE_PRE:
 		return stage2_map_walk_table_pre(addr, end, level, ptep, old, data, shared);
 	case KVM_PGTABLE_WALK_LEAF:
-		return stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
+		return stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared, false);
 	case KVM_PGTABLE_WALK_TABLE_POST:
 		return stage2_map_walk_table_post(addr, end, level, ptep, old, data, shared);
 	}
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 11/17] KVM: arm64: Move MMU cache init/destroy into helpers
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:58   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

No functional change intended.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_mmu.h |  2 ++
 arch/arm64/kvm/arm.c             |  4 ++--
 arch/arm64/kvm/mmu.c             | 10 ++++++++++
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 74735a864eee..3bb7b678a7e7 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -172,6 +172,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
 phys_addr_t kvm_mmu_get_httbr(void);
 phys_addr_t kvm_get_idmap_vector(void);
 int kvm_mmu_init(u32 *hyp_va_bits);
+void kvm_mmu_vcpu_init(struct kvm_vcpu *vcpu);
+void kvm_mmu_vcpu_destroy(struct kvm_vcpu *vcpu);
 
 static inline void *__kvm_vector_slot2addr(void *base,
 					   enum arm64_hyp_spectre_vector slot)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 523bc934fe2f..f7862fec1595 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -320,7 +320,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.target = -1;
 	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
 
-	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
+	kvm_mmu_vcpu_init(vcpu);
 
 	/* Set up the timer */
 	kvm_timer_vcpu_init(vcpu);
@@ -349,7 +349,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 	if (vcpu_has_run_once(vcpu) && unlikely(!irqchip_in_kernel(vcpu->kvm)))
 		static_branch_dec(&userspace_irqchip_in_use);
 
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
+	kvm_mmu_vcpu_destroy(vcpu);
 	kvm_timer_vcpu_terminate(vcpu);
 	kvm_pmu_vcpu_destroy(vcpu);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 53ae2c0640bc..f29d5179196b 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1601,6 +1601,16 @@ int kvm_mmu_init(u32 *hyp_va_bits)
 	return err;
 }
 
+void kvm_mmu_vcpu_init(struct kvm_vcpu *vcpu)
+{
+	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
+}
+
+void kvm_mmu_vcpu_destroy(struct kvm_vcpu *vcpu)
+{
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
+}
+
 void kvm_arch_commit_memory_region(struct kvm *kvm,
 				   struct kvm_memory_slot *old,
 				   const struct kvm_memory_slot *new,
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 11/17] KVM: arm64: Move MMU cache init/destroy into helpers
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

No functional change intended.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_mmu.h |  2 ++
 arch/arm64/kvm/arm.c             |  4 ++--
 arch/arm64/kvm/mmu.c             | 10 ++++++++++
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 74735a864eee..3bb7b678a7e7 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -172,6 +172,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
 phys_addr_t kvm_mmu_get_httbr(void);
 phys_addr_t kvm_get_idmap_vector(void);
 int kvm_mmu_init(u32 *hyp_va_bits);
+void kvm_mmu_vcpu_init(struct kvm_vcpu *vcpu);
+void kvm_mmu_vcpu_destroy(struct kvm_vcpu *vcpu);
 
 static inline void *__kvm_vector_slot2addr(void *base,
 					   enum arm64_hyp_spectre_vector slot)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 523bc934fe2f..f7862fec1595 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -320,7 +320,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.target = -1;
 	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
 
-	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
+	kvm_mmu_vcpu_init(vcpu);
 
 	/* Set up the timer */
 	kvm_timer_vcpu_init(vcpu);
@@ -349,7 +349,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 	if (vcpu_has_run_once(vcpu) && unlikely(!irqchip_in_kernel(vcpu->kvm)))
 		static_branch_dec(&userspace_irqchip_in_use);
 
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
+	kvm_mmu_vcpu_destroy(vcpu);
 	kvm_timer_vcpu_terminate(vcpu);
 	kvm_pmu_vcpu_destroy(vcpu);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 53ae2c0640bc..f29d5179196b 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1601,6 +1601,16 @@ int kvm_mmu_init(u32 *hyp_va_bits)
 	return err;
 }
 
+void kvm_mmu_vcpu_init(struct kvm_vcpu *vcpu)
+{
+	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
+}
+
+void kvm_mmu_vcpu_destroy(struct kvm_vcpu *vcpu)
+{
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
+}
+
 void kvm_arch_commit_memory_region(struct kvm *kvm,
 				   struct kvm_memory_slot *old,
 				   const struct kvm_memory_slot *new,
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 11/17] KVM: arm64: Move MMU cache init/destroy into helpers
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

No functional change intended.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_mmu.h |  2 ++
 arch/arm64/kvm/arm.c             |  4 ++--
 arch/arm64/kvm/mmu.c             | 10 ++++++++++
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 74735a864eee..3bb7b678a7e7 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -172,6 +172,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
 phys_addr_t kvm_mmu_get_httbr(void);
 phys_addr_t kvm_get_idmap_vector(void);
 int kvm_mmu_init(u32 *hyp_va_bits);
+void kvm_mmu_vcpu_init(struct kvm_vcpu *vcpu);
+void kvm_mmu_vcpu_destroy(struct kvm_vcpu *vcpu);
 
 static inline void *__kvm_vector_slot2addr(void *base,
 					   enum arm64_hyp_spectre_vector slot)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 523bc934fe2f..f7862fec1595 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -320,7 +320,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.target = -1;
 	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
 
-	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
+	kvm_mmu_vcpu_init(vcpu);
 
 	/* Set up the timer */
 	kvm_timer_vcpu_init(vcpu);
@@ -349,7 +349,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 	if (vcpu_has_run_once(vcpu) && unlikely(!irqchip_in_kernel(vcpu->kvm)))
 		static_branch_dec(&userspace_irqchip_in_use);
 
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
+	kvm_mmu_vcpu_destroy(vcpu);
 	kvm_timer_vcpu_terminate(vcpu);
 	kvm_pmu_vcpu_destroy(vcpu);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 53ae2c0640bc..f29d5179196b 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1601,6 +1601,16 @@ int kvm_mmu_init(u32 *hyp_va_bits)
 	return err;
 }
 
+void kvm_mmu_vcpu_init(struct kvm_vcpu *vcpu)
+{
+	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
+}
+
+void kvm_mmu_vcpu_destroy(struct kvm_vcpu *vcpu)
+{
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
+}
+
 void kvm_arch_commit_memory_region(struct kvm *kvm,
 				   struct kvm_memory_slot *old,
 				   const struct kvm_memory_slot *new,
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 12/17] KVM: arm64: Stuff mmu page cache in sub struct
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:58   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

We're about to add another mmu cache. Stuff the current one in a sub
struct so its easier to pass them all to ->zalloc_page().

No functional change intended.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  4 +++-
 arch/arm64/kvm/mmu.c              | 14 +++++++-------
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 94a27a7520f4..c8947597a619 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -372,7 +372,9 @@ struct kvm_vcpu_arch {
 	bool pause;
 
 	/* Cache some mmu pages needed inside spinlock regions */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	struct kvm_mmu_caches {
+		struct kvm_mmu_memory_cache page_cache;
+	} mmu_caches;
 
 	/* Target CPU and feature flags */
 	int target;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f29d5179196b..7a588928740a 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -91,10 +91,10 @@ static bool kvm_is_device_pfn(unsigned long pfn)
 
 static void *stage2_memcache_zalloc_page(void *arg)
 {
-	struct kvm_mmu_memory_cache *mc = arg;
+	struct kvm_mmu_caches *mmu_caches = arg;
 
 	/* Allocated with __GFP_ZERO, so no need to zero */
-	return kvm_mmu_memory_cache_alloc(mc);
+	return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
 }
 
 static void *kvm_host_zalloc_pages_exact(size_t size)
@@ -1073,7 +1073,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	bool shared;
 	unsigned long mmu_seq;
 	struct kvm *kvm = vcpu->kvm;
-	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
+	struct kvm_mmu_caches *mmu_caches = &vcpu->arch.mmu_caches;
 	struct vm_area_struct *vma;
 	short vma_shift;
 	gfn_t gfn;
@@ -1160,7 +1160,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * and a write fault needs to collapse a block entry into a table.
 	 */
 	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
-		ret = kvm_mmu_topup_memory_cache(memcache,
+		ret = kvm_mmu_topup_memory_cache(&mmu_caches->page_cache,
 						 kvm_mmu_cache_min_pages(kvm));
 		if (ret)
 			return ret;
@@ -1273,7 +1273,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
 					     __pfn_to_phys(pfn), prot,
-					     memcache);
+					     mmu_caches);
 	}
 
 	/* Mark the page dirty only if the fault is handled successfully */
@@ -1603,12 +1603,12 @@ int kvm_mmu_init(u32 *hyp_va_bits)
 
 void kvm_mmu_vcpu_init(struct kvm_vcpu *vcpu)
 {
-	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
+	vcpu->arch.mmu_caches.page_cache.gfp_zero = __GFP_ZERO;
 }
 
 void kvm_mmu_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_caches.page_cache);
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 12/17] KVM: arm64: Stuff mmu page cache in sub struct
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

We're about to add another mmu cache. Stuff the current one in a sub
struct so its easier to pass them all to ->zalloc_page().

No functional change intended.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  4 +++-
 arch/arm64/kvm/mmu.c              | 14 +++++++-------
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 94a27a7520f4..c8947597a619 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -372,7 +372,9 @@ struct kvm_vcpu_arch {
 	bool pause;
 
 	/* Cache some mmu pages needed inside spinlock regions */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	struct kvm_mmu_caches {
+		struct kvm_mmu_memory_cache page_cache;
+	} mmu_caches;
 
 	/* Target CPU and feature flags */
 	int target;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f29d5179196b..7a588928740a 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -91,10 +91,10 @@ static bool kvm_is_device_pfn(unsigned long pfn)
 
 static void *stage2_memcache_zalloc_page(void *arg)
 {
-	struct kvm_mmu_memory_cache *mc = arg;
+	struct kvm_mmu_caches *mmu_caches = arg;
 
 	/* Allocated with __GFP_ZERO, so no need to zero */
-	return kvm_mmu_memory_cache_alloc(mc);
+	return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
 }
 
 static void *kvm_host_zalloc_pages_exact(size_t size)
@@ -1073,7 +1073,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	bool shared;
 	unsigned long mmu_seq;
 	struct kvm *kvm = vcpu->kvm;
-	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
+	struct kvm_mmu_caches *mmu_caches = &vcpu->arch.mmu_caches;
 	struct vm_area_struct *vma;
 	short vma_shift;
 	gfn_t gfn;
@@ -1160,7 +1160,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * and a write fault needs to collapse a block entry into a table.
 	 */
 	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
-		ret = kvm_mmu_topup_memory_cache(memcache,
+		ret = kvm_mmu_topup_memory_cache(&mmu_caches->page_cache,
 						 kvm_mmu_cache_min_pages(kvm));
 		if (ret)
 			return ret;
@@ -1273,7 +1273,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
 					     __pfn_to_phys(pfn), prot,
-					     memcache);
+					     mmu_caches);
 	}
 
 	/* Mark the page dirty only if the fault is handled successfully */
@@ -1603,12 +1603,12 @@ int kvm_mmu_init(u32 *hyp_va_bits)
 
 void kvm_mmu_vcpu_init(struct kvm_vcpu *vcpu)
 {
-	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
+	vcpu->arch.mmu_caches.page_cache.gfp_zero = __GFP_ZERO;
 }
 
 void kvm_mmu_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_caches.page_cache);
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 12/17] KVM: arm64: Stuff mmu page cache in sub struct
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

We're about to add another mmu cache. Stuff the current one in a sub
struct so its easier to pass them all to ->zalloc_page().

No functional change intended.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  4 +++-
 arch/arm64/kvm/mmu.c              | 14 +++++++-------
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 94a27a7520f4..c8947597a619 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -372,7 +372,9 @@ struct kvm_vcpu_arch {
 	bool pause;
 
 	/* Cache some mmu pages needed inside spinlock regions */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	struct kvm_mmu_caches {
+		struct kvm_mmu_memory_cache page_cache;
+	} mmu_caches;
 
 	/* Target CPU and feature flags */
 	int target;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f29d5179196b..7a588928740a 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -91,10 +91,10 @@ static bool kvm_is_device_pfn(unsigned long pfn)
 
 static void *stage2_memcache_zalloc_page(void *arg)
 {
-	struct kvm_mmu_memory_cache *mc = arg;
+	struct kvm_mmu_caches *mmu_caches = arg;
 
 	/* Allocated with __GFP_ZERO, so no need to zero */
-	return kvm_mmu_memory_cache_alloc(mc);
+	return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
 }
 
 static void *kvm_host_zalloc_pages_exact(size_t size)
@@ -1073,7 +1073,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	bool shared;
 	unsigned long mmu_seq;
 	struct kvm *kvm = vcpu->kvm;
-	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
+	struct kvm_mmu_caches *mmu_caches = &vcpu->arch.mmu_caches;
 	struct vm_area_struct *vma;
 	short vma_shift;
 	gfn_t gfn;
@@ -1160,7 +1160,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * and a write fault needs to collapse a block entry into a table.
 	 */
 	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
-		ret = kvm_mmu_topup_memory_cache(memcache,
+		ret = kvm_mmu_topup_memory_cache(&mmu_caches->page_cache,
 						 kvm_mmu_cache_min_pages(kvm));
 		if (ret)
 			return ret;
@@ -1273,7 +1273,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
 					     __pfn_to_phys(pfn), prot,
-					     memcache);
+					     mmu_caches);
 	}
 
 	/* Mark the page dirty only if the fault is handled successfully */
@@ -1603,12 +1603,12 @@ int kvm_mmu_init(u32 *hyp_va_bits)
 
 void kvm_mmu_vcpu_init(struct kvm_vcpu *vcpu)
 {
-	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
+	vcpu->arch.mmu_caches.page_cache.gfp_zero = __GFP_ZERO;
 }
 
 void kvm_mmu_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_caches.page_cache);
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 13/17] KVM: arm64: Setup cache for stage2 page headers
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:58   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

In order to punt the last reference drop on a page to an RCU
synchronization we need to get a pointer to the page to handle the
callback.

Set up a memcache for stage2 page headers, but do nothing with it for
now. Note that the kmem_cache is never destoyed as it is currently not
possible to build KVM/arm64 as a module.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  1 +
 arch/arm64/kvm/mmu.c              | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index c8947597a619..a640d015790e 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -374,6 +374,7 @@ struct kvm_vcpu_arch {
 	/* Cache some mmu pages needed inside spinlock regions */
 	struct kvm_mmu_caches {
 		struct kvm_mmu_memory_cache page_cache;
+		struct kvm_mmu_memory_cache header_cache;
 	} mmu_caches;
 
 	/* Target CPU and feature flags */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7a588928740a..cc6ed6b06ec2 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -31,6 +31,12 @@ static phys_addr_t hyp_idmap_vector;
 
 static unsigned long io_map_base;
 
+static struct kmem_cache *stage2_page_header_cache;
+
+struct stage2_page_header {
+	struct rcu_head rcu_head;
+	struct page *page;
+};
 
 /*
  * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
@@ -1164,6 +1170,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 						 kvm_mmu_cache_min_pages(kvm));
 		if (ret)
 			return ret;
+
+		ret = kvm_mmu_topup_memory_cache(&mmu_caches->header_cache,
+						 kvm_mmu_cache_min_pages(kvm));
+		if (ret)
+			return ret;
 	}
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
@@ -1589,6 +1600,13 @@ int kvm_mmu_init(u32 *hyp_va_bits)
 	if (err)
 		goto out_destroy_pgtable;
 
+	stage2_page_header_cache = kmem_cache_create("stage2_page_header",
+						     sizeof(struct stage2_page_header),
+						     0, SLAB_ACCOUNT, NULL);
+
+	if (!stage2_page_header_cache)
+		goto out_destroy_pgtable;
+
 	io_map_base = hyp_idmap_start;
 	return 0;
 
@@ -1604,11 +1622,13 @@ int kvm_mmu_init(u32 *hyp_va_bits)
 void kvm_mmu_vcpu_init(struct kvm_vcpu *vcpu)
 {
 	vcpu->arch.mmu_caches.page_cache.gfp_zero = __GFP_ZERO;
+	vcpu->arch.mmu_caches.header_cache.kmem_cache = stage2_page_header_cache;
 }
 
 void kvm_mmu_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_caches.page_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_caches.header_cache);
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 13/17] KVM: arm64: Setup cache for stage2 page headers
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

In order to punt the last reference drop on a page to an RCU
synchronization we need to get a pointer to the page to handle the
callback.

Set up a memcache for stage2 page headers, but do nothing with it for
now. Note that the kmem_cache is never destoyed as it is currently not
possible to build KVM/arm64 as a module.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  1 +
 arch/arm64/kvm/mmu.c              | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index c8947597a619..a640d015790e 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -374,6 +374,7 @@ struct kvm_vcpu_arch {
 	/* Cache some mmu pages needed inside spinlock regions */
 	struct kvm_mmu_caches {
 		struct kvm_mmu_memory_cache page_cache;
+		struct kvm_mmu_memory_cache header_cache;
 	} mmu_caches;
 
 	/* Target CPU and feature flags */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7a588928740a..cc6ed6b06ec2 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -31,6 +31,12 @@ static phys_addr_t hyp_idmap_vector;
 
 static unsigned long io_map_base;
 
+static struct kmem_cache *stage2_page_header_cache;
+
+struct stage2_page_header {
+	struct rcu_head rcu_head;
+	struct page *page;
+};
 
 /*
  * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
@@ -1164,6 +1170,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 						 kvm_mmu_cache_min_pages(kvm));
 		if (ret)
 			return ret;
+
+		ret = kvm_mmu_topup_memory_cache(&mmu_caches->header_cache,
+						 kvm_mmu_cache_min_pages(kvm));
+		if (ret)
+			return ret;
 	}
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
@@ -1589,6 +1600,13 @@ int kvm_mmu_init(u32 *hyp_va_bits)
 	if (err)
 		goto out_destroy_pgtable;
 
+	stage2_page_header_cache = kmem_cache_create("stage2_page_header",
+						     sizeof(struct stage2_page_header),
+						     0, SLAB_ACCOUNT, NULL);
+
+	if (!stage2_page_header_cache)
+		goto out_destroy_pgtable;
+
 	io_map_base = hyp_idmap_start;
 	return 0;
 
@@ -1604,11 +1622,13 @@ int kvm_mmu_init(u32 *hyp_va_bits)
 void kvm_mmu_vcpu_init(struct kvm_vcpu *vcpu)
 {
 	vcpu->arch.mmu_caches.page_cache.gfp_zero = __GFP_ZERO;
+	vcpu->arch.mmu_caches.header_cache.kmem_cache = stage2_page_header_cache;
 }
 
 void kvm_mmu_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_caches.page_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_caches.header_cache);
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 13/17] KVM: arm64: Setup cache for stage2 page headers
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

In order to punt the last reference drop on a page to an RCU
synchronization we need to get a pointer to the page to handle the
callback.

Set up a memcache for stage2 page headers, but do nothing with it for
now. Note that the kmem_cache is never destoyed as it is currently not
possible to build KVM/arm64 as a module.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  1 +
 arch/arm64/kvm/mmu.c              | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index c8947597a619..a640d015790e 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -374,6 +374,7 @@ struct kvm_vcpu_arch {
 	/* Cache some mmu pages needed inside spinlock regions */
 	struct kvm_mmu_caches {
 		struct kvm_mmu_memory_cache page_cache;
+		struct kvm_mmu_memory_cache header_cache;
 	} mmu_caches;
 
 	/* Target CPU and feature flags */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7a588928740a..cc6ed6b06ec2 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -31,6 +31,12 @@ static phys_addr_t hyp_idmap_vector;
 
 static unsigned long io_map_base;
 
+static struct kmem_cache *stage2_page_header_cache;
+
+struct stage2_page_header {
+	struct rcu_head rcu_head;
+	struct page *page;
+};
 
 /*
  * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
@@ -1164,6 +1170,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 						 kvm_mmu_cache_min_pages(kvm));
 		if (ret)
 			return ret;
+
+		ret = kvm_mmu_topup_memory_cache(&mmu_caches->header_cache,
+						 kvm_mmu_cache_min_pages(kvm));
+		if (ret)
+			return ret;
 	}
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
@@ -1589,6 +1600,13 @@ int kvm_mmu_init(u32 *hyp_va_bits)
 	if (err)
 		goto out_destroy_pgtable;
 
+	stage2_page_header_cache = kmem_cache_create("stage2_page_header",
+						     sizeof(struct stage2_page_header),
+						     0, SLAB_ACCOUNT, NULL);
+
+	if (!stage2_page_header_cache)
+		goto out_destroy_pgtable;
+
 	io_map_base = hyp_idmap_start;
 	return 0;
 
@@ -1604,11 +1622,13 @@ int kvm_mmu_init(u32 *hyp_va_bits)
 void kvm_mmu_vcpu_init(struct kvm_vcpu *vcpu)
 {
 	vcpu->arch.mmu_caches.page_cache.gfp_zero = __GFP_ZERO;
+	vcpu->arch.mmu_caches.header_cache.kmem_cache = stage2_page_header_cache;
 }
 
 void kvm_mmu_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_caches.page_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_caches.header_cache);
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:58   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

It is possible that a table page remains visible to another thread until
the next rcu synchronization event. To that end, we cannot drop the last
page reference synchronous with post-order traversal for a parallel
table walk.

Schedule an rcu callback to clean up the child table page for parallel
walks.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  3 ++
 arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
 arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
 3 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 74955aba5918..52e55e00f0ca 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
  * @put_page:			Decrement the refcount on a page. When the
  *				refcount reaches 0 the page is automatically
  *				freed.
+ * @free_table:			Drop the last page reference, possibly in the
+ *				next RCU sync if doing a shared walk.
  * @page_count:			Return the refcount of a page.
  * @phys_to_virt:		Convert a physical address into a virtual
  *				address	mapped in the current context.
@@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
 	void		(*get_page)(void *addr);
 	void		(*put_page)(void *addr);
 	int		(*page_count)(void *addr);
+	void		(*free_table)(void *addr, bool shared);
 	void*		(*phys_to_virt)(phys_addr_t phys);
 	phys_addr_t	(*virt_to_phys)(void *addr);
 	void		(*dcache_clean_inval_poc)(void *addr, size_t size);
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 121818d4c33e..a9a48edba63b 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
 {}
 
 #define kvm_dereference_ptep	rcu_dereference_raw
+
+static inline void kvm_pgtable_destroy_barrier(void)
+{}
+
 #else
 #define kvm_pgtable_walk_begin	rcu_read_lock
 
 #define kvm_pgtable_walk_end	rcu_read_unlock
 
 #define kvm_dereference_ptep	rcu_dereference
+
+#define kvm_pgtable_destroy_barrier	rcu_barrier
+
 #endif
 
 static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
@@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 		childp = kvm_pte_follow(*old, mm_ops);
 	}
 
-	mm_ops->put_page(childp);
+	/*
+	 * If we do not have exclusive access to the page tables it is possible
+	 * the unlinked table remains visible to another thread until the next
+	 * rcu synchronization.
+	 */
+	mm_ops->free_table(childp, shared);
 	mm_ops->put_page(ptep);
 
 	return ret;
@@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 					       kvm_granule_size(level));
 
 	if (childp)
-		mm_ops->put_page(childp);
+		mm_ops->free_table(childp, shared);
 
 	return 0;
 }
@@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	mm_ops->put_page(ptep);
 
 	if (kvm_pte_table(*old, level))
-		mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
+		mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
 
 	return 0;
 }
@@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
 	pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
 	pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
 	pgt->pgd = NULL;
+
+	/*
+	 * Guarantee that all unlinked subtrees associated with the stage2 page
+	 * table have also been freed before returning.
+	 */
+	kvm_pgtable_destroy_barrier();
 }
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index cc6ed6b06ec2..6ecf37009c21 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
 static void *stage2_memcache_zalloc_page(void *arg)
 {
 	struct kvm_mmu_caches *mmu_caches = arg;
+	struct stage2_page_header *hdr;
+	void *addr;
 
 	/* Allocated with __GFP_ZERO, so no need to zero */
-	return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
+	addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
+	if (!addr)
+		return NULL;
+
+	hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
+	if (!hdr) {
+		free_page((unsigned long)addr);
+		return NULL;
+	}
+
+	hdr->page = virt_to_page(addr);
+	set_page_private(hdr->page, (unsigned long)hdr);
+	return addr;
+}
+
+static void stage2_free_page_now(struct stage2_page_header *hdr)
+{
+	WARN_ON(page_ref_count(hdr->page) != 1);
+
+	__free_page(hdr->page);
+	kmem_cache_free(stage2_page_header_cache, hdr);
+}
+
+static void stage2_free_page_rcu_cb(struct rcu_head *head)
+{
+	struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
+						      rcu_head);
+
+	stage2_free_page_now(hdr);
+}
+
+static void stage2_free_table(void *addr, bool shared)
+{
+	struct page *page = virt_to_page(addr);
+	struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
+
+	if (shared)
+		call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);
+	else
+		stage2_free_page_now(hdr);
 }
 
 static void *kvm_host_zalloc_pages_exact(size_t size)
@@ -613,6 +654,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
 	.free_pages_exact	= free_pages_exact,
 	.get_page		= kvm_host_get_page,
 	.put_page		= kvm_host_put_page,
+	.free_table		= stage2_free_table,
 	.page_count		= kvm_host_page_count,
 	.phys_to_virt		= kvm_host_va,
 	.virt_to_phys		= kvm_host_pa,
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

It is possible that a table page remains visible to another thread until
the next rcu synchronization event. To that end, we cannot drop the last
page reference synchronous with post-order traversal for a parallel
table walk.

Schedule an rcu callback to clean up the child table page for parallel
walks.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  3 ++
 arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
 arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
 3 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 74955aba5918..52e55e00f0ca 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
  * @put_page:			Decrement the refcount on a page. When the
  *				refcount reaches 0 the page is automatically
  *				freed.
+ * @free_table:			Drop the last page reference, possibly in the
+ *				next RCU sync if doing a shared walk.
  * @page_count:			Return the refcount of a page.
  * @phys_to_virt:		Convert a physical address into a virtual
  *				address	mapped in the current context.
@@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
 	void		(*get_page)(void *addr);
 	void		(*put_page)(void *addr);
 	int		(*page_count)(void *addr);
+	void		(*free_table)(void *addr, bool shared);
 	void*		(*phys_to_virt)(phys_addr_t phys);
 	phys_addr_t	(*virt_to_phys)(void *addr);
 	void		(*dcache_clean_inval_poc)(void *addr, size_t size);
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 121818d4c33e..a9a48edba63b 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
 {}
 
 #define kvm_dereference_ptep	rcu_dereference_raw
+
+static inline void kvm_pgtable_destroy_barrier(void)
+{}
+
 #else
 #define kvm_pgtable_walk_begin	rcu_read_lock
 
 #define kvm_pgtable_walk_end	rcu_read_unlock
 
 #define kvm_dereference_ptep	rcu_dereference
+
+#define kvm_pgtable_destroy_barrier	rcu_barrier
+
 #endif
 
 static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
@@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 		childp = kvm_pte_follow(*old, mm_ops);
 	}
 
-	mm_ops->put_page(childp);
+	/*
+	 * If we do not have exclusive access to the page tables it is possible
+	 * the unlinked table remains visible to another thread until the next
+	 * rcu synchronization.
+	 */
+	mm_ops->free_table(childp, shared);
 	mm_ops->put_page(ptep);
 
 	return ret;
@@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 					       kvm_granule_size(level));
 
 	if (childp)
-		mm_ops->put_page(childp);
+		mm_ops->free_table(childp, shared);
 
 	return 0;
 }
@@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	mm_ops->put_page(ptep);
 
 	if (kvm_pte_table(*old, level))
-		mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
+		mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
 
 	return 0;
 }
@@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
 	pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
 	pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
 	pgt->pgd = NULL;
+
+	/*
+	 * Guarantee that all unlinked subtrees associated with the stage2 page
+	 * table have also been freed before returning.
+	 */
+	kvm_pgtable_destroy_barrier();
 }
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index cc6ed6b06ec2..6ecf37009c21 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
 static void *stage2_memcache_zalloc_page(void *arg)
 {
 	struct kvm_mmu_caches *mmu_caches = arg;
+	struct stage2_page_header *hdr;
+	void *addr;
 
 	/* Allocated with __GFP_ZERO, so no need to zero */
-	return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
+	addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
+	if (!addr)
+		return NULL;
+
+	hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
+	if (!hdr) {
+		free_page((unsigned long)addr);
+		return NULL;
+	}
+
+	hdr->page = virt_to_page(addr);
+	set_page_private(hdr->page, (unsigned long)hdr);
+	return addr;
+}
+
+static void stage2_free_page_now(struct stage2_page_header *hdr)
+{
+	WARN_ON(page_ref_count(hdr->page) != 1);
+
+	__free_page(hdr->page);
+	kmem_cache_free(stage2_page_header_cache, hdr);
+}
+
+static void stage2_free_page_rcu_cb(struct rcu_head *head)
+{
+	struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
+						      rcu_head);
+
+	stage2_free_page_now(hdr);
+}
+
+static void stage2_free_table(void *addr, bool shared)
+{
+	struct page *page = virt_to_page(addr);
+	struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
+
+	if (shared)
+		call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);
+	else
+		stage2_free_page_now(hdr);
 }
 
 static void *kvm_host_zalloc_pages_exact(size_t size)
@@ -613,6 +654,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
 	.free_pages_exact	= free_pages_exact,
 	.get_page		= kvm_host_get_page,
 	.put_page		= kvm_host_put_page,
+	.free_table		= stage2_free_table,
 	.page_count		= kvm_host_page_count,
 	.phys_to_virt		= kvm_host_va,
 	.virt_to_phys		= kvm_host_pa,
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

It is possible that a table page remains visible to another thread until
the next rcu synchronization event. To that end, we cannot drop the last
page reference synchronous with post-order traversal for a parallel
table walk.

Schedule an rcu callback to clean up the child table page for parallel
walks.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  3 ++
 arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
 arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
 3 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 74955aba5918..52e55e00f0ca 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
  * @put_page:			Decrement the refcount on a page. When the
  *				refcount reaches 0 the page is automatically
  *				freed.
+ * @free_table:			Drop the last page reference, possibly in the
+ *				next RCU sync if doing a shared walk.
  * @page_count:			Return the refcount of a page.
  * @phys_to_virt:		Convert a physical address into a virtual
  *				address	mapped in the current context.
@@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
 	void		(*get_page)(void *addr);
 	void		(*put_page)(void *addr);
 	int		(*page_count)(void *addr);
+	void		(*free_table)(void *addr, bool shared);
 	void*		(*phys_to_virt)(phys_addr_t phys);
 	phys_addr_t	(*virt_to_phys)(void *addr);
 	void		(*dcache_clean_inval_poc)(void *addr, size_t size);
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 121818d4c33e..a9a48edba63b 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
 {}
 
 #define kvm_dereference_ptep	rcu_dereference_raw
+
+static inline void kvm_pgtable_destroy_barrier(void)
+{}
+
 #else
 #define kvm_pgtable_walk_begin	rcu_read_lock
 
 #define kvm_pgtable_walk_end	rcu_read_unlock
 
 #define kvm_dereference_ptep	rcu_dereference
+
+#define kvm_pgtable_destroy_barrier	rcu_barrier
+
 #endif
 
 static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
@@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
 		childp = kvm_pte_follow(*old, mm_ops);
 	}
 
-	mm_ops->put_page(childp);
+	/*
+	 * If we do not have exclusive access to the page tables it is possible
+	 * the unlinked table remains visible to another thread until the next
+	 * rcu synchronization.
+	 */
+	mm_ops->free_table(childp, shared);
 	mm_ops->put_page(ptep);
 
 	return ret;
@@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 					       kvm_granule_size(level));
 
 	if (childp)
-		mm_ops->put_page(childp);
+		mm_ops->free_table(childp, shared);
 
 	return 0;
 }
@@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
 	mm_ops->put_page(ptep);
 
 	if (kvm_pte_table(*old, level))
-		mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
+		mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
 
 	return 0;
 }
@@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
 	pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
 	pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
 	pgt->pgd = NULL;
+
+	/*
+	 * Guarantee that all unlinked subtrees associated with the stage2 page
+	 * table have also been freed before returning.
+	 */
+	kvm_pgtable_destroy_barrier();
 }
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index cc6ed6b06ec2..6ecf37009c21 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
 static void *stage2_memcache_zalloc_page(void *arg)
 {
 	struct kvm_mmu_caches *mmu_caches = arg;
+	struct stage2_page_header *hdr;
+	void *addr;
 
 	/* Allocated with __GFP_ZERO, so no need to zero */
-	return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
+	addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
+	if (!addr)
+		return NULL;
+
+	hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
+	if (!hdr) {
+		free_page((unsigned long)addr);
+		return NULL;
+	}
+
+	hdr->page = virt_to_page(addr);
+	set_page_private(hdr->page, (unsigned long)hdr);
+	return addr;
+}
+
+static void stage2_free_page_now(struct stage2_page_header *hdr)
+{
+	WARN_ON(page_ref_count(hdr->page) != 1);
+
+	__free_page(hdr->page);
+	kmem_cache_free(stage2_page_header_cache, hdr);
+}
+
+static void stage2_free_page_rcu_cb(struct rcu_head *head)
+{
+	struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
+						      rcu_head);
+
+	stage2_free_page_now(hdr);
+}
+
+static void stage2_free_table(void *addr, bool shared)
+{
+	struct page *page = virt_to_page(addr);
+	struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
+
+	if (shared)
+		call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);
+	else
+		stage2_free_page_now(hdr);
 }
 
 static void *kvm_host_zalloc_pages_exact(size_t size)
@@ -613,6 +654,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
 	.free_pages_exact	= free_pages_exact,
 	.get_page		= kvm_host_get_page,
 	.put_page		= kvm_host_put_page,
+	.free_table		= stage2_free_table,
 	.page_count		= kvm_host_page_count,
 	.phys_to_virt		= kvm_host_va,
 	.virt_to_phys		= kvm_host_pa,
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 15/17] KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map()
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:58   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

The map walker is now appraised of how to walk the tables in parallel
with another table walker. Take a parameter indicating whether or not a
walk is done in parallel so as to relax the atomicity/locking
requirements on ptes.

Defer actually using parallel walks to a later change.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h  | 4 +++-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c | 2 +-
 arch/arm64/kvm/hyp/pgtable.c          | 4 ++--
 arch/arm64/kvm/mmu.c                  | 6 +++---
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 52e55e00f0ca..9830eea19de4 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -328,6 +328,8 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
  * @prot:	Permissions and attributes for the mapping.
  * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
  *		page-table pages.
+ * @shared:	true if multiple software walkers could be traversing the tables
+ *		in parallel
  *
  * The offset of @addr within a page is ignored, @size is rounded-up to
  * the next page boundary and @phys is rounded-down to the previous page
@@ -349,7 +351,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
  */
 int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 			   u64 phys, enum kvm_pgtable_prot prot,
-			   void *mc);
+			   void *mc, bool shared);
 
 /**
  * kvm_pgtable_stage2_set_owner() - Unmap and annotate pages in the IPA space to
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 42a5f35cd819..53b172036c2a 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -251,7 +251,7 @@ static inline int __host_stage2_idmap(u64 start, u64 end,
 				      enum kvm_pgtable_prot prot)
 {
 	return kvm_pgtable_stage2_map(&host_kvm.pgt, start, end - start, start,
-				      prot, &host_s2_pool);
+				      prot, &host_s2_pool, false);
 }
 
 /*
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index a9a48edba63b..20ff198ebef7 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1119,7 +1119,7 @@ static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_
 
 int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 			   u64 phys, enum kvm_pgtable_prot prot,
-			   void *mc)
+			   void *mc, bool shared)
 {
 	int ret;
 	struct stage2_map_data map_data = {
@@ -1144,7 +1144,7 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	if (ret)
 		return ret;
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, shared);
 	dsb(ishst);
 	return ret;
 }
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6ecf37009c21..63cf18cdb978 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -832,7 +832,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 
 		write_lock(&kvm->mmu_lock);
 		ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
-					     &cache);
+					     &cache, false);
 		write_unlock(&kvm->mmu_lock);
 		if (ret)
 			break;
@@ -1326,7 +1326,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
 					     __pfn_to_phys(pfn), prot,
-					     mmu_caches);
+					     mmu_caches, true);
 	}
 
 	/* Mark the page dirty only if the fault is handled successfully */
@@ -1526,7 +1526,7 @@ bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	 */
 	kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
 			       PAGE_SIZE, __pfn_to_phys(pfn),
-			       KVM_PGTABLE_PROT_R, NULL);
+			       KVM_PGTABLE_PROT_R, NULL, false);
 
 	return false;
 }
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 15/17] KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map()
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

The map walker is now appraised of how to walk the tables in parallel
with another table walker. Take a parameter indicating whether or not a
walk is done in parallel so as to relax the atomicity/locking
requirements on ptes.

Defer actually using parallel walks to a later change.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h  | 4 +++-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c | 2 +-
 arch/arm64/kvm/hyp/pgtable.c          | 4 ++--
 arch/arm64/kvm/mmu.c                  | 6 +++---
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 52e55e00f0ca..9830eea19de4 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -328,6 +328,8 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
  * @prot:	Permissions and attributes for the mapping.
  * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
  *		page-table pages.
+ * @shared:	true if multiple software walkers could be traversing the tables
+ *		in parallel
  *
  * The offset of @addr within a page is ignored, @size is rounded-up to
  * the next page boundary and @phys is rounded-down to the previous page
@@ -349,7 +351,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
  */
 int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 			   u64 phys, enum kvm_pgtable_prot prot,
-			   void *mc);
+			   void *mc, bool shared);
 
 /**
  * kvm_pgtable_stage2_set_owner() - Unmap and annotate pages in the IPA space to
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 42a5f35cd819..53b172036c2a 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -251,7 +251,7 @@ static inline int __host_stage2_idmap(u64 start, u64 end,
 				      enum kvm_pgtable_prot prot)
 {
 	return kvm_pgtable_stage2_map(&host_kvm.pgt, start, end - start, start,
-				      prot, &host_s2_pool);
+				      prot, &host_s2_pool, false);
 }
 
 /*
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index a9a48edba63b..20ff198ebef7 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1119,7 +1119,7 @@ static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_
 
 int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 			   u64 phys, enum kvm_pgtable_prot prot,
-			   void *mc)
+			   void *mc, bool shared)
 {
 	int ret;
 	struct stage2_map_data map_data = {
@@ -1144,7 +1144,7 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	if (ret)
 		return ret;
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, shared);
 	dsb(ishst);
 	return ret;
 }
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6ecf37009c21..63cf18cdb978 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -832,7 +832,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 
 		write_lock(&kvm->mmu_lock);
 		ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
-					     &cache);
+					     &cache, false);
 		write_unlock(&kvm->mmu_lock);
 		if (ret)
 			break;
@@ -1326,7 +1326,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
 					     __pfn_to_phys(pfn), prot,
-					     mmu_caches);
+					     mmu_caches, true);
 	}
 
 	/* Mark the page dirty only if the fault is handled successfully */
@@ -1526,7 +1526,7 @@ bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	 */
 	kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
 			       PAGE_SIZE, __pfn_to_phys(pfn),
-			       KVM_PGTABLE_PROT_R, NULL);
+			       KVM_PGTABLE_PROT_R, NULL, false);
 
 	return false;
 }
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 15/17] KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map()
@ 2022-04-15 21:58   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:58 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

The map walker is now appraised of how to walk the tables in parallel
with another table walker. Take a parameter indicating whether or not a
walk is done in parallel so as to relax the atomicity/locking
requirements on ptes.

Defer actually using parallel walks to a later change.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h  | 4 +++-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c | 2 +-
 arch/arm64/kvm/hyp/pgtable.c          | 4 ++--
 arch/arm64/kvm/mmu.c                  | 6 +++---
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 52e55e00f0ca..9830eea19de4 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -328,6 +328,8 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
  * @prot:	Permissions and attributes for the mapping.
  * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
  *		page-table pages.
+ * @shared:	true if multiple software walkers could be traversing the tables
+ *		in parallel
  *
  * The offset of @addr within a page is ignored, @size is rounded-up to
  * the next page boundary and @phys is rounded-down to the previous page
@@ -349,7 +351,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
  */
 int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 			   u64 phys, enum kvm_pgtable_prot prot,
-			   void *mc);
+			   void *mc, bool shared);
 
 /**
  * kvm_pgtable_stage2_set_owner() - Unmap and annotate pages in the IPA space to
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 42a5f35cd819..53b172036c2a 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -251,7 +251,7 @@ static inline int __host_stage2_idmap(u64 start, u64 end,
 				      enum kvm_pgtable_prot prot)
 {
 	return kvm_pgtable_stage2_map(&host_kvm.pgt, start, end - start, start,
-				      prot, &host_s2_pool);
+				      prot, &host_s2_pool, false);
 }
 
 /*
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index a9a48edba63b..20ff198ebef7 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1119,7 +1119,7 @@ static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_
 
 int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 			   u64 phys, enum kvm_pgtable_prot prot,
-			   void *mc)
+			   void *mc, bool shared)
 {
 	int ret;
 	struct stage2_map_data map_data = {
@@ -1144,7 +1144,7 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
 	if (ret)
 		return ret;
 
-	ret = kvm_pgtable_walk(pgt, addr, size, &walker, false);
+	ret = kvm_pgtable_walk(pgt, addr, size, &walker, shared);
 	dsb(ishst);
 	return ret;
 }
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6ecf37009c21..63cf18cdb978 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -832,7 +832,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 
 		write_lock(&kvm->mmu_lock);
 		ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
-					     &cache);
+					     &cache, false);
 		write_unlock(&kvm->mmu_lock);
 		if (ret)
 			break;
@@ -1326,7 +1326,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
 					     __pfn_to_phys(pfn), prot,
-					     mmu_caches);
+					     mmu_caches, true);
 	}
 
 	/* Mark the page dirty only if the fault is handled successfully */
@@ -1526,7 +1526,7 @@ bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	 */
 	kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
 			       PAGE_SIZE, __pfn_to_phys(pfn),
-			       KVM_PGTABLE_PROT_R, NULL);
+			       KVM_PGTABLE_PROT_R, NULL, false);
 
 	return false;
 }
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 16/17] KVM: arm64: Enable parallel stage 2 MMU faults
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:59   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:59 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

Voila! Since the map walkers are able to work in parallel there is no
need to take the write lock on a stage 2 memory abort. Relax locking
on map operations and cross fingers we got it right.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/mmu.c | 21 +++------------------
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 63cf18cdb978..2881051c3743 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1127,7 +1127,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	gfn_t gfn;
 	kvm_pfn_t pfn;
 	bool logging_active = memslot_is_logging(memslot);
-	bool use_read_lock = false;
 	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
 	unsigned long vma_pagesize, fault_granule;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
@@ -1162,8 +1161,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (logging_active) {
 		force_pte = true;
 		vma_shift = PAGE_SHIFT;
-		use_read_lock = (fault_status == FSC_PERM && write_fault &&
-				 fault_granule == PAGE_SIZE);
 	} else {
 		vma_shift = get_vma_page_shift(vma, hva);
 	}
@@ -1267,15 +1264,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (exec_fault && device)
 		return -ENOEXEC;
 
-	/*
-	 * To reduce MMU contentions and enhance concurrency during dirty
-	 * logging dirty logging, only acquire read lock for permission
-	 * relaxation.
-	 */
-	if (use_read_lock)
-		read_lock(&kvm->mmu_lock);
-	else
-		write_lock(&kvm->mmu_lock);
+	read_lock(&kvm->mmu_lock);
+
 	pgt = vcpu->arch.hw_mmu->pgt;
 	if (mmu_notifier_retry(kvm, mmu_seq))
 		goto out_unlock;
@@ -1322,8 +1312,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
 		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
 	} else {
-		WARN_ONCE(use_read_lock, "Attempted stage-2 map outside of write lock\n");
-
 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
 					     __pfn_to_phys(pfn), prot,
 					     mmu_caches, true);
@@ -1336,10 +1324,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	}
 
 out_unlock:
-	if (use_read_lock)
-		read_unlock(&kvm->mmu_lock);
-	else
-		write_unlock(&kvm->mmu_lock);
+	read_unlock(&kvm->mmu_lock);
 	kvm_set_pfn_accessed(pfn);
 	kvm_release_pfn_clean(pfn);
 	return ret != -EAGAIN ? ret : 0;
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 16/17] KVM: arm64: Enable parallel stage 2 MMU faults
@ 2022-04-15 21:59   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:59 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

Voila! Since the map walkers are able to work in parallel there is no
need to take the write lock on a stage 2 memory abort. Relax locking
on map operations and cross fingers we got it right.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/mmu.c | 21 +++------------------
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 63cf18cdb978..2881051c3743 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1127,7 +1127,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	gfn_t gfn;
 	kvm_pfn_t pfn;
 	bool logging_active = memslot_is_logging(memslot);
-	bool use_read_lock = false;
 	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
 	unsigned long vma_pagesize, fault_granule;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
@@ -1162,8 +1161,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (logging_active) {
 		force_pte = true;
 		vma_shift = PAGE_SHIFT;
-		use_read_lock = (fault_status == FSC_PERM && write_fault &&
-				 fault_granule == PAGE_SIZE);
 	} else {
 		vma_shift = get_vma_page_shift(vma, hva);
 	}
@@ -1267,15 +1264,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (exec_fault && device)
 		return -ENOEXEC;
 
-	/*
-	 * To reduce MMU contentions and enhance concurrency during dirty
-	 * logging dirty logging, only acquire read lock for permission
-	 * relaxation.
-	 */
-	if (use_read_lock)
-		read_lock(&kvm->mmu_lock);
-	else
-		write_lock(&kvm->mmu_lock);
+	read_lock(&kvm->mmu_lock);
+
 	pgt = vcpu->arch.hw_mmu->pgt;
 	if (mmu_notifier_retry(kvm, mmu_seq))
 		goto out_unlock;
@@ -1322,8 +1312,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
 		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
 	} else {
-		WARN_ONCE(use_read_lock, "Attempted stage-2 map outside of write lock\n");
-
 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
 					     __pfn_to_phys(pfn), prot,
 					     mmu_caches, true);
@@ -1336,10 +1324,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	}
 
 out_unlock:
-	if (use_read_lock)
-		read_unlock(&kvm->mmu_lock);
-	else
-		write_unlock(&kvm->mmu_lock);
+	read_unlock(&kvm->mmu_lock);
 	kvm_set_pfn_accessed(pfn);
 	kvm_release_pfn_clean(pfn);
 	return ret != -EAGAIN ? ret : 0;
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 16/17] KVM: arm64: Enable parallel stage 2 MMU faults
@ 2022-04-15 21:59   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:59 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

Voila! Since the map walkers are able to work in parallel there is no
need to take the write lock on a stage 2 memory abort. Relax locking
on map operations and cross fingers we got it right.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/mmu.c | 21 +++------------------
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 63cf18cdb978..2881051c3743 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1127,7 +1127,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	gfn_t gfn;
 	kvm_pfn_t pfn;
 	bool logging_active = memslot_is_logging(memslot);
-	bool use_read_lock = false;
 	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
 	unsigned long vma_pagesize, fault_granule;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
@@ -1162,8 +1161,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (logging_active) {
 		force_pte = true;
 		vma_shift = PAGE_SHIFT;
-		use_read_lock = (fault_status == FSC_PERM && write_fault &&
-				 fault_granule == PAGE_SIZE);
 	} else {
 		vma_shift = get_vma_page_shift(vma, hva);
 	}
@@ -1267,15 +1264,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (exec_fault && device)
 		return -ENOEXEC;
 
-	/*
-	 * To reduce MMU contentions and enhance concurrency during dirty
-	 * logging dirty logging, only acquire read lock for permission
-	 * relaxation.
-	 */
-	if (use_read_lock)
-		read_lock(&kvm->mmu_lock);
-	else
-		write_lock(&kvm->mmu_lock);
+	read_lock(&kvm->mmu_lock);
+
 	pgt = vcpu->arch.hw_mmu->pgt;
 	if (mmu_notifier_retry(kvm, mmu_seq))
 		goto out_unlock;
@@ -1322,8 +1312,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
 		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
 	} else {
-		WARN_ONCE(use_read_lock, "Attempted stage-2 map outside of write lock\n");
-
 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
 					     __pfn_to_phys(pfn), prot,
 					     mmu_caches, true);
@@ -1336,10 +1324,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	}
 
 out_unlock:
-	if (use_read_lock)
-		read_unlock(&kvm->mmu_lock);
-	else
-		write_unlock(&kvm->mmu_lock);
+	read_unlock(&kvm->mmu_lock);
 	kvm_set_pfn_accessed(pfn);
 	kvm_release_pfn_clean(pfn);
 	return ret != -EAGAIN ? ret : 0;
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 17/17] TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 21:59   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:59 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

Don't use this please. I was just being lazy but wanted to make sure
tables are all accounted for.

There's a race here too, do you see it? :)

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/mmu.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 2881051c3743..68ea7f0244fe 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -95,6 +95,8 @@ static bool kvm_is_device_pfn(unsigned long pfn)
 	return !pfn_is_map_memory(pfn);
 }
 
+static atomic_t stage2_pages = ATOMIC_INIT(0);
+
 static void *stage2_memcache_zalloc_page(void *arg)
 {
 	struct kvm_mmu_caches *mmu_caches = arg;
@@ -112,6 +114,8 @@ static void *stage2_memcache_zalloc_page(void *arg)
 		return NULL;
 	}
 
+	atomic_inc(&stage2_pages);
+
 	hdr->page = virt_to_page(addr);
 	set_page_private(hdr->page, (unsigned long)hdr);
 	return addr;
@@ -121,6 +125,8 @@ static void stage2_free_page_now(struct stage2_page_header *hdr)
 {
 	WARN_ON(page_ref_count(hdr->page) != 1);
 
+	atomic_dec(&stage2_pages);
+
 	__free_page(hdr->page);
 	kmem_cache_free(stage2_page_header_cache, hdr);
 }
@@ -662,6 +668,8 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
 	.icache_inval_pou	= invalidate_icache_guest_page,
 };
 
+static atomic_t stage2_mmus = ATOMIC_INIT(0);
+
 /**
  * kvm_init_stage2_mmu - Initialise a S2 MMU structure
  * @kvm:	The pointer to the KVM structure
@@ -699,6 +707,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu)
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
 
+	atomic_inc(&stage2_mmus);
+
 	mmu->pgt = pgt;
 	mmu->pgd_phys = __pa(pgt->pgd);
 	return 0;
@@ -796,6 +806,9 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
 		kvm_pgtable_stage2_destroy(pgt);
 		kfree(pgt);
 	}
+
+	if (atomic_dec_and_test(&stage2_mmus))
+		WARN_ON(atomic_read(&stage2_pages));
 }
 
 /**
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 17/17] TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages
@ 2022-04-15 21:59   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:59 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

Don't use this please. I was just being lazy but wanted to make sure
tables are all accounted for.

There's a race here too, do you see it? :)

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/mmu.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 2881051c3743..68ea7f0244fe 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -95,6 +95,8 @@ static bool kvm_is_device_pfn(unsigned long pfn)
 	return !pfn_is_map_memory(pfn);
 }
 
+static atomic_t stage2_pages = ATOMIC_INIT(0);
+
 static void *stage2_memcache_zalloc_page(void *arg)
 {
 	struct kvm_mmu_caches *mmu_caches = arg;
@@ -112,6 +114,8 @@ static void *stage2_memcache_zalloc_page(void *arg)
 		return NULL;
 	}
 
+	atomic_inc(&stage2_pages);
+
 	hdr->page = virt_to_page(addr);
 	set_page_private(hdr->page, (unsigned long)hdr);
 	return addr;
@@ -121,6 +125,8 @@ static void stage2_free_page_now(struct stage2_page_header *hdr)
 {
 	WARN_ON(page_ref_count(hdr->page) != 1);
 
+	atomic_dec(&stage2_pages);
+
 	__free_page(hdr->page);
 	kmem_cache_free(stage2_page_header_cache, hdr);
 }
@@ -662,6 +668,8 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
 	.icache_inval_pou	= invalidate_icache_guest_page,
 };
 
+static atomic_t stage2_mmus = ATOMIC_INIT(0);
+
 /**
  * kvm_init_stage2_mmu - Initialise a S2 MMU structure
  * @kvm:	The pointer to the KVM structure
@@ -699,6 +707,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu)
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
 
+	atomic_inc(&stage2_mmus);
+
 	mmu->pgt = pgt;
 	mmu->pgd_phys = __pa(pgt->pgd);
 	return 0;
@@ -796,6 +806,9 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
 		kvm_pgtable_stage2_destroy(pgt);
 		kfree(pgt);
 	}
+
+	if (atomic_dec_and_test(&stage2_mmus))
+		WARN_ON(atomic_read(&stage2_pages));
 }
 
 /**
-- 
2.36.0.rc0.470.gd361397f0d-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [RFC PATCH 17/17] TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages
@ 2022-04-15 21:59   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-15 21:59 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack, Oliver Upton

Don't use this please. I was just being lazy but wanted to make sure
tables are all accounted for.

There's a race here too, do you see it? :)

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/mmu.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 2881051c3743..68ea7f0244fe 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -95,6 +95,8 @@ static bool kvm_is_device_pfn(unsigned long pfn)
 	return !pfn_is_map_memory(pfn);
 }
 
+static atomic_t stage2_pages = ATOMIC_INIT(0);
+
 static void *stage2_memcache_zalloc_page(void *arg)
 {
 	struct kvm_mmu_caches *mmu_caches = arg;
@@ -112,6 +114,8 @@ static void *stage2_memcache_zalloc_page(void *arg)
 		return NULL;
 	}
 
+	atomic_inc(&stage2_pages);
+
 	hdr->page = virt_to_page(addr);
 	set_page_private(hdr->page, (unsigned long)hdr);
 	return addr;
@@ -121,6 +125,8 @@ static void stage2_free_page_now(struct stage2_page_header *hdr)
 {
 	WARN_ON(page_ref_count(hdr->page) != 1);
 
+	atomic_dec(&stage2_pages);
+
 	__free_page(hdr->page);
 	kmem_cache_free(stage2_page_header_cache, hdr);
 }
@@ -662,6 +668,8 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
 	.icache_inval_pou	= invalidate_icache_guest_page,
 };
 
+static atomic_t stage2_mmus = ATOMIC_INIT(0);
+
 /**
  * kvm_init_stage2_mmu - Initialise a S2 MMU structure
  * @kvm:	The pointer to the KVM structure
@@ -699,6 +707,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu)
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
 
+	atomic_inc(&stage2_mmus);
+
 	mmu->pgt = pgt;
 	mmu->pgd_phys = __pa(pgt->pgd);
 	return 0;
@@ -796,6 +806,9 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
 		kvm_pgtable_stage2_destroy(pgt);
 		kfree(pgt);
 	}
+
+	if (atomic_dec_and_test(&stage2_mmus))
+		WARN_ON(atomic_read(&stage2_pages));
 }
 
 /**
-- 
2.36.0.rc0.470.gd361397f0d-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-15 23:35   ` David Matlack
  -1 siblings, 0 replies; 165+ messages in thread
From: David Matlack @ 2022-04-15 23:35 UTC (permalink / raw)
  To: Oliver Upton
  Cc: KVMARM, kvm list, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> Presently KVM only takes a read lock for stage 2 faults if it believes
> the fault can be fixed by relaxing permissions on a PTE (write unprotect
> for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> predictably can pile up all the vCPUs in a sufficiently large VM.
>
> The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> MMU protected by the combination of a read-write lock and RCU, allowing
> page walkers to traverse in parallel.
>
> This series is strongly inspired by the mechanics of the TDP MMU,
> making use of RCU to protect parallel walks. Note that the TLB
> invalidation mechanics are a bit different between x86 and ARM, so we
> need to use the 'break-before-make' sequence to split/collapse a
> block/table mapping, respectively.

An alternative (or perhaps "v2" [1]) is to make x86's TDP MMU
arch-neutral and port it to support ARM's stage-2 MMU. This is based
on a few observations:

- The problems that motivated the development of the TDP MMU are not
x86-specific (e.g. parallelizing faults during the post-copy phase of
Live Migration).
- The synchronization in the TDP MMU (read/write lock, RCU for PT
freeing, atomic compare-exchanges for modifying PTEs) is complex, but
would be equivalent across architectures.
- Eventually RISC-V is going to want similar performance (my
understanding is RISC-V MMU is already a copy-paste of the ARM MMU),
and it'd be a shame to re-implement TDP MMU synchronization a third
time.
- The TDP MMU includes support for various performance features that
would benefit other architectures, such as eager page splitting,
deferred zapping, lockless write-protection resolution, and (coming
soon) in-place huge page promotion.
- And then there's the obvious wins from less code duplication in KVM
(e.g. get rid of the RISC-V MMU copy, increased code test coverage,
...).

The side of this I haven't really looked into yet is ARM's stage-2
MMU, and how amenable it would be to being managed by the TDP MMU. But
I assume it's a conventional page table structure mapping GPAs to
HPAs, which is the most important overlap.

That all being said, an arch-neutral TDP MMU would be a larger, more
complex code change than something like this series (hence my "v2"
caveat above). But I wanted to get this idea out there since the
rubber is starting to hit the road on improving ARM MMU scalability.

[1] "v2" as in the "next evolution" sense, not the "PATCH v2" sense :)





>
> Nonetheless, using atomics on the break side allows fault handlers to
> acquire exclusive access to a PTE (lets just call it locked). Once the
> PTE lock is acquired it is then safe to assume exclusive access.
>
> Special consideration is required when pruning the page tables in
> parallel. Suppose we are collapsing a table into a block. Allowing
> parallel faults means that a software walker could be in the middle of
> a lower level traversal when the table is unlinked. Table
> walkers that prune the paging structures must now 'lock' all descendent
> PTEs, effectively asserting exclusive ownership of the substructure
> (no other walker can install something to an already locked pte).
>
> Additionally, for parallel walks we need to punt the freeing of table
> pages to the next RCU sync, as there could be multiple observers of the
> table until all walkers exit the RCU critical section. For this I
> decided to cram an rcu_head into page private data for every table page.
> We wind up spending a bit more on table pages now, but lazily allocating
> for rcu callbacks probably doesn't make a lot of sense. Not only would
> we need a large cache of them (think about installing a level 1 block)
> to wire up callbacks on all descendent tables, but we also then need to
> spend memory to actually free memory.
>
> I tried to organize these patches as best I could w/o introducing
> intermediate breakage.
>
> The first 5 patches are meant mostly as prepatory reworks, and, in the
> case of RCU a nop.
>
> Patch 6 is quite large, but I had a hard time deciding how to change the
> way we link/unlink tables to use atomics without breaking things along
> the way.
>
> Patch 7 probably should come before patch 6, as it informs the other
> read-side fault (perm relax) about when a map is in progress so it'll
> back off.
>
> Patches 8-10 take care of the pruning case, actually locking the child ptes
> instead of simply dropping table page references along the way. Note
> that we cannot assume a pte points to a table/page at this point, hence
> the same helper is called for pre- and leaf-traversal. Guide the
> recursion based on what got yanked from the PTE.
>
> Patches 11-14 wire up everything to schedule rcu callbacks on
> to-be-freed table pages. rcu_barrier() is called on the way out from
> tearing down a stage 2 page table to guarantee all memory associated
> with the VM has actually been cleaned up.
>
> Patches 15-16 loop in the fault handler to the new table traversal game.
>
> Lastly, patch 17 is a nasty bit of debugging residue to spot possible
> table page leaks. Please don't laugh ;-)
>
> Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
> exercise the table pruning code. Haven't done anything beyond this,
> sending as an RFC now to get eyes on the code.
>
> Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of
> git://git.kernel.dk/linux-block")
>
> Oliver Upton (17):
>   KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
>   KVM: arm64: Only read the pte once per visit
>   KVM: arm64: Return the next table from map callbacks
>   KVM: arm64: Protect page table traversal with RCU
>   KVM: arm64: Take an argument to indicate parallel walk
>   KVM: arm64: Implement break-before-make sequence for parallel walks
>   KVM: arm64: Enlighten perm relax path about parallel walks
>   KVM: arm64: Spin off helper for initializing table pte
>   KVM: arm64: Tear down unlinked page tables in parallel walk
>   KVM: arm64: Assume a table pte is already owned in post-order
>     traversal
>   KVM: arm64: Move MMU cache init/destroy into helpers
>   KVM: arm64: Stuff mmu page cache in sub struct
>   KVM: arm64: Setup cache for stage2 page headers
>   KVM: arm64: Punt last page reference to rcu callback for parallel walk
>   KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map()
>   KVM: arm64: Enable parallel stage 2 MMU faults
>   TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages
>
>  arch/arm64/include/asm/kvm_host.h     |   5 +-
>  arch/arm64/include/asm/kvm_mmu.h      |   2 +
>  arch/arm64/include/asm/kvm_pgtable.h  |  14 +-
>  arch/arm64/kvm/arm.c                  |   4 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  13 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |  13 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 518 +++++++++++++++++++-------
>  arch/arm64/kvm/mmu.c                  | 120 ++++--
>  8 files changed, 503 insertions(+), 186 deletions(-)
>
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-15 23:35   ` David Matlack
  0 siblings, 0 replies; 165+ messages in thread
From: David Matlack @ 2022-04-15 23:35 UTC (permalink / raw)
  To: Oliver Upton
  Cc: KVMARM, kvm list, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> Presently KVM only takes a read lock for stage 2 faults if it believes
> the fault can be fixed by relaxing permissions on a PTE (write unprotect
> for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> predictably can pile up all the vCPUs in a sufficiently large VM.
>
> The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> MMU protected by the combination of a read-write lock and RCU, allowing
> page walkers to traverse in parallel.
>
> This series is strongly inspired by the mechanics of the TDP MMU,
> making use of RCU to protect parallel walks. Note that the TLB
> invalidation mechanics are a bit different between x86 and ARM, so we
> need to use the 'break-before-make' sequence to split/collapse a
> block/table mapping, respectively.

An alternative (or perhaps "v2" [1]) is to make x86's TDP MMU
arch-neutral and port it to support ARM's stage-2 MMU. This is based
on a few observations:

- The problems that motivated the development of the TDP MMU are not
x86-specific (e.g. parallelizing faults during the post-copy phase of
Live Migration).
- The synchronization in the TDP MMU (read/write lock, RCU for PT
freeing, atomic compare-exchanges for modifying PTEs) is complex, but
would be equivalent across architectures.
- Eventually RISC-V is going to want similar performance (my
understanding is RISC-V MMU is already a copy-paste of the ARM MMU),
and it'd be a shame to re-implement TDP MMU synchronization a third
time.
- The TDP MMU includes support for various performance features that
would benefit other architectures, such as eager page splitting,
deferred zapping, lockless write-protection resolution, and (coming
soon) in-place huge page promotion.
- And then there's the obvious wins from less code duplication in KVM
(e.g. get rid of the RISC-V MMU copy, increased code test coverage,
...).

The side of this I haven't really looked into yet is ARM's stage-2
MMU, and how amenable it would be to being managed by the TDP MMU. But
I assume it's a conventional page table structure mapping GPAs to
HPAs, which is the most important overlap.

That all being said, an arch-neutral TDP MMU would be a larger, more
complex code change than something like this series (hence my "v2"
caveat above). But I wanted to get this idea out there since the
rubber is starting to hit the road on improving ARM MMU scalability.

[1] "v2" as in the "next evolution" sense, not the "PATCH v2" sense :)





>
> Nonetheless, using atomics on the break side allows fault handlers to
> acquire exclusive access to a PTE (lets just call it locked). Once the
> PTE lock is acquired it is then safe to assume exclusive access.
>
> Special consideration is required when pruning the page tables in
> parallel. Suppose we are collapsing a table into a block. Allowing
> parallel faults means that a software walker could be in the middle of
> a lower level traversal when the table is unlinked. Table
> walkers that prune the paging structures must now 'lock' all descendent
> PTEs, effectively asserting exclusive ownership of the substructure
> (no other walker can install something to an already locked pte).
>
> Additionally, for parallel walks we need to punt the freeing of table
> pages to the next RCU sync, as there could be multiple observers of the
> table until all walkers exit the RCU critical section. For this I
> decided to cram an rcu_head into page private data for every table page.
> We wind up spending a bit more on table pages now, but lazily allocating
> for rcu callbacks probably doesn't make a lot of sense. Not only would
> we need a large cache of them (think about installing a level 1 block)
> to wire up callbacks on all descendent tables, but we also then need to
> spend memory to actually free memory.
>
> I tried to organize these patches as best I could w/o introducing
> intermediate breakage.
>
> The first 5 patches are meant mostly as prepatory reworks, and, in the
> case of RCU a nop.
>
> Patch 6 is quite large, but I had a hard time deciding how to change the
> way we link/unlink tables to use atomics without breaking things along
> the way.
>
> Patch 7 probably should come before patch 6, as it informs the other
> read-side fault (perm relax) about when a map is in progress so it'll
> back off.
>
> Patches 8-10 take care of the pruning case, actually locking the child ptes
> instead of simply dropping table page references along the way. Note
> that we cannot assume a pte points to a table/page at this point, hence
> the same helper is called for pre- and leaf-traversal. Guide the
> recursion based on what got yanked from the PTE.
>
> Patches 11-14 wire up everything to schedule rcu callbacks on
> to-be-freed table pages. rcu_barrier() is called on the way out from
> tearing down a stage 2 page table to guarantee all memory associated
> with the VM has actually been cleaned up.
>
> Patches 15-16 loop in the fault handler to the new table traversal game.
>
> Lastly, patch 17 is a nasty bit of debugging residue to spot possible
> table page leaks. Please don't laugh ;-)
>
> Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
> exercise the table pruning code. Haven't done anything beyond this,
> sending as an RFC now to get eyes on the code.
>
> Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of
> git://git.kernel.dk/linux-block")
>
> Oliver Upton (17):
>   KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
>   KVM: arm64: Only read the pte once per visit
>   KVM: arm64: Return the next table from map callbacks
>   KVM: arm64: Protect page table traversal with RCU
>   KVM: arm64: Take an argument to indicate parallel walk
>   KVM: arm64: Implement break-before-make sequence for parallel walks
>   KVM: arm64: Enlighten perm relax path about parallel walks
>   KVM: arm64: Spin off helper for initializing table pte
>   KVM: arm64: Tear down unlinked page tables in parallel walk
>   KVM: arm64: Assume a table pte is already owned in post-order
>     traversal
>   KVM: arm64: Move MMU cache init/destroy into helpers
>   KVM: arm64: Stuff mmu page cache in sub struct
>   KVM: arm64: Setup cache for stage2 page headers
>   KVM: arm64: Punt last page reference to rcu callback for parallel walk
>   KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map()
>   KVM: arm64: Enable parallel stage 2 MMU faults
>   TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages
>
>  arch/arm64/include/asm/kvm_host.h     |   5 +-
>  arch/arm64/include/asm/kvm_mmu.h      |   2 +
>  arch/arm64/include/asm/kvm_pgtable.h  |  14 +-
>  arch/arm64/kvm/arm.c                  |   4 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  13 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |  13 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 518 +++++++++++++++++++-------
>  arch/arm64/kvm/mmu.c                  | 120 ++++--
>  8 files changed, 503 insertions(+), 186 deletions(-)
>
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-15 23:35   ` David Matlack
  0 siblings, 0 replies; 165+ messages in thread
From: David Matlack @ 2022-04-15 23:35 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm list, Marc Zyngier, Ben Gardon, Peter Shier, Paolo Bonzini,
	KVMARM, linux-arm-kernel

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> Presently KVM only takes a read lock for stage 2 faults if it believes
> the fault can be fixed by relaxing permissions on a PTE (write unprotect
> for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> predictably can pile up all the vCPUs in a sufficiently large VM.
>
> The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> MMU protected by the combination of a read-write lock and RCU, allowing
> page walkers to traverse in parallel.
>
> This series is strongly inspired by the mechanics of the TDP MMU,
> making use of RCU to protect parallel walks. Note that the TLB
> invalidation mechanics are a bit different between x86 and ARM, so we
> need to use the 'break-before-make' sequence to split/collapse a
> block/table mapping, respectively.

An alternative (or perhaps "v2" [1]) is to make x86's TDP MMU
arch-neutral and port it to support ARM's stage-2 MMU. This is based
on a few observations:

- The problems that motivated the development of the TDP MMU are not
x86-specific (e.g. parallelizing faults during the post-copy phase of
Live Migration).
- The synchronization in the TDP MMU (read/write lock, RCU for PT
freeing, atomic compare-exchanges for modifying PTEs) is complex, but
would be equivalent across architectures.
- Eventually RISC-V is going to want similar performance (my
understanding is RISC-V MMU is already a copy-paste of the ARM MMU),
and it'd be a shame to re-implement TDP MMU synchronization a third
time.
- The TDP MMU includes support for various performance features that
would benefit other architectures, such as eager page splitting,
deferred zapping, lockless write-protection resolution, and (coming
soon) in-place huge page promotion.
- And then there's the obvious wins from less code duplication in KVM
(e.g. get rid of the RISC-V MMU copy, increased code test coverage,
...).

The side of this I haven't really looked into yet is ARM's stage-2
MMU, and how amenable it would be to being managed by the TDP MMU. But
I assume it's a conventional page table structure mapping GPAs to
HPAs, which is the most important overlap.

That all being said, an arch-neutral TDP MMU would be a larger, more
complex code change than something like this series (hence my "v2"
caveat above). But I wanted to get this idea out there since the
rubber is starting to hit the road on improving ARM MMU scalability.

[1] "v2" as in the "next evolution" sense, not the "PATCH v2" sense :)





>
> Nonetheless, using atomics on the break side allows fault handlers to
> acquire exclusive access to a PTE (lets just call it locked). Once the
> PTE lock is acquired it is then safe to assume exclusive access.
>
> Special consideration is required when pruning the page tables in
> parallel. Suppose we are collapsing a table into a block. Allowing
> parallel faults means that a software walker could be in the middle of
> a lower level traversal when the table is unlinked. Table
> walkers that prune the paging structures must now 'lock' all descendent
> PTEs, effectively asserting exclusive ownership of the substructure
> (no other walker can install something to an already locked pte).
>
> Additionally, for parallel walks we need to punt the freeing of table
> pages to the next RCU sync, as there could be multiple observers of the
> table until all walkers exit the RCU critical section. For this I
> decided to cram an rcu_head into page private data for every table page.
> We wind up spending a bit more on table pages now, but lazily allocating
> for rcu callbacks probably doesn't make a lot of sense. Not only would
> we need a large cache of them (think about installing a level 1 block)
> to wire up callbacks on all descendent tables, but we also then need to
> spend memory to actually free memory.
>
> I tried to organize these patches as best I could w/o introducing
> intermediate breakage.
>
> The first 5 patches are meant mostly as prepatory reworks, and, in the
> case of RCU a nop.
>
> Patch 6 is quite large, but I had a hard time deciding how to change the
> way we link/unlink tables to use atomics without breaking things along
> the way.
>
> Patch 7 probably should come before patch 6, as it informs the other
> read-side fault (perm relax) about when a map is in progress so it'll
> back off.
>
> Patches 8-10 take care of the pruning case, actually locking the child ptes
> instead of simply dropping table page references along the way. Note
> that we cannot assume a pte points to a table/page at this point, hence
> the same helper is called for pre- and leaf-traversal. Guide the
> recursion based on what got yanked from the PTE.
>
> Patches 11-14 wire up everything to schedule rcu callbacks on
> to-be-freed table pages. rcu_barrier() is called on the way out from
> tearing down a stage 2 page table to guarantee all memory associated
> with the VM has actually been cleaned up.
>
> Patches 15-16 loop in the fault handler to the new table traversal game.
>
> Lastly, patch 17 is a nasty bit of debugging residue to spot possible
> table page leaks. Please don't laugh ;-)
>
> Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
> exercise the table pruning code. Haven't done anything beyond this,
> sending as an RFC now to get eyes on the code.
>
> Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of
> git://git.kernel.dk/linux-block")
>
> Oliver Upton (17):
>   KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
>   KVM: arm64: Only read the pte once per visit
>   KVM: arm64: Return the next table from map callbacks
>   KVM: arm64: Protect page table traversal with RCU
>   KVM: arm64: Take an argument to indicate parallel walk
>   KVM: arm64: Implement break-before-make sequence for parallel walks
>   KVM: arm64: Enlighten perm relax path about parallel walks
>   KVM: arm64: Spin off helper for initializing table pte
>   KVM: arm64: Tear down unlinked page tables in parallel walk
>   KVM: arm64: Assume a table pte is already owned in post-order
>     traversal
>   KVM: arm64: Move MMU cache init/destroy into helpers
>   KVM: arm64: Stuff mmu page cache in sub struct
>   KVM: arm64: Setup cache for stage2 page headers
>   KVM: arm64: Punt last page reference to rcu callback for parallel walk
>   KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map()
>   KVM: arm64: Enable parallel stage 2 MMU faults
>   TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages
>
>  arch/arm64/include/asm/kvm_host.h     |   5 +-
>  arch/arm64/include/asm/kvm_mmu.h      |   2 +
>  arch/arm64/include/asm/kvm_pgtable.h  |  14 +-
>  arch/arm64/kvm/arm.c                  |   4 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  13 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |  13 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 518 +++++++++++++++++++-------
>  arch/arm64/kvm/mmu.c                  | 120 ++++--
>  8 files changed, 503 insertions(+), 186 deletions(-)
>
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
  2022-04-15 23:35   ` David Matlack
  (?)
@ 2022-04-16  0:04     ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-16  0:04 UTC (permalink / raw)
  To: David Matlack
  Cc: KVMARM, kvm list, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon

On Fri, Apr 15, 2022 at 04:35:24PM -0700, David Matlack wrote:
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > Presently KVM only takes a read lock for stage 2 faults if it believes
> > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > predictably can pile up all the vCPUs in a sufficiently large VM.
> >
> > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > MMU protected by the combination of a read-write lock and RCU, allowing
> > page walkers to traverse in parallel.
> >
> > This series is strongly inspired by the mechanics of the TDP MMU,
> > making use of RCU to protect parallel walks. Note that the TLB
> > invalidation mechanics are a bit different between x86 and ARM, so we
> > need to use the 'break-before-make' sequence to split/collapse a
> > block/table mapping, respectively.
> 
> An alternative (or perhaps "v2" [1]) is to make x86's TDP MMU
> arch-neutral and port it to support ARM's stage-2 MMU. This is based
> on a few observations:
> 
> - The problems that motivated the development of the TDP MMU are not
> x86-specific (e.g. parallelizing faults during the post-copy phase of
> Live Migration).
> - The synchronization in the TDP MMU (read/write lock, RCU for PT
> freeing, atomic compare-exchanges for modifying PTEs) is complex, but
> would be equivalent across architectures.
> - Eventually RISC-V is going to want similar performance (my
> understanding is RISC-V MMU is already a copy-paste of the ARM MMU),
> and it'd be a shame to re-implement TDP MMU synchronization a third
> time.
> - The TDP MMU includes support for various performance features that
> would benefit other architectures, such as eager page splitting,
> deferred zapping, lockless write-protection resolution, and (coming
> soon) in-place huge page promotion.
> - And then there's the obvious wins from less code duplication in KVM
> (e.g. get rid of the RISC-V MMU copy, increased code test coverage,
> ...).

I definitely agree with the observation -- we're all trying to solve the
same set of issues. And I completely agree that a good long term goal
would be to create some common parts for all architectures. Less work
for us ARM folks it would seem ;-)

What's top of mind is how we paper over the architectural differences
between all of the architectures, especially when we need to do entirely
different things because of the arch.

For example, I whine about break-before-make a lot throughout this
series which is somewhat unique to ARM. I don't think we can do eager
page splitting on the base architecture w/o doing the TLBI for every
block. Not only that, we can't do a direct valid->valid change without
first making an invalid PTE visible to hardware. Things get even more
exciting when hardware revisions relax break-before-make requirements.

There's also significant architectural differences between KVM on x86
and KVM for ARM. Our paging code runs both in the host kernel and the
hyp/lowvisor, and does:

 - VM two dimensional paging (stage 2 MMU)
 - Hyp's own MMU (stage 1 MMU)
 - Host kernel isolation (stage 2 MMU)

each with its own quirks. The 'not exactly in the kernel' part will make
instrumentation a bit of a hassle too.

None of this is meant to disagree with you in the slightest. I firmly
agree we need to share as many parts between the architectures as
possible. I'm just trying to call out a few of the things relating to
ARM that will make this annoying so that way whoever embarks on the
adventure will see it.

> The side of this I haven't really looked into yet is ARM's stage-2
> MMU, and how amenable it would be to being managed by the TDP MMU. But
> I assume it's a conventional page table structure mapping GPAs to
> HPAs, which is the most important overlap.
> 
> That all being said, an arch-neutral TDP MMU would be a larger, more
> complex code change than something like this series (hence my "v2"
> caveat above). But I wanted to get this idea out there since the
> rubber is starting to hit the road on improving ARM MMU scalability.

All for it. I cc'ed you on the series for this exact reason, I wanted to
grab your attention to spark the conversation :)

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-16  0:04     ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-16  0:04 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm list, Marc Zyngier, Ben Gardon, Peter Shier, Paolo Bonzini,
	KVMARM, linux-arm-kernel

On Fri, Apr 15, 2022 at 04:35:24PM -0700, David Matlack wrote:
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > Presently KVM only takes a read lock for stage 2 faults if it believes
> > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > predictably can pile up all the vCPUs in a sufficiently large VM.
> >
> > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > MMU protected by the combination of a read-write lock and RCU, allowing
> > page walkers to traverse in parallel.
> >
> > This series is strongly inspired by the mechanics of the TDP MMU,
> > making use of RCU to protect parallel walks. Note that the TLB
> > invalidation mechanics are a bit different between x86 and ARM, so we
> > need to use the 'break-before-make' sequence to split/collapse a
> > block/table mapping, respectively.
> 
> An alternative (or perhaps "v2" [1]) is to make x86's TDP MMU
> arch-neutral and port it to support ARM's stage-2 MMU. This is based
> on a few observations:
> 
> - The problems that motivated the development of the TDP MMU are not
> x86-specific (e.g. parallelizing faults during the post-copy phase of
> Live Migration).
> - The synchronization in the TDP MMU (read/write lock, RCU for PT
> freeing, atomic compare-exchanges for modifying PTEs) is complex, but
> would be equivalent across architectures.
> - Eventually RISC-V is going to want similar performance (my
> understanding is RISC-V MMU is already a copy-paste of the ARM MMU),
> and it'd be a shame to re-implement TDP MMU synchronization a third
> time.
> - The TDP MMU includes support for various performance features that
> would benefit other architectures, such as eager page splitting,
> deferred zapping, lockless write-protection resolution, and (coming
> soon) in-place huge page promotion.
> - And then there's the obvious wins from less code duplication in KVM
> (e.g. get rid of the RISC-V MMU copy, increased code test coverage,
> ...).

I definitely agree with the observation -- we're all trying to solve the
same set of issues. And I completely agree that a good long term goal
would be to create some common parts for all architectures. Less work
for us ARM folks it would seem ;-)

What's top of mind is how we paper over the architectural differences
between all of the architectures, especially when we need to do entirely
different things because of the arch.

For example, I whine about break-before-make a lot throughout this
series which is somewhat unique to ARM. I don't think we can do eager
page splitting on the base architecture w/o doing the TLBI for every
block. Not only that, we can't do a direct valid->valid change without
first making an invalid PTE visible to hardware. Things get even more
exciting when hardware revisions relax break-before-make requirements.

There's also significant architectural differences between KVM on x86
and KVM for ARM. Our paging code runs both in the host kernel and the
hyp/lowvisor, and does:

 - VM two dimensional paging (stage 2 MMU)
 - Hyp's own MMU (stage 1 MMU)
 - Host kernel isolation (stage 2 MMU)

each with its own quirks. The 'not exactly in the kernel' part will make
instrumentation a bit of a hassle too.

None of this is meant to disagree with you in the slightest. I firmly
agree we need to share as many parts between the architectures as
possible. I'm just trying to call out a few of the things relating to
ARM that will make this annoying so that way whoever embarks on the
adventure will see it.

> The side of this I haven't really looked into yet is ARM's stage-2
> MMU, and how amenable it would be to being managed by the TDP MMU. But
> I assume it's a conventional page table structure mapping GPAs to
> HPAs, which is the most important overlap.
> 
> That all being said, an arch-neutral TDP MMU would be a larger, more
> complex code change than something like this series (hence my "v2"
> caveat above). But I wanted to get this idea out there since the
> rubber is starting to hit the road on improving ARM MMU scalability.

All for it. I cc'ed you on the series for this exact reason, I wanted to
grab your attention to spark the conversation :)

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-16  0:04     ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-16  0:04 UTC (permalink / raw)
  To: David Matlack
  Cc: KVMARM, kvm list, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon

On Fri, Apr 15, 2022 at 04:35:24PM -0700, David Matlack wrote:
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > Presently KVM only takes a read lock for stage 2 faults if it believes
> > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > predictably can pile up all the vCPUs in a sufficiently large VM.
> >
> > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > MMU protected by the combination of a read-write lock and RCU, allowing
> > page walkers to traverse in parallel.
> >
> > This series is strongly inspired by the mechanics of the TDP MMU,
> > making use of RCU to protect parallel walks. Note that the TLB
> > invalidation mechanics are a bit different between x86 and ARM, so we
> > need to use the 'break-before-make' sequence to split/collapse a
> > block/table mapping, respectively.
> 
> An alternative (or perhaps "v2" [1]) is to make x86's TDP MMU
> arch-neutral and port it to support ARM's stage-2 MMU. This is based
> on a few observations:
> 
> - The problems that motivated the development of the TDP MMU are not
> x86-specific (e.g. parallelizing faults during the post-copy phase of
> Live Migration).
> - The synchronization in the TDP MMU (read/write lock, RCU for PT
> freeing, atomic compare-exchanges for modifying PTEs) is complex, but
> would be equivalent across architectures.
> - Eventually RISC-V is going to want similar performance (my
> understanding is RISC-V MMU is already a copy-paste of the ARM MMU),
> and it'd be a shame to re-implement TDP MMU synchronization a third
> time.
> - The TDP MMU includes support for various performance features that
> would benefit other architectures, such as eager page splitting,
> deferred zapping, lockless write-protection resolution, and (coming
> soon) in-place huge page promotion.
> - And then there's the obvious wins from less code duplication in KVM
> (e.g. get rid of the RISC-V MMU copy, increased code test coverage,
> ...).

I definitely agree with the observation -- we're all trying to solve the
same set of issues. And I completely agree that a good long term goal
would be to create some common parts for all architectures. Less work
for us ARM folks it would seem ;-)

What's top of mind is how we paper over the architectural differences
between all of the architectures, especially when we need to do entirely
different things because of the arch.

For example, I whine about break-before-make a lot throughout this
series which is somewhat unique to ARM. I don't think we can do eager
page splitting on the base architecture w/o doing the TLBI for every
block. Not only that, we can't do a direct valid->valid change without
first making an invalid PTE visible to hardware. Things get even more
exciting when hardware revisions relax break-before-make requirements.

There's also significant architectural differences between KVM on x86
and KVM for ARM. Our paging code runs both in the host kernel and the
hyp/lowvisor, and does:

 - VM two dimensional paging (stage 2 MMU)
 - Hyp's own MMU (stage 1 MMU)
 - Host kernel isolation (stage 2 MMU)

each with its own quirks. The 'not exactly in the kernel' part will make
instrumentation a bit of a hassle too.

None of this is meant to disagree with you in the slightest. I firmly
agree we need to share as many parts between the architectures as
possible. I'm just trying to call out a few of the things relating to
ARM that will make this annoying so that way whoever embarks on the
adventure will see it.

> The side of this I haven't really looked into yet is ARM's stage-2
> MMU, and how amenable it would be to being managed by the TDP MMU. But
> I assume it's a conventional page table structure mapping GPAs to
> HPAs, which is the most important overlap.
> 
> That all being said, an arch-neutral TDP MMU would be a larger, more
> complex code change than something like this series (hence my "v2"
> caveat above). But I wanted to get this idea out there since the
> rubber is starting to hit the road on improving ARM MMU scalability.

All for it. I cc'ed you on the series for this exact reason, I wanted to
grab your attention to spark the conversation :)

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-16  6:23   ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-16  6:23 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack

On Fri, Apr 15, 2022 at 09:58:44PM +0000, Oliver Upton wrote:

[...]

> 
> Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
> exercise the table pruning code. Haven't done anything beyond this,
> sending as an RFC now to get eyes on the code.

Ok, got around to testing this thing a bit harder. Keep in mind that
permission faults at PAGE_SIZE granularity already go on the read side
of the lock. I used the dirty_log_perf_test with 4G/vCPU and anonymous
THP all the way up to 48 vCPUs. Here is the data as it compares to
5.18-rc2.

Dirty log time (split 2M -> 4K):

+-------+----------+-------------------+
| vCPUs | 5.18-rc2 | 5.18-rc2 + series |
+-------+----------+-------------------+
|     1 | 0.83s    | 0.85s             |
|     2 | 0.95s    | 1.07s             |
|     4 | 2.65s    | 1.13s             |
|     8 | 4.88s    | 1.33s             |
|    16 | 9.71s    | 1.73s             |
|    32 | 20.43s   | 3.99s             |
|    48 | 29.15s   | 6.28s             |
+-------+----------+-------------------+

The scaling of prefaulting pass looks better too (same config):

+-------+----------+-------------------+
| vCPUs | 5.18-rc2 | 5.18-rc2 + series |
+-------+----------+-------------------+
|     1 | 0.42s    | 0.18s             |
|     2 | 0.55s    | 0.19s             |
|     4 | 0.79s    | 0.27s             |
|     8 | 1.29s    | 0.35s             |
|    16 | 2.03s    | 0.53s             |
|    32 | 4.03s    | 1.01s             |
|    48 | 6.10s    | 1.51s             |
+-------+----------+-------------------+

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-16  6:23   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-16  6:23 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, linux-arm-kernel

On Fri, Apr 15, 2022 at 09:58:44PM +0000, Oliver Upton wrote:

[...]

> 
> Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
> exercise the table pruning code. Haven't done anything beyond this,
> sending as an RFC now to get eyes on the code.

Ok, got around to testing this thing a bit harder. Keep in mind that
permission faults at PAGE_SIZE granularity already go on the read side
of the lock. I used the dirty_log_perf_test with 4G/vCPU and anonymous
THP all the way up to 48 vCPUs. Here is the data as it compares to
5.18-rc2.

Dirty log time (split 2M -> 4K):

+-------+----------+-------------------+
| vCPUs | 5.18-rc2 | 5.18-rc2 + series |
+-------+----------+-------------------+
|     1 | 0.83s    | 0.85s             |
|     2 | 0.95s    | 1.07s             |
|     4 | 2.65s    | 1.13s             |
|     8 | 4.88s    | 1.33s             |
|    16 | 9.71s    | 1.73s             |
|    32 | 20.43s   | 3.99s             |
|    48 | 29.15s   | 6.28s             |
+-------+----------+-------------------+

The scaling of prefaulting pass looks better too (same config):

+-------+----------+-------------------+
| vCPUs | 5.18-rc2 | 5.18-rc2 + series |
+-------+----------+-------------------+
|     1 | 0.42s    | 0.18s             |
|     2 | 0.55s    | 0.19s             |
|     4 | 0.79s    | 0.27s             |
|     8 | 1.29s    | 0.35s             |
|    16 | 2.03s    | 0.53s             |
|    32 | 4.03s    | 1.01s             |
|    48 | 6.10s    | 1.51s             |
+-------+----------+-------------------+

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-16  6:23   ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-16  6:23 UTC (permalink / raw)
  To: kvmarm
  Cc: kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon,
	David Matlack

On Fri, Apr 15, 2022 at 09:58:44PM +0000, Oliver Upton wrote:

[...]

> 
> Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
> exercise the table pruning code. Haven't done anything beyond this,
> sending as an RFC now to get eyes on the code.

Ok, got around to testing this thing a bit harder. Keep in mind that
permission faults at PAGE_SIZE granularity already go on the read side
of the lock. I used the dirty_log_perf_test with 4G/vCPU and anonymous
THP all the way up to 48 vCPUs. Here is the data as it compares to
5.18-rc2.

Dirty log time (split 2M -> 4K):

+-------+----------+-------------------+
| vCPUs | 5.18-rc2 | 5.18-rc2 + series |
+-------+----------+-------------------+
|     1 | 0.83s    | 0.85s             |
|     2 | 0.95s    | 1.07s             |
|     4 | 2.65s    | 1.13s             |
|     8 | 4.88s    | 1.33s             |
|    16 | 9.71s    | 1.73s             |
|    32 | 20.43s   | 3.99s             |
|    48 | 29.15s   | 6.28s             |
+-------+----------+-------------------+

The scaling of prefaulting pass looks better too (same config):

+-------+----------+-------------------+
| vCPUs | 5.18-rc2 | 5.18-rc2 + series |
+-------+----------+-------------------+
|     1 | 0.42s    | 0.18s             |
|     2 | 0.55s    | 0.19s             |
|     4 | 0.79s    | 0.27s             |
|     8 | 1.29s    | 0.35s             |
|    16 | 2.03s    | 0.53s             |
|    32 | 4.03s    | 1.01s             |
|    48 | 6.10s    | 1.51s             |
+-------+----------+-------------------+

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 05/17] KVM: arm64: Take an argument to indicate parallel walk
  2022-04-15 21:58   ` Oliver Upton
  (?)
@ 2022-04-16 11:30     ` Marc Zyngier
  -1 siblings, 0 replies; 165+ messages in thread
From: Marc Zyngier @ 2022-04-16 11:30 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, James Morse, Alexandru Elisei, Suzuki K Poulose,
	linux-arm-kernel, Peter Shier, Ricardo Koller, Reiji Watanabe,
	Paolo Bonzini, Sean Christopherson, Ben Gardon, David Matlack

Hi Oliver,

On Fri, 15 Apr 2022 22:58:49 +0100,
Oliver Upton <oupton@google.com> wrote:
> 
> It is desirable to reuse the same page walkers for serial and parallel
> faults. Take an argument to kvm_pgtable_walk() (and throughout) to
> indicate whether or not a walk might happen in parallel with another.
>
> No functional change intended.
> 
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h  |  5 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  4 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |  4 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 91 ++++++++++++++-------------
>  4 files changed, 54 insertions(+), 50 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index ea818a5f7408..74955aba5918 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -194,7 +194,7 @@ enum kvm_pgtable_walk_flags {
>  typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
>  					kvm_pte_t *ptep, kvm_pte_t *old,
>  					enum kvm_pgtable_walk_flags flag,
> -					void * const arg);
> +					void * const arg, bool shared);

Am I the only one who find this really ugly? Sprinkling this all over
the shop makes the code rather unreadable. It seems to me that having
some sort of more general context would make more sense.

For example, I would fully expect the walk context to tell us whether
this walker is willing to share its walk. Add a predicate to that,
which would conveniently expand to 'false' for contexts where we don't
have RCU (such as the pKVM HYP PT management, and you should get
something that is more manageable.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 05/17] KVM: arm64: Take an argument to indicate parallel walk
@ 2022-04-16 11:30     ` Marc Zyngier
  0 siblings, 0 replies; 165+ messages in thread
From: Marc Zyngier @ 2022-04-16 11:30 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Peter Shier, Ben Gardon, David Matlack, Paolo Bonzini,
	kvmarm, linux-arm-kernel

Hi Oliver,

On Fri, 15 Apr 2022 22:58:49 +0100,
Oliver Upton <oupton@google.com> wrote:
> 
> It is desirable to reuse the same page walkers for serial and parallel
> faults. Take an argument to kvm_pgtable_walk() (and throughout) to
> indicate whether or not a walk might happen in parallel with another.
>
> No functional change intended.
> 
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h  |  5 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  4 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |  4 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 91 ++++++++++++++-------------
>  4 files changed, 54 insertions(+), 50 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index ea818a5f7408..74955aba5918 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -194,7 +194,7 @@ enum kvm_pgtable_walk_flags {
>  typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
>  					kvm_pte_t *ptep, kvm_pte_t *old,
>  					enum kvm_pgtable_walk_flags flag,
> -					void * const arg);
> +					void * const arg, bool shared);

Am I the only one who find this really ugly? Sprinkling this all over
the shop makes the code rather unreadable. It seems to me that having
some sort of more general context would make more sense.

For example, I would fully expect the walk context to tell us whether
this walker is willing to share its walk. Add a predicate to that,
which would conveniently expand to 'false' for contexts where we don't
have RCU (such as the pKVM HYP PT management, and you should get
something that is more manageable.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 05/17] KVM: arm64: Take an argument to indicate parallel walk
@ 2022-04-16 11:30     ` Marc Zyngier
  0 siblings, 0 replies; 165+ messages in thread
From: Marc Zyngier @ 2022-04-16 11:30 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, James Morse, Alexandru Elisei, Suzuki K Poulose,
	linux-arm-kernel, Peter Shier, Ricardo Koller, Reiji Watanabe,
	Paolo Bonzini, Sean Christopherson, Ben Gardon, David Matlack

Hi Oliver,

On Fri, 15 Apr 2022 22:58:49 +0100,
Oliver Upton <oupton@google.com> wrote:
> 
> It is desirable to reuse the same page walkers for serial and parallel
> faults. Take an argument to kvm_pgtable_walk() (and throughout) to
> indicate whether or not a walk might happen in parallel with another.
>
> No functional change intended.
> 
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h  |  5 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  4 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |  4 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 91 ++++++++++++++-------------
>  4 files changed, 54 insertions(+), 50 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index ea818a5f7408..74955aba5918 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -194,7 +194,7 @@ enum kvm_pgtable_walk_flags {
>  typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
>  					kvm_pte_t *ptep, kvm_pte_t *old,
>  					enum kvm_pgtable_walk_flags flag,
> -					void * const arg);
> +					void * const arg, bool shared);

Am I the only one who find this really ugly? Sprinkling this all over
the shop makes the code rather unreadable. It seems to me that having
some sort of more general context would make more sense.

For example, I would fully expect the walk context to tell us whether
this walker is willing to share its walk. Add a predicate to that,
which would conveniently expand to 'false' for contexts where we don't
have RCU (such as the pKVM HYP PT management, and you should get
something that is more manageable.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 05/17] KVM: arm64: Take an argument to indicate parallel walk
  2022-04-16 11:30     ` Marc Zyngier
  (?)
@ 2022-04-16 16:03       ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-16 16:03 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, kvm, James Morse, Alexandru Elisei, Suzuki K Poulose,
	linux-arm-kernel, Peter Shier, Ricardo Koller, Reiji Watanabe,
	Paolo Bonzini, Sean Christopherson, Ben Gardon, David Matlack

On Sat, Apr 16, 2022 at 12:30:23PM +0100, Marc Zyngier wrote:
> Hi Oliver,
> 
> On Fri, 15 Apr 2022 22:58:49 +0100,
> Oliver Upton <oupton@google.com> wrote:
> > 
> > It is desirable to reuse the same page walkers for serial and parallel
> > faults. Take an argument to kvm_pgtable_walk() (and throughout) to
> > indicate whether or not a walk might happen in parallel with another.
> >
> > No functional change intended.
> > 
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h  |  5 +-
> >  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  4 +-
> >  arch/arm64/kvm/hyp/nvhe/setup.c       |  4 +-
> >  arch/arm64/kvm/hyp/pgtable.c          | 91 ++++++++++++++-------------
> >  4 files changed, 54 insertions(+), 50 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index ea818a5f7408..74955aba5918 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -194,7 +194,7 @@ enum kvm_pgtable_walk_flags {
> >  typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
> >  					kvm_pte_t *ptep, kvm_pte_t *old,
> >  					enum kvm_pgtable_walk_flags flag,
> > -					void * const arg);
> > +					void * const arg, bool shared);
> 
> Am I the only one who find this really ugly? Sprinkling this all over
> the shop makes the code rather unreadable. It seems to me that having
> some sort of more general context would make more sense.

You certainly are not. This is a bit sloppy, a previous spin of this
needed to know about parallelism in the generic page walker context and
I had picked just poking the bool through instead of hitching it to
kvm_pgtable_walker. I needed to churn either way in that scheme, but
that is no longer the case now.

> For example, I would fully expect the walk context to tell us whether
> this walker is willing to share its walk. Add a predicate to that,
> which would conveniently expand to 'false' for contexts where we don't
> have RCU (such as the pKVM HYP PT management, and you should get
> something that is more manageable.

I think the blast radius is now limited to just the stage2 visitors, so
it can probably get crammed in the callback arg now. Limiting the
changes to stage2 was intentional. The hyp walkers seem to be working
fine and I'd rather not come under fire for breaking it somehow ;)

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 05/17] KVM: arm64: Take an argument to indicate parallel walk
@ 2022-04-16 16:03       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-16 16:03 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, Peter Shier, Ben Gardon, David Matlack, Paolo Bonzini,
	kvmarm, linux-arm-kernel

On Sat, Apr 16, 2022 at 12:30:23PM +0100, Marc Zyngier wrote:
> Hi Oliver,
> 
> On Fri, 15 Apr 2022 22:58:49 +0100,
> Oliver Upton <oupton@google.com> wrote:
> > 
> > It is desirable to reuse the same page walkers for serial and parallel
> > faults. Take an argument to kvm_pgtable_walk() (and throughout) to
> > indicate whether or not a walk might happen in parallel with another.
> >
> > No functional change intended.
> > 
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h  |  5 +-
> >  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  4 +-
> >  arch/arm64/kvm/hyp/nvhe/setup.c       |  4 +-
> >  arch/arm64/kvm/hyp/pgtable.c          | 91 ++++++++++++++-------------
> >  4 files changed, 54 insertions(+), 50 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index ea818a5f7408..74955aba5918 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -194,7 +194,7 @@ enum kvm_pgtable_walk_flags {
> >  typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
> >  					kvm_pte_t *ptep, kvm_pte_t *old,
> >  					enum kvm_pgtable_walk_flags flag,
> > -					void * const arg);
> > +					void * const arg, bool shared);
> 
> Am I the only one who find this really ugly? Sprinkling this all over
> the shop makes the code rather unreadable. It seems to me that having
> some sort of more general context would make more sense.

You certainly are not. This is a bit sloppy, a previous spin of this
needed to know about parallelism in the generic page walker context and
I had picked just poking the bool through instead of hitching it to
kvm_pgtable_walker. I needed to churn either way in that scheme, but
that is no longer the case now.

> For example, I would fully expect the walk context to tell us whether
> this walker is willing to share its walk. Add a predicate to that,
> which would conveniently expand to 'false' for contexts where we don't
> have RCU (such as the pKVM HYP PT management, and you should get
> something that is more manageable.

I think the blast radius is now limited to just the stage2 visitors, so
it can probably get crammed in the callback arg now. Limiting the
changes to stage2 was intentional. The hyp walkers seem to be working
fine and I'd rather not come under fire for breaking it somehow ;)

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 05/17] KVM: arm64: Take an argument to indicate parallel walk
@ 2022-04-16 16:03       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-16 16:03 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, kvm, James Morse, Alexandru Elisei, Suzuki K Poulose,
	linux-arm-kernel, Peter Shier, Ricardo Koller, Reiji Watanabe,
	Paolo Bonzini, Sean Christopherson, Ben Gardon, David Matlack

On Sat, Apr 16, 2022 at 12:30:23PM +0100, Marc Zyngier wrote:
> Hi Oliver,
> 
> On Fri, 15 Apr 2022 22:58:49 +0100,
> Oliver Upton <oupton@google.com> wrote:
> > 
> > It is desirable to reuse the same page walkers for serial and parallel
> > faults. Take an argument to kvm_pgtable_walk() (and throughout) to
> > indicate whether or not a walk might happen in parallel with another.
> >
> > No functional change intended.
> > 
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h  |  5 +-
> >  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  4 +-
> >  arch/arm64/kvm/hyp/nvhe/setup.c       |  4 +-
> >  arch/arm64/kvm/hyp/pgtable.c          | 91 ++++++++++++++-------------
> >  4 files changed, 54 insertions(+), 50 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index ea818a5f7408..74955aba5918 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -194,7 +194,7 @@ enum kvm_pgtable_walk_flags {
> >  typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
> >  					kvm_pte_t *ptep, kvm_pte_t *old,
> >  					enum kvm_pgtable_walk_flags flag,
> > -					void * const arg);
> > +					void * const arg, bool shared);
> 
> Am I the only one who find this really ugly? Sprinkling this all over
> the shop makes the code rather unreadable. It seems to me that having
> some sort of more general context would make more sense.

You certainly are not. This is a bit sloppy, a previous spin of this
needed to know about parallelism in the generic page walker context and
I had picked just poking the bool through instead of hitching it to
kvm_pgtable_walker. I needed to churn either way in that scheme, but
that is no longer the case now.

> For example, I would fully expect the walk context to tell us whether
> this walker is willing to share its walk. Add a predicate to that,
> which would conveniently expand to 'false' for contexts where we don't
> have RCU (such as the pKVM HYP PT management, and you should get
> something that is more manageable.

I think the blast radius is now limited to just the stage2 visitors, so
it can probably get crammed in the callback arg now. Limiting the
changes to stage2 was intentional. The hyp walkers seem to be working
fine and I'd rather not come under fire for breaking it somehow ;)

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 04/17] KVM: arm64: Protect page table traversal with RCU
  2022-04-15 21:58   ` Oliver Upton
  (?)
@ 2022-04-19  2:55     ` Ricardo Koller
  -1 siblings, 0 replies; 165+ messages in thread
From: Ricardo Koller @ 2022-04-19  2:55 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Reiji Watanabe,
	Paolo Bonzini, Sean Christopherson, Ben Gardon, David Matlack

On Fri, Apr 15, 2022 at 09:58:48PM +0000, Oliver Upton wrote:
> Use RCU to safely traverse the page tables in parallel; the tables
> themselves will only be freed from an RCU synchronized context. Don't
> even bother with adding support to hyp, and instead just assume
> exclusive access of the page tables.
> 
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 5b64fbca8a93..d4699f698d6e 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -132,9 +132,28 @@ static kvm_pte_t kvm_phys_to_pte(u64 pa)
>  	return pte;
>  }
>  
> +
> +#if defined(__KVM_NVHE_HYPERVISOR__)
> +static inline void kvm_pgtable_walk_begin(void)
> +{}
> +
> +static inline void kvm_pgtable_walk_end(void)
> +{}
> +
> +#define kvm_dereference_ptep	rcu_dereference_raw
> +#else
> +#define kvm_pgtable_walk_begin	rcu_read_lock
> +
> +#define kvm_pgtable_walk_end	rcu_read_unlock
> +
> +#define kvm_dereference_ptep	rcu_dereference
> +#endif
> +
>  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
>  {
> -	return mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
> +	kvm_pte_t __rcu *ptep = mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
> +
> +	return kvm_dereference_ptep(ptep);
>  }
>  
>  static void kvm_clear_pte(kvm_pte_t *ptep)
> @@ -288,7 +307,9 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
>  		.walker	= walker,
>  	};
>  
> +	kvm_pgtable_walk_begin();
>  	return _kvm_pgtable_walk(&walk_data);
> +	kvm_pgtable_walk_end();

This might be fixed later in the series, but at this point the
rcu_read_unlock is never called.

>  }
>  
>  struct leaf_walk_data {
> -- 
> 2.36.0.rc0.470.gd361397f0d-goog
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 04/17] KVM: arm64: Protect page table traversal with RCU
@ 2022-04-19  2:55     ` Ricardo Koller
  0 siblings, 0 replies; 165+ messages in thread
From: Ricardo Koller @ 2022-04-19  2:55 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Fri, Apr 15, 2022 at 09:58:48PM +0000, Oliver Upton wrote:
> Use RCU to safely traverse the page tables in parallel; the tables
> themselves will only be freed from an RCU synchronized context. Don't
> even bother with adding support to hyp, and instead just assume
> exclusive access of the page tables.
> 
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 5b64fbca8a93..d4699f698d6e 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -132,9 +132,28 @@ static kvm_pte_t kvm_phys_to_pte(u64 pa)
>  	return pte;
>  }
>  
> +
> +#if defined(__KVM_NVHE_HYPERVISOR__)
> +static inline void kvm_pgtable_walk_begin(void)
> +{}
> +
> +static inline void kvm_pgtable_walk_end(void)
> +{}
> +
> +#define kvm_dereference_ptep	rcu_dereference_raw
> +#else
> +#define kvm_pgtable_walk_begin	rcu_read_lock
> +
> +#define kvm_pgtable_walk_end	rcu_read_unlock
> +
> +#define kvm_dereference_ptep	rcu_dereference
> +#endif
> +
>  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
>  {
> -	return mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
> +	kvm_pte_t __rcu *ptep = mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
> +
> +	return kvm_dereference_ptep(ptep);
>  }
>  
>  static void kvm_clear_pte(kvm_pte_t *ptep)
> @@ -288,7 +307,9 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
>  		.walker	= walker,
>  	};
>  
> +	kvm_pgtable_walk_begin();
>  	return _kvm_pgtable_walk(&walk_data);
> +	kvm_pgtable_walk_end();

This might be fixed later in the series, but at this point the
rcu_read_unlock is never called.

>  }
>  
>  struct leaf_walk_data {
> -- 
> 2.36.0.rc0.470.gd361397f0d-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 04/17] KVM: arm64: Protect page table traversal with RCU
@ 2022-04-19  2:55     ` Ricardo Koller
  0 siblings, 0 replies; 165+ messages in thread
From: Ricardo Koller @ 2022-04-19  2:55 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Reiji Watanabe,
	Paolo Bonzini, Sean Christopherson, Ben Gardon, David Matlack

On Fri, Apr 15, 2022 at 09:58:48PM +0000, Oliver Upton wrote:
> Use RCU to safely traverse the page tables in parallel; the tables
> themselves will only be freed from an RCU synchronized context. Don't
> even bother with adding support to hyp, and instead just assume
> exclusive access of the page tables.
> 
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 5b64fbca8a93..d4699f698d6e 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -132,9 +132,28 @@ static kvm_pte_t kvm_phys_to_pte(u64 pa)
>  	return pte;
>  }
>  
> +
> +#if defined(__KVM_NVHE_HYPERVISOR__)
> +static inline void kvm_pgtable_walk_begin(void)
> +{}
> +
> +static inline void kvm_pgtable_walk_end(void)
> +{}
> +
> +#define kvm_dereference_ptep	rcu_dereference_raw
> +#else
> +#define kvm_pgtable_walk_begin	rcu_read_lock
> +
> +#define kvm_pgtable_walk_end	rcu_read_unlock
> +
> +#define kvm_dereference_ptep	rcu_dereference
> +#endif
> +
>  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
>  {
> -	return mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
> +	kvm_pte_t __rcu *ptep = mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
> +
> +	return kvm_dereference_ptep(ptep);
>  }
>  
>  static void kvm_clear_pte(kvm_pte_t *ptep)
> @@ -288,7 +307,9 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
>  		.walker	= walker,
>  	};
>  
> +	kvm_pgtable_walk_begin();
>  	return _kvm_pgtable_walk(&walk_data);
> +	kvm_pgtable_walk_end();

This might be fixed later in the series, but at this point the
rcu_read_unlock is never called.

>  }
>  
>  struct leaf_walk_data {
> -- 
> 2.36.0.rc0.470.gd361397f0d-goog
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
  2022-04-15 21:58   ` Oliver Upton
  (?)
@ 2022-04-19  2:59     ` Ricardo Koller
  -1 siblings, 0 replies; 165+ messages in thread
From: Ricardo Koller @ 2022-04-19  2:59 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Reiji Watanabe,
	Paolo Bonzini, Sean Christopherson, Ben Gardon, David Matlack

On Fri, Apr 15, 2022 at 09:58:58PM +0000, Oliver Upton wrote:
> It is possible that a table page remains visible to another thread until
> the next rcu synchronization event. To that end, we cannot drop the last
> page reference synchronous with post-order traversal for a parallel
> table walk.
> 
> Schedule an rcu callback to clean up the child table page for parallel
> walks.
> 
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  3 ++
>  arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
>  arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
>  3 files changed, 67 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 74955aba5918..52e55e00f0ca 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
>   * @put_page:			Decrement the refcount on a page. When the
>   *				refcount reaches 0 the page is automatically
>   *				freed.
> + * @free_table:			Drop the last page reference, possibly in the
> + *				next RCU sync if doing a shared walk.
>   * @page_count:			Return the refcount of a page.
>   * @phys_to_virt:		Convert a physical address into a virtual
>   *				address	mapped in the current context.
> @@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
>  	void		(*get_page)(void *addr);
>  	void		(*put_page)(void *addr);
>  	int		(*page_count)(void *addr);
> +	void		(*free_table)(void *addr, bool shared);
>  	void*		(*phys_to_virt)(phys_addr_t phys);
>  	phys_addr_t	(*virt_to_phys)(void *addr);
>  	void		(*dcache_clean_inval_poc)(void *addr, size_t size);
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 121818d4c33e..a9a48edba63b 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
>  {}
>  
>  #define kvm_dereference_ptep	rcu_dereference_raw
> +
> +static inline void kvm_pgtable_destroy_barrier(void)
> +{}
> +
>  #else
>  #define kvm_pgtable_walk_begin	rcu_read_lock
>  
>  #define kvm_pgtable_walk_end	rcu_read_unlock
>  
>  #define kvm_dereference_ptep	rcu_dereference
> +
> +#define kvm_pgtable_destroy_barrier	rcu_barrier
> +
>  #endif
>  
>  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> @@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>  		childp = kvm_pte_follow(*old, mm_ops);
>  	}
>  
> -	mm_ops->put_page(childp);
> +	/*
> +	 * If we do not have exclusive access to the page tables it is possible
> +	 * the unlinked table remains visible to another thread until the next
> +	 * rcu synchronization.
> +	 */
> +	mm_ops->free_table(childp, shared);
>  	mm_ops->put_page(ptep);
>  
>  	return ret;
> @@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  					       kvm_granule_size(level));
>  
>  	if (childp)
> -		mm_ops->put_page(childp);
> +		mm_ops->free_table(childp, shared);
>  
>  	return 0;
>  }
> @@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  	mm_ops->put_page(ptep);
>  
>  	if (kvm_pte_table(*old, level))
> -		mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
> +		mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
>  
>  	return 0;
>  }
> @@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
>  	pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
>  	pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
>  	pgt->pgd = NULL;
> +
> +	/*
> +	 * Guarantee that all unlinked subtrees associated with the stage2 page
> +	 * table have also been freed before returning.
> +	 */
> +	kvm_pgtable_destroy_barrier();
>  }
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index cc6ed6b06ec2..6ecf37009c21 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
>  static void *stage2_memcache_zalloc_page(void *arg)
>  {
>  	struct kvm_mmu_caches *mmu_caches = arg;
> +	struct stage2_page_header *hdr;
> +	void *addr;
>  
>  	/* Allocated with __GFP_ZERO, so no need to zero */
> -	return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> +	addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> +	if (!addr)
> +		return NULL;
> +
> +	hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
> +	if (!hdr) {
> +		free_page((unsigned long)addr);
> +		return NULL;
> +	}
> +
> +	hdr->page = virt_to_page(addr);
> +	set_page_private(hdr->page, (unsigned long)hdr);
> +	return addr;
> +}
> +
> +static void stage2_free_page_now(struct stage2_page_header *hdr)
> +{
> +	WARN_ON(page_ref_count(hdr->page) != 1);
> +
> +	__free_page(hdr->page);
> +	kmem_cache_free(stage2_page_header_cache, hdr);
> +}
> +
> +static void stage2_free_page_rcu_cb(struct rcu_head *head)
> +{
> +	struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
> +						      rcu_head);
> +
> +	stage2_free_page_now(hdr);
> +}
> +
> +static void stage2_free_table(void *addr, bool shared)
> +{
> +	struct page *page = virt_to_page(addr);
> +	struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
> +
> +	if (shared)
> +		call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);

Can the number of callbacks grow to "dangerous" numbers? can it be
bounded with something like the following?

if number of readers is really high:
	synchronize_rcu() 
else
	call_rcu()

maybe the rcu API has an option for that.

> +	else
> +		stage2_free_page_now(hdr);
>  }
>  
>  static void *kvm_host_zalloc_pages_exact(size_t size)
> @@ -613,6 +654,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>  	.free_pages_exact	= free_pages_exact,
>  	.get_page		= kvm_host_get_page,
>  	.put_page		= kvm_host_put_page,
> +	.free_table		= stage2_free_table,
>  	.page_count		= kvm_host_page_count,
>  	.phys_to_virt		= kvm_host_va,
>  	.virt_to_phys		= kvm_host_pa,
> -- 
> 2.36.0.rc0.470.gd361397f0d-goog
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
@ 2022-04-19  2:59     ` Ricardo Koller
  0 siblings, 0 replies; 165+ messages in thread
From: Ricardo Koller @ 2022-04-19  2:59 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Fri, Apr 15, 2022 at 09:58:58PM +0000, Oliver Upton wrote:
> It is possible that a table page remains visible to another thread until
> the next rcu synchronization event. To that end, we cannot drop the last
> page reference synchronous with post-order traversal for a parallel
> table walk.
> 
> Schedule an rcu callback to clean up the child table page for parallel
> walks.
> 
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  3 ++
>  arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
>  arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
>  3 files changed, 67 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 74955aba5918..52e55e00f0ca 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
>   * @put_page:			Decrement the refcount on a page. When the
>   *				refcount reaches 0 the page is automatically
>   *				freed.
> + * @free_table:			Drop the last page reference, possibly in the
> + *				next RCU sync if doing a shared walk.
>   * @page_count:			Return the refcount of a page.
>   * @phys_to_virt:		Convert a physical address into a virtual
>   *				address	mapped in the current context.
> @@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
>  	void		(*get_page)(void *addr);
>  	void		(*put_page)(void *addr);
>  	int		(*page_count)(void *addr);
> +	void		(*free_table)(void *addr, bool shared);
>  	void*		(*phys_to_virt)(phys_addr_t phys);
>  	phys_addr_t	(*virt_to_phys)(void *addr);
>  	void		(*dcache_clean_inval_poc)(void *addr, size_t size);
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 121818d4c33e..a9a48edba63b 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
>  {}
>  
>  #define kvm_dereference_ptep	rcu_dereference_raw
> +
> +static inline void kvm_pgtable_destroy_barrier(void)
> +{}
> +
>  #else
>  #define kvm_pgtable_walk_begin	rcu_read_lock
>  
>  #define kvm_pgtable_walk_end	rcu_read_unlock
>  
>  #define kvm_dereference_ptep	rcu_dereference
> +
> +#define kvm_pgtable_destroy_barrier	rcu_barrier
> +
>  #endif
>  
>  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> @@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>  		childp = kvm_pte_follow(*old, mm_ops);
>  	}
>  
> -	mm_ops->put_page(childp);
> +	/*
> +	 * If we do not have exclusive access to the page tables it is possible
> +	 * the unlinked table remains visible to another thread until the next
> +	 * rcu synchronization.
> +	 */
> +	mm_ops->free_table(childp, shared);
>  	mm_ops->put_page(ptep);
>  
>  	return ret;
> @@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  					       kvm_granule_size(level));
>  
>  	if (childp)
> -		mm_ops->put_page(childp);
> +		mm_ops->free_table(childp, shared);
>  
>  	return 0;
>  }
> @@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  	mm_ops->put_page(ptep);
>  
>  	if (kvm_pte_table(*old, level))
> -		mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
> +		mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
>  
>  	return 0;
>  }
> @@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
>  	pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
>  	pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
>  	pgt->pgd = NULL;
> +
> +	/*
> +	 * Guarantee that all unlinked subtrees associated with the stage2 page
> +	 * table have also been freed before returning.
> +	 */
> +	kvm_pgtable_destroy_barrier();
>  }
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index cc6ed6b06ec2..6ecf37009c21 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
>  static void *stage2_memcache_zalloc_page(void *arg)
>  {
>  	struct kvm_mmu_caches *mmu_caches = arg;
> +	struct stage2_page_header *hdr;
> +	void *addr;
>  
>  	/* Allocated with __GFP_ZERO, so no need to zero */
> -	return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> +	addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> +	if (!addr)
> +		return NULL;
> +
> +	hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
> +	if (!hdr) {
> +		free_page((unsigned long)addr);
> +		return NULL;
> +	}
> +
> +	hdr->page = virt_to_page(addr);
> +	set_page_private(hdr->page, (unsigned long)hdr);
> +	return addr;
> +}
> +
> +static void stage2_free_page_now(struct stage2_page_header *hdr)
> +{
> +	WARN_ON(page_ref_count(hdr->page) != 1);
> +
> +	__free_page(hdr->page);
> +	kmem_cache_free(stage2_page_header_cache, hdr);
> +}
> +
> +static void stage2_free_page_rcu_cb(struct rcu_head *head)
> +{
> +	struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
> +						      rcu_head);
> +
> +	stage2_free_page_now(hdr);
> +}
> +
> +static void stage2_free_table(void *addr, bool shared)
> +{
> +	struct page *page = virt_to_page(addr);
> +	struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
> +
> +	if (shared)
> +		call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);

Can the number of callbacks grow to "dangerous" numbers? can it be
bounded with something like the following?

if number of readers is really high:
	synchronize_rcu() 
else
	call_rcu()

maybe the rcu API has an option for that.

> +	else
> +		stage2_free_page_now(hdr);
>  }
>  
>  static void *kvm_host_zalloc_pages_exact(size_t size)
> @@ -613,6 +654,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>  	.free_pages_exact	= free_pages_exact,
>  	.get_page		= kvm_host_get_page,
>  	.put_page		= kvm_host_put_page,
> +	.free_table		= stage2_free_table,
>  	.page_count		= kvm_host_page_count,
>  	.phys_to_virt		= kvm_host_va,
>  	.virt_to_phys		= kvm_host_pa,
> -- 
> 2.36.0.rc0.470.gd361397f0d-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
@ 2022-04-19  2:59     ` Ricardo Koller
  0 siblings, 0 replies; 165+ messages in thread
From: Ricardo Koller @ 2022-04-19  2:59 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Reiji Watanabe,
	Paolo Bonzini, Sean Christopherson, Ben Gardon, David Matlack

On Fri, Apr 15, 2022 at 09:58:58PM +0000, Oliver Upton wrote:
> It is possible that a table page remains visible to another thread until
> the next rcu synchronization event. To that end, we cannot drop the last
> page reference synchronous with post-order traversal for a parallel
> table walk.
> 
> Schedule an rcu callback to clean up the child table page for parallel
> walks.
> 
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  3 ++
>  arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
>  arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
>  3 files changed, 67 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 74955aba5918..52e55e00f0ca 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
>   * @put_page:			Decrement the refcount on a page. When the
>   *				refcount reaches 0 the page is automatically
>   *				freed.
> + * @free_table:			Drop the last page reference, possibly in the
> + *				next RCU sync if doing a shared walk.
>   * @page_count:			Return the refcount of a page.
>   * @phys_to_virt:		Convert a physical address into a virtual
>   *				address	mapped in the current context.
> @@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
>  	void		(*get_page)(void *addr);
>  	void		(*put_page)(void *addr);
>  	int		(*page_count)(void *addr);
> +	void		(*free_table)(void *addr, bool shared);
>  	void*		(*phys_to_virt)(phys_addr_t phys);
>  	phys_addr_t	(*virt_to_phys)(void *addr);
>  	void		(*dcache_clean_inval_poc)(void *addr, size_t size);
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 121818d4c33e..a9a48edba63b 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
>  {}
>  
>  #define kvm_dereference_ptep	rcu_dereference_raw
> +
> +static inline void kvm_pgtable_destroy_barrier(void)
> +{}
> +
>  #else
>  #define kvm_pgtable_walk_begin	rcu_read_lock
>  
>  #define kvm_pgtable_walk_end	rcu_read_unlock
>  
>  #define kvm_dereference_ptep	rcu_dereference
> +
> +#define kvm_pgtable_destroy_barrier	rcu_barrier
> +
>  #endif
>  
>  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> @@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>  		childp = kvm_pte_follow(*old, mm_ops);
>  	}
>  
> -	mm_ops->put_page(childp);
> +	/*
> +	 * If we do not have exclusive access to the page tables it is possible
> +	 * the unlinked table remains visible to another thread until the next
> +	 * rcu synchronization.
> +	 */
> +	mm_ops->free_table(childp, shared);
>  	mm_ops->put_page(ptep);
>  
>  	return ret;
> @@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  					       kvm_granule_size(level));
>  
>  	if (childp)
> -		mm_ops->put_page(childp);
> +		mm_ops->free_table(childp, shared);
>  
>  	return 0;
>  }
> @@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  	mm_ops->put_page(ptep);
>  
>  	if (kvm_pte_table(*old, level))
> -		mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
> +		mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
>  
>  	return 0;
>  }
> @@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
>  	pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
>  	pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
>  	pgt->pgd = NULL;
> +
> +	/*
> +	 * Guarantee that all unlinked subtrees associated with the stage2 page
> +	 * table have also been freed before returning.
> +	 */
> +	kvm_pgtable_destroy_barrier();
>  }
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index cc6ed6b06ec2..6ecf37009c21 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
>  static void *stage2_memcache_zalloc_page(void *arg)
>  {
>  	struct kvm_mmu_caches *mmu_caches = arg;
> +	struct stage2_page_header *hdr;
> +	void *addr;
>  
>  	/* Allocated with __GFP_ZERO, so no need to zero */
> -	return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> +	addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> +	if (!addr)
> +		return NULL;
> +
> +	hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
> +	if (!hdr) {
> +		free_page((unsigned long)addr);
> +		return NULL;
> +	}
> +
> +	hdr->page = virt_to_page(addr);
> +	set_page_private(hdr->page, (unsigned long)hdr);
> +	return addr;
> +}
> +
> +static void stage2_free_page_now(struct stage2_page_header *hdr)
> +{
> +	WARN_ON(page_ref_count(hdr->page) != 1);
> +
> +	__free_page(hdr->page);
> +	kmem_cache_free(stage2_page_header_cache, hdr);
> +}
> +
> +static void stage2_free_page_rcu_cb(struct rcu_head *head)
> +{
> +	struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
> +						      rcu_head);
> +
> +	stage2_free_page_now(hdr);
> +}
> +
> +static void stage2_free_table(void *addr, bool shared)
> +{
> +	struct page *page = virt_to_page(addr);
> +	struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
> +
> +	if (shared)
> +		call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);

Can the number of callbacks grow to "dangerous" numbers? can it be
bounded with something like the following?

if number of readers is really high:
	synchronize_rcu() 
else
	call_rcu()

maybe the rcu API has an option for that.

> +	else
> +		stage2_free_page_now(hdr);
>  }
>  
>  static void *kvm_host_zalloc_pages_exact(size_t size)
> @@ -613,6 +654,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>  	.free_pages_exact	= free_pages_exact,
>  	.get_page		= kvm_host_get_page,
>  	.put_page		= kvm_host_put_page,
> +	.free_table		= stage2_free_table,
>  	.page_count		= kvm_host_page_count,
>  	.phys_to_virt		= kvm_host_va,
>  	.virt_to_phys		= kvm_host_pa,
> -- 
> 2.36.0.rc0.470.gd361397f0d-goog
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 04/17] KVM: arm64: Protect page table traversal with RCU
  2022-04-19  2:55     ` Ricardo Koller
  (?)
@ 2022-04-19  3:01       ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-19  3:01 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Reiji Watanabe,
	Paolo Bonzini, Sean Christopherson, Ben Gardon, David Matlack

On Mon, Apr 18, 2022 at 7:55 PM Ricardo Koller <ricarkol@google.com> wrote:
>
> On Fri, Apr 15, 2022 at 09:58:48PM +0000, Oliver Upton wrote:
> > Use RCU to safely traverse the page tables in parallel; the tables
> > themselves will only be freed from an RCU synchronized context. Don't
> > even bother with adding support to hyp, and instead just assume
> > exclusive access of the page tables.
> >
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++++++++-
> >  1 file changed, 22 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 5b64fbca8a93..d4699f698d6e 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -132,9 +132,28 @@ static kvm_pte_t kvm_phys_to_pte(u64 pa)
> >       return pte;
> >  }
> >
> > +
> > +#if defined(__KVM_NVHE_HYPERVISOR__)
> > +static inline void kvm_pgtable_walk_begin(void)
> > +{}
> > +
> > +static inline void kvm_pgtable_walk_end(void)
> > +{}
> > +
> > +#define kvm_dereference_ptep rcu_dereference_raw
> > +#else
> > +#define kvm_pgtable_walk_begin       rcu_read_lock
> > +
> > +#define kvm_pgtable_walk_end rcu_read_unlock
> > +
> > +#define kvm_dereference_ptep rcu_dereference
> > +#endif
> > +
> >  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> >  {
> > -     return mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
> > +     kvm_pte_t __rcu *ptep = mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
> > +
> > +     return kvm_dereference_ptep(ptep);
> >  }
> >
> >  static void kvm_clear_pte(kvm_pte_t *ptep)
> > @@ -288,7 +307,9 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
> >               .walker = walker,
> >       };
> >
> > +     kvm_pgtable_walk_begin();
> >       return _kvm_pgtable_walk(&walk_data);
> > +     kvm_pgtable_walk_end();
>
> This might be fixed later in the series, but at this point the
> rcu_read_unlock is never called.

Well that's embarrassing!

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 04/17] KVM: arm64: Protect page table traversal with RCU
@ 2022-04-19  3:01       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-19  3:01 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Mon, Apr 18, 2022 at 7:55 PM Ricardo Koller <ricarkol@google.com> wrote:
>
> On Fri, Apr 15, 2022 at 09:58:48PM +0000, Oliver Upton wrote:
> > Use RCU to safely traverse the page tables in parallel; the tables
> > themselves will only be freed from an RCU synchronized context. Don't
> > even bother with adding support to hyp, and instead just assume
> > exclusive access of the page tables.
> >
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++++++++-
> >  1 file changed, 22 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 5b64fbca8a93..d4699f698d6e 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -132,9 +132,28 @@ static kvm_pte_t kvm_phys_to_pte(u64 pa)
> >       return pte;
> >  }
> >
> > +
> > +#if defined(__KVM_NVHE_HYPERVISOR__)
> > +static inline void kvm_pgtable_walk_begin(void)
> > +{}
> > +
> > +static inline void kvm_pgtable_walk_end(void)
> > +{}
> > +
> > +#define kvm_dereference_ptep rcu_dereference_raw
> > +#else
> > +#define kvm_pgtable_walk_begin       rcu_read_lock
> > +
> > +#define kvm_pgtable_walk_end rcu_read_unlock
> > +
> > +#define kvm_dereference_ptep rcu_dereference
> > +#endif
> > +
> >  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> >  {
> > -     return mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
> > +     kvm_pte_t __rcu *ptep = mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
> > +
> > +     return kvm_dereference_ptep(ptep);
> >  }
> >
> >  static void kvm_clear_pte(kvm_pte_t *ptep)
> > @@ -288,7 +307,9 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
> >               .walker = walker,
> >       };
> >
> > +     kvm_pgtable_walk_begin();
> >       return _kvm_pgtable_walk(&walk_data);
> > +     kvm_pgtable_walk_end();
>
> This might be fixed later in the series, but at this point the
> rcu_read_unlock is never called.

Well that's embarrassing!
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 04/17] KVM: arm64: Protect page table traversal with RCU
@ 2022-04-19  3:01       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-19  3:01 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Reiji Watanabe,
	Paolo Bonzini, Sean Christopherson, Ben Gardon, David Matlack

On Mon, Apr 18, 2022 at 7:55 PM Ricardo Koller <ricarkol@google.com> wrote:
>
> On Fri, Apr 15, 2022 at 09:58:48PM +0000, Oliver Upton wrote:
> > Use RCU to safely traverse the page tables in parallel; the tables
> > themselves will only be freed from an RCU synchronized context. Don't
> > even bother with adding support to hyp, and instead just assume
> > exclusive access of the page tables.
> >
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++++++++-
> >  1 file changed, 22 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 5b64fbca8a93..d4699f698d6e 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -132,9 +132,28 @@ static kvm_pte_t kvm_phys_to_pte(u64 pa)
> >       return pte;
> >  }
> >
> > +
> > +#if defined(__KVM_NVHE_HYPERVISOR__)
> > +static inline void kvm_pgtable_walk_begin(void)
> > +{}
> > +
> > +static inline void kvm_pgtable_walk_end(void)
> > +{}
> > +
> > +#define kvm_dereference_ptep rcu_dereference_raw
> > +#else
> > +#define kvm_pgtable_walk_begin       rcu_read_lock
> > +
> > +#define kvm_pgtable_walk_end rcu_read_unlock
> > +
> > +#define kvm_dereference_ptep rcu_dereference
> > +#endif
> > +
> >  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> >  {
> > -     return mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
> > +     kvm_pte_t __rcu *ptep = mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
> > +
> > +     return kvm_dereference_ptep(ptep);
> >  }
> >
> >  static void kvm_clear_pte(kvm_pte_t *ptep)
> > @@ -288,7 +307,9 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
> >               .walker = walker,
> >       };
> >
> > +     kvm_pgtable_walk_begin();
> >       return _kvm_pgtable_walk(&walk_data);
> > +     kvm_pgtable_walk_end();
>
> This might be fixed later in the series, but at this point the
> rcu_read_unlock is never called.

Well that's embarrassing!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
  2022-04-19  2:59     ` Ricardo Koller
  (?)
@ 2022-04-19  3:09       ` Ricardo Koller
  -1 siblings, 0 replies; 165+ messages in thread
From: Ricardo Koller @ 2022-04-19  3:09 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Reiji Watanabe,
	Paolo Bonzini, Sean Christopherson, Ben Gardon, David Matlack

On Mon, Apr 18, 2022 at 07:59:04PM -0700, Ricardo Koller wrote:
> On Fri, Apr 15, 2022 at 09:58:58PM +0000, Oliver Upton wrote:
> > It is possible that a table page remains visible to another thread until
> > the next rcu synchronization event. To that end, we cannot drop the last
> > page reference synchronous with post-order traversal for a parallel
> > table walk.
> > 
> > Schedule an rcu callback to clean up the child table page for parallel
> > walks.
> > 
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h |  3 ++
> >  arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
> >  arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
> >  3 files changed, 67 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 74955aba5918..52e55e00f0ca 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> >   * @put_page:			Decrement the refcount on a page. When the
> >   *				refcount reaches 0 the page is automatically
> >   *				freed.
> > + * @free_table:			Drop the last page reference, possibly in the
> > + *				next RCU sync if doing a shared walk.
> >   * @page_count:			Return the refcount of a page.
> >   * @phys_to_virt:		Convert a physical address into a virtual
> >   *				address	mapped in the current context.
> > @@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
> >  	void		(*get_page)(void *addr);
> >  	void		(*put_page)(void *addr);
> >  	int		(*page_count)(void *addr);
> > +	void		(*free_table)(void *addr, bool shared);
> >  	void*		(*phys_to_virt)(phys_addr_t phys);
> >  	phys_addr_t	(*virt_to_phys)(void *addr);
> >  	void		(*dcache_clean_inval_poc)(void *addr, size_t size);
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 121818d4c33e..a9a48edba63b 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
> >  {}
> >  
> >  #define kvm_dereference_ptep	rcu_dereference_raw
> > +
> > +static inline void kvm_pgtable_destroy_barrier(void)
> > +{}
> > +
> >  #else
> >  #define kvm_pgtable_walk_begin	rcu_read_lock
> >  
> >  #define kvm_pgtable_walk_end	rcu_read_unlock
> >  
> >  #define kvm_dereference_ptep	rcu_dereference
> > +
> > +#define kvm_pgtable_destroy_barrier	rcu_barrier
> > +
> >  #endif
> >  
> >  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> > @@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
> >  		childp = kvm_pte_follow(*old, mm_ops);
> >  	}
> >  
> > -	mm_ops->put_page(childp);
> > +	/*
> > +	 * If we do not have exclusive access to the page tables it is possible
> > +	 * the unlinked table remains visible to another thread until the next
> > +	 * rcu synchronization.
> > +	 */
> > +	mm_ops->free_table(childp, shared);
> >  	mm_ops->put_page(ptep);
> >  
> >  	return ret;
> > @@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> >  					       kvm_granule_size(level));
> >  
> >  	if (childp)
> > -		mm_ops->put_page(childp);
> > +		mm_ops->free_table(childp, shared);
> >  
> >  	return 0;
> >  }
> > @@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> >  	mm_ops->put_page(ptep);
> >  
> >  	if (kvm_pte_table(*old, level))
> > -		mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
> > +		mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
> >  
> >  	return 0;
> >  }
> > @@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
> >  	pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
> >  	pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
> >  	pgt->pgd = NULL;
> > +
> > +	/*
> > +	 * Guarantee that all unlinked subtrees associated with the stage2 page
> > +	 * table have also been freed before returning.
> > +	 */
> > +	kvm_pgtable_destroy_barrier();
> >  }
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index cc6ed6b06ec2..6ecf37009c21 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
> >  static void *stage2_memcache_zalloc_page(void *arg)
> >  {
> >  	struct kvm_mmu_caches *mmu_caches = arg;
> > +	struct stage2_page_header *hdr;
> > +	void *addr;
> >  
> >  	/* Allocated with __GFP_ZERO, so no need to zero */
> > -	return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > +	addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > +	if (!addr)
> > +		return NULL;
> > +
> > +	hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
> > +	if (!hdr) {
> > +		free_page((unsigned long)addr);
> > +		return NULL;
> > +	}
> > +
> > +	hdr->page = virt_to_page(addr);
> > +	set_page_private(hdr->page, (unsigned long)hdr);
> > +	return addr;
> > +}
> > +
> > +static void stage2_free_page_now(struct stage2_page_header *hdr)
> > +{
> > +	WARN_ON(page_ref_count(hdr->page) != 1);
> > +
> > +	__free_page(hdr->page);
> > +	kmem_cache_free(stage2_page_header_cache, hdr);
> > +}
> > +
> > +static void stage2_free_page_rcu_cb(struct rcu_head *head)
> > +{
> > +	struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
> > +						      rcu_head);
> > +
> > +	stage2_free_page_now(hdr);
> > +}
> > +
> > +static void stage2_free_table(void *addr, bool shared)
> > +{
> > +	struct page *page = virt_to_page(addr);
> > +	struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
> > +
> > +	if (shared)
> > +		call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);
> 
> Can the number of callbacks grow to "dangerous" numbers? can it be
> bounded with something like the following?
> 
> if number of readers is really high:
> 	synchronize_rcu() 
> else
> 	call_rcu()

sorry, meant to say "number of callbacks"
> 
> maybe the rcu API has an option for that.
> 
> > +	else
> > +		stage2_free_page_now(hdr);
> >  }
> >  
> >  static void *kvm_host_zalloc_pages_exact(size_t size)
> > @@ -613,6 +654,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
> >  	.free_pages_exact	= free_pages_exact,
> >  	.get_page		= kvm_host_get_page,
> >  	.put_page		= kvm_host_put_page,
> > +	.free_table		= stage2_free_table,
> >  	.page_count		= kvm_host_page_count,
> >  	.phys_to_virt		= kvm_host_va,
> >  	.virt_to_phys		= kvm_host_pa,
> > -- 
> > 2.36.0.rc0.470.gd361397f0d-goog
> > 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
@ 2022-04-19  3:09       ` Ricardo Koller
  0 siblings, 0 replies; 165+ messages in thread
From: Ricardo Koller @ 2022-04-19  3:09 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Mon, Apr 18, 2022 at 07:59:04PM -0700, Ricardo Koller wrote:
> On Fri, Apr 15, 2022 at 09:58:58PM +0000, Oliver Upton wrote:
> > It is possible that a table page remains visible to another thread until
> > the next rcu synchronization event. To that end, we cannot drop the last
> > page reference synchronous with post-order traversal for a parallel
> > table walk.
> > 
> > Schedule an rcu callback to clean up the child table page for parallel
> > walks.
> > 
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h |  3 ++
> >  arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
> >  arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
> >  3 files changed, 67 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 74955aba5918..52e55e00f0ca 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> >   * @put_page:			Decrement the refcount on a page. When the
> >   *				refcount reaches 0 the page is automatically
> >   *				freed.
> > + * @free_table:			Drop the last page reference, possibly in the
> > + *				next RCU sync if doing a shared walk.
> >   * @page_count:			Return the refcount of a page.
> >   * @phys_to_virt:		Convert a physical address into a virtual
> >   *				address	mapped in the current context.
> > @@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
> >  	void		(*get_page)(void *addr);
> >  	void		(*put_page)(void *addr);
> >  	int		(*page_count)(void *addr);
> > +	void		(*free_table)(void *addr, bool shared);
> >  	void*		(*phys_to_virt)(phys_addr_t phys);
> >  	phys_addr_t	(*virt_to_phys)(void *addr);
> >  	void		(*dcache_clean_inval_poc)(void *addr, size_t size);
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 121818d4c33e..a9a48edba63b 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
> >  {}
> >  
> >  #define kvm_dereference_ptep	rcu_dereference_raw
> > +
> > +static inline void kvm_pgtable_destroy_barrier(void)
> > +{}
> > +
> >  #else
> >  #define kvm_pgtable_walk_begin	rcu_read_lock
> >  
> >  #define kvm_pgtable_walk_end	rcu_read_unlock
> >  
> >  #define kvm_dereference_ptep	rcu_dereference
> > +
> > +#define kvm_pgtable_destroy_barrier	rcu_barrier
> > +
> >  #endif
> >  
> >  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> > @@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
> >  		childp = kvm_pte_follow(*old, mm_ops);
> >  	}
> >  
> > -	mm_ops->put_page(childp);
> > +	/*
> > +	 * If we do not have exclusive access to the page tables it is possible
> > +	 * the unlinked table remains visible to another thread until the next
> > +	 * rcu synchronization.
> > +	 */
> > +	mm_ops->free_table(childp, shared);
> >  	mm_ops->put_page(ptep);
> >  
> >  	return ret;
> > @@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> >  					       kvm_granule_size(level));
> >  
> >  	if (childp)
> > -		mm_ops->put_page(childp);
> > +		mm_ops->free_table(childp, shared);
> >  
> >  	return 0;
> >  }
> > @@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> >  	mm_ops->put_page(ptep);
> >  
> >  	if (kvm_pte_table(*old, level))
> > -		mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
> > +		mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
> >  
> >  	return 0;
> >  }
> > @@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
> >  	pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
> >  	pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
> >  	pgt->pgd = NULL;
> > +
> > +	/*
> > +	 * Guarantee that all unlinked subtrees associated with the stage2 page
> > +	 * table have also been freed before returning.
> > +	 */
> > +	kvm_pgtable_destroy_barrier();
> >  }
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index cc6ed6b06ec2..6ecf37009c21 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
> >  static void *stage2_memcache_zalloc_page(void *arg)
> >  {
> >  	struct kvm_mmu_caches *mmu_caches = arg;
> > +	struct stage2_page_header *hdr;
> > +	void *addr;
> >  
> >  	/* Allocated with __GFP_ZERO, so no need to zero */
> > -	return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > +	addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > +	if (!addr)
> > +		return NULL;
> > +
> > +	hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
> > +	if (!hdr) {
> > +		free_page((unsigned long)addr);
> > +		return NULL;
> > +	}
> > +
> > +	hdr->page = virt_to_page(addr);
> > +	set_page_private(hdr->page, (unsigned long)hdr);
> > +	return addr;
> > +}
> > +
> > +static void stage2_free_page_now(struct stage2_page_header *hdr)
> > +{
> > +	WARN_ON(page_ref_count(hdr->page) != 1);
> > +
> > +	__free_page(hdr->page);
> > +	kmem_cache_free(stage2_page_header_cache, hdr);
> > +}
> > +
> > +static void stage2_free_page_rcu_cb(struct rcu_head *head)
> > +{
> > +	struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
> > +						      rcu_head);
> > +
> > +	stage2_free_page_now(hdr);
> > +}
> > +
> > +static void stage2_free_table(void *addr, bool shared)
> > +{
> > +	struct page *page = virt_to_page(addr);
> > +	struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
> > +
> > +	if (shared)
> > +		call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);
> 
> Can the number of callbacks grow to "dangerous" numbers? can it be
> bounded with something like the following?
> 
> if number of readers is really high:
> 	synchronize_rcu() 
> else
> 	call_rcu()

sorry, meant to say "number of callbacks"
> 
> maybe the rcu API has an option for that.
> 
> > +	else
> > +		stage2_free_page_now(hdr);
> >  }
> >  
> >  static void *kvm_host_zalloc_pages_exact(size_t size)
> > @@ -613,6 +654,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
> >  	.free_pages_exact	= free_pages_exact,
> >  	.get_page		= kvm_host_get_page,
> >  	.put_page		= kvm_host_put_page,
> > +	.free_table		= stage2_free_table,
> >  	.page_count		= kvm_host_page_count,
> >  	.phys_to_virt		= kvm_host_va,
> >  	.virt_to_phys		= kvm_host_pa,
> > -- 
> > 2.36.0.rc0.470.gd361397f0d-goog
> > 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
@ 2022-04-19  3:09       ` Ricardo Koller
  0 siblings, 0 replies; 165+ messages in thread
From: Ricardo Koller @ 2022-04-19  3:09 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Reiji Watanabe,
	Paolo Bonzini, Sean Christopherson, Ben Gardon, David Matlack

On Mon, Apr 18, 2022 at 07:59:04PM -0700, Ricardo Koller wrote:
> On Fri, Apr 15, 2022 at 09:58:58PM +0000, Oliver Upton wrote:
> > It is possible that a table page remains visible to another thread until
> > the next rcu synchronization event. To that end, we cannot drop the last
> > page reference synchronous with post-order traversal for a parallel
> > table walk.
> > 
> > Schedule an rcu callback to clean up the child table page for parallel
> > walks.
> > 
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h |  3 ++
> >  arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
> >  arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
> >  3 files changed, 67 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 74955aba5918..52e55e00f0ca 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> >   * @put_page:			Decrement the refcount on a page. When the
> >   *				refcount reaches 0 the page is automatically
> >   *				freed.
> > + * @free_table:			Drop the last page reference, possibly in the
> > + *				next RCU sync if doing a shared walk.
> >   * @page_count:			Return the refcount of a page.
> >   * @phys_to_virt:		Convert a physical address into a virtual
> >   *				address	mapped in the current context.
> > @@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
> >  	void		(*get_page)(void *addr);
> >  	void		(*put_page)(void *addr);
> >  	int		(*page_count)(void *addr);
> > +	void		(*free_table)(void *addr, bool shared);
> >  	void*		(*phys_to_virt)(phys_addr_t phys);
> >  	phys_addr_t	(*virt_to_phys)(void *addr);
> >  	void		(*dcache_clean_inval_poc)(void *addr, size_t size);
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 121818d4c33e..a9a48edba63b 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
> >  {}
> >  
> >  #define kvm_dereference_ptep	rcu_dereference_raw
> > +
> > +static inline void kvm_pgtable_destroy_barrier(void)
> > +{}
> > +
> >  #else
> >  #define kvm_pgtable_walk_begin	rcu_read_lock
> >  
> >  #define kvm_pgtable_walk_end	rcu_read_unlock
> >  
> >  #define kvm_dereference_ptep	rcu_dereference
> > +
> > +#define kvm_pgtable_destroy_barrier	rcu_barrier
> > +
> >  #endif
> >  
> >  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> > @@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
> >  		childp = kvm_pte_follow(*old, mm_ops);
> >  	}
> >  
> > -	mm_ops->put_page(childp);
> > +	/*
> > +	 * If we do not have exclusive access to the page tables it is possible
> > +	 * the unlinked table remains visible to another thread until the next
> > +	 * rcu synchronization.
> > +	 */
> > +	mm_ops->free_table(childp, shared);
> >  	mm_ops->put_page(ptep);
> >  
> >  	return ret;
> > @@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> >  					       kvm_granule_size(level));
> >  
> >  	if (childp)
> > -		mm_ops->put_page(childp);
> > +		mm_ops->free_table(childp, shared);
> >  
> >  	return 0;
> >  }
> > @@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> >  	mm_ops->put_page(ptep);
> >  
> >  	if (kvm_pte_table(*old, level))
> > -		mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
> > +		mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
> >  
> >  	return 0;
> >  }
> > @@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
> >  	pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
> >  	pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
> >  	pgt->pgd = NULL;
> > +
> > +	/*
> > +	 * Guarantee that all unlinked subtrees associated with the stage2 page
> > +	 * table have also been freed before returning.
> > +	 */
> > +	kvm_pgtable_destroy_barrier();
> >  }
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index cc6ed6b06ec2..6ecf37009c21 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
> >  static void *stage2_memcache_zalloc_page(void *arg)
> >  {
> >  	struct kvm_mmu_caches *mmu_caches = arg;
> > +	struct stage2_page_header *hdr;
> > +	void *addr;
> >  
> >  	/* Allocated with __GFP_ZERO, so no need to zero */
> > -	return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > +	addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > +	if (!addr)
> > +		return NULL;
> > +
> > +	hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
> > +	if (!hdr) {
> > +		free_page((unsigned long)addr);
> > +		return NULL;
> > +	}
> > +
> > +	hdr->page = virt_to_page(addr);
> > +	set_page_private(hdr->page, (unsigned long)hdr);
> > +	return addr;
> > +}
> > +
> > +static void stage2_free_page_now(struct stage2_page_header *hdr)
> > +{
> > +	WARN_ON(page_ref_count(hdr->page) != 1);
> > +
> > +	__free_page(hdr->page);
> > +	kmem_cache_free(stage2_page_header_cache, hdr);
> > +}
> > +
> > +static void stage2_free_page_rcu_cb(struct rcu_head *head)
> > +{
> > +	struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
> > +						      rcu_head);
> > +
> > +	stage2_free_page_now(hdr);
> > +}
> > +
> > +static void stage2_free_table(void *addr, bool shared)
> > +{
> > +	struct page *page = virt_to_page(addr);
> > +	struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
> > +
> > +	if (shared)
> > +		call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);
> 
> Can the number of callbacks grow to "dangerous" numbers? can it be
> bounded with something like the following?
> 
> if number of readers is really high:
> 	synchronize_rcu() 
> else
> 	call_rcu()

sorry, meant to say "number of callbacks"
> 
> maybe the rcu API has an option for that.
> 
> > +	else
> > +		stage2_free_page_now(hdr);
> >  }
> >  
> >  static void *kvm_host_zalloc_pages_exact(size_t size)
> > @@ -613,6 +654,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
> >  	.free_pages_exact	= free_pages_exact,
> >  	.get_page		= kvm_host_get_page,
> >  	.put_page		= kvm_host_put_page,
> > +	.free_table		= stage2_free_table,
> >  	.page_count		= kvm_host_page_count,
> >  	.phys_to_virt		= kvm_host_va,
> >  	.virt_to_phys		= kvm_host_pa,
> > -- 
> > 2.36.0.rc0.470.gd361397f0d-goog
> > 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
  2022-04-15 21:58 ` Oliver Upton
  (?)
@ 2022-04-19 17:57   ` Ben Gardon
  -1 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-19 17:57 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> Presently KVM only takes a read lock for stage 2 faults if it believes
> the fault can be fixed by relaxing permissions on a PTE (write unprotect
> for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> predictably can pile up all the vCPUs in a sufficiently large VM.
>
> The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> MMU protected by the combination of a read-write lock and RCU, allowing
> page walkers to traverse in parallel.
>
> This series is strongly inspired by the mechanics of the TDP MMU,
> making use of RCU to protect parallel walks. Note that the TLB
> invalidation mechanics are a bit different between x86 and ARM, so we
> need to use the 'break-before-make' sequence to split/collapse a
> block/table mapping, respectively.
>
> Nonetheless, using atomics on the break side allows fault handlers to
> acquire exclusive access to a PTE (lets just call it locked). Once the
> PTE lock is acquired it is then safe to assume exclusive access.
>
> Special consideration is required when pruning the page tables in
> parallel. Suppose we are collapsing a table into a block. Allowing
> parallel faults means that a software walker could be in the middle of
> a lower level traversal when the table is unlinked. Table
> walkers that prune the paging structures must now 'lock' all descendent
> PTEs, effectively asserting exclusive ownership of the substructure
> (no other walker can install something to an already locked pte).
>
> Additionally, for parallel walks we need to punt the freeing of table
> pages to the next RCU sync, as there could be multiple observers of the
> table until all walkers exit the RCU critical section. For this I
> decided to cram an rcu_head into page private data for every table page.
> We wind up spending a bit more on table pages now, but lazily allocating
> for rcu callbacks probably doesn't make a lot of sense. Not only would
> we need a large cache of them (think about installing a level 1 block)
> to wire up callbacks on all descendent tables, but we also then need to
> spend memory to actually free memory.

FWIW we used a similar approach in early versions of the TDP MMU, but
instead of page->private used page->lru so that more metadata could be
stored in page->private.
Ultimately that ended up being too limiting and we decided to switch
to just using the associated struct kvm_mmu_page as the list element.
I don't know if ARM has an equivalent construct though.

>
> I tried to organize these patches as best I could w/o introducing
> intermediate breakage.
>
> The first 5 patches are meant mostly as prepatory reworks, and, in the
> case of RCU a nop.
>
> Patch 6 is quite large, but I had a hard time deciding how to change the
> way we link/unlink tables to use atomics without breaking things along
> the way.
>
> Patch 7 probably should come before patch 6, as it informs the other
> read-side fault (perm relax) about when a map is in progress so it'll
> back off.
>
> Patches 8-10 take care of the pruning case, actually locking the child ptes
> instead of simply dropping table page references along the way. Note
> that we cannot assume a pte points to a table/page at this point, hence
> the same helper is called for pre- and leaf-traversal. Guide the
> recursion based on what got yanked from the PTE.
>
> Patches 11-14 wire up everything to schedule rcu callbacks on
> to-be-freed table pages. rcu_barrier() is called on the way out from
> tearing down a stage 2 page table to guarantee all memory associated
> with the VM has actually been cleaned up.
>
> Patches 15-16 loop in the fault handler to the new table traversal game.
>
> Lastly, patch 17 is a nasty bit of debugging residue to spot possible
> table page leaks. Please don't laugh ;-)
>
> Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
> exercise the table pruning code. Haven't done anything beyond this,
> sending as an RFC now to get eyes on the code.
>
> Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of
> git://git.kernel.dk/linux-block")
>
> Oliver Upton (17):
>   KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
>   KVM: arm64: Only read the pte once per visit
>   KVM: arm64: Return the next table from map callbacks
>   KVM: arm64: Protect page table traversal with RCU
>   KVM: arm64: Take an argument to indicate parallel walk
>   KVM: arm64: Implement break-before-make sequence for parallel walks
>   KVM: arm64: Enlighten perm relax path about parallel walks
>   KVM: arm64: Spin off helper for initializing table pte
>   KVM: arm64: Tear down unlinked page tables in parallel walk
>   KVM: arm64: Assume a table pte is already owned in post-order
>     traversal
>   KVM: arm64: Move MMU cache init/destroy into helpers
>   KVM: arm64: Stuff mmu page cache in sub struct
>   KVM: arm64: Setup cache for stage2 page headers
>   KVM: arm64: Punt last page reference to rcu callback for parallel walk
>   KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map()
>   KVM: arm64: Enable parallel stage 2 MMU faults
>   TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages
>
>  arch/arm64/include/asm/kvm_host.h     |   5 +-
>  arch/arm64/include/asm/kvm_mmu.h      |   2 +
>  arch/arm64/include/asm/kvm_pgtable.h  |  14 +-
>  arch/arm64/kvm/arm.c                  |   4 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  13 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |  13 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 518 +++++++++++++++++++-------
>  arch/arm64/kvm/mmu.c                  | 120 ++++--
>  8 files changed, 503 insertions(+), 186 deletions(-)
>
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-19 17:57   ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-19 17:57 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> Presently KVM only takes a read lock for stage 2 faults if it believes
> the fault can be fixed by relaxing permissions on a PTE (write unprotect
> for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> predictably can pile up all the vCPUs in a sufficiently large VM.
>
> The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> MMU protected by the combination of a read-write lock and RCU, allowing
> page walkers to traverse in parallel.
>
> This series is strongly inspired by the mechanics of the TDP MMU,
> making use of RCU to protect parallel walks. Note that the TLB
> invalidation mechanics are a bit different between x86 and ARM, so we
> need to use the 'break-before-make' sequence to split/collapse a
> block/table mapping, respectively.
>
> Nonetheless, using atomics on the break side allows fault handlers to
> acquire exclusive access to a PTE (lets just call it locked). Once the
> PTE lock is acquired it is then safe to assume exclusive access.
>
> Special consideration is required when pruning the page tables in
> parallel. Suppose we are collapsing a table into a block. Allowing
> parallel faults means that a software walker could be in the middle of
> a lower level traversal when the table is unlinked. Table
> walkers that prune the paging structures must now 'lock' all descendent
> PTEs, effectively asserting exclusive ownership of the substructure
> (no other walker can install something to an already locked pte).
>
> Additionally, for parallel walks we need to punt the freeing of table
> pages to the next RCU sync, as there could be multiple observers of the
> table until all walkers exit the RCU critical section. For this I
> decided to cram an rcu_head into page private data for every table page.
> We wind up spending a bit more on table pages now, but lazily allocating
> for rcu callbacks probably doesn't make a lot of sense. Not only would
> we need a large cache of them (think about installing a level 1 block)
> to wire up callbacks on all descendent tables, but we also then need to
> spend memory to actually free memory.

FWIW we used a similar approach in early versions of the TDP MMU, but
instead of page->private used page->lru so that more metadata could be
stored in page->private.
Ultimately that ended up being too limiting and we decided to switch
to just using the associated struct kvm_mmu_page as the list element.
I don't know if ARM has an equivalent construct though.

>
> I tried to organize these patches as best I could w/o introducing
> intermediate breakage.
>
> The first 5 patches are meant mostly as prepatory reworks, and, in the
> case of RCU a nop.
>
> Patch 6 is quite large, but I had a hard time deciding how to change the
> way we link/unlink tables to use atomics without breaking things along
> the way.
>
> Patch 7 probably should come before patch 6, as it informs the other
> read-side fault (perm relax) about when a map is in progress so it'll
> back off.
>
> Patches 8-10 take care of the pruning case, actually locking the child ptes
> instead of simply dropping table page references along the way. Note
> that we cannot assume a pte points to a table/page at this point, hence
> the same helper is called for pre- and leaf-traversal. Guide the
> recursion based on what got yanked from the PTE.
>
> Patches 11-14 wire up everything to schedule rcu callbacks on
> to-be-freed table pages. rcu_barrier() is called on the way out from
> tearing down a stage 2 page table to guarantee all memory associated
> with the VM has actually been cleaned up.
>
> Patches 15-16 loop in the fault handler to the new table traversal game.
>
> Lastly, patch 17 is a nasty bit of debugging residue to spot possible
> table page leaks. Please don't laugh ;-)
>
> Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
> exercise the table pruning code. Haven't done anything beyond this,
> sending as an RFC now to get eyes on the code.
>
> Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of
> git://git.kernel.dk/linux-block")
>
> Oliver Upton (17):
>   KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
>   KVM: arm64: Only read the pte once per visit
>   KVM: arm64: Return the next table from map callbacks
>   KVM: arm64: Protect page table traversal with RCU
>   KVM: arm64: Take an argument to indicate parallel walk
>   KVM: arm64: Implement break-before-make sequence for parallel walks
>   KVM: arm64: Enlighten perm relax path about parallel walks
>   KVM: arm64: Spin off helper for initializing table pte
>   KVM: arm64: Tear down unlinked page tables in parallel walk
>   KVM: arm64: Assume a table pte is already owned in post-order
>     traversal
>   KVM: arm64: Move MMU cache init/destroy into helpers
>   KVM: arm64: Stuff mmu page cache in sub struct
>   KVM: arm64: Setup cache for stage2 page headers
>   KVM: arm64: Punt last page reference to rcu callback for parallel walk
>   KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map()
>   KVM: arm64: Enable parallel stage 2 MMU faults
>   TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages
>
>  arch/arm64/include/asm/kvm_host.h     |   5 +-
>  arch/arm64/include/asm/kvm_mmu.h      |   2 +
>  arch/arm64/include/asm/kvm_pgtable.h  |  14 +-
>  arch/arm64/kvm/arm.c                  |   4 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  13 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |  13 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 518 +++++++++++++++++++-------
>  arch/arm64/kvm/mmu.c                  | 120 ++++--
>  8 files changed, 503 insertions(+), 186 deletions(-)
>
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-19 17:57   ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-19 17:57 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, David Matlack, Paolo Bonzini,
	kvmarm, linux-arm-kernel

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> Presently KVM only takes a read lock for stage 2 faults if it believes
> the fault can be fixed by relaxing permissions on a PTE (write unprotect
> for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> predictably can pile up all the vCPUs in a sufficiently large VM.
>
> The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> MMU protected by the combination of a read-write lock and RCU, allowing
> page walkers to traverse in parallel.
>
> This series is strongly inspired by the mechanics of the TDP MMU,
> making use of RCU to protect parallel walks. Note that the TLB
> invalidation mechanics are a bit different between x86 and ARM, so we
> need to use the 'break-before-make' sequence to split/collapse a
> block/table mapping, respectively.
>
> Nonetheless, using atomics on the break side allows fault handlers to
> acquire exclusive access to a PTE (lets just call it locked). Once the
> PTE lock is acquired it is then safe to assume exclusive access.
>
> Special consideration is required when pruning the page tables in
> parallel. Suppose we are collapsing a table into a block. Allowing
> parallel faults means that a software walker could be in the middle of
> a lower level traversal when the table is unlinked. Table
> walkers that prune the paging structures must now 'lock' all descendent
> PTEs, effectively asserting exclusive ownership of the substructure
> (no other walker can install something to an already locked pte).
>
> Additionally, for parallel walks we need to punt the freeing of table
> pages to the next RCU sync, as there could be multiple observers of the
> table until all walkers exit the RCU critical section. For this I
> decided to cram an rcu_head into page private data for every table page.
> We wind up spending a bit more on table pages now, but lazily allocating
> for rcu callbacks probably doesn't make a lot of sense. Not only would
> we need a large cache of them (think about installing a level 1 block)
> to wire up callbacks on all descendent tables, but we also then need to
> spend memory to actually free memory.

FWIW we used a similar approach in early versions of the TDP MMU, but
instead of page->private used page->lru so that more metadata could be
stored in page->private.
Ultimately that ended up being too limiting and we decided to switch
to just using the associated struct kvm_mmu_page as the list element.
I don't know if ARM has an equivalent construct though.

>
> I tried to organize these patches as best I could w/o introducing
> intermediate breakage.
>
> The first 5 patches are meant mostly as prepatory reworks, and, in the
> case of RCU a nop.
>
> Patch 6 is quite large, but I had a hard time deciding how to change the
> way we link/unlink tables to use atomics without breaking things along
> the way.
>
> Patch 7 probably should come before patch 6, as it informs the other
> read-side fault (perm relax) about when a map is in progress so it'll
> back off.
>
> Patches 8-10 take care of the pruning case, actually locking the child ptes
> instead of simply dropping table page references along the way. Note
> that we cannot assume a pte points to a table/page at this point, hence
> the same helper is called for pre- and leaf-traversal. Guide the
> recursion based on what got yanked from the PTE.
>
> Patches 11-14 wire up everything to schedule rcu callbacks on
> to-be-freed table pages. rcu_barrier() is called on the way out from
> tearing down a stage 2 page table to guarantee all memory associated
> with the VM has actually been cleaned up.
>
> Patches 15-16 loop in the fault handler to the new table traversal game.
>
> Lastly, patch 17 is a nasty bit of debugging residue to spot possible
> table page leaks. Please don't laugh ;-)
>
> Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
> exercise the table pruning code. Haven't done anything beyond this,
> sending as an RFC now to get eyes on the code.
>
> Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of
> git://git.kernel.dk/linux-block")
>
> Oliver Upton (17):
>   KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
>   KVM: arm64: Only read the pte once per visit
>   KVM: arm64: Return the next table from map callbacks
>   KVM: arm64: Protect page table traversal with RCU
>   KVM: arm64: Take an argument to indicate parallel walk
>   KVM: arm64: Implement break-before-make sequence for parallel walks
>   KVM: arm64: Enlighten perm relax path about parallel walks
>   KVM: arm64: Spin off helper for initializing table pte
>   KVM: arm64: Tear down unlinked page tables in parallel walk
>   KVM: arm64: Assume a table pte is already owned in post-order
>     traversal
>   KVM: arm64: Move MMU cache init/destroy into helpers
>   KVM: arm64: Stuff mmu page cache in sub struct
>   KVM: arm64: Setup cache for stage2 page headers
>   KVM: arm64: Punt last page reference to rcu callback for parallel walk
>   KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map()
>   KVM: arm64: Enable parallel stage 2 MMU faults
>   TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages
>
>  arch/arm64/include/asm/kvm_host.h     |   5 +-
>  arch/arm64/include/asm/kvm_mmu.h      |   2 +
>  arch/arm64/include/asm/kvm_pgtable.h  |  14 +-
>  arch/arm64/kvm/arm.c                  |   4 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  13 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |  13 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 518 +++++++++++++++++++-------
>  arch/arm64/kvm/mmu.c                  | 120 ++++--
>  8 files changed, 503 insertions(+), 186 deletions(-)
>
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
  2022-04-19 17:57   ` Ben Gardon
  (?)
@ 2022-04-19 18:36     ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-19 18:36 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Marc Zyngier, Peter Shier, David Matlack, Paolo Bonzini,
	kvmarm, linux-arm-kernel

On Tue, Apr 19, 2022 at 10:57 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > Presently KVM only takes a read lock for stage 2 faults if it believes
> > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > predictably can pile up all the vCPUs in a sufficiently large VM.
> >
> > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > MMU protected by the combination of a read-write lock and RCU, allowing
> > page walkers to traverse in parallel.
> >
> > This series is strongly inspired by the mechanics of the TDP MMU,
> > making use of RCU to protect parallel walks. Note that the TLB
> > invalidation mechanics are a bit different between x86 and ARM, so we
> > need to use the 'break-before-make' sequence to split/collapse a
> > block/table mapping, respectively.
> >
> > Nonetheless, using atomics on the break side allows fault handlers to
> > acquire exclusive access to a PTE (lets just call it locked). Once the
> > PTE lock is acquired it is then safe to assume exclusive access.
> >
> > Special consideration is required when pruning the page tables in
> > parallel. Suppose we are collapsing a table into a block. Allowing
> > parallel faults means that a software walker could be in the middle of
> > a lower level traversal when the table is unlinked. Table
> > walkers that prune the paging structures must now 'lock' all descendent
> > PTEs, effectively asserting exclusive ownership of the substructure
> > (no other walker can install something to an already locked pte).
> >
> > Additionally, for parallel walks we need to punt the freeing of table
> > pages to the next RCU sync, as there could be multiple observers of the
> > table until all walkers exit the RCU critical section. For this I
> > decided to cram an rcu_head into page private data for every table page.
> > We wind up spending a bit more on table pages now, but lazily allocating
> > for rcu callbacks probably doesn't make a lot of sense. Not only would
> > we need a large cache of them (think about installing a level 1 block)
> > to wire up callbacks on all descendent tables, but we also then need to
> > spend memory to actually free memory.
>
> FWIW we used a similar approach in early versions of the TDP MMU, but
> instead of page->private used page->lru so that more metadata could be
> stored in page->private.
> Ultimately that ended up being too limiting and we decided to switch
> to just using the associated struct kvm_mmu_page as the list element.
> I don't know if ARM has an equivalent construct though.

ARM currently doesn't have any metadata it needs to tie with the table
pages. We just do very basic page reference counting for every valid
PTE. I was going to link together pages (hence the page header), but
we actually do not have a functional need for it at the moment. In
fact, struct page::rcu_head would probably fit the bill and we can
avoid extra metadata/memory for the time being.

Perhaps best to keep it simple and do the rest when we have a genuine
need for it.

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-19 18:36     ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-19 18:36 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Tue, Apr 19, 2022 at 10:57 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > Presently KVM only takes a read lock for stage 2 faults if it believes
> > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > predictably can pile up all the vCPUs in a sufficiently large VM.
> >
> > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > MMU protected by the combination of a read-write lock and RCU, allowing
> > page walkers to traverse in parallel.
> >
> > This series is strongly inspired by the mechanics of the TDP MMU,
> > making use of RCU to protect parallel walks. Note that the TLB
> > invalidation mechanics are a bit different between x86 and ARM, so we
> > need to use the 'break-before-make' sequence to split/collapse a
> > block/table mapping, respectively.
> >
> > Nonetheless, using atomics on the break side allows fault handlers to
> > acquire exclusive access to a PTE (lets just call it locked). Once the
> > PTE lock is acquired it is then safe to assume exclusive access.
> >
> > Special consideration is required when pruning the page tables in
> > parallel. Suppose we are collapsing a table into a block. Allowing
> > parallel faults means that a software walker could be in the middle of
> > a lower level traversal when the table is unlinked. Table
> > walkers that prune the paging structures must now 'lock' all descendent
> > PTEs, effectively asserting exclusive ownership of the substructure
> > (no other walker can install something to an already locked pte).
> >
> > Additionally, for parallel walks we need to punt the freeing of table
> > pages to the next RCU sync, as there could be multiple observers of the
> > table until all walkers exit the RCU critical section. For this I
> > decided to cram an rcu_head into page private data for every table page.
> > We wind up spending a bit more on table pages now, but lazily allocating
> > for rcu callbacks probably doesn't make a lot of sense. Not only would
> > we need a large cache of them (think about installing a level 1 block)
> > to wire up callbacks on all descendent tables, but we also then need to
> > spend memory to actually free memory.
>
> FWIW we used a similar approach in early versions of the TDP MMU, but
> instead of page->private used page->lru so that more metadata could be
> stored in page->private.
> Ultimately that ended up being too limiting and we decided to switch
> to just using the associated struct kvm_mmu_page as the list element.
> I don't know if ARM has an equivalent construct though.

ARM currently doesn't have any metadata it needs to tie with the table
pages. We just do very basic page reference counting for every valid
PTE. I was going to link together pages (hence the page header), but
we actually do not have a functional need for it at the moment. In
fact, struct page::rcu_head would probably fit the bill and we can
avoid extra metadata/memory for the time being.

Perhaps best to keep it simple and do the rest when we have a genuine
need for it.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-19 18:36     ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-19 18:36 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Tue, Apr 19, 2022 at 10:57 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > Presently KVM only takes a read lock for stage 2 faults if it believes
> > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > predictably can pile up all the vCPUs in a sufficiently large VM.
> >
> > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > MMU protected by the combination of a read-write lock and RCU, allowing
> > page walkers to traverse in parallel.
> >
> > This series is strongly inspired by the mechanics of the TDP MMU,
> > making use of RCU to protect parallel walks. Note that the TLB
> > invalidation mechanics are a bit different between x86 and ARM, so we
> > need to use the 'break-before-make' sequence to split/collapse a
> > block/table mapping, respectively.
> >
> > Nonetheless, using atomics on the break side allows fault handlers to
> > acquire exclusive access to a PTE (lets just call it locked). Once the
> > PTE lock is acquired it is then safe to assume exclusive access.
> >
> > Special consideration is required when pruning the page tables in
> > parallel. Suppose we are collapsing a table into a block. Allowing
> > parallel faults means that a software walker could be in the middle of
> > a lower level traversal when the table is unlinked. Table
> > walkers that prune the paging structures must now 'lock' all descendent
> > PTEs, effectively asserting exclusive ownership of the substructure
> > (no other walker can install something to an already locked pte).
> >
> > Additionally, for parallel walks we need to punt the freeing of table
> > pages to the next RCU sync, as there could be multiple observers of the
> > table until all walkers exit the RCU critical section. For this I
> > decided to cram an rcu_head into page private data for every table page.
> > We wind up spending a bit more on table pages now, but lazily allocating
> > for rcu callbacks probably doesn't make a lot of sense. Not only would
> > we need a large cache of them (think about installing a level 1 block)
> > to wire up callbacks on all descendent tables, but we also then need to
> > spend memory to actually free memory.
>
> FWIW we used a similar approach in early versions of the TDP MMU, but
> instead of page->private used page->lru so that more metadata could be
> stored in page->private.
> Ultimately that ended up being too limiting and we decided to switch
> to just using the associated struct kvm_mmu_page as the list element.
> I don't know if ARM has an equivalent construct though.

ARM currently doesn't have any metadata it needs to tie with the table
pages. We just do very basic page reference counting for every valid
PTE. I was going to link together pages (hence the page header), but
we actually do not have a functional need for it at the moment. In
fact, struct page::rcu_head would probably fit the bill and we can
avoid extra metadata/memory for the time being.

Perhaps best to keep it simple and do the rest when we have a genuine
need for it.

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
  2022-04-19  3:09       ` Ricardo Koller
  (?)
@ 2022-04-20  0:53         ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-20  0:53 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Reiji Watanabe,
	Paolo Bonzini, Sean Christopherson, Ben Gardon, David Matlack

Hi Ricardo,

On Mon, Apr 18, 2022 at 8:09 PM Ricardo Koller <ricarkol@google.com> wrote:
>
> On Mon, Apr 18, 2022 at 07:59:04PM -0700, Ricardo Koller wrote:
> > On Fri, Apr 15, 2022 at 09:58:58PM +0000, Oliver Upton wrote:
> > > It is possible that a table page remains visible to another thread until
> > > the next rcu synchronization event. To that end, we cannot drop the last
> > > page reference synchronous with post-order traversal for a parallel
> > > table walk.
> > >
> > > Schedule an rcu callback to clean up the child table page for parallel
> > > walks.
> > >
> > > Signed-off-by: Oliver Upton <oupton@google.com>
> > > ---
> > >  arch/arm64/include/asm/kvm_pgtable.h |  3 ++
> > >  arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
> > >  arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
> > >  3 files changed, 67 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > > index 74955aba5918..52e55e00f0ca 100644
> > > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > > @@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> > >   * @put_page:                      Decrement the refcount on a page. When the
> > >   *                         refcount reaches 0 the page is automatically
> > >   *                         freed.
> > > + * @free_table:                    Drop the last page reference, possibly in the
> > > + *                         next RCU sync if doing a shared walk.
> > >   * @page_count:                    Return the refcount of a page.
> > >   * @phys_to_virt:          Convert a physical address into a virtual
> > >   *                         address mapped in the current context.
> > > @@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
> > >     void            (*get_page)(void *addr);
> > >     void            (*put_page)(void *addr);
> > >     int             (*page_count)(void *addr);
> > > +   void            (*free_table)(void *addr, bool shared);
> > >     void*           (*phys_to_virt)(phys_addr_t phys);
> > >     phys_addr_t     (*virt_to_phys)(void *addr);
> > >     void            (*dcache_clean_inval_poc)(void *addr, size_t size);
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 121818d4c33e..a9a48edba63b 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
> > >  {}
> > >
> > >  #define kvm_dereference_ptep       rcu_dereference_raw
> > > +
> > > +static inline void kvm_pgtable_destroy_barrier(void)
> > > +{}
> > > +
> > >  #else
> > >  #define kvm_pgtable_walk_begin     rcu_read_lock
> > >
> > >  #define kvm_pgtable_walk_end       rcu_read_unlock
> > >
> > >  #define kvm_dereference_ptep       rcu_dereference
> > > +
> > > +#define kvm_pgtable_destroy_barrier        rcu_barrier
> > > +
> > >  #endif
> > >
> > >  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> > > @@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
> > >             childp = kvm_pte_follow(*old, mm_ops);
> > >     }
> > >
> > > -   mm_ops->put_page(childp);
> > > +   /*
> > > +    * If we do not have exclusive access to the page tables it is possible
> > > +    * the unlinked table remains visible to another thread until the next
> > > +    * rcu synchronization.
> > > +    */
> > > +   mm_ops->free_table(childp, shared);
> > >     mm_ops->put_page(ptep);
> > >
> > >     return ret;
> > > @@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > >                                            kvm_granule_size(level));
> > >
> > >     if (childp)
> > > -           mm_ops->put_page(childp);
> > > +           mm_ops->free_table(childp, shared);
> > >
> > >     return 0;
> > >  }
> > > @@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > >     mm_ops->put_page(ptep);
> > >
> > >     if (kvm_pte_table(*old, level))
> > > -           mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
> > > +           mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
> > >
> > >     return 0;
> > >  }
> > > @@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
> > >     pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
> > >     pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
> > >     pgt->pgd = NULL;
> > > +
> > > +   /*
> > > +    * Guarantee that all unlinked subtrees associated with the stage2 page
> > > +    * table have also been freed before returning.
> > > +    */
> > > +   kvm_pgtable_destroy_barrier();
> > >  }
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index cc6ed6b06ec2..6ecf37009c21 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
> > >  static void *stage2_memcache_zalloc_page(void *arg)
> > >  {
> > >     struct kvm_mmu_caches *mmu_caches = arg;
> > > +   struct stage2_page_header *hdr;
> > > +   void *addr;
> > >
> > >     /* Allocated with __GFP_ZERO, so no need to zero */
> > > -   return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > > +   addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > > +   if (!addr)
> > > +           return NULL;
> > > +
> > > +   hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
> > > +   if (!hdr) {
> > > +           free_page((unsigned long)addr);
> > > +           return NULL;
> > > +   }
> > > +
> > > +   hdr->page = virt_to_page(addr);
> > > +   set_page_private(hdr->page, (unsigned long)hdr);
> > > +   return addr;
> > > +}
> > > +
> > > +static void stage2_free_page_now(struct stage2_page_header *hdr)
> > > +{
> > > +   WARN_ON(page_ref_count(hdr->page) != 1);
> > > +
> > > +   __free_page(hdr->page);
> > > +   kmem_cache_free(stage2_page_header_cache, hdr);
> > > +}
> > > +
> > > +static void stage2_free_page_rcu_cb(struct rcu_head *head)
> > > +{
> > > +   struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
> > > +                                                 rcu_head);
> > > +
> > > +   stage2_free_page_now(hdr);
> > > +}
> > > +
> > > +static void stage2_free_table(void *addr, bool shared)
> > > +{
> > > +   struct page *page = virt_to_page(addr);
> > > +   struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
> > > +
> > > +   if (shared)
> > > +           call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);
> >
> > Can the number of callbacks grow to "dangerous" numbers? can it be
> > bounded with something like the following?
> >
> > if number of readers is really high:
> >       synchronize_rcu()
> > else
> >       call_rcu()
>
> sorry, meant to say "number of callbacks"

Good point. I don't have data for this, but generally speaking I do
not believe we need to enqueue a callback for every page. In fact,
since we already make the invalid PTE visible in pre-order traversal
we could theoretically free all tables from a single RCU callback (per
fault).

I think if we used synchronize_rcu() then we would need to drop the
mmu lock since it will block the thread.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
@ 2022-04-20  0:53         ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-20  0:53 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: kvm, Marc Zyngier, Ben Gardon, Peter Shier, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

Hi Ricardo,

On Mon, Apr 18, 2022 at 8:09 PM Ricardo Koller <ricarkol@google.com> wrote:
>
> On Mon, Apr 18, 2022 at 07:59:04PM -0700, Ricardo Koller wrote:
> > On Fri, Apr 15, 2022 at 09:58:58PM +0000, Oliver Upton wrote:
> > > It is possible that a table page remains visible to another thread until
> > > the next rcu synchronization event. To that end, we cannot drop the last
> > > page reference synchronous with post-order traversal for a parallel
> > > table walk.
> > >
> > > Schedule an rcu callback to clean up the child table page for parallel
> > > walks.
> > >
> > > Signed-off-by: Oliver Upton <oupton@google.com>
> > > ---
> > >  arch/arm64/include/asm/kvm_pgtable.h |  3 ++
> > >  arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
> > >  arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
> > >  3 files changed, 67 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > > index 74955aba5918..52e55e00f0ca 100644
> > > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > > @@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> > >   * @put_page:                      Decrement the refcount on a page. When the
> > >   *                         refcount reaches 0 the page is automatically
> > >   *                         freed.
> > > + * @free_table:                    Drop the last page reference, possibly in the
> > > + *                         next RCU sync if doing a shared walk.
> > >   * @page_count:                    Return the refcount of a page.
> > >   * @phys_to_virt:          Convert a physical address into a virtual
> > >   *                         address mapped in the current context.
> > > @@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
> > >     void            (*get_page)(void *addr);
> > >     void            (*put_page)(void *addr);
> > >     int             (*page_count)(void *addr);
> > > +   void            (*free_table)(void *addr, bool shared);
> > >     void*           (*phys_to_virt)(phys_addr_t phys);
> > >     phys_addr_t     (*virt_to_phys)(void *addr);
> > >     void            (*dcache_clean_inval_poc)(void *addr, size_t size);
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 121818d4c33e..a9a48edba63b 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
> > >  {}
> > >
> > >  #define kvm_dereference_ptep       rcu_dereference_raw
> > > +
> > > +static inline void kvm_pgtable_destroy_barrier(void)
> > > +{}
> > > +
> > >  #else
> > >  #define kvm_pgtable_walk_begin     rcu_read_lock
> > >
> > >  #define kvm_pgtable_walk_end       rcu_read_unlock
> > >
> > >  #define kvm_dereference_ptep       rcu_dereference
> > > +
> > > +#define kvm_pgtable_destroy_barrier        rcu_barrier
> > > +
> > >  #endif
> > >
> > >  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> > > @@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
> > >             childp = kvm_pte_follow(*old, mm_ops);
> > >     }
> > >
> > > -   mm_ops->put_page(childp);
> > > +   /*
> > > +    * If we do not have exclusive access to the page tables it is possible
> > > +    * the unlinked table remains visible to another thread until the next
> > > +    * rcu synchronization.
> > > +    */
> > > +   mm_ops->free_table(childp, shared);
> > >     mm_ops->put_page(ptep);
> > >
> > >     return ret;
> > > @@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > >                                            kvm_granule_size(level));
> > >
> > >     if (childp)
> > > -           mm_ops->put_page(childp);
> > > +           mm_ops->free_table(childp, shared);
> > >
> > >     return 0;
> > >  }
> > > @@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > >     mm_ops->put_page(ptep);
> > >
> > >     if (kvm_pte_table(*old, level))
> > > -           mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
> > > +           mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
> > >
> > >     return 0;
> > >  }
> > > @@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
> > >     pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
> > >     pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
> > >     pgt->pgd = NULL;
> > > +
> > > +   /*
> > > +    * Guarantee that all unlinked subtrees associated with the stage2 page
> > > +    * table have also been freed before returning.
> > > +    */
> > > +   kvm_pgtable_destroy_barrier();
> > >  }
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index cc6ed6b06ec2..6ecf37009c21 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
> > >  static void *stage2_memcache_zalloc_page(void *arg)
> > >  {
> > >     struct kvm_mmu_caches *mmu_caches = arg;
> > > +   struct stage2_page_header *hdr;
> > > +   void *addr;
> > >
> > >     /* Allocated with __GFP_ZERO, so no need to zero */
> > > -   return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > > +   addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > > +   if (!addr)
> > > +           return NULL;
> > > +
> > > +   hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
> > > +   if (!hdr) {
> > > +           free_page((unsigned long)addr);
> > > +           return NULL;
> > > +   }
> > > +
> > > +   hdr->page = virt_to_page(addr);
> > > +   set_page_private(hdr->page, (unsigned long)hdr);
> > > +   return addr;
> > > +}
> > > +
> > > +static void stage2_free_page_now(struct stage2_page_header *hdr)
> > > +{
> > > +   WARN_ON(page_ref_count(hdr->page) != 1);
> > > +
> > > +   __free_page(hdr->page);
> > > +   kmem_cache_free(stage2_page_header_cache, hdr);
> > > +}
> > > +
> > > +static void stage2_free_page_rcu_cb(struct rcu_head *head)
> > > +{
> > > +   struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
> > > +                                                 rcu_head);
> > > +
> > > +   stage2_free_page_now(hdr);
> > > +}
> > > +
> > > +static void stage2_free_table(void *addr, bool shared)
> > > +{
> > > +   struct page *page = virt_to_page(addr);
> > > +   struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
> > > +
> > > +   if (shared)
> > > +           call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);
> >
> > Can the number of callbacks grow to "dangerous" numbers? can it be
> > bounded with something like the following?
> >
> > if number of readers is really high:
> >       synchronize_rcu()
> > else
> >       call_rcu()
>
> sorry, meant to say "number of callbacks"

Good point. I don't have data for this, but generally speaking I do
not believe we need to enqueue a callback for every page. In fact,
since we already make the invalid PTE visible in pre-order traversal
we could theoretically free all tables from a single RCU callback (per
fault).

I think if we used synchronize_rcu() then we would need to drop the
mmu lock since it will block the thread.

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
@ 2022-04-20  0:53         ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-20  0:53 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Reiji Watanabe,
	Paolo Bonzini, Sean Christopherson, Ben Gardon, David Matlack

Hi Ricardo,

On Mon, Apr 18, 2022 at 8:09 PM Ricardo Koller <ricarkol@google.com> wrote:
>
> On Mon, Apr 18, 2022 at 07:59:04PM -0700, Ricardo Koller wrote:
> > On Fri, Apr 15, 2022 at 09:58:58PM +0000, Oliver Upton wrote:
> > > It is possible that a table page remains visible to another thread until
> > > the next rcu synchronization event. To that end, we cannot drop the last
> > > page reference synchronous with post-order traversal for a parallel
> > > table walk.
> > >
> > > Schedule an rcu callback to clean up the child table page for parallel
> > > walks.
> > >
> > > Signed-off-by: Oliver Upton <oupton@google.com>
> > > ---
> > >  arch/arm64/include/asm/kvm_pgtable.h |  3 ++
> > >  arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
> > >  arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
> > >  3 files changed, 67 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > > index 74955aba5918..52e55e00f0ca 100644
> > > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > > @@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> > >   * @put_page:                      Decrement the refcount on a page. When the
> > >   *                         refcount reaches 0 the page is automatically
> > >   *                         freed.
> > > + * @free_table:                    Drop the last page reference, possibly in the
> > > + *                         next RCU sync if doing a shared walk.
> > >   * @page_count:                    Return the refcount of a page.
> > >   * @phys_to_virt:          Convert a physical address into a virtual
> > >   *                         address mapped in the current context.
> > > @@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
> > >     void            (*get_page)(void *addr);
> > >     void            (*put_page)(void *addr);
> > >     int             (*page_count)(void *addr);
> > > +   void            (*free_table)(void *addr, bool shared);
> > >     void*           (*phys_to_virt)(phys_addr_t phys);
> > >     phys_addr_t     (*virt_to_phys)(void *addr);
> > >     void            (*dcache_clean_inval_poc)(void *addr, size_t size);
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 121818d4c33e..a9a48edba63b 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
> > >  {}
> > >
> > >  #define kvm_dereference_ptep       rcu_dereference_raw
> > > +
> > > +static inline void kvm_pgtable_destroy_barrier(void)
> > > +{}
> > > +
> > >  #else
> > >  #define kvm_pgtable_walk_begin     rcu_read_lock
> > >
> > >  #define kvm_pgtable_walk_end       rcu_read_unlock
> > >
> > >  #define kvm_dereference_ptep       rcu_dereference
> > > +
> > > +#define kvm_pgtable_destroy_barrier        rcu_barrier
> > > +
> > >  #endif
> > >
> > >  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> > > @@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
> > >             childp = kvm_pte_follow(*old, mm_ops);
> > >     }
> > >
> > > -   mm_ops->put_page(childp);
> > > +   /*
> > > +    * If we do not have exclusive access to the page tables it is possible
> > > +    * the unlinked table remains visible to another thread until the next
> > > +    * rcu synchronization.
> > > +    */
> > > +   mm_ops->free_table(childp, shared);
> > >     mm_ops->put_page(ptep);
> > >
> > >     return ret;
> > > @@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > >                                            kvm_granule_size(level));
> > >
> > >     if (childp)
> > > -           mm_ops->put_page(childp);
> > > +           mm_ops->free_table(childp, shared);
> > >
> > >     return 0;
> > >  }
> > > @@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > >     mm_ops->put_page(ptep);
> > >
> > >     if (kvm_pte_table(*old, level))
> > > -           mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
> > > +           mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
> > >
> > >     return 0;
> > >  }
> > > @@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
> > >     pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
> > >     pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
> > >     pgt->pgd = NULL;
> > > +
> > > +   /*
> > > +    * Guarantee that all unlinked subtrees associated with the stage2 page
> > > +    * table have also been freed before returning.
> > > +    */
> > > +   kvm_pgtable_destroy_barrier();
> > >  }
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index cc6ed6b06ec2..6ecf37009c21 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
> > >  static void *stage2_memcache_zalloc_page(void *arg)
> > >  {
> > >     struct kvm_mmu_caches *mmu_caches = arg;
> > > +   struct stage2_page_header *hdr;
> > > +   void *addr;
> > >
> > >     /* Allocated with __GFP_ZERO, so no need to zero */
> > > -   return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > > +   addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > > +   if (!addr)
> > > +           return NULL;
> > > +
> > > +   hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
> > > +   if (!hdr) {
> > > +           free_page((unsigned long)addr);
> > > +           return NULL;
> > > +   }
> > > +
> > > +   hdr->page = virt_to_page(addr);
> > > +   set_page_private(hdr->page, (unsigned long)hdr);
> > > +   return addr;
> > > +}
> > > +
> > > +static void stage2_free_page_now(struct stage2_page_header *hdr)
> > > +{
> > > +   WARN_ON(page_ref_count(hdr->page) != 1);
> > > +
> > > +   __free_page(hdr->page);
> > > +   kmem_cache_free(stage2_page_header_cache, hdr);
> > > +}
> > > +
> > > +static void stage2_free_page_rcu_cb(struct rcu_head *head)
> > > +{
> > > +   struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
> > > +                                                 rcu_head);
> > > +
> > > +   stage2_free_page_now(hdr);
> > > +}
> > > +
> > > +static void stage2_free_table(void *addr, bool shared)
> > > +{
> > > +   struct page *page = virt_to_page(addr);
> > > +   struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
> > > +
> > > +   if (shared)
> > > +           call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);
> >
> > Can the number of callbacks grow to "dangerous" numbers? can it be
> > bounded with something like the following?
> >
> > if number of readers is really high:
> >       synchronize_rcu()
> > else
> >       call_rcu()
>
> sorry, meant to say "number of callbacks"

Good point. I don't have data for this, but generally speaking I do
not believe we need to enqueue a callback for every page. In fact,
since we already make the invalid PTE visible in pre-order traversal
we could theoretically free all tables from a single RCU callback (per
fault).

I think if we used synchronize_rcu() then we would need to drop the
mmu lock since it will block the thread.

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
  2022-04-15 21:58   ` Oliver Upton
  (?)
@ 2022-04-20 16:55     ` Quentin Perret
  -1 siblings, 0 replies; 165+ messages in thread
From: Quentin Perret @ 2022-04-20 16:55 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

On Friday 15 Apr 2022 at 21:58:50 (+0000), Oliver Upton wrote:
> +/*
> + * Used to indicate a pte for which a 'make-before-break' sequence is in

'break-before-make' presumably :-) ?

<snip>
> +static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
> +{
> +	/* Yikes! We really shouldn't install to an entry we don't own. */
> +	WARN_ON(!stage2_pte_is_locked(*ptep));
> +
> +	if (stage2_pte_is_counted(new))
> +		mm_ops->get_page(ptep);
> +
> +	if (kvm_pte_valid(new)) {
> +		WRITE_ONCE(*ptep, new);
> +		dsb(ishst);
> +	} else {
> +		smp_store_release(ptep, new);
> +	}
> +}

I'm struggling a bit to understand this pattern. We currently use
smp_store_release() to install valid mappings, which this patch seems
to change. Is the behaviour change intentional? If so, could you please
share some details about the reasoning that applies here?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-20 16:55     ` Quentin Perret
  0 siblings, 0 replies; 165+ messages in thread
From: Quentin Perret @ 2022-04-20 16:55 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, Ben Gardon, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Friday 15 Apr 2022 at 21:58:50 (+0000), Oliver Upton wrote:
> +/*
> + * Used to indicate a pte for which a 'make-before-break' sequence is in

'break-before-make' presumably :-) ?

<snip>
> +static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
> +{
> +	/* Yikes! We really shouldn't install to an entry we don't own. */
> +	WARN_ON(!stage2_pte_is_locked(*ptep));
> +
> +	if (stage2_pte_is_counted(new))
> +		mm_ops->get_page(ptep);
> +
> +	if (kvm_pte_valid(new)) {
> +		WRITE_ONCE(*ptep, new);
> +		dsb(ishst);
> +	} else {
> +		smp_store_release(ptep, new);
> +	}
> +}

I'm struggling a bit to understand this pattern. We currently use
smp_store_release() to install valid mappings, which this patch seems
to change. Is the behaviour change intentional? If so, could you please
share some details about the reasoning that applies here?

Thanks,
Quentin
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-20 16:55     ` Quentin Perret
  0 siblings, 0 replies; 165+ messages in thread
From: Quentin Perret @ 2022-04-20 16:55 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

On Friday 15 Apr 2022 at 21:58:50 (+0000), Oliver Upton wrote:
> +/*
> + * Used to indicate a pte for which a 'make-before-break' sequence is in

'break-before-make' presumably :-) ?

<snip>
> +static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
> +{
> +	/* Yikes! We really shouldn't install to an entry we don't own. */
> +	WARN_ON(!stage2_pte_is_locked(*ptep));
> +
> +	if (stage2_pte_is_counted(new))
> +		mm_ops->get_page(ptep);
> +
> +	if (kvm_pte_valid(new)) {
> +		WRITE_ONCE(*ptep, new);
> +		dsb(ishst);
> +	} else {
> +		smp_store_release(ptep, new);
> +	}
> +}

I'm struggling a bit to understand this pattern. We currently use
smp_store_release() to install valid mappings, which this patch seems
to change. Is the behaviour change intentional? If so, could you please
share some details about the reasoning that applies here?

Thanks,
Quentin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
  2022-04-20 16:55     ` Quentin Perret
  (?)
@ 2022-04-20 17:06       ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-20 17:06 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

On Wed, Apr 20, 2022 at 9:55 AM Quentin Perret <qperret@google.com> wrote:
>
> On Friday 15 Apr 2022 at 21:58:50 (+0000), Oliver Upton wrote:
> > +/*
> > + * Used to indicate a pte for which a 'make-before-break' sequence is in
>
> 'break-before-make' presumably :-) ?

Gosh, I'd certainly hope so! ;)

> <snip>
> > +static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
> > +{
> > +     /* Yikes! We really shouldn't install to an entry we don't own. */
> > +     WARN_ON(!stage2_pte_is_locked(*ptep));
> > +
> > +     if (stage2_pte_is_counted(new))
> > +             mm_ops->get_page(ptep);
> > +
> > +     if (kvm_pte_valid(new)) {
> > +             WRITE_ONCE(*ptep, new);
> > +             dsb(ishst);
> > +     } else {
> > +             smp_store_release(ptep, new);
> > +     }
> > +}
>
> I'm struggling a bit to understand this pattern. We currently use
> smp_store_release() to install valid mappings, which this patch seems
> to change. Is the behaviour change intentional? If so, could you please
> share some details about the reasoning that applies here?

This is unintentional. We still need to do smp_store_release(),
especially considering we acquire a lock on the PTE in the break
pattern. In fact, I believe the compare-exchange could be loosened to
have only acquire semantics. What I had really meant to do here (but
goofed) is to avoid the DSB when changing between invalid PTEs.

Thanks for the review!

--
Best,
Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-20 17:06       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-20 17:06 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvm, Marc Zyngier, Peter Shier, Ben Gardon, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Wed, Apr 20, 2022 at 9:55 AM Quentin Perret <qperret@google.com> wrote:
>
> On Friday 15 Apr 2022 at 21:58:50 (+0000), Oliver Upton wrote:
> > +/*
> > + * Used to indicate a pte for which a 'make-before-break' sequence is in
>
> 'break-before-make' presumably :-) ?

Gosh, I'd certainly hope so! ;)

> <snip>
> > +static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
> > +{
> > +     /* Yikes! We really shouldn't install to an entry we don't own. */
> > +     WARN_ON(!stage2_pte_is_locked(*ptep));
> > +
> > +     if (stage2_pte_is_counted(new))
> > +             mm_ops->get_page(ptep);
> > +
> > +     if (kvm_pte_valid(new)) {
> > +             WRITE_ONCE(*ptep, new);
> > +             dsb(ishst);
> > +     } else {
> > +             smp_store_release(ptep, new);
> > +     }
> > +}
>
> I'm struggling a bit to understand this pattern. We currently use
> smp_store_release() to install valid mappings, which this patch seems
> to change. Is the behaviour change intentional? If so, could you please
> share some details about the reasoning that applies here?

This is unintentional. We still need to do smp_store_release(),
especially considering we acquire a lock on the PTE in the break
pattern. In fact, I believe the compare-exchange could be loosened to
have only acquire semantics. What I had really meant to do here (but
goofed) is to avoid the DSB when changing between invalid PTEs.

Thanks for the review!

--
Best,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-20 17:06       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-20 17:06 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

On Wed, Apr 20, 2022 at 9:55 AM Quentin Perret <qperret@google.com> wrote:
>
> On Friday 15 Apr 2022 at 21:58:50 (+0000), Oliver Upton wrote:
> > +/*
> > + * Used to indicate a pte for which a 'make-before-break' sequence is in
>
> 'break-before-make' presumably :-) ?

Gosh, I'd certainly hope so! ;)

> <snip>
> > +static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
> > +{
> > +     /* Yikes! We really shouldn't install to an entry we don't own. */
> > +     WARN_ON(!stage2_pte_is_locked(*ptep));
> > +
> > +     if (stage2_pte_is_counted(new))
> > +             mm_ops->get_page(ptep);
> > +
> > +     if (kvm_pte_valid(new)) {
> > +             WRITE_ONCE(*ptep, new);
> > +             dsb(ishst);
> > +     } else {
> > +             smp_store_release(ptep, new);
> > +     }
> > +}
>
> I'm struggling a bit to understand this pattern. We currently use
> smp_store_release() to install valid mappings, which this patch seems
> to change. Is the behaviour change intentional? If so, could you please
> share some details about the reasoning that applies here?

This is unintentional. We still need to do smp_store_release(),
especially considering we acquire a lock on the PTE in the break
pattern. In fact, I believe the compare-exchange could be loosened to
have only acquire semantics. What I had really meant to do here (but
goofed) is to avoid the DSB when changing between invalid PTEs.

Thanks for the review!

--
Best,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
  2022-04-15 21:58   ` Oliver Upton
  (?)
@ 2022-04-21 13:21     ` Quentin Perret
  -1 siblings, 0 replies; 165+ messages in thread
From: Quentin Perret @ 2022-04-21 13:21 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

Hi Oliver,

On Friday 15 Apr 2022 at 21:58:53 (+0000), Oliver Upton wrote:
> Breaking a table pte is insufficient to guarantee ownership of an
> unlinked subtree. Parallel software walkers could be traversing
> substructures and changing their mappings.
> 
> Recurse through the unlinked subtree and lock all descendent ptes
> to take ownership of the subtree. Since the ptes are actually being
> evicted, return table ptes back to the table walker to ensure child
> tables are also traversed. Note that this is done both in both the
> pre-order and leaf visitors as the underlying pte remains volatile until
> it is unlinked.

Still trying to get the full picture of the series so bear with me. IIUC
the case you're dealing with here is when we're coallescing a table into
a block with concurrent walkers making changes in the sub-tree. I
believe this should happen when turning dirty logging off?

Why do we need to recursively lock the entire sub-tree at all in this
case? As long as the table is turned into a locked invalid PTE, what
concurrent walkers are doing in the sub-tree should be irrelevant no?
None of the changes they do will be made visible to the hardware anyway.
So as long as the sub-tree isn't freed under their feet (which should be
the point of the RCU protection) this should be all fine? Is there a
case where this is not actually true?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
@ 2022-04-21 13:21     ` Quentin Perret
  0 siblings, 0 replies; 165+ messages in thread
From: Quentin Perret @ 2022-04-21 13:21 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, Ben Gardon, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

Hi Oliver,

On Friday 15 Apr 2022 at 21:58:53 (+0000), Oliver Upton wrote:
> Breaking a table pte is insufficient to guarantee ownership of an
> unlinked subtree. Parallel software walkers could be traversing
> substructures and changing their mappings.
> 
> Recurse through the unlinked subtree and lock all descendent ptes
> to take ownership of the subtree. Since the ptes are actually being
> evicted, return table ptes back to the table walker to ensure child
> tables are also traversed. Note that this is done both in both the
> pre-order and leaf visitors as the underlying pte remains volatile until
> it is unlinked.

Still trying to get the full picture of the series so bear with me. IIUC
the case you're dealing with here is when we're coallescing a table into
a block with concurrent walkers making changes in the sub-tree. I
believe this should happen when turning dirty logging off?

Why do we need to recursively lock the entire sub-tree at all in this
case? As long as the table is turned into a locked invalid PTE, what
concurrent walkers are doing in the sub-tree should be irrelevant no?
None of the changes they do will be made visible to the hardware anyway.
So as long as the sub-tree isn't freed under their feet (which should be
the point of the RCU protection) this should be all fine? Is there a
case where this is not actually true?

Thanks,
Quentin
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
@ 2022-04-21 13:21     ` Quentin Perret
  0 siblings, 0 replies; 165+ messages in thread
From: Quentin Perret @ 2022-04-21 13:21 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

Hi Oliver,

On Friday 15 Apr 2022 at 21:58:53 (+0000), Oliver Upton wrote:
> Breaking a table pte is insufficient to guarantee ownership of an
> unlinked subtree. Parallel software walkers could be traversing
> substructures and changing their mappings.
> 
> Recurse through the unlinked subtree and lock all descendent ptes
> to take ownership of the subtree. Since the ptes are actually being
> evicted, return table ptes back to the table walker to ensure child
> tables are also traversed. Note that this is done both in both the
> pre-order and leaf visitors as the underlying pte remains volatile until
> it is unlinked.

Still trying to get the full picture of the series so bear with me. IIUC
the case you're dealing with here is when we're coallescing a table into
a block with concurrent walkers making changes in the sub-tree. I
believe this should happen when turning dirty logging off?

Why do we need to recursively lock the entire sub-tree at all in this
case? As long as the table is turned into a locked invalid PTE, what
concurrent walkers are doing in the sub-tree should be irrelevant no?
None of the changes they do will be made visible to the hardware anyway.
So as long as the sub-tree isn't freed under their feet (which should be
the point of the RCU protection) this should be all fine? Is there a
case where this is not actually true?

Thanks,
Quentin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 10/17] KVM: arm64: Assume a table pte is already owned in post-order traversal
  2022-04-15 21:58   ` Oliver Upton
  (?)
@ 2022-04-21 16:11     ` Ben Gardon
  -1 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:11 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> For parallel walks that collapse a table into a block KVM ensures a
> locked invalid pte is visible to all observers in pre-order traversal.
> As such, there is no need to try breaking the pte again.

When you're doing the pre and post-order traversals, are they
implemented as separate traversals from the root, or is it a kind of
pre and post-order where non-leaf nodes are visited on the way down
and on the way up?
I assume either could be made to work, but the re-traversal from the
root probably minimizes TLB flushes, whereas the pre-and-post-order
would be a more efficient walk?

>
> Directly set the pte if it has already been broken.
>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 22 ++++++++++++++++------
>  1 file changed, 16 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 146fc44acf31..121818d4c33e 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -924,7 +924,7 @@ static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level,
>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>                                       kvm_pte_t *ptep, kvm_pte_t old,
>                                       struct stage2_map_data *data,
> -                                     bool shared)
> +                                     bool shared, bool locked)
>  {
>         kvm_pte_t new;
>         u64 granule = kvm_granule_size(level), phys = data->phys;
> @@ -948,7 +948,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>         if (!stage2_pte_needs_update(old, new))
>                 return -EAGAIN;
>
> -       if (!stage2_try_break_pte(ptep, old, addr, level, shared, data))
> +       if (!locked && !stage2_try_break_pte(ptep, old, addr, level, shared, data))
>                 return -EAGAIN;
>
>         /* Perform CMOs before installation of the guest stage-2 PTE */
> @@ -987,7 +987,8 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>  }
>
>  static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                               kvm_pte_t *old, struct stage2_map_data *data, bool shared)
> +                               kvm_pte_t *old, struct stage2_map_data *data, bool shared,
> +                               bool locked)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>         kvm_pte_t *childp, pte;
> @@ -998,10 +999,13 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>                 return 0;
>         }
>
> -       ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data, shared);
> +       ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data, shared, locked);
>         if (ret != -E2BIG)
>                 return ret;
>
> +       /* We should never attempt installing a table in post-order */
> +       WARN_ON(locked);
> +
>         if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
>                 return -EINVAL;
>
> @@ -1048,7 +1052,13 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>                 childp = data->childp;
>                 data->anchor = NULL;
>                 data->childp = NULL;
> -               ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
> +
> +               /*
> +                * We are guaranteed exclusive access to the pte in post-order
> +                * traversal since the locked value was made visible to all
> +                * observers in stage2_map_walk_table_pre.
> +                */
> +               ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared, true);
>         } else {
>                 childp = kvm_pte_follow(*old, mm_ops);
>         }
> @@ -1087,7 +1097,7 @@ static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_
>         case KVM_PGTABLE_WALK_TABLE_PRE:
>                 return stage2_map_walk_table_pre(addr, end, level, ptep, old, data, shared);
>         case KVM_PGTABLE_WALK_LEAF:
> -               return stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
> +               return stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared, false);
>         case KVM_PGTABLE_WALK_TABLE_POST:
>                 return stage2_map_walk_table_post(addr, end, level, ptep, old, data, shared);
>         }
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 10/17] KVM: arm64: Assume a table pte is already owned in post-order traversal
@ 2022-04-21 16:11     ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:11 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, David Matlack, Paolo Bonzini,
	kvmarm, linux-arm-kernel

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> For parallel walks that collapse a table into a block KVM ensures a
> locked invalid pte is visible to all observers in pre-order traversal.
> As such, there is no need to try breaking the pte again.

When you're doing the pre and post-order traversals, are they
implemented as separate traversals from the root, or is it a kind of
pre and post-order where non-leaf nodes are visited on the way down
and on the way up?
I assume either could be made to work, but the re-traversal from the
root probably minimizes TLB flushes, whereas the pre-and-post-order
would be a more efficient walk?

>
> Directly set the pte if it has already been broken.
>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 22 ++++++++++++++++------
>  1 file changed, 16 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 146fc44acf31..121818d4c33e 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -924,7 +924,7 @@ static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level,
>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>                                       kvm_pte_t *ptep, kvm_pte_t old,
>                                       struct stage2_map_data *data,
> -                                     bool shared)
> +                                     bool shared, bool locked)
>  {
>         kvm_pte_t new;
>         u64 granule = kvm_granule_size(level), phys = data->phys;
> @@ -948,7 +948,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>         if (!stage2_pte_needs_update(old, new))
>                 return -EAGAIN;
>
> -       if (!stage2_try_break_pte(ptep, old, addr, level, shared, data))
> +       if (!locked && !stage2_try_break_pte(ptep, old, addr, level, shared, data))
>                 return -EAGAIN;
>
>         /* Perform CMOs before installation of the guest stage-2 PTE */
> @@ -987,7 +987,8 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>  }
>
>  static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                               kvm_pte_t *old, struct stage2_map_data *data, bool shared)
> +                               kvm_pte_t *old, struct stage2_map_data *data, bool shared,
> +                               bool locked)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>         kvm_pte_t *childp, pte;
> @@ -998,10 +999,13 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>                 return 0;
>         }
>
> -       ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data, shared);
> +       ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data, shared, locked);
>         if (ret != -E2BIG)
>                 return ret;
>
> +       /* We should never attempt installing a table in post-order */
> +       WARN_ON(locked);
> +
>         if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
>                 return -EINVAL;
>
> @@ -1048,7 +1052,13 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>                 childp = data->childp;
>                 data->anchor = NULL;
>                 data->childp = NULL;
> -               ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
> +
> +               /*
> +                * We are guaranteed exclusive access to the pte in post-order
> +                * traversal since the locked value was made visible to all
> +                * observers in stage2_map_walk_table_pre.
> +                */
> +               ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared, true);
>         } else {
>                 childp = kvm_pte_follow(*old, mm_ops);
>         }
> @@ -1087,7 +1097,7 @@ static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_
>         case KVM_PGTABLE_WALK_TABLE_PRE:
>                 return stage2_map_walk_table_pre(addr, end, level, ptep, old, data, shared);
>         case KVM_PGTABLE_WALK_LEAF:
> -               return stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
> +               return stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared, false);
>         case KVM_PGTABLE_WALK_TABLE_POST:
>                 return stage2_map_walk_table_post(addr, end, level, ptep, old, data, shared);
>         }
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 10/17] KVM: arm64: Assume a table pte is already owned in post-order traversal
@ 2022-04-21 16:11     ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:11 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> For parallel walks that collapse a table into a block KVM ensures a
> locked invalid pte is visible to all observers in pre-order traversal.
> As such, there is no need to try breaking the pte again.

When you're doing the pre and post-order traversals, are they
implemented as separate traversals from the root, or is it a kind of
pre and post-order where non-leaf nodes are visited on the way down
and on the way up?
I assume either could be made to work, but the re-traversal from the
root probably minimizes TLB flushes, whereas the pre-and-post-order
would be a more efficient walk?

>
> Directly set the pte if it has already been broken.
>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 22 ++++++++++++++++------
>  1 file changed, 16 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 146fc44acf31..121818d4c33e 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -924,7 +924,7 @@ static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level,
>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>                                       kvm_pte_t *ptep, kvm_pte_t old,
>                                       struct stage2_map_data *data,
> -                                     bool shared)
> +                                     bool shared, bool locked)
>  {
>         kvm_pte_t new;
>         u64 granule = kvm_granule_size(level), phys = data->phys;
> @@ -948,7 +948,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>         if (!stage2_pte_needs_update(old, new))
>                 return -EAGAIN;
>
> -       if (!stage2_try_break_pte(ptep, old, addr, level, shared, data))
> +       if (!locked && !stage2_try_break_pte(ptep, old, addr, level, shared, data))
>                 return -EAGAIN;
>
>         /* Perform CMOs before installation of the guest stage-2 PTE */
> @@ -987,7 +987,8 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>  }
>
>  static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                               kvm_pte_t *old, struct stage2_map_data *data, bool shared)
> +                               kvm_pte_t *old, struct stage2_map_data *data, bool shared,
> +                               bool locked)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>         kvm_pte_t *childp, pte;
> @@ -998,10 +999,13 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>                 return 0;
>         }
>
> -       ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data, shared);
> +       ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data, shared, locked);
>         if (ret != -E2BIG)
>                 return ret;
>
> +       /* We should never attempt installing a table in post-order */
> +       WARN_ON(locked);
> +
>         if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
>                 return -EINVAL;
>
> @@ -1048,7 +1052,13 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>                 childp = data->childp;
>                 data->anchor = NULL;
>                 data->childp = NULL;
> -               ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
> +
> +               /*
> +                * We are guaranteed exclusive access to the pte in post-order
> +                * traversal since the locked value was made visible to all
> +                * observers in stage2_map_walk_table_pre.
> +                */
> +               ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared, true);
>         } else {
>                 childp = kvm_pte_follow(*old, mm_ops);
>         }
> @@ -1087,7 +1097,7 @@ static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_
>         case KVM_PGTABLE_WALK_TABLE_PRE:
>                 return stage2_map_walk_table_pre(addr, end, level, ptep, old, data, shared);
>         case KVM_PGTABLE_WALK_LEAF:
> -               return stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared);
> +               return stage2_map_walk_leaf(addr, end, level, ptep, old, data, shared, false);
>         case KVM_PGTABLE_WALK_TABLE_POST:
>                 return stage2_map_walk_table_post(addr, end, level, ptep, old, data, shared);
>         }
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 02/17] KVM: arm64: Only read the pte once per visit
  2022-04-15 21:58   ` Oliver Upton
  (?)
@ 2022-04-21 16:12     ` Ben Gardon
  -1 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:12 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> A subsequent change to KVM will parallize modifications to the stage-2
> page tables. The various page table walkers read the ptep multiple
> times, which could lead to a visitor seeing multiple values during the
> visit.
>
> Pass through the observed pte to the visitor callbacks. Promote reads of
> the ptep to a full READ_ONCE(), which will matter more when we start
> tweaking ptes atomically. Note that a pointer to the old pte is given to
> visitors, as parallel visitors will need to steer the page table
> traversal as they adjust the page tables.
>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h  |   2 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |   7 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |   9 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 113 +++++++++++++-------------
>  4 files changed, 63 insertions(+), 68 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 9f339dffbc1a..ea818a5f7408 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -192,7 +192,7 @@ enum kvm_pgtable_walk_flags {
>  };
>
>  typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
> -                                       kvm_pte_t *ptep,
> +                                       kvm_pte_t *ptep, kvm_pte_t *old,
>                                         enum kvm_pgtable_walk_flags flag,
>                                         void * const arg);
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 78edf077fa3b..601a586581d8 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -422,17 +422,16 @@ struct check_walk_data {
>  };
>
>  static int __check_page_state_visitor(u64 addr, u64 end, u32 level,
> -                                     kvm_pte_t *ptep,
> +                                     kvm_pte_t *ptep, kvm_pte_t *old,
>                                       enum kvm_pgtable_walk_flags flag,
>                                       void * const arg)

David mentioned combining the ARM and x86 TDP MMUs, and I wonder if a
first step in that direction could be to adopt the TDP iter here. The
signatures of most of these functions are very similar to the fields
in the TDP iter and the TDP MMU might benefit from adopting some
version of kvm_pgtable_walk_flags.


>  {
>         struct check_walk_data *d = arg;
> -       kvm_pte_t pte = *ptep;
>
> -       if (kvm_pte_valid(pte) && !addr_is_memory(kvm_pte_to_phys(pte)))
> +       if (kvm_pte_valid(*old) && !addr_is_memory(kvm_pte_to_phys(*old)))
>                 return -EINVAL;
>
> -       return d->get_page_state(pte) == d->desired ? 0 : -EPERM;
> +       return d->get_page_state(*old) == d->desired ? 0 : -EPERM;
>  }
>
>  static int check_page_state_range(struct kvm_pgtable *pgt, u64 addr, u64 size,
> diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
> index 27af337f9fea..ecab7a4049d6 100644
> --- a/arch/arm64/kvm/hyp/nvhe/setup.c
> +++ b/arch/arm64/kvm/hyp/nvhe/setup.c
> @@ -162,17 +162,16 @@ static void hpool_put_page(void *addr)
>  }
>
>  static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
> -                                        kvm_pte_t *ptep,
> +                                        kvm_pte_t *ptep, kvm_pte_t *old,
>                                          enum kvm_pgtable_walk_flags flag,
>                                          void * const arg)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = arg;
>         enum kvm_pgtable_prot prot;
>         enum pkvm_page_state state;
> -       kvm_pte_t pte = *ptep;
>         phys_addr_t phys;
>
> -       if (!kvm_pte_valid(pte))
> +       if (!kvm_pte_valid(*old))
>                 return 0;
>
>         /*
> @@ -187,7 +186,7 @@ static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
>         if (level != (KVM_PGTABLE_MAX_LEVELS - 1))
>                 return -EINVAL;
>
> -       phys = kvm_pte_to_phys(pte);
> +       phys = kvm_pte_to_phys(*old);
>         if (!addr_is_memory(phys))
>                 return -EINVAL;
>
> @@ -195,7 +194,7 @@ static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
>          * Adjust the host stage-2 mappings to match the ownership attributes
>          * configured in the hypervisor stage-1.
>          */
> -       state = pkvm_getstate(kvm_pgtable_hyp_pte_prot(pte));
> +       state = pkvm_getstate(kvm_pgtable_hyp_pte_prot(*old));
>         switch (state) {
>         case PKVM_PAGE_OWNED:
>                 return host_stage2_set_owner_locked(phys, PAGE_SIZE, pkvm_hyp_id);
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index e1506da3e2fb..ad911cd44425 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -178,11 +178,11 @@ static u8 kvm_invalid_pte_owner(kvm_pte_t pte)
>  }
>
>  static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
> -                                 u32 level, kvm_pte_t *ptep,
> +                                 u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                                   enum kvm_pgtable_walk_flags flag)
>  {
>         struct kvm_pgtable_walker *walker = data->walker;
> -       return walker->cb(addr, data->end, level, ptep, flag, walker->arg);
> +       return walker->cb(addr, data->end, level, ptep, old, flag, walker->arg);
>  }
>
>  static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
> @@ -193,17 +193,17 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
>  {
>         int ret = 0;
>         u64 addr = data->addr;
> -       kvm_pte_t *childp, pte = *ptep;
> +       kvm_pte_t *childp, pte = READ_ONCE(*ptep);
>         bool table = kvm_pte_table(pte, level);
>         enum kvm_pgtable_walk_flags flags = data->walker->flags;
>
>         if (table && (flags & KVM_PGTABLE_WALK_TABLE_PRE)) {
> -               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
> +               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
>                                              KVM_PGTABLE_WALK_TABLE_PRE);
>         }
>
>         if (!table && (flags & KVM_PGTABLE_WALK_LEAF)) {
> -               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
> +               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
>                                              KVM_PGTABLE_WALK_LEAF);
>                 pte = *ptep;
>                 table = kvm_pte_table(pte, level);
> @@ -224,7 +224,7 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
>                 goto out;
>
>         if (flags & KVM_PGTABLE_WALK_TABLE_POST) {
> -               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
> +               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
>                                              KVM_PGTABLE_WALK_TABLE_POST);
>         }
>
> @@ -297,12 +297,12 @@ struct leaf_walk_data {
>         u32             level;
>  };
>
> -static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                        enum kvm_pgtable_walk_flags flag, void * const arg)
>  {
>         struct leaf_walk_data *data = arg;
>
> -       data->pte   = *ptep;
> +       data->pte   = *old;
>         data->level = level;
>
>         return 0;
> @@ -388,10 +388,10 @@ enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte)
>         return prot;
>  }
>
> -static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> -                                   kvm_pte_t *ptep, struct hyp_map_data *data)
> +static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +                                   kvm_pte_t old, struct hyp_map_data *data)
>  {
> -       kvm_pte_t new, old = *ptep;
> +       kvm_pte_t new;
>         u64 granule = kvm_granule_size(level), phys = data->phys;
>
>         if (!kvm_block_mapping_supported(addr, end, phys, level))
> @@ -410,14 +410,14 @@ static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>         return true;
>  }
>
> -static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                           enum kvm_pgtable_walk_flags flag, void * const arg)
>  {
>         kvm_pte_t *childp;
>         struct hyp_map_data *data = arg;
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>
> -       if (hyp_map_walker_try_leaf(addr, end, level, ptep, arg))
> +       if (hyp_map_walker_try_leaf(addr, end, level, ptep, *old, arg))
>                 return 0;
>
>         if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
> @@ -461,19 +461,19 @@ struct hyp_unmap_data {
>         struct kvm_pgtable_mm_ops       *mm_ops;
>  };
>
> -static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                             enum kvm_pgtable_walk_flags flag, void * const arg)
>  {
> -       kvm_pte_t pte = *ptep, *childp = NULL;
> +       kvm_pte_t *childp = NULL;
>         u64 granule = kvm_granule_size(level);
>         struct hyp_unmap_data *data = arg;
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>
> -       if (!kvm_pte_valid(pte))
> +       if (!kvm_pte_valid(*old))
>                 return -EINVAL;
>
> -       if (kvm_pte_table(pte, level)) {
> -               childp = kvm_pte_follow(pte, mm_ops);
> +       if (kvm_pte_table(*old, level)) {
> +               childp = kvm_pte_follow(*old, mm_ops);
>
>                 if (mm_ops->page_count(childp) != 1)
>                         return 0;
> @@ -537,19 +537,18 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
>         return 0;
>  }
>
> -static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                            enum kvm_pgtable_walk_flags flag, void * const arg)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = arg;
> -       kvm_pte_t pte = *ptep;
>
> -       if (!kvm_pte_valid(pte))
> +       if (!kvm_pte_valid(*old))
>                 return 0;
>
>         mm_ops->put_page(ptep);
>
> -       if (kvm_pte_table(pte, level))
> -               mm_ops->put_page(kvm_pte_follow(pte, mm_ops));
> +       if (kvm_pte_table(*old, level))
> +               mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
>
>         return 0;
>  }
> @@ -723,10 +722,10 @@ static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level,
>  }
>
>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> -                                     kvm_pte_t *ptep,
> +                                     kvm_pte_t *ptep, kvm_pte_t old,
>                                       struct stage2_map_data *data)
>  {
> -       kvm_pte_t new, old = *ptep;
> +       kvm_pte_t new;
>         u64 granule = kvm_granule_size(level), phys = data->phys;
>         struct kvm_pgtable *pgt = data->mmu->pgt;
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> @@ -769,7 +768,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  }
>
>  static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
> -                                    kvm_pte_t *ptep,
> +                                    kvm_pte_t *ptep, kvm_pte_t *old,
>                                      struct stage2_map_data *data)
>  {
>         if (data->anchor)
> @@ -778,7 +777,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>         if (!stage2_leaf_mapping_allowed(addr, end, level, data))
>                 return 0;
>
> -       data->childp = kvm_pte_follow(*ptep, data->mm_ops);
> +       data->childp = kvm_pte_follow(*old, data->mm_ops);
>         kvm_clear_pte(ptep);
>
>         /*
> @@ -792,20 +791,20 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>  }
>
>  static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                               struct stage2_map_data *data)
> +                               kvm_pte_t *old, struct stage2_map_data *data)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> -       kvm_pte_t *childp, pte = *ptep;
> +       kvm_pte_t *childp;
>         int ret;
>
>         if (data->anchor) {
> -               if (stage2_pte_is_counted(pte))
> +               if (stage2_pte_is_counted(*old))
>                         mm_ops->put_page(ptep);
>
>                 return 0;
>         }
>
> -       ret = stage2_map_walker_try_leaf(addr, end, level, ptep, data);
> +       ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data);
>         if (ret != -E2BIG)
>                 return ret;
>
> @@ -824,7 +823,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>          * a table. Accesses beyond 'end' that fall within the new table
>          * will be mapped lazily.
>          */
> -       if (stage2_pte_is_counted(pte))
> +       if (stage2_pte_is_counted(*old))
>                 stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
>
>         kvm_set_table_pte(ptep, childp, mm_ops);
> @@ -834,7 +833,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  }
>
>  static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
> -                                     kvm_pte_t *ptep,
> +                                     kvm_pte_t *ptep, kvm_pte_t *old,
>                                       struct stage2_map_data *data)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> @@ -848,9 +847,9 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>                 childp = data->childp;
>                 data->anchor = NULL;
>                 data->childp = NULL;
> -               ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
> +               ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data);
>         } else {
> -               childp = kvm_pte_follow(*ptep, mm_ops);
> +               childp = kvm_pte_follow(*old, mm_ops);
>         }
>
>         mm_ops->put_page(childp);
> @@ -878,18 +877,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>   * the page-table, installing the block entry when it revisits the anchor
>   * pointer and clearing the anchor to NULL.
>   */
> -static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                              enum kvm_pgtable_walk_flags flag, void * const arg)
>  {
>         struct stage2_map_data *data = arg;
>
>         switch (flag) {
>         case KVM_PGTABLE_WALK_TABLE_PRE:
> -               return stage2_map_walk_table_pre(addr, end, level, ptep, data);
> +               return stage2_map_walk_table_pre(addr, end, level, ptep, old, data);
>         case KVM_PGTABLE_WALK_LEAF:
> -               return stage2_map_walk_leaf(addr, end, level, ptep, data);
> +               return stage2_map_walk_leaf(addr, end, level, ptep, old, data);
>         case KVM_PGTABLE_WALK_TABLE_POST:
> -               return stage2_map_walk_table_post(addr, end, level, ptep, data);
> +               return stage2_map_walk_table_post(addr, end, level, ptep, old, data);
>         }
>
>         return -EINVAL;
> @@ -955,29 +954,29 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
>  }
>
>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                              enum kvm_pgtable_walk_flags flag,
> +                              kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
>                                void * const arg)
>  {
>         struct kvm_pgtable *pgt = arg;
>         struct kvm_s2_mmu *mmu = pgt->mmu;
>         struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
> -       kvm_pte_t pte = *ptep, *childp = NULL;
> +       kvm_pte_t *childp = NULL;
>         bool need_flush = false;
>
> -       if (!kvm_pte_valid(pte)) {
> -               if (stage2_pte_is_counted(pte)) {
> +       if (!kvm_pte_valid(*old)) {
> +               if (stage2_pte_is_counted(*old)) {
>                         kvm_clear_pte(ptep);
>                         mm_ops->put_page(ptep);
>                 }
>                 return 0;
>         }
>
> -       if (kvm_pte_table(pte, level)) {
> -               childp = kvm_pte_follow(pte, mm_ops);
> +       if (kvm_pte_table(*old, level)) {
> +               childp = kvm_pte_follow(*old, mm_ops);
>
>                 if (mm_ops->page_count(childp) != 1)
>                         return 0;
> -       } else if (stage2_pte_cacheable(pgt, pte)) {
> +       } else if (stage2_pte_cacheable(pgt, *old)) {
>                 need_flush = !stage2_has_fwb(pgt);
>         }
>
> @@ -989,7 +988,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>         stage2_put_pte(ptep, mmu, addr, level, mm_ops);
>
>         if (need_flush && mm_ops->dcache_clean_inval_poc)
> -               mm_ops->dcache_clean_inval_poc(kvm_pte_follow(pte, mm_ops),
> +               mm_ops->dcache_clean_inval_poc(kvm_pte_follow(*old, mm_ops),
>                                                kvm_granule_size(level));
>
>         if (childp)
> @@ -1018,10 +1017,10 @@ struct stage2_attr_data {
>  };
>
>  static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                             enum kvm_pgtable_walk_flags flag,
> +                             kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
>                               void * const arg)
>  {
> -       kvm_pte_t pte = *ptep;
> +       kvm_pte_t pte = *old;
>         struct stage2_attr_data *data = arg;
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>
> @@ -1146,18 +1145,17 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
>  }
>
>  static int stage2_flush_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                              enum kvm_pgtable_walk_flags flag,
> +                              kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
>                                void * const arg)
>  {
>         struct kvm_pgtable *pgt = arg;
>         struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
> -       kvm_pte_t pte = *ptep;
>
> -       if (!kvm_pte_valid(pte) || !stage2_pte_cacheable(pgt, pte))
> +       if (!kvm_pte_valid(*old) || !stage2_pte_cacheable(pgt, *old))
>                 return 0;
>
>         if (mm_ops->dcache_clean_inval_poc)
> -               mm_ops->dcache_clean_inval_poc(kvm_pte_follow(pte, mm_ops),
> +               mm_ops->dcache_clean_inval_poc(kvm_pte_follow(*old, mm_ops),
>                                                kvm_granule_size(level));
>         return 0;
>  }
> @@ -1206,19 +1204,18 @@ int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
>  }
>
>  static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                             enum kvm_pgtable_walk_flags flag,
> +                             kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
>                               void * const arg)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = arg;
> -       kvm_pte_t pte = *ptep;
>
> -       if (!stage2_pte_is_counted(pte))
> +       if (!stage2_pte_is_counted(*old))
>                 return 0;
>
>         mm_ops->put_page(ptep);
>
> -       if (kvm_pte_table(pte, level))
> -               mm_ops->put_page(kvm_pte_follow(pte, mm_ops));
> +       if (kvm_pte_table(*old, level))
> +               mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
>
>         return 0;
>  }
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 02/17] KVM: arm64: Only read the pte once per visit
@ 2022-04-21 16:12     ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:12 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, David Matlack, Paolo Bonzini,
	kvmarm, linux-arm-kernel

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> A subsequent change to KVM will parallize modifications to the stage-2
> page tables. The various page table walkers read the ptep multiple
> times, which could lead to a visitor seeing multiple values during the
> visit.
>
> Pass through the observed pte to the visitor callbacks. Promote reads of
> the ptep to a full READ_ONCE(), which will matter more when we start
> tweaking ptes atomically. Note that a pointer to the old pte is given to
> visitors, as parallel visitors will need to steer the page table
> traversal as they adjust the page tables.
>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h  |   2 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |   7 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |   9 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 113 +++++++++++++-------------
>  4 files changed, 63 insertions(+), 68 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 9f339dffbc1a..ea818a5f7408 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -192,7 +192,7 @@ enum kvm_pgtable_walk_flags {
>  };
>
>  typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
> -                                       kvm_pte_t *ptep,
> +                                       kvm_pte_t *ptep, kvm_pte_t *old,
>                                         enum kvm_pgtable_walk_flags flag,
>                                         void * const arg);
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 78edf077fa3b..601a586581d8 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -422,17 +422,16 @@ struct check_walk_data {
>  };
>
>  static int __check_page_state_visitor(u64 addr, u64 end, u32 level,
> -                                     kvm_pte_t *ptep,
> +                                     kvm_pte_t *ptep, kvm_pte_t *old,
>                                       enum kvm_pgtable_walk_flags flag,
>                                       void * const arg)

David mentioned combining the ARM and x86 TDP MMUs, and I wonder if a
first step in that direction could be to adopt the TDP iter here. The
signatures of most of these functions are very similar to the fields
in the TDP iter and the TDP MMU might benefit from adopting some
version of kvm_pgtable_walk_flags.


>  {
>         struct check_walk_data *d = arg;
> -       kvm_pte_t pte = *ptep;
>
> -       if (kvm_pte_valid(pte) && !addr_is_memory(kvm_pte_to_phys(pte)))
> +       if (kvm_pte_valid(*old) && !addr_is_memory(kvm_pte_to_phys(*old)))
>                 return -EINVAL;
>
> -       return d->get_page_state(pte) == d->desired ? 0 : -EPERM;
> +       return d->get_page_state(*old) == d->desired ? 0 : -EPERM;
>  }
>
>  static int check_page_state_range(struct kvm_pgtable *pgt, u64 addr, u64 size,
> diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
> index 27af337f9fea..ecab7a4049d6 100644
> --- a/arch/arm64/kvm/hyp/nvhe/setup.c
> +++ b/arch/arm64/kvm/hyp/nvhe/setup.c
> @@ -162,17 +162,16 @@ static void hpool_put_page(void *addr)
>  }
>
>  static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
> -                                        kvm_pte_t *ptep,
> +                                        kvm_pte_t *ptep, kvm_pte_t *old,
>                                          enum kvm_pgtable_walk_flags flag,
>                                          void * const arg)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = arg;
>         enum kvm_pgtable_prot prot;
>         enum pkvm_page_state state;
> -       kvm_pte_t pte = *ptep;
>         phys_addr_t phys;
>
> -       if (!kvm_pte_valid(pte))
> +       if (!kvm_pte_valid(*old))
>                 return 0;
>
>         /*
> @@ -187,7 +186,7 @@ static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
>         if (level != (KVM_PGTABLE_MAX_LEVELS - 1))
>                 return -EINVAL;
>
> -       phys = kvm_pte_to_phys(pte);
> +       phys = kvm_pte_to_phys(*old);
>         if (!addr_is_memory(phys))
>                 return -EINVAL;
>
> @@ -195,7 +194,7 @@ static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
>          * Adjust the host stage-2 mappings to match the ownership attributes
>          * configured in the hypervisor stage-1.
>          */
> -       state = pkvm_getstate(kvm_pgtable_hyp_pte_prot(pte));
> +       state = pkvm_getstate(kvm_pgtable_hyp_pte_prot(*old));
>         switch (state) {
>         case PKVM_PAGE_OWNED:
>                 return host_stage2_set_owner_locked(phys, PAGE_SIZE, pkvm_hyp_id);
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index e1506da3e2fb..ad911cd44425 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -178,11 +178,11 @@ static u8 kvm_invalid_pte_owner(kvm_pte_t pte)
>  }
>
>  static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
> -                                 u32 level, kvm_pte_t *ptep,
> +                                 u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                                   enum kvm_pgtable_walk_flags flag)
>  {
>         struct kvm_pgtable_walker *walker = data->walker;
> -       return walker->cb(addr, data->end, level, ptep, flag, walker->arg);
> +       return walker->cb(addr, data->end, level, ptep, old, flag, walker->arg);
>  }
>
>  static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
> @@ -193,17 +193,17 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
>  {
>         int ret = 0;
>         u64 addr = data->addr;
> -       kvm_pte_t *childp, pte = *ptep;
> +       kvm_pte_t *childp, pte = READ_ONCE(*ptep);
>         bool table = kvm_pte_table(pte, level);
>         enum kvm_pgtable_walk_flags flags = data->walker->flags;
>
>         if (table && (flags & KVM_PGTABLE_WALK_TABLE_PRE)) {
> -               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
> +               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
>                                              KVM_PGTABLE_WALK_TABLE_PRE);
>         }
>
>         if (!table && (flags & KVM_PGTABLE_WALK_LEAF)) {
> -               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
> +               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
>                                              KVM_PGTABLE_WALK_LEAF);
>                 pte = *ptep;
>                 table = kvm_pte_table(pte, level);
> @@ -224,7 +224,7 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
>                 goto out;
>
>         if (flags & KVM_PGTABLE_WALK_TABLE_POST) {
> -               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
> +               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
>                                              KVM_PGTABLE_WALK_TABLE_POST);
>         }
>
> @@ -297,12 +297,12 @@ struct leaf_walk_data {
>         u32             level;
>  };
>
> -static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                        enum kvm_pgtable_walk_flags flag, void * const arg)
>  {
>         struct leaf_walk_data *data = arg;
>
> -       data->pte   = *ptep;
> +       data->pte   = *old;
>         data->level = level;
>
>         return 0;
> @@ -388,10 +388,10 @@ enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte)
>         return prot;
>  }
>
> -static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> -                                   kvm_pte_t *ptep, struct hyp_map_data *data)
> +static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +                                   kvm_pte_t old, struct hyp_map_data *data)
>  {
> -       kvm_pte_t new, old = *ptep;
> +       kvm_pte_t new;
>         u64 granule = kvm_granule_size(level), phys = data->phys;
>
>         if (!kvm_block_mapping_supported(addr, end, phys, level))
> @@ -410,14 +410,14 @@ static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>         return true;
>  }
>
> -static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                           enum kvm_pgtable_walk_flags flag, void * const arg)
>  {
>         kvm_pte_t *childp;
>         struct hyp_map_data *data = arg;
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>
> -       if (hyp_map_walker_try_leaf(addr, end, level, ptep, arg))
> +       if (hyp_map_walker_try_leaf(addr, end, level, ptep, *old, arg))
>                 return 0;
>
>         if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
> @@ -461,19 +461,19 @@ struct hyp_unmap_data {
>         struct kvm_pgtable_mm_ops       *mm_ops;
>  };
>
> -static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                             enum kvm_pgtable_walk_flags flag, void * const arg)
>  {
> -       kvm_pte_t pte = *ptep, *childp = NULL;
> +       kvm_pte_t *childp = NULL;
>         u64 granule = kvm_granule_size(level);
>         struct hyp_unmap_data *data = arg;
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>
> -       if (!kvm_pte_valid(pte))
> +       if (!kvm_pte_valid(*old))
>                 return -EINVAL;
>
> -       if (kvm_pte_table(pte, level)) {
> -               childp = kvm_pte_follow(pte, mm_ops);
> +       if (kvm_pte_table(*old, level)) {
> +               childp = kvm_pte_follow(*old, mm_ops);
>
>                 if (mm_ops->page_count(childp) != 1)
>                         return 0;
> @@ -537,19 +537,18 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
>         return 0;
>  }
>
> -static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                            enum kvm_pgtable_walk_flags flag, void * const arg)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = arg;
> -       kvm_pte_t pte = *ptep;
>
> -       if (!kvm_pte_valid(pte))
> +       if (!kvm_pte_valid(*old))
>                 return 0;
>
>         mm_ops->put_page(ptep);
>
> -       if (kvm_pte_table(pte, level))
> -               mm_ops->put_page(kvm_pte_follow(pte, mm_ops));
> +       if (kvm_pte_table(*old, level))
> +               mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
>
>         return 0;
>  }
> @@ -723,10 +722,10 @@ static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level,
>  }
>
>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> -                                     kvm_pte_t *ptep,
> +                                     kvm_pte_t *ptep, kvm_pte_t old,
>                                       struct stage2_map_data *data)
>  {
> -       kvm_pte_t new, old = *ptep;
> +       kvm_pte_t new;
>         u64 granule = kvm_granule_size(level), phys = data->phys;
>         struct kvm_pgtable *pgt = data->mmu->pgt;
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> @@ -769,7 +768,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  }
>
>  static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
> -                                    kvm_pte_t *ptep,
> +                                    kvm_pte_t *ptep, kvm_pte_t *old,
>                                      struct stage2_map_data *data)
>  {
>         if (data->anchor)
> @@ -778,7 +777,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>         if (!stage2_leaf_mapping_allowed(addr, end, level, data))
>                 return 0;
>
> -       data->childp = kvm_pte_follow(*ptep, data->mm_ops);
> +       data->childp = kvm_pte_follow(*old, data->mm_ops);
>         kvm_clear_pte(ptep);
>
>         /*
> @@ -792,20 +791,20 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>  }
>
>  static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                               struct stage2_map_data *data)
> +                               kvm_pte_t *old, struct stage2_map_data *data)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> -       kvm_pte_t *childp, pte = *ptep;
> +       kvm_pte_t *childp;
>         int ret;
>
>         if (data->anchor) {
> -               if (stage2_pte_is_counted(pte))
> +               if (stage2_pte_is_counted(*old))
>                         mm_ops->put_page(ptep);
>
>                 return 0;
>         }
>
> -       ret = stage2_map_walker_try_leaf(addr, end, level, ptep, data);
> +       ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data);
>         if (ret != -E2BIG)
>                 return ret;
>
> @@ -824,7 +823,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>          * a table. Accesses beyond 'end' that fall within the new table
>          * will be mapped lazily.
>          */
> -       if (stage2_pte_is_counted(pte))
> +       if (stage2_pte_is_counted(*old))
>                 stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
>
>         kvm_set_table_pte(ptep, childp, mm_ops);
> @@ -834,7 +833,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  }
>
>  static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
> -                                     kvm_pte_t *ptep,
> +                                     kvm_pte_t *ptep, kvm_pte_t *old,
>                                       struct stage2_map_data *data)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> @@ -848,9 +847,9 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>                 childp = data->childp;
>                 data->anchor = NULL;
>                 data->childp = NULL;
> -               ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
> +               ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data);
>         } else {
> -               childp = kvm_pte_follow(*ptep, mm_ops);
> +               childp = kvm_pte_follow(*old, mm_ops);
>         }
>
>         mm_ops->put_page(childp);
> @@ -878,18 +877,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>   * the page-table, installing the block entry when it revisits the anchor
>   * pointer and clearing the anchor to NULL.
>   */
> -static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                              enum kvm_pgtable_walk_flags flag, void * const arg)
>  {
>         struct stage2_map_data *data = arg;
>
>         switch (flag) {
>         case KVM_PGTABLE_WALK_TABLE_PRE:
> -               return stage2_map_walk_table_pre(addr, end, level, ptep, data);
> +               return stage2_map_walk_table_pre(addr, end, level, ptep, old, data);
>         case KVM_PGTABLE_WALK_LEAF:
> -               return stage2_map_walk_leaf(addr, end, level, ptep, data);
> +               return stage2_map_walk_leaf(addr, end, level, ptep, old, data);
>         case KVM_PGTABLE_WALK_TABLE_POST:
> -               return stage2_map_walk_table_post(addr, end, level, ptep, data);
> +               return stage2_map_walk_table_post(addr, end, level, ptep, old, data);
>         }
>
>         return -EINVAL;
> @@ -955,29 +954,29 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
>  }
>
>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                              enum kvm_pgtable_walk_flags flag,
> +                              kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
>                                void * const arg)
>  {
>         struct kvm_pgtable *pgt = arg;
>         struct kvm_s2_mmu *mmu = pgt->mmu;
>         struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
> -       kvm_pte_t pte = *ptep, *childp = NULL;
> +       kvm_pte_t *childp = NULL;
>         bool need_flush = false;
>
> -       if (!kvm_pte_valid(pte)) {
> -               if (stage2_pte_is_counted(pte)) {
> +       if (!kvm_pte_valid(*old)) {
> +               if (stage2_pte_is_counted(*old)) {
>                         kvm_clear_pte(ptep);
>                         mm_ops->put_page(ptep);
>                 }
>                 return 0;
>         }
>
> -       if (kvm_pte_table(pte, level)) {
> -               childp = kvm_pte_follow(pte, mm_ops);
> +       if (kvm_pte_table(*old, level)) {
> +               childp = kvm_pte_follow(*old, mm_ops);
>
>                 if (mm_ops->page_count(childp) != 1)
>                         return 0;
> -       } else if (stage2_pte_cacheable(pgt, pte)) {
> +       } else if (stage2_pte_cacheable(pgt, *old)) {
>                 need_flush = !stage2_has_fwb(pgt);
>         }
>
> @@ -989,7 +988,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>         stage2_put_pte(ptep, mmu, addr, level, mm_ops);
>
>         if (need_flush && mm_ops->dcache_clean_inval_poc)
> -               mm_ops->dcache_clean_inval_poc(kvm_pte_follow(pte, mm_ops),
> +               mm_ops->dcache_clean_inval_poc(kvm_pte_follow(*old, mm_ops),
>                                                kvm_granule_size(level));
>
>         if (childp)
> @@ -1018,10 +1017,10 @@ struct stage2_attr_data {
>  };
>
>  static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                             enum kvm_pgtable_walk_flags flag,
> +                             kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
>                               void * const arg)
>  {
> -       kvm_pte_t pte = *ptep;
> +       kvm_pte_t pte = *old;
>         struct stage2_attr_data *data = arg;
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>
> @@ -1146,18 +1145,17 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
>  }
>
>  static int stage2_flush_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                              enum kvm_pgtable_walk_flags flag,
> +                              kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
>                                void * const arg)
>  {
>         struct kvm_pgtable *pgt = arg;
>         struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
> -       kvm_pte_t pte = *ptep;
>
> -       if (!kvm_pte_valid(pte) || !stage2_pte_cacheable(pgt, pte))
> +       if (!kvm_pte_valid(*old) || !stage2_pte_cacheable(pgt, *old))
>                 return 0;
>
>         if (mm_ops->dcache_clean_inval_poc)
> -               mm_ops->dcache_clean_inval_poc(kvm_pte_follow(pte, mm_ops),
> +               mm_ops->dcache_clean_inval_poc(kvm_pte_follow(*old, mm_ops),
>                                                kvm_granule_size(level));
>         return 0;
>  }
> @@ -1206,19 +1204,18 @@ int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
>  }
>
>  static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                             enum kvm_pgtable_walk_flags flag,
> +                             kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
>                               void * const arg)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = arg;
> -       kvm_pte_t pte = *ptep;
>
> -       if (!stage2_pte_is_counted(pte))
> +       if (!stage2_pte_is_counted(*old))
>                 return 0;
>
>         mm_ops->put_page(ptep);
>
> -       if (kvm_pte_table(pte, level))
> -               mm_ops->put_page(kvm_pte_follow(pte, mm_ops));
> +       if (kvm_pte_table(*old, level))
> +               mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
>
>         return 0;
>  }
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 02/17] KVM: arm64: Only read the pte once per visit
@ 2022-04-21 16:12     ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:12 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> A subsequent change to KVM will parallize modifications to the stage-2
> page tables. The various page table walkers read the ptep multiple
> times, which could lead to a visitor seeing multiple values during the
> visit.
>
> Pass through the observed pte to the visitor callbacks. Promote reads of
> the ptep to a full READ_ONCE(), which will matter more when we start
> tweaking ptes atomically. Note that a pointer to the old pte is given to
> visitors, as parallel visitors will need to steer the page table
> traversal as they adjust the page tables.
>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h  |   2 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |   7 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |   9 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 113 +++++++++++++-------------
>  4 files changed, 63 insertions(+), 68 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 9f339dffbc1a..ea818a5f7408 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -192,7 +192,7 @@ enum kvm_pgtable_walk_flags {
>  };
>
>  typedef int (*kvm_pgtable_visitor_fn_t)(u64 addr, u64 end, u32 level,
> -                                       kvm_pte_t *ptep,
> +                                       kvm_pte_t *ptep, kvm_pte_t *old,
>                                         enum kvm_pgtable_walk_flags flag,
>                                         void * const arg);
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 78edf077fa3b..601a586581d8 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -422,17 +422,16 @@ struct check_walk_data {
>  };
>
>  static int __check_page_state_visitor(u64 addr, u64 end, u32 level,
> -                                     kvm_pte_t *ptep,
> +                                     kvm_pte_t *ptep, kvm_pte_t *old,
>                                       enum kvm_pgtable_walk_flags flag,
>                                       void * const arg)

David mentioned combining the ARM and x86 TDP MMUs, and I wonder if a
first step in that direction could be to adopt the TDP iter here. The
signatures of most of these functions are very similar to the fields
in the TDP iter and the TDP MMU might benefit from adopting some
version of kvm_pgtable_walk_flags.


>  {
>         struct check_walk_data *d = arg;
> -       kvm_pte_t pte = *ptep;
>
> -       if (kvm_pte_valid(pte) && !addr_is_memory(kvm_pte_to_phys(pte)))
> +       if (kvm_pte_valid(*old) && !addr_is_memory(kvm_pte_to_phys(*old)))
>                 return -EINVAL;
>
> -       return d->get_page_state(pte) == d->desired ? 0 : -EPERM;
> +       return d->get_page_state(*old) == d->desired ? 0 : -EPERM;
>  }
>
>  static int check_page_state_range(struct kvm_pgtable *pgt, u64 addr, u64 size,
> diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
> index 27af337f9fea..ecab7a4049d6 100644
> --- a/arch/arm64/kvm/hyp/nvhe/setup.c
> +++ b/arch/arm64/kvm/hyp/nvhe/setup.c
> @@ -162,17 +162,16 @@ static void hpool_put_page(void *addr)
>  }
>
>  static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
> -                                        kvm_pte_t *ptep,
> +                                        kvm_pte_t *ptep, kvm_pte_t *old,
>                                          enum kvm_pgtable_walk_flags flag,
>                                          void * const arg)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = arg;
>         enum kvm_pgtable_prot prot;
>         enum pkvm_page_state state;
> -       kvm_pte_t pte = *ptep;
>         phys_addr_t phys;
>
> -       if (!kvm_pte_valid(pte))
> +       if (!kvm_pte_valid(*old))
>                 return 0;
>
>         /*
> @@ -187,7 +186,7 @@ static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
>         if (level != (KVM_PGTABLE_MAX_LEVELS - 1))
>                 return -EINVAL;
>
> -       phys = kvm_pte_to_phys(pte);
> +       phys = kvm_pte_to_phys(*old);
>         if (!addr_is_memory(phys))
>                 return -EINVAL;
>
> @@ -195,7 +194,7 @@ static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level,
>          * Adjust the host stage-2 mappings to match the ownership attributes
>          * configured in the hypervisor stage-1.
>          */
> -       state = pkvm_getstate(kvm_pgtable_hyp_pte_prot(pte));
> +       state = pkvm_getstate(kvm_pgtable_hyp_pte_prot(*old));
>         switch (state) {
>         case PKVM_PAGE_OWNED:
>                 return host_stage2_set_owner_locked(phys, PAGE_SIZE, pkvm_hyp_id);
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index e1506da3e2fb..ad911cd44425 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -178,11 +178,11 @@ static u8 kvm_invalid_pte_owner(kvm_pte_t pte)
>  }
>
>  static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
> -                                 u32 level, kvm_pte_t *ptep,
> +                                 u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                                   enum kvm_pgtable_walk_flags flag)
>  {
>         struct kvm_pgtable_walker *walker = data->walker;
> -       return walker->cb(addr, data->end, level, ptep, flag, walker->arg);
> +       return walker->cb(addr, data->end, level, ptep, old, flag, walker->arg);
>  }
>
>  static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
> @@ -193,17 +193,17 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
>  {
>         int ret = 0;
>         u64 addr = data->addr;
> -       kvm_pte_t *childp, pte = *ptep;
> +       kvm_pte_t *childp, pte = READ_ONCE(*ptep);
>         bool table = kvm_pte_table(pte, level);
>         enum kvm_pgtable_walk_flags flags = data->walker->flags;
>
>         if (table && (flags & KVM_PGTABLE_WALK_TABLE_PRE)) {
> -               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
> +               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
>                                              KVM_PGTABLE_WALK_TABLE_PRE);
>         }
>
>         if (!table && (flags & KVM_PGTABLE_WALK_LEAF)) {
> -               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
> +               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
>                                              KVM_PGTABLE_WALK_LEAF);
>                 pte = *ptep;
>                 table = kvm_pte_table(pte, level);
> @@ -224,7 +224,7 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
>                 goto out;
>
>         if (flags & KVM_PGTABLE_WALK_TABLE_POST) {
> -               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep,
> +               ret = kvm_pgtable_visitor_cb(data, addr, level, ptep, &pte,
>                                              KVM_PGTABLE_WALK_TABLE_POST);
>         }
>
> @@ -297,12 +297,12 @@ struct leaf_walk_data {
>         u32             level;
>  };
>
> -static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                        enum kvm_pgtable_walk_flags flag, void * const arg)
>  {
>         struct leaf_walk_data *data = arg;
>
> -       data->pte   = *ptep;
> +       data->pte   = *old;
>         data->level = level;
>
>         return 0;
> @@ -388,10 +388,10 @@ enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte)
>         return prot;
>  }
>
> -static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> -                                   kvm_pte_t *ptep, struct hyp_map_data *data)
> +static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +                                   kvm_pte_t old, struct hyp_map_data *data)
>  {
> -       kvm_pte_t new, old = *ptep;
> +       kvm_pte_t new;
>         u64 granule = kvm_granule_size(level), phys = data->phys;
>
>         if (!kvm_block_mapping_supported(addr, end, phys, level))
> @@ -410,14 +410,14 @@ static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>         return true;
>  }
>
> -static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                           enum kvm_pgtable_walk_flags flag, void * const arg)
>  {
>         kvm_pte_t *childp;
>         struct hyp_map_data *data = arg;
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>
> -       if (hyp_map_walker_try_leaf(addr, end, level, ptep, arg))
> +       if (hyp_map_walker_try_leaf(addr, end, level, ptep, *old, arg))
>                 return 0;
>
>         if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
> @@ -461,19 +461,19 @@ struct hyp_unmap_data {
>         struct kvm_pgtable_mm_ops       *mm_ops;
>  };
>
> -static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +static int hyp_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                             enum kvm_pgtable_walk_flags flag, void * const arg)
>  {
> -       kvm_pte_t pte = *ptep, *childp = NULL;
> +       kvm_pte_t *childp = NULL;
>         u64 granule = kvm_granule_size(level);
>         struct hyp_unmap_data *data = arg;
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>
> -       if (!kvm_pte_valid(pte))
> +       if (!kvm_pte_valid(*old))
>                 return -EINVAL;
>
> -       if (kvm_pte_table(pte, level)) {
> -               childp = kvm_pte_follow(pte, mm_ops);
> +       if (kvm_pte_table(*old, level)) {
> +               childp = kvm_pte_follow(*old, mm_ops);
>
>                 if (mm_ops->page_count(childp) != 1)
>                         return 0;
> @@ -537,19 +537,18 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
>         return 0;
>  }
>
> -static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                            enum kvm_pgtable_walk_flags flag, void * const arg)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = arg;
> -       kvm_pte_t pte = *ptep;
>
> -       if (!kvm_pte_valid(pte))
> +       if (!kvm_pte_valid(*old))
>                 return 0;
>
>         mm_ops->put_page(ptep);
>
> -       if (kvm_pte_table(pte, level))
> -               mm_ops->put_page(kvm_pte_follow(pte, mm_ops));
> +       if (kvm_pte_table(*old, level))
> +               mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
>
>         return 0;
>  }
> @@ -723,10 +722,10 @@ static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level,
>  }
>
>  static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> -                                     kvm_pte_t *ptep,
> +                                     kvm_pte_t *ptep, kvm_pte_t old,
>                                       struct stage2_map_data *data)
>  {
> -       kvm_pte_t new, old = *ptep;
> +       kvm_pte_t new;
>         u64 granule = kvm_granule_size(level), phys = data->phys;
>         struct kvm_pgtable *pgt = data->mmu->pgt;
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> @@ -769,7 +768,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>  }
>
>  static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
> -                                    kvm_pte_t *ptep,
> +                                    kvm_pte_t *ptep, kvm_pte_t *old,
>                                      struct stage2_map_data *data)
>  {
>         if (data->anchor)
> @@ -778,7 +777,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>         if (!stage2_leaf_mapping_allowed(addr, end, level, data))
>                 return 0;
>
> -       data->childp = kvm_pte_follow(*ptep, data->mm_ops);
> +       data->childp = kvm_pte_follow(*old, data->mm_ops);
>         kvm_clear_pte(ptep);
>
>         /*
> @@ -792,20 +791,20 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>  }
>
>  static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                               struct stage2_map_data *data)
> +                               kvm_pte_t *old, struct stage2_map_data *data)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> -       kvm_pte_t *childp, pte = *ptep;
> +       kvm_pte_t *childp;
>         int ret;
>
>         if (data->anchor) {
> -               if (stage2_pte_is_counted(pte))
> +               if (stage2_pte_is_counted(*old))
>                         mm_ops->put_page(ptep);
>
>                 return 0;
>         }
>
> -       ret = stage2_map_walker_try_leaf(addr, end, level, ptep, data);
> +       ret = stage2_map_walker_try_leaf(addr, end, level, ptep, *old, data);
>         if (ret != -E2BIG)
>                 return ret;
>
> @@ -824,7 +823,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>          * a table. Accesses beyond 'end' that fall within the new table
>          * will be mapped lazily.
>          */
> -       if (stage2_pte_is_counted(pte))
> +       if (stage2_pte_is_counted(*old))
>                 stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
>
>         kvm_set_table_pte(ptep, childp, mm_ops);
> @@ -834,7 +833,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  }
>
>  static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
> -                                     kvm_pte_t *ptep,
> +                                     kvm_pte_t *ptep, kvm_pte_t *old,
>                                       struct stage2_map_data *data)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> @@ -848,9 +847,9 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>                 childp = data->childp;
>                 data->anchor = NULL;
>                 data->childp = NULL;
> -               ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
> +               ret = stage2_map_walk_leaf(addr, end, level, ptep, old, data);
>         } else {
> -               childp = kvm_pte_follow(*ptep, mm_ops);
> +               childp = kvm_pte_follow(*old, mm_ops);
>         }
>
>         mm_ops->put_page(childp);
> @@ -878,18 +877,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>   * the page-table, installing the block entry when it revisits the anchor
>   * pointer and clearing the anchor to NULL.
>   */
> -static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, kvm_pte_t *old,
>                              enum kvm_pgtable_walk_flags flag, void * const arg)
>  {
>         struct stage2_map_data *data = arg;
>
>         switch (flag) {
>         case KVM_PGTABLE_WALK_TABLE_PRE:
> -               return stage2_map_walk_table_pre(addr, end, level, ptep, data);
> +               return stage2_map_walk_table_pre(addr, end, level, ptep, old, data);
>         case KVM_PGTABLE_WALK_LEAF:
> -               return stage2_map_walk_leaf(addr, end, level, ptep, data);
> +               return stage2_map_walk_leaf(addr, end, level, ptep, old, data);
>         case KVM_PGTABLE_WALK_TABLE_POST:
> -               return stage2_map_walk_table_post(addr, end, level, ptep, data);
> +               return stage2_map_walk_table_post(addr, end, level, ptep, old, data);
>         }
>
>         return -EINVAL;
> @@ -955,29 +954,29 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
>  }
>
>  static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                              enum kvm_pgtable_walk_flags flag,
> +                              kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
>                                void * const arg)
>  {
>         struct kvm_pgtable *pgt = arg;
>         struct kvm_s2_mmu *mmu = pgt->mmu;
>         struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
> -       kvm_pte_t pte = *ptep, *childp = NULL;
> +       kvm_pte_t *childp = NULL;
>         bool need_flush = false;
>
> -       if (!kvm_pte_valid(pte)) {
> -               if (stage2_pte_is_counted(pte)) {
> +       if (!kvm_pte_valid(*old)) {
> +               if (stage2_pte_is_counted(*old)) {
>                         kvm_clear_pte(ptep);
>                         mm_ops->put_page(ptep);
>                 }
>                 return 0;
>         }
>
> -       if (kvm_pte_table(pte, level)) {
> -               childp = kvm_pte_follow(pte, mm_ops);
> +       if (kvm_pte_table(*old, level)) {
> +               childp = kvm_pte_follow(*old, mm_ops);
>
>                 if (mm_ops->page_count(childp) != 1)
>                         return 0;
> -       } else if (stage2_pte_cacheable(pgt, pte)) {
> +       } else if (stage2_pte_cacheable(pgt, *old)) {
>                 need_flush = !stage2_has_fwb(pgt);
>         }
>
> @@ -989,7 +988,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>         stage2_put_pte(ptep, mmu, addr, level, mm_ops);
>
>         if (need_flush && mm_ops->dcache_clean_inval_poc)
> -               mm_ops->dcache_clean_inval_poc(kvm_pte_follow(pte, mm_ops),
> +               mm_ops->dcache_clean_inval_poc(kvm_pte_follow(*old, mm_ops),
>                                                kvm_granule_size(level));
>
>         if (childp)
> @@ -1018,10 +1017,10 @@ struct stage2_attr_data {
>  };
>
>  static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                             enum kvm_pgtable_walk_flags flag,
> +                             kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
>                               void * const arg)
>  {
> -       kvm_pte_t pte = *ptep;
> +       kvm_pte_t pte = *old;
>         struct stage2_attr_data *data = arg;
>         struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>
> @@ -1146,18 +1145,17 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
>  }
>
>  static int stage2_flush_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                              enum kvm_pgtable_walk_flags flag,
> +                              kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
>                                void * const arg)
>  {
>         struct kvm_pgtable *pgt = arg;
>         struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
> -       kvm_pte_t pte = *ptep;
>
> -       if (!kvm_pte_valid(pte) || !stage2_pte_cacheable(pgt, pte))
> +       if (!kvm_pte_valid(*old) || !stage2_pte_cacheable(pgt, *old))
>                 return 0;
>
>         if (mm_ops->dcache_clean_inval_poc)
> -               mm_ops->dcache_clean_inval_poc(kvm_pte_follow(pte, mm_ops),
> +               mm_ops->dcache_clean_inval_poc(kvm_pte_follow(*old, mm_ops),
>                                                kvm_granule_size(level));
>         return 0;
>  }
> @@ -1206,19 +1204,18 @@ int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
>  }
>
>  static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> -                             enum kvm_pgtable_walk_flags flag,
> +                             kvm_pte_t *old, enum kvm_pgtable_walk_flags flag,
>                               void * const arg)
>  {
>         struct kvm_pgtable_mm_ops *mm_ops = arg;
> -       kvm_pte_t pte = *ptep;
>
> -       if (!stage2_pte_is_counted(pte))
> +       if (!stage2_pte_is_counted(*old))
>                 return 0;
>
>         mm_ops->put_page(ptep);
>
> -       if (kvm_pte_table(pte, level))
> -               mm_ops->put_page(kvm_pte_follow(pte, mm_ops));
> +       if (kvm_pte_table(*old, level))
> +               mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
>
>         return 0;
>  }
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
  2022-04-15 21:58   ` Oliver Upton
  (?)
@ 2022-04-21 16:28     ` Ben Gardon
  -1 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:28 UTC (permalink / raw)
  To: Oliver Upton
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> It is possible that a table page remains visible to another thread until
> the next rcu synchronization event. To that end, we cannot drop the last
> page reference synchronous with post-order traversal for a parallel
> table walk.
>
> Schedule an rcu callback to clean up the child table page for parallel
> walks.
>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  3 ++
>  arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
>  arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
>  3 files changed, 67 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 74955aba5918..52e55e00f0ca 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
>   * @put_page:                  Decrement the refcount on a page. When the
>   *                             refcount reaches 0 the page is automatically
>   *                             freed.
> + * @free_table:                        Drop the last page reference, possibly in the
> + *                             next RCU sync if doing a shared walk.
>   * @page_count:                        Return the refcount of a page.
>   * @phys_to_virt:              Convert a physical address into a virtual
>   *                             address mapped in the current context.
> @@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
>         void            (*get_page)(void *addr);
>         void            (*put_page)(void *addr);
>         int             (*page_count)(void *addr);
> +       void            (*free_table)(void *addr, bool shared);
>         void*           (*phys_to_virt)(phys_addr_t phys);
>         phys_addr_t     (*virt_to_phys)(void *addr);
>         void            (*dcache_clean_inval_poc)(void *addr, size_t size);
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 121818d4c33e..a9a48edba63b 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
>  {}
>
>  #define kvm_dereference_ptep   rcu_dereference_raw
> +
> +static inline void kvm_pgtable_destroy_barrier(void)
> +{}
> +
>  #else
>  #define kvm_pgtable_walk_begin rcu_read_lock
>
>  #define kvm_pgtable_walk_end   rcu_read_unlock
>
>  #define kvm_dereference_ptep   rcu_dereference
> +
> +#define kvm_pgtable_destroy_barrier    rcu_barrier
> +
>  #endif
>
>  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> @@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>                 childp = kvm_pte_follow(*old, mm_ops);
>         }
>
> -       mm_ops->put_page(childp);
> +       /*
> +        * If we do not have exclusive access to the page tables it is possible
> +        * the unlinked table remains visible to another thread until the next
> +        * rcu synchronization.
> +        */
> +       mm_ops->free_table(childp, shared);
>         mm_ops->put_page(ptep);
>
>         return ret;
> @@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>                                                kvm_granule_size(level));
>
>         if (childp)
> -               mm_ops->put_page(childp);
> +               mm_ops->free_table(childp, shared);
>
>         return 0;
>  }
> @@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>         mm_ops->put_page(ptep);
>
>         if (kvm_pte_table(*old, level))
> -               mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
> +               mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
>
>         return 0;
>  }
> @@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
>         pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
>         pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
>         pgt->pgd = NULL;
> +
> +       /*
> +        * Guarantee that all unlinked subtrees associated with the stage2 page
> +        * table have also been freed before returning.
> +        */
> +       kvm_pgtable_destroy_barrier();

Why do we need to wait for in-flight RCU callbacks to finish here?
Is this function only used on VM teardown and we just want to make
sure all the memory is freed or is something actually depending on
this behavior?

>  }
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index cc6ed6b06ec2..6ecf37009c21 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
>  static void *stage2_memcache_zalloc_page(void *arg)
>  {
>         struct kvm_mmu_caches *mmu_caches = arg;
> +       struct stage2_page_header *hdr;
> +       void *addr;
>
>         /* Allocated with __GFP_ZERO, so no need to zero */
> -       return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> +       addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> +       if (!addr)
> +               return NULL;
> +
> +       hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
> +       if (!hdr) {
> +               free_page((unsigned long)addr);
> +               return NULL;
> +       }
> +
> +       hdr->page = virt_to_page(addr);
> +       set_page_private(hdr->page, (unsigned long)hdr);
> +       return addr;
> +}
> +
> +static void stage2_free_page_now(struct stage2_page_header *hdr)
> +{
> +       WARN_ON(page_ref_count(hdr->page) != 1);
> +
> +       __free_page(hdr->page);
> +       kmem_cache_free(stage2_page_header_cache, hdr);
> +}
> +
> +static void stage2_free_page_rcu_cb(struct rcu_head *head)
> +{
> +       struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
> +                                                     rcu_head);
> +
> +       stage2_free_page_now(hdr);
> +}
> +
> +static void stage2_free_table(void *addr, bool shared)
> +{
> +       struct page *page = virt_to_page(addr);
> +       struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
> +
> +       if (shared)
> +               call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);
> +       else
> +               stage2_free_page_now(hdr);
>  }
>
>  static void *kvm_host_zalloc_pages_exact(size_t size)
> @@ -613,6 +654,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>         .free_pages_exact       = free_pages_exact,
>         .get_page               = kvm_host_get_page,
>         .put_page               = kvm_host_put_page,
> +       .free_table             = stage2_free_table,
>         .page_count             = kvm_host_page_count,
>         .phys_to_virt           = kvm_host_va,
>         .virt_to_phys           = kvm_host_pa,
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
@ 2022-04-21 16:28     ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:28 UTC (permalink / raw)
  To: Oliver Upton
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> It is possible that a table page remains visible to another thread until
> the next rcu synchronization event. To that end, we cannot drop the last
> page reference synchronous with post-order traversal for a parallel
> table walk.
>
> Schedule an rcu callback to clean up the child table page for parallel
> walks.
>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  3 ++
>  arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
>  arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
>  3 files changed, 67 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 74955aba5918..52e55e00f0ca 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
>   * @put_page:                  Decrement the refcount on a page. When the
>   *                             refcount reaches 0 the page is automatically
>   *                             freed.
> + * @free_table:                        Drop the last page reference, possibly in the
> + *                             next RCU sync if doing a shared walk.
>   * @page_count:                        Return the refcount of a page.
>   * @phys_to_virt:              Convert a physical address into a virtual
>   *                             address mapped in the current context.
> @@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
>         void            (*get_page)(void *addr);
>         void            (*put_page)(void *addr);
>         int             (*page_count)(void *addr);
> +       void            (*free_table)(void *addr, bool shared);
>         void*           (*phys_to_virt)(phys_addr_t phys);
>         phys_addr_t     (*virt_to_phys)(void *addr);
>         void            (*dcache_clean_inval_poc)(void *addr, size_t size);
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 121818d4c33e..a9a48edba63b 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
>  {}
>
>  #define kvm_dereference_ptep   rcu_dereference_raw
> +
> +static inline void kvm_pgtable_destroy_barrier(void)
> +{}
> +
>  #else
>  #define kvm_pgtable_walk_begin rcu_read_lock
>
>  #define kvm_pgtable_walk_end   rcu_read_unlock
>
>  #define kvm_dereference_ptep   rcu_dereference
> +
> +#define kvm_pgtable_destroy_barrier    rcu_barrier
> +
>  #endif
>
>  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> @@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>                 childp = kvm_pte_follow(*old, mm_ops);
>         }
>
> -       mm_ops->put_page(childp);
> +       /*
> +        * If we do not have exclusive access to the page tables it is possible
> +        * the unlinked table remains visible to another thread until the next
> +        * rcu synchronization.
> +        */
> +       mm_ops->free_table(childp, shared);
>         mm_ops->put_page(ptep);
>
>         return ret;
> @@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>                                                kvm_granule_size(level));
>
>         if (childp)
> -               mm_ops->put_page(childp);
> +               mm_ops->free_table(childp, shared);
>
>         return 0;
>  }
> @@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>         mm_ops->put_page(ptep);
>
>         if (kvm_pte_table(*old, level))
> -               mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
> +               mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
>
>         return 0;
>  }
> @@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
>         pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
>         pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
>         pgt->pgd = NULL;
> +
> +       /*
> +        * Guarantee that all unlinked subtrees associated with the stage2 page
> +        * table have also been freed before returning.
> +        */
> +       kvm_pgtable_destroy_barrier();

Why do we need to wait for in-flight RCU callbacks to finish here?
Is this function only used on VM teardown and we just want to make
sure all the memory is freed or is something actually depending on
this behavior?

>  }
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index cc6ed6b06ec2..6ecf37009c21 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
>  static void *stage2_memcache_zalloc_page(void *arg)
>  {
>         struct kvm_mmu_caches *mmu_caches = arg;
> +       struct stage2_page_header *hdr;
> +       void *addr;
>
>         /* Allocated with __GFP_ZERO, so no need to zero */
> -       return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> +       addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> +       if (!addr)
> +               return NULL;
> +
> +       hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
> +       if (!hdr) {
> +               free_page((unsigned long)addr);
> +               return NULL;
> +       }
> +
> +       hdr->page = virt_to_page(addr);
> +       set_page_private(hdr->page, (unsigned long)hdr);
> +       return addr;
> +}
> +
> +static void stage2_free_page_now(struct stage2_page_header *hdr)
> +{
> +       WARN_ON(page_ref_count(hdr->page) != 1);
> +
> +       __free_page(hdr->page);
> +       kmem_cache_free(stage2_page_header_cache, hdr);
> +}
> +
> +static void stage2_free_page_rcu_cb(struct rcu_head *head)
> +{
> +       struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
> +                                                     rcu_head);
> +
> +       stage2_free_page_now(hdr);
> +}
> +
> +static void stage2_free_table(void *addr, bool shared)
> +{
> +       struct page *page = virt_to_page(addr);
> +       struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
> +
> +       if (shared)
> +               call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);
> +       else
> +               stage2_free_page_now(hdr);
>  }
>
>  static void *kvm_host_zalloc_pages_exact(size_t size)
> @@ -613,6 +654,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>         .free_pages_exact       = free_pages_exact,
>         .get_page               = kvm_host_get_page,
>         .put_page               = kvm_host_put_page,
> +       .free_table             = stage2_free_table,
>         .page_count             = kvm_host_page_count,
>         .phys_to_virt           = kvm_host_va,
>         .virt_to_phys           = kvm_host_pa,
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
@ 2022-04-21 16:28     ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:28 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, David Matlack, Paolo Bonzini,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	linux-arm-kernel

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> It is possible that a table page remains visible to another thread until
> the next rcu synchronization event. To that end, we cannot drop the last
> page reference synchronous with post-order traversal for a parallel
> table walk.
>
> Schedule an rcu callback to clean up the child table page for parallel
> walks.
>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  3 ++
>  arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
>  arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
>  3 files changed, 67 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 74955aba5918..52e55e00f0ca 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
>   * @put_page:                  Decrement the refcount on a page. When the
>   *                             refcount reaches 0 the page is automatically
>   *                             freed.
> + * @free_table:                        Drop the last page reference, possibly in the
> + *                             next RCU sync if doing a shared walk.
>   * @page_count:                        Return the refcount of a page.
>   * @phys_to_virt:              Convert a physical address into a virtual
>   *                             address mapped in the current context.
> @@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
>         void            (*get_page)(void *addr);
>         void            (*put_page)(void *addr);
>         int             (*page_count)(void *addr);
> +       void            (*free_table)(void *addr, bool shared);
>         void*           (*phys_to_virt)(phys_addr_t phys);
>         phys_addr_t     (*virt_to_phys)(void *addr);
>         void            (*dcache_clean_inval_poc)(void *addr, size_t size);
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 121818d4c33e..a9a48edba63b 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
>  {}
>
>  #define kvm_dereference_ptep   rcu_dereference_raw
> +
> +static inline void kvm_pgtable_destroy_barrier(void)
> +{}
> +
>  #else
>  #define kvm_pgtable_walk_begin rcu_read_lock
>
>  #define kvm_pgtable_walk_end   rcu_read_unlock
>
>  #define kvm_dereference_ptep   rcu_dereference
> +
> +#define kvm_pgtable_destroy_barrier    rcu_barrier
> +
>  #endif
>
>  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> @@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>                 childp = kvm_pte_follow(*old, mm_ops);
>         }
>
> -       mm_ops->put_page(childp);
> +       /*
> +        * If we do not have exclusive access to the page tables it is possible
> +        * the unlinked table remains visible to another thread until the next
> +        * rcu synchronization.
> +        */
> +       mm_ops->free_table(childp, shared);
>         mm_ops->put_page(ptep);
>
>         return ret;
> @@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>                                                kvm_granule_size(level));
>
>         if (childp)
> -               mm_ops->put_page(childp);
> +               mm_ops->free_table(childp, shared);
>
>         return 0;
>  }
> @@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>         mm_ops->put_page(ptep);
>
>         if (kvm_pte_table(*old, level))
> -               mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
> +               mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
>
>         return 0;
>  }
> @@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
>         pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
>         pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
>         pgt->pgd = NULL;
> +
> +       /*
> +        * Guarantee that all unlinked subtrees associated with the stage2 page
> +        * table have also been freed before returning.
> +        */
> +       kvm_pgtable_destroy_barrier();

Why do we need to wait for in-flight RCU callbacks to finish here?
Is this function only used on VM teardown and we just want to make
sure all the memory is freed or is something actually depending on
this behavior?

>  }
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index cc6ed6b06ec2..6ecf37009c21 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
>  static void *stage2_memcache_zalloc_page(void *arg)
>  {
>         struct kvm_mmu_caches *mmu_caches = arg;
> +       struct stage2_page_header *hdr;
> +       void *addr;
>
>         /* Allocated with __GFP_ZERO, so no need to zero */
> -       return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> +       addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> +       if (!addr)
> +               return NULL;
> +
> +       hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
> +       if (!hdr) {
> +               free_page((unsigned long)addr);
> +               return NULL;
> +       }
> +
> +       hdr->page = virt_to_page(addr);
> +       set_page_private(hdr->page, (unsigned long)hdr);
> +       return addr;
> +}
> +
> +static void stage2_free_page_now(struct stage2_page_header *hdr)
> +{
> +       WARN_ON(page_ref_count(hdr->page) != 1);
> +
> +       __free_page(hdr->page);
> +       kmem_cache_free(stage2_page_header_cache, hdr);
> +}
> +
> +static void stage2_free_page_rcu_cb(struct rcu_head *head)
> +{
> +       struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
> +                                                     rcu_head);
> +
> +       stage2_free_page_now(hdr);
> +}
> +
> +static void stage2_free_table(void *addr, bool shared)
> +{
> +       struct page *page = virt_to_page(addr);
> +       struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
> +
> +       if (shared)
> +               call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);
> +       else
> +               stage2_free_page_now(hdr);
>  }
>
>  static void *kvm_host_zalloc_pages_exact(size_t size)
> @@ -613,6 +654,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>         .free_pages_exact       = free_pages_exact,
>         .get_page               = kvm_host_get_page,
>         .put_page               = kvm_host_put_page,
> +       .free_table             = stage2_free_table,
>         .page_count             = kvm_host_page_count,
>         .phys_to_virt           = kvm_host_va,
>         .virt_to_phys           = kvm_host_pa,
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
  2022-04-19 18:36     ` Oliver Upton
  (?)
@ 2022-04-21 16:30       ` Ben Gardon
  -1 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:30 UTC (permalink / raw)
  To: Oliver Upton
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Tue, Apr 19, 2022 at 11:36 AM Oliver Upton <oupton@google.com> wrote:
>
> On Tue, Apr 19, 2022 at 10:57 AM Ben Gardon <bgardon@google.com> wrote:
> >
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > Presently KVM only takes a read lock for stage 2 faults if it believes
> > > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > > predictably can pile up all the vCPUs in a sufficiently large VM.
> > >
> > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > > MMU protected by the combination of a read-write lock and RCU, allowing
> > > page walkers to traverse in parallel.
> > >
> > > This series is strongly inspired by the mechanics of the TDP MMU,
> > > making use of RCU to protect parallel walks. Note that the TLB
> > > invalidation mechanics are a bit different between x86 and ARM, so we
> > > need to use the 'break-before-make' sequence to split/collapse a
> > > block/table mapping, respectively.
> > >
> > > Nonetheless, using atomics on the break side allows fault handlers to
> > > acquire exclusive access to a PTE (lets just call it locked). Once the
> > > PTE lock is acquired it is then safe to assume exclusive access.
> > >
> > > Special consideration is required when pruning the page tables in
> > > parallel. Suppose we are collapsing a table into a block. Allowing
> > > parallel faults means that a software walker could be in the middle of
> > > a lower level traversal when the table is unlinked. Table
> > > walkers that prune the paging structures must now 'lock' all descendent
> > > PTEs, effectively asserting exclusive ownership of the substructure
> > > (no other walker can install something to an already locked pte).
> > >
> > > Additionally, for parallel walks we need to punt the freeing of table
> > > pages to the next RCU sync, as there could be multiple observers of the
> > > table until all walkers exit the RCU critical section. For this I
> > > decided to cram an rcu_head into page private data for every table page.
> > > We wind up spending a bit more on table pages now, but lazily allocating
> > > for rcu callbacks probably doesn't make a lot of sense. Not only would
> > > we need a large cache of them (think about installing a level 1 block)
> > > to wire up callbacks on all descendent tables, but we also then need to
> > > spend memory to actually free memory.
> >
> > FWIW we used a similar approach in early versions of the TDP MMU, but
> > instead of page->private used page->lru so that more metadata could be
> > stored in page->private.
> > Ultimately that ended up being too limiting and we decided to switch
> > to just using the associated struct kvm_mmu_page as the list element.
> > I don't know if ARM has an equivalent construct though.
>
> ARM currently doesn't have any metadata it needs to tie with the table
> pages. We just do very basic page reference counting for every valid
> PTE. I was going to link together pages (hence the page header), but
> we actually do not have a functional need for it at the moment. In
> fact, struct page::rcu_head would probably fit the bill and we can
> avoid extra metadata/memory for the time being.

Ah, right! I page::rcu_head was the field I was thinking of.

>
> Perhaps best to keep it simple and do the rest when we have a genuine
> need for it.

Completely agree. I'm surprised that ARM doesn't have a need for a
metadata structure associated with each page of the stage 2 paging
structure, but if you don't need it, that definitely makes things
simpler.

>
> --
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-21 16:30       ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:30 UTC (permalink / raw)
  To: Oliver Upton
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Tue, Apr 19, 2022 at 11:36 AM Oliver Upton <oupton@google.com> wrote:
>
> On Tue, Apr 19, 2022 at 10:57 AM Ben Gardon <bgardon@google.com> wrote:
> >
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > Presently KVM only takes a read lock for stage 2 faults if it believes
> > > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > > predictably can pile up all the vCPUs in a sufficiently large VM.
> > >
> > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > > MMU protected by the combination of a read-write lock and RCU, allowing
> > > page walkers to traverse in parallel.
> > >
> > > This series is strongly inspired by the mechanics of the TDP MMU,
> > > making use of RCU to protect parallel walks. Note that the TLB
> > > invalidation mechanics are a bit different between x86 and ARM, so we
> > > need to use the 'break-before-make' sequence to split/collapse a
> > > block/table mapping, respectively.
> > >
> > > Nonetheless, using atomics on the break side allows fault handlers to
> > > acquire exclusive access to a PTE (lets just call it locked). Once the
> > > PTE lock is acquired it is then safe to assume exclusive access.
> > >
> > > Special consideration is required when pruning the page tables in
> > > parallel. Suppose we are collapsing a table into a block. Allowing
> > > parallel faults means that a software walker could be in the middle of
> > > a lower level traversal when the table is unlinked. Table
> > > walkers that prune the paging structures must now 'lock' all descendent
> > > PTEs, effectively asserting exclusive ownership of the substructure
> > > (no other walker can install something to an already locked pte).
> > >
> > > Additionally, for parallel walks we need to punt the freeing of table
> > > pages to the next RCU sync, as there could be multiple observers of the
> > > table until all walkers exit the RCU critical section. For this I
> > > decided to cram an rcu_head into page private data for every table page.
> > > We wind up spending a bit more on table pages now, but lazily allocating
> > > for rcu callbacks probably doesn't make a lot of sense. Not only would
> > > we need a large cache of them (think about installing a level 1 block)
> > > to wire up callbacks on all descendent tables, but we also then need to
> > > spend memory to actually free memory.
> >
> > FWIW we used a similar approach in early versions of the TDP MMU, but
> > instead of page->private used page->lru so that more metadata could be
> > stored in page->private.
> > Ultimately that ended up being too limiting and we decided to switch
> > to just using the associated struct kvm_mmu_page as the list element.
> > I don't know if ARM has an equivalent construct though.
>
> ARM currently doesn't have any metadata it needs to tie with the table
> pages. We just do very basic page reference counting for every valid
> PTE. I was going to link together pages (hence the page header), but
> we actually do not have a functional need for it at the moment. In
> fact, struct page::rcu_head would probably fit the bill and we can
> avoid extra metadata/memory for the time being.

Ah, right! I page::rcu_head was the field I was thinking of.

>
> Perhaps best to keep it simple and do the rest when we have a genuine
> need for it.

Completely agree. I'm surprised that ARM doesn't have a need for a
metadata structure associated with each page of the stage 2 paging
structure, but if you don't need it, that definitely makes things
simpler.

>
> --
> Thanks,
> Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-21 16:30       ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:30 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, David Matlack, Paolo Bonzini,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	linux-arm-kernel

On Tue, Apr 19, 2022 at 11:36 AM Oliver Upton <oupton@google.com> wrote:
>
> On Tue, Apr 19, 2022 at 10:57 AM Ben Gardon <bgardon@google.com> wrote:
> >
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > Presently KVM only takes a read lock for stage 2 faults if it believes
> > > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > > predictably can pile up all the vCPUs in a sufficiently large VM.
> > >
> > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > > MMU protected by the combination of a read-write lock and RCU, allowing
> > > page walkers to traverse in parallel.
> > >
> > > This series is strongly inspired by the mechanics of the TDP MMU,
> > > making use of RCU to protect parallel walks. Note that the TLB
> > > invalidation mechanics are a bit different between x86 and ARM, so we
> > > need to use the 'break-before-make' sequence to split/collapse a
> > > block/table mapping, respectively.
> > >
> > > Nonetheless, using atomics on the break side allows fault handlers to
> > > acquire exclusive access to a PTE (lets just call it locked). Once the
> > > PTE lock is acquired it is then safe to assume exclusive access.
> > >
> > > Special consideration is required when pruning the page tables in
> > > parallel. Suppose we are collapsing a table into a block. Allowing
> > > parallel faults means that a software walker could be in the middle of
> > > a lower level traversal when the table is unlinked. Table
> > > walkers that prune the paging structures must now 'lock' all descendent
> > > PTEs, effectively asserting exclusive ownership of the substructure
> > > (no other walker can install something to an already locked pte).
> > >
> > > Additionally, for parallel walks we need to punt the freeing of table
> > > pages to the next RCU sync, as there could be multiple observers of the
> > > table until all walkers exit the RCU critical section. For this I
> > > decided to cram an rcu_head into page private data for every table page.
> > > We wind up spending a bit more on table pages now, but lazily allocating
> > > for rcu callbacks probably doesn't make a lot of sense. Not only would
> > > we need a large cache of them (think about installing a level 1 block)
> > > to wire up callbacks on all descendent tables, but we also then need to
> > > spend memory to actually free memory.
> >
> > FWIW we used a similar approach in early versions of the TDP MMU, but
> > instead of page->private used page->lru so that more metadata could be
> > stored in page->private.
> > Ultimately that ended up being too limiting and we decided to switch
> > to just using the associated struct kvm_mmu_page as the list element.
> > I don't know if ARM has an equivalent construct though.
>
> ARM currently doesn't have any metadata it needs to tie with the table
> pages. We just do very basic page reference counting for every valid
> PTE. I was going to link together pages (hence the page header), but
> we actually do not have a functional need for it at the moment. In
> fact, struct page::rcu_head would probably fit the bill and we can
> avoid extra metadata/memory for the time being.

Ah, right! I page::rcu_head was the field I was thinking of.

>
> Perhaps best to keep it simple and do the rest when we have a genuine
> need for it.

Completely agree. I'm surprised that ARM doesn't have a need for a
metadata structure associated with each page of the stage 2 paging
structure, but if you don't need it, that definitely makes things
simpler.

>
> --
> Thanks,
> Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 16/17] KVM: arm64: Enable parallel stage 2 MMU faults
  2022-04-15 21:59   ` Oliver Upton
  (?)
@ 2022-04-21 16:35     ` Ben Gardon
  -1 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:35 UTC (permalink / raw)
  To: Oliver Upton
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> Voila! Since the map walkers are able to work in parallel there is no
> need to take the write lock on a stage 2 memory abort. Relax locking
> on map operations and cross fingers we got it right.

Might be worth a healthy sprinkle of lockdep on the functions taking
"shared" as an argument, just to make sure the wrong value isn't going
down a callstack you didn't expect.

>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/kvm/mmu.c | 21 +++------------------
>  1 file changed, 3 insertions(+), 18 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 63cf18cdb978..2881051c3743 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1127,7 +1127,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         gfn_t gfn;
>         kvm_pfn_t pfn;
>         bool logging_active = memslot_is_logging(memslot);
> -       bool use_read_lock = false;
>         unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
>         unsigned long vma_pagesize, fault_granule;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> @@ -1162,8 +1161,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         if (logging_active) {
>                 force_pte = true;
>                 vma_shift = PAGE_SHIFT;
> -               use_read_lock = (fault_status == FSC_PERM && write_fault &&
> -                                fault_granule == PAGE_SIZE);
>         } else {
>                 vma_shift = get_vma_page_shift(vma, hva);
>         }
> @@ -1267,15 +1264,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         if (exec_fault && device)
>                 return -ENOEXEC;
>
> -       /*
> -        * To reduce MMU contentions and enhance concurrency during dirty
> -        * logging dirty logging, only acquire read lock for permission
> -        * relaxation.
> -        */
> -       if (use_read_lock)
> -               read_lock(&kvm->mmu_lock);
> -       else
> -               write_lock(&kvm->mmu_lock);
> +       read_lock(&kvm->mmu_lock);
> +

Ugh, I which we could get rid of the analogous ugly block on x86.

>         pgt = vcpu->arch.hw_mmu->pgt;
>         if (mmu_notifier_retry(kvm, mmu_seq))
>                 goto out_unlock;
> @@ -1322,8 +1312,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
>                 ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
>         } else {
> -               WARN_ONCE(use_read_lock, "Attempted stage-2 map outside of write lock\n");
> -
>                 ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
>                                              __pfn_to_phys(pfn), prot,
>                                              mmu_caches, true);
> @@ -1336,10 +1324,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         }
>
>  out_unlock:
> -       if (use_read_lock)
> -               read_unlock(&kvm->mmu_lock);
> -       else
> -               write_unlock(&kvm->mmu_lock);
> +       read_unlock(&kvm->mmu_lock);
>         kvm_set_pfn_accessed(pfn);
>         kvm_release_pfn_clean(pfn);
>         return ret != -EAGAIN ? ret : 0;
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 16/17] KVM: arm64: Enable parallel stage 2 MMU faults
@ 2022-04-21 16:35     ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:35 UTC (permalink / raw)
  To: Oliver Upton
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> Voila! Since the map walkers are able to work in parallel there is no
> need to take the write lock on a stage 2 memory abort. Relax locking
> on map operations and cross fingers we got it right.

Might be worth a healthy sprinkle of lockdep on the functions taking
"shared" as an argument, just to make sure the wrong value isn't going
down a callstack you didn't expect.

>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/kvm/mmu.c | 21 +++------------------
>  1 file changed, 3 insertions(+), 18 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 63cf18cdb978..2881051c3743 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1127,7 +1127,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         gfn_t gfn;
>         kvm_pfn_t pfn;
>         bool logging_active = memslot_is_logging(memslot);
> -       bool use_read_lock = false;
>         unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
>         unsigned long vma_pagesize, fault_granule;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> @@ -1162,8 +1161,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         if (logging_active) {
>                 force_pte = true;
>                 vma_shift = PAGE_SHIFT;
> -               use_read_lock = (fault_status == FSC_PERM && write_fault &&
> -                                fault_granule == PAGE_SIZE);
>         } else {
>                 vma_shift = get_vma_page_shift(vma, hva);
>         }
> @@ -1267,15 +1264,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         if (exec_fault && device)
>                 return -ENOEXEC;
>
> -       /*
> -        * To reduce MMU contentions and enhance concurrency during dirty
> -        * logging dirty logging, only acquire read lock for permission
> -        * relaxation.
> -        */
> -       if (use_read_lock)
> -               read_lock(&kvm->mmu_lock);
> -       else
> -               write_lock(&kvm->mmu_lock);
> +       read_lock(&kvm->mmu_lock);
> +

Ugh, I which we could get rid of the analogous ugly block on x86.

>         pgt = vcpu->arch.hw_mmu->pgt;
>         if (mmu_notifier_retry(kvm, mmu_seq))
>                 goto out_unlock;
> @@ -1322,8 +1312,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
>                 ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
>         } else {
> -               WARN_ONCE(use_read_lock, "Attempted stage-2 map outside of write lock\n");
> -
>                 ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
>                                              __pfn_to_phys(pfn), prot,
>                                              mmu_caches, true);
> @@ -1336,10 +1324,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         }
>
>  out_unlock:
> -       if (use_read_lock)
> -               read_unlock(&kvm->mmu_lock);
> -       else
> -               write_unlock(&kvm->mmu_lock);
> +       read_unlock(&kvm->mmu_lock);
>         kvm_set_pfn_accessed(pfn);
>         kvm_release_pfn_clean(pfn);
>         return ret != -EAGAIN ? ret : 0;
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 16/17] KVM: arm64: Enable parallel stage 2 MMU faults
@ 2022-04-21 16:35     ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:35 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, David Matlack, Paolo Bonzini,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	linux-arm-kernel

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> Voila! Since the map walkers are able to work in parallel there is no
> need to take the write lock on a stage 2 memory abort. Relax locking
> on map operations and cross fingers we got it right.

Might be worth a healthy sprinkle of lockdep on the functions taking
"shared" as an argument, just to make sure the wrong value isn't going
down a callstack you didn't expect.

>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/kvm/mmu.c | 21 +++------------------
>  1 file changed, 3 insertions(+), 18 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 63cf18cdb978..2881051c3743 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1127,7 +1127,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         gfn_t gfn;
>         kvm_pfn_t pfn;
>         bool logging_active = memslot_is_logging(memslot);
> -       bool use_read_lock = false;
>         unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
>         unsigned long vma_pagesize, fault_granule;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> @@ -1162,8 +1161,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         if (logging_active) {
>                 force_pte = true;
>                 vma_shift = PAGE_SHIFT;
> -               use_read_lock = (fault_status == FSC_PERM && write_fault &&
> -                                fault_granule == PAGE_SIZE);
>         } else {
>                 vma_shift = get_vma_page_shift(vma, hva);
>         }
> @@ -1267,15 +1264,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         if (exec_fault && device)
>                 return -ENOEXEC;
>
> -       /*
> -        * To reduce MMU contentions and enhance concurrency during dirty
> -        * logging dirty logging, only acquire read lock for permission
> -        * relaxation.
> -        */
> -       if (use_read_lock)
> -               read_lock(&kvm->mmu_lock);
> -       else
> -               write_lock(&kvm->mmu_lock);
> +       read_lock(&kvm->mmu_lock);
> +

Ugh, I which we could get rid of the analogous ugly block on x86.

>         pgt = vcpu->arch.hw_mmu->pgt;
>         if (mmu_notifier_retry(kvm, mmu_seq))
>                 goto out_unlock;
> @@ -1322,8 +1312,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
>                 ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
>         } else {
> -               WARN_ONCE(use_read_lock, "Attempted stage-2 map outside of write lock\n");
> -
>                 ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
>                                              __pfn_to_phys(pfn), prot,
>                                              mmu_caches, true);
> @@ -1336,10 +1324,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         }
>
>  out_unlock:
> -       if (use_read_lock)
> -               read_unlock(&kvm->mmu_lock);
> -       else
> -               write_unlock(&kvm->mmu_lock);
> +       read_unlock(&kvm->mmu_lock);
>         kvm_set_pfn_accessed(pfn);
>         kvm_release_pfn_clean(pfn);
>         return ret != -EAGAIN ? ret : 0;
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
  2022-04-21 16:30       ` Ben Gardon
  (?)
@ 2022-04-21 16:37         ` Paolo Bonzini
  -1 siblings, 0 replies; 165+ messages in thread
From: Paolo Bonzini @ 2022-04-21 16:37 UTC (permalink / raw)
  To: Ben Gardon, Oliver Upton
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Sean Christopherson, David Matlack

On 4/21/22 18:30, Ben Gardon wrote:
> Completely agree. I'm surprised that ARM doesn't have a need for a
> metadata structure associated with each page of the stage 2 paging
> structure, but if you don't need it, that definitely makes things
> simpler.

The uses of struct kvm_mmu_page in the TDP MMU are all relatively new, 
for the work_struct and the roots' reference count.  sp->ptep is only 
used to in a very specific path, kvm_recover_nx_lpages.

I wouldn't be surprised if ARM grows more metadata later, but in fact 
it's not _that_ surprising that it doesn't need it yet!

Paolo


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-21 16:37         ` Paolo Bonzini
  0 siblings, 0 replies; 165+ messages in thread
From: Paolo Bonzini @ 2022-04-21 16:37 UTC (permalink / raw)
  To: Ben Gardon, Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, David Matlack,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	linux-arm-kernel

On 4/21/22 18:30, Ben Gardon wrote:
> Completely agree. I'm surprised that ARM doesn't have a need for a
> metadata structure associated with each page of the stage 2 paging
> structure, but if you don't need it, that definitely makes things
> simpler.

The uses of struct kvm_mmu_page in the TDP MMU are all relatively new, 
for the work_struct and the roots' reference count.  sp->ptep is only 
used to in a very specific path, kvm_recover_nx_lpages.

I wouldn't be surprised if ARM grows more metadata later, but in fact 
it's not _that_ surprising that it doesn't need it yet!

Paolo

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-21 16:37         ` Paolo Bonzini
  0 siblings, 0 replies; 165+ messages in thread
From: Paolo Bonzini @ 2022-04-21 16:37 UTC (permalink / raw)
  To: Ben Gardon, Oliver Upton
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Sean Christopherson, David Matlack

On 4/21/22 18:30, Ben Gardon wrote:
> Completely agree. I'm surprised that ARM doesn't have a need for a
> metadata structure associated with each page of the stage 2 paging
> structure, but if you don't need it, that definitely makes things
> simpler.

The uses of struct kvm_mmu_page in the TDP MMU are all relatively new, 
for the work_struct and the roots' reference count.  sp->ptep is only 
used to in a very specific path, kvm_recover_nx_lpages.

I wouldn't be surprised if ARM grows more metadata later, but in fact 
it's not _that_ surprising that it doesn't need it yet!

Paolo


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
  2022-04-21 13:21     ` Quentin Perret
  (?)
@ 2022-04-21 16:40       ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-21 16:40 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

On Thu, Apr 21, 2022 at 01:21:54PM +0000, Quentin Perret wrote:
> Hi Oliver,
> 
> On Friday 15 Apr 2022 at 21:58:53 (+0000), Oliver Upton wrote:
> > Breaking a table pte is insufficient to guarantee ownership of an
> > unlinked subtree. Parallel software walkers could be traversing
> > substructures and changing their mappings.
> > 
> > Recurse through the unlinked subtree and lock all descendent ptes
> > to take ownership of the subtree. Since the ptes are actually being
> > evicted, return table ptes back to the table walker to ensure child
> > tables are also traversed. Note that this is done both in both the
> > pre-order and leaf visitors as the underlying pte remains volatile until
> > it is unlinked.
> 
> Still trying to get the full picture of the series so bear with me. IIUC
> the case you're dealing with here is when we're coallescing a table into
> a block with concurrent walkers making changes in the sub-tree. I
> believe this should happen when turning dirty logging off?

Yup, I think that's the only time we wind up collapsing tables.

> Why do we need to recursively lock the entire sub-tree at all in this
> case? As long as the table is turned into a locked invalid PTE, what
> concurrent walkers are doing in the sub-tree should be irrelevant no?
> None of the changes they do will be made visible to the hardware anyway.
> So as long as the sub-tree isn't freed under their feet (which should be
> the point of the RCU protection) this should be all fine? Is there a
> case where this is not actually true?

The problem arises when you're trying to actually free an unlinked
subtree. All bets are off until the next RCU grace period. What would
stop another software walker from installing a table to a PTE that I've
already visited? I think we'd wind up leaking a table page in this case
as the walker doing the table collapse assumes it has successfully freed
everything underneath.

The other option would be to not touch the subtree at all until the rcu
callback, as at that point software will not tweak the tables any more.
No need for atomics/spinning and can just do a boring traversal. Of
course, I lazily avoided this option because it would be a bit more code
but isn't too awfully complicated.

Does this paint a better picture, or have I only managed to confuse even
more? :)

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
@ 2022-04-21 16:40       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-21 16:40 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvm, Marc Zyngier, Peter Shier, Ben Gardon, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Thu, Apr 21, 2022 at 01:21:54PM +0000, Quentin Perret wrote:
> Hi Oliver,
> 
> On Friday 15 Apr 2022 at 21:58:53 (+0000), Oliver Upton wrote:
> > Breaking a table pte is insufficient to guarantee ownership of an
> > unlinked subtree. Parallel software walkers could be traversing
> > substructures and changing their mappings.
> > 
> > Recurse through the unlinked subtree and lock all descendent ptes
> > to take ownership of the subtree. Since the ptes are actually being
> > evicted, return table ptes back to the table walker to ensure child
> > tables are also traversed. Note that this is done both in both the
> > pre-order and leaf visitors as the underlying pte remains volatile until
> > it is unlinked.
> 
> Still trying to get the full picture of the series so bear with me. IIUC
> the case you're dealing with here is when we're coallescing a table into
> a block with concurrent walkers making changes in the sub-tree. I
> believe this should happen when turning dirty logging off?

Yup, I think that's the only time we wind up collapsing tables.

> Why do we need to recursively lock the entire sub-tree at all in this
> case? As long as the table is turned into a locked invalid PTE, what
> concurrent walkers are doing in the sub-tree should be irrelevant no?
> None of the changes they do will be made visible to the hardware anyway.
> So as long as the sub-tree isn't freed under their feet (which should be
> the point of the RCU protection) this should be all fine? Is there a
> case where this is not actually true?

The problem arises when you're trying to actually free an unlinked
subtree. All bets are off until the next RCU grace period. What would
stop another software walker from installing a table to a PTE that I've
already visited? I think we'd wind up leaking a table page in this case
as the walker doing the table collapse assumes it has successfully freed
everything underneath.

The other option would be to not touch the subtree at all until the rcu
callback, as at that point software will not tweak the tables any more.
No need for atomics/spinning and can just do a boring traversal. Of
course, I lazily avoided this option because it would be a bit more code
but isn't too awfully complicated.

Does this paint a better picture, or have I only managed to confuse even
more? :)

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
@ 2022-04-21 16:40       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-21 16:40 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

On Thu, Apr 21, 2022 at 01:21:54PM +0000, Quentin Perret wrote:
> Hi Oliver,
> 
> On Friday 15 Apr 2022 at 21:58:53 (+0000), Oliver Upton wrote:
> > Breaking a table pte is insufficient to guarantee ownership of an
> > unlinked subtree. Parallel software walkers could be traversing
> > substructures and changing their mappings.
> > 
> > Recurse through the unlinked subtree and lock all descendent ptes
> > to take ownership of the subtree. Since the ptes are actually being
> > evicted, return table ptes back to the table walker to ensure child
> > tables are also traversed. Note that this is done both in both the
> > pre-order and leaf visitors as the underlying pte remains volatile until
> > it is unlinked.
> 
> Still trying to get the full picture of the series so bear with me. IIUC
> the case you're dealing with here is when we're coallescing a table into
> a block with concurrent walkers making changes in the sub-tree. I
> believe this should happen when turning dirty logging off?

Yup, I think that's the only time we wind up collapsing tables.

> Why do we need to recursively lock the entire sub-tree at all in this
> case? As long as the table is turned into a locked invalid PTE, what
> concurrent walkers are doing in the sub-tree should be irrelevant no?
> None of the changes they do will be made visible to the hardware anyway.
> So as long as the sub-tree isn't freed under their feet (which should be
> the point of the RCU protection) this should be all fine? Is there a
> case where this is not actually true?

The problem arises when you're trying to actually free an unlinked
subtree. All bets are off until the next RCU grace period. What would
stop another software walker from installing a table to a PTE that I've
already visited? I think we'd wind up leaking a table page in this case
as the walker doing the table collapse assumes it has successfully freed
everything underneath.

The other option would be to not touch the subtree at all until the rcu
callback, as at that point software will not tweak the tables any more.
No need for atomics/spinning and can just do a boring traversal. Of
course, I lazily avoided this option because it would be a bit more code
but isn't too awfully complicated.

Does this paint a better picture, or have I only managed to confuse even
more? :)

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
  2022-04-16  0:04     ` Oliver Upton
  (?)
@ 2022-04-21 16:43       ` David Matlack
  -1 siblings, 0 replies; 165+ messages in thread
From: David Matlack @ 2022-04-21 16:43 UTC (permalink / raw)
  To: Oliver Upton
  Cc: KVMARM, kvm list, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon

On Fri, Apr 15, 2022 at 5:04 PM Oliver Upton <oupton@google.com> wrote:
>
> On Fri, Apr 15, 2022 at 04:35:24PM -0700, David Matlack wrote:
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > Presently KVM only takes a read lock for stage 2 faults if it believes
> > > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > > predictably can pile up all the vCPUs in a sufficiently large VM.
> > >
> > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > > MMU protected by the combination of a read-write lock and RCU, allowing
> > > page walkers to traverse in parallel.
> > >
> > > This series is strongly inspired by the mechanics of the TDP MMU,
> > > making use of RCU to protect parallel walks. Note that the TLB
> > > invalidation mechanics are a bit different between x86 and ARM, so we
> > > need to use the 'break-before-make' sequence to split/collapse a
> > > block/table mapping, respectively.
> >
> > An alternative (or perhaps "v2" [1]) is to make x86's TDP MMU
> > arch-neutral and port it to support ARM's stage-2 MMU. This is based
> > on a few observations:
> >
> > - The problems that motivated the development of the TDP MMU are not
> > x86-specific (e.g. parallelizing faults during the post-copy phase of
> > Live Migration).
> > - The synchronization in the TDP MMU (read/write lock, RCU for PT
> > freeing, atomic compare-exchanges for modifying PTEs) is complex, but
> > would be equivalent across architectures.
> > - Eventually RISC-V is going to want similar performance (my
> > understanding is RISC-V MMU is already a copy-paste of the ARM MMU),
> > and it'd be a shame to re-implement TDP MMU synchronization a third
> > time.
> > - The TDP MMU includes support for various performance features that
> > would benefit other architectures, such as eager page splitting,
> > deferred zapping, lockless write-protection resolution, and (coming
> > soon) in-place huge page promotion.
> > - And then there's the obvious wins from less code duplication in KVM
> > (e.g. get rid of the RISC-V MMU copy, increased code test coverage,
> > ...).
>
> I definitely agree with the observation -- we're all trying to solve the
> same set of issues. And I completely agree that a good long term goal
> would be to create some common parts for all architectures. Less work
> for us ARM folks it would seem ;-)
>
> What's top of mind is how we paper over the architectural differences
> between all of the architectures, especially when we need to do entirely
> different things because of the arch.
>
> For example, I whine about break-before-make a lot throughout this
> series which is somewhat unique to ARM. I don't think we can do eager
> page splitting on the base architecture w/o doing the TLBI for every
> block. Not only that, we can't do a direct valid->valid change without
> first making an invalid PTE visible to hardware. Things get even more
> exciting when hardware revisions relax break-before-make requirements.

Gotcha, so porting the TDP MMU to ARM would require adding
break-before-make support. That seems feasible and we could guard it
behind a e.g. static_key so there is no runtime overhead for
architectures (or ARM hardware revisions) that do not require it.
Anything else come to mind as major architectural differences?

 >
> There's also significant architectural differences between KVM on x86
> and KVM for ARM. Our paging code runs both in the host kernel and the
> hyp/lowvisor, and does:
>
>  - VM two dimensional paging (stage 2 MMU)
>  - Hyp's own MMU (stage 1 MMU)
>  - Host kernel isolation (stage 2 MMU)
>
> each with its own quirks. The 'not exactly in the kernel' part will make
> instrumentation a bit of a hassle too.

Ah, interesting. It'd probably make sense to start with the VM
2-dimensional paging use-case and leave the other use-cases using the
existing MMU, and then investigate transitioning the other use-cases.
Similarly in x86 we still have the legacy MMU for shadow paging (e.g.
hosts with no stage-2 hardware, and nested virtualization).

>
> None of this is meant to disagree with you in the slightest. I firmly
> agree we need to share as many parts between the architectures as
> possible. I'm just trying to call out a few of the things relating to
> ARM that will make this annoying so that way whoever embarks on the
> adventure will see it.
>
> > The side of this I haven't really looked into yet is ARM's stage-2
> > MMU, and how amenable it would be to being managed by the TDP MMU. But
> > I assume it's a conventional page table structure mapping GPAs to
> > HPAs, which is the most important overlap.
> >
> > That all being said, an arch-neutral TDP MMU would be a larger, more
> > complex code change than something like this series (hence my "v2"
> > caveat above). But I wanted to get this idea out there since the
> > rubber is starting to hit the road on improving ARM MMU scalability.
>
> All for it. I cc'ed you on the series for this exact reason, I wanted to
> grab your attention to spark the conversation :)
>
> --
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-21 16:43       ` David Matlack
  0 siblings, 0 replies; 165+ messages in thread
From: David Matlack @ 2022-04-21 16:43 UTC (permalink / raw)
  To: Oliver Upton
  Cc: KVMARM, kvm list, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson, Ben Gardon

On Fri, Apr 15, 2022 at 5:04 PM Oliver Upton <oupton@google.com> wrote:
>
> On Fri, Apr 15, 2022 at 04:35:24PM -0700, David Matlack wrote:
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > Presently KVM only takes a read lock for stage 2 faults if it believes
> > > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > > predictably can pile up all the vCPUs in a sufficiently large VM.
> > >
> > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > > MMU protected by the combination of a read-write lock and RCU, allowing
> > > page walkers to traverse in parallel.
> > >
> > > This series is strongly inspired by the mechanics of the TDP MMU,
> > > making use of RCU to protect parallel walks. Note that the TLB
> > > invalidation mechanics are a bit different between x86 and ARM, so we
> > > need to use the 'break-before-make' sequence to split/collapse a
> > > block/table mapping, respectively.
> >
> > An alternative (or perhaps "v2" [1]) is to make x86's TDP MMU
> > arch-neutral and port it to support ARM's stage-2 MMU. This is based
> > on a few observations:
> >
> > - The problems that motivated the development of the TDP MMU are not
> > x86-specific (e.g. parallelizing faults during the post-copy phase of
> > Live Migration).
> > - The synchronization in the TDP MMU (read/write lock, RCU for PT
> > freeing, atomic compare-exchanges for modifying PTEs) is complex, but
> > would be equivalent across architectures.
> > - Eventually RISC-V is going to want similar performance (my
> > understanding is RISC-V MMU is already a copy-paste of the ARM MMU),
> > and it'd be a shame to re-implement TDP MMU synchronization a third
> > time.
> > - The TDP MMU includes support for various performance features that
> > would benefit other architectures, such as eager page splitting,
> > deferred zapping, lockless write-protection resolution, and (coming
> > soon) in-place huge page promotion.
> > - And then there's the obvious wins from less code duplication in KVM
> > (e.g. get rid of the RISC-V MMU copy, increased code test coverage,
> > ...).
>
> I definitely agree with the observation -- we're all trying to solve the
> same set of issues. And I completely agree that a good long term goal
> would be to create some common parts for all architectures. Less work
> for us ARM folks it would seem ;-)
>
> What's top of mind is how we paper over the architectural differences
> between all of the architectures, especially when we need to do entirely
> different things because of the arch.
>
> For example, I whine about break-before-make a lot throughout this
> series which is somewhat unique to ARM. I don't think we can do eager
> page splitting on the base architecture w/o doing the TLBI for every
> block. Not only that, we can't do a direct valid->valid change without
> first making an invalid PTE visible to hardware. Things get even more
> exciting when hardware revisions relax break-before-make requirements.

Gotcha, so porting the TDP MMU to ARM would require adding
break-before-make support. That seems feasible and we could guard it
behind a e.g. static_key so there is no runtime overhead for
architectures (or ARM hardware revisions) that do not require it.
Anything else come to mind as major architectural differences?

 >
> There's also significant architectural differences between KVM on x86
> and KVM for ARM. Our paging code runs both in the host kernel and the
> hyp/lowvisor, and does:
>
>  - VM two dimensional paging (stage 2 MMU)
>  - Hyp's own MMU (stage 1 MMU)
>  - Host kernel isolation (stage 2 MMU)
>
> each with its own quirks. The 'not exactly in the kernel' part will make
> instrumentation a bit of a hassle too.

Ah, interesting. It'd probably make sense to start with the VM
2-dimensional paging use-case and leave the other use-cases using the
existing MMU, and then investigate transitioning the other use-cases.
Similarly in x86 we still have the legacy MMU for shadow paging (e.g.
hosts with no stage-2 hardware, and nested virtualization).

>
> None of this is meant to disagree with you in the slightest. I firmly
> agree we need to share as many parts between the architectures as
> possible. I'm just trying to call out a few of the things relating to
> ARM that will make this annoying so that way whoever embarks on the
> adventure will see it.
>
> > The side of this I haven't really looked into yet is ARM's stage-2
> > MMU, and how amenable it would be to being managed by the TDP MMU. But
> > I assume it's a conventional page table structure mapping GPAs to
> > HPAs, which is the most important overlap.
> >
> > That all being said, an arch-neutral TDP MMU would be a larger, more
> > complex code change than something like this series (hence my "v2"
> > caveat above). But I wanted to get this idea out there since the
> > rubber is starting to hit the road on improving ARM MMU scalability.
>
> All for it. I cc'ed you on the series for this exact reason, I wanted to
> grab your attention to spark the conversation :)
>
> --
> Thanks,
> Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling
@ 2022-04-21 16:43       ` David Matlack
  0 siblings, 0 replies; 165+ messages in thread
From: David Matlack @ 2022-04-21 16:43 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm list, Marc Zyngier, Ben Gardon, Peter Shier, Paolo Bonzini,
	KVMARM, linux-arm-kernel

On Fri, Apr 15, 2022 at 5:04 PM Oliver Upton <oupton@google.com> wrote:
>
> On Fri, Apr 15, 2022 at 04:35:24PM -0700, David Matlack wrote:
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > Presently KVM only takes a read lock for stage 2 faults if it believes
> > > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > > predictably can pile up all the vCPUs in a sufficiently large VM.
> > >
> > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > > MMU protected by the combination of a read-write lock and RCU, allowing
> > > page walkers to traverse in parallel.
> > >
> > > This series is strongly inspired by the mechanics of the TDP MMU,
> > > making use of RCU to protect parallel walks. Note that the TLB
> > > invalidation mechanics are a bit different between x86 and ARM, so we
> > > need to use the 'break-before-make' sequence to split/collapse a
> > > block/table mapping, respectively.
> >
> > An alternative (or perhaps "v2" [1]) is to make x86's TDP MMU
> > arch-neutral and port it to support ARM's stage-2 MMU. This is based
> > on a few observations:
> >
> > - The problems that motivated the development of the TDP MMU are not
> > x86-specific (e.g. parallelizing faults during the post-copy phase of
> > Live Migration).
> > - The synchronization in the TDP MMU (read/write lock, RCU for PT
> > freeing, atomic compare-exchanges for modifying PTEs) is complex, but
> > would be equivalent across architectures.
> > - Eventually RISC-V is going to want similar performance (my
> > understanding is RISC-V MMU is already a copy-paste of the ARM MMU),
> > and it'd be a shame to re-implement TDP MMU synchronization a third
> > time.
> > - The TDP MMU includes support for various performance features that
> > would benefit other architectures, such as eager page splitting,
> > deferred zapping, lockless write-protection resolution, and (coming
> > soon) in-place huge page promotion.
> > - And then there's the obvious wins from less code duplication in KVM
> > (e.g. get rid of the RISC-V MMU copy, increased code test coverage,
> > ...).
>
> I definitely agree with the observation -- we're all trying to solve the
> same set of issues. And I completely agree that a good long term goal
> would be to create some common parts for all architectures. Less work
> for us ARM folks it would seem ;-)
>
> What's top of mind is how we paper over the architectural differences
> between all of the architectures, especially when we need to do entirely
> different things because of the arch.
>
> For example, I whine about break-before-make a lot throughout this
> series which is somewhat unique to ARM. I don't think we can do eager
> page splitting on the base architecture w/o doing the TLBI for every
> block. Not only that, we can't do a direct valid->valid change without
> first making an invalid PTE visible to hardware. Things get even more
> exciting when hardware revisions relax break-before-make requirements.

Gotcha, so porting the TDP MMU to ARM would require adding
break-before-make support. That seems feasible and we could guard it
behind a e.g. static_key so there is no runtime overhead for
architectures (or ARM hardware revisions) that do not require it.
Anything else come to mind as major architectural differences?

 >
> There's also significant architectural differences between KVM on x86
> and KVM for ARM. Our paging code runs both in the host kernel and the
> hyp/lowvisor, and does:
>
>  - VM two dimensional paging (stage 2 MMU)
>  - Hyp's own MMU (stage 1 MMU)
>  - Host kernel isolation (stage 2 MMU)
>
> each with its own quirks. The 'not exactly in the kernel' part will make
> instrumentation a bit of a hassle too.

Ah, interesting. It'd probably make sense to start with the VM
2-dimensional paging use-case and leave the other use-cases using the
existing MMU, and then investigate transitioning the other use-cases.
Similarly in x86 we still have the legacy MMU for shadow paging (e.g.
hosts with no stage-2 hardware, and nested virtualization).

>
> None of this is meant to disagree with you in the slightest. I firmly
> agree we need to share as many parts between the architectures as
> possible. I'm just trying to call out a few of the things relating to
> ARM that will make this annoying so that way whoever embarks on the
> adventure will see it.
>
> > The side of this I haven't really looked into yet is ARM's stage-2
> > MMU, and how amenable it would be to being managed by the TDP MMU. But
> > I assume it's a conventional page table structure mapping GPAs to
> > HPAs, which is the most important overlap.
> >
> > That all being said, an arch-neutral TDP MMU would be a larger, more
> > complex code change than something like this series (hence my "v2"
> > caveat above). But I wanted to get this idea out there since the
> > rubber is starting to hit the road on improving ARM MMU scalability.
>
> All for it. I cc'ed you on the series for this exact reason, I wanted to
> grab your attention to spark the conversation :)
>
> --
> Thanks,
> Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 16/17] KVM: arm64: Enable parallel stage 2 MMU faults
  2022-04-21 16:35     ` Ben Gardon
  (?)
@ 2022-04-21 16:46       ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-21 16:46 UTC (permalink / raw)
  To: Ben Gardon
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Thu, Apr 21, 2022 at 09:35:27AM -0700, Ben Gardon wrote:
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > Voila! Since the map walkers are able to work in parallel there is no
> > need to take the write lock on a stage 2 memory abort. Relax locking
> > on map operations and cross fingers we got it right.
> 
> Might be worth a healthy sprinkle of lockdep on the functions taking
> "shared" as an argument, just to make sure the wrong value isn't going
> down a callstack you didn't expect.

If we're going to go this route we might need to just punch a pointer
to the vCPU through to the stage 2 table walker. All of this plumbing is
built around the idea that there are multiple tables to manage and
needn't be in the context of a vCPU/VM, which is why I went the WARN()
route instead of better lockdep assertions.

> >
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  arch/arm64/kvm/mmu.c | 21 +++------------------
> >  1 file changed, 3 insertions(+), 18 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 63cf18cdb978..2881051c3743 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1127,7 +1127,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >         gfn_t gfn;
> >         kvm_pfn_t pfn;
> >         bool logging_active = memslot_is_logging(memslot);
> > -       bool use_read_lock = false;
> >         unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> >         unsigned long vma_pagesize, fault_granule;
> >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > @@ -1162,8 +1161,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >         if (logging_active) {
> >                 force_pte = true;
> >                 vma_shift = PAGE_SHIFT;
> > -               use_read_lock = (fault_status == FSC_PERM && write_fault &&
> > -                                fault_granule == PAGE_SIZE);
> >         } else {
> >                 vma_shift = get_vma_page_shift(vma, hva);
> >         }
> > @@ -1267,15 +1264,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >         if (exec_fault && device)
> >                 return -ENOEXEC;
> >
> > -       /*
> > -        * To reduce MMU contentions and enhance concurrency during dirty
> > -        * logging dirty logging, only acquire read lock for permission
> > -        * relaxation.
> > -        */
> > -       if (use_read_lock)
> > -               read_lock(&kvm->mmu_lock);
> > -       else
> > -               write_lock(&kvm->mmu_lock);
> > +       read_lock(&kvm->mmu_lock);
> > +
> 
> Ugh, I which we could get rid of the analogous ugly block on x86.

Maybe we could fold it in to a MMU macro in the arch-generic scope?
Conditional locking is smelly, I was very pleased to delete these lines :)

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 16/17] KVM: arm64: Enable parallel stage 2 MMU faults
@ 2022-04-21 16:46       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-21 16:46 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Marc Zyngier, Peter Shier, David Matlack, Paolo Bonzini,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	linux-arm-kernel

On Thu, Apr 21, 2022 at 09:35:27AM -0700, Ben Gardon wrote:
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > Voila! Since the map walkers are able to work in parallel there is no
> > need to take the write lock on a stage 2 memory abort. Relax locking
> > on map operations and cross fingers we got it right.
> 
> Might be worth a healthy sprinkle of lockdep on the functions taking
> "shared" as an argument, just to make sure the wrong value isn't going
> down a callstack you didn't expect.

If we're going to go this route we might need to just punch a pointer
to the vCPU through to the stage 2 table walker. All of this plumbing is
built around the idea that there are multiple tables to manage and
needn't be in the context of a vCPU/VM, which is why I went the WARN()
route instead of better lockdep assertions.

> >
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  arch/arm64/kvm/mmu.c | 21 +++------------------
> >  1 file changed, 3 insertions(+), 18 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 63cf18cdb978..2881051c3743 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1127,7 +1127,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >         gfn_t gfn;
> >         kvm_pfn_t pfn;
> >         bool logging_active = memslot_is_logging(memslot);
> > -       bool use_read_lock = false;
> >         unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> >         unsigned long vma_pagesize, fault_granule;
> >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > @@ -1162,8 +1161,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >         if (logging_active) {
> >                 force_pte = true;
> >                 vma_shift = PAGE_SHIFT;
> > -               use_read_lock = (fault_status == FSC_PERM && write_fault &&
> > -                                fault_granule == PAGE_SIZE);
> >         } else {
> >                 vma_shift = get_vma_page_shift(vma, hva);
> >         }
> > @@ -1267,15 +1264,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >         if (exec_fault && device)
> >                 return -ENOEXEC;
> >
> > -       /*
> > -        * To reduce MMU contentions and enhance concurrency during dirty
> > -        * logging dirty logging, only acquire read lock for permission
> > -        * relaxation.
> > -        */
> > -       if (use_read_lock)
> > -               read_lock(&kvm->mmu_lock);
> > -       else
> > -               write_lock(&kvm->mmu_lock);
> > +       read_lock(&kvm->mmu_lock);
> > +
> 
> Ugh, I which we could get rid of the analogous ugly block on x86.

Maybe we could fold it in to a MMU macro in the arch-generic scope?
Conditional locking is smelly, I was very pleased to delete these lines :)

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 16/17] KVM: arm64: Enable parallel stage 2 MMU faults
@ 2022-04-21 16:46       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-21 16:46 UTC (permalink / raw)
  To: Ben Gardon
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Thu, Apr 21, 2022 at 09:35:27AM -0700, Ben Gardon wrote:
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > Voila! Since the map walkers are able to work in parallel there is no
> > need to take the write lock on a stage 2 memory abort. Relax locking
> > on map operations and cross fingers we got it right.
> 
> Might be worth a healthy sprinkle of lockdep on the functions taking
> "shared" as an argument, just to make sure the wrong value isn't going
> down a callstack you didn't expect.

If we're going to go this route we might need to just punch a pointer
to the vCPU through to the stage 2 table walker. All of this plumbing is
built around the idea that there are multiple tables to manage and
needn't be in the context of a vCPU/VM, which is why I went the WARN()
route instead of better lockdep assertions.

> >
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  arch/arm64/kvm/mmu.c | 21 +++------------------
> >  1 file changed, 3 insertions(+), 18 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 63cf18cdb978..2881051c3743 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1127,7 +1127,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >         gfn_t gfn;
> >         kvm_pfn_t pfn;
> >         bool logging_active = memslot_is_logging(memslot);
> > -       bool use_read_lock = false;
> >         unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> >         unsigned long vma_pagesize, fault_granule;
> >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > @@ -1162,8 +1161,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >         if (logging_active) {
> >                 force_pte = true;
> >                 vma_shift = PAGE_SHIFT;
> > -               use_read_lock = (fault_status == FSC_PERM && write_fault &&
> > -                                fault_granule == PAGE_SIZE);
> >         } else {
> >                 vma_shift = get_vma_page_shift(vma, hva);
> >         }
> > @@ -1267,15 +1264,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >         if (exec_fault && device)
> >                 return -ENOEXEC;
> >
> > -       /*
> > -        * To reduce MMU contentions and enhance concurrency during dirty
> > -        * logging dirty logging, only acquire read lock for permission
> > -        * relaxation.
> > -        */
> > -       if (use_read_lock)
> > -               read_lock(&kvm->mmu_lock);
> > -       else
> > -               write_lock(&kvm->mmu_lock);
> > +       read_lock(&kvm->mmu_lock);
> > +
> 
> Ugh, I which we could get rid of the analogous ugly block on x86.

Maybe we could fold it in to a MMU macro in the arch-generic scope?
Conditional locking is smelly, I was very pleased to delete these lines :)

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
  2022-04-15 21:58   ` Oliver Upton
  (?)
@ 2022-04-21 16:57     ` Ben Gardon
  -1 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:57 UTC (permalink / raw)
  To: Oliver Upton
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> The ARM architecture requires that software use the 'break-before-make'
> sequence whenever memory is being remapped. An additional requirement of
> parallel page walks is a mechanism to ensure exclusive access to a pte,
> thereby avoiding two threads changing the pte and invariably stomping on
> one another.
>
> Roll the two concepts together into a new helper to implement the
> 'break' sequence. Use a special invalid pte value to indicate that the
> pte is under the exclusive control of a thread. If software walkers are
> traversing the tables in parallel, use an atomic compare-exchange to
> break the pte. Retry execution on a failed attempt to break the pte, in
> the hopes that either the instruction will succeed or the pte lock will
> be successfully acquired.
>
> Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
> evicted pte was valid. For counted non-table ptes drop the reference
> immediately. Otherwise, references on tables are dropped in post-order
> traversal as the walker must recurse on the pruned subtree.
>
> All of the new atomics do nothing (for now), as there are a few other
> bits of the map walker that need to be addressed before actually walking
> in parallel.

I want to make sure I understand the make before break / PTE locking
patterns here.
Please check my understanding of the following cases:

Case 1: Change a leaf PTE (for some reason)
1. Traverse the page table to the leaf
2. Invalidate the leaf PTE, replacing it with a locked PTE
3. Flush TLBs
4. Replace the locked PTE with the new value

In this case, no need to lock the parent SPTEs, right? This is pretty simple.

Case 2:  Drop a page table
1. Traverse to some non-leaf PTE
2. Lock the PTE, invalidating it
3. Recurse into the child page table
4. Lock the PTEs in the child page table. (We need to lock ALL the
PTEs here right? I don't think we'd get away with locking only the
valid ones)
5. Flush TLBs
6. Unlock the PTE from 2
7. Free the child page after an RCU grace period (via callback)

Case 3: Drop a range of leaf PTEs
1. Traverse the page table to the first leaf
2. For each leaf in the range:
        a. Invalidate the leaf PTE, replacing it with a locked PTE
3. Flush TLBs
4. unlock the locked PTEs

In this case we have to lock ALL PTEs in the range too, right? My
worry about the whole locking scheme is making sure each thread
correctly remembers which PTEs it locked versus which might have been
locked by other threads.
On x86 we solved this by only locking one SPTE at a time, flushing,
then fixing it, but if you're locking a bunch at once it might get
complicated.
Making this locking scheme work without demolishing performance seems hard.

>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 172 +++++++++++++++++++++++++++++------
>  1 file changed, 146 insertions(+), 26 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index bf46d6d24951..059ebb921125 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -49,6 +49,12 @@
>  #define KVM_INVALID_PTE_OWNER_MASK     GENMASK(9, 2)
>  #define KVM_MAX_OWNER_ID               1
>
> +/*
> + * Used to indicate a pte for which a 'make-before-break' sequence is in
> + * progress.
> + */
> +#define KVM_INVALID_PTE_LOCKED         BIT(10)
> +
>  struct kvm_pgtable_walk_data {
>         struct kvm_pgtable              *pgt;
>         struct kvm_pgtable_walker       *walker;
> @@ -707,6 +713,122 @@ static bool stage2_pte_is_counted(kvm_pte_t pte)
>         return kvm_pte_valid(pte) || kvm_invalid_pte_owner(pte);
>  }
>
> +static bool stage2_pte_is_locked(kvm_pte_t pte)
> +{
> +       return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED);
> +}
> +
> +static inline bool kvm_try_set_pte(kvm_pte_t *ptep, kvm_pte_t old, kvm_pte_t new, bool shared)
> +{
> +       if (!shared) {
> +               WRITE_ONCE(*ptep, new);
> +               return true;
> +       }
> +
> +       return cmpxchg(ptep, old, new) == old;
> +}
> +
> +/**
> + * stage2_try_break_pte() - Invalidates a pte according to the
> + *                         'break-before-make' sequence.
> + *
> + * @ptep: Pointer to the pte to break
> + * @old: The previously observed value of the pte; used for compare-exchange in
> + *      a parallel walk
> + * @addr: IPA corresponding to the pte
> + * @level: Table level of the pte
> + * @shared: true if the tables are shared by multiple software walkers
> + * @data: pointer to the map walker data
> + *
> + * Returns: true if the pte was successfully broken.
> + *
> + * If the removed pt was valid, performs the necessary DSB and TLB flush for
> + * the old value. Drops references to the page table if a non-table entry was
> + * removed. Otherwise, the table reference is preserved as the walker must also
> + * recurse through the child tables.
> + *
> + * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
> + * on the 'break-before-make' sequence.
> + */
> +static bool stage2_try_break_pte(kvm_pte_t *ptep, kvm_pte_t old, u64 addr, u32 level, bool shared,
> +                                struct stage2_map_data *data)
> +{
> +       /*
> +        * Another thread could have already visited this pte and taken
> +        * ownership.
> +        */
> +       if (stage2_pte_is_locked(old)) {
> +               /*
> +                * If the table walker has exclusive access to the page tables
> +                * then no other software walkers should have locked the pte.
> +                */
> +               WARN_ON(!shared);
> +               return false;
> +       }
> +
> +       if (!kvm_try_set_pte(ptep, old, KVM_INVALID_PTE_LOCKED, shared))
> +               return false;
> +
> +       /*
> +        * If we removed a valid pte, break-then-make rules are in effect as a
> +        * translation may have been cached that traversed this entry.
> +        */
> +       if (kvm_pte_valid(old)) {
> +               dsb(ishst);
> +
> +               if (kvm_pte_table(old, level))
> +                       /*
> +                        * Invalidate the whole stage-2, as we may have numerous leaf
> +                        * entries below us which would otherwise need invalidating
> +                        * individually.
> +                        */
> +                       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
> +               else
> +                       kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
> +       }
> +
> +       /*
> +        * Don't drop the reference on table entries yet, as the walker must
> +        * first recurse on the unlinked subtree to unlink and drop references
> +        * to child tables.
> +        */
> +       if (!kvm_pte_table(old, level) && stage2_pte_is_counted(old))
> +               data->mm_ops->put_page(ptep);
> +
> +       return true;
> +}
> +
> +/**
> + * stage2_make_pte() - Installs a new pte according to the 'break-before-make'
> + *                    sequence.
> + *
> + * @ptep: pointer to the pte to make
> + * @new: new pte value to install
> + *
> + * Assumes that the pte addressed by ptep has already been broken and is under
> + * the ownership of the table walker. If the new pte to be installed is a valid
> + * entry, perform a DSB to make the write visible. Raise the reference count on
> + * the table if the new pte requires a reference.
> + *
> + * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
> + * on the 'break-before-make' sequence.
> + */
> +static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
> +{
> +       /* Yikes! We really shouldn't install to an entry we don't own. */
> +       WARN_ON(!stage2_pte_is_locked(*ptep));
> +
> +       if (stage2_pte_is_counted(new))
> +               mm_ops->get_page(ptep);
> +
> +       if (kvm_pte_valid(new)) {
> +               WRITE_ONCE(*ptep, new);
> +               dsb(ishst);
> +       } else {
> +               smp_store_release(ptep, new);
> +       }
> +}
> +
>  static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
>                            u32 level, struct kvm_pgtable_mm_ops *mm_ops)
>  {
> @@ -760,18 +882,17 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>         else
>                 new = kvm_init_invalid_leaf_owner(data->owner_id);
>
> -       if (stage2_pte_is_counted(old)) {
> -               /*
> -                * Skip updating the PTE if we are trying to recreate the exact
> -                * same mapping or only change the access permissions. Instead,
> -                * the vCPU will exit one more time from guest if still needed
> -                * and then go through the path of relaxing permissions.
> -                */
> -               if (!stage2_pte_needs_update(old, new))
> -                       return -EAGAIN;
> +       /*
> +        * Skip updating the PTE if we are trying to recreate the exact same
> +        * mapping or only change the access permissions. Instead, the vCPU will
> +        * exit one more time from the guest if still needed and then go through
> +        * the path of relaxing permissions.
> +        */
> +       if (!stage2_pte_needs_update(old, new))
> +               return -EAGAIN;
>
> -               stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
> -       }
> +       if (!stage2_try_break_pte(ptep, old, addr, level, shared, data))
> +               return -EAGAIN;
>
>         /* Perform CMOs before installation of the guest stage-2 PTE */
>         if (mm_ops->dcache_clean_inval_poc && stage2_pte_cacheable(pgt, new))
> @@ -781,9 +902,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>         if (mm_ops->icache_inval_pou && stage2_pte_executable(new))
>                 mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
>
> -       smp_store_release(ptep, new);
> -       if (stage2_pte_is_counted(new))
> -               mm_ops->get_page(ptep);
> +       stage2_make_pte(ptep, new, data->mm_ops);
>         if (kvm_phys_is_valid(phys))
>                 data->phys += granule;
>         return 0;
> @@ -800,15 +919,10 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>         if (!stage2_leaf_mapping_allowed(addr, end, level, data))
>                 return 0;
>
> -       data->childp = kvm_pte_follow(*old, data->mm_ops);
> -       kvm_clear_pte(ptep);
> +       if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
> +               return -EAGAIN;
>
> -       /*
> -        * Invalidate the whole stage-2, as we may have numerous leaf
> -        * entries below us which would otherwise need invalidating
> -        * individually.
> -        */
> -       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
> +       data->childp = kvm_pte_follow(*old, data->mm_ops);
>         data->anchor = ptep;
>         return 0;
>  }
> @@ -837,18 +951,24 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>         if (!data->memcache)
>                 return -ENOMEM;
>
> +       if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
> +               return -EAGAIN;
> +
>         childp = mm_ops->zalloc_page(data->memcache);
> -       if (!childp)
> +       if (!childp) {
> +               /*
> +                * Release the pte if we were unable to install a table to allow
> +                * another thread to make an attempt.
> +                */
> +               stage2_make_pte(ptep, 0, data->mm_ops);
>                 return -ENOMEM;
> +       }
>
>         /*
>          * If we've run into an existing block mapping then replace it with
>          * a table. Accesses beyond 'end' that fall within the new table
>          * will be mapped lazily.
>          */
> -       if (stage2_pte_is_counted(*old))
> -               stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
> -
>         kvm_set_table_pte(ptep, childp, mm_ops);
>         mm_ops->get_page(ptep);
>         *old = *ptep;
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-21 16:57     ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:57 UTC (permalink / raw)
  To: Oliver Upton
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> The ARM architecture requires that software use the 'break-before-make'
> sequence whenever memory is being remapped. An additional requirement of
> parallel page walks is a mechanism to ensure exclusive access to a pte,
> thereby avoiding two threads changing the pte and invariably stomping on
> one another.
>
> Roll the two concepts together into a new helper to implement the
> 'break' sequence. Use a special invalid pte value to indicate that the
> pte is under the exclusive control of a thread. If software walkers are
> traversing the tables in parallel, use an atomic compare-exchange to
> break the pte. Retry execution on a failed attempt to break the pte, in
> the hopes that either the instruction will succeed or the pte lock will
> be successfully acquired.
>
> Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
> evicted pte was valid. For counted non-table ptes drop the reference
> immediately. Otherwise, references on tables are dropped in post-order
> traversal as the walker must recurse on the pruned subtree.
>
> All of the new atomics do nothing (for now), as there are a few other
> bits of the map walker that need to be addressed before actually walking
> in parallel.

I want to make sure I understand the make before break / PTE locking
patterns here.
Please check my understanding of the following cases:

Case 1: Change a leaf PTE (for some reason)
1. Traverse the page table to the leaf
2. Invalidate the leaf PTE, replacing it with a locked PTE
3. Flush TLBs
4. Replace the locked PTE with the new value

In this case, no need to lock the parent SPTEs, right? This is pretty simple.

Case 2:  Drop a page table
1. Traverse to some non-leaf PTE
2. Lock the PTE, invalidating it
3. Recurse into the child page table
4. Lock the PTEs in the child page table. (We need to lock ALL the
PTEs here right? I don't think we'd get away with locking only the
valid ones)
5. Flush TLBs
6. Unlock the PTE from 2
7. Free the child page after an RCU grace period (via callback)

Case 3: Drop a range of leaf PTEs
1. Traverse the page table to the first leaf
2. For each leaf in the range:
        a. Invalidate the leaf PTE, replacing it with a locked PTE
3. Flush TLBs
4. unlock the locked PTEs

In this case we have to lock ALL PTEs in the range too, right? My
worry about the whole locking scheme is making sure each thread
correctly remembers which PTEs it locked versus which might have been
locked by other threads.
On x86 we solved this by only locking one SPTE at a time, flushing,
then fixing it, but if you're locking a bunch at once it might get
complicated.
Making this locking scheme work without demolishing performance seems hard.

>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 172 +++++++++++++++++++++++++++++------
>  1 file changed, 146 insertions(+), 26 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index bf46d6d24951..059ebb921125 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -49,6 +49,12 @@
>  #define KVM_INVALID_PTE_OWNER_MASK     GENMASK(9, 2)
>  #define KVM_MAX_OWNER_ID               1
>
> +/*
> + * Used to indicate a pte for which a 'make-before-break' sequence is in
> + * progress.
> + */
> +#define KVM_INVALID_PTE_LOCKED         BIT(10)
> +
>  struct kvm_pgtable_walk_data {
>         struct kvm_pgtable              *pgt;
>         struct kvm_pgtable_walker       *walker;
> @@ -707,6 +713,122 @@ static bool stage2_pte_is_counted(kvm_pte_t pte)
>         return kvm_pte_valid(pte) || kvm_invalid_pte_owner(pte);
>  }
>
> +static bool stage2_pte_is_locked(kvm_pte_t pte)
> +{
> +       return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED);
> +}
> +
> +static inline bool kvm_try_set_pte(kvm_pte_t *ptep, kvm_pte_t old, kvm_pte_t new, bool shared)
> +{
> +       if (!shared) {
> +               WRITE_ONCE(*ptep, new);
> +               return true;
> +       }
> +
> +       return cmpxchg(ptep, old, new) == old;
> +}
> +
> +/**
> + * stage2_try_break_pte() - Invalidates a pte according to the
> + *                         'break-before-make' sequence.
> + *
> + * @ptep: Pointer to the pte to break
> + * @old: The previously observed value of the pte; used for compare-exchange in
> + *      a parallel walk
> + * @addr: IPA corresponding to the pte
> + * @level: Table level of the pte
> + * @shared: true if the tables are shared by multiple software walkers
> + * @data: pointer to the map walker data
> + *
> + * Returns: true if the pte was successfully broken.
> + *
> + * If the removed pt was valid, performs the necessary DSB and TLB flush for
> + * the old value. Drops references to the page table if a non-table entry was
> + * removed. Otherwise, the table reference is preserved as the walker must also
> + * recurse through the child tables.
> + *
> + * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
> + * on the 'break-before-make' sequence.
> + */
> +static bool stage2_try_break_pte(kvm_pte_t *ptep, kvm_pte_t old, u64 addr, u32 level, bool shared,
> +                                struct stage2_map_data *data)
> +{
> +       /*
> +        * Another thread could have already visited this pte and taken
> +        * ownership.
> +        */
> +       if (stage2_pte_is_locked(old)) {
> +               /*
> +                * If the table walker has exclusive access to the page tables
> +                * then no other software walkers should have locked the pte.
> +                */
> +               WARN_ON(!shared);
> +               return false;
> +       }
> +
> +       if (!kvm_try_set_pte(ptep, old, KVM_INVALID_PTE_LOCKED, shared))
> +               return false;
> +
> +       /*
> +        * If we removed a valid pte, break-then-make rules are in effect as a
> +        * translation may have been cached that traversed this entry.
> +        */
> +       if (kvm_pte_valid(old)) {
> +               dsb(ishst);
> +
> +               if (kvm_pte_table(old, level))
> +                       /*
> +                        * Invalidate the whole stage-2, as we may have numerous leaf
> +                        * entries below us which would otherwise need invalidating
> +                        * individually.
> +                        */
> +                       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
> +               else
> +                       kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
> +       }
> +
> +       /*
> +        * Don't drop the reference on table entries yet, as the walker must
> +        * first recurse on the unlinked subtree to unlink and drop references
> +        * to child tables.
> +        */
> +       if (!kvm_pte_table(old, level) && stage2_pte_is_counted(old))
> +               data->mm_ops->put_page(ptep);
> +
> +       return true;
> +}
> +
> +/**
> + * stage2_make_pte() - Installs a new pte according to the 'break-before-make'
> + *                    sequence.
> + *
> + * @ptep: pointer to the pte to make
> + * @new: new pte value to install
> + *
> + * Assumes that the pte addressed by ptep has already been broken and is under
> + * the ownership of the table walker. If the new pte to be installed is a valid
> + * entry, perform a DSB to make the write visible. Raise the reference count on
> + * the table if the new pte requires a reference.
> + *
> + * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
> + * on the 'break-before-make' sequence.
> + */
> +static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
> +{
> +       /* Yikes! We really shouldn't install to an entry we don't own. */
> +       WARN_ON(!stage2_pte_is_locked(*ptep));
> +
> +       if (stage2_pte_is_counted(new))
> +               mm_ops->get_page(ptep);
> +
> +       if (kvm_pte_valid(new)) {
> +               WRITE_ONCE(*ptep, new);
> +               dsb(ishst);
> +       } else {
> +               smp_store_release(ptep, new);
> +       }
> +}
> +
>  static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
>                            u32 level, struct kvm_pgtable_mm_ops *mm_ops)
>  {
> @@ -760,18 +882,17 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>         else
>                 new = kvm_init_invalid_leaf_owner(data->owner_id);
>
> -       if (stage2_pte_is_counted(old)) {
> -               /*
> -                * Skip updating the PTE if we are trying to recreate the exact
> -                * same mapping or only change the access permissions. Instead,
> -                * the vCPU will exit one more time from guest if still needed
> -                * and then go through the path of relaxing permissions.
> -                */
> -               if (!stage2_pte_needs_update(old, new))
> -                       return -EAGAIN;
> +       /*
> +        * Skip updating the PTE if we are trying to recreate the exact same
> +        * mapping or only change the access permissions. Instead, the vCPU will
> +        * exit one more time from the guest if still needed and then go through
> +        * the path of relaxing permissions.
> +        */
> +       if (!stage2_pte_needs_update(old, new))
> +               return -EAGAIN;
>
> -               stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
> -       }
> +       if (!stage2_try_break_pte(ptep, old, addr, level, shared, data))
> +               return -EAGAIN;
>
>         /* Perform CMOs before installation of the guest stage-2 PTE */
>         if (mm_ops->dcache_clean_inval_poc && stage2_pte_cacheable(pgt, new))
> @@ -781,9 +902,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>         if (mm_ops->icache_inval_pou && stage2_pte_executable(new))
>                 mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
>
> -       smp_store_release(ptep, new);
> -       if (stage2_pte_is_counted(new))
> -               mm_ops->get_page(ptep);
> +       stage2_make_pte(ptep, new, data->mm_ops);
>         if (kvm_phys_is_valid(phys))
>                 data->phys += granule;
>         return 0;
> @@ -800,15 +919,10 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>         if (!stage2_leaf_mapping_allowed(addr, end, level, data))
>                 return 0;
>
> -       data->childp = kvm_pte_follow(*old, data->mm_ops);
> -       kvm_clear_pte(ptep);
> +       if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
> +               return -EAGAIN;
>
> -       /*
> -        * Invalidate the whole stage-2, as we may have numerous leaf
> -        * entries below us which would otherwise need invalidating
> -        * individually.
> -        */
> -       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
> +       data->childp = kvm_pte_follow(*old, data->mm_ops);
>         data->anchor = ptep;
>         return 0;
>  }
> @@ -837,18 +951,24 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>         if (!data->memcache)
>                 return -ENOMEM;
>
> +       if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
> +               return -EAGAIN;
> +
>         childp = mm_ops->zalloc_page(data->memcache);
> -       if (!childp)
> +       if (!childp) {
> +               /*
> +                * Release the pte if we were unable to install a table to allow
> +                * another thread to make an attempt.
> +                */
> +               stage2_make_pte(ptep, 0, data->mm_ops);
>                 return -ENOMEM;
> +       }
>
>         /*
>          * If we've run into an existing block mapping then replace it with
>          * a table. Accesses beyond 'end' that fall within the new table
>          * will be mapped lazily.
>          */
> -       if (stage2_pte_is_counted(*old))
> -               stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
> -
>         kvm_set_table_pte(ptep, childp, mm_ops);
>         mm_ops->get_page(ptep);
>         *old = *ptep;
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-21 16:57     ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 16:57 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, David Matlack, Paolo Bonzini,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	linux-arm-kernel

On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> The ARM architecture requires that software use the 'break-before-make'
> sequence whenever memory is being remapped. An additional requirement of
> parallel page walks is a mechanism to ensure exclusive access to a pte,
> thereby avoiding two threads changing the pte and invariably stomping on
> one another.
>
> Roll the two concepts together into a new helper to implement the
> 'break' sequence. Use a special invalid pte value to indicate that the
> pte is under the exclusive control of a thread. If software walkers are
> traversing the tables in parallel, use an atomic compare-exchange to
> break the pte. Retry execution on a failed attempt to break the pte, in
> the hopes that either the instruction will succeed or the pte lock will
> be successfully acquired.
>
> Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
> evicted pte was valid. For counted non-table ptes drop the reference
> immediately. Otherwise, references on tables are dropped in post-order
> traversal as the walker must recurse on the pruned subtree.
>
> All of the new atomics do nothing (for now), as there are a few other
> bits of the map walker that need to be addressed before actually walking
> in parallel.

I want to make sure I understand the make before break / PTE locking
patterns here.
Please check my understanding of the following cases:

Case 1: Change a leaf PTE (for some reason)
1. Traverse the page table to the leaf
2. Invalidate the leaf PTE, replacing it with a locked PTE
3. Flush TLBs
4. Replace the locked PTE with the new value

In this case, no need to lock the parent SPTEs, right? This is pretty simple.

Case 2:  Drop a page table
1. Traverse to some non-leaf PTE
2. Lock the PTE, invalidating it
3. Recurse into the child page table
4. Lock the PTEs in the child page table. (We need to lock ALL the
PTEs here right? I don't think we'd get away with locking only the
valid ones)
5. Flush TLBs
6. Unlock the PTE from 2
7. Free the child page after an RCU grace period (via callback)

Case 3: Drop a range of leaf PTEs
1. Traverse the page table to the first leaf
2. For each leaf in the range:
        a. Invalidate the leaf PTE, replacing it with a locked PTE
3. Flush TLBs
4. unlock the locked PTEs

In this case we have to lock ALL PTEs in the range too, right? My
worry about the whole locking scheme is making sure each thread
correctly remembers which PTEs it locked versus which might have been
locked by other threads.
On x86 we solved this by only locking one SPTE at a time, flushing,
then fixing it, but if you're locking a bunch at once it might get
complicated.
Making this locking scheme work without demolishing performance seems hard.

>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 172 +++++++++++++++++++++++++++++------
>  1 file changed, 146 insertions(+), 26 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index bf46d6d24951..059ebb921125 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -49,6 +49,12 @@
>  #define KVM_INVALID_PTE_OWNER_MASK     GENMASK(9, 2)
>  #define KVM_MAX_OWNER_ID               1
>
> +/*
> + * Used to indicate a pte for which a 'make-before-break' sequence is in
> + * progress.
> + */
> +#define KVM_INVALID_PTE_LOCKED         BIT(10)
> +
>  struct kvm_pgtable_walk_data {
>         struct kvm_pgtable              *pgt;
>         struct kvm_pgtable_walker       *walker;
> @@ -707,6 +713,122 @@ static bool stage2_pte_is_counted(kvm_pte_t pte)
>         return kvm_pte_valid(pte) || kvm_invalid_pte_owner(pte);
>  }
>
> +static bool stage2_pte_is_locked(kvm_pte_t pte)
> +{
> +       return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED);
> +}
> +
> +static inline bool kvm_try_set_pte(kvm_pte_t *ptep, kvm_pte_t old, kvm_pte_t new, bool shared)
> +{
> +       if (!shared) {
> +               WRITE_ONCE(*ptep, new);
> +               return true;
> +       }
> +
> +       return cmpxchg(ptep, old, new) == old;
> +}
> +
> +/**
> + * stage2_try_break_pte() - Invalidates a pte according to the
> + *                         'break-before-make' sequence.
> + *
> + * @ptep: Pointer to the pte to break
> + * @old: The previously observed value of the pte; used for compare-exchange in
> + *      a parallel walk
> + * @addr: IPA corresponding to the pte
> + * @level: Table level of the pte
> + * @shared: true if the tables are shared by multiple software walkers
> + * @data: pointer to the map walker data
> + *
> + * Returns: true if the pte was successfully broken.
> + *
> + * If the removed pt was valid, performs the necessary DSB and TLB flush for
> + * the old value. Drops references to the page table if a non-table entry was
> + * removed. Otherwise, the table reference is preserved as the walker must also
> + * recurse through the child tables.
> + *
> + * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
> + * on the 'break-before-make' sequence.
> + */
> +static bool stage2_try_break_pte(kvm_pte_t *ptep, kvm_pte_t old, u64 addr, u32 level, bool shared,
> +                                struct stage2_map_data *data)
> +{
> +       /*
> +        * Another thread could have already visited this pte and taken
> +        * ownership.
> +        */
> +       if (stage2_pte_is_locked(old)) {
> +               /*
> +                * If the table walker has exclusive access to the page tables
> +                * then no other software walkers should have locked the pte.
> +                */
> +               WARN_ON(!shared);
> +               return false;
> +       }
> +
> +       if (!kvm_try_set_pte(ptep, old, KVM_INVALID_PTE_LOCKED, shared))
> +               return false;
> +
> +       /*
> +        * If we removed a valid pte, break-then-make rules are in effect as a
> +        * translation may have been cached that traversed this entry.
> +        */
> +       if (kvm_pte_valid(old)) {
> +               dsb(ishst);
> +
> +               if (kvm_pte_table(old, level))
> +                       /*
> +                        * Invalidate the whole stage-2, as we may have numerous leaf
> +                        * entries below us which would otherwise need invalidating
> +                        * individually.
> +                        */
> +                       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
> +               else
> +                       kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
> +       }
> +
> +       /*
> +        * Don't drop the reference on table entries yet, as the walker must
> +        * first recurse on the unlinked subtree to unlink and drop references
> +        * to child tables.
> +        */
> +       if (!kvm_pte_table(old, level) && stage2_pte_is_counted(old))
> +               data->mm_ops->put_page(ptep);
> +
> +       return true;
> +}
> +
> +/**
> + * stage2_make_pte() - Installs a new pte according to the 'break-before-make'
> + *                    sequence.
> + *
> + * @ptep: pointer to the pte to make
> + * @new: new pte value to install
> + *
> + * Assumes that the pte addressed by ptep has already been broken and is under
> + * the ownership of the table walker. If the new pte to be installed is a valid
> + * entry, perform a DSB to make the write visible. Raise the reference count on
> + * the table if the new pte requires a reference.
> + *
> + * See ARM DDI0487G.a D5.10.1 "General TLB maintenance requirements" for details
> + * on the 'break-before-make' sequence.
> + */
> +static void stage2_make_pte(kvm_pte_t *ptep, kvm_pte_t new, struct kvm_pgtable_mm_ops *mm_ops)
> +{
> +       /* Yikes! We really shouldn't install to an entry we don't own. */
> +       WARN_ON(!stage2_pte_is_locked(*ptep));
> +
> +       if (stage2_pte_is_counted(new))
> +               mm_ops->get_page(ptep);
> +
> +       if (kvm_pte_valid(new)) {
> +               WRITE_ONCE(*ptep, new);
> +               dsb(ishst);
> +       } else {
> +               smp_store_release(ptep, new);
> +       }
> +}
> +
>  static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
>                            u32 level, struct kvm_pgtable_mm_ops *mm_ops)
>  {
> @@ -760,18 +882,17 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>         else
>                 new = kvm_init_invalid_leaf_owner(data->owner_id);
>
> -       if (stage2_pte_is_counted(old)) {
> -               /*
> -                * Skip updating the PTE if we are trying to recreate the exact
> -                * same mapping or only change the access permissions. Instead,
> -                * the vCPU will exit one more time from guest if still needed
> -                * and then go through the path of relaxing permissions.
> -                */
> -               if (!stage2_pte_needs_update(old, new))
> -                       return -EAGAIN;
> +       /*
> +        * Skip updating the PTE if we are trying to recreate the exact same
> +        * mapping or only change the access permissions. Instead, the vCPU will
> +        * exit one more time from the guest if still needed and then go through
> +        * the path of relaxing permissions.
> +        */
> +       if (!stage2_pte_needs_update(old, new))
> +               return -EAGAIN;
>
> -               stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
> -       }
> +       if (!stage2_try_break_pte(ptep, old, addr, level, shared, data))
> +               return -EAGAIN;
>
>         /* Perform CMOs before installation of the guest stage-2 PTE */
>         if (mm_ops->dcache_clean_inval_poc && stage2_pte_cacheable(pgt, new))
> @@ -781,9 +902,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>         if (mm_ops->icache_inval_pou && stage2_pte_executable(new))
>                 mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
>
> -       smp_store_release(ptep, new);
> -       if (stage2_pte_is_counted(new))
> -               mm_ops->get_page(ptep);
> +       stage2_make_pte(ptep, new, data->mm_ops);
>         if (kvm_phys_is_valid(phys))
>                 data->phys += granule;
>         return 0;
> @@ -800,15 +919,10 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>         if (!stage2_leaf_mapping_allowed(addr, end, level, data))
>                 return 0;
>
> -       data->childp = kvm_pte_follow(*old, data->mm_ops);
> -       kvm_clear_pte(ptep);
> +       if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
> +               return -EAGAIN;
>
> -       /*
> -        * Invalidate the whole stage-2, as we may have numerous leaf
> -        * entries below us which would otherwise need invalidating
> -        * individually.
> -        */
> -       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
> +       data->childp = kvm_pte_follow(*old, data->mm_ops);
>         data->anchor = ptep;
>         return 0;
>  }
> @@ -837,18 +951,24 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>         if (!data->memcache)
>                 return -ENOMEM;
>
> +       if (!stage2_try_break_pte(ptep, *old, addr, level, shared, data))
> +               return -EAGAIN;
> +
>         childp = mm_ops->zalloc_page(data->memcache);
> -       if (!childp)
> +       if (!childp) {
> +               /*
> +                * Release the pte if we were unable to install a table to allow
> +                * another thread to make an attempt.
> +                */
> +               stage2_make_pte(ptep, 0, data->mm_ops);
>                 return -ENOMEM;
> +       }
>
>         /*
>          * If we've run into an existing block mapping then replace it with
>          * a table. Accesses beyond 'end' that fall within the new table
>          * will be mapped lazily.
>          */
> -       if (stage2_pte_is_counted(*old))
> -               stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
> -
>         kvm_set_table_pte(ptep, childp, mm_ops);
>         mm_ops->get_page(ptep);
>         *old = *ptep;
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 16/17] KVM: arm64: Enable parallel stage 2 MMU faults
  2022-04-21 16:46       ` Oliver Upton
  (?)
@ 2022-04-21 17:03         ` Ben Gardon
  -1 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 17:03 UTC (permalink / raw)
  To: Oliver Upton
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Thu, Apr 21, 2022 at 9:46 AM Oliver Upton <oupton@google.com> wrote:
>
> On Thu, Apr 21, 2022 at 09:35:27AM -0700, Ben Gardon wrote:
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > Voila! Since the map walkers are able to work in parallel there is no
> > > need to take the write lock on a stage 2 memory abort. Relax locking
> > > on map operations and cross fingers we got it right.
> >
> > Might be worth a healthy sprinkle of lockdep on the functions taking
> > "shared" as an argument, just to make sure the wrong value isn't going
> > down a callstack you didn't expect.
>
> If we're going to go this route we might need to just punch a pointer
> to the vCPU through to the stage 2 table walker. All of this plumbing is
> built around the idea that there are multiple tables to manage and
> needn't be in the context of a vCPU/VM, which is why I went the WARN()
> route instead of better lockdep assertions.

Oh right, it didn't even occur to me that those functions wouldn't
have a vCPU / KVM pointer.

>
> > >
> > > Signed-off-by: Oliver Upton <oupton@google.com>
> > > ---
> > >  arch/arm64/kvm/mmu.c | 21 +++------------------
> > >  1 file changed, 3 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 63cf18cdb978..2881051c3743 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -1127,7 +1127,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >         gfn_t gfn;
> > >         kvm_pfn_t pfn;
> > >         bool logging_active = memslot_is_logging(memslot);
> > > -       bool use_read_lock = false;
> > >         unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> > >         unsigned long vma_pagesize, fault_granule;
> > >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > > @@ -1162,8 +1161,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >         if (logging_active) {
> > >                 force_pte = true;
> > >                 vma_shift = PAGE_SHIFT;
> > > -               use_read_lock = (fault_status == FSC_PERM && write_fault &&
> > > -                                fault_granule == PAGE_SIZE);
> > >         } else {
> > >                 vma_shift = get_vma_page_shift(vma, hva);
> > >         }
> > > @@ -1267,15 +1264,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >         if (exec_fault && device)
> > >                 return -ENOEXEC;
> > >
> > > -       /*
> > > -        * To reduce MMU contentions and enhance concurrency during dirty
> > > -        * logging dirty logging, only acquire read lock for permission
> > > -        * relaxation.
> > > -        */
> > > -       if (use_read_lock)
> > > -               read_lock(&kvm->mmu_lock);
> > > -       else
> > > -               write_lock(&kvm->mmu_lock);
> > > +       read_lock(&kvm->mmu_lock);
> > > +
> >
> > Ugh, I which we could get rid of the analogous ugly block on x86.
>
> Maybe we could fold it in to a MMU macro in the arch-generic scope?
> Conditional locking is smelly, I was very pleased to delete these lines :)

Smelly indeed. I don't think hiding it behind a macro would really
help. It's just something we'll have to live with in x86.

>
> --
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 16/17] KVM: arm64: Enable parallel stage 2 MMU faults
@ 2022-04-21 17:03         ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 17:03 UTC (permalink / raw)
  To: Oliver Upton
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Thu, Apr 21, 2022 at 9:46 AM Oliver Upton <oupton@google.com> wrote:
>
> On Thu, Apr 21, 2022 at 09:35:27AM -0700, Ben Gardon wrote:
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > Voila! Since the map walkers are able to work in parallel there is no
> > > need to take the write lock on a stage 2 memory abort. Relax locking
> > > on map operations and cross fingers we got it right.
> >
> > Might be worth a healthy sprinkle of lockdep on the functions taking
> > "shared" as an argument, just to make sure the wrong value isn't going
> > down a callstack you didn't expect.
>
> If we're going to go this route we might need to just punch a pointer
> to the vCPU through to the stage 2 table walker. All of this plumbing is
> built around the idea that there are multiple tables to manage and
> needn't be in the context of a vCPU/VM, which is why I went the WARN()
> route instead of better lockdep assertions.

Oh right, it didn't even occur to me that those functions wouldn't
have a vCPU / KVM pointer.

>
> > >
> > > Signed-off-by: Oliver Upton <oupton@google.com>
> > > ---
> > >  arch/arm64/kvm/mmu.c | 21 +++------------------
> > >  1 file changed, 3 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 63cf18cdb978..2881051c3743 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -1127,7 +1127,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >         gfn_t gfn;
> > >         kvm_pfn_t pfn;
> > >         bool logging_active = memslot_is_logging(memslot);
> > > -       bool use_read_lock = false;
> > >         unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> > >         unsigned long vma_pagesize, fault_granule;
> > >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > > @@ -1162,8 +1161,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >         if (logging_active) {
> > >                 force_pte = true;
> > >                 vma_shift = PAGE_SHIFT;
> > > -               use_read_lock = (fault_status == FSC_PERM && write_fault &&
> > > -                                fault_granule == PAGE_SIZE);
> > >         } else {
> > >                 vma_shift = get_vma_page_shift(vma, hva);
> > >         }
> > > @@ -1267,15 +1264,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >         if (exec_fault && device)
> > >                 return -ENOEXEC;
> > >
> > > -       /*
> > > -        * To reduce MMU contentions and enhance concurrency during dirty
> > > -        * logging dirty logging, only acquire read lock for permission
> > > -        * relaxation.
> > > -        */
> > > -       if (use_read_lock)
> > > -               read_lock(&kvm->mmu_lock);
> > > -       else
> > > -               write_lock(&kvm->mmu_lock);
> > > +       read_lock(&kvm->mmu_lock);
> > > +
> >
> > Ugh, I which we could get rid of the analogous ugly block on x86.
>
> Maybe we could fold it in to a MMU macro in the arch-generic scope?
> Conditional locking is smelly, I was very pleased to delete these lines :)

Smelly indeed. I don't think hiding it behind a macro would really
help. It's just something we'll have to live with in x86.

>
> --
> Thanks,
> Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 16/17] KVM: arm64: Enable parallel stage 2 MMU faults
@ 2022-04-21 17:03         ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-21 17:03 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, David Matlack, Paolo Bonzini,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	linux-arm-kernel

On Thu, Apr 21, 2022 at 9:46 AM Oliver Upton <oupton@google.com> wrote:
>
> On Thu, Apr 21, 2022 at 09:35:27AM -0700, Ben Gardon wrote:
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > Voila! Since the map walkers are able to work in parallel there is no
> > > need to take the write lock on a stage 2 memory abort. Relax locking
> > > on map operations and cross fingers we got it right.
> >
> > Might be worth a healthy sprinkle of lockdep on the functions taking
> > "shared" as an argument, just to make sure the wrong value isn't going
> > down a callstack you didn't expect.
>
> If we're going to go this route we might need to just punch a pointer
> to the vCPU through to the stage 2 table walker. All of this plumbing is
> built around the idea that there are multiple tables to manage and
> needn't be in the context of a vCPU/VM, which is why I went the WARN()
> route instead of better lockdep assertions.

Oh right, it didn't even occur to me that those functions wouldn't
have a vCPU / KVM pointer.

>
> > >
> > > Signed-off-by: Oliver Upton <oupton@google.com>
> > > ---
> > >  arch/arm64/kvm/mmu.c | 21 +++------------------
> > >  1 file changed, 3 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 63cf18cdb978..2881051c3743 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -1127,7 +1127,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >         gfn_t gfn;
> > >         kvm_pfn_t pfn;
> > >         bool logging_active = memslot_is_logging(memslot);
> > > -       bool use_read_lock = false;
> > >         unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> > >         unsigned long vma_pagesize, fault_granule;
> > >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > > @@ -1162,8 +1161,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >         if (logging_active) {
> > >                 force_pte = true;
> > >                 vma_shift = PAGE_SHIFT;
> > > -               use_read_lock = (fault_status == FSC_PERM && write_fault &&
> > > -                                fault_granule == PAGE_SIZE);
> > >         } else {
> > >                 vma_shift = get_vma_page_shift(vma, hva);
> > >         }
> > > @@ -1267,15 +1264,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >         if (exec_fault && device)
> > >                 return -ENOEXEC;
> > >
> > > -       /*
> > > -        * To reduce MMU contentions and enhance concurrency during dirty
> > > -        * logging dirty logging, only acquire read lock for permission
> > > -        * relaxation.
> > > -        */
> > > -       if (use_read_lock)
> > > -               read_lock(&kvm->mmu_lock);
> > > -       else
> > > -               write_lock(&kvm->mmu_lock);
> > > +       read_lock(&kvm->mmu_lock);
> > > +
> >
> > Ugh, I which we could get rid of the analogous ugly block on x86.
>
> Maybe we could fold it in to a MMU macro in the arch-generic scope?
> Conditional locking is smelly, I was very pleased to delete these lines :)

Smelly indeed. I don't think hiding it behind a macro would really
help. It's just something we'll have to live with in x86.

>
> --
> Thanks,
> Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 10/17] KVM: arm64: Assume a table pte is already owned in post-order traversal
  2022-04-21 16:11     ` Ben Gardon
  (?)
@ 2022-04-21 17:16       ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-21 17:16 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Thu, Apr 21, 2022 at 09:11:37AM -0700, Ben Gardon wrote:
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > For parallel walks that collapse a table into a block KVM ensures a
> > locked invalid pte is visible to all observers in pre-order traversal.
> > As such, there is no need to try breaking the pte again.
> 
> When you're doing the pre and post-order traversals, are they
> implemented as separate traversals from the root, or is it a kind of
> pre and post-order where non-leaf nodes are visited on the way down
> and on the way up?

The latter. We do one walk of the tables and fire the appropriate
visitor callbacks based on what part of the walk we're in.

> I assume either could be made to work, but the re-traversal from the
> root probably minimizes TLB flushes, whereas the pre-and-post-order
> would be a more efficient walk?

When we need to start doing operations on a whole range of memory this
way I completely agree (collapse to 2M, shatter to 4K for a memslot,
etc.).

For the current use cases of the stage 2 walker, to coalesce TLBIs we'd
need a better science around when to do blast all of stage 2 vs. TLBI with
an IPA argument. IOW, we go through a decent bit of trouble to avoid
flushing all of stage 2 unless deemed necessary. And the other unfortunate
thing about that is I doubt observations are portable between implementations
so the point where we cut over to a full flush is likely highly dependent
on the microarch.

Later revisions of the ARM architecture bring us TLBI instructions that
take a range argument, which could help a lot in this department.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 10/17] KVM: arm64: Assume a table pte is already owned in post-order traversal
@ 2022-04-21 17:16       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-21 17:16 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Marc Zyngier, Peter Shier, David Matlack, Paolo Bonzini,
	kvmarm, linux-arm-kernel

On Thu, Apr 21, 2022 at 09:11:37AM -0700, Ben Gardon wrote:
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > For parallel walks that collapse a table into a block KVM ensures a
> > locked invalid pte is visible to all observers in pre-order traversal.
> > As such, there is no need to try breaking the pte again.
> 
> When you're doing the pre and post-order traversals, are they
> implemented as separate traversals from the root, or is it a kind of
> pre and post-order where non-leaf nodes are visited on the way down
> and on the way up?

The latter. We do one walk of the tables and fire the appropriate
visitor callbacks based on what part of the walk we're in.

> I assume either could be made to work, but the re-traversal from the
> root probably minimizes TLB flushes, whereas the pre-and-post-order
> would be a more efficient walk?

When we need to start doing operations on a whole range of memory this
way I completely agree (collapse to 2M, shatter to 4K for a memslot,
etc.).

For the current use cases of the stage 2 walker, to coalesce TLBIs we'd
need a better science around when to do blast all of stage 2 vs. TLBI with
an IPA argument. IOW, we go through a decent bit of trouble to avoid
flushing all of stage 2 unless deemed necessary. And the other unfortunate
thing about that is I doubt observations are portable between implementations
so the point where we cut over to a full flush is likely highly dependent
on the microarch.

Later revisions of the ARM architecture bring us TLBI instructions that
take a range argument, which could help a lot in this department.

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 10/17] KVM: arm64: Assume a table pte is already owned in post-order traversal
@ 2022-04-21 17:16       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-21 17:16 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Thu, Apr 21, 2022 at 09:11:37AM -0700, Ben Gardon wrote:
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > For parallel walks that collapse a table into a block KVM ensures a
> > locked invalid pte is visible to all observers in pre-order traversal.
> > As such, there is no need to try breaking the pte again.
> 
> When you're doing the pre and post-order traversals, are they
> implemented as separate traversals from the root, or is it a kind of
> pre and post-order where non-leaf nodes are visited on the way down
> and on the way up?

The latter. We do one walk of the tables and fire the appropriate
visitor callbacks based on what part of the walk we're in.

> I assume either could be made to work, but the re-traversal from the
> root probably minimizes TLB flushes, whereas the pre-and-post-order
> would be a more efficient walk?

When we need to start doing operations on a whole range of memory this
way I completely agree (collapse to 2M, shatter to 4K for a memslot,
etc.).

For the current use cases of the stage 2 walker, to coalesce TLBIs we'd
need a better science around when to do blast all of stage 2 vs. TLBI with
an IPA argument. IOW, we go through a decent bit of trouble to avoid
flushing all of stage 2 unless deemed necessary. And the other unfortunate
thing about that is I doubt observations are portable between implementations
so the point where we cut over to a full flush is likely highly dependent
on the microarch.

Later revisions of the ARM architecture bring us TLBI instructions that
take a range argument, which could help a lot in this department.

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
  2022-04-21 16:57     ` Ben Gardon
  (?)
@ 2022-04-21 18:52       ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-21 18:52 UTC (permalink / raw)
  To: Ben Gardon
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Thu, Apr 21, 2022 at 09:57:32AM -0700, Ben Gardon wrote:
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > The ARM architecture requires that software use the 'break-before-make'
> > sequence whenever memory is being remapped. An additional requirement of
> > parallel page walks is a mechanism to ensure exclusive access to a pte,
> > thereby avoiding two threads changing the pte and invariably stomping on
> > one another.
> >
> > Roll the two concepts together into a new helper to implement the
> > 'break' sequence. Use a special invalid pte value to indicate that the
> > pte is under the exclusive control of a thread. If software walkers are
> > traversing the tables in parallel, use an atomic compare-exchange to
> > break the pte. Retry execution on a failed attempt to break the pte, in
> > the hopes that either the instruction will succeed or the pte lock will
> > be successfully acquired.
> >
> > Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
> > evicted pte was valid. For counted non-table ptes drop the reference
> > immediately. Otherwise, references on tables are dropped in post-order
> > traversal as the walker must recurse on the pruned subtree.
> >
> > All of the new atomics do nothing (for now), as there are a few other
> > bits of the map walker that need to be addressed before actually walking
> > in parallel.
> 
> I want to make sure I understand the make before break / PTE locking
> patterns here.
> Please check my understanding of the following cases:
> 
> Case 1: Change a leaf PTE (for some reason)
> 1. Traverse the page table to the leaf
> 2. Invalidate the leaf PTE, replacing it with a locked PTE
> 3. Flush TLBs
> 4. Replace the locked PTE with the new value
> 
> In this case, no need to lock the parent SPTEs, right? This is pretty simple.

Right, if we're changing the OA of a leaf PTE. If we are just adjusting
attributes on a leaf we go through stage2_attr_walker(), which skips
step 2 and does the rest in this order: 1, 4, 3.

> Case 2:  Drop a page table
> 1. Traverse to some non-leaf PTE
> 2. Lock the PTE, invalidating it
> 3. Recurse into the child page table
> 4. Lock the PTEs in the child page table. (We need to lock ALL the
> PTEs here right? I don't think we'd get away with locking only the
> valid ones)

Right. We can just skip some of the TLBI/DSB dance when making an
invalid->invalid transition.

> 5. Flush TLBs
> 6. Unlock the PTE from 2
> 7. Free the child page after an RCU grace period (via callback)
> 
> Case 3: Drop a range of leaf PTEs
> 1. Traverse the page table to the first leaf
> 2. For each leaf in the range:
>         a. Invalidate the leaf PTE, replacing it with a locked PTE
> 3. Flush TLBs
> 4. unlock the locked PTEs
> 
> In this case we have to lock ALL PTEs in the range too, right? My
> worry about the whole locking scheme is making sure each thread
> correctly remembers which PTEs it locked versus which might have been
> locked by other threads.

Isn't exclusivity accomplished by checking what you get back from the
xchg()? If I get a locked PTE back, some other thread owns the PTE. If I
get anything else, then I've taken ownership of that PTE.

> On x86 we solved this by only locking one SPTE at a time, flushing,
> then fixing it, but if you're locking a bunch at once it might get
> complicated.
> Making this locking scheme work without demolishing performance seems hard.

We only change at most a single active PTE per fault on the stage 2 MMU.
We do one of three things on that path:

 1. Install a page/block PTE to an empty PTE
 2. Replace a table PTE with a block PTE
 3. Replace a block PTE with a table PTE

1 is pretty cheap and can skip flushes altogether.

2 only requires a single TLBI (a big, painful flush of the stage 2 context),
but child PTEs needn't be flushed.

3 also requires a single TLBI, but can be done with an IPA and level
hint.

Perhaps the answer is to push teardown into the rcu callback altogether,
IOW don't mess with links in the subtree until then. At that point
there's no need for TLBIs nor atomics.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-21 18:52       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-21 18:52 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Marc Zyngier, Peter Shier, David Matlack, Paolo Bonzini,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	linux-arm-kernel

On Thu, Apr 21, 2022 at 09:57:32AM -0700, Ben Gardon wrote:
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > The ARM architecture requires that software use the 'break-before-make'
> > sequence whenever memory is being remapped. An additional requirement of
> > parallel page walks is a mechanism to ensure exclusive access to a pte,
> > thereby avoiding two threads changing the pte and invariably stomping on
> > one another.
> >
> > Roll the two concepts together into a new helper to implement the
> > 'break' sequence. Use a special invalid pte value to indicate that the
> > pte is under the exclusive control of a thread. If software walkers are
> > traversing the tables in parallel, use an atomic compare-exchange to
> > break the pte. Retry execution on a failed attempt to break the pte, in
> > the hopes that either the instruction will succeed or the pte lock will
> > be successfully acquired.
> >
> > Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
> > evicted pte was valid. For counted non-table ptes drop the reference
> > immediately. Otherwise, references on tables are dropped in post-order
> > traversal as the walker must recurse on the pruned subtree.
> >
> > All of the new atomics do nothing (for now), as there are a few other
> > bits of the map walker that need to be addressed before actually walking
> > in parallel.
> 
> I want to make sure I understand the make before break / PTE locking
> patterns here.
> Please check my understanding of the following cases:
> 
> Case 1: Change a leaf PTE (for some reason)
> 1. Traverse the page table to the leaf
> 2. Invalidate the leaf PTE, replacing it with a locked PTE
> 3. Flush TLBs
> 4. Replace the locked PTE with the new value
> 
> In this case, no need to lock the parent SPTEs, right? This is pretty simple.

Right, if we're changing the OA of a leaf PTE. If we are just adjusting
attributes on a leaf we go through stage2_attr_walker(), which skips
step 2 and does the rest in this order: 1, 4, 3.

> Case 2:  Drop a page table
> 1. Traverse to some non-leaf PTE
> 2. Lock the PTE, invalidating it
> 3. Recurse into the child page table
> 4. Lock the PTEs in the child page table. (We need to lock ALL the
> PTEs here right? I don't think we'd get away with locking only the
> valid ones)

Right. We can just skip some of the TLBI/DSB dance when making an
invalid->invalid transition.

> 5. Flush TLBs
> 6. Unlock the PTE from 2
> 7. Free the child page after an RCU grace period (via callback)
> 
> Case 3: Drop a range of leaf PTEs
> 1. Traverse the page table to the first leaf
> 2. For each leaf in the range:
>         a. Invalidate the leaf PTE, replacing it with a locked PTE
> 3. Flush TLBs
> 4. unlock the locked PTEs
> 
> In this case we have to lock ALL PTEs in the range too, right? My
> worry about the whole locking scheme is making sure each thread
> correctly remembers which PTEs it locked versus which might have been
> locked by other threads.

Isn't exclusivity accomplished by checking what you get back from the
xchg()? If I get a locked PTE back, some other thread owns the PTE. If I
get anything else, then I've taken ownership of that PTE.

> On x86 we solved this by only locking one SPTE at a time, flushing,
> then fixing it, but if you're locking a bunch at once it might get
> complicated.
> Making this locking scheme work without demolishing performance seems hard.

We only change at most a single active PTE per fault on the stage 2 MMU.
We do one of three things on that path:

 1. Install a page/block PTE to an empty PTE
 2. Replace a table PTE with a block PTE
 3. Replace a block PTE with a table PTE

1 is pretty cheap and can skip flushes altogether.

2 only requires a single TLBI (a big, painful flush of the stage 2 context),
but child PTEs needn't be flushed.

3 also requires a single TLBI, but can be done with an IPA and level
hint.

Perhaps the answer is to push teardown into the rcu callback altogether,
IOW don't mess with links in the subtree until then. At that point
there's no need for TLBIs nor atomics.

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-21 18:52       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-21 18:52 UTC (permalink / raw)
  To: Ben Gardon
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Thu, Apr 21, 2022 at 09:57:32AM -0700, Ben Gardon wrote:
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > The ARM architecture requires that software use the 'break-before-make'
> > sequence whenever memory is being remapped. An additional requirement of
> > parallel page walks is a mechanism to ensure exclusive access to a pte,
> > thereby avoiding two threads changing the pte and invariably stomping on
> > one another.
> >
> > Roll the two concepts together into a new helper to implement the
> > 'break' sequence. Use a special invalid pte value to indicate that the
> > pte is under the exclusive control of a thread. If software walkers are
> > traversing the tables in parallel, use an atomic compare-exchange to
> > break the pte. Retry execution on a failed attempt to break the pte, in
> > the hopes that either the instruction will succeed or the pte lock will
> > be successfully acquired.
> >
> > Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
> > evicted pte was valid. For counted non-table ptes drop the reference
> > immediately. Otherwise, references on tables are dropped in post-order
> > traversal as the walker must recurse on the pruned subtree.
> >
> > All of the new atomics do nothing (for now), as there are a few other
> > bits of the map walker that need to be addressed before actually walking
> > in parallel.
> 
> I want to make sure I understand the make before break / PTE locking
> patterns here.
> Please check my understanding of the following cases:
> 
> Case 1: Change a leaf PTE (for some reason)
> 1. Traverse the page table to the leaf
> 2. Invalidate the leaf PTE, replacing it with a locked PTE
> 3. Flush TLBs
> 4. Replace the locked PTE with the new value
> 
> In this case, no need to lock the parent SPTEs, right? This is pretty simple.

Right, if we're changing the OA of a leaf PTE. If we are just adjusting
attributes on a leaf we go through stage2_attr_walker(), which skips
step 2 and does the rest in this order: 1, 4, 3.

> Case 2:  Drop a page table
> 1. Traverse to some non-leaf PTE
> 2. Lock the PTE, invalidating it
> 3. Recurse into the child page table
> 4. Lock the PTEs in the child page table. (We need to lock ALL the
> PTEs here right? I don't think we'd get away with locking only the
> valid ones)

Right. We can just skip some of the TLBI/DSB dance when making an
invalid->invalid transition.

> 5. Flush TLBs
> 6. Unlock the PTE from 2
> 7. Free the child page after an RCU grace period (via callback)
> 
> Case 3: Drop a range of leaf PTEs
> 1. Traverse the page table to the first leaf
> 2. For each leaf in the range:
>         a. Invalidate the leaf PTE, replacing it with a locked PTE
> 3. Flush TLBs
> 4. unlock the locked PTEs
> 
> In this case we have to lock ALL PTEs in the range too, right? My
> worry about the whole locking scheme is making sure each thread
> correctly remembers which PTEs it locked versus which might have been
> locked by other threads.

Isn't exclusivity accomplished by checking what you get back from the
xchg()? If I get a locked PTE back, some other thread owns the PTE. If I
get anything else, then I've taken ownership of that PTE.

> On x86 we solved this by only locking one SPTE at a time, flushing,
> then fixing it, but if you're locking a bunch at once it might get
> complicated.
> Making this locking scheme work without demolishing performance seems hard.

We only change at most a single active PTE per fault on the stage 2 MMU.
We do one of three things on that path:

 1. Install a page/block PTE to an empty PTE
 2. Replace a table PTE with a block PTE
 3. Replace a block PTE with a table PTE

1 is pretty cheap and can skip flushes altogether.

2 only requires a single TLBI (a big, painful flush of the stage 2 context),
but child PTEs needn't be flushed.

3 also requires a single TLBI, but can be done with an IPA and level
hint.

Perhaps the answer is to push teardown into the rcu callback altogether,
IOW don't mess with links in the subtree until then. At that point
there's no need for TLBIs nor atomics.

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
  2022-04-21 16:40       ` Oliver Upton
  (?)
@ 2022-04-22 16:00         ` Quentin Perret
  -1 siblings, 0 replies; 165+ messages in thread
From: Quentin Perret @ 2022-04-22 16:00 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> The other option would be to not touch the subtree at all until the rcu
> callback, as at that point software will not tweak the tables any more.
> No need for atomics/spinning and can just do a boring traversal.

Right that is sort of what I had in mind. Note that I'm still trying to
make my mind about the overall approach -- I can see how RCU protection
provides a rather elegant solution to this problem, but this makes the
whole thing inaccessible to e.g. pKVM where RCU is a non-starter. A
possible alternative that comes to mind would be to have all walkers
take references on the pages as they walk down, and release them on
their way back, but I'm still not sure how to make this race-safe. I'll
have a think ...

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
@ 2022-04-22 16:00         ` Quentin Perret
  0 siblings, 0 replies; 165+ messages in thread
From: Quentin Perret @ 2022-04-22 16:00 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, Ben Gardon, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> The other option would be to not touch the subtree at all until the rcu
> callback, as at that point software will not tweak the tables any more.
> No need for atomics/spinning and can just do a boring traversal.

Right that is sort of what I had in mind. Note that I'm still trying to
make my mind about the overall approach -- I can see how RCU protection
provides a rather elegant solution to this problem, but this makes the
whole thing inaccessible to e.g. pKVM where RCU is a non-starter. A
possible alternative that comes to mind would be to have all walkers
take references on the pages as they walk down, and release them on
their way back, but I'm still not sure how to make this race-safe. I'll
have a think ...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
@ 2022-04-22 16:00         ` Quentin Perret
  0 siblings, 0 replies; 165+ messages in thread
From: Quentin Perret @ 2022-04-22 16:00 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> The other option would be to not touch the subtree at all until the rcu
> callback, as at that point software will not tweak the tables any more.
> No need for atomics/spinning and can just do a boring traversal.

Right that is sort of what I had in mind. Note that I'm still trying to
make my mind about the overall approach -- I can see how RCU protection
provides a rather elegant solution to this problem, but this makes the
whole thing inaccessible to e.g. pKVM where RCU is a non-starter. A
possible alternative that comes to mind would be to have all walkers
take references on the pages as they walk down, and release them on
their way back, but I'm still not sure how to make this race-safe. I'll
have a think ...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
  2022-04-22 16:00         ` Quentin Perret
  (?)
@ 2022-04-22 20:41           ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-22 20:41 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvm, Marc Zyngier, Peter Shier, Ben Gardon, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote:
> On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> > The other option would be to not touch the subtree at all until the rcu
> > callback, as at that point software will not tweak the tables any more.
> > No need for atomics/spinning and can just do a boring traversal.
> 
> Right that is sort of what I had in mind. Note that I'm still trying to
> make my mind about the overall approach -- I can see how RCU protection
> provides a rather elegant solution to this problem, but this makes the
> whole thing inaccessible to e.g. pKVM where RCU is a non-starter.

Heh, figuring out how to do this for pKVM seemed hard hence my lazy
attempt :)

> A
> possible alternative that comes to mind would be to have all walkers
> take references on the pages as they walk down, and release them on
> their way back, but I'm still not sure how to make this race-safe. I'll
> have a think ...

Does pKVM ever collapse tables into blocks? That is the only reason any
of this mess ever gets roped in. If not I think it is possible to get
away with a rwlock with unmap on the write side and everything else on
the read side, right?

As far as regular KVM goes we get in this business when disabling dirty
logging on a memslot. Guest faults will lazily collapse the tables back
into blocks. An equally valid implementation would be just to unmap the
whole memslot and have the guest build out the tables again, which could
work with the aforementioned rwlock.

Do any of my ramblings sound workable? :)

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
@ 2022-04-22 20:41           ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-22 20:41 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote:
> On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> > The other option would be to not touch the subtree at all until the rcu
> > callback, as at that point software will not tweak the tables any more.
> > No need for atomics/spinning and can just do a boring traversal.
> 
> Right that is sort of what I had in mind. Note that I'm still trying to
> make my mind about the overall approach -- I can see how RCU protection
> provides a rather elegant solution to this problem, but this makes the
> whole thing inaccessible to e.g. pKVM where RCU is a non-starter.

Heh, figuring out how to do this for pKVM seemed hard hence my lazy
attempt :)

> A
> possible alternative that comes to mind would be to have all walkers
> take references on the pages as they walk down, and release them on
> their way back, but I'm still not sure how to make this race-safe. I'll
> have a think ...

Does pKVM ever collapse tables into blocks? That is the only reason any
of this mess ever gets roped in. If not I think it is possible to get
away with a rwlock with unmap on the write side and everything else on
the read side, right?

As far as regular KVM goes we get in this business when disabling dirty
logging on a memslot. Guest faults will lazily collapse the tables back
into blocks. An equally valid implementation would be just to unmap the
whole memslot and have the guest build out the tables again, which could
work with the aforementioned rwlock.

Do any of my ramblings sound workable? :)

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
@ 2022-04-22 20:41           ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-22 20:41 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote:
> On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> > The other option would be to not touch the subtree at all until the rcu
> > callback, as at that point software will not tweak the tables any more.
> > No need for atomics/spinning and can just do a boring traversal.
> 
> Right that is sort of what I had in mind. Note that I'm still trying to
> make my mind about the overall approach -- I can see how RCU protection
> provides a rather elegant solution to this problem, but this makes the
> whole thing inaccessible to e.g. pKVM where RCU is a non-starter.

Heh, figuring out how to do this for pKVM seemed hard hence my lazy
attempt :)

> A
> possible alternative that comes to mind would be to have all walkers
> take references on the pages as they walk down, and release them on
> their way back, but I'm still not sure how to make this race-safe. I'll
> have a think ...

Does pKVM ever collapse tables into blocks? That is the only reason any
of this mess ever gets roped in. If not I think it is possible to get
away with a rwlock with unmap on the write side and everything else on
the read side, right?

As far as regular KVM goes we get in this business when disabling dirty
logging on a memslot. Guest faults will lazily collapse the tables back
into blocks. An equally valid implementation would be just to unmap the
whole memslot and have the guest build out the tables again, which could
work with the aforementioned rwlock.

Do any of my ramblings sound workable? :)

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
  2022-04-15 21:58   ` Oliver Upton
  (?)
@ 2022-04-25 15:13     ` Sean Christopherson
  -1 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2022-04-25 15:13 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Ben Gardon, David Matlack

On Fri, Apr 15, 2022, Oliver Upton wrote:
> The ARM architecture requires that software use the 'break-before-make'
> sequence whenever memory is being remapped.

What does "remapped" mean here?  Changing the pfn?  Promoting/demoting to/from a
huge page?

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-25 15:13     ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2022-04-25 15:13 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, Ben Gardon, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Fri, Apr 15, 2022, Oliver Upton wrote:
> The ARM architecture requires that software use the 'break-before-make'
> sequence whenever memory is being remapped.

What does "remapped" mean here?  Changing the pfn?  Promoting/demoting to/from a
huge page?
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-25 15:13     ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2022-04-25 15:13 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Ben Gardon, David Matlack

On Fri, Apr 15, 2022, Oliver Upton wrote:
> The ARM architecture requires that software use the 'break-before-make'
> sequence whenever memory is being remapped.

What does "remapped" mean here?  Changing the pfn?  Promoting/demoting to/from a
huge page?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
  2022-04-25 15:13     ` Sean Christopherson
  (?)
@ 2022-04-25 16:53       ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-25 16:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Marc Zyngier, Peter Shier, Ben Gardon, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Mon, Apr 25, 2022 at 8:13 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 15, 2022, Oliver Upton wrote:
> > The ARM architecture requires that software use the 'break-before-make'
> > sequence whenever memory is being remapped.
>
> What does "remapped" mean here?  Changing the pfn?  Promoting/demoting to/from a
> huge page?

Both, but in the case of this series it is mostly concerned with
promotion/demotion. I'll make this language a bit more precise next
time around.

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-25 16:53       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-25 16:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Ben Gardon, David Matlack

On Mon, Apr 25, 2022 at 8:13 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 15, 2022, Oliver Upton wrote:
> > The ARM architecture requires that software use the 'break-before-make'
> > sequence whenever memory is being remapped.
>
> What does "remapped" mean here?  Changing the pfn?  Promoting/demoting to/from a
> huge page?

Both, but in the case of this series it is mostly concerned with
promotion/demotion. I'll make this language a bit more precise next
time around.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-25 16:53       ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-04-25 16:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Ben Gardon, David Matlack

On Mon, Apr 25, 2022 at 8:13 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 15, 2022, Oliver Upton wrote:
> > The ARM architecture requires that software use the 'break-before-make'
> > sequence whenever memory is being remapped.
>
> What does "remapped" mean here?  Changing the pfn?  Promoting/demoting to/from a
> huge page?

Both, but in the case of this series it is mostly concerned with
promotion/demotion. I'll make this language a bit more precise next
time around.

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
  2022-04-25 16:53       ` Oliver Upton
  (?)
@ 2022-04-25 18:16         ` Sean Christopherson
  -1 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2022-04-25 18:16 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Ben Gardon, David Matlack

On Mon, Apr 25, 2022, Oliver Upton wrote:
> On Mon, Apr 25, 2022 at 8:13 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Apr 15, 2022, Oliver Upton wrote:
> > > The ARM architecture requires that software use the 'break-before-make'
> > > sequence whenever memory is being remapped.
> >
> > What does "remapped" mean here?  Changing the pfn?  Promoting/demoting to/from a
> > huge page?
> 
> Both, but in the case of this series it is mostly concerned with
> promotion/demotion. I'll make this language a bit more precise next
> time around.

Please be very precise :-)  It matters because it should be impossible for KVM to
actually change a PFN in a valid PTE.  Callers of mmu_notifier_change_pte() are
required to bookend it with mmu_notifier_invalidate_range_start/end(), i.e. KVM
should have zapped all PTEs and should not establish new PTEs.  I'd actually like
to drop mmu_notifier_change_pte() altogether, because for all intents and purposes,
it's dead code.  But convincing "everyone" that dropping it instead of trying to
salvage it for KSM is too much work :-)

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-25 18:16         ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2022-04-25 18:16 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, Ben Gardon, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Mon, Apr 25, 2022, Oliver Upton wrote:
> On Mon, Apr 25, 2022 at 8:13 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Apr 15, 2022, Oliver Upton wrote:
> > > The ARM architecture requires that software use the 'break-before-make'
> > > sequence whenever memory is being remapped.
> >
> > What does "remapped" mean here?  Changing the pfn?  Promoting/demoting to/from a
> > huge page?
> 
> Both, but in the case of this series it is mostly concerned with
> promotion/demotion. I'll make this language a bit more precise next
> time around.

Please be very precise :-)  It matters because it should be impossible for KVM to
actually change a PFN in a valid PTE.  Callers of mmu_notifier_change_pte() are
required to bookend it with mmu_notifier_invalidate_range_start/end(), i.e. KVM
should have zapped all PTEs and should not establish new PTEs.  I'd actually like
to drop mmu_notifier_change_pte() altogether, because for all intents and purposes,
it's dead code.  But convincing "everyone" that dropping it instead of trying to
salvage it for KSM is too much work :-)
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-25 18:16         ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2022-04-25 18:16 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Ben Gardon, David Matlack

On Mon, Apr 25, 2022, Oliver Upton wrote:
> On Mon, Apr 25, 2022 at 8:13 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Apr 15, 2022, Oliver Upton wrote:
> > > The ARM architecture requires that software use the 'break-before-make'
> > > sequence whenever memory is being remapped.
> >
> > What does "remapped" mean here?  Changing the pfn?  Promoting/demoting to/from a
> > huge page?
> 
> Both, but in the case of this series it is mostly concerned with
> promotion/demotion. I'll make this language a bit more precise next
> time around.

Please be very precise :-)  It matters because it should be impossible for KVM to
actually change a PFN in a valid PTE.  Callers of mmu_notifier_change_pte() are
required to bookend it with mmu_notifier_invalidate_range_start/end(), i.e. KVM
should have zapped all PTEs and should not establish new PTEs.  I'd actually like
to drop mmu_notifier_change_pte() altogether, because for all intents and purposes,
it's dead code.  But convincing "everyone" that dropping it instead of trying to
salvage it for KSM is too much work :-)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
  2022-04-21 18:52       ` Oliver Upton
  (?)
@ 2022-04-26 21:32         ` Ben Gardon
  -1 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-26 21:32 UTC (permalink / raw)
  To: Oliver Upton
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Thu, Apr 21, 2022 at 11:52 AM Oliver Upton <oupton@google.com> wrote:
>
> On Thu, Apr 21, 2022 at 09:57:32AM -0700, Ben Gardon wrote:
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > The ARM architecture requires that software use the 'break-before-make'
> > > sequence whenever memory is being remapped. An additional requirement of
> > > parallel page walks is a mechanism to ensure exclusive access to a pte,
> > > thereby avoiding two threads changing the pte and invariably stomping on
> > > one another.
> > >
> > > Roll the two concepts together into a new helper to implement the
> > > 'break' sequence. Use a special invalid pte value to indicate that the
> > > pte is under the exclusive control of a thread. If software walkers are
> > > traversing the tables in parallel, use an atomic compare-exchange to
> > > break the pte. Retry execution on a failed attempt to break the pte, in
> > > the hopes that either the instruction will succeed or the pte lock will
> > > be successfully acquired.
> > >
> > > Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
> > > evicted pte was valid. For counted non-table ptes drop the reference
> > > immediately. Otherwise, references on tables are dropped in post-order
> > > traversal as the walker must recurse on the pruned subtree.
> > >
> > > All of the new atomics do nothing (for now), as there are a few other
> > > bits of the map walker that need to be addressed before actually walking
> > > in parallel.
> >
> > I want to make sure I understand the make before break / PTE locking
> > patterns here.
> > Please check my understanding of the following cases:
> >
> > Case 1: Change a leaf PTE (for some reason)
> > 1. Traverse the page table to the leaf
> > 2. Invalidate the leaf PTE, replacing it with a locked PTE
> > 3. Flush TLBs
> > 4. Replace the locked PTE with the new value
> >
> > In this case, no need to lock the parent SPTEs, right? This is pretty simple.
>
> Right, if we're changing the OA of a leaf PTE. If we are just adjusting
> attributes on a leaf we go through stage2_attr_walker(), which skips
> step 2 and does the rest in this order: 1, 4, 3.
>
> > Case 2:  Drop a page table
> > 1. Traverse to some non-leaf PTE
> > 2. Lock the PTE, invalidating it
> > 3. Recurse into the child page table
> > 4. Lock the PTEs in the child page table. (We need to lock ALL the
> > PTEs here right? I don't think we'd get away with locking only the
> > valid ones)
>
> Right. We can just skip some of the TLBI/DSB dance when making an
> invalid->invalid transition.
>
> > 5. Flush TLBs
> > 6. Unlock the PTE from 2
> > 7. Free the child page after an RCU grace period (via callback)
> >
> > Case 3: Drop a range of leaf PTEs
> > 1. Traverse the page table to the first leaf
> > 2. For each leaf in the range:
> >         a. Invalidate the leaf PTE, replacing it with a locked PTE
> > 3. Flush TLBs
> > 4. unlock the locked PTEs
> >
> > In this case we have to lock ALL PTEs in the range too, right? My
> > worry about the whole locking scheme is making sure each thread
> > correctly remembers which PTEs it locked versus which might have been
> > locked by other threads.
>
> Isn't exclusivity accomplished by checking what you get back from the
> xchg()? If I get a locked PTE back, some other thread owns the PTE. If I
> get anything else, then I've taken ownership of that PTE.

That's true if you only modify one PTE at a time, but if you want to
batch flushes by:
1. Locking a bunch of PTEs
2. TLB invalidate
3. Set them to some new value (e.g. 0)
Then you need to track which ones you locked. If you locked an entire
contiguous region, that works, but you need some way to ensure you
don't think you locked a pte locked by another thread.

>
> > On x86 we solved this by only locking one SPTE at a time, flushing,
> > then fixing it, but if you're locking a bunch at once it might get
> > complicated.
> > Making this locking scheme work without demolishing performance seems hard.
>
> We only change at most a single active PTE per fault on the stage 2 MMU.
> We do one of three things on that path:
>
>  1. Install a page/block PTE to an empty PTE
>  2. Replace a table PTE with a block PTE
>  3. Replace a block PTE with a table PTE
>
> 1 is pretty cheap and can skip flushes altogether.
>
> 2 only requires a single TLBI (a big, painful flush of the stage 2 context),
> but child PTEs needn't be flushed.
>
> 3 also requires a single TLBI, but can be done with an IPA and level
> hint.
>
> Perhaps the answer is to push teardown into the rcu callback altogether,
> IOW don't mess with links in the subtree until then. At that point
> there's no need for TLBIs nor atomics.
>
> --
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-26 21:32         ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-26 21:32 UTC (permalink / raw)
  To: Oliver Upton
  Cc: moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	kvm, Marc Zyngier, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Peter Shier, Ricardo Koller,
	Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	David Matlack

On Thu, Apr 21, 2022 at 11:52 AM Oliver Upton <oupton@google.com> wrote:
>
> On Thu, Apr 21, 2022 at 09:57:32AM -0700, Ben Gardon wrote:
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > The ARM architecture requires that software use the 'break-before-make'
> > > sequence whenever memory is being remapped. An additional requirement of
> > > parallel page walks is a mechanism to ensure exclusive access to a pte,
> > > thereby avoiding two threads changing the pte and invariably stomping on
> > > one another.
> > >
> > > Roll the two concepts together into a new helper to implement the
> > > 'break' sequence. Use a special invalid pte value to indicate that the
> > > pte is under the exclusive control of a thread. If software walkers are
> > > traversing the tables in parallel, use an atomic compare-exchange to
> > > break the pte. Retry execution on a failed attempt to break the pte, in
> > > the hopes that either the instruction will succeed or the pte lock will
> > > be successfully acquired.
> > >
> > > Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
> > > evicted pte was valid. For counted non-table ptes drop the reference
> > > immediately. Otherwise, references on tables are dropped in post-order
> > > traversal as the walker must recurse on the pruned subtree.
> > >
> > > All of the new atomics do nothing (for now), as there are a few other
> > > bits of the map walker that need to be addressed before actually walking
> > > in parallel.
> >
> > I want to make sure I understand the make before break / PTE locking
> > patterns here.
> > Please check my understanding of the following cases:
> >
> > Case 1: Change a leaf PTE (for some reason)
> > 1. Traverse the page table to the leaf
> > 2. Invalidate the leaf PTE, replacing it with a locked PTE
> > 3. Flush TLBs
> > 4. Replace the locked PTE with the new value
> >
> > In this case, no need to lock the parent SPTEs, right? This is pretty simple.
>
> Right, if we're changing the OA of a leaf PTE. If we are just adjusting
> attributes on a leaf we go through stage2_attr_walker(), which skips
> step 2 and does the rest in this order: 1, 4, 3.
>
> > Case 2:  Drop a page table
> > 1. Traverse to some non-leaf PTE
> > 2. Lock the PTE, invalidating it
> > 3. Recurse into the child page table
> > 4. Lock the PTEs in the child page table. (We need to lock ALL the
> > PTEs here right? I don't think we'd get away with locking only the
> > valid ones)
>
> Right. We can just skip some of the TLBI/DSB dance when making an
> invalid->invalid transition.
>
> > 5. Flush TLBs
> > 6. Unlock the PTE from 2
> > 7. Free the child page after an RCU grace period (via callback)
> >
> > Case 3: Drop a range of leaf PTEs
> > 1. Traverse the page table to the first leaf
> > 2. For each leaf in the range:
> >         a. Invalidate the leaf PTE, replacing it with a locked PTE
> > 3. Flush TLBs
> > 4. unlock the locked PTEs
> >
> > In this case we have to lock ALL PTEs in the range too, right? My
> > worry about the whole locking scheme is making sure each thread
> > correctly remembers which PTEs it locked versus which might have been
> > locked by other threads.
>
> Isn't exclusivity accomplished by checking what you get back from the
> xchg()? If I get a locked PTE back, some other thread owns the PTE. If I
> get anything else, then I've taken ownership of that PTE.

That's true if you only modify one PTE at a time, but if you want to
batch flushes by:
1. Locking a bunch of PTEs
2. TLB invalidate
3. Set them to some new value (e.g. 0)
Then you need to track which ones you locked. If you locked an entire
contiguous region, that works, but you need some way to ensure you
don't think you locked a pte locked by another thread.

>
> > On x86 we solved this by only locking one SPTE at a time, flushing,
> > then fixing it, but if you're locking a bunch at once it might get
> > complicated.
> > Making this locking scheme work without demolishing performance seems hard.
>
> We only change at most a single active PTE per fault on the stage 2 MMU.
> We do one of three things on that path:
>
>  1. Install a page/block PTE to an empty PTE
>  2. Replace a table PTE with a block PTE
>  3. Replace a block PTE with a table PTE
>
> 1 is pretty cheap and can skip flushes altogether.
>
> 2 only requires a single TLBI (a big, painful flush of the stage 2 context),
> but child PTEs needn't be flushed.
>
> 3 also requires a single TLBI, but can be done with an IPA and level
> hint.
>
> Perhaps the answer is to push teardown into the rcu callback altogether,
> IOW don't mess with links in the subtree until then. At that point
> there's no need for TLBIs nor atomics.
>
> --
> Thanks,
> Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks
@ 2022-04-26 21:32         ` Ben Gardon
  0 siblings, 0 replies; 165+ messages in thread
From: Ben Gardon @ 2022-04-26 21:32 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, David Matlack, Paolo Bonzini,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	linux-arm-kernel

On Thu, Apr 21, 2022 at 11:52 AM Oliver Upton <oupton@google.com> wrote:
>
> On Thu, Apr 21, 2022 at 09:57:32AM -0700, Ben Gardon wrote:
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > The ARM architecture requires that software use the 'break-before-make'
> > > sequence whenever memory is being remapped. An additional requirement of
> > > parallel page walks is a mechanism to ensure exclusive access to a pte,
> > > thereby avoiding two threads changing the pte and invariably stomping on
> > > one another.
> > >
> > > Roll the two concepts together into a new helper to implement the
> > > 'break' sequence. Use a special invalid pte value to indicate that the
> > > pte is under the exclusive control of a thread. If software walkers are
> > > traversing the tables in parallel, use an atomic compare-exchange to
> > > break the pte. Retry execution on a failed attempt to break the pte, in
> > > the hopes that either the instruction will succeed or the pte lock will
> > > be successfully acquired.
> > >
> > > Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
> > > evicted pte was valid. For counted non-table ptes drop the reference
> > > immediately. Otherwise, references on tables are dropped in post-order
> > > traversal as the walker must recurse on the pruned subtree.
> > >
> > > All of the new atomics do nothing (for now), as there are a few other
> > > bits of the map walker that need to be addressed before actually walking
> > > in parallel.
> >
> > I want to make sure I understand the make before break / PTE locking
> > patterns here.
> > Please check my understanding of the following cases:
> >
> > Case 1: Change a leaf PTE (for some reason)
> > 1. Traverse the page table to the leaf
> > 2. Invalidate the leaf PTE, replacing it with a locked PTE
> > 3. Flush TLBs
> > 4. Replace the locked PTE with the new value
> >
> > In this case, no need to lock the parent SPTEs, right? This is pretty simple.
>
> Right, if we're changing the OA of a leaf PTE. If we are just adjusting
> attributes on a leaf we go through stage2_attr_walker(), which skips
> step 2 and does the rest in this order: 1, 4, 3.
>
> > Case 2:  Drop a page table
> > 1. Traverse to some non-leaf PTE
> > 2. Lock the PTE, invalidating it
> > 3. Recurse into the child page table
> > 4. Lock the PTEs in the child page table. (We need to lock ALL the
> > PTEs here right? I don't think we'd get away with locking only the
> > valid ones)
>
> Right. We can just skip some of the TLBI/DSB dance when making an
> invalid->invalid transition.
>
> > 5. Flush TLBs
> > 6. Unlock the PTE from 2
> > 7. Free the child page after an RCU grace period (via callback)
> >
> > Case 3: Drop a range of leaf PTEs
> > 1. Traverse the page table to the first leaf
> > 2. For each leaf in the range:
> >         a. Invalidate the leaf PTE, replacing it with a locked PTE
> > 3. Flush TLBs
> > 4. unlock the locked PTEs
> >
> > In this case we have to lock ALL PTEs in the range too, right? My
> > worry about the whole locking scheme is making sure each thread
> > correctly remembers which PTEs it locked versus which might have been
> > locked by other threads.
>
> Isn't exclusivity accomplished by checking what you get back from the
> xchg()? If I get a locked PTE back, some other thread owns the PTE. If I
> get anything else, then I've taken ownership of that PTE.

That's true if you only modify one PTE at a time, but if you want to
batch flushes by:
1. Locking a bunch of PTEs
2. TLB invalidate
3. Set them to some new value (e.g. 0)
Then you need to track which ones you locked. If you locked an entire
contiguous region, that works, but you need some way to ensure you
don't think you locked a pte locked by another thread.

>
> > On x86 we solved this by only locking one SPTE at a time, flushing,
> > then fixing it, but if you're locking a bunch at once it might get
> > complicated.
> > Making this locking scheme work without demolishing performance seems hard.
>
> We only change at most a single active PTE per fault on the stage 2 MMU.
> We do one of three things on that path:
>
>  1. Install a page/block PTE to an empty PTE
>  2. Replace a table PTE with a block PTE
>  3. Replace a block PTE with a table PTE
>
> 1 is pretty cheap and can skip flushes altogether.
>
> 2 only requires a single TLBI (a big, painful flush of the stage 2 context),
> but child PTEs needn't be flushed.
>
> 3 also requires a single TLBI, but can be done with an IPA and level
> hint.
>
> Perhaps the answer is to push teardown into the rcu callback altogether,
> IOW don't mess with links in the subtree until then. At that point
> there's no need for TLBIs nor atomics.
>
> --
> Thanks,
> Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
  2022-04-22 20:41           ` Oliver Upton
  (?)
@ 2022-05-03 14:17             ` Quentin Perret
  -1 siblings, 0 replies; 165+ messages in thread
From: Quentin Perret @ 2022-05-03 14:17 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

On Friday 22 Apr 2022 at 20:41:47 (+0000), Oliver Upton wrote:
> On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote:
> > On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> > > The other option would be to not touch the subtree at all until the rcu
> > > callback, as at that point software will not tweak the tables any more.
> > > No need for atomics/spinning and can just do a boring traversal.
> > 
> > Right that is sort of what I had in mind. Note that I'm still trying to
> > make my mind about the overall approach -- I can see how RCU protection
> > provides a rather elegant solution to this problem, but this makes the
> > whole thing inaccessible to e.g. pKVM where RCU is a non-starter.
> 
> Heh, figuring out how to do this for pKVM seemed hard hence my lazy
> attempt :)
> 
> > A
> > possible alternative that comes to mind would be to have all walkers
> > take references on the pages as they walk down, and release them on
> > their way back, but I'm still not sure how to make this race-safe. I'll
> > have a think ...
> 
> Does pKVM ever collapse tables into blocks? That is the only reason any
> of this mess ever gets roped in. If not I think it is possible to get
> away with a rwlock with unmap on the write side and everything else on
> the read side, right?
> 
> As far as regular KVM goes we get in this business when disabling dirty
> logging on a memslot. Guest faults will lazily collapse the tables back
> into blocks. An equally valid implementation would be just to unmap the
> whole memslot and have the guest build out the tables again, which could
> work with the aforementioned rwlock.

Apologies for the delay on this one, I was away for a while.

Yup, that all makes sense. FWIW the pKVM use-case I have in mind is
slightly different. Specifically, in the pKVM world the hypervisor
maintains a stage-2 for the host, that is all identity mapped. So we use
nice big block mappings as much as we can. But when a protected guest
starts, the hypervisor needs to break down the host stage-2 blocks to
unmap the 4K guest pages from the host (which is where the protection
comes from in pKVM). And when the guest is torn down, the host can
reclaim its pages, hence putting us in a position to coallesce its
stage-2 into nice big blocks again. Note that none of this coallescing
is currently implemented even in our pKVM prototype, so it's a bit
unfair to ask you to deal with this stuff now, but clearly it'd be cool
if there was a way we could make these things coexist and even ideally
share some code...

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
@ 2022-05-03 14:17             ` Quentin Perret
  0 siblings, 0 replies; 165+ messages in thread
From: Quentin Perret @ 2022-05-03 14:17 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Marc Zyngier, Peter Shier, Ben Gardon, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Friday 22 Apr 2022 at 20:41:47 (+0000), Oliver Upton wrote:
> On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote:
> > On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> > > The other option would be to not touch the subtree at all until the rcu
> > > callback, as at that point software will not tweak the tables any more.
> > > No need for atomics/spinning and can just do a boring traversal.
> > 
> > Right that is sort of what I had in mind. Note that I'm still trying to
> > make my mind about the overall approach -- I can see how RCU protection
> > provides a rather elegant solution to this problem, but this makes the
> > whole thing inaccessible to e.g. pKVM where RCU is a non-starter.
> 
> Heh, figuring out how to do this for pKVM seemed hard hence my lazy
> attempt :)
> 
> > A
> > possible alternative that comes to mind would be to have all walkers
> > take references on the pages as they walk down, and release them on
> > their way back, but I'm still not sure how to make this race-safe. I'll
> > have a think ...
> 
> Does pKVM ever collapse tables into blocks? That is the only reason any
> of this mess ever gets roped in. If not I think it is possible to get
> away with a rwlock with unmap on the write side and everything else on
> the read side, right?
> 
> As far as regular KVM goes we get in this business when disabling dirty
> logging on a memslot. Guest faults will lazily collapse the tables back
> into blocks. An equally valid implementation would be just to unmap the
> whole memslot and have the guest build out the tables again, which could
> work with the aforementioned rwlock.

Apologies for the delay on this one, I was away for a while.

Yup, that all makes sense. FWIW the pKVM use-case I have in mind is
slightly different. Specifically, in the pKVM world the hypervisor
maintains a stage-2 for the host, that is all identity mapped. So we use
nice big block mappings as much as we can. But when a protected guest
starts, the hypervisor needs to break down the host stage-2 blocks to
unmap the 4K guest pages from the host (which is where the protection
comes from in pKVM). And when the guest is torn down, the host can
reclaim its pages, hence putting us in a position to coallesce its
stage-2 into nice big blocks again. Note that none of this coallescing
is currently implemented even in our pKVM prototype, so it's a bit
unfair to ask you to deal with this stuff now, but clearly it'd be cool
if there was a way we could make these things coexist and even ideally
share some code...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
@ 2022-05-03 14:17             ` Quentin Perret
  0 siblings, 0 replies; 165+ messages in thread
From: Quentin Perret @ 2022-05-03 14:17 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

On Friday 22 Apr 2022 at 20:41:47 (+0000), Oliver Upton wrote:
> On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote:
> > On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> > > The other option would be to not touch the subtree at all until the rcu
> > > callback, as at that point software will not tweak the tables any more.
> > > No need for atomics/spinning and can just do a boring traversal.
> > 
> > Right that is sort of what I had in mind. Note that I'm still trying to
> > make my mind about the overall approach -- I can see how RCU protection
> > provides a rather elegant solution to this problem, but this makes the
> > whole thing inaccessible to e.g. pKVM where RCU is a non-starter.
> 
> Heh, figuring out how to do this for pKVM seemed hard hence my lazy
> attempt :)
> 
> > A
> > possible alternative that comes to mind would be to have all walkers
> > take references on the pages as they walk down, and release them on
> > their way back, but I'm still not sure how to make this race-safe. I'll
> > have a think ...
> 
> Does pKVM ever collapse tables into blocks? That is the only reason any
> of this mess ever gets roped in. If not I think it is possible to get
> away with a rwlock with unmap on the write side and everything else on
> the read side, right?
> 
> As far as regular KVM goes we get in this business when disabling dirty
> logging on a memslot. Guest faults will lazily collapse the tables back
> into blocks. An equally valid implementation would be just to unmap the
> whole memslot and have the guest build out the tables again, which could
> work with the aforementioned rwlock.

Apologies for the delay on this one, I was away for a while.

Yup, that all makes sense. FWIW the pKVM use-case I have in mind is
slightly different. Specifically, in the pKVM world the hypervisor
maintains a stage-2 for the host, that is all identity mapped. So we use
nice big block mappings as much as we can. But when a protected guest
starts, the hypervisor needs to break down the host stage-2 blocks to
unmap the 4K guest pages from the host (which is where the protection
comes from in pKVM). And when the guest is torn down, the host can
reclaim its pages, hence putting us in a position to coallesce its
stage-2 into nice big blocks again. Note that none of this coallescing
is currently implemented even in our pKVM prototype, so it's a bit
unfair to ask you to deal with this stuff now, but clearly it'd be cool
if there was a way we could make these things coexist and even ideally
share some code...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
  2022-05-03 14:17             ` Quentin Perret
  (?)
@ 2022-05-04  6:03               ` Oliver Upton
  -1 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-05-04  6:03 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvm, Marc Zyngier, Peter Shier, Ben Gardon, David Matlack,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Tue, May 03, 2022 at 02:17:25PM +0000, Quentin Perret wrote:
> On Friday 22 Apr 2022 at 20:41:47 (+0000), Oliver Upton wrote:
> > On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote:
> > > On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> > > > The other option would be to not touch the subtree at all until the rcu
> > > > callback, as at that point software will not tweak the tables any more.
> > > > No need for atomics/spinning and can just do a boring traversal.
> > > 
> > > Right that is sort of what I had in mind. Note that I'm still trying to
> > > make my mind about the overall approach -- I can see how RCU protection
> > > provides a rather elegant solution to this problem, but this makes the
> > > whole thing inaccessible to e.g. pKVM where RCU is a non-starter.
> > 
> > Heh, figuring out how to do this for pKVM seemed hard hence my lazy
> > attempt :)
> > 
> > > A
> > > possible alternative that comes to mind would be to have all walkers
> > > take references on the pages as they walk down, and release them on
> > > their way back, but I'm still not sure how to make this race-safe. I'll
> > > have a think ...
> > 
> > Does pKVM ever collapse tables into blocks? That is the only reason any
> > of this mess ever gets roped in. If not I think it is possible to get
> > away with a rwlock with unmap on the write side and everything else on
> > the read side, right?
> > 
> > As far as regular KVM goes we get in this business when disabling dirty
> > logging on a memslot. Guest faults will lazily collapse the tables back
> > into blocks. An equally valid implementation would be just to unmap the
> > whole memslot and have the guest build out the tables again, which could
> > work with the aforementioned rwlock.
> 
> Apologies for the delay on this one, I was away for a while.
> 
> Yup, that all makes sense. FWIW the pKVM use-case I have in mind is
> slightly different. Specifically, in the pKVM world the hypervisor
> maintains a stage-2 for the host, that is all identity mapped. So we use
> nice big block mappings as much as we can. But when a protected guest
> starts, the hypervisor needs to break down the host stage-2 blocks to
> unmap the 4K guest pages from the host (which is where the protection
> comes from in pKVM). And when the guest is torn down, the host can
> reclaim its pages, hence putting us in a position to coallesce its
> stage-2 into nice big blocks again. Note that none of this coallescing
> is currently implemented even in our pKVM prototype, so it's a bit
> unfair to ask you to deal with this stuff now, but clearly it'd be cool
> if there was a way we could make these things coexist and even ideally
> share some code...

Oh, it certainly isn't unfair to make sure we've got good constructs
landing for everyone to use :-)

I'll need to chew on this a bit more to have a better answer. The reason
I hesitate to do the giant unmap for non-pKVM is that I believe we'd be
leaving some performance on the table for newer implementations of the
architecture. Having said that, avoiding a tlbi vmalls12e1is on every
collapsed table is highly desirable.

FEAT_BBM=2 semantics in the MMU is also on the todo list. In this case
we'd do a direct table->block transformation on the PTE and elide that
nasty tlbi.

Unless there's objections, I'll probably hobble this series along as-is
for the time being. My hope is that other table walkers can join in on
the parallel party later down the road.

Thanks for getting back to me.

--
Best,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
@ 2022-05-04  6:03               ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-05-04  6:03 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

On Tue, May 03, 2022 at 02:17:25PM +0000, Quentin Perret wrote:
> On Friday 22 Apr 2022 at 20:41:47 (+0000), Oliver Upton wrote:
> > On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote:
> > > On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> > > > The other option would be to not touch the subtree at all until the rcu
> > > > callback, as at that point software will not tweak the tables any more.
> > > > No need for atomics/spinning and can just do a boring traversal.
> > > 
> > > Right that is sort of what I had in mind. Note that I'm still trying to
> > > make my mind about the overall approach -- I can see how RCU protection
> > > provides a rather elegant solution to this problem, but this makes the
> > > whole thing inaccessible to e.g. pKVM where RCU is a non-starter.
> > 
> > Heh, figuring out how to do this for pKVM seemed hard hence my lazy
> > attempt :)
> > 
> > > A
> > > possible alternative that comes to mind would be to have all walkers
> > > take references on the pages as they walk down, and release them on
> > > their way back, but I'm still not sure how to make this race-safe. I'll
> > > have a think ...
> > 
> > Does pKVM ever collapse tables into blocks? That is the only reason any
> > of this mess ever gets roped in. If not I think it is possible to get
> > away with a rwlock with unmap on the write side and everything else on
> > the read side, right?
> > 
> > As far as regular KVM goes we get in this business when disabling dirty
> > logging on a memslot. Guest faults will lazily collapse the tables back
> > into blocks. An equally valid implementation would be just to unmap the
> > whole memslot and have the guest build out the tables again, which could
> > work with the aforementioned rwlock.
> 
> Apologies for the delay on this one, I was away for a while.
> 
> Yup, that all makes sense. FWIW the pKVM use-case I have in mind is
> slightly different. Specifically, in the pKVM world the hypervisor
> maintains a stage-2 for the host, that is all identity mapped. So we use
> nice big block mappings as much as we can. But when a protected guest
> starts, the hypervisor needs to break down the host stage-2 blocks to
> unmap the 4K guest pages from the host (which is where the protection
> comes from in pKVM). And when the guest is torn down, the host can
> reclaim its pages, hence putting us in a position to coallesce its
> stage-2 into nice big blocks again. Note that none of this coallescing
> is currently implemented even in our pKVM prototype, so it's a bit
> unfair to ask you to deal with this stuff now, but clearly it'd be cool
> if there was a way we could make these things coexist and even ideally
> share some code...

Oh, it certainly isn't unfair to make sure we've got good constructs
landing for everyone to use :-)

I'll need to chew on this a bit more to have a better answer. The reason
I hesitate to do the giant unmap for non-pKVM is that I believe we'd be
leaving some performance on the table for newer implementations of the
architecture. Having said that, avoiding a tlbi vmalls12e1is on every
collapsed table is highly desirable.

FEAT_BBM=2 semantics in the MMU is also on the todo list. In this case
we'd do a direct table->block transformation on the PTE and elide that
nasty tlbi.

Unless there's objections, I'll probably hobble this series along as-is
for the time being. My hope is that other table walkers can join in on
the parallel party later down the road.

Thanks for getting back to me.

--
Best,
Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk
@ 2022-05-04  6:03               ` Oliver Upton
  0 siblings, 0 replies; 165+ messages in thread
From: Oliver Upton @ 2022-05-04  6:03 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvmarm, kvm, Marc Zyngier, Ben Gardon, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel

On Tue, May 03, 2022 at 02:17:25PM +0000, Quentin Perret wrote:
> On Friday 22 Apr 2022 at 20:41:47 (+0000), Oliver Upton wrote:
> > On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote:
> > > On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote:
> > > > The other option would be to not touch the subtree at all until the rcu
> > > > callback, as at that point software will not tweak the tables any more.
> > > > No need for atomics/spinning and can just do a boring traversal.
> > > 
> > > Right that is sort of what I had in mind. Note that I'm still trying to
> > > make my mind about the overall approach -- I can see how RCU protection
> > > provides a rather elegant solution to this problem, but this makes the
> > > whole thing inaccessible to e.g. pKVM where RCU is a non-starter.
> > 
> > Heh, figuring out how to do this for pKVM seemed hard hence my lazy
> > attempt :)
> > 
> > > A
> > > possible alternative that comes to mind would be to have all walkers
> > > take references on the pages as they walk down, and release them on
> > > their way back, but I'm still not sure how to make this race-safe. I'll
> > > have a think ...
> > 
> > Does pKVM ever collapse tables into blocks? That is the only reason any
> > of this mess ever gets roped in. If not I think it is possible to get
> > away with a rwlock with unmap on the write side and everything else on
> > the read side, right?
> > 
> > As far as regular KVM goes we get in this business when disabling dirty
> > logging on a memslot. Guest faults will lazily collapse the tables back
> > into blocks. An equally valid implementation would be just to unmap the
> > whole memslot and have the guest build out the tables again, which could
> > work with the aforementioned rwlock.
> 
> Apologies for the delay on this one, I was away for a while.
> 
> Yup, that all makes sense. FWIW the pKVM use-case I have in mind is
> slightly different. Specifically, in the pKVM world the hypervisor
> maintains a stage-2 for the host, that is all identity mapped. So we use
> nice big block mappings as much as we can. But when a protected guest
> starts, the hypervisor needs to break down the host stage-2 blocks to
> unmap the 4K guest pages from the host (which is where the protection
> comes from in pKVM). And when the guest is torn down, the host can
> reclaim its pages, hence putting us in a position to coallesce its
> stage-2 into nice big blocks again. Note that none of this coallescing
> is currently implemented even in our pKVM prototype, so it's a bit
> unfair to ask you to deal with this stuff now, but clearly it'd be cool
> if there was a way we could make these things coexist and even ideally
> share some code...

Oh, it certainly isn't unfair to make sure we've got good constructs
landing for everyone to use :-)

I'll need to chew on this a bit more to have a better answer. The reason
I hesitate to do the giant unmap for non-pKVM is that I believe we'd be
leaving some performance on the table for newer implementations of the
architecture. Having said that, avoiding a tlbi vmalls12e1is on every
collapsed table is highly desirable.

FEAT_BBM=2 semantics in the MMU is also on the todo list. In this case
we'd do a direct table->block transformation on the PTE and elide that
nasty tlbi.

Unless there's objections, I'll probably hobble this series along as-is
for the time being. My hope is that other table walkers can join in on
the parallel party later down the road.

Thanks for getting back to me.

--
Best,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
  2022-04-20  0:53         ` Oliver Upton
  (?)
@ 2022-09-08  0:52           ` David Matlack
  -1 siblings, 0 replies; 165+ messages in thread
From: David Matlack @ 2022-09-08  0:52 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Ricardo Koller, KVMARM, kvm list, Marc Zyngier, James Morse,
	Alexandru Elisei, Suzuki K Poulose, linux-arm-kernel,
	Peter Shier, Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	Ben Gardon

On Tue, Apr 19, 2022 at 5:53 PM Oliver Upton <oupton@google.com> wrote:
>
> Hi Ricardo,
>
> On Mon, Apr 18, 2022 at 8:09 PM Ricardo Koller <ricarkol@google.com> wrote:
> >
> > On Mon, Apr 18, 2022 at 07:59:04PM -0700, Ricardo Koller wrote:
> > > On Fri, Apr 15, 2022 at 09:58:58PM +0000, Oliver Upton wrote:
> > > > It is possible that a table page remains visible to another thread until
> > > > the next rcu synchronization event. To that end, we cannot drop the last
> > > > page reference synchronous with post-order traversal for a parallel
> > > > table walk.
> > > >
> > > > Schedule an rcu callback to clean up the child table page for parallel
> > > > walks.
> > > >
> > > > Signed-off-by: Oliver Upton <oupton@google.com>
> > > > ---
> > > >  arch/arm64/include/asm/kvm_pgtable.h |  3 ++
> > > >  arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
> > > >  arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
> > > >  3 files changed, 67 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > > > index 74955aba5918..52e55e00f0ca 100644
> > > > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > > > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > > > @@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> > > >   * @put_page:                      Decrement the refcount on a page. When the
> > > >   *                         refcount reaches 0 the page is automatically
> > > >   *                         freed.
> > > > + * @free_table:                    Drop the last page reference, possibly in the
> > > > + *                         next RCU sync if doing a shared walk.
> > > >   * @page_count:                    Return the refcount of a page.
> > > >   * @phys_to_virt:          Convert a physical address into a virtual
> > > >   *                         address mapped in the current context.
> > > > @@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
> > > >     void            (*get_page)(void *addr);
> > > >     void            (*put_page)(void *addr);
> > > >     int             (*page_count)(void *addr);
> > > > +   void            (*free_table)(void *addr, bool shared);
> > > >     void*           (*phys_to_virt)(phys_addr_t phys);
> > > >     phys_addr_t     (*virt_to_phys)(void *addr);
> > > >     void            (*dcache_clean_inval_poc)(void *addr, size_t size);
> > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > index 121818d4c33e..a9a48edba63b 100644
> > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > @@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
> > > >  {}
> > > >
> > > >  #define kvm_dereference_ptep       rcu_dereference_raw
> > > > +
> > > > +static inline void kvm_pgtable_destroy_barrier(void)
> > > > +{}
> > > > +
> > > >  #else
> > > >  #define kvm_pgtable_walk_begin     rcu_read_lock
> > > >
> > > >  #define kvm_pgtable_walk_end       rcu_read_unlock
> > > >
> > > >  #define kvm_dereference_ptep       rcu_dereference
> > > > +
> > > > +#define kvm_pgtable_destroy_barrier        rcu_barrier
> > > > +
> > > >  #endif
> > > >
> > > >  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> > > > @@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
> > > >             childp = kvm_pte_follow(*old, mm_ops);
> > > >     }
> > > >
> > > > -   mm_ops->put_page(childp);
> > > > +   /*
> > > > +    * If we do not have exclusive access to the page tables it is possible
> > > > +    * the unlinked table remains visible to another thread until the next
> > > > +    * rcu synchronization.
> > > > +    */
> > > > +   mm_ops->free_table(childp, shared);
> > > >     mm_ops->put_page(ptep);
> > > >
> > > >     return ret;
> > > > @@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > > >                                            kvm_granule_size(level));
> > > >
> > > >     if (childp)
> > > > -           mm_ops->put_page(childp);
> > > > +           mm_ops->free_table(childp, shared);
> > > >
> > > >     return 0;
> > > >  }
> > > > @@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > > >     mm_ops->put_page(ptep);
> > > >
> > > >     if (kvm_pte_table(*old, level))
> > > > -           mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
> > > > +           mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
> > > >
> > > >     return 0;
> > > >  }
> > > > @@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
> > > >     pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
> > > >     pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
> > > >     pgt->pgd = NULL;
> > > > +
> > > > +   /*
> > > > +    * Guarantee that all unlinked subtrees associated with the stage2 page
> > > > +    * table have also been freed before returning.
> > > > +    */
> > > > +   kvm_pgtable_destroy_barrier();
> > > >  }
> > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > index cc6ed6b06ec2..6ecf37009c21 100644
> > > > --- a/arch/arm64/kvm/mmu.c
> > > > +++ b/arch/arm64/kvm/mmu.c
> > > > @@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
> > > >  static void *stage2_memcache_zalloc_page(void *arg)
> > > >  {
> > > >     struct kvm_mmu_caches *mmu_caches = arg;
> > > > +   struct stage2_page_header *hdr;
> > > > +   void *addr;
> > > >
> > > >     /* Allocated with __GFP_ZERO, so no need to zero */
> > > > -   return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > > > +   addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > > > +   if (!addr)
> > > > +           return NULL;
> > > > +
> > > > +   hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
> > > > +   if (!hdr) {
> > > > +           free_page((unsigned long)addr);
> > > > +           return NULL;
> > > > +   }
> > > > +
> > > > +   hdr->page = virt_to_page(addr);
> > > > +   set_page_private(hdr->page, (unsigned long)hdr);
> > > > +   return addr;
> > > > +}
> > > > +
> > > > +static void stage2_free_page_now(struct stage2_page_header *hdr)
> > > > +{
> > > > +   WARN_ON(page_ref_count(hdr->page) != 1);
> > > > +
> > > > +   __free_page(hdr->page);
> > > > +   kmem_cache_free(stage2_page_header_cache, hdr);
> > > > +}
> > > > +
> > > > +static void stage2_free_page_rcu_cb(struct rcu_head *head)
> > > > +{
> > > > +   struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
> > > > +                                                 rcu_head);
> > > > +
> > > > +   stage2_free_page_now(hdr);
> > > > +}
> > > > +
> > > > +static void stage2_free_table(void *addr, bool shared)
> > > > +{
> > > > +   struct page *page = virt_to_page(addr);
> > > > +   struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
> > > > +
> > > > +   if (shared)
> > > > +           call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);
> > >
> > > Can the number of callbacks grow to "dangerous" numbers? can it be
> > > bounded with something like the following?
> > >
> > > if number of readers is really high:
> > >       synchronize_rcu()
> > > else
> > >       call_rcu()
> >
> > sorry, meant to say "number of callbacks"
>
> Good point. I don't have data for this, but generally speaking I do
> not believe we need to enqueue a callback for every page. In fact,
> since we already make the invalid PTE visible in pre-order traversal
> we could theoretically free all tables from a single RCU callback (per
> fault).

I noticed this change was made in v1, but I don't understand the
reasoning. Whether page tables are freed in many callbacks (one per
table) or a single callback (one per subtree), we will still do the
same amount of work in RCU callbacks. In fact the latter (i.e. v1)
approach seems like it ends up doing more work in the RCU callback
because it has to do the page table traversal rather than just call
free() a bunch of times. I'm also not sure if RCU callbacks have any
limitations on how long they can/should take (it may be better to have
lots of tiny callbacks than one large one). OTOH maybe I'm just
misunderstanding something so I thought I'd ask :)

>
> I think if we used synchronize_rcu() then we would need to drop the
> mmu lock since it will block the thread.
>
> --
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
@ 2022-09-08  0:52           ` David Matlack
  0 siblings, 0 replies; 165+ messages in thread
From: David Matlack @ 2022-09-08  0:52 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Ricardo Koller, KVMARM, kvm list, Marc Zyngier, James Morse,
	Alexandru Elisei, Suzuki K Poulose, linux-arm-kernel,
	Peter Shier, Reiji Watanabe, Paolo Bonzini, Sean Christopherson,
	Ben Gardon

On Tue, Apr 19, 2022 at 5:53 PM Oliver Upton <oupton@google.com> wrote:
>
> Hi Ricardo,
>
> On Mon, Apr 18, 2022 at 8:09 PM Ricardo Koller <ricarkol@google.com> wrote:
> >
> > On Mon, Apr 18, 2022 at 07:59:04PM -0700, Ricardo Koller wrote:
> > > On Fri, Apr 15, 2022 at 09:58:58PM +0000, Oliver Upton wrote:
> > > > It is possible that a table page remains visible to another thread until
> > > > the next rcu synchronization event. To that end, we cannot drop the last
> > > > page reference synchronous with post-order traversal for a parallel
> > > > table walk.
> > > >
> > > > Schedule an rcu callback to clean up the child table page for parallel
> > > > walks.
> > > >
> > > > Signed-off-by: Oliver Upton <oupton@google.com>
> > > > ---
> > > >  arch/arm64/include/asm/kvm_pgtable.h |  3 ++
> > > >  arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
> > > >  arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
> > > >  3 files changed, 67 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > > > index 74955aba5918..52e55e00f0ca 100644
> > > > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > > > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > > > @@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> > > >   * @put_page:                      Decrement the refcount on a page. When the
> > > >   *                         refcount reaches 0 the page is automatically
> > > >   *                         freed.
> > > > + * @free_table:                    Drop the last page reference, possibly in the
> > > > + *                         next RCU sync if doing a shared walk.
> > > >   * @page_count:                    Return the refcount of a page.
> > > >   * @phys_to_virt:          Convert a physical address into a virtual
> > > >   *                         address mapped in the current context.
> > > > @@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
> > > >     void            (*get_page)(void *addr);
> > > >     void            (*put_page)(void *addr);
> > > >     int             (*page_count)(void *addr);
> > > > +   void            (*free_table)(void *addr, bool shared);
> > > >     void*           (*phys_to_virt)(phys_addr_t phys);
> > > >     phys_addr_t     (*virt_to_phys)(void *addr);
> > > >     void            (*dcache_clean_inval_poc)(void *addr, size_t size);
> > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > index 121818d4c33e..a9a48edba63b 100644
> > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > @@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
> > > >  {}
> > > >
> > > >  #define kvm_dereference_ptep       rcu_dereference_raw
> > > > +
> > > > +static inline void kvm_pgtable_destroy_barrier(void)
> > > > +{}
> > > > +
> > > >  #else
> > > >  #define kvm_pgtable_walk_begin     rcu_read_lock
> > > >
> > > >  #define kvm_pgtable_walk_end       rcu_read_unlock
> > > >
> > > >  #define kvm_dereference_ptep       rcu_dereference
> > > > +
> > > > +#define kvm_pgtable_destroy_barrier        rcu_barrier
> > > > +
> > > >  #endif
> > > >
> > > >  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> > > > @@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
> > > >             childp = kvm_pte_follow(*old, mm_ops);
> > > >     }
> > > >
> > > > -   mm_ops->put_page(childp);
> > > > +   /*
> > > > +    * If we do not have exclusive access to the page tables it is possible
> > > > +    * the unlinked table remains visible to another thread until the next
> > > > +    * rcu synchronization.
> > > > +    */
> > > > +   mm_ops->free_table(childp, shared);
> > > >     mm_ops->put_page(ptep);
> > > >
> > > >     return ret;
> > > > @@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > > >                                            kvm_granule_size(level));
> > > >
> > > >     if (childp)
> > > > -           mm_ops->put_page(childp);
> > > > +           mm_ops->free_table(childp, shared);
> > > >
> > > >     return 0;
> > > >  }
> > > > @@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > > >     mm_ops->put_page(ptep);
> > > >
> > > >     if (kvm_pte_table(*old, level))
> > > > -           mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
> > > > +           mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
> > > >
> > > >     return 0;
> > > >  }
> > > > @@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
> > > >     pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
> > > >     pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
> > > >     pgt->pgd = NULL;
> > > > +
> > > > +   /*
> > > > +    * Guarantee that all unlinked subtrees associated with the stage2 page
> > > > +    * table have also been freed before returning.
> > > > +    */
> > > > +   kvm_pgtable_destroy_barrier();
> > > >  }
> > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > index cc6ed6b06ec2..6ecf37009c21 100644
> > > > --- a/arch/arm64/kvm/mmu.c
> > > > +++ b/arch/arm64/kvm/mmu.c
> > > > @@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
> > > >  static void *stage2_memcache_zalloc_page(void *arg)
> > > >  {
> > > >     struct kvm_mmu_caches *mmu_caches = arg;
> > > > +   struct stage2_page_header *hdr;
> > > > +   void *addr;
> > > >
> > > >     /* Allocated with __GFP_ZERO, so no need to zero */
> > > > -   return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > > > +   addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > > > +   if (!addr)
> > > > +           return NULL;
> > > > +
> > > > +   hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
> > > > +   if (!hdr) {
> > > > +           free_page((unsigned long)addr);
> > > > +           return NULL;
> > > > +   }
> > > > +
> > > > +   hdr->page = virt_to_page(addr);
> > > > +   set_page_private(hdr->page, (unsigned long)hdr);
> > > > +   return addr;
> > > > +}
> > > > +
> > > > +static void stage2_free_page_now(struct stage2_page_header *hdr)
> > > > +{
> > > > +   WARN_ON(page_ref_count(hdr->page) != 1);
> > > > +
> > > > +   __free_page(hdr->page);
> > > > +   kmem_cache_free(stage2_page_header_cache, hdr);
> > > > +}
> > > > +
> > > > +static void stage2_free_page_rcu_cb(struct rcu_head *head)
> > > > +{
> > > > +   struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
> > > > +                                                 rcu_head);
> > > > +
> > > > +   stage2_free_page_now(hdr);
> > > > +}
> > > > +
> > > > +static void stage2_free_table(void *addr, bool shared)
> > > > +{
> > > > +   struct page *page = virt_to_page(addr);
> > > > +   struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
> > > > +
> > > > +   if (shared)
> > > > +           call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);
> > >
> > > Can the number of callbacks grow to "dangerous" numbers? can it be
> > > bounded with something like the following?
> > >
> > > if number of readers is really high:
> > >       synchronize_rcu()
> > > else
> > >       call_rcu()
> >
> > sorry, meant to say "number of callbacks"
>
> Good point. I don't have data for this, but generally speaking I do
> not believe we need to enqueue a callback for every page. In fact,
> since we already make the invalid PTE visible in pre-order traversal
> we could theoretically free all tables from a single RCU callback (per
> fault).

I noticed this change was made in v1, but I don't understand the
reasoning. Whether page tables are freed in many callbacks (one per
table) or a single callback (one per subtree), we will still do the
same amount of work in RCU callbacks. In fact the latter (i.e. v1)
approach seems like it ends up doing more work in the RCU callback
because it has to do the page table traversal rather than just call
free() a bunch of times. I'm also not sure if RCU callbacks have any
limitations on how long they can/should take (it may be better to have
lots of tiny callbacks than one large one). OTOH maybe I'm just
misunderstanding something so I thought I'd ask :)

>
> I think if we used synchronize_rcu() then we would need to drop the
> mmu lock since it will block the thread.
>
> --
> Thanks,
> Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk
@ 2022-09-08  0:52           ` David Matlack
  0 siblings, 0 replies; 165+ messages in thread
From: David Matlack @ 2022-09-08  0:52 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm list, Marc Zyngier, Ben Gardon, Peter Shier, Paolo Bonzini,
	KVMARM, linux-arm-kernel

On Tue, Apr 19, 2022 at 5:53 PM Oliver Upton <oupton@google.com> wrote:
>
> Hi Ricardo,
>
> On Mon, Apr 18, 2022 at 8:09 PM Ricardo Koller <ricarkol@google.com> wrote:
> >
> > On Mon, Apr 18, 2022 at 07:59:04PM -0700, Ricardo Koller wrote:
> > > On Fri, Apr 15, 2022 at 09:58:58PM +0000, Oliver Upton wrote:
> > > > It is possible that a table page remains visible to another thread until
> > > > the next rcu synchronization event. To that end, we cannot drop the last
> > > > page reference synchronous with post-order traversal for a parallel
> > > > table walk.
> > > >
> > > > Schedule an rcu callback to clean up the child table page for parallel
> > > > walks.
> > > >
> > > > Signed-off-by: Oliver Upton <oupton@google.com>
> > > > ---
> > > >  arch/arm64/include/asm/kvm_pgtable.h |  3 ++
> > > >  arch/arm64/kvm/hyp/pgtable.c         | 24 +++++++++++++--
> > > >  arch/arm64/kvm/mmu.c                 | 44 +++++++++++++++++++++++++++-
> > > >  3 files changed, 67 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > > > index 74955aba5918..52e55e00f0ca 100644
> > > > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > > > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > > > @@ -81,6 +81,8 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> > > >   * @put_page:                      Decrement the refcount on a page. When the
> > > >   *                         refcount reaches 0 the page is automatically
> > > >   *                         freed.
> > > > + * @free_table:                    Drop the last page reference, possibly in the
> > > > + *                         next RCU sync if doing a shared walk.
> > > >   * @page_count:                    Return the refcount of a page.
> > > >   * @phys_to_virt:          Convert a physical address into a virtual
> > > >   *                         address mapped in the current context.
> > > > @@ -98,6 +100,7 @@ struct kvm_pgtable_mm_ops {
> > > >     void            (*get_page)(void *addr);
> > > >     void            (*put_page)(void *addr);
> > > >     int             (*page_count)(void *addr);
> > > > +   void            (*free_table)(void *addr, bool shared);
> > > >     void*           (*phys_to_virt)(phys_addr_t phys);
> > > >     phys_addr_t     (*virt_to_phys)(void *addr);
> > > >     void            (*dcache_clean_inval_poc)(void *addr, size_t size);
> > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > index 121818d4c33e..a9a48edba63b 100644
> > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > @@ -147,12 +147,19 @@ static inline void kvm_pgtable_walk_end(void)
> > > >  {}
> > > >
> > > >  #define kvm_dereference_ptep       rcu_dereference_raw
> > > > +
> > > > +static inline void kvm_pgtable_destroy_barrier(void)
> > > > +{}
> > > > +
> > > >  #else
> > > >  #define kvm_pgtable_walk_begin     rcu_read_lock
> > > >
> > > >  #define kvm_pgtable_walk_end       rcu_read_unlock
> > > >
> > > >  #define kvm_dereference_ptep       rcu_dereference
> > > > +
> > > > +#define kvm_pgtable_destroy_barrier        rcu_barrier
> > > > +
> > > >  #endif
> > > >
> > > >  static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
> > > > @@ -1063,7 +1070,12 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
> > > >             childp = kvm_pte_follow(*old, mm_ops);
> > > >     }
> > > >
> > > > -   mm_ops->put_page(childp);
> > > > +   /*
> > > > +    * If we do not have exclusive access to the page tables it is possible
> > > > +    * the unlinked table remains visible to another thread until the next
> > > > +    * rcu synchronization.
> > > > +    */
> > > > +   mm_ops->free_table(childp, shared);
> > > >     mm_ops->put_page(ptep);
> > > >
> > > >     return ret;
> > > > @@ -1203,7 +1215,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > > >                                            kvm_granule_size(level));
> > > >
> > > >     if (childp)
> > > > -           mm_ops->put_page(childp);
> > > > +           mm_ops->free_table(childp, shared);
> > > >
> > > >     return 0;
> > > >  }
> > > > @@ -1433,7 +1445,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > > >     mm_ops->put_page(ptep);
> > > >
> > > >     if (kvm_pte_table(*old, level))
> > > > -           mm_ops->put_page(kvm_pte_follow(*old, mm_ops));
> > > > +           mm_ops->free_table(kvm_pte_follow(*old, mm_ops), shared);
> > > >
> > > >     return 0;
> > > >  }
> > > > @@ -1452,4 +1464,10 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
> > > >     pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
> > > >     pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
> > > >     pgt->pgd = NULL;
> > > > +
> > > > +   /*
> > > > +    * Guarantee that all unlinked subtrees associated with the stage2 page
> > > > +    * table have also been freed before returning.
> > > > +    */
> > > > +   kvm_pgtable_destroy_barrier();
> > > >  }
> > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > index cc6ed6b06ec2..6ecf37009c21 100644
> > > > --- a/arch/arm64/kvm/mmu.c
> > > > +++ b/arch/arm64/kvm/mmu.c
> > > > @@ -98,9 +98,50 @@ static bool kvm_is_device_pfn(unsigned long pfn)
> > > >  static void *stage2_memcache_zalloc_page(void *arg)
> > > >  {
> > > >     struct kvm_mmu_caches *mmu_caches = arg;
> > > > +   struct stage2_page_header *hdr;
> > > > +   void *addr;
> > > >
> > > >     /* Allocated with __GFP_ZERO, so no need to zero */
> > > > -   return kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > > > +   addr = kvm_mmu_memory_cache_alloc(&mmu_caches->page_cache);
> > > > +   if (!addr)
> > > > +           return NULL;
> > > > +
> > > > +   hdr = kvm_mmu_memory_cache_alloc(&mmu_caches->header_cache);
> > > > +   if (!hdr) {
> > > > +           free_page((unsigned long)addr);
> > > > +           return NULL;
> > > > +   }
> > > > +
> > > > +   hdr->page = virt_to_page(addr);
> > > > +   set_page_private(hdr->page, (unsigned long)hdr);
> > > > +   return addr;
> > > > +}
> > > > +
> > > > +static void stage2_free_page_now(struct stage2_page_header *hdr)
> > > > +{
> > > > +   WARN_ON(page_ref_count(hdr->page) != 1);
> > > > +
> > > > +   __free_page(hdr->page);
> > > > +   kmem_cache_free(stage2_page_header_cache, hdr);
> > > > +}
> > > > +
> > > > +static void stage2_free_page_rcu_cb(struct rcu_head *head)
> > > > +{
> > > > +   struct stage2_page_header *hdr = container_of(head, struct stage2_page_header,
> > > > +                                                 rcu_head);
> > > > +
> > > > +   stage2_free_page_now(hdr);
> > > > +}
> > > > +
> > > > +static void stage2_free_table(void *addr, bool shared)
> > > > +{
> > > > +   struct page *page = virt_to_page(addr);
> > > > +   struct stage2_page_header *hdr = (struct stage2_page_header *)page_private(page);
> > > > +
> > > > +   if (shared)
> > > > +           call_rcu(&hdr->rcu_head, stage2_free_page_rcu_cb);
> > >
> > > Can the number of callbacks grow to "dangerous" numbers? can it be
> > > bounded with something like the following?
> > >
> > > if number of readers is really high:
> > >       synchronize_rcu()
> > > else
> > >       call_rcu()
> >
> > sorry, meant to say "number of callbacks"
>
> Good point. I don't have data for this, but generally speaking I do
> not believe we need to enqueue a callback for every page. In fact,
> since we already make the invalid PTE visible in pre-order traversal
> we could theoretically free all tables from a single RCU callback (per
> fault).

I noticed this change was made in v1, but I don't understand the
reasoning. Whether page tables are freed in many callbacks (one per
table) or a single callback (one per subtree), we will still do the
same amount of work in RCU callbacks. In fact the latter (i.e. v1)
approach seems like it ends up doing more work in the RCU callback
because it has to do the page table traversal rather than just call
free() a bunch of times. I'm also not sure if RCU callbacks have any
limitations on how long they can/should take (it may be better to have
lots of tiny callbacks than one large one). OTOH maybe I'm just
misunderstanding something so I thought I'd ask :)

>
> I think if we used synchronize_rcu() then we would need to drop the
> mmu lock since it will block the thread.
>
> --
> Thanks,
> Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 165+ messages in thread

end of thread, other threads:[~2022-09-08  8:34 UTC | newest]

Thread overview: 165+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-15 21:58 [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling Oliver Upton
2022-04-15 21:58 ` Oliver Upton
2022-04-15 21:58 ` Oliver Upton
2022-04-15 21:58 ` [RFC PATCH 01/17] KVM: arm64: Directly read owner id field in stage2_pte_is_counted() Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58 ` [RFC PATCH 02/17] KVM: arm64: Only read the pte once per visit Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-21 16:12   ` Ben Gardon
2022-04-21 16:12     ` Ben Gardon
2022-04-21 16:12     ` Ben Gardon
2022-04-15 21:58 ` [RFC PATCH 03/17] KVM: arm64: Return the next table from map callbacks Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58 ` [RFC PATCH 04/17] KVM: arm64: Protect page table traversal with RCU Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-19  2:55   ` Ricardo Koller
2022-04-19  2:55     ` Ricardo Koller
2022-04-19  2:55     ` Ricardo Koller
2022-04-19  3:01     ` Oliver Upton
2022-04-19  3:01       ` Oliver Upton
2022-04-19  3:01       ` Oliver Upton
2022-04-15 21:58 ` [RFC PATCH 05/17] KVM: arm64: Take an argument to indicate parallel walk Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-16 11:30   ` Marc Zyngier
2022-04-16 11:30     ` Marc Zyngier
2022-04-16 11:30     ` Marc Zyngier
2022-04-16 16:03     ` Oliver Upton
2022-04-16 16:03       ` Oliver Upton
2022-04-16 16:03       ` Oliver Upton
2022-04-15 21:58 ` [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-20 16:55   ` Quentin Perret
2022-04-20 16:55     ` Quentin Perret
2022-04-20 16:55     ` Quentin Perret
2022-04-20 17:06     ` Oliver Upton
2022-04-20 17:06       ` Oliver Upton
2022-04-20 17:06       ` Oliver Upton
2022-04-21 16:57   ` Ben Gardon
2022-04-21 16:57     ` Ben Gardon
2022-04-21 16:57     ` Ben Gardon
2022-04-21 18:52     ` Oliver Upton
2022-04-21 18:52       ` Oliver Upton
2022-04-21 18:52       ` Oliver Upton
2022-04-26 21:32       ` Ben Gardon
2022-04-26 21:32         ` Ben Gardon
2022-04-26 21:32         ` Ben Gardon
2022-04-25 15:13   ` Sean Christopherson
2022-04-25 15:13     ` Sean Christopherson
2022-04-25 15:13     ` Sean Christopherson
2022-04-25 16:53     ` Oliver Upton
2022-04-25 16:53       ` Oliver Upton
2022-04-25 16:53       ` Oliver Upton
2022-04-25 18:16       ` Sean Christopherson
2022-04-25 18:16         ` Sean Christopherson
2022-04-25 18:16         ` Sean Christopherson
2022-04-15 21:58 ` [RFC PATCH 07/17] KVM: arm64: Enlighten perm relax path about " Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58 ` [RFC PATCH 08/17] KVM: arm64: Spin off helper for initializing table pte Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58 ` [RFC PATCH 09/17] KVM: arm64: Tear down unlinked page tables in parallel walk Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-21 13:21   ` Quentin Perret
2022-04-21 13:21     ` Quentin Perret
2022-04-21 13:21     ` Quentin Perret
2022-04-21 16:40     ` Oliver Upton
2022-04-21 16:40       ` Oliver Upton
2022-04-21 16:40       ` Oliver Upton
2022-04-22 16:00       ` Quentin Perret
2022-04-22 16:00         ` Quentin Perret
2022-04-22 16:00         ` Quentin Perret
2022-04-22 20:41         ` Oliver Upton
2022-04-22 20:41           ` Oliver Upton
2022-04-22 20:41           ` Oliver Upton
2022-05-03 14:17           ` Quentin Perret
2022-05-03 14:17             ` Quentin Perret
2022-05-03 14:17             ` Quentin Perret
2022-05-04  6:03             ` Oliver Upton
2022-05-04  6:03               ` Oliver Upton
2022-05-04  6:03               ` Oliver Upton
2022-04-15 21:58 ` [RFC PATCH 10/17] KVM: arm64: Assume a table pte is already owned in post-order traversal Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-21 16:11   ` Ben Gardon
2022-04-21 16:11     ` Ben Gardon
2022-04-21 16:11     ` Ben Gardon
2022-04-21 17:16     ` Oliver Upton
2022-04-21 17:16       ` Oliver Upton
2022-04-21 17:16       ` Oliver Upton
2022-04-15 21:58 ` [RFC PATCH 11/17] KVM: arm64: Move MMU cache init/destroy into helpers Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58 ` [RFC PATCH 12/17] KVM: arm64: Stuff mmu page cache in sub struct Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58 ` [RFC PATCH 13/17] KVM: arm64: Setup cache for stage2 page headers Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58 ` [RFC PATCH 14/17] KVM: arm64: Punt last page reference to rcu callback for parallel walk Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-19  2:59   ` Ricardo Koller
2022-04-19  2:59     ` Ricardo Koller
2022-04-19  2:59     ` Ricardo Koller
2022-04-19  3:09     ` Ricardo Koller
2022-04-19  3:09       ` Ricardo Koller
2022-04-19  3:09       ` Ricardo Koller
2022-04-20  0:53       ` Oliver Upton
2022-04-20  0:53         ` Oliver Upton
2022-04-20  0:53         ` Oliver Upton
2022-09-08  0:52         ` David Matlack
2022-09-08  0:52           ` David Matlack
2022-09-08  0:52           ` David Matlack
2022-04-21 16:28   ` Ben Gardon
2022-04-21 16:28     ` Ben Gardon
2022-04-21 16:28     ` Ben Gardon
2022-04-15 21:58 ` [RFC PATCH 15/17] KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map() Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:58   ` Oliver Upton
2022-04-15 21:59 ` [RFC PATCH 16/17] KVM: arm64: Enable parallel stage 2 MMU faults Oliver Upton
2022-04-15 21:59   ` Oliver Upton
2022-04-15 21:59   ` Oliver Upton
2022-04-21 16:35   ` Ben Gardon
2022-04-21 16:35     ` Ben Gardon
2022-04-21 16:35     ` Ben Gardon
2022-04-21 16:46     ` Oliver Upton
2022-04-21 16:46       ` Oliver Upton
2022-04-21 16:46       ` Oliver Upton
2022-04-21 17:03       ` Ben Gardon
2022-04-21 17:03         ` Ben Gardon
2022-04-21 17:03         ` Ben Gardon
2022-04-15 21:59 ` [RFC PATCH 17/17] TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages Oliver Upton
2022-04-15 21:59   ` Oliver Upton
2022-04-15 21:59   ` Oliver Upton
2022-04-15 23:35 ` [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling David Matlack
2022-04-15 23:35   ` David Matlack
2022-04-15 23:35   ` David Matlack
2022-04-16  0:04   ` Oliver Upton
2022-04-16  0:04     ` Oliver Upton
2022-04-16  0:04     ` Oliver Upton
2022-04-21 16:43     ` David Matlack
2022-04-21 16:43       ` David Matlack
2022-04-21 16:43       ` David Matlack
2022-04-16  6:23 ` Oliver Upton
2022-04-16  6:23   ` Oliver Upton
2022-04-16  6:23   ` Oliver Upton
2022-04-19 17:57 ` Ben Gardon
2022-04-19 17:57   ` Ben Gardon
2022-04-19 17:57   ` Ben Gardon
2022-04-19 18:36   ` Oliver Upton
2022-04-19 18:36     ` Oliver Upton
2022-04-19 18:36     ` Oliver Upton
2022-04-21 16:30     ` Ben Gardon
2022-04-21 16:30       ` Ben Gardon
2022-04-21 16:30       ` Ben Gardon
2022-04-21 16:37       ` Paolo Bonzini
2022-04-21 16:37         ` Paolo Bonzini
2022-04-21 16:37         ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.