All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] KVM: Remove kvm_is_transparent_hugepage() and friends
@ 2021-07-17  9:55 ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, Alexandru Elisei,
	kernel-team

A while ago, Willy and Sean pointed out[1] that arm64 is the last user
of kvm_is_transparent_hugepage(), and that there would actually be
some benefit in looking at the userspace mapping directly instead.

This small series does exactly that, although it doesn't try to
support more than a PMD-sized mapping yet for THPs. We could probably
look into unifying this with the huge PUD code, and there is still
some potential use of the contiguous hint.

As a consequence, it removes kvm_is_transparent_hugepage(),
PageTransCompoundMap() and kvm_get_pfn(), all of which have no user
left after this rework.

This has been lightly tested on an Altra box. Although nothing caught
fire, it requires some careful reviewing on the arm64 side.

[1] https://lore.kernel.org/r/YLpLvFPXrIp8nAK4@google.com

Marc Zyngier (5):
  KVM: arm64: Walk userspace page tables to compute the THP mapping size
  KVM: arm64: Avoid mapping size adjustment on permission fault
  KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()
  KVM: arm64: Use get_page() instead of kvm_get_pfn()
  KVM: Get rid of kvm_get_pfn()

 arch/arm64/kvm/mmu.c       | 57 +++++++++++++++++++++++++++++++++-----
 include/linux/kvm_host.h   |  1 -
 include/linux/page-flags.h | 37 -------------------------
 virt/kvm/kvm_main.c        | 19 +------------
 4 files changed, 51 insertions(+), 63 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 0/5] KVM: Remove kvm_is_transparent_hugepage() and friends
@ 2021-07-17  9:55 ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: kernel-team, Sean Christopherson, Matthew Wilcox, Paolo Bonzini,
	Will Deacon

A while ago, Willy and Sean pointed out[1] that arm64 is the last user
of kvm_is_transparent_hugepage(), and that there would actually be
some benefit in looking at the userspace mapping directly instead.

This small series does exactly that, although it doesn't try to
support more than a PMD-sized mapping yet for THPs. We could probably
look into unifying this with the huge PUD code, and there is still
some potential use of the contiguous hint.

As a consequence, it removes kvm_is_transparent_hugepage(),
PageTransCompoundMap() and kvm_get_pfn(), all of which have no user
left after this rework.

This has been lightly tested on an Altra box. Although nothing caught
fire, it requires some careful reviewing on the arm64 side.

[1] https://lore.kernel.org/r/YLpLvFPXrIp8nAK4@google.com

Marc Zyngier (5):
  KVM: arm64: Walk userspace page tables to compute the THP mapping size
  KVM: arm64: Avoid mapping size adjustment on permission fault
  KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()
  KVM: arm64: Use get_page() instead of kvm_get_pfn()
  KVM: Get rid of kvm_get_pfn()

 arch/arm64/kvm/mmu.c       | 57 +++++++++++++++++++++++++++++++++-----
 include/linux/kvm_host.h   |  1 -
 include/linux/page-flags.h | 37 -------------------------
 virt/kvm/kvm_main.c        | 19 +------------
 4 files changed, 51 insertions(+), 63 deletions(-)

-- 
2.30.2

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 0/5] KVM: Remove kvm_is_transparent_hugepage() and friends
@ 2021-07-17  9:55 ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, Alexandru Elisei,
	kernel-team

A while ago, Willy and Sean pointed out[1] that arm64 is the last user
of kvm_is_transparent_hugepage(), and that there would actually be
some benefit in looking at the userspace mapping directly instead.

This small series does exactly that, although it doesn't try to
support more than a PMD-sized mapping yet for THPs. We could probably
look into unifying this with the huge PUD code, and there is still
some potential use of the contiguous hint.

As a consequence, it removes kvm_is_transparent_hugepage(),
PageTransCompoundMap() and kvm_get_pfn(), all of which have no user
left after this rework.

This has been lightly tested on an Altra box. Although nothing caught
fire, it requires some careful reviewing on the arm64 side.

[1] https://lore.kernel.org/r/YLpLvFPXrIp8nAK4@google.com

Marc Zyngier (5):
  KVM: arm64: Walk userspace page tables to compute the THP mapping size
  KVM: arm64: Avoid mapping size adjustment on permission fault
  KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()
  KVM: arm64: Use get_page() instead of kvm_get_pfn()
  KVM: Get rid of kvm_get_pfn()

 arch/arm64/kvm/mmu.c       | 57 +++++++++++++++++++++++++++++++++-----
 include/linux/kvm_host.h   |  1 -
 include/linux/page-flags.h | 37 -------------------------
 virt/kvm/kvm_main.c        | 19 +------------
 4 files changed, 51 insertions(+), 63 deletions(-)

-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
  2021-07-17  9:55 ` Marc Zyngier
  (?)
@ 2021-07-17  9:55   ` Marc Zyngier
  -1 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, Alexandru Elisei,
	kernel-team

We currently rely on the kvm_is_transparent_hugepage() helper to
discover whether a given page has the potential to be mapped as
a block mapping.

However, this API doesn't really give un everything we want:
- we don't get the size: this is not crucial today as we only
  support PMD-sized THPs, but we'd like to have larger sizes
  in the future
- we're the only user left of the API, and there is a will
  to remove it altogether

To address the above, implement a simple walker using the existing
page table infrastructure, and plumb it into transparent_hugepage_adjust().
No new page sizes are supported in the process.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 3155c9e778f0..db6314b93e99 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -433,6 +433,44 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
 	return 0;
 }
 
+static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
+	/* We shouldn't need any other callback to walk the PT */
+	.phys_to_virt		= kvm_host_va,
+};
+
+struct user_walk_data {
+	u32	level;
+};
+
+static int user_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+		       enum kvm_pgtable_walk_flags flag, void * const arg)
+{
+	struct user_walk_data *data = arg;
+
+	data->level = level;
+	return 0;
+}
+
+static int get_user_mapping_size(struct kvm *kvm, u64 addr)
+{
+	struct user_walk_data data;
+	struct kvm_pgtable pgt = {
+		.pgd		= (kvm_pte_t *)kvm->mm->pgd,
+		.ia_bits	= VA_BITS,
+		.start_level	= 4 - CONFIG_PGTABLE_LEVELS,
+		.mm_ops		= &kvm_user_mm_ops,
+	};
+	struct kvm_pgtable_walker walker = {
+		.cb		= user_walker,
+		.flags		= KVM_PGTABLE_WALK_LEAF,
+		.arg		= &data,
+	};
+
+	kvm_pgtable_walk(&pgt, ALIGN_DOWN(addr, PAGE_SIZE), PAGE_SIZE, &walker);
+
+	return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(data.level));
+}
+
 static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
 	.zalloc_page		= stage2_memcache_zalloc_page,
 	.zalloc_pages_exact	= kvm_host_zalloc_pages_exact,
@@ -780,7 +818,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
  * Returns the size of the mapping.
  */
 static unsigned long
-transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
+transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
 			    unsigned long hva, kvm_pfn_t *pfnp,
 			    phys_addr_t *ipap)
 {
@@ -791,8 +829,8 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
 	 * sure that the HVA and IPA are sufficiently aligned and that the
 	 * block map is contained within the memslot.
 	 */
-	if (kvm_is_transparent_hugepage(pfn) &&
-	    fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
+	if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) &&
+	    get_user_mapping_size(kvm, hva) >= PMD_SIZE) {
 		/*
 		 * The address we faulted on is backed by a transparent huge
 		 * page.  However, because we map the compound huge page and
@@ -1051,7 +1089,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * backed by a THP and thus use block mapping if possible.
 	 */
 	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
-		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
+		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
 							   &pfn, &fault_ipa);
 
 	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-17  9:55   ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: kernel-team, Sean Christopherson, Matthew Wilcox, Paolo Bonzini,
	Will Deacon

We currently rely on the kvm_is_transparent_hugepage() helper to
discover whether a given page has the potential to be mapped as
a block mapping.

However, this API doesn't really give un everything we want:
- we don't get the size: this is not crucial today as we only
  support PMD-sized THPs, but we'd like to have larger sizes
  in the future
- we're the only user left of the API, and there is a will
  to remove it altogether

To address the above, implement a simple walker using the existing
page table infrastructure, and plumb it into transparent_hugepage_adjust().
No new page sizes are supported in the process.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 3155c9e778f0..db6314b93e99 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -433,6 +433,44 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
 	return 0;
 }
 
+static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
+	/* We shouldn't need any other callback to walk the PT */
+	.phys_to_virt		= kvm_host_va,
+};
+
+struct user_walk_data {
+	u32	level;
+};
+
+static int user_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+		       enum kvm_pgtable_walk_flags flag, void * const arg)
+{
+	struct user_walk_data *data = arg;
+
+	data->level = level;
+	return 0;
+}
+
+static int get_user_mapping_size(struct kvm *kvm, u64 addr)
+{
+	struct user_walk_data data;
+	struct kvm_pgtable pgt = {
+		.pgd		= (kvm_pte_t *)kvm->mm->pgd,
+		.ia_bits	= VA_BITS,
+		.start_level	= 4 - CONFIG_PGTABLE_LEVELS,
+		.mm_ops		= &kvm_user_mm_ops,
+	};
+	struct kvm_pgtable_walker walker = {
+		.cb		= user_walker,
+		.flags		= KVM_PGTABLE_WALK_LEAF,
+		.arg		= &data,
+	};
+
+	kvm_pgtable_walk(&pgt, ALIGN_DOWN(addr, PAGE_SIZE), PAGE_SIZE, &walker);
+
+	return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(data.level));
+}
+
 static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
 	.zalloc_page		= stage2_memcache_zalloc_page,
 	.zalloc_pages_exact	= kvm_host_zalloc_pages_exact,
@@ -780,7 +818,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
  * Returns the size of the mapping.
  */
 static unsigned long
-transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
+transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
 			    unsigned long hva, kvm_pfn_t *pfnp,
 			    phys_addr_t *ipap)
 {
@@ -791,8 +829,8 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
 	 * sure that the HVA and IPA are sufficiently aligned and that the
 	 * block map is contained within the memslot.
 	 */
-	if (kvm_is_transparent_hugepage(pfn) &&
-	    fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
+	if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) &&
+	    get_user_mapping_size(kvm, hva) >= PMD_SIZE) {
 		/*
 		 * The address we faulted on is backed by a transparent huge
 		 * page.  However, because we map the compound huge page and
@@ -1051,7 +1089,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * backed by a THP and thus use block mapping if possible.
 	 */
 	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
-		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
+		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
 							   &pfn, &fault_ipa);
 
 	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
-- 
2.30.2

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-17  9:55   ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, Alexandru Elisei,
	kernel-team

We currently rely on the kvm_is_transparent_hugepage() helper to
discover whether a given page has the potential to be mapped as
a block mapping.

However, this API doesn't really give un everything we want:
- we don't get the size: this is not crucial today as we only
  support PMD-sized THPs, but we'd like to have larger sizes
  in the future
- we're the only user left of the API, and there is a will
  to remove it altogether

To address the above, implement a simple walker using the existing
page table infrastructure, and plumb it into transparent_hugepage_adjust().
No new page sizes are supported in the process.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 3155c9e778f0..db6314b93e99 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -433,6 +433,44 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
 	return 0;
 }
 
+static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
+	/* We shouldn't need any other callback to walk the PT */
+	.phys_to_virt		= kvm_host_va,
+};
+
+struct user_walk_data {
+	u32	level;
+};
+
+static int user_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+		       enum kvm_pgtable_walk_flags flag, void * const arg)
+{
+	struct user_walk_data *data = arg;
+
+	data->level = level;
+	return 0;
+}
+
+static int get_user_mapping_size(struct kvm *kvm, u64 addr)
+{
+	struct user_walk_data data;
+	struct kvm_pgtable pgt = {
+		.pgd		= (kvm_pte_t *)kvm->mm->pgd,
+		.ia_bits	= VA_BITS,
+		.start_level	= 4 - CONFIG_PGTABLE_LEVELS,
+		.mm_ops		= &kvm_user_mm_ops,
+	};
+	struct kvm_pgtable_walker walker = {
+		.cb		= user_walker,
+		.flags		= KVM_PGTABLE_WALK_LEAF,
+		.arg		= &data,
+	};
+
+	kvm_pgtable_walk(&pgt, ALIGN_DOWN(addr, PAGE_SIZE), PAGE_SIZE, &walker);
+
+	return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(data.level));
+}
+
 static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
 	.zalloc_page		= stage2_memcache_zalloc_page,
 	.zalloc_pages_exact	= kvm_host_zalloc_pages_exact,
@@ -780,7 +818,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
  * Returns the size of the mapping.
  */
 static unsigned long
-transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
+transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
 			    unsigned long hva, kvm_pfn_t *pfnp,
 			    phys_addr_t *ipap)
 {
@@ -791,8 +829,8 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
 	 * sure that the HVA and IPA are sufficiently aligned and that the
 	 * block map is contained within the memslot.
 	 */
-	if (kvm_is_transparent_hugepage(pfn) &&
-	    fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
+	if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) &&
+	    get_user_mapping_size(kvm, hva) >= PMD_SIZE) {
 		/*
 		 * The address we faulted on is backed by a transparent huge
 		 * page.  However, because we map the compound huge page and
@@ -1051,7 +1089,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * backed by a THP and thus use block mapping if possible.
 	 */
 	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
-		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
+		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
 							   &pfn, &fault_ipa);
 
 	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/5] KVM: arm64: Avoid mapping size adjustment on permission fault
  2021-07-17  9:55 ` Marc Zyngier
  (?)
@ 2021-07-17  9:55   ` Marc Zyngier
  -1 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, Alexandru Elisei,
	kernel-team

Since we only support PMD-sized mappings for THP, getting
a permission fault on a level that results in a mapping
being larger than PAGE_SIZE is a sure indication that we have
already upgraded our mapping to a PMD.

In this case, there is no need to try and parse userspace page
tables, as the fault information already tells us everything.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/mmu.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index db6314b93e99..c036a480ca27 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1088,9 +1088,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * If we are not forced to use page mapping, check if we are
 	 * backed by a THP and thus use block mapping if possible.
 	 */
-	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
-		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
-							   &pfn, &fault_ipa);
+	if (vma_pagesize == PAGE_SIZE && !force_pte) {
+		if (fault_status == FSC_PERM && fault_granule > PAGE_SIZE)
+			vma_pagesize = fault_granule;
+		else
+			vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
+								   hva, &pfn,
+								   &fault_ipa);
+	}
 
 	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
 		/* Check the VMM hasn't introduced a new VM_SHARED VMA */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/5] KVM: arm64: Avoid mapping size adjustment on permission fault
@ 2021-07-17  9:55   ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: kernel-team, Sean Christopherson, Matthew Wilcox, Paolo Bonzini,
	Will Deacon

Since we only support PMD-sized mappings for THP, getting
a permission fault on a level that results in a mapping
being larger than PAGE_SIZE is a sure indication that we have
already upgraded our mapping to a PMD.

In this case, there is no need to try and parse userspace page
tables, as the fault information already tells us everything.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/mmu.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index db6314b93e99..c036a480ca27 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1088,9 +1088,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * If we are not forced to use page mapping, check if we are
 	 * backed by a THP and thus use block mapping if possible.
 	 */
-	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
-		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
-							   &pfn, &fault_ipa);
+	if (vma_pagesize == PAGE_SIZE && !force_pte) {
+		if (fault_status == FSC_PERM && fault_granule > PAGE_SIZE)
+			vma_pagesize = fault_granule;
+		else
+			vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
+								   hva, &pfn,
+								   &fault_ipa);
+	}
 
 	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
 		/* Check the VMM hasn't introduced a new VM_SHARED VMA */
-- 
2.30.2

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/5] KVM: arm64: Avoid mapping size adjustment on permission fault
@ 2021-07-17  9:55   ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, Alexandru Elisei,
	kernel-team

Since we only support PMD-sized mappings for THP, getting
a permission fault on a level that results in a mapping
being larger than PAGE_SIZE is a sure indication that we have
already upgraded our mapping to a PMD.

In this case, there is no need to try and parse userspace page
tables, as the fault information already tells us everything.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/mmu.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index db6314b93e99..c036a480ca27 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1088,9 +1088,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * If we are not forced to use page mapping, check if we are
 	 * backed by a THP and thus use block mapping if possible.
 	 */
-	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
-		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
-							   &pfn, &fault_ipa);
+	if (vma_pagesize == PAGE_SIZE && !force_pte) {
+		if (fault_status == FSC_PERM && fault_granule > PAGE_SIZE)
+			vma_pagesize = fault_granule;
+		else
+			vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
+								   hva, &pfn,
+								   &fault_ipa);
+	}
 
 	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
 		/* Check the VMM hasn't introduced a new VM_SHARED VMA */
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 3/5] KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()
  2021-07-17  9:55 ` Marc Zyngier
  (?)
@ 2021-07-17  9:55   ` Marc Zyngier
  -1 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, Alexandru Elisei,
	kernel-team

Now that arm64 has stopped using kvm_is_transparent_hugepage(),
we can remove it, as well as PageTransCompoundMap() which was
only used by the former.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 include/linux/page-flags.h | 37 -------------------------------------
 virt/kvm/kvm_main.c        | 10 ----------
 2 files changed, 47 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5922031ffab6..1ace27c4a8e0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -632,43 +632,6 @@ static inline int PageTransCompound(struct page *page)
 	return PageCompound(page);
 }
 
-/*
- * PageTransCompoundMap is the same as PageTransCompound, but it also
- * guarantees the primary MMU has the entire compound page mapped
- * through pmd_trans_huge, which in turn guarantees the secondary MMUs
- * can also map the entire compound page. This allows the secondary
- * MMUs to call get_user_pages() only once for each compound page and
- * to immediately map the entire compound page with a single secondary
- * MMU fault. If there will be a pmd split later, the secondary MMUs
- * will get an update through the MMU notifier invalidation through
- * split_huge_pmd().
- *
- * Unlike PageTransCompound, this is safe to be called only while
- * split_huge_pmd() cannot run from under us, like if protected by the
- * MMU notifier, otherwise it may result in page->_mapcount check false
- * positives.
- *
- * We have to treat page cache THP differently since every subpage of it
- * would get _mapcount inc'ed once it is PMD mapped.  But, it may be PTE
- * mapped in the current process so comparing subpage's _mapcount to
- * compound_mapcount to filter out PTE mapped case.
- */
-static inline int PageTransCompoundMap(struct page *page)
-{
-	struct page *head;
-
-	if (!PageTransCompound(page))
-		return 0;
-
-	if (PageAnon(page))
-		return atomic_read(&page->_mapcount) < 0;
-
-	head = compound_head(page);
-	/* File THP is PMD mapped and not PTE mapped */
-	return atomic_read(&page->_mapcount) ==
-	       atomic_read(compound_mapcount_ptr(head));
-}
-
 /*
  * PageTransTail returns true for both transparent huge pages
  * and hugetlbfs pages, so it should only be called when it's known
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7d95126cda9e..2e410a8a6a67 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -189,16 +189,6 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 	return true;
 }
 
-bool kvm_is_transparent_hugepage(kvm_pfn_t pfn)
-{
-	struct page *page = pfn_to_page(pfn);
-
-	if (!PageTransCompoundMap(page))
-		return false;
-
-	return is_transparent_hugepage(compound_head(page));
-}
-
 /*
  * Switches to specified vcpu, until a matching vcpu_put()
  */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 3/5] KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()
@ 2021-07-17  9:55   ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: kernel-team, Sean Christopherson, Matthew Wilcox, Paolo Bonzini,
	Will Deacon

Now that arm64 has stopped using kvm_is_transparent_hugepage(),
we can remove it, as well as PageTransCompoundMap() which was
only used by the former.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 include/linux/page-flags.h | 37 -------------------------------------
 virt/kvm/kvm_main.c        | 10 ----------
 2 files changed, 47 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5922031ffab6..1ace27c4a8e0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -632,43 +632,6 @@ static inline int PageTransCompound(struct page *page)
 	return PageCompound(page);
 }
 
-/*
- * PageTransCompoundMap is the same as PageTransCompound, but it also
- * guarantees the primary MMU has the entire compound page mapped
- * through pmd_trans_huge, which in turn guarantees the secondary MMUs
- * can also map the entire compound page. This allows the secondary
- * MMUs to call get_user_pages() only once for each compound page and
- * to immediately map the entire compound page with a single secondary
- * MMU fault. If there will be a pmd split later, the secondary MMUs
- * will get an update through the MMU notifier invalidation through
- * split_huge_pmd().
- *
- * Unlike PageTransCompound, this is safe to be called only while
- * split_huge_pmd() cannot run from under us, like if protected by the
- * MMU notifier, otherwise it may result in page->_mapcount check false
- * positives.
- *
- * We have to treat page cache THP differently since every subpage of it
- * would get _mapcount inc'ed once it is PMD mapped.  But, it may be PTE
- * mapped in the current process so comparing subpage's _mapcount to
- * compound_mapcount to filter out PTE mapped case.
- */
-static inline int PageTransCompoundMap(struct page *page)
-{
-	struct page *head;
-
-	if (!PageTransCompound(page))
-		return 0;
-
-	if (PageAnon(page))
-		return atomic_read(&page->_mapcount) < 0;
-
-	head = compound_head(page);
-	/* File THP is PMD mapped and not PTE mapped */
-	return atomic_read(&page->_mapcount) ==
-	       atomic_read(compound_mapcount_ptr(head));
-}
-
 /*
  * PageTransTail returns true for both transparent huge pages
  * and hugetlbfs pages, so it should only be called when it's known
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7d95126cda9e..2e410a8a6a67 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -189,16 +189,6 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 	return true;
 }
 
-bool kvm_is_transparent_hugepage(kvm_pfn_t pfn)
-{
-	struct page *page = pfn_to_page(pfn);
-
-	if (!PageTransCompoundMap(page))
-		return false;
-
-	return is_transparent_hugepage(compound_head(page));
-}
-
 /*
  * Switches to specified vcpu, until a matching vcpu_put()
  */
-- 
2.30.2

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 3/5] KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()
@ 2021-07-17  9:55   ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, Alexandru Elisei,
	kernel-team

Now that arm64 has stopped using kvm_is_transparent_hugepage(),
we can remove it, as well as PageTransCompoundMap() which was
only used by the former.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 include/linux/page-flags.h | 37 -------------------------------------
 virt/kvm/kvm_main.c        | 10 ----------
 2 files changed, 47 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5922031ffab6..1ace27c4a8e0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -632,43 +632,6 @@ static inline int PageTransCompound(struct page *page)
 	return PageCompound(page);
 }
 
-/*
- * PageTransCompoundMap is the same as PageTransCompound, but it also
- * guarantees the primary MMU has the entire compound page mapped
- * through pmd_trans_huge, which in turn guarantees the secondary MMUs
- * can also map the entire compound page. This allows the secondary
- * MMUs to call get_user_pages() only once for each compound page and
- * to immediately map the entire compound page with a single secondary
- * MMU fault. If there will be a pmd split later, the secondary MMUs
- * will get an update through the MMU notifier invalidation through
- * split_huge_pmd().
- *
- * Unlike PageTransCompound, this is safe to be called only while
- * split_huge_pmd() cannot run from under us, like if protected by the
- * MMU notifier, otherwise it may result in page->_mapcount check false
- * positives.
- *
- * We have to treat page cache THP differently since every subpage of it
- * would get _mapcount inc'ed once it is PMD mapped.  But, it may be PTE
- * mapped in the current process so comparing subpage's _mapcount to
- * compound_mapcount to filter out PTE mapped case.
- */
-static inline int PageTransCompoundMap(struct page *page)
-{
-	struct page *head;
-
-	if (!PageTransCompound(page))
-		return 0;
-
-	if (PageAnon(page))
-		return atomic_read(&page->_mapcount) < 0;
-
-	head = compound_head(page);
-	/* File THP is PMD mapped and not PTE mapped */
-	return atomic_read(&page->_mapcount) ==
-	       atomic_read(compound_mapcount_ptr(head));
-}
-
 /*
  * PageTransTail returns true for both transparent huge pages
  * and hugetlbfs pages, so it should only be called when it's known
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7d95126cda9e..2e410a8a6a67 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -189,16 +189,6 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 	return true;
 }
 
-bool kvm_is_transparent_hugepage(kvm_pfn_t pfn)
-{
-	struct page *page = pfn_to_page(pfn);
-
-	if (!PageTransCompoundMap(page))
-		return false;
-
-	return is_transparent_hugepage(compound_head(page));
-}
-
 /*
  * Switches to specified vcpu, until a matching vcpu_put()
  */
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 4/5] KVM: arm64: Use get_page() instead of kvm_get_pfn()
  2021-07-17  9:55 ` Marc Zyngier
  (?)
@ 2021-07-17  9:55   ` Marc Zyngier
  -1 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, Alexandru Elisei,
	kernel-team

When mapping a THP, we are guaranteed that the page isn't reserved,
and we can safely avoid the kvm_is_reserved_pfn() call.

Replace kvm_get_pfn() with get_page(pfn_to_page()).

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c036a480ca27..0e8dab124cbd 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -852,7 +852,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		*ipap &= PMD_MASK;
 		kvm_release_pfn_clean(pfn);
 		pfn &= ~(PTRS_PER_PMD - 1);
-		kvm_get_pfn(pfn);
+		get_page(pfn_to_page(pfn));
 		*pfnp = pfn;
 
 		return PMD_SIZE;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 4/5] KVM: arm64: Use get_page() instead of kvm_get_pfn()
@ 2021-07-17  9:55   ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: kernel-team, Sean Christopherson, Matthew Wilcox, Paolo Bonzini,
	Will Deacon

When mapping a THP, we are guaranteed that the page isn't reserved,
and we can safely avoid the kvm_is_reserved_pfn() call.

Replace kvm_get_pfn() with get_page(pfn_to_page()).

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c036a480ca27..0e8dab124cbd 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -852,7 +852,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		*ipap &= PMD_MASK;
 		kvm_release_pfn_clean(pfn);
 		pfn &= ~(PTRS_PER_PMD - 1);
-		kvm_get_pfn(pfn);
+		get_page(pfn_to_page(pfn));
 		*pfnp = pfn;
 
 		return PMD_SIZE;
-- 
2.30.2

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 4/5] KVM: arm64: Use get_page() instead of kvm_get_pfn()
@ 2021-07-17  9:55   ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, Alexandru Elisei,
	kernel-team

When mapping a THP, we are guaranteed that the page isn't reserved,
and we can safely avoid the kvm_is_reserved_pfn() call.

Replace kvm_get_pfn() with get_page(pfn_to_page()).

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c036a480ca27..0e8dab124cbd 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -852,7 +852,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		*ipap &= PMD_MASK;
 		kvm_release_pfn_clean(pfn);
 		pfn &= ~(PTRS_PER_PMD - 1);
-		kvm_get_pfn(pfn);
+		get_page(pfn_to_page(pfn));
 		*pfnp = pfn;
 
 		return PMD_SIZE;
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 5/5] KVM: Get rid of kvm_get_pfn()
  2021-07-17  9:55 ` Marc Zyngier
  (?)
@ 2021-07-17  9:55   ` Marc Zyngier
  -1 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, Alexandru Elisei,
	kernel-team

Nobody is using kvm_get_pfn() anymore. Get rid of it.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 include/linux/kvm_host.h | 1 -
 virt/kvm/kvm_main.c      | 9 +--------
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ae7735b490b4..9818d271c2a1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -824,7 +824,6 @@ void kvm_release_pfn_clean(kvm_pfn_t pfn);
 void kvm_release_pfn_dirty(kvm_pfn_t pfn);
 void kvm_set_pfn_dirty(kvm_pfn_t pfn);
 void kvm_set_pfn_accessed(kvm_pfn_t pfn);
-void kvm_get_pfn(kvm_pfn_t pfn);
 
 void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache);
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2e410a8a6a67..0284418c4400 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2215,7 +2215,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 	 * Get a reference here because callers of *hva_to_pfn* and
 	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
 	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
-	 * set, but the kvm_get_pfn/kvm_release_pfn_clean pair will
+	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
 	 * simply do nothing for reserved pfns.
 	 *
 	 * Whoever called remap_pfn_range is also going to call e.g.
@@ -2612,13 +2612,6 @@ void kvm_set_pfn_accessed(kvm_pfn_t pfn)
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
 
-void kvm_get_pfn(kvm_pfn_t pfn)
-{
-	if (!kvm_is_reserved_pfn(pfn))
-		get_page(pfn_to_page(pfn));
-}
-EXPORT_SYMBOL_GPL(kvm_get_pfn);
-
 static int next_segment(unsigned long len, int offset)
 {
 	if (len > PAGE_SIZE - offset)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 5/5] KVM: Get rid of kvm_get_pfn()
@ 2021-07-17  9:55   ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: kernel-team, Sean Christopherson, Matthew Wilcox, Paolo Bonzini,
	Will Deacon

Nobody is using kvm_get_pfn() anymore. Get rid of it.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 include/linux/kvm_host.h | 1 -
 virt/kvm/kvm_main.c      | 9 +--------
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ae7735b490b4..9818d271c2a1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -824,7 +824,6 @@ void kvm_release_pfn_clean(kvm_pfn_t pfn);
 void kvm_release_pfn_dirty(kvm_pfn_t pfn);
 void kvm_set_pfn_dirty(kvm_pfn_t pfn);
 void kvm_set_pfn_accessed(kvm_pfn_t pfn);
-void kvm_get_pfn(kvm_pfn_t pfn);
 
 void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache);
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2e410a8a6a67..0284418c4400 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2215,7 +2215,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 	 * Get a reference here because callers of *hva_to_pfn* and
 	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
 	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
-	 * set, but the kvm_get_pfn/kvm_release_pfn_clean pair will
+	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
 	 * simply do nothing for reserved pfns.
 	 *
 	 * Whoever called remap_pfn_range is also going to call e.g.
@@ -2612,13 +2612,6 @@ void kvm_set_pfn_accessed(kvm_pfn_t pfn)
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
 
-void kvm_get_pfn(kvm_pfn_t pfn)
-{
-	if (!kvm_is_reserved_pfn(pfn))
-		get_page(pfn_to_page(pfn));
-}
-EXPORT_SYMBOL_GPL(kvm_get_pfn);
-
 static int next_segment(unsigned long len, int offset)
 {
 	if (len > PAGE_SIZE - offset)
-- 
2.30.2

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 5/5] KVM: Get rid of kvm_get_pfn()
@ 2021-07-17  9:55   ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, Alexandru Elisei,
	kernel-team

Nobody is using kvm_get_pfn() anymore. Get rid of it.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 include/linux/kvm_host.h | 1 -
 virt/kvm/kvm_main.c      | 9 +--------
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ae7735b490b4..9818d271c2a1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -824,7 +824,6 @@ void kvm_release_pfn_clean(kvm_pfn_t pfn);
 void kvm_release_pfn_dirty(kvm_pfn_t pfn);
 void kvm_set_pfn_dirty(kvm_pfn_t pfn);
 void kvm_set_pfn_accessed(kvm_pfn_t pfn);
-void kvm_get_pfn(kvm_pfn_t pfn);
 
 void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache);
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2e410a8a6a67..0284418c4400 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2215,7 +2215,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 	 * Get a reference here because callers of *hva_to_pfn* and
 	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
 	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
-	 * set, but the kvm_get_pfn/kvm_release_pfn_clean pair will
+	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
 	 * simply do nothing for reserved pfns.
 	 *
 	 * Whoever called remap_pfn_range is also going to call e.g.
@@ -2612,13 +2612,6 @@ void kvm_set_pfn_accessed(kvm_pfn_t pfn)
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
 
-void kvm_get_pfn(kvm_pfn_t pfn)
-{
-	if (!kvm_is_reserved_pfn(pfn))
-		get_page(pfn_to_page(pfn));
-}
-EXPORT_SYMBOL_GPL(kvm_get_pfn);
-
 static int next_segment(unsigned long len, int offset)
 {
 	if (len > PAGE_SIZE - offset)
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
  2021-07-17  9:55   ` Marc Zyngier
  (?)
@ 2021-07-19  6:31     ` Paolo Bonzini
  -1 siblings, 0 replies; 54+ messages in thread
From: Paolo Bonzini @ 2021-07-19  6:31 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Will Deacon, Quentin Perret,
	James Morse, Suzuki K Poulose, Alexandru Elisei, kernel-team

On 17/07/21 11:55, Marc Zyngier wrote:
> We currently rely on the kvm_is_transparent_hugepage() helper to
> discover whether a given page has the potential to be mapped as
> a block mapping.
> 
> However, this API doesn't really give un everything we want:
> - we don't get the size: this is not crucial today as we only
>    support PMD-sized THPs, but we'd like to have larger sizes
>    in the future
> - we're the only user left of the API, and there is a will
>    to remove it altogether
> 
> To address the above, implement a simple walker using the existing
> page table infrastructure, and plumb it into transparent_hugepage_adjust().
> No new page sizes are supported in the process.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>

If it's okay for you to reuse the KVM page walker that's fine of course, 
but the arch/x86/mm functions lookup_address_in_{mm,pgd} are mostly 
machine-independent and it may make sense to move them to mm/.

That would also allow reusing the x86 function host_pfn_mapping_level.

Paolo

> ---
>   arch/arm64/kvm/mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 42 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 3155c9e778f0..db6314b93e99 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -433,6 +433,44 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
>   	return 0;
>   }
>   
> +static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
> +	/* We shouldn't need any other callback to walk the PT */
> +	.phys_to_virt		= kvm_host_va,
> +};
> +
> +struct user_walk_data {
> +	u32	level;
> +};
> +
> +static int user_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +		       enum kvm_pgtable_walk_flags flag, void * const arg)
> +{
> +	struct user_walk_data *data = arg;
> +
> +	data->level = level;
> +	return 0;
> +}
> +
> +static int get_user_mapping_size(struct kvm *kvm, u64 addr)
> +{
> +	struct user_walk_data data;
> +	struct kvm_pgtable pgt = {
> +		.pgd		= (kvm_pte_t *)kvm->mm->pgd,
> +		.ia_bits	= VA_BITS,
> +		.start_level	= 4 - CONFIG_PGTABLE_LEVELS,
> +		.mm_ops		= &kvm_user_mm_ops,
> +	};
> +	struct kvm_pgtable_walker walker = {
> +		.cb		= user_walker,
> +		.flags		= KVM_PGTABLE_WALK_LEAF,
> +		.arg		= &data,
> +	};
> +
> +	kvm_pgtable_walk(&pgt, ALIGN_DOWN(addr, PAGE_SIZE), PAGE_SIZE, &walker);
> +
> +	return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(data.level));
> +}
> +
>   static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>   	.zalloc_page		= stage2_memcache_zalloc_page,
>   	.zalloc_pages_exact	= kvm_host_zalloc_pages_exact,
> @@ -780,7 +818,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>    * Returns the size of the mapping.
>    */
>   static unsigned long
> -transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
> +transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
>   			    unsigned long hva, kvm_pfn_t *pfnp,
>   			    phys_addr_t *ipap)
>   {
> @@ -791,8 +829,8 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>   	 * sure that the HVA and IPA are sufficiently aligned and that the
>   	 * block map is contained within the memslot.
>   	 */
> -	if (kvm_is_transparent_hugepage(pfn) &&
> -	    fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
> +	if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) &&
> +	    get_user_mapping_size(kvm, hva) >= PMD_SIZE) {
>   		/*
>   		 * The address we faulted on is backed by a transparent huge
>   		 * page.  However, because we map the compound huge page and
> @@ -1051,7 +1089,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   	 * backed by a THP and thus use block mapping if possible.
>   	 */
>   	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
> -		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> +		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
>   							   &pfn, &fault_ipa);
>   
>   	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-19  6:31     ` Paolo Bonzini
  0 siblings, 0 replies; 54+ messages in thread
From: Paolo Bonzini @ 2021-07-19  6:31 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: kernel-team, Sean Christopherson, Matthew Wilcox, Will Deacon

On 17/07/21 11:55, Marc Zyngier wrote:
> We currently rely on the kvm_is_transparent_hugepage() helper to
> discover whether a given page has the potential to be mapped as
> a block mapping.
> 
> However, this API doesn't really give un everything we want:
> - we don't get the size: this is not crucial today as we only
>    support PMD-sized THPs, but we'd like to have larger sizes
>    in the future
> - we're the only user left of the API, and there is a will
>    to remove it altogether
> 
> To address the above, implement a simple walker using the existing
> page table infrastructure, and plumb it into transparent_hugepage_adjust().
> No new page sizes are supported in the process.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>

If it's okay for you to reuse the KVM page walker that's fine of course, 
but the arch/x86/mm functions lookup_address_in_{mm,pgd} are mostly 
machine-independent and it may make sense to move them to mm/.

That would also allow reusing the x86 function host_pfn_mapping_level.

Paolo

> ---
>   arch/arm64/kvm/mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 42 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 3155c9e778f0..db6314b93e99 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -433,6 +433,44 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
>   	return 0;
>   }
>   
> +static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
> +	/* We shouldn't need any other callback to walk the PT */
> +	.phys_to_virt		= kvm_host_va,
> +};
> +
> +struct user_walk_data {
> +	u32	level;
> +};
> +
> +static int user_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +		       enum kvm_pgtable_walk_flags flag, void * const arg)
> +{
> +	struct user_walk_data *data = arg;
> +
> +	data->level = level;
> +	return 0;
> +}
> +
> +static int get_user_mapping_size(struct kvm *kvm, u64 addr)
> +{
> +	struct user_walk_data data;
> +	struct kvm_pgtable pgt = {
> +		.pgd		= (kvm_pte_t *)kvm->mm->pgd,
> +		.ia_bits	= VA_BITS,
> +		.start_level	= 4 - CONFIG_PGTABLE_LEVELS,
> +		.mm_ops		= &kvm_user_mm_ops,
> +	};
> +	struct kvm_pgtable_walker walker = {
> +		.cb		= user_walker,
> +		.flags		= KVM_PGTABLE_WALK_LEAF,
> +		.arg		= &data,
> +	};
> +
> +	kvm_pgtable_walk(&pgt, ALIGN_DOWN(addr, PAGE_SIZE), PAGE_SIZE, &walker);
> +
> +	return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(data.level));
> +}
> +
>   static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>   	.zalloc_page		= stage2_memcache_zalloc_page,
>   	.zalloc_pages_exact	= kvm_host_zalloc_pages_exact,
> @@ -780,7 +818,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>    * Returns the size of the mapping.
>    */
>   static unsigned long
> -transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
> +transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
>   			    unsigned long hva, kvm_pfn_t *pfnp,
>   			    phys_addr_t *ipap)
>   {
> @@ -791,8 +829,8 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>   	 * sure that the HVA and IPA are sufficiently aligned and that the
>   	 * block map is contained within the memslot.
>   	 */
> -	if (kvm_is_transparent_hugepage(pfn) &&
> -	    fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
> +	if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) &&
> +	    get_user_mapping_size(kvm, hva) >= PMD_SIZE) {
>   		/*
>   		 * The address we faulted on is backed by a transparent huge
>   		 * page.  However, because we map the compound huge page and
> @@ -1051,7 +1089,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   	 * backed by a THP and thus use block mapping if possible.
>   	 */
>   	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
> -		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> +		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
>   							   &pfn, &fault_ipa);
>   
>   	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
> 

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-19  6:31     ` Paolo Bonzini
  0 siblings, 0 replies; 54+ messages in thread
From: Paolo Bonzini @ 2021-07-19  6:31 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Will Deacon, Quentin Perret,
	James Morse, Suzuki K Poulose, Alexandru Elisei, kernel-team

On 17/07/21 11:55, Marc Zyngier wrote:
> We currently rely on the kvm_is_transparent_hugepage() helper to
> discover whether a given page has the potential to be mapped as
> a block mapping.
> 
> However, this API doesn't really give un everything we want:
> - we don't get the size: this is not crucial today as we only
>    support PMD-sized THPs, but we'd like to have larger sizes
>    in the future
> - we're the only user left of the API, and there is a will
>    to remove it altogether
> 
> To address the above, implement a simple walker using the existing
> page table infrastructure, and plumb it into transparent_hugepage_adjust().
> No new page sizes are supported in the process.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>

If it's okay for you to reuse the KVM page walker that's fine of course, 
but the arch/x86/mm functions lookup_address_in_{mm,pgd} are mostly 
machine-independent and it may make sense to move them to mm/.

That would also allow reusing the x86 function host_pfn_mapping_level.

Paolo

> ---
>   arch/arm64/kvm/mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 42 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 3155c9e778f0..db6314b93e99 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -433,6 +433,44 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
>   	return 0;
>   }
>   
> +static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
> +	/* We shouldn't need any other callback to walk the PT */
> +	.phys_to_virt		= kvm_host_va,
> +};
> +
> +struct user_walk_data {
> +	u32	level;
> +};
> +
> +static int user_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +		       enum kvm_pgtable_walk_flags flag, void * const arg)
> +{
> +	struct user_walk_data *data = arg;
> +
> +	data->level = level;
> +	return 0;
> +}
> +
> +static int get_user_mapping_size(struct kvm *kvm, u64 addr)
> +{
> +	struct user_walk_data data;
> +	struct kvm_pgtable pgt = {
> +		.pgd		= (kvm_pte_t *)kvm->mm->pgd,
> +		.ia_bits	= VA_BITS,
> +		.start_level	= 4 - CONFIG_PGTABLE_LEVELS,
> +		.mm_ops		= &kvm_user_mm_ops,
> +	};
> +	struct kvm_pgtable_walker walker = {
> +		.cb		= user_walker,
> +		.flags		= KVM_PGTABLE_WALK_LEAF,
> +		.arg		= &data,
> +	};
> +
> +	kvm_pgtable_walk(&pgt, ALIGN_DOWN(addr, PAGE_SIZE), PAGE_SIZE, &walker);
> +
> +	return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(data.level));
> +}
> +
>   static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>   	.zalloc_page		= stage2_memcache_zalloc_page,
>   	.zalloc_pages_exact	= kvm_host_zalloc_pages_exact,
> @@ -780,7 +818,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>    * Returns the size of the mapping.
>    */
>   static unsigned long
> -transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
> +transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
>   			    unsigned long hva, kvm_pfn_t *pfnp,
>   			    phys_addr_t *ipap)
>   {
> @@ -791,8 +829,8 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>   	 * sure that the HVA and IPA are sufficiently aligned and that the
>   	 * block map is contained within the memslot.
>   	 */
> -	if (kvm_is_transparent_hugepage(pfn) &&
> -	    fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
> +	if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) &&
> +	    get_user_mapping_size(kvm, hva) >= PMD_SIZE) {
>   		/*
>   		 * The address we faulted on is backed by a transparent huge
>   		 * page.  However, because we map the compound huge page and
> @@ -1051,7 +1089,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   	 * backed by a THP and thus use block mapping if possible.
>   	 */
>   	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
> -		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> +		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
>   							   &pfn, &fault_ipa);
>   
>   	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()
  2021-07-17  9:55   ` Marc Zyngier
  (?)
@ 2021-07-19  6:31     ` Paolo Bonzini
  -1 siblings, 0 replies; 54+ messages in thread
From: Paolo Bonzini @ 2021-07-19  6:31 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Will Deacon, Quentin Perret,
	James Morse, Suzuki K Poulose, Alexandru Elisei, kernel-team

On 17/07/21 11:55, Marc Zyngier wrote:
> Now that arm64 has stopped using kvm_is_transparent_hugepage(),
> we can remove it, as well as PageTransCompoundMap() which was
> only used by the former.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>   include/linux/page-flags.h | 37 -------------------------------------
>   virt/kvm/kvm_main.c        | 10 ----------
>   2 files changed, 47 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5922031ffab6..1ace27c4a8e0 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -632,43 +632,6 @@ static inline int PageTransCompound(struct page *page)
>   	return PageCompound(page);
>   }
>   
> -/*
> - * PageTransCompoundMap is the same as PageTransCompound, but it also
> - * guarantees the primary MMU has the entire compound page mapped
> - * through pmd_trans_huge, which in turn guarantees the secondary MMUs
> - * can also map the entire compound page. This allows the secondary
> - * MMUs to call get_user_pages() only once for each compound page and
> - * to immediately map the entire compound page with a single secondary
> - * MMU fault. If there will be a pmd split later, the secondary MMUs
> - * will get an update through the MMU notifier invalidation through
> - * split_huge_pmd().
> - *
> - * Unlike PageTransCompound, this is safe to be called only while
> - * split_huge_pmd() cannot run from under us, like if protected by the
> - * MMU notifier, otherwise it may result in page->_mapcount check false
> - * positives.
> - *
> - * We have to treat page cache THP differently since every subpage of it
> - * would get _mapcount inc'ed once it is PMD mapped.  But, it may be PTE
> - * mapped in the current process so comparing subpage's _mapcount to
> - * compound_mapcount to filter out PTE mapped case.
> - */
> -static inline int PageTransCompoundMap(struct page *page)
> -{
> -	struct page *head;
> -
> -	if (!PageTransCompound(page))
> -		return 0;
> -
> -	if (PageAnon(page))
> -		return atomic_read(&page->_mapcount) < 0;
> -
> -	head = compound_head(page);
> -	/* File THP is PMD mapped and not PTE mapped */
> -	return atomic_read(&page->_mapcount) ==
> -	       atomic_read(compound_mapcount_ptr(head));
> -}
> -
>   /*
>    * PageTransTail returns true for both transparent huge pages
>    * and hugetlbfs pages, so it should only be called when it's known
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7d95126cda9e..2e410a8a6a67 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -189,16 +189,6 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>   	return true;
>   }
>   
> -bool kvm_is_transparent_hugepage(kvm_pfn_t pfn)
> -{
> -	struct page *page = pfn_to_page(pfn);
> -
> -	if (!PageTransCompoundMap(page))
> -		return false;
> -
> -	return is_transparent_hugepage(compound_head(page));
> -}
> -
>   /*
>    * Switches to specified vcpu, until a matching vcpu_put()
>    */
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()
@ 2021-07-19  6:31     ` Paolo Bonzini
  0 siblings, 0 replies; 54+ messages in thread
From: Paolo Bonzini @ 2021-07-19  6:31 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: kernel-team, Sean Christopherson, Matthew Wilcox, Will Deacon

On 17/07/21 11:55, Marc Zyngier wrote:
> Now that arm64 has stopped using kvm_is_transparent_hugepage(),
> we can remove it, as well as PageTransCompoundMap() which was
> only used by the former.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>   include/linux/page-flags.h | 37 -------------------------------------
>   virt/kvm/kvm_main.c        | 10 ----------
>   2 files changed, 47 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5922031ffab6..1ace27c4a8e0 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -632,43 +632,6 @@ static inline int PageTransCompound(struct page *page)
>   	return PageCompound(page);
>   }
>   
> -/*
> - * PageTransCompoundMap is the same as PageTransCompound, but it also
> - * guarantees the primary MMU has the entire compound page mapped
> - * through pmd_trans_huge, which in turn guarantees the secondary MMUs
> - * can also map the entire compound page. This allows the secondary
> - * MMUs to call get_user_pages() only once for each compound page and
> - * to immediately map the entire compound page with a single secondary
> - * MMU fault. If there will be a pmd split later, the secondary MMUs
> - * will get an update through the MMU notifier invalidation through
> - * split_huge_pmd().
> - *
> - * Unlike PageTransCompound, this is safe to be called only while
> - * split_huge_pmd() cannot run from under us, like if protected by the
> - * MMU notifier, otherwise it may result in page->_mapcount check false
> - * positives.
> - *
> - * We have to treat page cache THP differently since every subpage of it
> - * would get _mapcount inc'ed once it is PMD mapped.  But, it may be PTE
> - * mapped in the current process so comparing subpage's _mapcount to
> - * compound_mapcount to filter out PTE mapped case.
> - */
> -static inline int PageTransCompoundMap(struct page *page)
> -{
> -	struct page *head;
> -
> -	if (!PageTransCompound(page))
> -		return 0;
> -
> -	if (PageAnon(page))
> -		return atomic_read(&page->_mapcount) < 0;
> -
> -	head = compound_head(page);
> -	/* File THP is PMD mapped and not PTE mapped */
> -	return atomic_read(&page->_mapcount) ==
> -	       atomic_read(compound_mapcount_ptr(head));
> -}
> -
>   /*
>    * PageTransTail returns true for both transparent huge pages
>    * and hugetlbfs pages, so it should only be called when it's known
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7d95126cda9e..2e410a8a6a67 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -189,16 +189,6 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>   	return true;
>   }
>   
> -bool kvm_is_transparent_hugepage(kvm_pfn_t pfn)
> -{
> -	struct page *page = pfn_to_page(pfn);
> -
> -	if (!PageTransCompoundMap(page))
> -		return false;
> -
> -	return is_transparent_hugepage(compound_head(page));
> -}
> -
>   /*
>    * Switches to specified vcpu, until a matching vcpu_put()
>    */
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()
@ 2021-07-19  6:31     ` Paolo Bonzini
  0 siblings, 0 replies; 54+ messages in thread
From: Paolo Bonzini @ 2021-07-19  6:31 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Will Deacon, Quentin Perret,
	James Morse, Suzuki K Poulose, Alexandru Elisei, kernel-team

On 17/07/21 11:55, Marc Zyngier wrote:
> Now that arm64 has stopped using kvm_is_transparent_hugepage(),
> we can remove it, as well as PageTransCompoundMap() which was
> only used by the former.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>   include/linux/page-flags.h | 37 -------------------------------------
>   virt/kvm/kvm_main.c        | 10 ----------
>   2 files changed, 47 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5922031ffab6..1ace27c4a8e0 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -632,43 +632,6 @@ static inline int PageTransCompound(struct page *page)
>   	return PageCompound(page);
>   }
>   
> -/*
> - * PageTransCompoundMap is the same as PageTransCompound, but it also
> - * guarantees the primary MMU has the entire compound page mapped
> - * through pmd_trans_huge, which in turn guarantees the secondary MMUs
> - * can also map the entire compound page. This allows the secondary
> - * MMUs to call get_user_pages() only once for each compound page and
> - * to immediately map the entire compound page with a single secondary
> - * MMU fault. If there will be a pmd split later, the secondary MMUs
> - * will get an update through the MMU notifier invalidation through
> - * split_huge_pmd().
> - *
> - * Unlike PageTransCompound, this is safe to be called only while
> - * split_huge_pmd() cannot run from under us, like if protected by the
> - * MMU notifier, otherwise it may result in page->_mapcount check false
> - * positives.
> - *
> - * We have to treat page cache THP differently since every subpage of it
> - * would get _mapcount inc'ed once it is PMD mapped.  But, it may be PTE
> - * mapped in the current process so comparing subpage's _mapcount to
> - * compound_mapcount to filter out PTE mapped case.
> - */
> -static inline int PageTransCompoundMap(struct page *page)
> -{
> -	struct page *head;
> -
> -	if (!PageTransCompound(page))
> -		return 0;
> -
> -	if (PageAnon(page))
> -		return atomic_read(&page->_mapcount) < 0;
> -
> -	head = compound_head(page);
> -	/* File THP is PMD mapped and not PTE mapped */
> -	return atomic_read(&page->_mapcount) ==
> -	       atomic_read(compound_mapcount_ptr(head));
> -}
> -
>   /*
>    * PageTransTail returns true for both transparent huge pages
>    * and hugetlbfs pages, so it should only be called when it's known
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7d95126cda9e..2e410a8a6a67 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -189,16 +189,6 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>   	return true;
>   }
>   
> -bool kvm_is_transparent_hugepage(kvm_pfn_t pfn)
> -{
> -	struct page *page = pfn_to_page(pfn);
> -
> -	if (!PageTransCompoundMap(page))
> -		return false;
> -
> -	return is_transparent_hugepage(compound_head(page));
> -}
> -
>   /*
>    * Switches to specified vcpu, until a matching vcpu_put()
>    */
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/5] KVM: Get rid of kvm_get_pfn()
  2021-07-17  9:55   ` Marc Zyngier
  (?)
@ 2021-07-19  6:31     ` Paolo Bonzini
  -1 siblings, 0 replies; 54+ messages in thread
From: Paolo Bonzini @ 2021-07-19  6:31 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Will Deacon, Quentin Perret,
	James Morse, Suzuki K Poulose, Alexandru Elisei, kernel-team

On 17/07/21 11:55, Marc Zyngier wrote:
> Nobody is using kvm_get_pfn() anymore. Get rid of it.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>   include/linux/kvm_host.h | 1 -
>   virt/kvm/kvm_main.c      | 9 +--------
>   2 files changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ae7735b490b4..9818d271c2a1 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -824,7 +824,6 @@ void kvm_release_pfn_clean(kvm_pfn_t pfn);
>   void kvm_release_pfn_dirty(kvm_pfn_t pfn);
>   void kvm_set_pfn_dirty(kvm_pfn_t pfn);
>   void kvm_set_pfn_accessed(kvm_pfn_t pfn);
> -void kvm_get_pfn(kvm_pfn_t pfn);
>   
>   void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache);
>   int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2e410a8a6a67..0284418c4400 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2215,7 +2215,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   	 * Get a reference here because callers of *hva_to_pfn* and
>   	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
>   	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
> -	 * set, but the kvm_get_pfn/kvm_release_pfn_clean pair will
> +	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
>   	 * simply do nothing for reserved pfns.
>   	 *
>   	 * Whoever called remap_pfn_range is also going to call e.g.
> @@ -2612,13 +2612,6 @@ void kvm_set_pfn_accessed(kvm_pfn_t pfn)
>   }
>   EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
>   
> -void kvm_get_pfn(kvm_pfn_t pfn)
> -{
> -	if (!kvm_is_reserved_pfn(pfn))
> -		get_page(pfn_to_page(pfn));
> -}
> -EXPORT_SYMBOL_GPL(kvm_get_pfn);
> -
>   static int next_segment(unsigned long len, int offset)
>   {
>   	if (len > PAGE_SIZE - offset)
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/5] KVM: Get rid of kvm_get_pfn()
@ 2021-07-19  6:31     ` Paolo Bonzini
  0 siblings, 0 replies; 54+ messages in thread
From: Paolo Bonzini @ 2021-07-19  6:31 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: kernel-team, Sean Christopherson, Matthew Wilcox, Will Deacon

On 17/07/21 11:55, Marc Zyngier wrote:
> Nobody is using kvm_get_pfn() anymore. Get rid of it.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>   include/linux/kvm_host.h | 1 -
>   virt/kvm/kvm_main.c      | 9 +--------
>   2 files changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ae7735b490b4..9818d271c2a1 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -824,7 +824,6 @@ void kvm_release_pfn_clean(kvm_pfn_t pfn);
>   void kvm_release_pfn_dirty(kvm_pfn_t pfn);
>   void kvm_set_pfn_dirty(kvm_pfn_t pfn);
>   void kvm_set_pfn_accessed(kvm_pfn_t pfn);
> -void kvm_get_pfn(kvm_pfn_t pfn);
>   
>   void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache);
>   int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2e410a8a6a67..0284418c4400 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2215,7 +2215,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   	 * Get a reference here because callers of *hva_to_pfn* and
>   	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
>   	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
> -	 * set, but the kvm_get_pfn/kvm_release_pfn_clean pair will
> +	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
>   	 * simply do nothing for reserved pfns.
>   	 *
>   	 * Whoever called remap_pfn_range is also going to call e.g.
> @@ -2612,13 +2612,6 @@ void kvm_set_pfn_accessed(kvm_pfn_t pfn)
>   }
>   EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
>   
> -void kvm_get_pfn(kvm_pfn_t pfn)
> -{
> -	if (!kvm_is_reserved_pfn(pfn))
> -		get_page(pfn_to_page(pfn));
> -}
> -EXPORT_SYMBOL_GPL(kvm_get_pfn);
> -
>   static int next_segment(unsigned long len, int offset)
>   {
>   	if (len > PAGE_SIZE - offset)
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/5] KVM: Get rid of kvm_get_pfn()
@ 2021-07-19  6:31     ` Paolo Bonzini
  0 siblings, 0 replies; 54+ messages in thread
From: Paolo Bonzini @ 2021-07-19  6:31 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Will Deacon, Quentin Perret,
	James Morse, Suzuki K Poulose, Alexandru Elisei, kernel-team

On 17/07/21 11:55, Marc Zyngier wrote:
> Nobody is using kvm_get_pfn() anymore. Get rid of it.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>   include/linux/kvm_host.h | 1 -
>   virt/kvm/kvm_main.c      | 9 +--------
>   2 files changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ae7735b490b4..9818d271c2a1 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -824,7 +824,6 @@ void kvm_release_pfn_clean(kvm_pfn_t pfn);
>   void kvm_release_pfn_dirty(kvm_pfn_t pfn);
>   void kvm_set_pfn_dirty(kvm_pfn_t pfn);
>   void kvm_set_pfn_accessed(kvm_pfn_t pfn);
> -void kvm_get_pfn(kvm_pfn_t pfn);
>   
>   void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache);
>   int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2e410a8a6a67..0284418c4400 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2215,7 +2215,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   	 * Get a reference here because callers of *hva_to_pfn* and
>   	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
>   	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
> -	 * set, but the kvm_get_pfn/kvm_release_pfn_clean pair will
> +	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
>   	 * simply do nothing for reserved pfns.
>   	 *
>   	 * Whoever called remap_pfn_range is also going to call e.g.
> @@ -2612,13 +2612,6 @@ void kvm_set_pfn_accessed(kvm_pfn_t pfn)
>   }
>   EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
>   
> -void kvm_get_pfn(kvm_pfn_t pfn)
> -{
> -	if (!kvm_is_reserved_pfn(pfn))
> -		get_page(pfn_to_page(pfn));
> -}
> -EXPORT_SYMBOL_GPL(kvm_get_pfn);
> -
>   static int next_segment(unsigned long len, int offset)
>   {
>   	if (len > PAGE_SIZE - offset)
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
  2021-07-19  6:31     ` Paolo Bonzini
  (?)
@ 2021-07-19  9:31       ` Marc Zyngier
  -1 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-19  9:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-arm-kernel, kvm, kvmarm, linux-mm, Sean Christopherson,
	Matthew Wilcox, Will Deacon, Quentin Perret, James Morse,
	Suzuki K Poulose, Alexandru Elisei, kernel-team

On Mon, 19 Jul 2021 07:31:30 +0100,
Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 17/07/21 11:55, Marc Zyngier wrote:
> > We currently rely on the kvm_is_transparent_hugepage() helper to
> > discover whether a given page has the potential to be mapped as
> > a block mapping.
> > 
> > However, this API doesn't really give un everything we want:
> > - we don't get the size: this is not crucial today as we only
> >    support PMD-sized THPs, but we'd like to have larger sizes
> >    in the future
> > - we're the only user left of the API, and there is a will
> >    to remove it altogether
> > 
> > To address the above, implement a simple walker using the existing
> > page table infrastructure, and plumb it into transparent_hugepage_adjust().
> > No new page sizes are supported in the process.
> > 
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> 
> If it's okay for you to reuse the KVM page walker that's fine of
> course, but the arch/x86/mm functions lookup_address_in_{mm,pgd} are
> mostly machine-independent and it may make sense to move them to mm/.
>
> That would also allow reusing the x86 function host_pfn_mapping_level.

That could work to some extent, but the way the x86 code equates level
to mapping size is a bit at odds with the multiple page sizes that
arm64 deals with.

We're also trying to move away from the whole P*D abstraction, because
this isn't something we want to deal with in the self-contained EL2
object.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-19  9:31       ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-19  9:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kernel-team, kvm, Sean Christopherson, Matthew Wilcox, linux-mm,
	Will Deacon, kvmarm, linux-arm-kernel

On Mon, 19 Jul 2021 07:31:30 +0100,
Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 17/07/21 11:55, Marc Zyngier wrote:
> > We currently rely on the kvm_is_transparent_hugepage() helper to
> > discover whether a given page has the potential to be mapped as
> > a block mapping.
> > 
> > However, this API doesn't really give un everything we want:
> > - we don't get the size: this is not crucial today as we only
> >    support PMD-sized THPs, but we'd like to have larger sizes
> >    in the future
> > - we're the only user left of the API, and there is a will
> >    to remove it altogether
> > 
> > To address the above, implement a simple walker using the existing
> > page table infrastructure, and plumb it into transparent_hugepage_adjust().
> > No new page sizes are supported in the process.
> > 
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> 
> If it's okay for you to reuse the KVM page walker that's fine of
> course, but the arch/x86/mm functions lookup_address_in_{mm,pgd} are
> mostly machine-independent and it may make sense to move them to mm/.
>
> That would also allow reusing the x86 function host_pfn_mapping_level.

That could work to some extent, but the way the x86 code equates level
to mapping size is a bit at odds with the multiple page sizes that
arm64 deals with.

We're also trying to move away from the whole P*D abstraction, because
this isn't something we want to deal with in the self-contained EL2
object.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-19  9:31       ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-19  9:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-arm-kernel, kvm, kvmarm, linux-mm, Sean Christopherson,
	Matthew Wilcox, Will Deacon, Quentin Perret, James Morse,
	Suzuki K Poulose, Alexandru Elisei, kernel-team

On Mon, 19 Jul 2021 07:31:30 +0100,
Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 17/07/21 11:55, Marc Zyngier wrote:
> > We currently rely on the kvm_is_transparent_hugepage() helper to
> > discover whether a given page has the potential to be mapped as
> > a block mapping.
> > 
> > However, this API doesn't really give un everything we want:
> > - we don't get the size: this is not crucial today as we only
> >    support PMD-sized THPs, but we'd like to have larger sizes
> >    in the future
> > - we're the only user left of the API, and there is a will
> >    to remove it altogether
> > 
> > To address the above, implement a simple walker using the existing
> > page table infrastructure, and plumb it into transparent_hugepage_adjust().
> > No new page sizes are supported in the process.
> > 
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> 
> If it's okay for you to reuse the KVM page walker that's fine of
> course, but the arch/x86/mm functions lookup_address_in_{mm,pgd} are
> mostly machine-independent and it may make sense to move them to mm/.
>
> That would also allow reusing the x86 function host_pfn_mapping_level.

That could work to some extent, but the way the x86 code equates level
to mapping size is a bit at odds with the multiple page sizes that
arm64 deals with.

We're also trying to move away from the whole P*D abstraction, because
this isn't something we want to deal with in the self-contained EL2
object.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
  2021-07-17  9:55   ` Marc Zyngier
  (?)
@ 2021-07-20 17:23     ` Alexandru Elisei
  -1 siblings, 0 replies; 54+ messages in thread
From: Alexandru Elisei @ 2021-07-20 17:23 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, kernel-team

Hi Marc,

I just can't figure out why having the mmap lock is not needed to walk the
userspace page tables. Any hints? Or am I not seeing where it's taken?

On 7/17/21 10:55 AM, Marc Zyngier wrote:
> We currently rely on the kvm_is_transparent_hugepage() helper to
> discover whether a given page has the potential to be mapped as
> a block mapping.
>
> However, this API doesn't really give un everything we want:
> - we don't get the size: this is not crucial today as we only
>   support PMD-sized THPs, but we'd like to have larger sizes
>   in the future
> - we're the only user left of the API, and there is a will
>   to remove it altogether
>
> To address the above, implement a simple walker using the existing
> page table infrastructure, and plumb it into transparent_hugepage_adjust().
> No new page sizes are supported in the process.
>
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/kvm/mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 42 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 3155c9e778f0..db6314b93e99 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -433,6 +433,44 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
>  	return 0;
>  }
>  
> +static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
> +	/* We shouldn't need any other callback to walk the PT */
> +	.phys_to_virt		= kvm_host_va,
> +};
> +
> +struct user_walk_data {
> +	u32	level;
> +};
> +
> +static int user_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +		       enum kvm_pgtable_walk_flags flag, void * const arg)
> +{
> +	struct user_walk_data *data = arg;
> +
> +	data->level = level;
> +	return 0;
> +}
> +
> +static int get_user_mapping_size(struct kvm *kvm, u64 addr)
> +{
> +	struct user_walk_data data;
> +	struct kvm_pgtable pgt = {
> +		.pgd		= (kvm_pte_t *)kvm->mm->pgd,
> +		.ia_bits	= VA_BITS,
> +		.start_level	= 4 - CONFIG_PGTABLE_LEVELS,
> +		.mm_ops		= &kvm_user_mm_ops,
> +	};
> +	struct kvm_pgtable_walker walker = {
> +		.cb		= user_walker,
> +		.flags		= KVM_PGTABLE_WALK_LEAF,
> +		.arg		= &data,
> +	};
> +
> +	kvm_pgtable_walk(&pgt, ALIGN_DOWN(addr, PAGE_SIZE), PAGE_SIZE, &walker);

I take it that it is guaranteed that kvm_pgtable_walk() will never fail? For
example, I can see it failing if someone messes with KVM_PGTABLE_MAX_LEVELS. To be
honest, I would rather have a check here instead of potentially feeding a bogus
value to ARM64_HW_PGTABLE_LEVEL_SHIFT. It could be a VM_WARN_ON, so there's no
runtime overhead unless CONFIG_DEBUG_VM.

The patch looks good to me so far, but I want to give it another look (or two)
after I figure out why the mmap semaphone is not needed.

Thanks,

Alex

> +
> +	return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(data.level));
> +}
> +
>  static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>  	.zalloc_page		= stage2_memcache_zalloc_page,
>  	.zalloc_pages_exact	= kvm_host_zalloc_pages_exact,
> @@ -780,7 +818,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>   * Returns the size of the mapping.
>   */
>  static unsigned long
> -transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
> +transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  			    unsigned long hva, kvm_pfn_t *pfnp,
>  			    phys_addr_t *ipap)
>  {
> @@ -791,8 +829,8 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>  	 * sure that the HVA and IPA are sufficiently aligned and that the
>  	 * block map is contained within the memslot.
>  	 */
> -	if (kvm_is_transparent_hugepage(pfn) &&
> -	    fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
> +	if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) &&
> +	    get_user_mapping_size(kvm, hva) >= PMD_SIZE) {
>  		/*
>  		 * The address we faulted on is backed by a transparent huge
>  		 * page.  However, because we map the compound huge page and
> @@ -1051,7 +1089,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	 * backed by a THP and thus use block mapping if possible.
>  	 */
>  	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
> -		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> +		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
>  							   &pfn, &fault_ipa);
>  
>  	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-20 17:23     ` Alexandru Elisei
  0 siblings, 0 replies; 54+ messages in thread
From: Alexandru Elisei @ 2021-07-20 17:23 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: kernel-team, Sean Christopherson, Matthew Wilcox, Paolo Bonzini,
	Will Deacon

Hi Marc,

I just can't figure out why having the mmap lock is not needed to walk the
userspace page tables. Any hints? Or am I not seeing where it's taken?

On 7/17/21 10:55 AM, Marc Zyngier wrote:
> We currently rely on the kvm_is_transparent_hugepage() helper to
> discover whether a given page has the potential to be mapped as
> a block mapping.
>
> However, this API doesn't really give un everything we want:
> - we don't get the size: this is not crucial today as we only
>   support PMD-sized THPs, but we'd like to have larger sizes
>   in the future
> - we're the only user left of the API, and there is a will
>   to remove it altogether
>
> To address the above, implement a simple walker using the existing
> page table infrastructure, and plumb it into transparent_hugepage_adjust().
> No new page sizes are supported in the process.
>
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/kvm/mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 42 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 3155c9e778f0..db6314b93e99 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -433,6 +433,44 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
>  	return 0;
>  }
>  
> +static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
> +	/* We shouldn't need any other callback to walk the PT */
> +	.phys_to_virt		= kvm_host_va,
> +};
> +
> +struct user_walk_data {
> +	u32	level;
> +};
> +
> +static int user_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +		       enum kvm_pgtable_walk_flags flag, void * const arg)
> +{
> +	struct user_walk_data *data = arg;
> +
> +	data->level = level;
> +	return 0;
> +}
> +
> +static int get_user_mapping_size(struct kvm *kvm, u64 addr)
> +{
> +	struct user_walk_data data;
> +	struct kvm_pgtable pgt = {
> +		.pgd		= (kvm_pte_t *)kvm->mm->pgd,
> +		.ia_bits	= VA_BITS,
> +		.start_level	= 4 - CONFIG_PGTABLE_LEVELS,
> +		.mm_ops		= &kvm_user_mm_ops,
> +	};
> +	struct kvm_pgtable_walker walker = {
> +		.cb		= user_walker,
> +		.flags		= KVM_PGTABLE_WALK_LEAF,
> +		.arg		= &data,
> +	};
> +
> +	kvm_pgtable_walk(&pgt, ALIGN_DOWN(addr, PAGE_SIZE), PAGE_SIZE, &walker);

I take it that it is guaranteed that kvm_pgtable_walk() will never fail? For
example, I can see it failing if someone messes with KVM_PGTABLE_MAX_LEVELS. To be
honest, I would rather have a check here instead of potentially feeding a bogus
value to ARM64_HW_PGTABLE_LEVEL_SHIFT. It could be a VM_WARN_ON, so there's no
runtime overhead unless CONFIG_DEBUG_VM.

The patch looks good to me so far, but I want to give it another look (or two)
after I figure out why the mmap semaphone is not needed.

Thanks,

Alex

> +
> +	return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(data.level));
> +}
> +
>  static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>  	.zalloc_page		= stage2_memcache_zalloc_page,
>  	.zalloc_pages_exact	= kvm_host_zalloc_pages_exact,
> @@ -780,7 +818,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>   * Returns the size of the mapping.
>   */
>  static unsigned long
> -transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
> +transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  			    unsigned long hva, kvm_pfn_t *pfnp,
>  			    phys_addr_t *ipap)
>  {
> @@ -791,8 +829,8 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>  	 * sure that the HVA and IPA are sufficiently aligned and that the
>  	 * block map is contained within the memslot.
>  	 */
> -	if (kvm_is_transparent_hugepage(pfn) &&
> -	    fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
> +	if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) &&
> +	    get_user_mapping_size(kvm, hva) >= PMD_SIZE) {
>  		/*
>  		 * The address we faulted on is backed by a transparent huge
>  		 * page.  However, because we map the compound huge page and
> @@ -1051,7 +1089,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	 * backed by a THP and thus use block mapping if possible.
>  	 */
>  	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
> -		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> +		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
>  							   &pfn, &fault_ipa);
>  
>  	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-20 17:23     ` Alexandru Elisei
  0 siblings, 0 replies; 54+ messages in thread
From: Alexandru Elisei @ 2021-07-20 17:23 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, kernel-team

Hi Marc,

I just can't figure out why having the mmap lock is not needed to walk the
userspace page tables. Any hints? Or am I not seeing where it's taken?

On 7/17/21 10:55 AM, Marc Zyngier wrote:
> We currently rely on the kvm_is_transparent_hugepage() helper to
> discover whether a given page has the potential to be mapped as
> a block mapping.
>
> However, this API doesn't really give un everything we want:
> - we don't get the size: this is not crucial today as we only
>   support PMD-sized THPs, but we'd like to have larger sizes
>   in the future
> - we're the only user left of the API, and there is a will
>   to remove it altogether
>
> To address the above, implement a simple walker using the existing
> page table infrastructure, and plumb it into transparent_hugepage_adjust().
> No new page sizes are supported in the process.
>
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/kvm/mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 42 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 3155c9e778f0..db6314b93e99 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -433,6 +433,44 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
>  	return 0;
>  }
>  
> +static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
> +	/* We shouldn't need any other callback to walk the PT */
> +	.phys_to_virt		= kvm_host_va,
> +};
> +
> +struct user_walk_data {
> +	u32	level;
> +};
> +
> +static int user_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> +		       enum kvm_pgtable_walk_flags flag, void * const arg)
> +{
> +	struct user_walk_data *data = arg;
> +
> +	data->level = level;
> +	return 0;
> +}
> +
> +static int get_user_mapping_size(struct kvm *kvm, u64 addr)
> +{
> +	struct user_walk_data data;
> +	struct kvm_pgtable pgt = {
> +		.pgd		= (kvm_pte_t *)kvm->mm->pgd,
> +		.ia_bits	= VA_BITS,
> +		.start_level	= 4 - CONFIG_PGTABLE_LEVELS,
> +		.mm_ops		= &kvm_user_mm_ops,
> +	};
> +	struct kvm_pgtable_walker walker = {
> +		.cb		= user_walker,
> +		.flags		= KVM_PGTABLE_WALK_LEAF,
> +		.arg		= &data,
> +	};
> +
> +	kvm_pgtable_walk(&pgt, ALIGN_DOWN(addr, PAGE_SIZE), PAGE_SIZE, &walker);

I take it that it is guaranteed that kvm_pgtable_walk() will never fail? For
example, I can see it failing if someone messes with KVM_PGTABLE_MAX_LEVELS. To be
honest, I would rather have a check here instead of potentially feeding a bogus
value to ARM64_HW_PGTABLE_LEVEL_SHIFT. It could be a VM_WARN_ON, so there's no
runtime overhead unless CONFIG_DEBUG_VM.

The patch looks good to me so far, but I want to give it another look (or two)
after I figure out why the mmap semaphone is not needed.

Thanks,

Alex

> +
> +	return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(data.level));
> +}
> +
>  static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>  	.zalloc_page		= stage2_memcache_zalloc_page,
>  	.zalloc_pages_exact	= kvm_host_zalloc_pages_exact,
> @@ -780,7 +818,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>   * Returns the size of the mapping.
>   */
>  static unsigned long
> -transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
> +transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  			    unsigned long hva, kvm_pfn_t *pfnp,
>  			    phys_addr_t *ipap)
>  {
> @@ -791,8 +829,8 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>  	 * sure that the HVA and IPA are sufficiently aligned and that the
>  	 * block map is contained within the memslot.
>  	 */
> -	if (kvm_is_transparent_hugepage(pfn) &&
> -	    fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
> +	if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) &&
> +	    get_user_mapping_size(kvm, hva) >= PMD_SIZE) {
>  		/*
>  		 * The address we faulted on is backed by a transparent huge
>  		 * page.  However, because we map the compound huge page and
> @@ -1051,7 +1089,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	 * backed by a THP and thus use block mapping if possible.
>  	 */
>  	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
> -		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> +		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
>  							   &pfn, &fault_ipa);
>  
>  	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
  2021-07-20 17:23     ` Alexandru Elisei
  (?)
@ 2021-07-20 20:33       ` Sean Christopherson
  -1 siblings, 0 replies; 54+ messages in thread
From: Sean Christopherson @ 2021-07-20 20:33 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm,
	Matthew Wilcox, Paolo Bonzini, Will Deacon, Quentin Perret,
	James Morse, Suzuki K Poulose, kernel-team

On Tue, Jul 20, 2021, Alexandru Elisei wrote:
> Hi Marc,
> 
> I just can't figure out why having the mmap lock is not needed to walk the
> userspace page tables. Any hints? Or am I not seeing where it's taken?

Disclaimer: I'm not super familiar with arm64's page tables, but the relevant KVM
functionality is common across x86 and arm64.

KVM arm64 (and x86) unconditionally registers a mmu_notifier for the mm_struct
associated with the VM, and disallows calling ioctls from a different process,
i.e. walking the page tables during KVM_RUN is guaranteed to use the mm for which
KVM registered the mmu_notifier.  As part of registration, the mmu_notifier
does mmgrab() and doesn't do mmdrop() until it's unregistered.  That ensures the
mm_struct itself is live.

For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is
invoked at the beginning of exit_mmap(), before the page tables are freed.  In
its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a.
the stage2 tables in KVM arm64.  The flow in question, get_user_mapping_size(),
also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
guaranteed to run with live userspace tables.

Lastly, KVM also implements mmu_notifier_ops.invalidate_range_{start,end}.  KVM's
invalidate_range implementations also take mmu_lock, and also update a sequence
counter and a flag stating that there's an invalidation in progress.  When
installing a stage2 entry, KVM snapshots the sequence counter before taking
mmu_lock, and then checks it again after acquiring mmu_lock.  If the counter
mismatches, or an invalidation is in-progress, then KVM bails and resumes the
guest without fixing the fault.

E.g. if the host zaps userspace page tables and KVM "wins" the race, the subsequent
kvm_mmu_notifier_invalidate_range_start() will zap the recently installed stage2
entries.  And if the host zap "wins" the race, KVM will resume the guest, which
in normal operation will hit the exception again and go back through the entire
process of installing stage2 entries.

Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly
handles the case where exit_mmap() wins the race.  The invalidate_range hooks will
still be called, so userspace page tables aren't a problem, but
kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without
any additional notifications that I see.  x86 deals with this by ensuring its
top-level TDP entry (stage2 equivalent) is valid while the page fault handler is
running.

  void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
  {
	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
	struct kvm_pgtable *pgt = NULL;

	spin_lock(&kvm->mmu_lock);
	pgt = mmu->pgt;
	if (pgt) {
		mmu->pgd_phys = 0;
		mmu->pgt = NULL;
		free_percpu(mmu->last_vcpu_ran);
	}
	spin_unlock(&kvm->mmu_lock);

	...
  }

AFAICT, nothing in user_mem_abort() would prevent consuming that null mmu->pgt
if exit_mmap() collidied with user_mem_abort().

  static int user_mem_abort(...)
  {

	...

	spin_lock(&kvm->mmu_lock);
	pgt = vcpu->arch.hw_mmu->pgt;         <-- hw_mmu->pgt may be NULL (hw_mmu points at vcpu->kvm->arch.mmu)
	if (mmu_notifier_retry(kvm, mmu_seq)) <-- mmu_seq not guaranteed to change
		goto out_unlock;

	...

	if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
	} else {
		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
					     __pfn_to_phys(pfn), prot,
					     memcache);
	}
  }

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-20 20:33       ` Sean Christopherson
  0 siblings, 0 replies; 54+ messages in thread
From: Sean Christopherson @ 2021-07-20 20:33 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kernel-team, kvm, Marc Zyngier, Matthew Wilcox, linux-mm,
	Paolo Bonzini, Will Deacon, kvmarm, linux-arm-kernel

On Tue, Jul 20, 2021, Alexandru Elisei wrote:
> Hi Marc,
> 
> I just can't figure out why having the mmap lock is not needed to walk the
> userspace page tables. Any hints? Or am I not seeing where it's taken?

Disclaimer: I'm not super familiar with arm64's page tables, but the relevant KVM
functionality is common across x86 and arm64.

KVM arm64 (and x86) unconditionally registers a mmu_notifier for the mm_struct
associated with the VM, and disallows calling ioctls from a different process,
i.e. walking the page tables during KVM_RUN is guaranteed to use the mm for which
KVM registered the mmu_notifier.  As part of registration, the mmu_notifier
does mmgrab() and doesn't do mmdrop() until it's unregistered.  That ensures the
mm_struct itself is live.

For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is
invoked at the beginning of exit_mmap(), before the page tables are freed.  In
its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a.
the stage2 tables in KVM arm64.  The flow in question, get_user_mapping_size(),
also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
guaranteed to run with live userspace tables.

Lastly, KVM also implements mmu_notifier_ops.invalidate_range_{start,end}.  KVM's
invalidate_range implementations also take mmu_lock, and also update a sequence
counter and a flag stating that there's an invalidation in progress.  When
installing a stage2 entry, KVM snapshots the sequence counter before taking
mmu_lock, and then checks it again after acquiring mmu_lock.  If the counter
mismatches, or an invalidation is in-progress, then KVM bails and resumes the
guest without fixing the fault.

E.g. if the host zaps userspace page tables and KVM "wins" the race, the subsequent
kvm_mmu_notifier_invalidate_range_start() will zap the recently installed stage2
entries.  And if the host zap "wins" the race, KVM will resume the guest, which
in normal operation will hit the exception again and go back through the entire
process of installing stage2 entries.

Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly
handles the case where exit_mmap() wins the race.  The invalidate_range hooks will
still be called, so userspace page tables aren't a problem, but
kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without
any additional notifications that I see.  x86 deals with this by ensuring its
top-level TDP entry (stage2 equivalent) is valid while the page fault handler is
running.

  void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
  {
	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
	struct kvm_pgtable *pgt = NULL;

	spin_lock(&kvm->mmu_lock);
	pgt = mmu->pgt;
	if (pgt) {
		mmu->pgd_phys = 0;
		mmu->pgt = NULL;
		free_percpu(mmu->last_vcpu_ran);
	}
	spin_unlock(&kvm->mmu_lock);

	...
  }

AFAICT, nothing in user_mem_abort() would prevent consuming that null mmu->pgt
if exit_mmap() collidied with user_mem_abort().

  static int user_mem_abort(...)
  {

	...

	spin_lock(&kvm->mmu_lock);
	pgt = vcpu->arch.hw_mmu->pgt;         <-- hw_mmu->pgt may be NULL (hw_mmu points at vcpu->kvm->arch.mmu)
	if (mmu_notifier_retry(kvm, mmu_seq)) <-- mmu_seq not guaranteed to change
		goto out_unlock;

	...

	if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
	} else {
		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
					     __pfn_to_phys(pfn), prot,
					     memcache);
	}
  }
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-20 20:33       ` Sean Christopherson
  0 siblings, 0 replies; 54+ messages in thread
From: Sean Christopherson @ 2021-07-20 20:33 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm,
	Matthew Wilcox, Paolo Bonzini, Will Deacon, Quentin Perret,
	James Morse, Suzuki K Poulose, kernel-team

On Tue, Jul 20, 2021, Alexandru Elisei wrote:
> Hi Marc,
> 
> I just can't figure out why having the mmap lock is not needed to walk the
> userspace page tables. Any hints? Or am I not seeing where it's taken?

Disclaimer: I'm not super familiar with arm64's page tables, but the relevant KVM
functionality is common across x86 and arm64.

KVM arm64 (and x86) unconditionally registers a mmu_notifier for the mm_struct
associated with the VM, and disallows calling ioctls from a different process,
i.e. walking the page tables during KVM_RUN is guaranteed to use the mm for which
KVM registered the mmu_notifier.  As part of registration, the mmu_notifier
does mmgrab() and doesn't do mmdrop() until it's unregistered.  That ensures the
mm_struct itself is live.

For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is
invoked at the beginning of exit_mmap(), before the page tables are freed.  In
its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a.
the stage2 tables in KVM arm64.  The flow in question, get_user_mapping_size(),
also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
guaranteed to run with live userspace tables.

Lastly, KVM also implements mmu_notifier_ops.invalidate_range_{start,end}.  KVM's
invalidate_range implementations also take mmu_lock, and also update a sequence
counter and a flag stating that there's an invalidation in progress.  When
installing a stage2 entry, KVM snapshots the sequence counter before taking
mmu_lock, and then checks it again after acquiring mmu_lock.  If the counter
mismatches, or an invalidation is in-progress, then KVM bails and resumes the
guest without fixing the fault.

E.g. if the host zaps userspace page tables and KVM "wins" the race, the subsequent
kvm_mmu_notifier_invalidate_range_start() will zap the recently installed stage2
entries.  And if the host zap "wins" the race, KVM will resume the guest, which
in normal operation will hit the exception again and go back through the entire
process of installing stage2 entries.

Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly
handles the case where exit_mmap() wins the race.  The invalidate_range hooks will
still be called, so userspace page tables aren't a problem, but
kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without
any additional notifications that I see.  x86 deals with this by ensuring its
top-level TDP entry (stage2 equivalent) is valid while the page fault handler is
running.

  void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
  {
	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
	struct kvm_pgtable *pgt = NULL;

	spin_lock(&kvm->mmu_lock);
	pgt = mmu->pgt;
	if (pgt) {
		mmu->pgd_phys = 0;
		mmu->pgt = NULL;
		free_percpu(mmu->last_vcpu_ran);
	}
	spin_unlock(&kvm->mmu_lock);

	...
  }

AFAICT, nothing in user_mem_abort() would prevent consuming that null mmu->pgt
if exit_mmap() collidied with user_mem_abort().

  static int user_mem_abort(...)
  {

	...

	spin_lock(&kvm->mmu_lock);
	pgt = vcpu->arch.hw_mmu->pgt;         <-- hw_mmu->pgt may be NULL (hw_mmu points at vcpu->kvm->arch.mmu)
	if (mmu_notifier_retry(kvm, mmu_seq)) <-- mmu_seq not guaranteed to change
		goto out_unlock;

	...

	if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
	} else {
		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
					     __pfn_to_phys(pfn), prot,
					     memcache);
	}
  }

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
  2021-07-20 20:33       ` Sean Christopherson
  (?)
@ 2021-07-21 14:58         ` Will Deacon
  -1 siblings, 0 replies; 54+ messages in thread
From: Will Deacon @ 2021-07-21 14:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Alexandru Elisei, Marc Zyngier, linux-arm-kernel, kvm, kvmarm,
	linux-mm, Matthew Wilcox, Paolo Bonzini, Quentin Perret,
	James Morse, Suzuki K Poulose, kernel-team

Hey Sean,

On Tue, Jul 20, 2021 at 08:33:46PM +0000, Sean Christopherson wrote:
> On Tue, Jul 20, 2021, Alexandru Elisei wrote:
> > I just can't figure out why having the mmap lock is not needed to walk the
> > userspace page tables. Any hints? Or am I not seeing where it's taken?
> 
> Disclaimer: I'm not super familiar with arm64's page tables, but the relevant KVM
> functionality is common across x86 and arm64.

No need for the disclaimer, there are so many moving parts here that I don't
think it's possible to be familiar with them all! Thanks for taking the time
to write it up so clearly.

> KVM arm64 (and x86) unconditionally registers a mmu_notifier for the mm_struct
> associated with the VM, and disallows calling ioctls from a different process,
> i.e. walking the page tables during KVM_RUN is guaranteed to use the mm for which
> KVM registered the mmu_notifier.  As part of registration, the mmu_notifier
> does mmgrab() and doesn't do mmdrop() until it's unregistered.  That ensures the
> mm_struct itself is live.
> 
> For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is
> invoked at the beginning of exit_mmap(), before the page tables are freed.  In
> its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a.
> the stage2 tables in KVM arm64.  The flow in question, get_user_mapping_size(),
> also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
> guaranteed to run with live userspace tables.

Unless I missed a case, exit_mmap() only runs when mm_struct::mm_users drops
to zero, right? The vCPU tasks should hold references to that afaict, so I
don't think it should be possible for exit_mmap() to run while there are
vCPUs running with the corresponding page-table.

> Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly
> handles the case where exit_mmap() wins the race.  The invalidate_range hooks will
> still be called, so userspace page tables aren't a problem, but
> kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without
> any additional notifications that I see.  x86 deals with this by ensuring its
> top-level TDP entry (stage2 equivalent) is valid while the page fault handler is
> running.

But the fact that x86 handles this race has me worried. What am I missing?

I agree that, if the race can occur, we don't appear to handle it in the
arm64 backend.

Cheers,

Will

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-21 14:58         ` Will Deacon
  0 siblings, 0 replies; 54+ messages in thread
From: Will Deacon @ 2021-07-21 14:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kernel-team, kvm, Marc Zyngier, Matthew Wilcox, linux-mm,
	Paolo Bonzini, kvmarm, linux-arm-kernel

Hey Sean,

On Tue, Jul 20, 2021 at 08:33:46PM +0000, Sean Christopherson wrote:
> On Tue, Jul 20, 2021, Alexandru Elisei wrote:
> > I just can't figure out why having the mmap lock is not needed to walk the
> > userspace page tables. Any hints? Or am I not seeing where it's taken?
> 
> Disclaimer: I'm not super familiar with arm64's page tables, but the relevant KVM
> functionality is common across x86 and arm64.

No need for the disclaimer, there are so many moving parts here that I don't
think it's possible to be familiar with them all! Thanks for taking the time
to write it up so clearly.

> KVM arm64 (and x86) unconditionally registers a mmu_notifier for the mm_struct
> associated with the VM, and disallows calling ioctls from a different process,
> i.e. walking the page tables during KVM_RUN is guaranteed to use the mm for which
> KVM registered the mmu_notifier.  As part of registration, the mmu_notifier
> does mmgrab() and doesn't do mmdrop() until it's unregistered.  That ensures the
> mm_struct itself is live.
> 
> For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is
> invoked at the beginning of exit_mmap(), before the page tables are freed.  In
> its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a.
> the stage2 tables in KVM arm64.  The flow in question, get_user_mapping_size(),
> also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
> guaranteed to run with live userspace tables.

Unless I missed a case, exit_mmap() only runs when mm_struct::mm_users drops
to zero, right? The vCPU tasks should hold references to that afaict, so I
don't think it should be possible for exit_mmap() to run while there are
vCPUs running with the corresponding page-table.

> Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly
> handles the case where exit_mmap() wins the race.  The invalidate_range hooks will
> still be called, so userspace page tables aren't a problem, but
> kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without
> any additional notifications that I see.  x86 deals with this by ensuring its
> top-level TDP entry (stage2 equivalent) is valid while the page fault handler is
> running.

But the fact that x86 handles this race has me worried. What am I missing?

I agree that, if the race can occur, we don't appear to handle it in the
arm64 backend.

Cheers,

Will
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-21 14:58         ` Will Deacon
  0 siblings, 0 replies; 54+ messages in thread
From: Will Deacon @ 2021-07-21 14:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Alexandru Elisei, Marc Zyngier, linux-arm-kernel, kvm, kvmarm,
	linux-mm, Matthew Wilcox, Paolo Bonzini, Quentin Perret,
	James Morse, Suzuki K Poulose, kernel-team

Hey Sean,

On Tue, Jul 20, 2021 at 08:33:46PM +0000, Sean Christopherson wrote:
> On Tue, Jul 20, 2021, Alexandru Elisei wrote:
> > I just can't figure out why having the mmap lock is not needed to walk the
> > userspace page tables. Any hints? Or am I not seeing where it's taken?
> 
> Disclaimer: I'm not super familiar with arm64's page tables, but the relevant KVM
> functionality is common across x86 and arm64.

No need for the disclaimer, there are so many moving parts here that I don't
think it's possible to be familiar with them all! Thanks for taking the time
to write it up so clearly.

> KVM arm64 (and x86) unconditionally registers a mmu_notifier for the mm_struct
> associated with the VM, and disallows calling ioctls from a different process,
> i.e. walking the page tables during KVM_RUN is guaranteed to use the mm for which
> KVM registered the mmu_notifier.  As part of registration, the mmu_notifier
> does mmgrab() and doesn't do mmdrop() until it's unregistered.  That ensures the
> mm_struct itself is live.
> 
> For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is
> invoked at the beginning of exit_mmap(), before the page tables are freed.  In
> its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a.
> the stage2 tables in KVM arm64.  The flow in question, get_user_mapping_size(),
> also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
> guaranteed to run with live userspace tables.

Unless I missed a case, exit_mmap() only runs when mm_struct::mm_users drops
to zero, right? The vCPU tasks should hold references to that afaict, so I
don't think it should be possible for exit_mmap() to run while there are
vCPUs running with the corresponding page-table.

> Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly
> handles the case where exit_mmap() wins the race.  The invalidate_range hooks will
> still be called, so userspace page tables aren't a problem, but
> kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without
> any additional notifications that I see.  x86 deals with this by ensuring its
> top-level TDP entry (stage2 equivalent) is valid while the page fault handler is
> running.

But the fact that x86 handles this race has me worried. What am I missing?

I agree that, if the race can occur, we don't appear to handle it in the
arm64 backend.

Cheers,

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
  2021-07-21 14:58         ` Will Deacon
  (?)
@ 2021-07-21 15:56           ` Sean Christopherson
  -1 siblings, 0 replies; 54+ messages in thread
From: Sean Christopherson @ 2021-07-21 15:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: Alexandru Elisei, Marc Zyngier, linux-arm-kernel, kvm, kvmarm,
	linux-mm, Matthew Wilcox, Paolo Bonzini, Quentin Perret,
	James Morse, Suzuki K Poulose, kernel-team

On Wed, Jul 21, 2021, Will Deacon wrote:
> > For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is
> > invoked at the beginning of exit_mmap(), before the page tables are freed.  In
> > its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a.
> > the stage2 tables in KVM arm64.  The flow in question, get_user_mapping_size(),
> > also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
> > guaranteed to run with live userspace tables.
> 
> Unless I missed a case, exit_mmap() only runs when mm_struct::mm_users drops
> to zero, right?

Yep.

> The vCPU tasks should hold references to that afaict, so I don't think it
> should be possible for exit_mmap() to run while there are vCPUs running with
> the corresponding page-table.

Ah, right, I was thinking of non-KVM code that operated on the page tables without
holding a reference to mm_users.

> > Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly
> > handles the case where exit_mmap() wins the race.  The invalidate_range hooks will
> > still be called, so userspace page tables aren't a problem, but
> > kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without
> > any additional notifications that I see.  x86 deals with this by ensuring its
> > top-level TDP entry (stage2 equivalent) is valid while the page fault handler is
> > running.
> 
> But the fact that x86 handles this race has me worried. What am I missing?

I don't think you're missing anything.  I forgot that KVM_RUN would require an
elevated mm_users.  x86 does handle the impossible race, but that's coincidental.
The extra protections in x86 are to deal with other cases where a vCPU's top-level
SPTE can be invalidated while the vCPU is running.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-21 15:56           ` Sean Christopherson
  0 siblings, 0 replies; 54+ messages in thread
From: Sean Christopherson @ 2021-07-21 15:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: kernel-team, kvm, Marc Zyngier, Matthew Wilcox, linux-mm,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Wed, Jul 21, 2021, Will Deacon wrote:
> > For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is
> > invoked at the beginning of exit_mmap(), before the page tables are freed.  In
> > its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a.
> > the stage2 tables in KVM arm64.  The flow in question, get_user_mapping_size(),
> > also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
> > guaranteed to run with live userspace tables.
> 
> Unless I missed a case, exit_mmap() only runs when mm_struct::mm_users drops
> to zero, right?

Yep.

> The vCPU tasks should hold references to that afaict, so I don't think it
> should be possible for exit_mmap() to run while there are vCPUs running with
> the corresponding page-table.

Ah, right, I was thinking of non-KVM code that operated on the page tables without
holding a reference to mm_users.

> > Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly
> > handles the case where exit_mmap() wins the race.  The invalidate_range hooks will
> > still be called, so userspace page tables aren't a problem, but
> > kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without
> > any additional notifications that I see.  x86 deals with this by ensuring its
> > top-level TDP entry (stage2 equivalent) is valid while the page fault handler is
> > running.
> 
> But the fact that x86 handles this race has me worried. What am I missing?

I don't think you're missing anything.  I forgot that KVM_RUN would require an
elevated mm_users.  x86 does handle the impossible race, but that's coincidental.
The extra protections in x86 are to deal with other cases where a vCPU's top-level
SPTE can be invalidated while the vCPU is running.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-21 15:56           ` Sean Christopherson
  0 siblings, 0 replies; 54+ messages in thread
From: Sean Christopherson @ 2021-07-21 15:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: Alexandru Elisei, Marc Zyngier, linux-arm-kernel, kvm, kvmarm,
	linux-mm, Matthew Wilcox, Paolo Bonzini, Quentin Perret,
	James Morse, Suzuki K Poulose, kernel-team

On Wed, Jul 21, 2021, Will Deacon wrote:
> > For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is
> > invoked at the beginning of exit_mmap(), before the page tables are freed.  In
> > its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a.
> > the stage2 tables in KVM arm64.  The flow in question, get_user_mapping_size(),
> > also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
> > guaranteed to run with live userspace tables.
> 
> Unless I missed a case, exit_mmap() only runs when mm_struct::mm_users drops
> to zero, right?

Yep.

> The vCPU tasks should hold references to that afaict, so I don't think it
> should be possible for exit_mmap() to run while there are vCPUs running with
> the corresponding page-table.

Ah, right, I was thinking of non-KVM code that operated on the page tables without
holding a reference to mm_users.

> > Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly
> > handles the case where exit_mmap() wins the race.  The invalidate_range hooks will
> > still be called, so userspace page tables aren't a problem, but
> > kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without
> > any additional notifications that I see.  x86 deals with this by ensuring its
> > top-level TDP entry (stage2 equivalent) is valid while the page fault handler is
> > running.
> 
> But the fact that x86 handles this race has me worried. What am I missing?

I don't think you're missing anything.  I forgot that KVM_RUN would require an
elevated mm_users.  x86 does handle the impossible race, but that's coincidental.
The extra protections in x86 are to deal with other cases where a vCPU's top-level
SPTE can be invalidated while the vCPU is running.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
  2021-07-20 20:33       ` Sean Christopherson
  (?)
@ 2021-07-21 16:37         ` Alexandru Elisei
  -1 siblings, 0 replies; 54+ messages in thread
From: Alexandru Elisei @ 2021-07-21 16:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm,
	Matthew Wilcox, Paolo Bonzini, Will Deacon, Quentin Perret,
	James Morse, Suzuki K Poulose, kernel-team

Hi Sean,

Thank you for writing this, it explains exactly what I wanted to know.

On 7/20/21 9:33 PM, Sean Christopherson wrote:
> On Tue, Jul 20, 2021, Alexandru Elisei wrote:
>> Hi Marc,
>>
>> I just can't figure out why having the mmap lock is not needed to walk the
>> userspace page tables. Any hints? Or am I not seeing where it's taken?
> Disclaimer: I'm not super familiar with arm64's page tables, but the relevant KVM
> functionality is common across x86 and arm64.
>
> KVM arm64 (and x86) unconditionally registers a mmu_notifier for the mm_struct
> associated with the VM, and disallows calling ioctls from a different process,
> i.e. walking the page tables during KVM_RUN is guaranteed to use the mm for which
> KVM registered the mmu_notifier.  As part of registration, the mmu_notifier
> does mmgrab() and doesn't do mmdrop() until it's unregistered.  That ensures the
> mm_struct itself is live.
>
> For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is
> invoked at the beginning of exit_mmap(), before the page tables are freed.  In
> its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a.
> the stage2 tables in KVM arm64.  The flow in question, get_user_mapping_size(),
> also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
> guaranteed to run with live userspace tables.
>
> Lastly, KVM also implements mmu_notifier_ops.invalidate_range_{start,end}.  KVM's
> invalidate_range implementations also take mmu_lock, and also update a sequence
> counter and a flag stating that there's an invalidation in progress.  When
> installing a stage2 entry, KVM snapshots the sequence counter before taking
> mmu_lock, and then checks it again after acquiring mmu_lock.  If the counter
> mismatches, or an invalidation is in-progress, then KVM bails and resumes the
> guest without fixing the fault.
>
> E.g. if the host zaps userspace page tables and KVM "wins" the race, the subsequent
> kvm_mmu_notifier_invalidate_range_start() will zap the recently installed stage2
> entries.  And if the host zap "wins" the race, KVM will resume the guest, which
> in normal operation will hit the exception again and go back through the entire
> process of installing stage2 entries.
>
> Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly
> handles the case where exit_mmap() wins the race.  The invalidate_range hooks will
> still be called, so userspace page tables aren't a problem, but
> kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without
> any additional notifications that I see.  x86 deals with this by ensuring its
> top-level TDP entry (stage2 equivalent) is valid while the page fault handler is
> running.
>
>   void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>   {
> 	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
> 	struct kvm_pgtable *pgt = NULL;
>
> 	spin_lock(&kvm->mmu_lock);
> 	pgt = mmu->pgt;
> 	if (pgt) {
> 		mmu->pgd_phys = 0;
> 		mmu->pgt = NULL;
> 		free_percpu(mmu->last_vcpu_ran);
> 	}
> 	spin_unlock(&kvm->mmu_lock);
>
> 	...
>   }
>
> AFAICT, nothing in user_mem_abort() would prevent consuming that null mmu->pgt
> if exit_mmap() collidied with user_mem_abort().
>
>   static int user_mem_abort(...)
>   {
>
> 	...
>
> 	spin_lock(&kvm->mmu_lock);
> 	pgt = vcpu->arch.hw_mmu->pgt;         <-- hw_mmu->pgt may be NULL (hw_mmu points at vcpu->kvm->arch.mmu)
> 	if (mmu_notifier_retry(kvm, mmu_seq)) <-- mmu_seq not guaranteed to change
> 		goto out_unlock;
>
> 	...
>
> 	if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
> 		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
> 	} else {
> 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
> 					     __pfn_to_phys(pfn), prot,
> 					     memcache);
> 	}
>   }

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-21 16:37         ` Alexandru Elisei
  0 siblings, 0 replies; 54+ messages in thread
From: Alexandru Elisei @ 2021-07-21 16:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kernel-team, kvm, Marc Zyngier, Matthew Wilcox, linux-mm,
	Paolo Bonzini, Will Deacon, kvmarm, linux-arm-kernel

Hi Sean,

Thank you for writing this, it explains exactly what I wanted to know.

On 7/20/21 9:33 PM, Sean Christopherson wrote:
> On Tue, Jul 20, 2021, Alexandru Elisei wrote:
>> Hi Marc,
>>
>> I just can't figure out why having the mmap lock is not needed to walk the
>> userspace page tables. Any hints? Or am I not seeing where it's taken?
> Disclaimer: I'm not super familiar with arm64's page tables, but the relevant KVM
> functionality is common across x86 and arm64.
>
> KVM arm64 (and x86) unconditionally registers a mmu_notifier for the mm_struct
> associated with the VM, and disallows calling ioctls from a different process,
> i.e. walking the page tables during KVM_RUN is guaranteed to use the mm for which
> KVM registered the mmu_notifier.  As part of registration, the mmu_notifier
> does mmgrab() and doesn't do mmdrop() until it's unregistered.  That ensures the
> mm_struct itself is live.
>
> For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is
> invoked at the beginning of exit_mmap(), before the page tables are freed.  In
> its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a.
> the stage2 tables in KVM arm64.  The flow in question, get_user_mapping_size(),
> also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
> guaranteed to run with live userspace tables.
>
> Lastly, KVM also implements mmu_notifier_ops.invalidate_range_{start,end}.  KVM's
> invalidate_range implementations also take mmu_lock, and also update a sequence
> counter and a flag stating that there's an invalidation in progress.  When
> installing a stage2 entry, KVM snapshots the sequence counter before taking
> mmu_lock, and then checks it again after acquiring mmu_lock.  If the counter
> mismatches, or an invalidation is in-progress, then KVM bails and resumes the
> guest without fixing the fault.
>
> E.g. if the host zaps userspace page tables and KVM "wins" the race, the subsequent
> kvm_mmu_notifier_invalidate_range_start() will zap the recently installed stage2
> entries.  And if the host zap "wins" the race, KVM will resume the guest, which
> in normal operation will hit the exception again and go back through the entire
> process of installing stage2 entries.
>
> Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly
> handles the case where exit_mmap() wins the race.  The invalidate_range hooks will
> still be called, so userspace page tables aren't a problem, but
> kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without
> any additional notifications that I see.  x86 deals with this by ensuring its
> top-level TDP entry (stage2 equivalent) is valid while the page fault handler is
> running.
>
>   void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>   {
> 	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
> 	struct kvm_pgtable *pgt = NULL;
>
> 	spin_lock(&kvm->mmu_lock);
> 	pgt = mmu->pgt;
> 	if (pgt) {
> 		mmu->pgd_phys = 0;
> 		mmu->pgt = NULL;
> 		free_percpu(mmu->last_vcpu_ran);
> 	}
> 	spin_unlock(&kvm->mmu_lock);
>
> 	...
>   }
>
> AFAICT, nothing in user_mem_abort() would prevent consuming that null mmu->pgt
> if exit_mmap() collidied with user_mem_abort().
>
>   static int user_mem_abort(...)
>   {
>
> 	...
>
> 	spin_lock(&kvm->mmu_lock);
> 	pgt = vcpu->arch.hw_mmu->pgt;         <-- hw_mmu->pgt may be NULL (hw_mmu points at vcpu->kvm->arch.mmu)
> 	if (mmu_notifier_retry(kvm, mmu_seq)) <-- mmu_seq not guaranteed to change
> 		goto out_unlock;
>
> 	...
>
> 	if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
> 		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
> 	} else {
> 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
> 					     __pfn_to_phys(pfn), prot,
> 					     memcache);
> 	}
>   }
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-21 16:37         ` Alexandru Elisei
  0 siblings, 0 replies; 54+ messages in thread
From: Alexandru Elisei @ 2021-07-21 16:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm,
	Matthew Wilcox, Paolo Bonzini, Will Deacon, Quentin Perret,
	James Morse, Suzuki K Poulose, kernel-team

Hi Sean,

Thank you for writing this, it explains exactly what I wanted to know.

On 7/20/21 9:33 PM, Sean Christopherson wrote:
> On Tue, Jul 20, 2021, Alexandru Elisei wrote:
>> Hi Marc,
>>
>> I just can't figure out why having the mmap lock is not needed to walk the
>> userspace page tables. Any hints? Or am I not seeing where it's taken?
> Disclaimer: I'm not super familiar with arm64's page tables, but the relevant KVM
> functionality is common across x86 and arm64.
>
> KVM arm64 (and x86) unconditionally registers a mmu_notifier for the mm_struct
> associated with the VM, and disallows calling ioctls from a different process,
> i.e. walking the page tables during KVM_RUN is guaranteed to use the mm for which
> KVM registered the mmu_notifier.  As part of registration, the mmu_notifier
> does mmgrab() and doesn't do mmdrop() until it's unregistered.  That ensures the
> mm_struct itself is live.
>
> For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is
> invoked at the beginning of exit_mmap(), before the page tables are freed.  In
> its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a.
> the stage2 tables in KVM arm64.  The flow in question, get_user_mapping_size(),
> also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
> guaranteed to run with live userspace tables.
>
> Lastly, KVM also implements mmu_notifier_ops.invalidate_range_{start,end}.  KVM's
> invalidate_range implementations also take mmu_lock, and also update a sequence
> counter and a flag stating that there's an invalidation in progress.  When
> installing a stage2 entry, KVM snapshots the sequence counter before taking
> mmu_lock, and then checks it again after acquiring mmu_lock.  If the counter
> mismatches, or an invalidation is in-progress, then KVM bails and resumes the
> guest without fixing the fault.
>
> E.g. if the host zaps userspace page tables and KVM "wins" the race, the subsequent
> kvm_mmu_notifier_invalidate_range_start() will zap the recently installed stage2
> entries.  And if the host zap "wins" the race, KVM will resume the guest, which
> in normal operation will hit the exception again and go back through the entire
> process of installing stage2 entries.
>
> Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly
> handles the case where exit_mmap() wins the race.  The invalidate_range hooks will
> still be called, so userspace page tables aren't a problem, but
> kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without
> any additional notifications that I see.  x86 deals with this by ensuring its
> top-level TDP entry (stage2 equivalent) is valid while the page fault handler is
> running.
>
>   void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>   {
> 	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
> 	struct kvm_pgtable *pgt = NULL;
>
> 	spin_lock(&kvm->mmu_lock);
> 	pgt = mmu->pgt;
> 	if (pgt) {
> 		mmu->pgd_phys = 0;
> 		mmu->pgt = NULL;
> 		free_percpu(mmu->last_vcpu_ran);
> 	}
> 	spin_unlock(&kvm->mmu_lock);
>
> 	...
>   }
>
> AFAICT, nothing in user_mem_abort() would prevent consuming that null mmu->pgt
> if exit_mmap() collidied with user_mem_abort().
>
>   static int user_mem_abort(...)
>   {
>
> 	...
>
> 	spin_lock(&kvm->mmu_lock);
> 	pgt = vcpu->arch.hw_mmu->pgt;         <-- hw_mmu->pgt may be NULL (hw_mmu points at vcpu->kvm->arch.mmu)
> 	if (mmu_notifier_retry(kvm, mmu_seq)) <-- mmu_seq not guaranteed to change
> 		goto out_unlock;
>
> 	...
>
> 	if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
> 		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
> 	} else {
> 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
> 					     __pfn_to_phys(pfn), prot,
> 					     memcache);
> 	}
>   }

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
  2021-07-20 17:23     ` Alexandru Elisei
  (?)
@ 2021-07-23  8:48       ` Marc Zyngier
  -1 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-23  8:48 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: linux-arm-kernel, kvm, kvmarm, linux-mm, Sean Christopherson,
	Matthew Wilcox, Paolo Bonzini, Will Deacon, Quentin Perret,
	James Morse, Suzuki K Poulose, kernel-team

On Tue, 20 Jul 2021 18:23:02 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> I just can't figure out why having the mmap lock is not needed to walk the
> userspace page tables. Any hints? Or am I not seeing where it's taken?

I trust Sean's explanation was complete enough!

> On 7/17/21 10:55 AM, Marc Zyngier wrote:
> > We currently rely on the kvm_is_transparent_hugepage() helper to
> > discover whether a given page has the potential to be mapped as
> > a block mapping.
> >
> > However, this API doesn't really give un everything we want:
> > - we don't get the size: this is not crucial today as we only
> >   support PMD-sized THPs, but we'd like to have larger sizes
> >   in the future
> > - we're the only user left of the API, and there is a will
> >   to remove it altogether
> >
> > To address the above, implement a simple walker using the existing
> > page table infrastructure, and plumb it into transparent_hugepage_adjust().
> > No new page sizes are supported in the process.
> >
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > ---
> >  arch/arm64/kvm/mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 42 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 3155c9e778f0..db6314b93e99 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -433,6 +433,44 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
> >  	return 0;
> >  }
> >  
> > +static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
> > +	/* We shouldn't need any other callback to walk the PT */
> > +	.phys_to_virt		= kvm_host_va,
> > +};
> > +
> > +struct user_walk_data {
> > +	u32	level;
> > +};
> > +
> > +static int user_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > +		       enum kvm_pgtable_walk_flags flag, void * const arg)
> > +{
> > +	struct user_walk_data *data = arg;
> > +
> > +	data->level = level;
> > +	return 0;
> > +}
> > +
> > +static int get_user_mapping_size(struct kvm *kvm, u64 addr)
> > +{
> > +	struct user_walk_data data;
> > +	struct kvm_pgtable pgt = {
> > +		.pgd		= (kvm_pte_t *)kvm->mm->pgd,
> > +		.ia_bits	= VA_BITS,
> > +		.start_level	= 4 - CONFIG_PGTABLE_LEVELS,
> > +		.mm_ops		= &kvm_user_mm_ops,
> > +	};
> > +	struct kvm_pgtable_walker walker = {
> > +		.cb		= user_walker,
> > +		.flags		= KVM_PGTABLE_WALK_LEAF,
> > +		.arg		= &data,
> > +	};
> > +
> > +	kvm_pgtable_walk(&pgt, ALIGN_DOWN(addr, PAGE_SIZE), PAGE_SIZE, &walker);
> 
> I take it that it is guaranteed that kvm_pgtable_walk() will never
> fail? For example, I can see it failing if someone messes with
> KVM_PGTABLE_MAX_LEVELS.

But that's an architectural constant. How could it be messed with?
When we introduce 5 levels of page tables, we'll have to check all
this anyway.

> To be honest, I would rather have a check here instead of
> potentially feeding a bogus value to ARM64_HW_PGTABLE_LEVEL_SHIFT.
> It could be a VM_WARN_ON, so there's no runtime overhead unless
> CONFIG_DEBUG_VM.

Fair enough. That's easy enough to check.

> The patch looks good to me so far, but I want to give it another
> look (or two) after I figure out why the mmap semaphone is not
> needed.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-23  8:48       ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-23  8:48 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kernel-team, kvm, Sean Christopherson, Matthew Wilcox, linux-mm,
	Paolo Bonzini, Will Deacon, kvmarm, linux-arm-kernel

On Tue, 20 Jul 2021 18:23:02 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> I just can't figure out why having the mmap lock is not needed to walk the
> userspace page tables. Any hints? Or am I not seeing where it's taken?

I trust Sean's explanation was complete enough!

> On 7/17/21 10:55 AM, Marc Zyngier wrote:
> > We currently rely on the kvm_is_transparent_hugepage() helper to
> > discover whether a given page has the potential to be mapped as
> > a block mapping.
> >
> > However, this API doesn't really give un everything we want:
> > - we don't get the size: this is not crucial today as we only
> >   support PMD-sized THPs, but we'd like to have larger sizes
> >   in the future
> > - we're the only user left of the API, and there is a will
> >   to remove it altogether
> >
> > To address the above, implement a simple walker using the existing
> > page table infrastructure, and plumb it into transparent_hugepage_adjust().
> > No new page sizes are supported in the process.
> >
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > ---
> >  arch/arm64/kvm/mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 42 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 3155c9e778f0..db6314b93e99 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -433,6 +433,44 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
> >  	return 0;
> >  }
> >  
> > +static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
> > +	/* We shouldn't need any other callback to walk the PT */
> > +	.phys_to_virt		= kvm_host_va,
> > +};
> > +
> > +struct user_walk_data {
> > +	u32	level;
> > +};
> > +
> > +static int user_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > +		       enum kvm_pgtable_walk_flags flag, void * const arg)
> > +{
> > +	struct user_walk_data *data = arg;
> > +
> > +	data->level = level;
> > +	return 0;
> > +}
> > +
> > +static int get_user_mapping_size(struct kvm *kvm, u64 addr)
> > +{
> > +	struct user_walk_data data;
> > +	struct kvm_pgtable pgt = {
> > +		.pgd		= (kvm_pte_t *)kvm->mm->pgd,
> > +		.ia_bits	= VA_BITS,
> > +		.start_level	= 4 - CONFIG_PGTABLE_LEVELS,
> > +		.mm_ops		= &kvm_user_mm_ops,
> > +	};
> > +	struct kvm_pgtable_walker walker = {
> > +		.cb		= user_walker,
> > +		.flags		= KVM_PGTABLE_WALK_LEAF,
> > +		.arg		= &data,
> > +	};
> > +
> > +	kvm_pgtable_walk(&pgt, ALIGN_DOWN(addr, PAGE_SIZE), PAGE_SIZE, &walker);
> 
> I take it that it is guaranteed that kvm_pgtable_walk() will never
> fail? For example, I can see it failing if someone messes with
> KVM_PGTABLE_MAX_LEVELS.

But that's an architectural constant. How could it be messed with?
When we introduce 5 levels of page tables, we'll have to check all
this anyway.

> To be honest, I would rather have a check here instead of
> potentially feeding a bogus value to ARM64_HW_PGTABLE_LEVEL_SHIFT.
> It could be a VM_WARN_ON, so there's no runtime overhead unless
> CONFIG_DEBUG_VM.

Fair enough. That's easy enough to check.

> The patch looks good to me so far, but I want to give it another
> look (or two) after I figure out why the mmap semaphone is not
> needed.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size
@ 2021-07-23  8:48       ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-23  8:48 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: linux-arm-kernel, kvm, kvmarm, linux-mm, Sean Christopherson,
	Matthew Wilcox, Paolo Bonzini, Will Deacon, Quentin Perret,
	James Morse, Suzuki K Poulose, kernel-team

On Tue, 20 Jul 2021 18:23:02 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> I just can't figure out why having the mmap lock is not needed to walk the
> userspace page tables. Any hints? Or am I not seeing where it's taken?

I trust Sean's explanation was complete enough!

> On 7/17/21 10:55 AM, Marc Zyngier wrote:
> > We currently rely on the kvm_is_transparent_hugepage() helper to
> > discover whether a given page has the potential to be mapped as
> > a block mapping.
> >
> > However, this API doesn't really give un everything we want:
> > - we don't get the size: this is not crucial today as we only
> >   support PMD-sized THPs, but we'd like to have larger sizes
> >   in the future
> > - we're the only user left of the API, and there is a will
> >   to remove it altogether
> >
> > To address the above, implement a simple walker using the existing
> > page table infrastructure, and plumb it into transparent_hugepage_adjust().
> > No new page sizes are supported in the process.
> >
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > ---
> >  arch/arm64/kvm/mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 42 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 3155c9e778f0..db6314b93e99 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -433,6 +433,44 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
> >  	return 0;
> >  }
> >  
> > +static struct kvm_pgtable_mm_ops kvm_user_mm_ops = {
> > +	/* We shouldn't need any other callback to walk the PT */
> > +	.phys_to_virt		= kvm_host_va,
> > +};
> > +
> > +struct user_walk_data {
> > +	u32	level;
> > +};
> > +
> > +static int user_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > +		       enum kvm_pgtable_walk_flags flag, void * const arg)
> > +{
> > +	struct user_walk_data *data = arg;
> > +
> > +	data->level = level;
> > +	return 0;
> > +}
> > +
> > +static int get_user_mapping_size(struct kvm *kvm, u64 addr)
> > +{
> > +	struct user_walk_data data;
> > +	struct kvm_pgtable pgt = {
> > +		.pgd		= (kvm_pte_t *)kvm->mm->pgd,
> > +		.ia_bits	= VA_BITS,
> > +		.start_level	= 4 - CONFIG_PGTABLE_LEVELS,
> > +		.mm_ops		= &kvm_user_mm_ops,
> > +	};
> > +	struct kvm_pgtable_walker walker = {
> > +		.cb		= user_walker,
> > +		.flags		= KVM_PGTABLE_WALK_LEAF,
> > +		.arg		= &data,
> > +	};
> > +
> > +	kvm_pgtable_walk(&pgt, ALIGN_DOWN(addr, PAGE_SIZE), PAGE_SIZE, &walker);
> 
> I take it that it is guaranteed that kvm_pgtable_walk() will never
> fail? For example, I can see it failing if someone messes with
> KVM_PGTABLE_MAX_LEVELS.

But that's an architectural constant. How could it be messed with?
When we introduce 5 levels of page tables, we'll have to check all
this anyway.

> To be honest, I would rather have a check here instead of
> potentially feeding a bogus value to ARM64_HW_PGTABLE_LEVEL_SHIFT.
> It could be a VM_WARN_ON, so there's no runtime overhead unless
> CONFIG_DEBUG_VM.

Fair enough. That's easy enough to check.

> The patch looks good to me so far, but I want to give it another
> look (or two) after I figure out why the mmap semaphone is not
> needed.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] KVM: arm64: Avoid mapping size adjustment on permission fault
  2021-07-17  9:55   ` Marc Zyngier
  (?)
@ 2021-07-23 15:55     ` Alexandru Elisei
  -1 siblings, 0 replies; 54+ messages in thread
From: Alexandru Elisei @ 2021-07-23 15:55 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, kernel-team

Hi Marc,

On 7/17/21 10:55 AM, Marc Zyngier wrote:
> Since we only support PMD-sized mappings for THP, getting
> a permission fault on a level that results in a mapping
> being larger than PAGE_SIZE is a sure indication that we have
> already upgraded our mapping to a PMD.
>
> In this case, there is no need to try and parse userspace page
> tables, as the fault information already tells us everything.
>
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/kvm/mmu.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index db6314b93e99..c036a480ca27 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1088,9 +1088,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	 * If we are not forced to use page mapping, check if we are
>  	 * backed by a THP and thus use block mapping if possible.
>  	 */
> -	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
> -		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
> -							   &pfn, &fault_ipa);
> +	if (vma_pagesize == PAGE_SIZE && !force_pte) {

Looks like now it's possible to call transparent_hugepage_adjust() for devices (if
fault_status != FSC_PERM). Commit 2aa53d68cee6 ("KVM: arm64: Try stage2 block
mapping for host device MMIO") makes a good case for the !device check. Was the
check removed unintentionally?

> +		if (fault_status == FSC_PERM && fault_granule > PAGE_SIZE)
> +			vma_pagesize = fault_granule;
> +		else
> +			vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
> +								   hva, &pfn,
> +								   &fault_ipa);
> +	}

This change makes sense to me - we can only get stage 2 permission faults on a
leaf entry since stage 2 tables don't have the APTable/XNTable/PXNTable bits. The
biggest block mapping size that we support at stage 2 is PMD size (from
transparent_hugepage_adjust()), therefore if fault_granule is larger than
PAGE_SIZE, then it must be PMD_SIZE.

Thanks,

Alex

>  
>  	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
>  		/* Check the VMM hasn't introduced a new VM_SHARED VMA */

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] KVM: arm64: Avoid mapping size adjustment on permission fault
@ 2021-07-23 15:55     ` Alexandru Elisei
  0 siblings, 0 replies; 54+ messages in thread
From: Alexandru Elisei @ 2021-07-23 15:55 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: kernel-team, Sean Christopherson, Matthew Wilcox, Paolo Bonzini,
	Will Deacon

Hi Marc,

On 7/17/21 10:55 AM, Marc Zyngier wrote:
> Since we only support PMD-sized mappings for THP, getting
> a permission fault on a level that results in a mapping
> being larger than PAGE_SIZE is a sure indication that we have
> already upgraded our mapping to a PMD.
>
> In this case, there is no need to try and parse userspace page
> tables, as the fault information already tells us everything.
>
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/kvm/mmu.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index db6314b93e99..c036a480ca27 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1088,9 +1088,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	 * If we are not forced to use page mapping, check if we are
>  	 * backed by a THP and thus use block mapping if possible.
>  	 */
> -	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
> -		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
> -							   &pfn, &fault_ipa);
> +	if (vma_pagesize == PAGE_SIZE && !force_pte) {

Looks like now it's possible to call transparent_hugepage_adjust() for devices (if
fault_status != FSC_PERM). Commit 2aa53d68cee6 ("KVM: arm64: Try stage2 block
mapping for host device MMIO") makes a good case for the !device check. Was the
check removed unintentionally?

> +		if (fault_status == FSC_PERM && fault_granule > PAGE_SIZE)
> +			vma_pagesize = fault_granule;
> +		else
> +			vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
> +								   hva, &pfn,
> +								   &fault_ipa);
> +	}

This change makes sense to me - we can only get stage 2 permission faults on a
leaf entry since stage 2 tables don't have the APTable/XNTable/PXNTable bits. The
biggest block mapping size that we support at stage 2 is PMD size (from
transparent_hugepage_adjust()), therefore if fault_granule is larger than
PAGE_SIZE, then it must be PMD_SIZE.

Thanks,

Alex

>  
>  	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
>  		/* Check the VMM hasn't introduced a new VM_SHARED VMA */
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] KVM: arm64: Avoid mapping size adjustment on permission fault
@ 2021-07-23 15:55     ` Alexandru Elisei
  0 siblings, 0 replies; 54+ messages in thread
From: Alexandru Elisei @ 2021-07-23 15:55 UTC (permalink / raw)
  To: Marc Zyngier, linux-arm-kernel, kvm, kvmarm, linux-mm
  Cc: Sean Christopherson, Matthew Wilcox, Paolo Bonzini, Will Deacon,
	Quentin Perret, James Morse, Suzuki K Poulose, kernel-team

Hi Marc,

On 7/17/21 10:55 AM, Marc Zyngier wrote:
> Since we only support PMD-sized mappings for THP, getting
> a permission fault on a level that results in a mapping
> being larger than PAGE_SIZE is a sure indication that we have
> already upgraded our mapping to a PMD.
>
> In this case, there is no need to try and parse userspace page
> tables, as the fault information already tells us everything.
>
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/kvm/mmu.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index db6314b93e99..c036a480ca27 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1088,9 +1088,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	 * If we are not forced to use page mapping, check if we are
>  	 * backed by a THP and thus use block mapping if possible.
>  	 */
> -	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
> -		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
> -							   &pfn, &fault_ipa);
> +	if (vma_pagesize == PAGE_SIZE && !force_pte) {

Looks like now it's possible to call transparent_hugepage_adjust() for devices (if
fault_status != FSC_PERM). Commit 2aa53d68cee6 ("KVM: arm64: Try stage2 block
mapping for host device MMIO") makes a good case for the !device check. Was the
check removed unintentionally?

> +		if (fault_status == FSC_PERM && fault_granule > PAGE_SIZE)
> +			vma_pagesize = fault_granule;
> +		else
> +			vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
> +								   hva, &pfn,
> +								   &fault_ipa);
> +	}

This change makes sense to me - we can only get stage 2 permission faults on a
leaf entry since stage 2 tables don't have the APTable/XNTable/PXNTable bits. The
biggest block mapping size that we support at stage 2 is PMD size (from
transparent_hugepage_adjust()), therefore if fault_granule is larger than
PAGE_SIZE, then it must be PMD_SIZE.

Thanks,

Alex

>  
>  	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
>  		/* Check the VMM hasn't introduced a new VM_SHARED VMA */

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] KVM: arm64: Avoid mapping size adjustment on permission fault
  2021-07-23 15:55     ` Alexandru Elisei
  (?)
@ 2021-07-23 16:18       ` Marc Zyngier
  -1 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-23 16:18 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: linux-arm-kernel, kvm, kvmarm, linux-mm, Sean Christopherson,
	Matthew Wilcox, Paolo Bonzini, Will Deacon, Quentin Perret,
	James Morse, Suzuki K Poulose, kernel-team

On Fri, 23 Jul 2021 16:55:39 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> On 7/17/21 10:55 AM, Marc Zyngier wrote:
> > Since we only support PMD-sized mappings for THP, getting
> > a permission fault on a level that results in a mapping
> > being larger than PAGE_SIZE is a sure indication that we have
> > already upgraded our mapping to a PMD.
> >
> > In this case, there is no need to try and parse userspace page
> > tables, as the fault information already tells us everything.
> >
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > ---
> >  arch/arm64/kvm/mmu.c | 11 ++++++++---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index db6314b93e99..c036a480ca27 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1088,9 +1088,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  	 * If we are not forced to use page mapping, check if we are
> >  	 * backed by a THP and thus use block mapping if possible.
> >  	 */
> > -	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
> > -		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
> > -							   &pfn, &fault_ipa);
> > +	if (vma_pagesize == PAGE_SIZE && !force_pte) {
> 
> Looks like now it's possible to call transparent_hugepage_adjust()
> for devices (if fault_status != FSC_PERM). Commit 2aa53d68cee6
> ("KVM: arm64: Try stage2 block mapping for host device MMIO") makes
> a good case for the !device check. Was the check removed
> unintentionally?

That's what stupid bugs are made of. I must have resolved a rebase
conflict the wrong way, and lost this crucial bit. Thanks for spotting
this! Now fixed.

> 
> > +		if (fault_status == FSC_PERM && fault_granule > PAGE_SIZE)
> > +			vma_pagesize = fault_granule;
> > +		else
> > +			vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
> > +								   hva, &pfn,
> > +								   &fault_ipa);
> > +	}
> 
> This change makes sense to me - we can only get stage 2 permission
> faults on a leaf entry since stage 2 tables don't have the
> APTable/XNTable/PXNTable bits. The biggest block mapping size that
> we support at stage 2 is PMD size (from
> transparent_hugepage_adjust()), therefore if fault_granule is larger
> than PAGE_SIZE, then it must be PMD_SIZE.

Yup, exactly.

Thanks again,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] KVM: arm64: Avoid mapping size adjustment on permission fault
@ 2021-07-23 16:18       ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-23 16:18 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kernel-team, kvm, Sean Christopherson, Matthew Wilcox, linux-mm,
	Paolo Bonzini, Will Deacon, kvmarm, linux-arm-kernel

On Fri, 23 Jul 2021 16:55:39 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> On 7/17/21 10:55 AM, Marc Zyngier wrote:
> > Since we only support PMD-sized mappings for THP, getting
> > a permission fault on a level that results in a mapping
> > being larger than PAGE_SIZE is a sure indication that we have
> > already upgraded our mapping to a PMD.
> >
> > In this case, there is no need to try and parse userspace page
> > tables, as the fault information already tells us everything.
> >
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > ---
> >  arch/arm64/kvm/mmu.c | 11 ++++++++---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index db6314b93e99..c036a480ca27 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1088,9 +1088,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  	 * If we are not forced to use page mapping, check if we are
> >  	 * backed by a THP and thus use block mapping if possible.
> >  	 */
> > -	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
> > -		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
> > -							   &pfn, &fault_ipa);
> > +	if (vma_pagesize == PAGE_SIZE && !force_pte) {
> 
> Looks like now it's possible to call transparent_hugepage_adjust()
> for devices (if fault_status != FSC_PERM). Commit 2aa53d68cee6
> ("KVM: arm64: Try stage2 block mapping for host device MMIO") makes
> a good case for the !device check. Was the check removed
> unintentionally?

That's what stupid bugs are made of. I must have resolved a rebase
conflict the wrong way, and lost this crucial bit. Thanks for spotting
this! Now fixed.

> 
> > +		if (fault_status == FSC_PERM && fault_granule > PAGE_SIZE)
> > +			vma_pagesize = fault_granule;
> > +		else
> > +			vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
> > +								   hva, &pfn,
> > +								   &fault_ipa);
> > +	}
> 
> This change makes sense to me - we can only get stage 2 permission
> faults on a leaf entry since stage 2 tables don't have the
> APTable/XNTable/PXNTable bits. The biggest block mapping size that
> we support at stage 2 is PMD size (from
> transparent_hugepage_adjust()), therefore if fault_granule is larger
> than PAGE_SIZE, then it must be PMD_SIZE.

Yup, exactly.

Thanks again,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] KVM: arm64: Avoid mapping size adjustment on permission fault
@ 2021-07-23 16:18       ` Marc Zyngier
  0 siblings, 0 replies; 54+ messages in thread
From: Marc Zyngier @ 2021-07-23 16:18 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: linux-arm-kernel, kvm, kvmarm, linux-mm, Sean Christopherson,
	Matthew Wilcox, Paolo Bonzini, Will Deacon, Quentin Perret,
	James Morse, Suzuki K Poulose, kernel-team

On Fri, 23 Jul 2021 16:55:39 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> On 7/17/21 10:55 AM, Marc Zyngier wrote:
> > Since we only support PMD-sized mappings for THP, getting
> > a permission fault on a level that results in a mapping
> > being larger than PAGE_SIZE is a sure indication that we have
> > already upgraded our mapping to a PMD.
> >
> > In this case, there is no need to try and parse userspace page
> > tables, as the fault information already tells us everything.
> >
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > ---
> >  arch/arm64/kvm/mmu.c | 11 ++++++++---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index db6314b93e99..c036a480ca27 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1088,9 +1088,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  	 * If we are not forced to use page mapping, check if we are
> >  	 * backed by a THP and thus use block mapping if possible.
> >  	 */
> > -	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
> > -		vma_pagesize = transparent_hugepage_adjust(kvm, memslot, hva,
> > -							   &pfn, &fault_ipa);
> > +	if (vma_pagesize == PAGE_SIZE && !force_pte) {
> 
> Looks like now it's possible to call transparent_hugepage_adjust()
> for devices (if fault_status != FSC_PERM). Commit 2aa53d68cee6
> ("KVM: arm64: Try stage2 block mapping for host device MMIO") makes
> a good case for the !device check. Was the check removed
> unintentionally?

That's what stupid bugs are made of. I must have resolved a rebase
conflict the wrong way, and lost this crucial bit. Thanks for spotting
this! Now fixed.

> 
> > +		if (fault_status == FSC_PERM && fault_granule > PAGE_SIZE)
> > +			vma_pagesize = fault_granule;
> > +		else
> > +			vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
> > +								   hva, &pfn,
> > +								   &fault_ipa);
> > +	}
> 
> This change makes sense to me - we can only get stage 2 permission
> faults on a leaf entry since stage 2 tables don't have the
> APTable/XNTable/PXNTable bits. The biggest block mapping size that
> we support at stage 2 is PMD size (from
> transparent_hugepage_adjust()), therefore if fault_granule is larger
> than PAGE_SIZE, then it must be PMD_SIZE.

Yup, exactly.

Thanks again,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2021-07-23 16:19 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-17  9:55 [PATCH 0/5] KVM: Remove kvm_is_transparent_hugepage() and friends Marc Zyngier
2021-07-17  9:55 ` Marc Zyngier
2021-07-17  9:55 ` Marc Zyngier
2021-07-17  9:55 ` [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size Marc Zyngier
2021-07-17  9:55   ` Marc Zyngier
2021-07-17  9:55   ` Marc Zyngier
2021-07-19  6:31   ` Paolo Bonzini
2021-07-19  6:31     ` Paolo Bonzini
2021-07-19  6:31     ` Paolo Bonzini
2021-07-19  9:31     ` Marc Zyngier
2021-07-19  9:31       ` Marc Zyngier
2021-07-19  9:31       ` Marc Zyngier
2021-07-20 17:23   ` Alexandru Elisei
2021-07-20 17:23     ` Alexandru Elisei
2021-07-20 17:23     ` Alexandru Elisei
2021-07-20 20:33     ` Sean Christopherson
2021-07-20 20:33       ` Sean Christopherson
2021-07-20 20:33       ` Sean Christopherson
2021-07-21 14:58       ` Will Deacon
2021-07-21 14:58         ` Will Deacon
2021-07-21 14:58         ` Will Deacon
2021-07-21 15:56         ` Sean Christopherson
2021-07-21 15:56           ` Sean Christopherson
2021-07-21 15:56           ` Sean Christopherson
2021-07-21 16:37       ` Alexandru Elisei
2021-07-21 16:37         ` Alexandru Elisei
2021-07-21 16:37         ` Alexandru Elisei
2021-07-23  8:48     ` Marc Zyngier
2021-07-23  8:48       ` Marc Zyngier
2021-07-23  8:48       ` Marc Zyngier
2021-07-17  9:55 ` [PATCH 2/5] KVM: arm64: Avoid mapping size adjustment on permission fault Marc Zyngier
2021-07-17  9:55   ` Marc Zyngier
2021-07-17  9:55   ` Marc Zyngier
2021-07-23 15:55   ` Alexandru Elisei
2021-07-23 15:55     ` Alexandru Elisei
2021-07-23 15:55     ` Alexandru Elisei
2021-07-23 16:18     ` Marc Zyngier
2021-07-23 16:18       ` Marc Zyngier
2021-07-23 16:18       ` Marc Zyngier
2021-07-17  9:55 ` [PATCH 3/5] KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap() Marc Zyngier
2021-07-17  9:55   ` Marc Zyngier
2021-07-17  9:55   ` Marc Zyngier
2021-07-19  6:31   ` Paolo Bonzini
2021-07-19  6:31     ` Paolo Bonzini
2021-07-19  6:31     ` Paolo Bonzini
2021-07-17  9:55 ` [PATCH 4/5] KVM: arm64: Use get_page() instead of kvm_get_pfn() Marc Zyngier
2021-07-17  9:55   ` Marc Zyngier
2021-07-17  9:55   ` Marc Zyngier
2021-07-17  9:55 ` [PATCH 5/5] KVM: Get rid " Marc Zyngier
2021-07-17  9:55   ` Marc Zyngier
2021-07-17  9:55   ` Marc Zyngier
2021-07-19  6:31   ` Paolo Bonzini
2021-07-19  6:31     ` Paolo Bonzini
2021-07-19  6:31     ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.