linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] KVM: arm64: Limit stage2_apply_range() batch size to 1GB
@ 2022-09-26 22:21 Oliver Upton
  2022-09-27 11:34 ` Marc Zyngier
  0 siblings, 1 reply; 4+ messages in thread
From: Oliver Upton @ 2022-09-26 22:21 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, James Morse,
	Alexandru Elisei, Suzuki K Poulose, Oliver Upton
  Cc: Ricardo Koller, linux-arm-kernel, linux-kernel, kvmarm

Presently stage2_apply_range() works on a batch of memory addressed by a
stage 2 root table entry for the VM. Depending on the IPA limit of the
VM and PAGE_SIZE of the host, this could address a massive range of
memory. Some examples:

  4 level, 4K paging -> 512 GB batch size

  3 level, 64K paging -> 4TB batch size

Unsurprisingly, working on such a large range of memory can lead to soft
lockups. When running dirty_log_perf_test:

  ./dirty_log_perf_test -m -2 -s anonymous_thp -b 4G -v 48

  watchdog: BUG: soft lockup - CPU#0 stuck for 45s! [dirty_log_perf_:16703]
  Modules linked in: vfat fat cdc_ether usbnet mii xhci_pci xhci_hcd sha3_generic gq(O)
  CPU: 0 PID: 16703 Comm: dirty_log_perf_ Tainted: G           O       6.0.0-smp-DEV #1
  pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : dcache_clean_inval_poc+0x24/0x38
  lr : clean_dcache_guest_page+0x28/0x4c
  sp : ffff800021763990
  pmr_save: 000000e0
  x29: ffff800021763990 x28: 0000000000000005 x27: 0000000000000de0
  x26: 0000000000000001 x25: 00400830b13bc77f x24: ffffad4f91ead9c0
  x23: 0000000000000000 x22: ffff8000082ad9c8 x21: 0000fffafa7bc000
  x20: ffffad4f9066ce50 x19: 0000000000000003 x18: ffffad4f92402000
  x17: 000000000000011b x16: 000000000000011b x15: 0000000000000124
  x14: ffff07ff8301d280 x13: 0000000000000000 x12: 00000000ffffffff
  x11: 0000000000010001 x10: fffffc0000000000 x9 : ffffad4f9069e580
  x8 : 000000000000000c x7 : 0000000000000000 x6 : 000000000000003f
  x5 : ffff07ffa2076980 x4 : 0000000000000001 x3 : 000000000000003f
  x2 : 0000000000000040 x1 : ffff0830313bd000 x0 : ffff0830313bcc40
  Call trace:
   dcache_clean_inval_poc+0x24/0x38
   stage2_unmap_walker+0x138/0x1ec
   __kvm_pgtable_walk+0x130/0x1d4
   __kvm_pgtable_walk+0x170/0x1d4
   __kvm_pgtable_walk+0x170/0x1d4
   __kvm_pgtable_walk+0x170/0x1d4
   kvm_pgtable_stage2_unmap+0xc4/0xf8
   kvm_arch_flush_shadow_memslot+0xa4/0x10c
   kvm_set_memslot+0xb8/0x454
   __kvm_set_memory_region+0x194/0x244
   kvm_vm_ioctl_set_memory_region+0x58/0x7c
   kvm_vm_ioctl+0x49c/0x560
   __arm64_sys_ioctl+0x9c/0xd4
   invoke_syscall+0x4c/0x124
   el0_svc_common+0xc8/0x194
   do_el0_svc+0x38/0xc0
   el0_svc+0x2c/0xa4
   el0t_64_sync_handler+0x84/0xf0
   el0t_64_sync+0x1a0/0x1a4

Given the various paging configurations used by KVM at stage 2 there
isn't a sensible page table level to use as the batch size. Use 1GB as
the batch size instead, as it is evenly divisible by all supported
hugepage sizes across 4K, 16K, and 64K paging.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
---

Applies to 6.0-rc3. Tested with 4K, 16K, and 64K pages with the above
dirty_log_perf_test command and noticed no more soft lockups.

v1: https://lore.kernel.org/kvmarm/20220920183630.3376939-1-oliver.upton@linux.dev/

v1 -> v2:
 - Align down to the next 1GB boundary (Ricardo)

 arch/arm64/include/asm/stage2_pgtable.h | 20 --------------------
 arch/arm64/kvm/mmu.c                    |  8 +++++++-
 2 files changed, 7 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
index fe341a6578c3..c8dca8ae359c 100644
--- a/arch/arm64/include/asm/stage2_pgtable.h
+++ b/arch/arm64/include/asm/stage2_pgtable.h
@@ -10,13 +10,6 @@
 
 #include <linux/pgtable.h>
 
-/*
- * PGDIR_SHIFT determines the size a top-level page table entry can map
- * and depends on the number of levels in the page table. Compute the
- * PGDIR_SHIFT for a given number of levels.
- */
-#define pt_levels_pgdir_shift(lvls)	ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - (lvls))
-
 /*
  * The hardware supports concatenation of up to 16 tables at stage2 entry
  * level and we use the feature whenever possible, which means we resolve 4
@@ -30,11 +23,6 @@
 #define stage2_pgtable_levels(ipa)	ARM64_HW_PGTABLE_LEVELS((ipa) - 4)
 #define kvm_stage2_levels(kvm)		VTCR_EL2_LVLS(kvm->arch.vtcr)
 
-/* stage2_pgdir_shift() is the size mapped by top-level stage2 entry for the VM */
-#define stage2_pgdir_shift(kvm)		pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
-#define stage2_pgdir_size(kvm)		(1ULL << stage2_pgdir_shift(kvm))
-#define stage2_pgdir_mask(kvm)		~(stage2_pgdir_size(kvm) - 1)
-
 /*
  * kvm_mmmu_cache_min_pages() is the number of pages required to install
  * a stage-2 translation. We pre-allocate the entry level page table at
@@ -42,12 +30,4 @@
  */
 #define kvm_mmu_cache_min_pages(kvm)	(kvm_stage2_levels(kvm) - 1)
 
-static inline phys_addr_t
-stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
-{
-	phys_addr_t boundary = (addr + stage2_pgdir_size(kvm)) & stage2_pgdir_mask(kvm);
-
-	return (boundary - 1 < end - 1) ? boundary : end;
-}
-
 #endif	/* __ARM64_S2_PGTABLE_H_ */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c9a13e487187..5d05bb92e129 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -31,6 +31,12 @@ static phys_addr_t hyp_idmap_vector;
 
 static unsigned long io_map_base;
 
+static inline phys_addr_t stage2_apply_range_next(phys_addr_t addr, phys_addr_t end)
+{
+	phys_addr_t boundary = round_down(addr + SZ_1G, SZ_1G);
+
+	return (boundary - 1 < end - 1) ? boundary : end;
+}
 
 /*
  * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
@@ -52,7 +58,7 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
 		if (!pgt)
 			return -EINVAL;
 
-		next = stage2_pgd_addr_end(kvm, addr, end);
+		next = stage2_apply_range_next(addr, end);
 		ret = fn(pgt, addr, next - addr);
 		if (ret)
 			break;

base-commit: b90cb1053190353cc30f0fef0ef1f378ccc063c5
-- 
2.37.3.998.g577e59143f-goog


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] KVM: arm64: Limit stage2_apply_range() batch size to 1GB
  2022-09-26 22:21 [PATCH v2] KVM: arm64: Limit stage2_apply_range() batch size to 1GB Oliver Upton
@ 2022-09-27 11:34 ` Marc Zyngier
  2022-09-27 19:43   ` Oliver Upton
  0 siblings, 1 reply; 4+ messages in thread
From: Marc Zyngier @ 2022-09-27 11:34 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Catalin Marinas, Will Deacon, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Ricardo Koller, linux-arm-kernel, linux-kernel,
	kvmarm

On Mon, 26 Sep 2022 18:21:45 -0400,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> Presently stage2_apply_range() works on a batch of memory addressed by a
> stage 2 root table entry for the VM. Depending on the IPA limit of the
> VM and PAGE_SIZE of the host, this could address a massive range of
> memory. Some examples:
> 
>   4 level, 4K paging -> 512 GB batch size
> 
>   3 level, 64K paging -> 4TB batch size
> 
> Unsurprisingly, working on such a large range of memory can lead to soft
> lockups. When running dirty_log_perf_test:
> 
>   ./dirty_log_perf_test -m -2 -s anonymous_thp -b 4G -v 48
> 
>   watchdog: BUG: soft lockup - CPU#0 stuck for 45s! [dirty_log_perf_:16703]
>   Modules linked in: vfat fat cdc_ether usbnet mii xhci_pci xhci_hcd sha3_generic gq(O)
>   CPU: 0 PID: 16703 Comm: dirty_log_perf_ Tainted: G           O       6.0.0-smp-DEV #1
>   pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>   pc : dcache_clean_inval_poc+0x24/0x38
>   lr : clean_dcache_guest_page+0x28/0x4c
>   sp : ffff800021763990
>   pmr_save: 000000e0
>   x29: ffff800021763990 x28: 0000000000000005 x27: 0000000000000de0
>   x26: 0000000000000001 x25: 00400830b13bc77f x24: ffffad4f91ead9c0
>   x23: 0000000000000000 x22: ffff8000082ad9c8 x21: 0000fffafa7bc000
>   x20: ffffad4f9066ce50 x19: 0000000000000003 x18: ffffad4f92402000
>   x17: 000000000000011b x16: 000000000000011b x15: 0000000000000124
>   x14: ffff07ff8301d280 x13: 0000000000000000 x12: 00000000ffffffff
>   x11: 0000000000010001 x10: fffffc0000000000 x9 : ffffad4f9069e580
>   x8 : 000000000000000c x7 : 0000000000000000 x6 : 000000000000003f
>   x5 : ffff07ffa2076980 x4 : 0000000000000001 x3 : 000000000000003f
>   x2 : 0000000000000040 x1 : ffff0830313bd000 x0 : ffff0830313bcc40
>   Call trace:
>    dcache_clean_inval_poc+0x24/0x38
>    stage2_unmap_walker+0x138/0x1ec
>    __kvm_pgtable_walk+0x130/0x1d4
>    __kvm_pgtable_walk+0x170/0x1d4
>    __kvm_pgtable_walk+0x170/0x1d4
>    __kvm_pgtable_walk+0x170/0x1d4
>    kvm_pgtable_stage2_unmap+0xc4/0xf8
>    kvm_arch_flush_shadow_memslot+0xa4/0x10c
>    kvm_set_memslot+0xb8/0x454
>    __kvm_set_memory_region+0x194/0x244
>    kvm_vm_ioctl_set_memory_region+0x58/0x7c
>    kvm_vm_ioctl+0x49c/0x560
>    __arm64_sys_ioctl+0x9c/0xd4
>    invoke_syscall+0x4c/0x124
>    el0_svc_common+0xc8/0x194
>    do_el0_svc+0x38/0xc0
>    el0_svc+0x2c/0xa4
>    el0t_64_sync_handler+0x84/0xf0
>    el0t_64_sync+0x1a0/0x1a4
> 
> Given the various paging configurations used by KVM at stage 2 there
> isn't a sensible page table level to use as the batch size. Use 1GB as
> the batch size instead, as it is evenly divisible by all supported
> hugepage sizes across 4K, 16K, and 64K paging.
> 
> Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
> ---
> 
> Applies to 6.0-rc3. Tested with 4K, 16K, and 64K pages with the above
> dirty_log_perf_test command and noticed no more soft lockups.
> 
> v1: https://lore.kernel.org/kvmarm/20220920183630.3376939-1-oliver.upton@linux.dev/
> 
> v1 -> v2:
>  - Align down to the next 1GB boundary (Ricardo)
> 
>  arch/arm64/include/asm/stage2_pgtable.h | 20 --------------------
>  arch/arm64/kvm/mmu.c                    |  8 +++++++-
>  2 files changed, 7 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> index fe341a6578c3..c8dca8ae359c 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -10,13 +10,6 @@
>  
>  #include <linux/pgtable.h>
>  
> -/*
> - * PGDIR_SHIFT determines the size a top-level page table entry can map
> - * and depends on the number of levels in the page table. Compute the
> - * PGDIR_SHIFT for a given number of levels.
> - */
> -#define pt_levels_pgdir_shift(lvls)	ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - (lvls))
> -
>  /*
>   * The hardware supports concatenation of up to 16 tables at stage2 entry
>   * level and we use the feature whenever possible, which means we resolve 4
> @@ -30,11 +23,6 @@
>  #define stage2_pgtable_levels(ipa)	ARM64_HW_PGTABLE_LEVELS((ipa) - 4)
>  #define kvm_stage2_levels(kvm)		VTCR_EL2_LVLS(kvm->arch.vtcr)
>  
> -/* stage2_pgdir_shift() is the size mapped by top-level stage2 entry for the VM */
> -#define stage2_pgdir_shift(kvm)		pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
> -#define stage2_pgdir_size(kvm)		(1ULL << stage2_pgdir_shift(kvm))
> -#define stage2_pgdir_mask(kvm)		~(stage2_pgdir_size(kvm) - 1)
> -
>  /*
>   * kvm_mmmu_cache_min_pages() is the number of pages required to install
>   * a stage-2 translation. We pre-allocate the entry level page table at
> @@ -42,12 +30,4 @@
>   */
>  #define kvm_mmu_cache_min_pages(kvm)	(kvm_stage2_levels(kvm) - 1)
>  
> -static inline phys_addr_t
> -stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> -{
> -	phys_addr_t boundary = (addr + stage2_pgdir_size(kvm)) & stage2_pgdir_mask(kvm);
> -
> -	return (boundary - 1 < end - 1) ? boundary : end;
> -}
> -
>  #endif	/* __ARM64_S2_PGTABLE_H_ */
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c9a13e487187..5d05bb92e129 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -31,6 +31,12 @@ static phys_addr_t hyp_idmap_vector;
>  
>  static unsigned long io_map_base;
>  
> +static inline phys_addr_t stage2_apply_range_next(phys_addr_t addr, phys_addr_t end)

Please drop the inline. I'm sure the compiler will perform its
magic.

Can I also bikeshed a bit about the name? This doesn't "apply"
anything, nor does it return the next range. It really computes the
end of the current one.

Something like stage2_range_addr_end() would at least be consistent
with the rest of the arm64 code (grep for _addr_end ...).

> +{
> +	phys_addr_t boundary = round_down(addr + SZ_1G, SZ_1G);

nit: the rest of the code is using ALIGN_DOWN(). Any reason why this
can't be used here?

> +
> +	return (boundary - 1 < end - 1) ? boundary : end;
> +}
>  
>  /*
>   * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
> @@ -52,7 +58,7 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
>  		if (!pgt)
>  			return -EINVAL;
>  
> -		next = stage2_pgd_addr_end(kvm, addr, end);
> +		next = stage2_apply_range_next(addr, end);
>  		ret = fn(pgt, addr, next - addr);
>  		if (ret)
>  			break;
> 

The main problem I see with this is that some entries now get visited
multiple times if they cover more than a single 1GB entry (like a
512GB level-0 entry with 4k pages and 48bit IPA) . As long as this
isn't destructive (CMOs, for example), this is probably OK. For
operations that are not idempotent (such as stage2_unmap_walker), this
is a significant change in behaviour.

My concern is that we have on one side a walker that is strictly
driven by the page-table sizes, and we now get an arbitrary value that
doesn't necessarily a multiple of block sizes. Yes, this works right
now because you can't create a block mapping larger than 1GB with any
of the supported page size.

But with 52bit VA/PA support, this changes: we can have 512GB (4k),
64GB (16k) and 4TB (64k) block mappings at S2. We don't support this
yet at S2, but when this hits, we'll be in potential trouble.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] KVM: arm64: Limit stage2_apply_range() batch size to 1GB
  2022-09-27 11:34 ` Marc Zyngier
@ 2022-09-27 19:43   ` Oliver Upton
  2022-09-28 11:05     ` Marc Zyngier
  0 siblings, 1 reply; 4+ messages in thread
From: Oliver Upton @ 2022-09-27 19:43 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Catalin Marinas, Will Deacon, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Ricardo Koller, linux-arm-kernel, linux-kernel,
	kvmarm

Hi Marc,

On Tue, Sep 27, 2022 at 07:34:22AM -0400, Marc Zyngier wrote:

[...]

> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index c9a13e487187..5d05bb92e129 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -31,6 +31,12 @@ static phys_addr_t hyp_idmap_vector;
> >  
> >  static unsigned long io_map_base;
> >  
> > +static inline phys_addr_t stage2_apply_range_next(phys_addr_t addr, phys_addr_t end)
> 
> Please drop the inline. I'm sure the compiler will perform its
> magic.
> 
> Can I also bikeshed a bit about the name? This doesn't "apply"
> anything, nor does it return the next range. It really computes the
> end of the current one.
> 
> Something like stage2_range_addr_end() would at least be consistent
> with the rest of the arm64 code (grep for _addr_end ...).

Bikeshed all you want :) But yeah, I like your suggestion.

> > +{
> > +	phys_addr_t boundary = round_down(addr + SZ_1G, SZ_1G);
> 
> nit: the rest of the code is using ALIGN_DOWN(). Any reason why this
> can't be used here?

Nope!

> > +
> > +	return (boundary - 1 < end - 1) ? boundary : end;
> > +}
> >  
> >  /*
> >   * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
> > @@ -52,7 +58,7 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
> >  		if (!pgt)
> >  			return -EINVAL;
> >  
> > -		next = stage2_pgd_addr_end(kvm, addr, end);
> > +		next = stage2_apply_range_next(addr, end);
> >  		ret = fn(pgt, addr, next - addr);
> >  		if (ret)
> >  			break;
> > 
> 
> The main problem I see with this is that some entries now get visited
> multiple times if they cover more than a single 1GB entry (like a
> 512GB level-0 entry with 4k pages and 48bit IPA) . As long as this
> isn't destructive (CMOs, for example), this is probably OK. For
> operations that are not idempotent (such as stage2_unmap_walker), this
> is a significant change in behaviour.
> 
> My concern is that we have on one side a walker that is strictly
> driven by the page-table sizes, and we now get an arbitrary value that
> doesn't necessarily a multiple of block sizes. Yes, this works right
> now because you can't create a block mapping larger than 1GB with any
> of the supported page size.
> 
> But with 52bit VA/PA support, this changes: we can have 512GB (4k),
> 64GB (16k) and 4TB (64k) block mappings at S2. We don't support this
> yet at S2, but when this hits, we'll be in potential trouble.

Ah, I didn't fully capture the reasoning about the batch size. I had
thought about batching by operating on at most 1 block of the largest
supported granularity, but that felt like an inefficient walk restarting
from root every 32M (for 16K paging).

OTOH, if/when we add support for larger blocks in S2 we will run into
the same exact problem if we batch on the largest block size. If
dirty logging caused the large blocks to be shattered down to leaf
granularity then we will visit a crazy amount of PTEs before releasing
the lock.

I guess what I'm getting at is we need to detect lock contention and the
need to reschedule in the middle of the walk instead of at some
predetermined boundary, though that ventures into the territory of 'TDP
MMU' features...

So, seems to me we can crack this a few ways:

  1.Batch at the arbitrary 1GB since it works currently and produces a
    more efficient walk for all page sizes. I can rework some of the
    logic in kvm_level_supports_block_mapping() such that we can
    BUILD_BUG_ON() if the largest block size exceeds 1GB. Kicks the can
    down the road on a better implementation.

  2.Batch by the largest supported block mapping size. This will result
    in less efficient walks for !4K page sizes and likely produce soft
    lockups when we support even larger blocks. Nonetheless, the
    implementation will remain correct regardless of block size.

  3.Test for lock contention and need_resched() in the middle of the
    walk, rewalking from wherever we left off when scheduled again. TDP
    MMU already does this, so it could be a wasted effort adding support
    for it to the ARM MMU if we are to switch over at some point.

WDYT?

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] KVM: arm64: Limit stage2_apply_range() batch size to 1GB
  2022-09-27 19:43   ` Oliver Upton
@ 2022-09-28 11:05     ` Marc Zyngier
  0 siblings, 0 replies; 4+ messages in thread
From: Marc Zyngier @ 2022-09-28 11:05 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Catalin Marinas, Will Deacon, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Ricardo Koller, linux-arm-kernel, linux-kernel,
	kvmarm, Quentin Perret

On Tue, 27 Sep 2022 15:43:10 -0400,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> Hi Marc,
> 
> On Tue, Sep 27, 2022 at 07:34:22AM -0400, Marc Zyngier wrote:
> 
> [...]
> 
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index c9a13e487187..5d05bb92e129 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -31,6 +31,12 @@ static phys_addr_t hyp_idmap_vector;
> > >  
> > >  static unsigned long io_map_base;
> > >  
> > > +static inline phys_addr_t stage2_apply_range_next(phys_addr_t addr, phys_addr_t end)
> > 
> > Please drop the inline. I'm sure the compiler will perform its
> > magic.
> > 
> > Can I also bikeshed a bit about the name? This doesn't "apply"
> > anything, nor does it return the next range. It really computes the
> > end of the current one.
> > 
> > Something like stage2_range_addr_end() would at least be consistent
> > with the rest of the arm64 code (grep for _addr_end ...).
> 
> Bikeshed all you want :) But yeah, I like your suggestion.
> 
> > > +{
> > > +	phys_addr_t boundary = round_down(addr + SZ_1G, SZ_1G);
> > 
> > nit: the rest of the code is using ALIGN_DOWN(). Any reason why this
> > can't be used here?
> 
> Nope!
> 
> > > +
> > > +	return (boundary - 1 < end - 1) ? boundary : end;
> > > +}
> > >  
> > >  /*
> > >   * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
> > > @@ -52,7 +58,7 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
> > >  		if (!pgt)
> > >  			return -EINVAL;
> > >  
> > > -		next = stage2_pgd_addr_end(kvm, addr, end);
> > > +		next = stage2_apply_range_next(addr, end);
> > >  		ret = fn(pgt, addr, next - addr);
> > >  		if (ret)
> > >  			break;
> > > 
> > 
> > The main problem I see with this is that some entries now get visited
> > multiple times if they cover more than a single 1GB entry (like a
> > 512GB level-0 entry with 4k pages and 48bit IPA) . As long as this
> > isn't destructive (CMOs, for example), this is probably OK. For
> > operations that are not idempotent (such as stage2_unmap_walker), this
> > is a significant change in behaviour.
> > 
> > My concern is that we have on one side a walker that is strictly
> > driven by the page-table sizes, and we now get an arbitrary value that
> > doesn't necessarily a multiple of block sizes. Yes, this works right
> > now because you can't create a block mapping larger than 1GB with any
> > of the supported page size.
> > 
> > But with 52bit VA/PA support, this changes: we can have 512GB (4k),
> > 64GB (16k) and 4TB (64k) block mappings at S2. We don't support this
> > yet at S2, but when this hits, we'll be in potential trouble.
> 
> Ah, I didn't fully capture the reasoning about the batch size. I had
> thought about batching by operating on at most 1 block of the largest
> supported granularity, but that felt like an inefficient walk restarting
> from root every 32M (for 16K paging).
> 
> OTOH, if/when we add support for larger blocks in S2 we will run into
> the same exact problem if we batch on the largest block size. If
> dirty logging caused the large blocks to be shattered down to leaf
> granularity then we will visit a crazy amount of PTEs before releasing
> the lock.
> 
> I guess what I'm getting at is we need to detect lock contention and the
> need to reschedule in the middle of the walk instead of at some
> predetermined boundary, though that ventures into the territory of 'TDP
> MMU' features...
> 
> So, seems to me we can crack this a few ways:
> 
>   1.Batch at the arbitrary 1GB since it works currently and produces a
>     more efficient walk for all page sizes. I can rework some of the
>     logic in kvm_level_supports_block_mapping() such that we can
>     BUILD_BUG_ON() if the largest block size exceeds 1GB. Kicks the can
>     down the road on a better implementation.
> 
>   2.Batch by the largest supported block mapping size. This will result
>     in less efficient walks for !4K page sizes and likely produce soft
>     lockups when we support even larger blocks. Nonetheless, the
>     implementation will remain correct regardless of block size.
> 
>   3.Test for lock contention and need_resched() in the middle of the
>     walk, rewalking from wherever we left off when scheduled again. TDP
>     MMU already does this, so it could be a wasted effort adding support
>     for it to the ARM MMU if we are to switch over at some point.
> 
> WDYT?

[+ Quentin, who is looking at this as well]

At this stage, I'm more keen on option (2). It solves your immediate
issue, and while it doesn't improve the humongous block mapping case,
it doesn't change the behaviour of the existing walkers (you are
guaranteed to visit a leaf only once per traversal).

Option (3) is more of a long term goal, and I'd rather we improve
things in now.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-09-28 11:09 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-26 22:21 [PATCH v2] KVM: arm64: Limit stage2_apply_range() batch size to 1GB Oliver Upton
2022-09-27 11:34 ` Marc Zyngier
2022-09-27 19:43   ` Oliver Upton
2022-09-28 11:05     ` Marc Zyngier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).