All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oliver Upton <oliver.upton@linux.dev>
To: Krister Johansen <kjlx@templeofstupid.com>
Cc: Marc Zyngier <maz@kernel.org>, James Morse <james.morse@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Zenghui Yu <yuzenghui@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, Ali Saidi <alisaidi@amazon.com>,
	David Reaver <me@davidreaver.com>,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] KVM: arm64: Limit stage2_apply_range() batch size to smallest block
Date: Fri, 29 Mar 2024 06:48:38 -0700	[thread overview]
Message-ID: <ZgbGtpj5mStTkAkn@linux.dev> (raw)
In-Reply-To: <ebf0fac84cb1d19bdc6e73576e3cc40a9cab0635.1711649501.git.kjlx@templeofstupid.com>

Hi Krister,

On Thu, Mar 28, 2024 at 12:05:08PM -0700, Krister Johansen wrote:
> stage2_apply_range() for unmap operations can interfere with the
> performance of IO if the device's interrupts share the CPU where the
> unmap operation is occurring.  commit 5994bc9e05c2 ("KVM: arm64: Limit
> stage2_apply_range() batch size to largest block") improved this.  Prior
> to that commit, workloads that were unfortunate enough to have their IO
> interrupts pinned to the same CPU as the unmap operation would observe a
> complete stall.  With the switch to using the largest block size, it is
> possible for IO to make progress, albeit at a reduced speed.

Can you describe the workload a bit more? I'm having a hard time
understanding how you're unmapping that much memory on the fly in
your workload. Is guest memory getting swapped? Are VMs being torn
down?

Also, it seems a bit odd to steer interrupts *into* the workload you
care about...

> Further reducing the stage2_apply_range() batch size has substantial
> performance improvements for IO that share a CPU performing an unmap
> operation.  By switching to a 2mb chunk, IO performance regressions were
> no longer observed in this author's tests.  E.g. it was possible to
> obtain the advertised device throughput despite an unmap operation
> occurring on the CPU where the interrupt was running.  There is a
> tradeoff, however.  No changes were observed in per-operation timings
> when running the kvm_pagetable_test without an interrupt load.  However,
> with a 64gb VM, 1 vcpu, and 4k pages and a IO load, map times increased
> by about 15% and unmap times increased by about 58%.  In essence, this
> trades slower map/unmap times for improved IO throughput.

There are other users of the range-based operations, like
write-protection. Live migration is especially sensitive to the latency
of page table updates as it can affect the VMM's ability to converge
with the guest.

> Cc: <stable@vger.kernel.org> # 5.15.x: 3b5c082bbfa2: KVM: arm64: Work out supported block level at compile time
> Cc: <stable@vger.kernel.org> # 5.15.x: 5994bc9e05c2: KVM: arm64: Limit stage2_apply_range() batch size to largest block
> Cc: <stable@vger.kernel.org> # 5.15.x

This is a performance improvement, *not* a correctness fix. Please don't
cc stable for it.

> Suggested-by: Ali Saidi <alisaidi@amazon.com>
> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h | 4 ++++
>  arch/arm64/kvm/mmu.c                 | 2 +-
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 19278dfe7978..b0c4651a4d9a 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -19,11 +19,15 @@
>   *  - 4K (level 1):	1GB
>   *  - 16K (level 2):	32MB
>   *  - 64K (level 2):	512MB
> + *
> + *  The max block level is the _smallest_ supported block size for KVM.

This feels like a non sequitur given the old comment is left in place...

>   */
>  #ifdef CONFIG_ARM64_4K_PAGES
>  #define KVM_PGTABLE_MIN_BLOCK_LEVEL	1
> +#define KVM_PGTABLE_MAX_BLOCK_LEVEL	2
>  #else
>  #define KVM_PGTABLE_MIN_BLOCK_LEVEL	2
> +#define KVM_PGTABLE_MAX_BLOCK_LEVEL	KVM_PGTABLE_MIN_BLOCK_LEVEL
>  #endif
>  
>  #define kvm_lpa2_is_enabled()		system_supports_lpa2()
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index dc04bc767865..1e927b306aee 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -41,7 +41,7 @@ static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
>  
>  static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
>  {
> -	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> +	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MAX_BLOCK_LEVEL);
>  
>  	return __stage2_range_addr_end(addr, end, size);
>  }

This doesn't feel right to me. A property that we had before is that
leaf entries are visited at most once, since every mapping size was
evenly divisible into KVM_PGTABLE_MIN_BLOCK_LEVEL.

Seems like we could wind up visiting a PUD mapping 512 times, at least
for 4K pages.

-- 
Thanks,
Oliver

WARNING: multiple messages have this Message-ID (diff)
From: Oliver Upton <oliver.upton@linux.dev>
To: Krister Johansen <kjlx@templeofstupid.com>
Cc: Marc Zyngier <maz@kernel.org>, James Morse <james.morse@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Zenghui Yu <yuzenghui@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, Ali Saidi <alisaidi@amazon.com>,
	David Reaver <me@davidreaver.com>,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] KVM: arm64: Limit stage2_apply_range() batch size to smallest block
Date: Fri, 29 Mar 2024 06:48:38 -0700	[thread overview]
Message-ID: <ZgbGtpj5mStTkAkn@linux.dev> (raw)
In-Reply-To: <ebf0fac84cb1d19bdc6e73576e3cc40a9cab0635.1711649501.git.kjlx@templeofstupid.com>

Hi Krister,

On Thu, Mar 28, 2024 at 12:05:08PM -0700, Krister Johansen wrote:
> stage2_apply_range() for unmap operations can interfere with the
> performance of IO if the device's interrupts share the CPU where the
> unmap operation is occurring.  commit 5994bc9e05c2 ("KVM: arm64: Limit
> stage2_apply_range() batch size to largest block") improved this.  Prior
> to that commit, workloads that were unfortunate enough to have their IO
> interrupts pinned to the same CPU as the unmap operation would observe a
> complete stall.  With the switch to using the largest block size, it is
> possible for IO to make progress, albeit at a reduced speed.

Can you describe the workload a bit more? I'm having a hard time
understanding how you're unmapping that much memory on the fly in
your workload. Is guest memory getting swapped? Are VMs being torn
down?

Also, it seems a bit odd to steer interrupts *into* the workload you
care about...

> Further reducing the stage2_apply_range() batch size has substantial
> performance improvements for IO that share a CPU performing an unmap
> operation.  By switching to a 2mb chunk, IO performance regressions were
> no longer observed in this author's tests.  E.g. it was possible to
> obtain the advertised device throughput despite an unmap operation
> occurring on the CPU where the interrupt was running.  There is a
> tradeoff, however.  No changes were observed in per-operation timings
> when running the kvm_pagetable_test without an interrupt load.  However,
> with a 64gb VM, 1 vcpu, and 4k pages and a IO load, map times increased
> by about 15% and unmap times increased by about 58%.  In essence, this
> trades slower map/unmap times for improved IO throughput.

There are other users of the range-based operations, like
write-protection. Live migration is especially sensitive to the latency
of page table updates as it can affect the VMM's ability to converge
with the guest.

> Cc: <stable@vger.kernel.org> # 5.15.x: 3b5c082bbfa2: KVM: arm64: Work out supported block level at compile time
> Cc: <stable@vger.kernel.org> # 5.15.x: 5994bc9e05c2: KVM: arm64: Limit stage2_apply_range() batch size to largest block
> Cc: <stable@vger.kernel.org> # 5.15.x

This is a performance improvement, *not* a correctness fix. Please don't
cc stable for it.

> Suggested-by: Ali Saidi <alisaidi@amazon.com>
> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h | 4 ++++
>  arch/arm64/kvm/mmu.c                 | 2 +-
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 19278dfe7978..b0c4651a4d9a 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -19,11 +19,15 @@
>   *  - 4K (level 1):	1GB
>   *  - 16K (level 2):	32MB
>   *  - 64K (level 2):	512MB
> + *
> + *  The max block level is the _smallest_ supported block size for KVM.

This feels like a non sequitur given the old comment is left in place...

>   */
>  #ifdef CONFIG_ARM64_4K_PAGES
>  #define KVM_PGTABLE_MIN_BLOCK_LEVEL	1
> +#define KVM_PGTABLE_MAX_BLOCK_LEVEL	2
>  #else
>  #define KVM_PGTABLE_MIN_BLOCK_LEVEL	2
> +#define KVM_PGTABLE_MAX_BLOCK_LEVEL	KVM_PGTABLE_MIN_BLOCK_LEVEL
>  #endif
>  
>  #define kvm_lpa2_is_enabled()		system_supports_lpa2()
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index dc04bc767865..1e927b306aee 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -41,7 +41,7 @@ static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
>  
>  static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
>  {
> -	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> +	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MAX_BLOCK_LEVEL);
>  
>  	return __stage2_range_addr_end(addr, end, size);
>  }

This doesn't feel right to me. A property that we had before is that
leaf entries are visited at most once, since every mapping size was
evenly divisible into KVM_PGTABLE_MIN_BLOCK_LEVEL.

Seems like we could wind up visiting a PUD mapping 512 times, at least
for 4K pages.

-- 
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2024-03-29 13:48 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-28 19:04 [RFC] KVM: arm64: improving IO performance during unmap? Krister Johansen
2024-03-28 19:04 ` Krister Johansen
2024-03-28 19:05 ` [PATCH] KVM: arm64: Limit stage2_apply_range() batch size to smallest block Krister Johansen
2024-03-28 19:05   ` Krister Johansen
2024-03-29 13:48   ` Oliver Upton [this message]
2024-03-29 13:48     ` Oliver Upton
2024-03-29 19:15     ` Krister Johansen
2024-03-29 19:15       ` Krister Johansen
2024-03-30 10:17       ` Marc Zyngier
2024-03-30 10:17         ` Marc Zyngier
2024-04-02 17:00         ` Krister Johansen
2024-04-02 17:00           ` Krister Johansen
2024-04-04  4:40           ` Krister Johansen
2024-04-04  4:40             ` Krister Johansen
2024-04-04 21:27             ` Ali Saidi
2024-04-04 21:27               ` Ali Saidi
2024-04-04 21:41               ` Krister Johansen
2024-04-04 21:41                 ` Krister Johansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZgbGtpj5mStTkAkn@linux.dev \
    --to=oliver.upton@linux.dev \
    --cc=alisaidi@amazon.com \
    --cc=catalin.marinas@arm.com \
    --cc=james.morse@arm.com \
    --cc=kjlx@templeofstupid.com \
    --cc=kvmarm@lists.linux.dev \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=me@davidreaver.com \
    --cc=suzuki.poulose@arm.com \
    --cc=will@kernel.org \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.