All of lore.kernel.org
 help / color / mirror / Atom feed
From: Krister Johansen <kjlx@templeofstupid.com>
To: Marc Zyngier <maz@kernel.org>, Oliver Upton <oliver.upton@linux.dev>
Cc: James Morse <james.morse@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Zenghui Yu <yuzenghui@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, Ali Saidi <alisaidi@amazon.com>,
	David Reaver <me@davidreaver.com>,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: [RFC] KVM: arm64: improving IO performance during unmap?
Date: Thu, 28 Mar 2024 12:04:34 -0700	[thread overview]
Message-ID: <cover.1711649501.git.kjlx@templeofstupid.com> (raw)

Hi,
Ali and I have been looking into ways to reduce the impact a unmap_stage2_range
operation has on IO performance when a device interrupt shares the cpu where the
unmap operation is occurring.

This came to our attention after porting a container VM / hardware virtualized
containers workload to arm64 from x86_64.  On ARM64, the unmap operations took
longer. kvm_tlb_flush_vmid_ipa runs with interrupts disabled.  Unmaps that don't
check for reschedule promptly may delay the IO.

One approach that we investigated was to modify the deferred TLBI code to run
even if range based operations were not supported.  (Provided FWB is enabled).
If range based operations were supported, the code would use them.  However, if
the CPU didn't support FEAT_TLBIRANGE or the unmap was larger than a certain
size, we'd fall back to vmalls12e1is instead.  This reduced the performance
impact of the unmap operation to less than 5% impact on IO performance.
However, with Will's recent patches[1] to fix cases where free'd PTEs may still
be referenced, we were concerned this might not be a viable approach.

As a follow-up to this e-mail, I'm sending a patch for a different approach.  It
shrinks the stage2_apply_range batch size to the minimum block size instead of
the maximum block size.  This eliminates the IO performance regressions, but
increases the overall map / unmap operation times when the CPU is receiving IO
interrupts.  I'm unsure if this is the optimal solution, however, since it may
generate extra unmap walks on 1gb hugepages. I'm also unclear if this creates
problems for any of the other users of stage2_apply_range().

I'd love to get some feedback on the best way to proceed here.

Thanks,

-K

[1] https://lore.kernel.org/kvmarm/20240325185158.8565-1-will@kernel.org/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Krister Johansen <kjlx@templeofstupid.com>
To: Marc Zyngier <maz@kernel.org>, Oliver Upton <oliver.upton@linux.dev>
Cc: James Morse <james.morse@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Zenghui Yu <yuzenghui@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, Ali Saidi <alisaidi@amazon.com>,
	David Reaver <me@davidreaver.com>,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: [RFC] KVM: arm64: improving IO performance during unmap?
Date: Thu, 28 Mar 2024 12:04:34 -0700	[thread overview]
Message-ID: <cover.1711649501.git.kjlx@templeofstupid.com> (raw)

Hi,
Ali and I have been looking into ways to reduce the impact a unmap_stage2_range
operation has on IO performance when a device interrupt shares the cpu where the
unmap operation is occurring.

This came to our attention after porting a container VM / hardware virtualized
containers workload to arm64 from x86_64.  On ARM64, the unmap operations took
longer. kvm_tlb_flush_vmid_ipa runs with interrupts disabled.  Unmaps that don't
check for reschedule promptly may delay the IO.

One approach that we investigated was to modify the deferred TLBI code to run
even if range based operations were not supported.  (Provided FWB is enabled).
If range based operations were supported, the code would use them.  However, if
the CPU didn't support FEAT_TLBIRANGE or the unmap was larger than a certain
size, we'd fall back to vmalls12e1is instead.  This reduced the performance
impact of the unmap operation to less than 5% impact on IO performance.
However, with Will's recent patches[1] to fix cases where free'd PTEs may still
be referenced, we were concerned this might not be a viable approach.

As a follow-up to this e-mail, I'm sending a patch for a different approach.  It
shrinks the stage2_apply_range batch size to the minimum block size instead of
the maximum block size.  This eliminates the IO performance regressions, but
increases the overall map / unmap operation times when the CPU is receiving IO
interrupts.  I'm unsure if this is the optimal solution, however, since it may
generate extra unmap walks on 1gb hugepages. I'm also unclear if this creates
problems for any of the other users of stage2_apply_range().

I'd love to get some feedback on the best way to proceed here.

Thanks,

-K

[1] https://lore.kernel.org/kvmarm/20240325185158.8565-1-will@kernel.org/

             reply	other threads:[~2024-03-28 19:05 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-28 19:04 Krister Johansen [this message]
2024-03-28 19:04 ` [RFC] KVM: arm64: improving IO performance during unmap? Krister Johansen
2024-03-28 19:05 ` [PATCH] KVM: arm64: Limit stage2_apply_range() batch size to smallest block Krister Johansen
2024-03-28 19:05   ` Krister Johansen
2024-03-29 13:48   ` Oliver Upton
2024-03-29 13:48     ` Oliver Upton
2024-03-29 19:15     ` Krister Johansen
2024-03-29 19:15       ` Krister Johansen
2024-03-30 10:17       ` Marc Zyngier
2024-03-30 10:17         ` Marc Zyngier
2024-04-02 17:00         ` Krister Johansen
2024-04-02 17:00           ` Krister Johansen
2024-04-04  4:40           ` Krister Johansen
2024-04-04  4:40             ` Krister Johansen
2024-04-04 21:27             ` Ali Saidi
2024-04-04 21:27               ` Ali Saidi
2024-04-04 21:41               ` Krister Johansen
2024-04-04 21:41                 ` Krister Johansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1711649501.git.kjlx@templeofstupid.com \
    --to=kjlx@templeofstupid.com \
    --cc=alisaidi@amazon.com \
    --cc=catalin.marinas@arm.com \
    --cc=james.morse@arm.com \
    --cc=kvmarm@lists.linux.dev \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=me@davidreaver.com \
    --cc=oliver.upton@linux.dev \
    --cc=suzuki.poulose@arm.com \
    --cc=will@kernel.org \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.