All of lore.kernel.org
 help / color / mirror / Atom feed
From: Krister Johansen <kjlx@templeofstupid.com>
To: Marc Zyngier <maz@kernel.org>
Cc: Krister Johansen <kjlx@templeofstupid.com>,
	Oliver Upton <oliver.upton@linux.dev>,
	James Morse <james.morse@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Zenghui Yu <yuzenghui@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, Ali Saidi <alisaidi@amazon.com>,
	David Reaver <me@davidreaver.com>,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] KVM: arm64: Limit stage2_apply_range() batch size to smallest block
Date: Tue, 2 Apr 2024 10:00:52 -0700	[thread overview]
Message-ID: <20240402170052.GA1988@templeofstupid.com> (raw)
In-Reply-To: <87r0fsrpko.wl-maz@kernel.org>

Hi Marc,

On Sat, Mar 30, 2024 at 10:17:43AM +0000, Marc Zyngier wrote:
> On Fri, 29 Mar 2024 19:15:37 +0000,
> Krister Johansen <kjlx@templeofstupid.com> wrote:
> > On Fri, Mar 29, 2024 at 06:48:38AM -0700, Oliver Upton wrote:
> > > On Thu, Mar 28, 2024 at 12:05:08PM -0700, Krister Johansen wrote:
> > > > stage2_apply_range() for unmap operations can interfere with the
> > > > performance of IO if the device's interrupts share the CPU where the
> > > > unmap operation is occurring.  commit 5994bc9e05c2 ("KVM: arm64: Limit
> > > > stage2_apply_range() batch size to largest block") improved this.  Prior
> > > > to that commit, workloads that were unfortunate enough to have their IO
> > > > interrupts pinned to the same CPU as the unmap operation would observe a
> > > > complete stall.  With the switch to using the largest block size, it is
> > > > possible for IO to make progress, albeit at a reduced speed.
> > > 
> > > Can you describe the workload a bit more? I'm having a hard time
> > > understanding how you're unmapping that much memory on the fly in
> > > your workload. Is guest memory getting swapped? Are VMs being torn
> > > down?
> > 
> > Sorry I wasn't clear here.  Yes, it's the VMs getting torn down that's
> > causing the problems.  The container VMs don't have long lifetimes, but
> > some may be up to 256Gb in size, depending on the user.  The workloads
> > running the VMs aren't especially performance sensitive, but their users
> > do notice when network connections time-out.  IOW, if the performance is
> > bad enough to temporarily prevent new TCP connections from being
> > established or requests / responses being recieved in a timely fashion,
> > we'll hear about it.  Users deploy their services a lot, so there's a
> > lot of container vm churn.  (Really it's automation redeploying the
> > services on behalf of the users in response to new commits to their
> > repos...)
> 
> I think this advocates for a teardown-specific code path rather than
> just relying on the usual S2 unmapping which is really designed for
> eviction. There are two things to consider here:
> 
> - TLB invalidation: this should only take a single VMALLS12E1, rather
>   than iterating over the PTs
> 
> - Cache maintenance: this could be elided with FWB, or *optionally*
>   elided if userspace buys in a "I don't need to see the memory of the
>   guest after teardown" type of behaviour

This approach would work for this workload, I think.  The hardware
supports FWB and AFAIK isn't looking at the guest memory after teardown.
This is also desirable because in the future we'd like to support
hotplug of VFIO devices. A separate path for unmap the memory used by
the device vs unmap all of the guest seems smart.

> > > Also, it seems a bit odd to steer interrupts *into* the workload you
> > > care about...
> > 
> > Ah, that was only intentionally done for the purposes of measuring the
> > impact.  That's not done on purpose in production.
> > 
> > Nevertheless, the example we tend to run into is that a box may have 2
> > NICs and each NIC has 32 Tx-Rx queues.  This means we've got 64 NIC
> > interrupts, each assigned to a different CPU.  Our systems have 64 CPUs.
> > What happens in practice is that a VM will get torn down, and that has a
> > 1-in-64 chance of impacting the performance of the subset of the flows
> > that are mapped via RSS to the interrupt that happens to be assigned to
> > the CPU where the VM is being torn down.
> > 
> > Of course, the obvious next question is why not just bind the VMs flows
> > to the CPUs the VM is running on?  We don't have a 1:1 mapping of
> > network device to VM, or VM to CPU right now, which frustrates this
> > approach.
> > 
> > > > Further reducing the stage2_apply_range() batch size has substantial
> > > > performance improvements for IO that share a CPU performing an unmap
> > > > operation.  By switching to a 2mb chunk, IO performance regressions were
> > > > no longer observed in this author's tests.  E.g. it was possible to
> > > > obtain the advertised device throughput despite an unmap operation
> > > > occurring on the CPU where the interrupt was running.  There is a
> > > > tradeoff, however.  No changes were observed in per-operation timings
> > > > when running the kvm_pagetable_test without an interrupt load.  However,
> > > > with a 64gb VM, 1 vcpu, and 4k pages and a IO load, map times increased
> > > > by about 15% and unmap times increased by about 58%.  In essence, this
> > > > trades slower map/unmap times for improved IO throughput.
> > > 
> > > There are other users of the range-based operations, like
> > > write-protection. Live migration is especially sensitive to the latency
> > > of page table updates as it can affect the VMM's ability to converge
> > > with the guest.
> > 
> > To be clear, the reduction in performance was observed when I
> > concurrently executed both the kvm_pagetable_test and a networking
> > benchmark where the NIC's interrupts were assigned to the same CPU where
> > the pagetable test was executing.  I didn't see a slowdown just running
> > the pagetable test.
> 
> Any chance you could share more details about your HW configuration
> (what CPU is that?)  and the type of traffic? This is the sort of
> things I'd like to be able to reproduce in order to experiment various
> strategies.

Sure, I only have access to documentation that is publicly available.

The hardware where we ran into this inititally was Graviton 3, which is
a Neoverse-V1 based core.  It does not support FEAT_TLBIRANGE.  I've
also tested on Graviton 4, which is Neoverse-V2 based.  It _does_
support FEAT_TLBIRANGE.  The deferred range based invalidation
support, was enough to allow us to teardown a large VM based on 4k pages
and not incur a visible performance penalty.  I haven't had a chance to
test to see if and how Will's patches change this, though.

The tests themselves were not especially fancy. The networking hardware
was a ENA device on an EC2 box with 30Gbps limit (5/10 Gbps per flow,
depending on the config).  The storage tested was a gp3 EBS device
configured to max IOPS/throughput (16,000 IOPS / 1000Mb/s).

Networking tests were iperf3 with a 9001 byte packet size.  The storage
tests were fio's randwrite workload in directio mode using the libaio
backend.  The "IOPS" test used a 4k blocksize and a queue depth of 128.
The "throughput" test used a blocksize of 64k and an iodepth of 32.  For
the fio tests, it was a 10gb file and 2 workers, mostly because the EBS
devices have two hardware queues for data.

I ran the kvm_page_table_test with a few different sizes, but settled on
64G with 1 vcpu for most tests.

Let me know if there's anything else I can share here.

-K

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Krister Johansen <kjlx@templeofstupid.com>
To: Marc Zyngier <maz@kernel.org>
Cc: Krister Johansen <kjlx@templeofstupid.com>,
	Oliver Upton <oliver.upton@linux.dev>,
	James Morse <james.morse@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Zenghui Yu <yuzenghui@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, Ali Saidi <alisaidi@amazon.com>,
	David Reaver <me@davidreaver.com>,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] KVM: arm64: Limit stage2_apply_range() batch size to smallest block
Date: Tue, 2 Apr 2024 10:00:52 -0700	[thread overview]
Message-ID: <20240402170052.GA1988@templeofstupid.com> (raw)
In-Reply-To: <87r0fsrpko.wl-maz@kernel.org>

Hi Marc,

On Sat, Mar 30, 2024 at 10:17:43AM +0000, Marc Zyngier wrote:
> On Fri, 29 Mar 2024 19:15:37 +0000,
> Krister Johansen <kjlx@templeofstupid.com> wrote:
> > On Fri, Mar 29, 2024 at 06:48:38AM -0700, Oliver Upton wrote:
> > > On Thu, Mar 28, 2024 at 12:05:08PM -0700, Krister Johansen wrote:
> > > > stage2_apply_range() for unmap operations can interfere with the
> > > > performance of IO if the device's interrupts share the CPU where the
> > > > unmap operation is occurring.  commit 5994bc9e05c2 ("KVM: arm64: Limit
> > > > stage2_apply_range() batch size to largest block") improved this.  Prior
> > > > to that commit, workloads that were unfortunate enough to have their IO
> > > > interrupts pinned to the same CPU as the unmap operation would observe a
> > > > complete stall.  With the switch to using the largest block size, it is
> > > > possible for IO to make progress, albeit at a reduced speed.
> > > 
> > > Can you describe the workload a bit more? I'm having a hard time
> > > understanding how you're unmapping that much memory on the fly in
> > > your workload. Is guest memory getting swapped? Are VMs being torn
> > > down?
> > 
> > Sorry I wasn't clear here.  Yes, it's the VMs getting torn down that's
> > causing the problems.  The container VMs don't have long lifetimes, but
> > some may be up to 256Gb in size, depending on the user.  The workloads
> > running the VMs aren't especially performance sensitive, but their users
> > do notice when network connections time-out.  IOW, if the performance is
> > bad enough to temporarily prevent new TCP connections from being
> > established or requests / responses being recieved in a timely fashion,
> > we'll hear about it.  Users deploy their services a lot, so there's a
> > lot of container vm churn.  (Really it's automation redeploying the
> > services on behalf of the users in response to new commits to their
> > repos...)
> 
> I think this advocates for a teardown-specific code path rather than
> just relying on the usual S2 unmapping which is really designed for
> eviction. There are two things to consider here:
> 
> - TLB invalidation: this should only take a single VMALLS12E1, rather
>   than iterating over the PTs
> 
> - Cache maintenance: this could be elided with FWB, or *optionally*
>   elided if userspace buys in a "I don't need to see the memory of the
>   guest after teardown" type of behaviour

This approach would work for this workload, I think.  The hardware
supports FWB and AFAIK isn't looking at the guest memory after teardown.
This is also desirable because in the future we'd like to support
hotplug of VFIO devices. A separate path for unmap the memory used by
the device vs unmap all of the guest seems smart.

> > > Also, it seems a bit odd to steer interrupts *into* the workload you
> > > care about...
> > 
> > Ah, that was only intentionally done for the purposes of measuring the
> > impact.  That's not done on purpose in production.
> > 
> > Nevertheless, the example we tend to run into is that a box may have 2
> > NICs and each NIC has 32 Tx-Rx queues.  This means we've got 64 NIC
> > interrupts, each assigned to a different CPU.  Our systems have 64 CPUs.
> > What happens in practice is that a VM will get torn down, and that has a
> > 1-in-64 chance of impacting the performance of the subset of the flows
> > that are mapped via RSS to the interrupt that happens to be assigned to
> > the CPU where the VM is being torn down.
> > 
> > Of course, the obvious next question is why not just bind the VMs flows
> > to the CPUs the VM is running on?  We don't have a 1:1 mapping of
> > network device to VM, or VM to CPU right now, which frustrates this
> > approach.
> > 
> > > > Further reducing the stage2_apply_range() batch size has substantial
> > > > performance improvements for IO that share a CPU performing an unmap
> > > > operation.  By switching to a 2mb chunk, IO performance regressions were
> > > > no longer observed in this author's tests.  E.g. it was possible to
> > > > obtain the advertised device throughput despite an unmap operation
> > > > occurring on the CPU where the interrupt was running.  There is a
> > > > tradeoff, however.  No changes were observed in per-operation timings
> > > > when running the kvm_pagetable_test without an interrupt load.  However,
> > > > with a 64gb VM, 1 vcpu, and 4k pages and a IO load, map times increased
> > > > by about 15% and unmap times increased by about 58%.  In essence, this
> > > > trades slower map/unmap times for improved IO throughput.
> > > 
> > > There are other users of the range-based operations, like
> > > write-protection. Live migration is especially sensitive to the latency
> > > of page table updates as it can affect the VMM's ability to converge
> > > with the guest.
> > 
> > To be clear, the reduction in performance was observed when I
> > concurrently executed both the kvm_pagetable_test and a networking
> > benchmark where the NIC's interrupts were assigned to the same CPU where
> > the pagetable test was executing.  I didn't see a slowdown just running
> > the pagetable test.
> 
> Any chance you could share more details about your HW configuration
> (what CPU is that?)  and the type of traffic? This is the sort of
> things I'd like to be able to reproduce in order to experiment various
> strategies.

Sure, I only have access to documentation that is publicly available.

The hardware where we ran into this inititally was Graviton 3, which is
a Neoverse-V1 based core.  It does not support FEAT_TLBIRANGE.  I've
also tested on Graviton 4, which is Neoverse-V2 based.  It _does_
support FEAT_TLBIRANGE.  The deferred range based invalidation
support, was enough to allow us to teardown a large VM based on 4k pages
and not incur a visible performance penalty.  I haven't had a chance to
test to see if and how Will's patches change this, though.

The tests themselves were not especially fancy. The networking hardware
was a ENA device on an EC2 box with 30Gbps limit (5/10 Gbps per flow,
depending on the config).  The storage tested was a gp3 EBS device
configured to max IOPS/throughput (16,000 IOPS / 1000Mb/s).

Networking tests were iperf3 with a 9001 byte packet size.  The storage
tests were fio's randwrite workload in directio mode using the libaio
backend.  The "IOPS" test used a 4k blocksize and a queue depth of 128.
The "throughput" test used a blocksize of 64k and an iodepth of 32.  For
the fio tests, it was a 10gb file and 2 workers, mostly because the EBS
devices have two hardware queues for data.

I ran the kvm_page_table_test with a few different sizes, but settled on
64G with 1 vcpu for most tests.

Let me know if there's anything else I can share here.

-K

  reply	other threads:[~2024-04-02 17:01 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-28 19:04 [RFC] KVM: arm64: improving IO performance during unmap? Krister Johansen
2024-03-28 19:04 ` Krister Johansen
2024-03-28 19:05 ` [PATCH] KVM: arm64: Limit stage2_apply_range() batch size to smallest block Krister Johansen
2024-03-28 19:05   ` Krister Johansen
2024-03-29 13:48   ` Oliver Upton
2024-03-29 13:48     ` Oliver Upton
2024-03-29 19:15     ` Krister Johansen
2024-03-29 19:15       ` Krister Johansen
2024-03-30 10:17       ` Marc Zyngier
2024-03-30 10:17         ` Marc Zyngier
2024-04-02 17:00         ` Krister Johansen [this message]
2024-04-02 17:00           ` Krister Johansen
2024-04-04  4:40           ` Krister Johansen
2024-04-04  4:40             ` Krister Johansen
2024-04-04 21:27             ` Ali Saidi
2024-04-04 21:27               ` Ali Saidi
2024-04-04 21:41               ` Krister Johansen
2024-04-04 21:41                 ` Krister Johansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240402170052.GA1988@templeofstupid.com \
    --to=kjlx@templeofstupid.com \
    --cc=alisaidi@amazon.com \
    --cc=catalin.marinas@arm.com \
    --cc=james.morse@arm.com \
    --cc=kvmarm@lists.linux.dev \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=me@davidreaver.com \
    --cc=oliver.upton@linux.dev \
    --cc=suzuki.poulose@arm.com \
    --cc=will@kernel.org \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.