[RFC] KVM: arm64: improving IO performance during unmap?

From: Krister Johansen <kjlx@templeofstupid.com>
To: Marc Zyngier <maz@kernel.org>, Oliver Upton <oliver.upton@linux.dev>
Cc: James Morse <james.morse@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Zenghui Yu <yuzenghui@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, Ali Saidi <alisaidi@amazon.com>,
	David Reaver <me@davidreaver.com>,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: [RFC] KVM: arm64: improving IO performance during unmap?
Date: Thu, 28 Mar 2024 12:04:34 -0700	[thread overview]
Message-ID: <cover.1711649501.git.kjlx@templeofstupid.com> (raw)

Hi,
Ali and I have been looking into ways to reduce the impact a unmap_stage2_range
operation has on IO performance when a device interrupt shares the cpu where the
unmap operation is occurring.

This came to our attention after porting a container VM / hardware virtualized
containers workload to arm64 from x86_64.  On ARM64, the unmap operations took
longer. kvm_tlb_flush_vmid_ipa runs with interrupts disabled.  Unmaps that don't
check for reschedule promptly may delay the IO.

One approach that we investigated was to modify the deferred TLBI code to run
even if range based operations were not supported.  (Provided FWB is enabled).
If range based operations were supported, the code would use them.  However, if
the CPU didn't support FEAT_TLBIRANGE or the unmap was larger than a certain
size, we'd fall back to vmalls12e1is instead.  This reduced the performance
impact of the unmap operation to less than 5% impact on IO performance.
However, with Will's recent patches[1] to fix cases where free'd PTEs may still
be referenced, we were concerned this might not be a viable approach.

As a follow-up to this e-mail, I'm sending a patch for a different approach.  It
shrinks the stage2_apply_range batch size to the minimum block size instead of
the maximum block size.  This eliminates the IO performance regressions, but
increases the overall map / unmap operation times when the CPU is receiving IO
interrupts.  I'm unsure if this is the optimal solution, however, since it may
generate extra unmap walks on 1gb hugepages. I'm also unclear if this creates
problems for any of the other users of stage2_apply_range().

I'd love to get some feedback on the best way to proceed here.

Thanks,

-K

[1] https://lore.kernel.org/kvmarm/20240325185158.8565-1-will@kernel.org/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel