From: Krister Johansen <kjlx@templeofstupid.com> To: Marc Zyngier <maz@kernel.org>, Oliver Upton <oliver.upton@linux.dev> Cc: James Morse <james.morse@arm.com>, Suzuki K Poulose <suzuki.poulose@arm.com>, Zenghui Yu <yuzenghui@huawei.com>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Ali Saidi <alisaidi@amazon.com>, David Reaver <me@davidreaver.com>, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org Subject: [RFC] KVM: arm64: improving IO performance during unmap? Date: Thu, 28 Mar 2024 12:04:34 -0700 [thread overview] Message-ID: <cover.1711649501.git.kjlx@templeofstupid.com> (raw) Hi, Ali and I have been looking into ways to reduce the impact a unmap_stage2_range operation has on IO performance when a device interrupt shares the cpu where the unmap operation is occurring. This came to our attention after porting a container VM / hardware virtualized containers workload to arm64 from x86_64. On ARM64, the unmap operations took longer. kvm_tlb_flush_vmid_ipa runs with interrupts disabled. Unmaps that don't check for reschedule promptly may delay the IO. One approach that we investigated was to modify the deferred TLBI code to run even if range based operations were not supported. (Provided FWB is enabled). If range based operations were supported, the code would use them. However, if the CPU didn't support FEAT_TLBIRANGE or the unmap was larger than a certain size, we'd fall back to vmalls12e1is instead. This reduced the performance impact of the unmap operation to less than 5% impact on IO performance. However, with Will's recent patches[1] to fix cases where free'd PTEs may still be referenced, we were concerned this might not be a viable approach. As a follow-up to this e-mail, I'm sending a patch for a different approach. It shrinks the stage2_apply_range batch size to the minimum block size instead of the maximum block size. This eliminates the IO performance regressions, but increases the overall map / unmap operation times when the CPU is receiving IO interrupts. I'm unsure if this is the optimal solution, however, since it may generate extra unmap walks on 1gb hugepages. I'm also unclear if this creates problems for any of the other users of stage2_apply_range(). I'd love to get some feedback on the best way to proceed here. Thanks, -K [1] https://lore.kernel.org/kvmarm/20240325185158.8565-1-will@kernel.org/ _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
WARNING: multiple messages have this Message-ID (diff)
From: Krister Johansen <kjlx@templeofstupid.com> To: Marc Zyngier <maz@kernel.org>, Oliver Upton <oliver.upton@linux.dev> Cc: James Morse <james.morse@arm.com>, Suzuki K Poulose <suzuki.poulose@arm.com>, Zenghui Yu <yuzenghui@huawei.com>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Ali Saidi <alisaidi@amazon.com>, David Reaver <me@davidreaver.com>, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org Subject: [RFC] KVM: arm64: improving IO performance during unmap? Date: Thu, 28 Mar 2024 12:04:34 -0700 [thread overview] Message-ID: <cover.1711649501.git.kjlx@templeofstupid.com> (raw) Hi, Ali and I have been looking into ways to reduce the impact a unmap_stage2_range operation has on IO performance when a device interrupt shares the cpu where the unmap operation is occurring. This came to our attention after porting a container VM / hardware virtualized containers workload to arm64 from x86_64. On ARM64, the unmap operations took longer. kvm_tlb_flush_vmid_ipa runs with interrupts disabled. Unmaps that don't check for reschedule promptly may delay the IO. One approach that we investigated was to modify the deferred TLBI code to run even if range based operations were not supported. (Provided FWB is enabled). If range based operations were supported, the code would use them. However, if the CPU didn't support FEAT_TLBIRANGE or the unmap was larger than a certain size, we'd fall back to vmalls12e1is instead. This reduced the performance impact of the unmap operation to less than 5% impact on IO performance. However, with Will's recent patches[1] to fix cases where free'd PTEs may still be referenced, we were concerned this might not be a viable approach. As a follow-up to this e-mail, I'm sending a patch for a different approach. It shrinks the stage2_apply_range batch size to the minimum block size instead of the maximum block size. This eliminates the IO performance regressions, but increases the overall map / unmap operation times when the CPU is receiving IO interrupts. I'm unsure if this is the optimal solution, however, since it may generate extra unmap walks on 1gb hugepages. I'm also unclear if this creates problems for any of the other users of stage2_apply_range(). I'd love to get some feedback on the best way to proceed here. Thanks, -K [1] https://lore.kernel.org/kvmarm/20240325185158.8565-1-will@kernel.org/
next reply other threads:[~2024-03-28 19:05 UTC|newest] Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top 2024-03-28 19:04 Krister Johansen [this message] 2024-03-28 19:04 ` [RFC] KVM: arm64: improving IO performance during unmap? Krister Johansen 2024-03-28 19:05 ` [PATCH] KVM: arm64: Limit stage2_apply_range() batch size to smallest block Krister Johansen 2024-03-28 19:05 ` Krister Johansen 2024-03-29 13:48 ` Oliver Upton 2024-03-29 13:48 ` Oliver Upton 2024-03-29 19:15 ` Krister Johansen 2024-03-29 19:15 ` Krister Johansen 2024-03-30 10:17 ` Marc Zyngier 2024-03-30 10:17 ` Marc Zyngier 2024-04-02 17:00 ` Krister Johansen 2024-04-02 17:00 ` Krister Johansen 2024-04-04 4:40 ` Krister Johansen 2024-04-04 4:40 ` Krister Johansen 2024-04-04 21:27 ` Ali Saidi 2024-04-04 21:27 ` Ali Saidi 2024-04-04 21:41 ` Krister Johansen 2024-04-04 21:41 ` Krister Johansen
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=cover.1711649501.git.kjlx@templeofstupid.com \ --to=kjlx@templeofstupid.com \ --cc=alisaidi@amazon.com \ --cc=catalin.marinas@arm.com \ --cc=james.morse@arm.com \ --cc=kvmarm@lists.linux.dev \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=maz@kernel.org \ --cc=me@davidreaver.com \ --cc=oliver.upton@linux.dev \ --cc=suzuki.poulose@arm.com \ --cc=will@kernel.org \ --cc=yuzenghui@huawei.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.