From: Chuck Lever <chuck.lever@oracle.com> To: Will Deacon <will@kernel.org> Cc: iommu@lists.linux-foundation.org, linux-rdma <linux-rdma@vger.kernel.org> Subject: performance regression noted in v5.11-rc after c062db039f40 Date: Fri, 8 Jan 2021 16:18:36 -0500 [thread overview] Message-ID: <D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com> (raw) Hi- [ Please cc: me on replies, I'm not currently subscribed to iommu@lists ]. I'm running NFS performance tests on InfiniBand using CX-3 Pro cards at 56Gb/s. The test is iozone on an NFSv3/RDMA mount: /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I For those not familiar with the way storage protocols use RDMA, The initiator/client sets up memory regions and the target/server uses RDMA Read and Write to move data out of and into those regions. The initiator/client uses only RDMA memory registration and invalidation operations, and the target/server uses RDMA Read and Write. My NFS client is a two-socket 12-core x86_64 system with its I/O MMU enabled using the kernel command line options "intel_iommu=on iommu=strict". Recently I've noticed a significant (25-30%) loss in NFS throughput. I was able to bisect on my client to the following commits. Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in map_sg"). This is about normal for this test. Children see throughput for 12 initial writers = 4732581.09 kB/sec Parent sees throughput for 12 initial writers = 4646810.21 kB/sec Min throughput per process = 387764.34 kB/sec Max throughput per process = 399655.47 kB/sec Avg throughput per process = 394381.76 kB/sec Min xfer = 1017344.00 kB CPU Utilization: Wall time 2.671 CPU time 1.974 CPU utilization 73.89 % Children see throughput for 12 rewriters = 4837741.94 kB/sec Parent sees throughput for 12 rewriters = 4833509.35 kB/sec Min throughput per process = 398983.72 kB/sec Max throughput per process = 406199.66 kB/sec Avg throughput per process = 403145.16 kB/sec Min xfer = 1030656.00 kB CPU utilization: Wall time 2.584 CPU time 1.959 CPU utilization 75.82 % Children see throughput for 12 readers = 5921370.94 kB/sec Parent sees throughput for 12 readers = 5914106.69 kB/sec Min throughput per process = 491812.38 kB/sec Max throughput per process = 494777.28 kB/sec Avg throughput per process = 493447.58 kB/sec Min xfer = 1042688.00 kB CPU utilization: Wall time 2.122 CPU time 1.968 CPU utilization 92.75 % Children see throughput for 12 re-readers = 5947985.69 kB/sec Parent sees throughput for 12 re-readers = 5941348.51 kB/sec Min throughput per process = 492805.81 kB/sec Max throughput per process = 497280.19 kB/sec Avg throughput per process = 495665.47 kB/sec Min xfer = 1039360.00 kB CPU utilization: Wall time 2.111 CPU time 1.968 CPU utilization 93.22 % Here's c062db039f40 ("iommu/vt-d: Update domain geometry in iommu_ops.at(de)tach_dev"). It's losing some steam here. Children see throughput for 12 initial writers = 4342419.12 kB/sec Parent sees throughput for 12 initial writers = 4310612.79 kB/sec Min throughput per process = 359299.06 kB/sec Max throughput per process = 363866.16 kB/sec Avg throughput per process = 361868.26 kB/sec Min xfer = 1035520.00 kB CPU Utilization: Wall time 2.902 CPU time 1.951 CPU utilization 67.22 % Children see throughput for 12 rewriters = 4408576.66 kB/sec Parent sees throughput for 12 rewriters = 4404280.87 kB/sec Min throughput per process = 364553.88 kB/sec Max throughput per process = 370029.28 kB/sec Avg throughput per process = 367381.39 kB/sec Min xfer = 1033216.00 kB CPU utilization: Wall time 2.836 CPU time 1.956 CPU utilization 68.97 % Children see throughput for 12 readers = 5406879.47 kB/sec Parent sees throughput for 12 readers = 5401862.78 kB/sec Min throughput per process = 449583.03 kB/sec Max throughput per process = 451761.69 kB/sec Avg throughput per process = 450573.29 kB/sec Min xfer = 1044224.00 kB CPU utilization: Wall time 2.323 CPU time 1.977 CPU utilization 85.12 % Children see throughput for 12 re-readers = 5410601.12 kB/sec Parent sees throughput for 12 re-readers = 5403504.40 kB/sec Min throughput per process = 449918.12 kB/sec Max throughput per process = 452489.28 kB/sec Avg throughput per process = 450883.43 kB/sec Min xfer = 1043456.00 kB CPU utilization: Wall time 2.321 CPU time 1.978 CPU utilization 85.21 % And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to the iommu ops"). Significant throughput loss. Children see throughput for 12 initial writers = 3812036.91 kB/sec Parent sees throughput for 12 initial writers = 3753683.40 kB/sec Min throughput per process = 313672.25 kB/sec Max throughput per process = 321719.44 kB/sec Avg throughput per process = 317669.74 kB/sec Min xfer = 1022464.00 kB CPU Utilization: Wall time 3.309 CPU time 1.986 CPU utilization 60.02 % Children see throughput for 12 rewriters = 3786831.94 kB/sec Parent sees throughput for 12 rewriters = 3783205.58 kB/sec Min throughput per process = 313654.44 kB/sec Max throughput per process = 317844.50 kB/sec Avg throughput per process = 315569.33 kB/sec Min xfer = 1035520.00 kB CPU utilization: Wall time 3.302 CPU time 1.945 CPU utilization 58.90 % Children see throughput for 12 readers = 4265828.28 kB/sec Parent sees throughput for 12 readers = 4261844.88 kB/sec Min throughput per process = 352305.00 kB/sec Max throughput per process = 357726.22 kB/sec Avg throughput per process = 355485.69 kB/sec Min xfer = 1032960.00 kB CPU utilization: Wall time 2.934 CPU time 1.942 CPU utilization 66.20 % Children see throughput for 12 re-readers = 4220651.19 kB/sec Parent sees throughput for 12 re-readers = 4216096.04 kB/sec Min throughput per process = 348677.16 kB/sec Max throughput per process = 353467.44 kB/sec Avg throughput per process = 351720.93 kB/sec Min xfer = 1035264.00 kB CPU utilization: Wall time 2.969 CPU time 1.952 CPU utilization 65.74 % The regression appears to be 100% reproducible. -- Chuck Lever
WARNING: multiple messages have this Message-ID (diff)
From: Chuck Lever <chuck.lever@oracle.com> To: Will Deacon <will@kernel.org> Cc: linux-rdma <linux-rdma@vger.kernel.org>, iommu@lists.linux-foundation.org Subject: performance regression noted in v5.11-rc after c062db039f40 Date: Fri, 8 Jan 2021 16:18:36 -0500 [thread overview] Message-ID: <D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com> (raw) Hi- [ Please cc: me on replies, I'm not currently subscribed to iommu@lists ]. I'm running NFS performance tests on InfiniBand using CX-3 Pro cards at 56Gb/s. The test is iozone on an NFSv3/RDMA mount: /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I For those not familiar with the way storage protocols use RDMA, The initiator/client sets up memory regions and the target/server uses RDMA Read and Write to move data out of and into those regions. The initiator/client uses only RDMA memory registration and invalidation operations, and the target/server uses RDMA Read and Write. My NFS client is a two-socket 12-core x86_64 system with its I/O MMU enabled using the kernel command line options "intel_iommu=on iommu=strict". Recently I've noticed a significant (25-30%) loss in NFS throughput. I was able to bisect on my client to the following commits. Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in map_sg"). This is about normal for this test. Children see throughput for 12 initial writers = 4732581.09 kB/sec Parent sees throughput for 12 initial writers = 4646810.21 kB/sec Min throughput per process = 387764.34 kB/sec Max throughput per process = 399655.47 kB/sec Avg throughput per process = 394381.76 kB/sec Min xfer = 1017344.00 kB CPU Utilization: Wall time 2.671 CPU time 1.974 CPU utilization 73.89 % Children see throughput for 12 rewriters = 4837741.94 kB/sec Parent sees throughput for 12 rewriters = 4833509.35 kB/sec Min throughput per process = 398983.72 kB/sec Max throughput per process = 406199.66 kB/sec Avg throughput per process = 403145.16 kB/sec Min xfer = 1030656.00 kB CPU utilization: Wall time 2.584 CPU time 1.959 CPU utilization 75.82 % Children see throughput for 12 readers = 5921370.94 kB/sec Parent sees throughput for 12 readers = 5914106.69 kB/sec Min throughput per process = 491812.38 kB/sec Max throughput per process = 494777.28 kB/sec Avg throughput per process = 493447.58 kB/sec Min xfer = 1042688.00 kB CPU utilization: Wall time 2.122 CPU time 1.968 CPU utilization 92.75 % Children see throughput for 12 re-readers = 5947985.69 kB/sec Parent sees throughput for 12 re-readers = 5941348.51 kB/sec Min throughput per process = 492805.81 kB/sec Max throughput per process = 497280.19 kB/sec Avg throughput per process = 495665.47 kB/sec Min xfer = 1039360.00 kB CPU utilization: Wall time 2.111 CPU time 1.968 CPU utilization 93.22 % Here's c062db039f40 ("iommu/vt-d: Update domain geometry in iommu_ops.at(de)tach_dev"). It's losing some steam here. Children see throughput for 12 initial writers = 4342419.12 kB/sec Parent sees throughput for 12 initial writers = 4310612.79 kB/sec Min throughput per process = 359299.06 kB/sec Max throughput per process = 363866.16 kB/sec Avg throughput per process = 361868.26 kB/sec Min xfer = 1035520.00 kB CPU Utilization: Wall time 2.902 CPU time 1.951 CPU utilization 67.22 % Children see throughput for 12 rewriters = 4408576.66 kB/sec Parent sees throughput for 12 rewriters = 4404280.87 kB/sec Min throughput per process = 364553.88 kB/sec Max throughput per process = 370029.28 kB/sec Avg throughput per process = 367381.39 kB/sec Min xfer = 1033216.00 kB CPU utilization: Wall time 2.836 CPU time 1.956 CPU utilization 68.97 % Children see throughput for 12 readers = 5406879.47 kB/sec Parent sees throughput for 12 readers = 5401862.78 kB/sec Min throughput per process = 449583.03 kB/sec Max throughput per process = 451761.69 kB/sec Avg throughput per process = 450573.29 kB/sec Min xfer = 1044224.00 kB CPU utilization: Wall time 2.323 CPU time 1.977 CPU utilization 85.12 % Children see throughput for 12 re-readers = 5410601.12 kB/sec Parent sees throughput for 12 re-readers = 5403504.40 kB/sec Min throughput per process = 449918.12 kB/sec Max throughput per process = 452489.28 kB/sec Avg throughput per process = 450883.43 kB/sec Min xfer = 1043456.00 kB CPU utilization: Wall time 2.321 CPU time 1.978 CPU utilization 85.21 % And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to the iommu ops"). Significant throughput loss. Children see throughput for 12 initial writers = 3812036.91 kB/sec Parent sees throughput for 12 initial writers = 3753683.40 kB/sec Min throughput per process = 313672.25 kB/sec Max throughput per process = 321719.44 kB/sec Avg throughput per process = 317669.74 kB/sec Min xfer = 1022464.00 kB CPU Utilization: Wall time 3.309 CPU time 1.986 CPU utilization 60.02 % Children see throughput for 12 rewriters = 3786831.94 kB/sec Parent sees throughput for 12 rewriters = 3783205.58 kB/sec Min throughput per process = 313654.44 kB/sec Max throughput per process = 317844.50 kB/sec Avg throughput per process = 315569.33 kB/sec Min xfer = 1035520.00 kB CPU utilization: Wall time 3.302 CPU time 1.945 CPU utilization 58.90 % Children see throughput for 12 readers = 4265828.28 kB/sec Parent sees throughput for 12 readers = 4261844.88 kB/sec Min throughput per process = 352305.00 kB/sec Max throughput per process = 357726.22 kB/sec Avg throughput per process = 355485.69 kB/sec Min xfer = 1032960.00 kB CPU utilization: Wall time 2.934 CPU time 1.942 CPU utilization 66.20 % Children see throughput for 12 re-readers = 4220651.19 kB/sec Parent sees throughput for 12 re-readers = 4216096.04 kB/sec Min throughput per process = 348677.16 kB/sec Max throughput per process = 353467.44 kB/sec Avg throughput per process = 351720.93 kB/sec Min xfer = 1035264.00 kB CPU utilization: Wall time 2.969 CPU time 1.952 CPU utilization 65.74 % The regression appears to be 100% reproducible. -- Chuck Lever _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
next reply other threads:[~2021-01-08 21:19 UTC|newest] Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-01-08 21:18 Chuck Lever [this message] 2021-01-08 21:18 ` performance regression noted in v5.11-rc after c062db039f40 Chuck Lever 2021-01-12 14:38 ` Will Deacon 2021-01-12 14:38 ` Will Deacon 2021-01-13 2:25 ` Lu Baolu 2021-01-13 2:25 ` Lu Baolu 2021-01-13 14:07 ` Chuck Lever 2021-01-13 14:07 ` Chuck Lever 2021-01-13 18:30 ` Chuck Lever 2021-01-13 18:30 ` Chuck Lever 2021-01-18 16:18 ` Chuck Lever 2021-01-18 16:18 ` Chuck Lever 2021-01-18 18:00 ` Robin Murphy 2021-01-18 18:00 ` Robin Murphy 2021-01-18 20:09 ` Chuck Lever 2021-01-18 20:09 ` Chuck Lever 2021-01-19 1:22 ` Lu Baolu 2021-01-19 1:22 ` Lu Baolu 2021-01-19 14:37 ` Chuck Lever 2021-01-19 14:37 ` Chuck Lever 2021-01-20 2:11 ` Lu Baolu 2021-01-20 2:11 ` Lu Baolu 2021-01-20 20:25 ` Chuck Lever 2021-01-20 20:25 ` Chuck Lever 2021-01-21 19:09 ` Chuck Lever 2021-01-21 19:09 ` Chuck Lever 2021-01-22 3:00 ` Lu Baolu 2021-01-22 3:00 ` Lu Baolu 2021-01-22 16:18 ` Chuck Lever 2021-01-22 16:18 ` Chuck Lever 2021-01-22 17:38 ` Robin Murphy 2021-01-22 17:38 ` Robin Murphy 2021-01-22 18:38 ` Chuck Lever 2021-01-22 18:38 ` Chuck Lever 2021-01-24 7:17 ` Lu Baolu 2021-01-24 7:17 ` Lu Baolu
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com \ --to=chuck.lever@oracle.com \ --cc=iommu@lists.linux-foundation.org \ --cc=linux-rdma@vger.kernel.org \ --cc=will@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.