All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: Will Deacon <will@kernel.org>
Cc: iommu@lists.linux-foundation.org,
	linux-rdma <linux-rdma@vger.kernel.org>
Subject: performance regression noted in v5.11-rc after c062db039f40
Date: Fri, 8 Jan 2021 16:18:36 -0500	[thread overview]
Message-ID: <D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com> (raw)

Hi-

[ Please cc: me on replies, I'm not currently subscribed to
iommu@lists ].

I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:

/home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I

For those not familiar with the way storage protocols use RDMA, The
initiator/client sets up memory regions and the target/server uses
RDMA Read and Write to move data out of and into those regions. The
initiator/client uses only RDMA memory registration and invalidation
operations, and the target/server uses RDMA Read and Write.

My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
enabled using the kernel command line options "intel_iommu=on
iommu=strict".

Recently I've noticed a significant (25-30%) loss in NFS throughput.
I was able to bisect on my client to the following commits.

Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
map_sg"). This is about normal for this test.

	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
 	Min throughput per process 			=  387764.34 kB/sec
 	Max throughput per process 			=  399655.47 kB/sec
 	Avg throughput per process 			=  394381.76 kB/sec
 	Min xfer 					= 1017344.00 kB
 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
 	Min throughput per process 			=  398983.72 kB/sec
 	Max throughput per process 			=  406199.66 kB/sec
 	Avg throughput per process 			=  403145.16 kB/sec
 	Min xfer 					= 1030656.00 kB
 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
 	Min throughput per process 			=  491812.38 kB/sec
 	Max throughput per process 			=  494777.28 kB/sec
 	Avg throughput per process 			=  493447.58 kB/sec
 	Min xfer 					= 1042688.00 kB
 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
 	Min throughput per process 			=  492805.81 kB/sec
 	Max throughput per process 			=  497280.19 kB/sec
 	Avg throughput per process 			=  495665.47 kB/sec
 	Min xfer 					= 1039360.00 kB
 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %

Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
iommu_ops.at(de)tach_dev"). It's losing some steam here.

	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
 	Min throughput per process 			=  359299.06 kB/sec
 	Max throughput per process 			=  363866.16 kB/sec
 	Avg throughput per process 			=  361868.26 kB/sec
 	Min xfer 					= 1035520.00 kB
 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
 	Min throughput per process 			=  364553.88 kB/sec
 	Max throughput per process 			=  370029.28 kB/sec
 	Avg throughput per process 			=  367381.39 kB/sec
 	Min xfer 					= 1033216.00 kB
 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
 	Min throughput per process 			=  449583.03 kB/sec
 	Max throughput per process 			=  451761.69 kB/sec
 	Avg throughput per process 			=  450573.29 kB/sec
 	Min xfer 					= 1044224.00 kB
 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
 	Min throughput per process 			=  449918.12 kB/sec
 	Max throughput per process 			=  452489.28 kB/sec
 	Avg throughput per process 			=  450883.43 kB/sec
 	Min xfer 					= 1043456.00 kB
 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %

And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
the iommu ops"). Significant throughput loss.

	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
 	Min throughput per process 			=  313672.25 kB/sec
 	Max throughput per process 			=  321719.44 kB/sec
 	Avg throughput per process 			=  317669.74 kB/sec
 	Min xfer 					= 1022464.00 kB
 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
 	Min throughput per process 			=  313654.44 kB/sec
 	Max throughput per process 			=  317844.50 kB/sec
 	Avg throughput per process 			=  315569.33 kB/sec
 	Min xfer 					= 1035520.00 kB
 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
 	Min throughput per process 			=  352305.00 kB/sec
 	Max throughput per process 			=  357726.22 kB/sec
 	Avg throughput per process 			=  355485.69 kB/sec
 	Min xfer 					= 1032960.00 kB
 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
 	Min throughput per process 			=  348677.16 kB/sec
 	Max throughput per process 			=  353467.44 kB/sec
 	Avg throughput per process 			=  351720.93 kB/sec
 	Min xfer 					= 1035264.00 kB
 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %

The regression appears to be 100% reproducible. 


--
Chuck Lever




WARNING: multiple messages have this Message-ID (diff)
From: Chuck Lever <chuck.lever@oracle.com>
To: Will Deacon <will@kernel.org>
Cc: linux-rdma <linux-rdma@vger.kernel.org>,
	iommu@lists.linux-foundation.org
Subject: performance regression noted in v5.11-rc after c062db039f40
Date: Fri, 8 Jan 2021 16:18:36 -0500	[thread overview]
Message-ID: <D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com> (raw)

Hi-

[ Please cc: me on replies, I'm not currently subscribed to
iommu@lists ].

I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:

/home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I

For those not familiar with the way storage protocols use RDMA, The
initiator/client sets up memory regions and the target/server uses
RDMA Read and Write to move data out of and into those regions. The
initiator/client uses only RDMA memory registration and invalidation
operations, and the target/server uses RDMA Read and Write.

My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
enabled using the kernel command line options "intel_iommu=on
iommu=strict".

Recently I've noticed a significant (25-30%) loss in NFS throughput.
I was able to bisect on my client to the following commits.

Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
map_sg"). This is about normal for this test.

	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
 	Min throughput per process 			=  387764.34 kB/sec
 	Max throughput per process 			=  399655.47 kB/sec
 	Avg throughput per process 			=  394381.76 kB/sec
 	Min xfer 					= 1017344.00 kB
 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
 	Min throughput per process 			=  398983.72 kB/sec
 	Max throughput per process 			=  406199.66 kB/sec
 	Avg throughput per process 			=  403145.16 kB/sec
 	Min xfer 					= 1030656.00 kB
 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
 	Min throughput per process 			=  491812.38 kB/sec
 	Max throughput per process 			=  494777.28 kB/sec
 	Avg throughput per process 			=  493447.58 kB/sec
 	Min xfer 					= 1042688.00 kB
 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
 	Min throughput per process 			=  492805.81 kB/sec
 	Max throughput per process 			=  497280.19 kB/sec
 	Avg throughput per process 			=  495665.47 kB/sec
 	Min xfer 					= 1039360.00 kB
 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %

Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
iommu_ops.at(de)tach_dev"). It's losing some steam here.

	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
 	Min throughput per process 			=  359299.06 kB/sec
 	Max throughput per process 			=  363866.16 kB/sec
 	Avg throughput per process 			=  361868.26 kB/sec
 	Min xfer 					= 1035520.00 kB
 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
 	Min throughput per process 			=  364553.88 kB/sec
 	Max throughput per process 			=  370029.28 kB/sec
 	Avg throughput per process 			=  367381.39 kB/sec
 	Min xfer 					= 1033216.00 kB
 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
 	Min throughput per process 			=  449583.03 kB/sec
 	Max throughput per process 			=  451761.69 kB/sec
 	Avg throughput per process 			=  450573.29 kB/sec
 	Min xfer 					= 1044224.00 kB
 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
 	Min throughput per process 			=  449918.12 kB/sec
 	Max throughput per process 			=  452489.28 kB/sec
 	Avg throughput per process 			=  450883.43 kB/sec
 	Min xfer 					= 1043456.00 kB
 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %

And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
the iommu ops"). Significant throughput loss.

	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
 	Min throughput per process 			=  313672.25 kB/sec
 	Max throughput per process 			=  321719.44 kB/sec
 	Avg throughput per process 			=  317669.74 kB/sec
 	Min xfer 					= 1022464.00 kB
 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
 	Min throughput per process 			=  313654.44 kB/sec
 	Max throughput per process 			=  317844.50 kB/sec
 	Avg throughput per process 			=  315569.33 kB/sec
 	Min xfer 					= 1035520.00 kB
 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
 	Min throughput per process 			=  352305.00 kB/sec
 	Max throughput per process 			=  357726.22 kB/sec
 	Avg throughput per process 			=  355485.69 kB/sec
 	Min xfer 					= 1032960.00 kB
 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
 	Min throughput per process 			=  348677.16 kB/sec
 	Max throughput per process 			=  353467.44 kB/sec
 	Avg throughput per process 			=  351720.93 kB/sec
 	Min xfer 					= 1035264.00 kB
 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %

The regression appears to be 100% reproducible. 


--
Chuck Lever



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

             reply	other threads:[~2021-01-08 21:19 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-08 21:18 Chuck Lever [this message]
2021-01-08 21:18 ` performance regression noted in v5.11-rc after c062db039f40 Chuck Lever
2021-01-12 14:38 ` Will Deacon
2021-01-12 14:38   ` Will Deacon
2021-01-13  2:25   ` Lu Baolu
2021-01-13  2:25     ` Lu Baolu
2021-01-13 14:07     ` Chuck Lever
2021-01-13 14:07       ` Chuck Lever
2021-01-13 18:30       ` Chuck Lever
2021-01-13 18:30         ` Chuck Lever
2021-01-18 16:18   ` Chuck Lever
2021-01-18 16:18     ` Chuck Lever
2021-01-18 18:00     ` Robin Murphy
2021-01-18 18:00       ` Robin Murphy
2021-01-18 20:09       ` Chuck Lever
2021-01-18 20:09         ` Chuck Lever
2021-01-19  1:22         ` Lu Baolu
2021-01-19  1:22           ` Lu Baolu
2021-01-19 14:37           ` Chuck Lever
2021-01-19 14:37             ` Chuck Lever
2021-01-20  2:11             ` Lu Baolu
2021-01-20  2:11               ` Lu Baolu
2021-01-20 20:25               ` Chuck Lever
2021-01-20 20:25                 ` Chuck Lever
2021-01-21 19:09       ` Chuck Lever
2021-01-21 19:09         ` Chuck Lever
2021-01-22  3:00         ` Lu Baolu
2021-01-22  3:00           ` Lu Baolu
2021-01-22 16:18           ` Chuck Lever
2021-01-22 16:18             ` Chuck Lever
2021-01-22 17:38             ` Robin Murphy
2021-01-22 17:38               ` Robin Murphy
2021-01-22 18:38               ` Chuck Lever
2021-01-22 18:38                 ` Chuck Lever
2021-01-24  7:17               ` Lu Baolu
2021-01-24  7:17                 ` Lu Baolu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=iommu@lists.linux-foundation.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.