All of lore.kernel.org
 help / color / mirror / Atom feed
* performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-08 21:18 ` Chuck Lever
  0 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-08 21:18 UTC (permalink / raw)
  To: Will Deacon; +Cc: iommu, linux-rdma

Hi-

[ Please cc: me on replies, I'm not currently subscribed to
iommu@lists ].

I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:

/home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I

For those not familiar with the way storage protocols use RDMA, The
initiator/client sets up memory regions and the target/server uses
RDMA Read and Write to move data out of and into those regions. The
initiator/client uses only RDMA memory registration and invalidation
operations, and the target/server uses RDMA Read and Write.

My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
enabled using the kernel command line options "intel_iommu=on
iommu=strict".

Recently I've noticed a significant (25-30%) loss in NFS throughput.
I was able to bisect on my client to the following commits.

Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
map_sg"). This is about normal for this test.

	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
 	Min throughput per process 			=  387764.34 kB/sec
 	Max throughput per process 			=  399655.47 kB/sec
 	Avg throughput per process 			=  394381.76 kB/sec
 	Min xfer 					= 1017344.00 kB
 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
 	Min throughput per process 			=  398983.72 kB/sec
 	Max throughput per process 			=  406199.66 kB/sec
 	Avg throughput per process 			=  403145.16 kB/sec
 	Min xfer 					= 1030656.00 kB
 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
 	Min throughput per process 			=  491812.38 kB/sec
 	Max throughput per process 			=  494777.28 kB/sec
 	Avg throughput per process 			=  493447.58 kB/sec
 	Min xfer 					= 1042688.00 kB
 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
 	Min throughput per process 			=  492805.81 kB/sec
 	Max throughput per process 			=  497280.19 kB/sec
 	Avg throughput per process 			=  495665.47 kB/sec
 	Min xfer 					= 1039360.00 kB
 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %

Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
iommu_ops.at(de)tach_dev"). It's losing some steam here.

	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
 	Min throughput per process 			=  359299.06 kB/sec
 	Max throughput per process 			=  363866.16 kB/sec
 	Avg throughput per process 			=  361868.26 kB/sec
 	Min xfer 					= 1035520.00 kB
 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
 	Min throughput per process 			=  364553.88 kB/sec
 	Max throughput per process 			=  370029.28 kB/sec
 	Avg throughput per process 			=  367381.39 kB/sec
 	Min xfer 					= 1033216.00 kB
 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
 	Min throughput per process 			=  449583.03 kB/sec
 	Max throughput per process 			=  451761.69 kB/sec
 	Avg throughput per process 			=  450573.29 kB/sec
 	Min xfer 					= 1044224.00 kB
 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
 	Min throughput per process 			=  449918.12 kB/sec
 	Max throughput per process 			=  452489.28 kB/sec
 	Avg throughput per process 			=  450883.43 kB/sec
 	Min xfer 					= 1043456.00 kB
 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %

And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
the iommu ops"). Significant throughput loss.

	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
 	Min throughput per process 			=  313672.25 kB/sec
 	Max throughput per process 			=  321719.44 kB/sec
 	Avg throughput per process 			=  317669.74 kB/sec
 	Min xfer 					= 1022464.00 kB
 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
 	Min throughput per process 			=  313654.44 kB/sec
 	Max throughput per process 			=  317844.50 kB/sec
 	Avg throughput per process 			=  315569.33 kB/sec
 	Min xfer 					= 1035520.00 kB
 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
 	Min throughput per process 			=  352305.00 kB/sec
 	Max throughput per process 			=  357726.22 kB/sec
 	Avg throughput per process 			=  355485.69 kB/sec
 	Min xfer 					= 1032960.00 kB
 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
 	Min throughput per process 			=  348677.16 kB/sec
 	Max throughput per process 			=  353467.44 kB/sec
 	Avg throughput per process 			=  351720.93 kB/sec
 	Min xfer 					= 1035264.00 kB
 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %

The regression appears to be 100% reproducible. 


--
Chuck Lever




^ permalink raw reply	[flat|nested] 36+ messages in thread

* performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-08 21:18 ` Chuck Lever
  0 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-08 21:18 UTC (permalink / raw)
  To: Will Deacon; +Cc: linux-rdma, iommu

Hi-

[ Please cc: me on replies, I'm not currently subscribed to
iommu@lists ].

I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:

/home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I

For those not familiar with the way storage protocols use RDMA, The
initiator/client sets up memory regions and the target/server uses
RDMA Read and Write to move data out of and into those regions. The
initiator/client uses only RDMA memory registration and invalidation
operations, and the target/server uses RDMA Read and Write.

My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
enabled using the kernel command line options "intel_iommu=on
iommu=strict".

Recently I've noticed a significant (25-30%) loss in NFS throughput.
I was able to bisect on my client to the following commits.

Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
map_sg"). This is about normal for this test.

	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
 	Min throughput per process 			=  387764.34 kB/sec
 	Max throughput per process 			=  399655.47 kB/sec
 	Avg throughput per process 			=  394381.76 kB/sec
 	Min xfer 					= 1017344.00 kB
 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
 	Min throughput per process 			=  398983.72 kB/sec
 	Max throughput per process 			=  406199.66 kB/sec
 	Avg throughput per process 			=  403145.16 kB/sec
 	Min xfer 					= 1030656.00 kB
 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
 	Min throughput per process 			=  491812.38 kB/sec
 	Max throughput per process 			=  494777.28 kB/sec
 	Avg throughput per process 			=  493447.58 kB/sec
 	Min xfer 					= 1042688.00 kB
 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
 	Min throughput per process 			=  492805.81 kB/sec
 	Max throughput per process 			=  497280.19 kB/sec
 	Avg throughput per process 			=  495665.47 kB/sec
 	Min xfer 					= 1039360.00 kB
 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %

Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
iommu_ops.at(de)tach_dev"). It's losing some steam here.

	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
 	Min throughput per process 			=  359299.06 kB/sec
 	Max throughput per process 			=  363866.16 kB/sec
 	Avg throughput per process 			=  361868.26 kB/sec
 	Min xfer 					= 1035520.00 kB
 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
 	Min throughput per process 			=  364553.88 kB/sec
 	Max throughput per process 			=  370029.28 kB/sec
 	Avg throughput per process 			=  367381.39 kB/sec
 	Min xfer 					= 1033216.00 kB
 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
 	Min throughput per process 			=  449583.03 kB/sec
 	Max throughput per process 			=  451761.69 kB/sec
 	Avg throughput per process 			=  450573.29 kB/sec
 	Min xfer 					= 1044224.00 kB
 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
 	Min throughput per process 			=  449918.12 kB/sec
 	Max throughput per process 			=  452489.28 kB/sec
 	Avg throughput per process 			=  450883.43 kB/sec
 	Min xfer 					= 1043456.00 kB
 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %

And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
the iommu ops"). Significant throughput loss.

	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
 	Min throughput per process 			=  313672.25 kB/sec
 	Max throughput per process 			=  321719.44 kB/sec
 	Avg throughput per process 			=  317669.74 kB/sec
 	Min xfer 					= 1022464.00 kB
 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
 	Min throughput per process 			=  313654.44 kB/sec
 	Max throughput per process 			=  317844.50 kB/sec
 	Avg throughput per process 			=  315569.33 kB/sec
 	Min xfer 					= 1035520.00 kB
 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
 	Min throughput per process 			=  352305.00 kB/sec
 	Max throughput per process 			=  357726.22 kB/sec
 	Avg throughput per process 			=  355485.69 kB/sec
 	Min xfer 					= 1032960.00 kB
 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
 	Min throughput per process 			=  348677.16 kB/sec
 	Max throughput per process 			=  353467.44 kB/sec
 	Avg throughput per process 			=  351720.93 kB/sec
 	Min xfer 					= 1035264.00 kB
 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %

The regression appears to be 100% reproducible. 


--
Chuck Lever



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-08 21:18 ` Chuck Lever
@ 2021-01-12 14:38   ` Will Deacon
  -1 siblings, 0 replies; 36+ messages in thread
From: Will Deacon @ 2021-01-12 14:38 UTC (permalink / raw)
  To: Chuck Lever
  Cc: iommu, linux-rdma, baolu.lu, logang, hch, murphyt7, robin.murphy

[Expanding cc list to include DMA-IOMMU and intel IOMMU folks]

On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
> Hi-
> 
> [ Please cc: me on replies, I'm not currently subscribed to
> iommu@lists ].
> 
> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
> 
> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
> 
> For those not familiar with the way storage protocols use RDMA, The
> initiator/client sets up memory regions and the target/server uses
> RDMA Read and Write to move data out of and into those regions. The
> initiator/client uses only RDMA memory registration and invalidation
> operations, and the target/server uses RDMA Read and Write.
> 
> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
> enabled using the kernel command line options "intel_iommu=on
> iommu=strict".
> 
> Recently I've noticed a significant (25-30%) loss in NFS throughput.
> I was able to bisect on my client to the following commits.
> 
> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
> map_sg"). This is about normal for this test.
> 
> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>  	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>  	Min throughput per process 			=  387764.34 kB/sec
>  	Max throughput per process 			=  399655.47 kB/sec
>  	Avg throughput per process 			=  394381.76 kB/sec
>  	Min xfer 					= 1017344.00 kB
>  	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>  	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>  	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>  	Min throughput per process 			=  398983.72 kB/sec
>  	Max throughput per process 			=  406199.66 kB/sec
>  	Avg throughput per process 			=  403145.16 kB/sec
>  	Min xfer 					= 1030656.00 kB
>  	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>  	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>  	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>  	Min throughput per process 			=  491812.38 kB/sec
>  	Max throughput per process 			=  494777.28 kB/sec
>  	Avg throughput per process 			=  493447.58 kB/sec
>  	Min xfer 					= 1042688.00 kB
>  	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>  	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>  	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>  	Min throughput per process 			=  492805.81 kB/sec
>  	Max throughput per process 			=  497280.19 kB/sec
>  	Avg throughput per process 			=  495665.47 kB/sec
>  	Min xfer 					= 1039360.00 kB
>  	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
> 
> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
> iommu_ops.at(de)tach_dev"). It's losing some steam here.
> 
> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>  	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>  	Min throughput per process 			=  359299.06 kB/sec
>  	Max throughput per process 			=  363866.16 kB/sec
>  	Avg throughput per process 			=  361868.26 kB/sec
>  	Min xfer 					= 1035520.00 kB
>  	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>  	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>  	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>  	Min throughput per process 			=  364553.88 kB/sec
>  	Max throughput per process 			=  370029.28 kB/sec
>  	Avg throughput per process 			=  367381.39 kB/sec
>  	Min xfer 					= 1033216.00 kB
>  	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>  	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>  	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>  	Min throughput per process 			=  449583.03 kB/sec
>  	Max throughput per process 			=  451761.69 kB/sec
>  	Avg throughput per process 			=  450573.29 kB/sec
>  	Min xfer 					= 1044224.00 kB
>  	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>  	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>  	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>  	Min throughput per process 			=  449918.12 kB/sec
>  	Max throughput per process 			=  452489.28 kB/sec
>  	Avg throughput per process 			=  450883.43 kB/sec
>  	Min xfer 					= 1043456.00 kB
>  	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
> 
> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
> the iommu ops"). Significant throughput loss.
> 
> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>  	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>  	Min throughput per process 			=  313672.25 kB/sec
>  	Max throughput per process 			=  321719.44 kB/sec
>  	Avg throughput per process 			=  317669.74 kB/sec
>  	Min xfer 					= 1022464.00 kB
>  	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>  	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>  	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>  	Min throughput per process 			=  313654.44 kB/sec
>  	Max throughput per process 			=  317844.50 kB/sec
>  	Avg throughput per process 			=  315569.33 kB/sec
>  	Min xfer 					= 1035520.00 kB
>  	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>  	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>  	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>  	Min throughput per process 			=  352305.00 kB/sec
>  	Max throughput per process 			=  357726.22 kB/sec
>  	Avg throughput per process 			=  355485.69 kB/sec
>  	Min xfer 					= 1032960.00 kB
>  	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>  	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>  	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>  	Min throughput per process 			=  348677.16 kB/sec
>  	Max throughput per process 			=  353467.44 kB/sec
>  	Avg throughput per process 			=  351720.93 kB/sec
>  	Min xfer 					= 1035264.00 kB
>  	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
> 
> The regression appears to be 100% reproducible. 
> 
> 
> --
> Chuck Lever
> 
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-12 14:38   ` Will Deacon
  0 siblings, 0 replies; 36+ messages in thread
From: Will Deacon @ 2021-01-12 14:38 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, robin.murphy, murphyt7, iommu, logang, hch

[Expanding cc list to include DMA-IOMMU and intel IOMMU folks]

On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
> Hi-
> 
> [ Please cc: me on replies, I'm not currently subscribed to
> iommu@lists ].
> 
> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
> 
> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
> 
> For those not familiar with the way storage protocols use RDMA, The
> initiator/client sets up memory regions and the target/server uses
> RDMA Read and Write to move data out of and into those regions. The
> initiator/client uses only RDMA memory registration and invalidation
> operations, and the target/server uses RDMA Read and Write.
> 
> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
> enabled using the kernel command line options "intel_iommu=on
> iommu=strict".
> 
> Recently I've noticed a significant (25-30%) loss in NFS throughput.
> I was able to bisect on my client to the following commits.
> 
> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
> map_sg"). This is about normal for this test.
> 
> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>  	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>  	Min throughput per process 			=  387764.34 kB/sec
>  	Max throughput per process 			=  399655.47 kB/sec
>  	Avg throughput per process 			=  394381.76 kB/sec
>  	Min xfer 					= 1017344.00 kB
>  	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>  	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>  	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>  	Min throughput per process 			=  398983.72 kB/sec
>  	Max throughput per process 			=  406199.66 kB/sec
>  	Avg throughput per process 			=  403145.16 kB/sec
>  	Min xfer 					= 1030656.00 kB
>  	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>  	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>  	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>  	Min throughput per process 			=  491812.38 kB/sec
>  	Max throughput per process 			=  494777.28 kB/sec
>  	Avg throughput per process 			=  493447.58 kB/sec
>  	Min xfer 					= 1042688.00 kB
>  	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>  	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>  	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>  	Min throughput per process 			=  492805.81 kB/sec
>  	Max throughput per process 			=  497280.19 kB/sec
>  	Avg throughput per process 			=  495665.47 kB/sec
>  	Min xfer 					= 1039360.00 kB
>  	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
> 
> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
> iommu_ops.at(de)tach_dev"). It's losing some steam here.
> 
> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>  	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>  	Min throughput per process 			=  359299.06 kB/sec
>  	Max throughput per process 			=  363866.16 kB/sec
>  	Avg throughput per process 			=  361868.26 kB/sec
>  	Min xfer 					= 1035520.00 kB
>  	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>  	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>  	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>  	Min throughput per process 			=  364553.88 kB/sec
>  	Max throughput per process 			=  370029.28 kB/sec
>  	Avg throughput per process 			=  367381.39 kB/sec
>  	Min xfer 					= 1033216.00 kB
>  	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>  	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>  	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>  	Min throughput per process 			=  449583.03 kB/sec
>  	Max throughput per process 			=  451761.69 kB/sec
>  	Avg throughput per process 			=  450573.29 kB/sec
>  	Min xfer 					= 1044224.00 kB
>  	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>  	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>  	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>  	Min throughput per process 			=  449918.12 kB/sec
>  	Max throughput per process 			=  452489.28 kB/sec
>  	Avg throughput per process 			=  450883.43 kB/sec
>  	Min xfer 					= 1043456.00 kB
>  	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
> 
> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
> the iommu ops"). Significant throughput loss.
> 
> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>  	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>  	Min throughput per process 			=  313672.25 kB/sec
>  	Max throughput per process 			=  321719.44 kB/sec
>  	Avg throughput per process 			=  317669.74 kB/sec
>  	Min xfer 					= 1022464.00 kB
>  	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>  	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>  	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>  	Min throughput per process 			=  313654.44 kB/sec
>  	Max throughput per process 			=  317844.50 kB/sec
>  	Avg throughput per process 			=  315569.33 kB/sec
>  	Min xfer 					= 1035520.00 kB
>  	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>  	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>  	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>  	Min throughput per process 			=  352305.00 kB/sec
>  	Max throughput per process 			=  357726.22 kB/sec
>  	Avg throughput per process 			=  355485.69 kB/sec
>  	Min xfer 					= 1032960.00 kB
>  	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>  	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>  	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>  	Min throughput per process 			=  348677.16 kB/sec
>  	Max throughput per process 			=  353467.44 kB/sec
>  	Avg throughput per process 			=  351720.93 kB/sec
>  	Min xfer 					= 1035264.00 kB
>  	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
> 
> The regression appears to be 100% reproducible. 
> 
> 
> --
> Chuck Lever
> 
> 
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-12 14:38   ` Will Deacon
@ 2021-01-13  2:25     ` Lu Baolu
  -1 siblings, 0 replies; 36+ messages in thread
From: Lu Baolu @ 2021-01-13  2:25 UTC (permalink / raw)
  To: Will Deacon, Chuck Lever
  Cc: baolu.lu, iommu, linux-rdma, logang, hch, murphyt7, robin.murphy

Hi,

On 1/12/21 10:38 PM, Will Deacon wrote:
> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
> 
> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>> Hi-
>>
>> [ Please cc: me on replies, I'm not currently subscribed to
>> iommu@lists ].
>>
>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>
>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>
>> For those not familiar with the way storage protocols use RDMA, The
>> initiator/client sets up memory regions and the target/server uses
>> RDMA Read and Write to move data out of and into those regions. The
>> initiator/client uses only RDMA memory registration and invalidation
>> operations, and the target/server uses RDMA Read and Write.
>>
>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>> enabled using the kernel command line options "intel_iommu=on
>> iommu=strict".
>>
>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>> I was able to bisect on my client to the following commits.
>>
>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>> map_sg"). This is about normal for this test.
>>
>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>   	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>   	Min throughput per process 			=  387764.34 kB/sec
>>   	Max throughput per process 			=  399655.47 kB/sec
>>   	Avg throughput per process 			=  394381.76 kB/sec
>>   	Min xfer 					= 1017344.00 kB
>>   	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>   	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>   	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>   	Min throughput per process 			=  398983.72 kB/sec
>>   	Max throughput per process 			=  406199.66 kB/sec
>>   	Avg throughput per process 			=  403145.16 kB/sec
>>   	Min xfer 					= 1030656.00 kB
>>   	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>   	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>   	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>   	Min throughput per process 			=  491812.38 kB/sec
>>   	Max throughput per process 			=  494777.28 kB/sec
>>   	Avg throughput per process 			=  493447.58 kB/sec
>>   	Min xfer 					= 1042688.00 kB
>>   	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>   	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>   	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>   	Min throughput per process 			=  492805.81 kB/sec
>>   	Max throughput per process 			=  497280.19 kB/sec
>>   	Avg throughput per process 			=  495665.47 kB/sec
>>   	Min xfer 					= 1039360.00 kB
>>   	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>
>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>
>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>   	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>   	Min throughput per process 			=  359299.06 kB/sec
>>   	Max throughput per process 			=  363866.16 kB/sec
>>   	Avg throughput per process 			=  361868.26 kB/sec
>>   	Min xfer 					= 1035520.00 kB
>>   	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>   	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>   	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>   	Min throughput per process 			=  364553.88 kB/sec
>>   	Max throughput per process 			=  370029.28 kB/sec
>>   	Avg throughput per process 			=  367381.39 kB/sec
>>   	Min xfer 					= 1033216.00 kB
>>   	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>   	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>   	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>   	Min throughput per process 			=  449583.03 kB/sec
>>   	Max throughput per process 			=  451761.69 kB/sec
>>   	Avg throughput per process 			=  450573.29 kB/sec
>>   	Min xfer 					= 1044224.00 kB
>>   	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>   	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>   	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>   	Min throughput per process 			=  449918.12 kB/sec
>>   	Max throughput per process 			=  452489.28 kB/sec
>>   	Avg throughput per process 			=  450883.43 kB/sec
>>   	Min xfer 					= 1043456.00 kB
>>   	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>
>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>> the iommu ops"). Significant throughput loss.
>>
>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>   	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>   	Min throughput per process 			=  313672.25 kB/sec
>>   	Max throughput per process 			=  321719.44 kB/sec
>>   	Avg throughput per process 			=  317669.74 kB/sec
>>   	Min xfer 					= 1022464.00 kB
>>   	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>   	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>   	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>   	Min throughput per process 			=  313654.44 kB/sec
>>   	Max throughput per process 			=  317844.50 kB/sec
>>   	Avg throughput per process 			=  315569.33 kB/sec
>>   	Min xfer 					= 1035520.00 kB
>>   	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>   	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>   	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>   	Min throughput per process 			=  352305.00 kB/sec
>>   	Max throughput per process 			=  357726.22 kB/sec
>>   	Avg throughput per process 			=  355485.69 kB/sec
>>   	Min xfer 					= 1032960.00 kB
>>   	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>   	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>   	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>   	Min throughput per process 			=  348677.16 kB/sec
>>   	Max throughput per process 			=  353467.44 kB/sec
>>   	Avg throughput per process 			=  351720.93 kB/sec
>>   	Min xfer 					= 1035264.00 kB
>>   	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>
>> The regression appears to be 100% reproducible.

The commit 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
map_sg") is a temporary workaround. We have reverted it recently (5.11-
rc3). Can you please try the a kernel version after -rc3?

Best regards,
baolu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-13  2:25     ` Lu Baolu
  0 siblings, 0 replies; 36+ messages in thread
From: Lu Baolu @ 2021-01-13  2:25 UTC (permalink / raw)
  To: Will Deacon, Chuck Lever
  Cc: linux-rdma, robin.murphy, murphyt7, iommu, logang, hch

Hi,

On 1/12/21 10:38 PM, Will Deacon wrote:
> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
> 
> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>> Hi-
>>
>> [ Please cc: me on replies, I'm not currently subscribed to
>> iommu@lists ].
>>
>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>
>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>
>> For those not familiar with the way storage protocols use RDMA, The
>> initiator/client sets up memory regions and the target/server uses
>> RDMA Read and Write to move data out of and into those regions. The
>> initiator/client uses only RDMA memory registration and invalidation
>> operations, and the target/server uses RDMA Read and Write.
>>
>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>> enabled using the kernel command line options "intel_iommu=on
>> iommu=strict".
>>
>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>> I was able to bisect on my client to the following commits.
>>
>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>> map_sg"). This is about normal for this test.
>>
>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>   	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>   	Min throughput per process 			=  387764.34 kB/sec
>>   	Max throughput per process 			=  399655.47 kB/sec
>>   	Avg throughput per process 			=  394381.76 kB/sec
>>   	Min xfer 					= 1017344.00 kB
>>   	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>   	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>   	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>   	Min throughput per process 			=  398983.72 kB/sec
>>   	Max throughput per process 			=  406199.66 kB/sec
>>   	Avg throughput per process 			=  403145.16 kB/sec
>>   	Min xfer 					= 1030656.00 kB
>>   	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>   	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>   	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>   	Min throughput per process 			=  491812.38 kB/sec
>>   	Max throughput per process 			=  494777.28 kB/sec
>>   	Avg throughput per process 			=  493447.58 kB/sec
>>   	Min xfer 					= 1042688.00 kB
>>   	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>   	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>   	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>   	Min throughput per process 			=  492805.81 kB/sec
>>   	Max throughput per process 			=  497280.19 kB/sec
>>   	Avg throughput per process 			=  495665.47 kB/sec
>>   	Min xfer 					= 1039360.00 kB
>>   	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>
>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>
>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>   	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>   	Min throughput per process 			=  359299.06 kB/sec
>>   	Max throughput per process 			=  363866.16 kB/sec
>>   	Avg throughput per process 			=  361868.26 kB/sec
>>   	Min xfer 					= 1035520.00 kB
>>   	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>   	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>   	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>   	Min throughput per process 			=  364553.88 kB/sec
>>   	Max throughput per process 			=  370029.28 kB/sec
>>   	Avg throughput per process 			=  367381.39 kB/sec
>>   	Min xfer 					= 1033216.00 kB
>>   	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>   	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>   	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>   	Min throughput per process 			=  449583.03 kB/sec
>>   	Max throughput per process 			=  451761.69 kB/sec
>>   	Avg throughput per process 			=  450573.29 kB/sec
>>   	Min xfer 					= 1044224.00 kB
>>   	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>   	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>   	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>   	Min throughput per process 			=  449918.12 kB/sec
>>   	Max throughput per process 			=  452489.28 kB/sec
>>   	Avg throughput per process 			=  450883.43 kB/sec
>>   	Min xfer 					= 1043456.00 kB
>>   	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>
>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>> the iommu ops"). Significant throughput loss.
>>
>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>   	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>   	Min throughput per process 			=  313672.25 kB/sec
>>   	Max throughput per process 			=  321719.44 kB/sec
>>   	Avg throughput per process 			=  317669.74 kB/sec
>>   	Min xfer 					= 1022464.00 kB
>>   	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>   	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>   	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>   	Min throughput per process 			=  313654.44 kB/sec
>>   	Max throughput per process 			=  317844.50 kB/sec
>>   	Avg throughput per process 			=  315569.33 kB/sec
>>   	Min xfer 					= 1035520.00 kB
>>   	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>   	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>   	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>   	Min throughput per process 			=  352305.00 kB/sec
>>   	Max throughput per process 			=  357726.22 kB/sec
>>   	Avg throughput per process 			=  355485.69 kB/sec
>>   	Min xfer 					= 1032960.00 kB
>>   	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>   	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>   	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>   	Min throughput per process 			=  348677.16 kB/sec
>>   	Max throughput per process 			=  353467.44 kB/sec
>>   	Avg throughput per process 			=  351720.93 kB/sec
>>   	Min xfer 					= 1035264.00 kB
>>   	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>
>> The regression appears to be 100% reproducible.

The commit 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
map_sg") is a temporary workaround. We have reverted it recently (5.11-
rc3). Can you please try the a kernel version after -rc3?

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-13  2:25     ` Lu Baolu
@ 2021-01-13 14:07       ` Chuck Lever
  -1 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-13 14:07 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Will Deacon, iommu, linux-rdma, logang, hch, murphyt7, robin.murphy



> On Jan 12, 2021, at 9:25 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
> Hi,
> 
> On 1/12/21 10:38 PM, Will Deacon wrote:
>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>> Hi-
>>> 
>>> [ Please cc: me on replies, I'm not currently subscribed to
>>> iommu@lists ].
>>> 
>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>> 
>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>> 
>>> For those not familiar with the way storage protocols use RDMA, The
>>> initiator/client sets up memory regions and the target/server uses
>>> RDMA Read and Write to move data out of and into those regions. The
>>> initiator/client uses only RDMA memory registration and invalidation
>>> operations, and the target/server uses RDMA Read and Write.
>>> 
>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>> enabled using the kernel command line options "intel_iommu=on
>>> iommu=strict".
>>> 
>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>> I was able to bisect on my client to the following commits.
>>> 
>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>> map_sg"). This is about normal for this test.
>>> 
>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>  	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>  	Min throughput per process 			=  387764.34 kB/sec
>>>  	Max throughput per process 			=  399655.47 kB/sec
>>>  	Avg throughput per process 			=  394381.76 kB/sec
>>>  	Min xfer 					= 1017344.00 kB
>>>  	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>  	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>  	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>  	Min throughput per process 			=  398983.72 kB/sec
>>>  	Max throughput per process 			=  406199.66 kB/sec
>>>  	Avg throughput per process 			=  403145.16 kB/sec
>>>  	Min xfer 					= 1030656.00 kB
>>>  	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>  	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>  	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>  	Min throughput per process 			=  491812.38 kB/sec
>>>  	Max throughput per process 			=  494777.28 kB/sec
>>>  	Avg throughput per process 			=  493447.58 kB/sec
>>>  	Min xfer 					= 1042688.00 kB
>>>  	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>  	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>  	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>  	Min throughput per process 			=  492805.81 kB/sec
>>>  	Max throughput per process 			=  497280.19 kB/sec
>>>  	Avg throughput per process 			=  495665.47 kB/sec
>>>  	Min xfer 					= 1039360.00 kB
>>>  	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>> 
>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>> 
>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>  	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>  	Min throughput per process 			=  359299.06 kB/sec
>>>  	Max throughput per process 			=  363866.16 kB/sec
>>>  	Avg throughput per process 			=  361868.26 kB/sec
>>>  	Min xfer 					= 1035520.00 kB
>>>  	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>  	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>  	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>  	Min throughput per process 			=  364553.88 kB/sec
>>>  	Max throughput per process 			=  370029.28 kB/sec
>>>  	Avg throughput per process 			=  367381.39 kB/sec
>>>  	Min xfer 					= 1033216.00 kB
>>>  	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>  	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>  	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>  	Min throughput per process 			=  449583.03 kB/sec
>>>  	Max throughput per process 			=  451761.69 kB/sec
>>>  	Avg throughput per process 			=  450573.29 kB/sec
>>>  	Min xfer 					= 1044224.00 kB
>>>  	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>  	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>  	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>  	Min throughput per process 			=  449918.12 kB/sec
>>>  	Max throughput per process 			=  452489.28 kB/sec
>>>  	Avg throughput per process 			=  450883.43 kB/sec
>>>  	Min xfer 					= 1043456.00 kB
>>>  	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>> 
>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>> the iommu ops"). Significant throughput loss.
>>> 
>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>  	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>  	Min throughput per process 			=  313672.25 kB/sec
>>>  	Max throughput per process 			=  321719.44 kB/sec
>>>  	Avg throughput per process 			=  317669.74 kB/sec
>>>  	Min xfer 					= 1022464.00 kB
>>>  	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>  	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>  	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>  	Min throughput per process 			=  313654.44 kB/sec
>>>  	Max throughput per process 			=  317844.50 kB/sec
>>>  	Avg throughput per process 			=  315569.33 kB/sec
>>>  	Min xfer 					= 1035520.00 kB
>>>  	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>  	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>  	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>  	Min throughput per process 			=  352305.00 kB/sec
>>>  	Max throughput per process 			=  357726.22 kB/sec
>>>  	Avg throughput per process 			=  355485.69 kB/sec
>>>  	Min xfer 					= 1032960.00 kB
>>>  	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>  	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>  	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>  	Min throughput per process 			=  348677.16 kB/sec
>>>  	Max throughput per process 			=  353467.44 kB/sec
>>>  	Avg throughput per process 			=  351720.93 kB/sec
>>>  	Min xfer 					= 1035264.00 kB
>>>  	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>> 
>>> The regression appears to be 100% reproducible.
> 
> The commit 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
> map_sg") is a temporary workaround. We have reverted it recently (5.11-
> rc3). Can you please try the a kernel version after -rc3?

I don't see a change in write results with v5.11-rc3, but read throughput
appears to improve a little.


	Children see throughput for 12 initial writers 	= 3854295.72 kB/sec
	Parent sees throughput for 12 initial writers 	= 3744064.85 kB/sec
	Min throughput per process 			=  313499.41 kB/sec 
	Max throughput per process 			=  328151.44 kB/sec
	Avg throughput per process 			=  321191.31 kB/sec
	Min xfer 					= 1001728.00 kB
	CPU Utilization: Wall time    3.289    CPU time    2.075    CPU utilization  63.10 %


	Children see throughput for 12 rewriters 	= 3692675.22 kB/sec
	Parent sees throughput for 12 rewriters 	= 3688975.23 kB/sec
	Min throughput per process 			=  304863.84 kB/sec 
	Max throughput per process 			=  311000.16 kB/sec
	Avg throughput per process 			=  307722.93 kB/sec
	Min xfer 					= 1028096.00 kB
	CPU utilization: Wall time    3.375    CPU time    2.051    CPU utilization  60.76 %


	Children see throughput for 12 readers 		= 4521975.69 kB/sec
	Parent sees throughput for 12 readers 		= 4516965.08 kB/sec
	Min throughput per process 			=  372762.16 kB/sec 
	Max throughput per process 			=  382233.84 kB/sec
	Avg throughput per process 			=  376831.31 kB/sec
	Min xfer 					= 1022720.00 kB
	CPU utilization: Wall time    2.747    CPU time    1.961    CPU utilization  71.39 %


	Children see throughput for 12 re-readers 	= 4684127.06 kB/sec
	Parent sees throughput for 12 re-readers 	= 4678990.23 kB/sec
	Min throughput per process 			=  385586.34 kB/sec 
	Max throughput per process 			=  395542.47 kB/sec
	Avg throughput per process 			=  390343.92 kB/sec
	Min xfer 					= 1022208.00 kB
	CPU utilization: Wall time    2.653    CPU time    1.941    CPU utilization  73.16 %



--
Chuck Lever




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-13 14:07       ` Chuck Lever
  0 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-13 14:07 UTC (permalink / raw)
  To: Lu Baolu
  Cc: linux-rdma, Will Deacon, robin.murphy, murphyt7, iommu, logang, hch



> On Jan 12, 2021, at 9:25 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
> Hi,
> 
> On 1/12/21 10:38 PM, Will Deacon wrote:
>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>> Hi-
>>> 
>>> [ Please cc: me on replies, I'm not currently subscribed to
>>> iommu@lists ].
>>> 
>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>> 
>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>> 
>>> For those not familiar with the way storage protocols use RDMA, The
>>> initiator/client sets up memory regions and the target/server uses
>>> RDMA Read and Write to move data out of and into those regions. The
>>> initiator/client uses only RDMA memory registration and invalidation
>>> operations, and the target/server uses RDMA Read and Write.
>>> 
>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>> enabled using the kernel command line options "intel_iommu=on
>>> iommu=strict".
>>> 
>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>> I was able to bisect on my client to the following commits.
>>> 
>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>> map_sg"). This is about normal for this test.
>>> 
>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>  	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>  	Min throughput per process 			=  387764.34 kB/sec
>>>  	Max throughput per process 			=  399655.47 kB/sec
>>>  	Avg throughput per process 			=  394381.76 kB/sec
>>>  	Min xfer 					= 1017344.00 kB
>>>  	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>  	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>  	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>  	Min throughput per process 			=  398983.72 kB/sec
>>>  	Max throughput per process 			=  406199.66 kB/sec
>>>  	Avg throughput per process 			=  403145.16 kB/sec
>>>  	Min xfer 					= 1030656.00 kB
>>>  	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>  	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>  	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>  	Min throughput per process 			=  491812.38 kB/sec
>>>  	Max throughput per process 			=  494777.28 kB/sec
>>>  	Avg throughput per process 			=  493447.58 kB/sec
>>>  	Min xfer 					= 1042688.00 kB
>>>  	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>  	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>  	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>  	Min throughput per process 			=  492805.81 kB/sec
>>>  	Max throughput per process 			=  497280.19 kB/sec
>>>  	Avg throughput per process 			=  495665.47 kB/sec
>>>  	Min xfer 					= 1039360.00 kB
>>>  	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>> 
>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>> 
>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>  	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>  	Min throughput per process 			=  359299.06 kB/sec
>>>  	Max throughput per process 			=  363866.16 kB/sec
>>>  	Avg throughput per process 			=  361868.26 kB/sec
>>>  	Min xfer 					= 1035520.00 kB
>>>  	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>  	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>  	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>  	Min throughput per process 			=  364553.88 kB/sec
>>>  	Max throughput per process 			=  370029.28 kB/sec
>>>  	Avg throughput per process 			=  367381.39 kB/sec
>>>  	Min xfer 					= 1033216.00 kB
>>>  	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>  	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>  	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>  	Min throughput per process 			=  449583.03 kB/sec
>>>  	Max throughput per process 			=  451761.69 kB/sec
>>>  	Avg throughput per process 			=  450573.29 kB/sec
>>>  	Min xfer 					= 1044224.00 kB
>>>  	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>  	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>  	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>  	Min throughput per process 			=  449918.12 kB/sec
>>>  	Max throughput per process 			=  452489.28 kB/sec
>>>  	Avg throughput per process 			=  450883.43 kB/sec
>>>  	Min xfer 					= 1043456.00 kB
>>>  	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>> 
>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>> the iommu ops"). Significant throughput loss.
>>> 
>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>  	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>  	Min throughput per process 			=  313672.25 kB/sec
>>>  	Max throughput per process 			=  321719.44 kB/sec
>>>  	Avg throughput per process 			=  317669.74 kB/sec
>>>  	Min xfer 					= 1022464.00 kB
>>>  	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>  	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>  	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>  	Min throughput per process 			=  313654.44 kB/sec
>>>  	Max throughput per process 			=  317844.50 kB/sec
>>>  	Avg throughput per process 			=  315569.33 kB/sec
>>>  	Min xfer 					= 1035520.00 kB
>>>  	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>  	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>  	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>  	Min throughput per process 			=  352305.00 kB/sec
>>>  	Max throughput per process 			=  357726.22 kB/sec
>>>  	Avg throughput per process 			=  355485.69 kB/sec
>>>  	Min xfer 					= 1032960.00 kB
>>>  	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>  	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>  	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>  	Min throughput per process 			=  348677.16 kB/sec
>>>  	Max throughput per process 			=  353467.44 kB/sec
>>>  	Avg throughput per process 			=  351720.93 kB/sec
>>>  	Min xfer 					= 1035264.00 kB
>>>  	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>> 
>>> The regression appears to be 100% reproducible.
> 
> The commit 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
> map_sg") is a temporary workaround. We have reverted it recently (5.11-
> rc3). Can you please try the a kernel version after -rc3?

I don't see a change in write results with v5.11-rc3, but read throughput
appears to improve a little.


	Children see throughput for 12 initial writers 	= 3854295.72 kB/sec
	Parent sees throughput for 12 initial writers 	= 3744064.85 kB/sec
	Min throughput per process 			=  313499.41 kB/sec 
	Max throughput per process 			=  328151.44 kB/sec
	Avg throughput per process 			=  321191.31 kB/sec
	Min xfer 					= 1001728.00 kB
	CPU Utilization: Wall time    3.289    CPU time    2.075    CPU utilization  63.10 %


	Children see throughput for 12 rewriters 	= 3692675.22 kB/sec
	Parent sees throughput for 12 rewriters 	= 3688975.23 kB/sec
	Min throughput per process 			=  304863.84 kB/sec 
	Max throughput per process 			=  311000.16 kB/sec
	Avg throughput per process 			=  307722.93 kB/sec
	Min xfer 					= 1028096.00 kB
	CPU utilization: Wall time    3.375    CPU time    2.051    CPU utilization  60.76 %


	Children see throughput for 12 readers 		= 4521975.69 kB/sec
	Parent sees throughput for 12 readers 		= 4516965.08 kB/sec
	Min throughput per process 			=  372762.16 kB/sec 
	Max throughput per process 			=  382233.84 kB/sec
	Avg throughput per process 			=  376831.31 kB/sec
	Min xfer 					= 1022720.00 kB
	CPU utilization: Wall time    2.747    CPU time    1.961    CPU utilization  71.39 %


	Children see throughput for 12 re-readers 	= 4684127.06 kB/sec
	Parent sees throughput for 12 re-readers 	= 4678990.23 kB/sec
	Min throughput per process 			=  385586.34 kB/sec 
	Max throughput per process 			=  395542.47 kB/sec
	Avg throughput per process 			=  390343.92 kB/sec
	Min xfer 					= 1022208.00 kB
	CPU utilization: Wall time    2.653    CPU time    1.941    CPU utilization  73.16 %



--
Chuck Lever



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-13 14:07       ` Chuck Lever
@ 2021-01-13 18:30         ` Chuck Lever
  -1 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-13 18:30 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Will Deacon, iommu, linux-rdma, logang, hch, murphyt7, robin.murphy



> On Jan 13, 2021, at 9:07 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> 
> 
>> On Jan 12, 2021, at 9:25 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>> 
>> Hi,
>> 
>> On 1/12/21 10:38 PM, Will Deacon wrote:
>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>> Hi-
>>>> 
>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>> iommu@lists ].
>>>> 
>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>> 
>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>> 
>>>> For those not familiar with the way storage protocols use RDMA, The
>>>> initiator/client sets up memory regions and the target/server uses
>>>> RDMA Read and Write to move data out of and into those regions. The
>>>> initiator/client uses only RDMA memory registration and invalidation
>>>> operations, and the target/server uses RDMA Read and Write.
>>>> 
>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>> enabled using the kernel command line options "intel_iommu=on
>>>> iommu=strict".
>>>> 
>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>> I was able to bisect on my client to the following commits.
>>>> 
>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>> map_sg"). This is about normal for this test.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>> 	Min xfer 					= 1017344.00 kB
>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>> 	Min xfer 					= 1030656.00 kB
>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>> 	Min xfer 					= 1042688.00 kB
>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>> 	Min xfer 					= 1039360.00 kB
>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>> 
>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>> 	Min xfer 					= 1035520.00 kB
>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>> 	Min xfer 					= 1033216.00 kB
>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>> 	Min xfer 					= 1044224.00 kB
>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>> 	Min xfer 					= 1043456.00 kB
>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>> 
>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>> the iommu ops"). Significant throughput loss.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>> 	Min xfer 					= 1022464.00 kB
>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>> 	Min xfer 					= 1035520.00 kB
>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>> 	Min xfer 					= 1032960.00 kB
>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>> 	Min xfer 					= 1035264.00 kB
>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>> 
>>>> The regression appears to be 100% reproducible.
>> 
>> The commit 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>> map_sg") is a temporary workaround. We have reverted it recently (5.11-
>> rc3). Can you please try the a kernel version after -rc3?
> 
> I don't see a change in write results with v5.11-rc3, but read throughput
> appears to improve a little.
> 
> 
> 	Children see throughput for 12 initial writers 	= 3854295.72 kB/sec
> 	Parent sees throughput for 12 initial writers 	= 3744064.85 kB/sec
> 	Min throughput per process 			=  313499.41 kB/sec 
> 	Max throughput per process 			=  328151.44 kB/sec
> 	Avg throughput per process 			=  321191.31 kB/sec
> 	Min xfer 					= 1001728.00 kB
> 	CPU Utilization: Wall time    3.289    CPU time    2.075    CPU utilization  63.10 %
> 
> 
> 	Children see throughput for 12 rewriters 	= 3692675.22 kB/sec
> 	Parent sees throughput for 12 rewriters 	= 3688975.23 kB/sec
> 	Min throughput per process 			=  304863.84 kB/sec 
> 	Max throughput per process 			=  311000.16 kB/sec
> 	Avg throughput per process 			=  307722.93 kB/sec
> 	Min xfer 					= 1028096.00 kB
> 	CPU utilization: Wall time    3.375    CPU time    2.051    CPU utilization  60.76 %
> 
> 
> 	Children see throughput for 12 readers 		= 4521975.69 kB/sec
> 	Parent sees throughput for 12 readers 		= 4516965.08 kB/sec
> 	Min throughput per process 			=  372762.16 kB/sec 
> 	Max throughput per process 			=  382233.84 kB/sec
> 	Avg throughput per process 			=  376831.31 kB/sec
> 	Min xfer 					= 1022720.00 kB
> 	CPU utilization: Wall time    2.747    CPU time    1.961    CPU utilization  71.39 %
> 
> 
> 	Children see throughput for 12 re-readers 	= 4684127.06 kB/sec
> 	Parent sees throughput for 12 re-readers 	= 4678990.23 kB/sec
> 	Min throughput per process 			=  385586.34 kB/sec 
> 	Max throughput per process 			=  395542.47 kB/sec
> 	Avg throughput per process 			=  390343.92 kB/sec
> 	Min xfer 					= 1022208.00 kB
> 	CPU utilization: Wall time    2.653    CPU time    1.941    CPU utilization  73.16 %

I should also mention that at 65f746e8285f ("iommu: Add quirk for Intel
graphic devices inmap_sg") I didn't measure any throughput regression.
It's the following two commits that introduce the problem.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-13 18:30         ` Chuck Lever
  0 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-13 18:30 UTC (permalink / raw)
  To: Lu Baolu
  Cc: linux-rdma, Will Deacon, robin.murphy, murphyt7, iommu, logang, hch



> On Jan 13, 2021, at 9:07 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> 
> 
>> On Jan 12, 2021, at 9:25 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>> 
>> Hi,
>> 
>> On 1/12/21 10:38 PM, Will Deacon wrote:
>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>> Hi-
>>>> 
>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>> iommu@lists ].
>>>> 
>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>> 
>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>> 
>>>> For those not familiar with the way storage protocols use RDMA, The
>>>> initiator/client sets up memory regions and the target/server uses
>>>> RDMA Read and Write to move data out of and into those regions. The
>>>> initiator/client uses only RDMA memory registration and invalidation
>>>> operations, and the target/server uses RDMA Read and Write.
>>>> 
>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>> enabled using the kernel command line options "intel_iommu=on
>>>> iommu=strict".
>>>> 
>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>> I was able to bisect on my client to the following commits.
>>>> 
>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>> map_sg"). This is about normal for this test.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>> 	Min xfer 					= 1017344.00 kB
>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>> 	Min xfer 					= 1030656.00 kB
>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>> 	Min xfer 					= 1042688.00 kB
>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>> 	Min xfer 					= 1039360.00 kB
>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>> 
>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>> 	Min xfer 					= 1035520.00 kB
>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>> 	Min xfer 					= 1033216.00 kB
>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>> 	Min xfer 					= 1044224.00 kB
>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>> 	Min xfer 					= 1043456.00 kB
>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>> 
>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>> the iommu ops"). Significant throughput loss.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>> 	Min xfer 					= 1022464.00 kB
>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>> 	Min xfer 					= 1035520.00 kB
>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>> 	Min xfer 					= 1032960.00 kB
>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>> 	Min xfer 					= 1035264.00 kB
>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>> 
>>>> The regression appears to be 100% reproducible.
>> 
>> The commit 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>> map_sg") is a temporary workaround. We have reverted it recently (5.11-
>> rc3). Can you please try the a kernel version after -rc3?
> 
> I don't see a change in write results with v5.11-rc3, but read throughput
> appears to improve a little.
> 
> 
> 	Children see throughput for 12 initial writers 	= 3854295.72 kB/sec
> 	Parent sees throughput for 12 initial writers 	= 3744064.85 kB/sec
> 	Min throughput per process 			=  313499.41 kB/sec 
> 	Max throughput per process 			=  328151.44 kB/sec
> 	Avg throughput per process 			=  321191.31 kB/sec
> 	Min xfer 					= 1001728.00 kB
> 	CPU Utilization: Wall time    3.289    CPU time    2.075    CPU utilization  63.10 %
> 
> 
> 	Children see throughput for 12 rewriters 	= 3692675.22 kB/sec
> 	Parent sees throughput for 12 rewriters 	= 3688975.23 kB/sec
> 	Min throughput per process 			=  304863.84 kB/sec 
> 	Max throughput per process 			=  311000.16 kB/sec
> 	Avg throughput per process 			=  307722.93 kB/sec
> 	Min xfer 					= 1028096.00 kB
> 	CPU utilization: Wall time    3.375    CPU time    2.051    CPU utilization  60.76 %
> 
> 
> 	Children see throughput for 12 readers 		= 4521975.69 kB/sec
> 	Parent sees throughput for 12 readers 		= 4516965.08 kB/sec
> 	Min throughput per process 			=  372762.16 kB/sec 
> 	Max throughput per process 			=  382233.84 kB/sec
> 	Avg throughput per process 			=  376831.31 kB/sec
> 	Min xfer 					= 1022720.00 kB
> 	CPU utilization: Wall time    2.747    CPU time    1.961    CPU utilization  71.39 %
> 
> 
> 	Children see throughput for 12 re-readers 	= 4684127.06 kB/sec
> 	Parent sees throughput for 12 re-readers 	= 4678990.23 kB/sec
> 	Min throughput per process 			=  385586.34 kB/sec 
> 	Max throughput per process 			=  395542.47 kB/sec
> 	Avg throughput per process 			=  390343.92 kB/sec
> 	Min xfer 					= 1022208.00 kB
> 	CPU utilization: Wall time    2.653    CPU time    1.941    CPU utilization  73.16 %

I should also mention that at 65f746e8285f ("iommu: Add quirk for Intel
graphic devices inmap_sg") I didn't measure any throughput regression.
It's the following two commits that introduce the problem.


--
Chuck Lever



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-12 14:38   ` Will Deacon
@ 2021-01-18 16:18     ` Chuck Lever
  -1 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-18 16:18 UTC (permalink / raw)
  To: iommu
  Cc: Will Deacon, linux-rdma, Lu Baolu, logang, Christoph Hellwig,
	murphyt7, robin.murphy



> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
> 
> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
> 
> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>> Hi-
>> 
>> [ Please cc: me on replies, I'm not currently subscribed to
>> iommu@lists ].
>> 
>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>> 
>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>> 
>> For those not familiar with the way storage protocols use RDMA, The
>> initiator/client sets up memory regions and the target/server uses
>> RDMA Read and Write to move data out of and into those regions. The
>> initiator/client uses only RDMA memory registration and invalidation
>> operations, and the target/server uses RDMA Read and Write.
>> 
>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>> enabled using the kernel command line options "intel_iommu=on
>> iommu=strict".
>> 
>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>> I was able to bisect on my client to the following commits.
>> 
>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>> map_sg"). This is about normal for this test.
>> 
>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>> 	Min throughput per process 			=  387764.34 kB/sec
>> 	Max throughput per process 			=  399655.47 kB/sec
>> 	Avg throughput per process 			=  394381.76 kB/sec
>> 	Min xfer 					= 1017344.00 kB
>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>> 	Min throughput per process 			=  398983.72 kB/sec
>> 	Max throughput per process 			=  406199.66 kB/sec
>> 	Avg throughput per process 			=  403145.16 kB/sec
>> 	Min xfer 					= 1030656.00 kB
>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>> 	Min throughput per process 			=  491812.38 kB/sec
>> 	Max throughput per process 			=  494777.28 kB/sec
>> 	Avg throughput per process 			=  493447.58 kB/sec
>> 	Min xfer 					= 1042688.00 kB
>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>> 	Min throughput per process 			=  492805.81 kB/sec
>> 	Max throughput per process 			=  497280.19 kB/sec
>> 	Avg throughput per process 			=  495665.47 kB/sec
>> 	Min xfer 					= 1039360.00 kB
>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>> 
>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>> 
>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>> 	Min throughput per process 			=  359299.06 kB/sec
>> 	Max throughput per process 			=  363866.16 kB/sec
>> 	Avg throughput per process 			=  361868.26 kB/sec
>> 	Min xfer 					= 1035520.00 kB
>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>> 	Min throughput per process 			=  364553.88 kB/sec
>> 	Max throughput per process 			=  370029.28 kB/sec
>> 	Avg throughput per process 			=  367381.39 kB/sec
>> 	Min xfer 					= 1033216.00 kB
>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>> 	Min throughput per process 			=  449583.03 kB/sec
>> 	Max throughput per process 			=  451761.69 kB/sec
>> 	Avg throughput per process 			=  450573.29 kB/sec
>> 	Min xfer 					= 1044224.00 kB
>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>> 	Min throughput per process 			=  449918.12 kB/sec
>> 	Max throughput per process 			=  452489.28 kB/sec
>> 	Avg throughput per process 			=  450883.43 kB/sec
>> 	Min xfer 					= 1043456.00 kB
>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>> 
>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>> the iommu ops"). Significant throughput loss.
>> 
>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>> 	Min throughput per process 			=  313672.25 kB/sec
>> 	Max throughput per process 			=  321719.44 kB/sec
>> 	Avg throughput per process 			=  317669.74 kB/sec
>> 	Min xfer 					= 1022464.00 kB
>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>> 	Min throughput per process 			=  313654.44 kB/sec
>> 	Max throughput per process 			=  317844.50 kB/sec
>> 	Avg throughput per process 			=  315569.33 kB/sec
>> 	Min xfer 					= 1035520.00 kB
>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>> 	Min throughput per process 			=  352305.00 kB/sec
>> 	Max throughput per process 			=  357726.22 kB/sec
>> 	Avg throughput per process 			=  355485.69 kB/sec
>> 	Min xfer 					= 1032960.00 kB
>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>> 	Min throughput per process 			=  348677.16 kB/sec
>> 	Max throughput per process 			=  353467.44 kB/sec
>> 	Avg throughput per process 			=  351720.93 kB/sec
>> 	Min xfer 					= 1035264.00 kB
>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>> 
>> The regression appears to be 100% reproducible.

Any thoughts?

How about some tools to try or debugging advice? I don't know where to start.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-18 16:18     ` Chuck Lever
  0 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-18 16:18 UTC (permalink / raw)
  To: iommu
  Cc: linux-rdma, Will Deacon, robin.murphy, murphyt7, logang,
	Christoph Hellwig



> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
> 
> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
> 
> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>> Hi-
>> 
>> [ Please cc: me on replies, I'm not currently subscribed to
>> iommu@lists ].
>> 
>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>> 
>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>> 
>> For those not familiar with the way storage protocols use RDMA, The
>> initiator/client sets up memory regions and the target/server uses
>> RDMA Read and Write to move data out of and into those regions. The
>> initiator/client uses only RDMA memory registration and invalidation
>> operations, and the target/server uses RDMA Read and Write.
>> 
>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>> enabled using the kernel command line options "intel_iommu=on
>> iommu=strict".
>> 
>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>> I was able to bisect on my client to the following commits.
>> 
>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>> map_sg"). This is about normal for this test.
>> 
>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>> 	Min throughput per process 			=  387764.34 kB/sec
>> 	Max throughput per process 			=  399655.47 kB/sec
>> 	Avg throughput per process 			=  394381.76 kB/sec
>> 	Min xfer 					= 1017344.00 kB
>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>> 	Min throughput per process 			=  398983.72 kB/sec
>> 	Max throughput per process 			=  406199.66 kB/sec
>> 	Avg throughput per process 			=  403145.16 kB/sec
>> 	Min xfer 					= 1030656.00 kB
>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>> 	Min throughput per process 			=  491812.38 kB/sec
>> 	Max throughput per process 			=  494777.28 kB/sec
>> 	Avg throughput per process 			=  493447.58 kB/sec
>> 	Min xfer 					= 1042688.00 kB
>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>> 	Min throughput per process 			=  492805.81 kB/sec
>> 	Max throughput per process 			=  497280.19 kB/sec
>> 	Avg throughput per process 			=  495665.47 kB/sec
>> 	Min xfer 					= 1039360.00 kB
>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>> 
>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>> 
>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>> 	Min throughput per process 			=  359299.06 kB/sec
>> 	Max throughput per process 			=  363866.16 kB/sec
>> 	Avg throughput per process 			=  361868.26 kB/sec
>> 	Min xfer 					= 1035520.00 kB
>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>> 	Min throughput per process 			=  364553.88 kB/sec
>> 	Max throughput per process 			=  370029.28 kB/sec
>> 	Avg throughput per process 			=  367381.39 kB/sec
>> 	Min xfer 					= 1033216.00 kB
>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>> 	Min throughput per process 			=  449583.03 kB/sec
>> 	Max throughput per process 			=  451761.69 kB/sec
>> 	Avg throughput per process 			=  450573.29 kB/sec
>> 	Min xfer 					= 1044224.00 kB
>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>> 	Min throughput per process 			=  449918.12 kB/sec
>> 	Max throughput per process 			=  452489.28 kB/sec
>> 	Avg throughput per process 			=  450883.43 kB/sec
>> 	Min xfer 					= 1043456.00 kB
>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>> 
>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>> the iommu ops"). Significant throughput loss.
>> 
>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>> 	Min throughput per process 			=  313672.25 kB/sec
>> 	Max throughput per process 			=  321719.44 kB/sec
>> 	Avg throughput per process 			=  317669.74 kB/sec
>> 	Min xfer 					= 1022464.00 kB
>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>> 	Min throughput per process 			=  313654.44 kB/sec
>> 	Max throughput per process 			=  317844.50 kB/sec
>> 	Avg throughput per process 			=  315569.33 kB/sec
>> 	Min xfer 					= 1035520.00 kB
>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>> 	Min throughput per process 			=  352305.00 kB/sec
>> 	Max throughput per process 			=  357726.22 kB/sec
>> 	Avg throughput per process 			=  355485.69 kB/sec
>> 	Min xfer 					= 1032960.00 kB
>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>> 	Min throughput per process 			=  348677.16 kB/sec
>> 	Max throughput per process 			=  353467.44 kB/sec
>> 	Avg throughput per process 			=  351720.93 kB/sec
>> 	Min xfer 					= 1035264.00 kB
>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>> 
>> The regression appears to be 100% reproducible.

Any thoughts?

How about some tools to try or debugging advice? I don't know where to start.


--
Chuck Lever



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-18 16:18     ` Chuck Lever
@ 2021-01-18 18:00       ` Robin Murphy
  -1 siblings, 0 replies; 36+ messages in thread
From: Robin Murphy @ 2021-01-18 18:00 UTC (permalink / raw)
  To: Chuck Lever, iommu
  Cc: Will Deacon, linux-rdma, Lu Baolu, logang, Christoph Hellwig, murphyt7

On 2021-01-18 16:18, Chuck Lever wrote:
> 
> 
>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>
>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>
>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>> Hi-
>>>
>>> [ Please cc: me on replies, I'm not currently subscribed to
>>> iommu@lists ].
>>>
>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>
>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>
>>> For those not familiar with the way storage protocols use RDMA, The
>>> initiator/client sets up memory regions and the target/server uses
>>> RDMA Read and Write to move data out of and into those regions. The
>>> initiator/client uses only RDMA memory registration and invalidation
>>> operations, and the target/server uses RDMA Read and Write.
>>>
>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>> enabled using the kernel command line options "intel_iommu=on
>>> iommu=strict".
>>>
>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>> I was able to bisect on my client to the following commits.
>>>
>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>> map_sg"). This is about normal for this test.
>>>
>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>> 	Min throughput per process 			=  387764.34 kB/sec
>>> 	Max throughput per process 			=  399655.47 kB/sec
>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>> 	Min xfer 					= 1017344.00 kB
>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>> 	Min throughput per process 			=  398983.72 kB/sec
>>> 	Max throughput per process 			=  406199.66 kB/sec
>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>> 	Min xfer 					= 1030656.00 kB
>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>> 	Min throughput per process 			=  491812.38 kB/sec
>>> 	Max throughput per process 			=  494777.28 kB/sec
>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>> 	Min xfer 					= 1042688.00 kB
>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>> 	Min throughput per process 			=  492805.81 kB/sec
>>> 	Max throughput per process 			=  497280.19 kB/sec
>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>> 	Min xfer 					= 1039360.00 kB
>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>
>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>
>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>> 	Min throughput per process 			=  359299.06 kB/sec
>>> 	Max throughput per process 			=  363866.16 kB/sec
>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>> 	Min xfer 					= 1035520.00 kB
>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>> 	Min throughput per process 			=  364553.88 kB/sec
>>> 	Max throughput per process 			=  370029.28 kB/sec
>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>> 	Min xfer 					= 1033216.00 kB
>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>> 	Min throughput per process 			=  449583.03 kB/sec
>>> 	Max throughput per process 			=  451761.69 kB/sec
>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>> 	Min xfer 					= 1044224.00 kB
>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>> 	Min throughput per process 			=  449918.12 kB/sec
>>> 	Max throughput per process 			=  452489.28 kB/sec
>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>> 	Min xfer 					= 1043456.00 kB
>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>
>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>> the iommu ops"). Significant throughput loss.
>>>
>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>> 	Min throughput per process 			=  313672.25 kB/sec
>>> 	Max throughput per process 			=  321719.44 kB/sec
>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>> 	Min xfer 					= 1022464.00 kB
>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>> 	Min throughput per process 			=  313654.44 kB/sec
>>> 	Max throughput per process 			=  317844.50 kB/sec
>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>> 	Min xfer 					= 1035520.00 kB
>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>> 	Min throughput per process 			=  352305.00 kB/sec
>>> 	Max throughput per process 			=  357726.22 kB/sec
>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>> 	Min xfer 					= 1032960.00 kB
>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>> 	Min throughput per process 			=  348677.16 kB/sec
>>> 	Max throughput per process 			=  353467.44 kB/sec
>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>> 	Min xfer 					= 1035264.00 kB
>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>
>>> The regression appears to be 100% reproducible.
> 
> Any thoughts?
> 
> How about some tools to try or debugging advice? I don't know where to start.

I'm not familiar enough with VT-D internals or Infiniband to have a clue 
why the middle commit makes any difference (the calculation itself is 
not on a fast path, so AFAICS the worst it could do is change your 
maximum DMA address size from 48/57 bits to 47/56, and that seems 
relatively benign).

With the last commit, though, at least part of it is likely to be the 
unfortunate inevitable overhead of the internal indirection through the 
IOMMU API. There's a coincidental performance-related thread where we've 
already started pondering some ideas in that area[1] (note that Intel is 
the last one to the party here; AMD has been using this path for a 
while, and it's all that arm64 systems have ever known). I'm not sure if 
there's any difference in the strict invalidation behaviour between the 
IOMMU API calls and the old intel_dma_ops, but I suppose that might be 
worth quickly double-checking as well. I guess the main thing would be 
to do some profiling to see where time is being spent in iommu-dma and 
intel-iommu vs. just different parts of intel-iommu before, and whether 
anything in particular stands out beyond the extra call overhead 
currently incurred by iommu_{map,unmap}.

I'm mildly puzzled by your "CPU time" metric remaining more or less 
constant and the utilisation jumping up and down though, or is that only 
counting time in the userspace workload such that spending more time 
busy in the kernel skews it?

Robin.

[1] 
https://lore.kernel.org/linux-iommu/20210112163307.GA1199965@infradead.org/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-18 18:00       ` Robin Murphy
  0 siblings, 0 replies; 36+ messages in thread
From: Robin Murphy @ 2021-01-18 18:00 UTC (permalink / raw)
  To: Chuck Lever, iommu
  Cc: linux-rdma, Will Deacon, murphyt7, logang, Christoph Hellwig

On 2021-01-18 16:18, Chuck Lever wrote:
> 
> 
>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>
>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>
>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>> Hi-
>>>
>>> [ Please cc: me on replies, I'm not currently subscribed to
>>> iommu@lists ].
>>>
>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>
>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>
>>> For those not familiar with the way storage protocols use RDMA, The
>>> initiator/client sets up memory regions and the target/server uses
>>> RDMA Read and Write to move data out of and into those regions. The
>>> initiator/client uses only RDMA memory registration and invalidation
>>> operations, and the target/server uses RDMA Read and Write.
>>>
>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>> enabled using the kernel command line options "intel_iommu=on
>>> iommu=strict".
>>>
>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>> I was able to bisect on my client to the following commits.
>>>
>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>> map_sg"). This is about normal for this test.
>>>
>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>> 	Min throughput per process 			=  387764.34 kB/sec
>>> 	Max throughput per process 			=  399655.47 kB/sec
>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>> 	Min xfer 					= 1017344.00 kB
>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>> 	Min throughput per process 			=  398983.72 kB/sec
>>> 	Max throughput per process 			=  406199.66 kB/sec
>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>> 	Min xfer 					= 1030656.00 kB
>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>> 	Min throughput per process 			=  491812.38 kB/sec
>>> 	Max throughput per process 			=  494777.28 kB/sec
>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>> 	Min xfer 					= 1042688.00 kB
>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>> 	Min throughput per process 			=  492805.81 kB/sec
>>> 	Max throughput per process 			=  497280.19 kB/sec
>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>> 	Min xfer 					= 1039360.00 kB
>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>
>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>
>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>> 	Min throughput per process 			=  359299.06 kB/sec
>>> 	Max throughput per process 			=  363866.16 kB/sec
>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>> 	Min xfer 					= 1035520.00 kB
>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>> 	Min throughput per process 			=  364553.88 kB/sec
>>> 	Max throughput per process 			=  370029.28 kB/sec
>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>> 	Min xfer 					= 1033216.00 kB
>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>> 	Min throughput per process 			=  449583.03 kB/sec
>>> 	Max throughput per process 			=  451761.69 kB/sec
>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>> 	Min xfer 					= 1044224.00 kB
>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>> 	Min throughput per process 			=  449918.12 kB/sec
>>> 	Max throughput per process 			=  452489.28 kB/sec
>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>> 	Min xfer 					= 1043456.00 kB
>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>
>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>> the iommu ops"). Significant throughput loss.
>>>
>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>> 	Min throughput per process 			=  313672.25 kB/sec
>>> 	Max throughput per process 			=  321719.44 kB/sec
>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>> 	Min xfer 					= 1022464.00 kB
>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>> 	Min throughput per process 			=  313654.44 kB/sec
>>> 	Max throughput per process 			=  317844.50 kB/sec
>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>> 	Min xfer 					= 1035520.00 kB
>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>> 	Min throughput per process 			=  352305.00 kB/sec
>>> 	Max throughput per process 			=  357726.22 kB/sec
>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>> 	Min xfer 					= 1032960.00 kB
>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>> 	Min throughput per process 			=  348677.16 kB/sec
>>> 	Max throughput per process 			=  353467.44 kB/sec
>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>> 	Min xfer 					= 1035264.00 kB
>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>
>>> The regression appears to be 100% reproducible.
> 
> Any thoughts?
> 
> How about some tools to try or debugging advice? I don't know where to start.

I'm not familiar enough with VT-D internals or Infiniband to have a clue 
why the middle commit makes any difference (the calculation itself is 
not on a fast path, so AFAICS the worst it could do is change your 
maximum DMA address size from 48/57 bits to 47/56, and that seems 
relatively benign).

With the last commit, though, at least part of it is likely to be the 
unfortunate inevitable overhead of the internal indirection through the 
IOMMU API. There's a coincidental performance-related thread where we've 
already started pondering some ideas in that area[1] (note that Intel is 
the last one to the party here; AMD has been using this path for a 
while, and it's all that arm64 systems have ever known). I'm not sure if 
there's any difference in the strict invalidation behaviour between the 
IOMMU API calls and the old intel_dma_ops, but I suppose that might be 
worth quickly double-checking as well. I guess the main thing would be 
to do some profiling to see where time is being spent in iommu-dma and 
intel-iommu vs. just different parts of intel-iommu before, and whether 
anything in particular stands out beyond the extra call overhead 
currently incurred by iommu_{map,unmap}.

I'm mildly puzzled by your "CPU time" metric remaining more or less 
constant and the utilisation jumping up and down though, or is that only 
counting time in the userspace workload such that spending more time 
busy in the kernel skews it?

Robin.

[1] 
https://lore.kernel.org/linux-iommu/20210112163307.GA1199965@infradead.org/
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-18 18:00       ` Robin Murphy
@ 2021-01-18 20:09         ` Chuck Lever
  -1 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-18 20:09 UTC (permalink / raw)
  To: Robin Murphy, Lu Baolu
  Cc: iommu, Will Deacon, linux-rdma, logang, Christoph Hellwig, murphyt7



> On Jan 18, 2021, at 1:00 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> 
> On 2021-01-18 16:18, Chuck Lever wrote:
>>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>> 
>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>> 
>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>> Hi-
>>>> 
>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>> iommu@lists ].
>>>> 
>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>> 
>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>> 
>>>> For those not familiar with the way storage protocols use RDMA, The
>>>> initiator/client sets up memory regions and the target/server uses
>>>> RDMA Read and Write to move data out of and into those regions. The
>>>> initiator/client uses only RDMA memory registration and invalidation
>>>> operations, and the target/server uses RDMA Read and Write.
>>>> 
>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>> enabled using the kernel command line options "intel_iommu=on
>>>> iommu=strict".
>>>> 
>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>> I was able to bisect on my client to the following commits.
>>>> 
>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>> map_sg"). This is about normal for this test.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>> 	Min xfer 					= 1017344.00 kB
>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>> 	Min xfer 					= 1030656.00 kB
>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>> 	Min xfer 					= 1042688.00 kB
>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>> 	Min xfer 					= 1039360.00 kB
>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>> 
>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>> 	Min xfer 					= 1035520.00 kB
>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>> 	Min xfer 					= 1033216.00 kB
>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>> 	Min xfer 					= 1044224.00 kB
>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>> 	Min xfer 					= 1043456.00 kB
>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>> 
>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>> the iommu ops"). Significant throughput loss.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>> 	Min xfer 					= 1022464.00 kB
>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>> 	Min xfer 					= 1035520.00 kB
>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>> 	Min xfer 					= 1032960.00 kB
>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>> 	Min xfer 					= 1035264.00 kB
>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>> 
>>>> The regression appears to be 100% reproducible.
>> Any thoughts?
>> How about some tools to try or debugging advice? I don't know where to start.
> 
> I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign).

Thanks for your response. Understood that you are not responding
about the middle commit (c062db039f40).

However, that's a pretty small and straightforward change, so I've
experimented a bit with that. Commenting out the new code there, I
get some relief:

	Children see throughput for 12 initial writers 	= 4266621.62 kB/sec
	Parent sees throughput for 12 initial writers 	= 4254756.31 kB/sec
	Min throughput per process 			=  354847.75 kB/sec 
	Max throughput per process 			=  356167.59 kB/sec
	Avg throughput per process 			=  355551.80 kB/sec
	Min xfer 					= 1044736.00 kB
	CPU Utilization: Wall time    2.951    CPU time    1.981    CPU utilization  67.11 %


	Children see throughput for 12 rewriters 	= 4314827.34 kB/sec
	Parent sees throughput for 12 rewriters 	= 4310347.32 kB/sec
	Min throughput per process 			=  358599.72 kB/sec 
	Max throughput per process 			=  360319.06 kB/sec
	Avg throughput per process 			=  359568.95 kB/sec
	Min xfer 					= 1043968.00 kB
	CPU utilization: Wall time    2.912    CPU time    2.057    CPU utilization  70.62 %


	Children see throughput for 12 readers 		= 4614004.47 kB/sec
	Parent sees throughput for 12 readers 		= 4609014.68 kB/sec
	Min throughput per process 			=  382414.81 kB/sec 
	Max throughput per process 			=  388519.50 kB/sec
	Avg throughput per process 			=  384500.37 kB/sec
	Min xfer 					= 1032192.00 kB
	CPU utilization: Wall time    2.701    CPU time    1.900    CPU utilization  70.35 %


	Children see throughput for 12 re-readers 	= 4653743.81 kB/sec
	Parent sees throughput for 12 re-readers 	= 4647155.31 kB/sec
	Min throughput per process 			=  384995.69 kB/sec 
	Max throughput per process 			=  390874.09 kB/sec
	Avg throughput per process 			=  387811.98 kB/sec
	Min xfer 					= 1032960.00 kB
	CPU utilization: Wall time    2.684    CPU time    1.907    CPU utilization  71.06 %

I instrumented the code to show the "before" and "after" values.

The value of domain->domain.geometry.aperture_end on my system
before this commit (and before the c062db039f40 code) is:

144,115,188,075,855,871 = 2^57

The c062db039f40 code sets domain->domain.geometry.aperture_end to:

281,474,976,710,655 = 2^48

Fwiw, this system uses the Intel C612 chipset with Intel(R) Xeon(R)
E5-2603 v3 @ 1.60GHz CPUs.


My sense is that "CPU time" remains about the same because the problem
isn't manifesting as an increase in instruction path length. Wall time
goes up, CPU time stays the same, the ratio of those (ie, utilization)
drops.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-18 20:09         ` Chuck Lever
  0 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-18 20:09 UTC (permalink / raw)
  To: Robin Murphy, Lu Baolu
  Cc: linux-rdma, Will Deacon, murphyt7, iommu, logang, Christoph Hellwig



> On Jan 18, 2021, at 1:00 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> 
> On 2021-01-18 16:18, Chuck Lever wrote:
>>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>> 
>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>> 
>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>> Hi-
>>>> 
>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>> iommu@lists ].
>>>> 
>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>> 
>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>> 
>>>> For those not familiar with the way storage protocols use RDMA, The
>>>> initiator/client sets up memory regions and the target/server uses
>>>> RDMA Read and Write to move data out of and into those regions. The
>>>> initiator/client uses only RDMA memory registration and invalidation
>>>> operations, and the target/server uses RDMA Read and Write.
>>>> 
>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>> enabled using the kernel command line options "intel_iommu=on
>>>> iommu=strict".
>>>> 
>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>> I was able to bisect on my client to the following commits.
>>>> 
>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>> map_sg"). This is about normal for this test.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>> 	Min xfer 					= 1017344.00 kB
>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>> 	Min xfer 					= 1030656.00 kB
>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>> 	Min xfer 					= 1042688.00 kB
>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>> 	Min xfer 					= 1039360.00 kB
>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>> 
>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>> 	Min xfer 					= 1035520.00 kB
>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>> 	Min xfer 					= 1033216.00 kB
>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>> 	Min xfer 					= 1044224.00 kB
>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>> 	Min xfer 					= 1043456.00 kB
>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>> 
>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>> the iommu ops"). Significant throughput loss.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>> 	Min xfer 					= 1022464.00 kB
>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>> 	Min xfer 					= 1035520.00 kB
>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>> 	Min xfer 					= 1032960.00 kB
>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>> 	Min xfer 					= 1035264.00 kB
>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>> 
>>>> The regression appears to be 100% reproducible.
>> Any thoughts?
>> How about some tools to try or debugging advice? I don't know where to start.
> 
> I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign).

Thanks for your response. Understood that you are not responding
about the middle commit (c062db039f40).

However, that's a pretty small and straightforward change, so I've
experimented a bit with that. Commenting out the new code there, I
get some relief:

	Children see throughput for 12 initial writers 	= 4266621.62 kB/sec
	Parent sees throughput for 12 initial writers 	= 4254756.31 kB/sec
	Min throughput per process 			=  354847.75 kB/sec 
	Max throughput per process 			=  356167.59 kB/sec
	Avg throughput per process 			=  355551.80 kB/sec
	Min xfer 					= 1044736.00 kB
	CPU Utilization: Wall time    2.951    CPU time    1.981    CPU utilization  67.11 %


	Children see throughput for 12 rewriters 	= 4314827.34 kB/sec
	Parent sees throughput for 12 rewriters 	= 4310347.32 kB/sec
	Min throughput per process 			=  358599.72 kB/sec 
	Max throughput per process 			=  360319.06 kB/sec
	Avg throughput per process 			=  359568.95 kB/sec
	Min xfer 					= 1043968.00 kB
	CPU utilization: Wall time    2.912    CPU time    2.057    CPU utilization  70.62 %


	Children see throughput for 12 readers 		= 4614004.47 kB/sec
	Parent sees throughput for 12 readers 		= 4609014.68 kB/sec
	Min throughput per process 			=  382414.81 kB/sec 
	Max throughput per process 			=  388519.50 kB/sec
	Avg throughput per process 			=  384500.37 kB/sec
	Min xfer 					= 1032192.00 kB
	CPU utilization: Wall time    2.701    CPU time    1.900    CPU utilization  70.35 %


	Children see throughput for 12 re-readers 	= 4653743.81 kB/sec
	Parent sees throughput for 12 re-readers 	= 4647155.31 kB/sec
	Min throughput per process 			=  384995.69 kB/sec 
	Max throughput per process 			=  390874.09 kB/sec
	Avg throughput per process 			=  387811.98 kB/sec
	Min xfer 					= 1032960.00 kB
	CPU utilization: Wall time    2.684    CPU time    1.907    CPU utilization  71.06 %

I instrumented the code to show the "before" and "after" values.

The value of domain->domain.geometry.aperture_end on my system
before this commit (and before the c062db039f40 code) is:

144,115,188,075,855,871 = 2^57

The c062db039f40 code sets domain->domain.geometry.aperture_end to:

281,474,976,710,655 = 2^48

Fwiw, this system uses the Intel C612 chipset with Intel(R) Xeon(R)
E5-2603 v3 @ 1.60GHz CPUs.


My sense is that "CPU time" remains about the same because the problem
isn't manifesting as an increase in instruction path length. Wall time
goes up, CPU time stays the same, the ratio of those (ie, utilization)
drops.


--
Chuck Lever



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-18 20:09         ` Chuck Lever
@ 2021-01-19  1:22           ` Lu Baolu
  -1 siblings, 0 replies; 36+ messages in thread
From: Lu Baolu @ 2021-01-19  1:22 UTC (permalink / raw)
  To: Chuck Lever, Robin Murphy
  Cc: baolu.lu, iommu, Will Deacon, linux-rdma, logang,
	Christoph Hellwig, murphyt7

Hi Chuck,

On 1/19/21 4:09 AM, Chuck Lever wrote:
> 
> 
>> On Jan 18, 2021, at 1:00 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 2021-01-18 16:18, Chuck Lever wrote:
>>>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>>>
>>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>>>
>>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>>> Hi-
>>>>>
>>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>>> iommu@lists ].
>>>>>
>>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>>>
>>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>>>
>>>>> For those not familiar with the way storage protocols use RDMA, The
>>>>> initiator/client sets up memory regions and the target/server uses
>>>>> RDMA Read and Write to move data out of and into those regions. The
>>>>> initiator/client uses only RDMA memory registration and invalidation
>>>>> operations, and the target/server uses RDMA Read and Write.
>>>>>
>>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>>> enabled using the kernel command line options "intel_iommu=on
>>>>> iommu=strict".
>>>>>
>>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>>> I was able to bisect on my client to the following commits.
>>>>>
>>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>>> map_sg"). This is about normal for this test.
>>>>>
>>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>>> 	Min xfer 					= 1017344.00 kB
>>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>>> 	Min xfer 					= 1030656.00 kB
>>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>>> 	Min xfer 					= 1042688.00 kB
>>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>>> 	Min xfer 					= 1039360.00 kB
>>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>>>
>>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>>>
>>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>>> 	Min xfer 					= 1035520.00 kB
>>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>>> 	Min xfer 					= 1033216.00 kB
>>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>>> 	Min xfer 					= 1044224.00 kB
>>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>>> 	Min xfer 					= 1043456.00 kB
>>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>>>
>>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>>> the iommu ops"). Significant throughput loss.
>>>>>
>>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>>> 	Min xfer 					= 1022464.00 kB
>>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>>> 	Min xfer 					= 1035520.00 kB
>>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>>> 	Min xfer 					= 1032960.00 kB
>>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>>> 	Min xfer 					= 1035264.00 kB
>>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>>>
>>>>> The regression appears to be 100% reproducible.
>>> Any thoughts?
>>> How about some tools to try or debugging advice? I don't know where to start.
>>
>> I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign).
> 
> Thanks for your response. Understood that you are not responding
> about the middle commit (c062db039f40).
> 
> However, that's a pretty small and straightforward change, so I've
> experimented a bit with that. Commenting out the new code there, I
> get some relief:
> 
> 	Children see throughput for 12 initial writers 	= 4266621.62 kB/sec
> 	Parent sees throughput for 12 initial writers 	= 4254756.31 kB/sec
> 	Min throughput per process 			=  354847.75 kB/sec
> 	Max throughput per process 			=  356167.59 kB/sec
> 	Avg throughput per process 			=  355551.80 kB/sec
> 	Min xfer 					= 1044736.00 kB
> 	CPU Utilization: Wall time    2.951    CPU time    1.981    CPU utilization  67.11 %
> 
> 
> 	Children see throughput for 12 rewriters 	= 4314827.34 kB/sec
> 	Parent sees throughput for 12 rewriters 	= 4310347.32 kB/sec
> 	Min throughput per process 			=  358599.72 kB/sec
> 	Max throughput per process 			=  360319.06 kB/sec
> 	Avg throughput per process 			=  359568.95 kB/sec
> 	Min xfer 					= 1043968.00 kB
> 	CPU utilization: Wall time    2.912    CPU time    2.057    CPU utilization  70.62 %
> 
> 
> 	Children see throughput for 12 readers 		= 4614004.47 kB/sec
> 	Parent sees throughput for 12 readers 		= 4609014.68 kB/sec
> 	Min throughput per process 			=  382414.81 kB/sec
> 	Max throughput per process 			=  388519.50 kB/sec
> 	Avg throughput per process 			=  384500.37 kB/sec
> 	Min xfer 					= 1032192.00 kB
> 	CPU utilization: Wall time    2.701    CPU time    1.900    CPU utilization  70.35 %
> 
> 
> 	Children see throughput for 12 re-readers 	= 4653743.81 kB/sec
> 	Parent sees throughput for 12 re-readers 	= 4647155.31 kB/sec
> 	Min throughput per process 			=  384995.69 kB/sec
> 	Max throughput per process 			=  390874.09 kB/sec
> 	Avg throughput per process 			=  387811.98 kB/sec
> 	Min xfer 					= 1032960.00 kB
> 	CPU utilization: Wall time    2.684    CPU time    1.907    CPU utilization  71.06 %
> 
> I instrumented the code to show the "before" and "after" values.
> 
> The value of domain->domain.geometry.aperture_end on my system
> before this commit (and before the c062db039f40 code) is:
> 
> 144,115,188,075,855,871 = 2^57

domain->domain.geometry.aperture_end makes no sense before c062db039f40.

> 
> The c062db039f40 code sets domain->domain.geometry.aperture_end to:
> 
> 281,474,976,710,655 = 2^48

Do you mind posting the cap and ecap of the iommu used by your device?

You can get it via sysfs, for example:

/sys/bus/pci/devices/0000:00:14.0/iommu/intel-iommu# ls
address  cap  domains_supported  domains_used  ecap  version

> 
> Fwiw, this system uses the Intel C612 chipset with Intel(R) Xeon(R)
> E5-2603 v3 @ 1.60GHz CPUs.
> 

Can you please also hack a line of code to check the return value of
iommu_dma_map_sg()?

> 
> My sense is that "CPU time" remains about the same because the problem
> isn't manifesting as an increase in instruction path length. Wall time
> goes up, CPU time stays the same, the ratio of those (ie, utilization)
> drops.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-19  1:22           ` Lu Baolu
  0 siblings, 0 replies; 36+ messages in thread
From: Lu Baolu @ 2021-01-19  1:22 UTC (permalink / raw)
  To: Chuck Lever, Robin Murphy
  Cc: linux-rdma, Will Deacon, murphyt7, iommu, logang, Christoph Hellwig

Hi Chuck,

On 1/19/21 4:09 AM, Chuck Lever wrote:
> 
> 
>> On Jan 18, 2021, at 1:00 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 2021-01-18 16:18, Chuck Lever wrote:
>>>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>>>
>>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>>>
>>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>>> Hi-
>>>>>
>>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>>> iommu@lists ].
>>>>>
>>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>>>
>>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>>>
>>>>> For those not familiar with the way storage protocols use RDMA, The
>>>>> initiator/client sets up memory regions and the target/server uses
>>>>> RDMA Read and Write to move data out of and into those regions. The
>>>>> initiator/client uses only RDMA memory registration and invalidation
>>>>> operations, and the target/server uses RDMA Read and Write.
>>>>>
>>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>>> enabled using the kernel command line options "intel_iommu=on
>>>>> iommu=strict".
>>>>>
>>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>>> I was able to bisect on my client to the following commits.
>>>>>
>>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>>> map_sg"). This is about normal for this test.
>>>>>
>>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>>> 	Min xfer 					= 1017344.00 kB
>>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>>> 	Min xfer 					= 1030656.00 kB
>>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>>> 	Min xfer 					= 1042688.00 kB
>>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>>> 	Min xfer 					= 1039360.00 kB
>>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>>>
>>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>>>
>>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>>> 	Min xfer 					= 1035520.00 kB
>>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>>> 	Min xfer 					= 1033216.00 kB
>>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>>> 	Min xfer 					= 1044224.00 kB
>>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>>> 	Min xfer 					= 1043456.00 kB
>>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>>>
>>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>>> the iommu ops"). Significant throughput loss.
>>>>>
>>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>>> 	Min xfer 					= 1022464.00 kB
>>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>>> 	Min xfer 					= 1035520.00 kB
>>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>>> 	Min xfer 					= 1032960.00 kB
>>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>>> 	Min xfer 					= 1035264.00 kB
>>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>>>
>>>>> The regression appears to be 100% reproducible.
>>> Any thoughts?
>>> How about some tools to try or debugging advice? I don't know where to start.
>>
>> I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign).
> 
> Thanks for your response. Understood that you are not responding
> about the middle commit (c062db039f40).
> 
> However, that's a pretty small and straightforward change, so I've
> experimented a bit with that. Commenting out the new code there, I
> get some relief:
> 
> 	Children see throughput for 12 initial writers 	= 4266621.62 kB/sec
> 	Parent sees throughput for 12 initial writers 	= 4254756.31 kB/sec
> 	Min throughput per process 			=  354847.75 kB/sec
> 	Max throughput per process 			=  356167.59 kB/sec
> 	Avg throughput per process 			=  355551.80 kB/sec
> 	Min xfer 					= 1044736.00 kB
> 	CPU Utilization: Wall time    2.951    CPU time    1.981    CPU utilization  67.11 %
> 
> 
> 	Children see throughput for 12 rewriters 	= 4314827.34 kB/sec
> 	Parent sees throughput for 12 rewriters 	= 4310347.32 kB/sec
> 	Min throughput per process 			=  358599.72 kB/sec
> 	Max throughput per process 			=  360319.06 kB/sec
> 	Avg throughput per process 			=  359568.95 kB/sec
> 	Min xfer 					= 1043968.00 kB
> 	CPU utilization: Wall time    2.912    CPU time    2.057    CPU utilization  70.62 %
> 
> 
> 	Children see throughput for 12 readers 		= 4614004.47 kB/sec
> 	Parent sees throughput for 12 readers 		= 4609014.68 kB/sec
> 	Min throughput per process 			=  382414.81 kB/sec
> 	Max throughput per process 			=  388519.50 kB/sec
> 	Avg throughput per process 			=  384500.37 kB/sec
> 	Min xfer 					= 1032192.00 kB
> 	CPU utilization: Wall time    2.701    CPU time    1.900    CPU utilization  70.35 %
> 
> 
> 	Children see throughput for 12 re-readers 	= 4653743.81 kB/sec
> 	Parent sees throughput for 12 re-readers 	= 4647155.31 kB/sec
> 	Min throughput per process 			=  384995.69 kB/sec
> 	Max throughput per process 			=  390874.09 kB/sec
> 	Avg throughput per process 			=  387811.98 kB/sec
> 	Min xfer 					= 1032960.00 kB
> 	CPU utilization: Wall time    2.684    CPU time    1.907    CPU utilization  71.06 %
> 
> I instrumented the code to show the "before" and "after" values.
> 
> The value of domain->domain.geometry.aperture_end on my system
> before this commit (and before the c062db039f40 code) is:
> 
> 144,115,188,075,855,871 = 2^57

domain->domain.geometry.aperture_end makes no sense before c062db039f40.

> 
> The c062db039f40 code sets domain->domain.geometry.aperture_end to:
> 
> 281,474,976,710,655 = 2^48

Do you mind posting the cap and ecap of the iommu used by your device?

You can get it via sysfs, for example:

/sys/bus/pci/devices/0000:00:14.0/iommu/intel-iommu# ls
address  cap  domains_supported  domains_used  ecap  version

> 
> Fwiw, this system uses the Intel C612 chipset with Intel(R) Xeon(R)
> E5-2603 v3 @ 1.60GHz CPUs.
> 

Can you please also hack a line of code to check the return value of
iommu_dma_map_sg()?

> 
> My sense is that "CPU time" remains about the same because the problem
> isn't manifesting as an increase in instruction path length. Wall time
> goes up, CPU time stays the same, the ratio of those (ie, utilization)
> drops.

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-19  1:22           ` Lu Baolu
@ 2021-01-19 14:37             ` Chuck Lever
  -1 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-19 14:37 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Robin Murphy, iommu, Will Deacon, linux-rdma, logang,
	Christoph Hellwig, murphyt7



> On Jan 18, 2021, at 8:22 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
> Do you mind posting the cap and ecap of the iommu used by your device?
> 
> You can get it via sysfs, for example:
> 
> /sys/bus/pci/devices/0000:00:14.0/iommu/intel-iommu# ls
> address  cap  domains_supported  domains_used  ecap  version

[root@manet intel-iommu]# lspci | grep Mellanox
03:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
[root@manet intel-iommu]# pwd
/sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/iommu/intel-iommu
[root@manet intel-iommu]# for i in *; do   echo -n $i ": ";   cat $i; done
address : c7ffc000
cap : d2078c106f0466
domains_supported : 65536
domains_used : 62
ecap : f020de
version : 1:0
[root@manet intel-iommu]#


>> Fwiw, this system uses the Intel C612 chipset with Intel(R) Xeon(R)
>> E5-2603 v3 @ 1.60GHz CPUs.
> 
> Can you please also hack a line of code to check the return value of
> iommu_dma_map_sg()?

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index baca49fe83af..e811562ead0e 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -328,6 +328,7 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
 
        dma_nents = ib_dma_map_sg(ep->re_id->device, mr->mr_sg, mr->mr_nents,
                                  mr->mr_dir);
+       trace_printk("ib_dma_map_sg(%d) returns %d\n", mr->mr_nents, dma_nents);
        if (!dma_nents)
                goto out_dmamap_err;
        mr->mr_device = ep->re_id->device;

During the 256KB iozone test I used before, this trace log is generated:

   kworker/u28:3-1269  [000]   336.054743: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
   kworker/u28:3-1269  [000]   336.054835: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
   kworker/u28:3-1269  [000]   336.055022: bprint:               frwr_map: ib_dma_map_sg(4) returns 1
   kworker/u28:3-1269  [000]   336.055118: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
   kworker/u28:3-1269  [000]   336.055312: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
   kworker/u28:3-1269  [000]   336.055407: bprint:               frwr_map: ib_dma_map_sg(4) returns 1

--
Chuck Lever




^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-19 14:37             ` Chuck Lever
  0 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-19 14:37 UTC (permalink / raw)
  To: Lu Baolu
  Cc: linux-rdma, logang, Robin Murphy, murphyt7, iommu, Will Deacon,
	Christoph Hellwig



> On Jan 18, 2021, at 8:22 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
> Do you mind posting the cap and ecap of the iommu used by your device?
> 
> You can get it via sysfs, for example:
> 
> /sys/bus/pci/devices/0000:00:14.0/iommu/intel-iommu# ls
> address  cap  domains_supported  domains_used  ecap  version

[root@manet intel-iommu]# lspci | grep Mellanox
03:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
[root@manet intel-iommu]# pwd
/sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/iommu/intel-iommu
[root@manet intel-iommu]# for i in *; do   echo -n $i ": ";   cat $i; done
address : c7ffc000
cap : d2078c106f0466
domains_supported : 65536
domains_used : 62
ecap : f020de
version : 1:0
[root@manet intel-iommu]#


>> Fwiw, this system uses the Intel C612 chipset with Intel(R) Xeon(R)
>> E5-2603 v3 @ 1.60GHz CPUs.
> 
> Can you please also hack a line of code to check the return value of
> iommu_dma_map_sg()?

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index baca49fe83af..e811562ead0e 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -328,6 +328,7 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
 
        dma_nents = ib_dma_map_sg(ep->re_id->device, mr->mr_sg, mr->mr_nents,
                                  mr->mr_dir);
+       trace_printk("ib_dma_map_sg(%d) returns %d\n", mr->mr_nents, dma_nents);
        if (!dma_nents)
                goto out_dmamap_err;
        mr->mr_device = ep->re_id->device;

During the 256KB iozone test I used before, this trace log is generated:

   kworker/u28:3-1269  [000]   336.054743: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
   kworker/u28:3-1269  [000]   336.054835: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
   kworker/u28:3-1269  [000]   336.055022: bprint:               frwr_map: ib_dma_map_sg(4) returns 1
   kworker/u28:3-1269  [000]   336.055118: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
   kworker/u28:3-1269  [000]   336.055312: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
   kworker/u28:3-1269  [000]   336.055407: bprint:               frwr_map: ib_dma_map_sg(4) returns 1

--
Chuck Lever



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-19 14:37             ` Chuck Lever
@ 2021-01-20  2:11               ` Lu Baolu
  -1 siblings, 0 replies; 36+ messages in thread
From: Lu Baolu @ 2021-01-20  2:11 UTC (permalink / raw)
  To: Chuck Lever
  Cc: baolu.lu, Robin Murphy, iommu, Will Deacon, linux-rdma, logang,
	Christoph Hellwig, murphyt7

On 1/19/21 10:37 PM, Chuck Lever wrote:
> 
> 
>> On Jan 18, 2021, at 8:22 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>
>> Do you mind posting the cap and ecap of the iommu used by your device?
>>
>> You can get it via sysfs, for example:
>>
>> /sys/bus/pci/devices/0000:00:14.0/iommu/intel-iommu# ls
>> address  cap  domains_supported  domains_used  ecap  version
> 
> [root@manet intel-iommu]# lspci | grep Mellanox
> 03:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> [root@manet intel-iommu]# pwd
> /sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/iommu/intel-iommu
> [root@manet intel-iommu]# for i in *; do   echo -n $i ": ";   cat $i; done
> address : c7ffc000
> cap : d2078c106f0466

MGAW: 101111 (supporting 48-bit address width)
SAGAW: 00100 (supporting 48-bit 4-level page table)

So the calculation of domain->domain.geometry.aperture_end is right.

> domains_supported : 65536
> domains_used : 62
> ecap : f020de
> version : 1:0
> [root@manet intel-iommu]#
> 
> 
>>> Fwiw, this system uses the Intel C612 chipset with Intel(R) Xeon(R)
>>> E5-2603 v3 @ 1.60GHz CPUs.
>>
>> Can you please also hack a line of code to check the return value of
>> iommu_dma_map_sg()?
> 
> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
> index baca49fe83af..e811562ead0e 100644
> --- a/net/sunrpc/xprtrdma/frwr_ops.c
> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
> @@ -328,6 +328,7 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
>   
>          dma_nents = ib_dma_map_sg(ep->re_id->device, mr->mr_sg, mr->mr_nents,
>                                    mr->mr_dir);
> +       trace_printk("ib_dma_map_sg(%d) returns %d\n", mr->mr_nents, dma_nents);
>          if (!dma_nents)
>                  goto out_dmamap_err;
>          mr->mr_device = ep->re_id->device;
> 
> During the 256KB iozone test I used before, this trace log is generated:
> 
>     kworker/u28:3-1269  [000]   336.054743: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
>     kworker/u28:3-1269  [000]   336.054835: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
>     kworker/u28:3-1269  [000]   336.055022: bprint:               frwr_map: ib_dma_map_sg(4) returns 1
>     kworker/u28:3-1269  [000]   336.055118: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
>     kworker/u28:3-1269  [000]   336.055312: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
>     kworker/u28:3-1269  [000]   336.055407: bprint:               frwr_map: ib_dma_map_sg(4) returns 1

This is the result after commit c062db039f40, right? It also looks good
to me. Are you using iotlb strict mode (intel_iommu=strict) or lazy mode
(by default)?

Best regards,
baolu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-20  2:11               ` Lu Baolu
  0 siblings, 0 replies; 36+ messages in thread
From: Lu Baolu @ 2021-01-20  2:11 UTC (permalink / raw)
  To: Chuck Lever
  Cc: linux-rdma, logang, Robin Murphy, murphyt7, iommu, Will Deacon,
	Christoph Hellwig

On 1/19/21 10:37 PM, Chuck Lever wrote:
> 
> 
>> On Jan 18, 2021, at 8:22 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>
>> Do you mind posting the cap and ecap of the iommu used by your device?
>>
>> You can get it via sysfs, for example:
>>
>> /sys/bus/pci/devices/0000:00:14.0/iommu/intel-iommu# ls
>> address  cap  domains_supported  domains_used  ecap  version
> 
> [root@manet intel-iommu]# lspci | grep Mellanox
> 03:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> [root@manet intel-iommu]# pwd
> /sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/iommu/intel-iommu
> [root@manet intel-iommu]# for i in *; do   echo -n $i ": ";   cat $i; done
> address : c7ffc000
> cap : d2078c106f0466

MGAW: 101111 (supporting 48-bit address width)
SAGAW: 00100 (supporting 48-bit 4-level page table)

So the calculation of domain->domain.geometry.aperture_end is right.

> domains_supported : 65536
> domains_used : 62
> ecap : f020de
> version : 1:0
> [root@manet intel-iommu]#
> 
> 
>>> Fwiw, this system uses the Intel C612 chipset with Intel(R) Xeon(R)
>>> E5-2603 v3 @ 1.60GHz CPUs.
>>
>> Can you please also hack a line of code to check the return value of
>> iommu_dma_map_sg()?
> 
> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
> index baca49fe83af..e811562ead0e 100644
> --- a/net/sunrpc/xprtrdma/frwr_ops.c
> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
> @@ -328,6 +328,7 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
>   
>          dma_nents = ib_dma_map_sg(ep->re_id->device, mr->mr_sg, mr->mr_nents,
>                                    mr->mr_dir);
> +       trace_printk("ib_dma_map_sg(%d) returns %d\n", mr->mr_nents, dma_nents);
>          if (!dma_nents)
>                  goto out_dmamap_err;
>          mr->mr_device = ep->re_id->device;
> 
> During the 256KB iozone test I used before, this trace log is generated:
> 
>     kworker/u28:3-1269  [000]   336.054743: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
>     kworker/u28:3-1269  [000]   336.054835: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
>     kworker/u28:3-1269  [000]   336.055022: bprint:               frwr_map: ib_dma_map_sg(4) returns 1
>     kworker/u28:3-1269  [000]   336.055118: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
>     kworker/u28:3-1269  [000]   336.055312: bprint:               frwr_map: ib_dma_map_sg(30) returns 1
>     kworker/u28:3-1269  [000]   336.055407: bprint:               frwr_map: ib_dma_map_sg(4) returns 1

This is the result after commit c062db039f40, right? It also looks good
to me. Are you using iotlb strict mode (intel_iommu=strict) or lazy mode
(by default)?

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-20  2:11               ` Lu Baolu
@ 2021-01-20 20:25                 ` Chuck Lever
  -1 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-20 20:25 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Robin Murphy, iommu, Will Deacon, linux-rdma, logang,
	Christoph Hellwig, murphyt7



> On Jan 19, 2021, at 9:11 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
> On 1/19/21 10:37 PM, Chuck Lever wrote:
>>> On Jan 18, 2021, at 8:22 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>> 
>>> Do you mind posting the cap and ecap of the iommu used by your device?
>>> 
>>> You can get it via sysfs, for example:
>>> 
>>> /sys/bus/pci/devices/0000:00:14.0/iommu/intel-iommu# ls
>>> address  cap  domains_supported  domains_used  ecap  version
>> [root@manet intel-iommu]# lspci | grep Mellanox
>> 03:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
>> [root@manet intel-iommu]# pwd
>> /sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/iommu/intel-iommu
>> [root@manet intel-iommu]# for i in *; do   echo -n $i ": ";   cat $i; done
>> address : c7ffc000
>> cap : d2078c106f0466
> 
> MGAW: 101111 (supporting 48-bit address width)
> SAGAW: 00100 (supporting 48-bit 4-level page table)
> 
> So the calculation of domain->domain.geometry.aperture_end is right.

I found the cause of the performance loss with c062db039f40: it was
a testing error on my part. I will begin looking at c588072bba6b
("iommu/vt-d: Convert intel iommu driver to the iommu ops").


--
Chuck Lever




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-20 20:25                 ` Chuck Lever
  0 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-20 20:25 UTC (permalink / raw)
  To: Lu Baolu
  Cc: linux-rdma, logang, Robin Murphy, murphyt7, iommu, Will Deacon,
	Christoph Hellwig



> On Jan 19, 2021, at 9:11 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
> On 1/19/21 10:37 PM, Chuck Lever wrote:
>>> On Jan 18, 2021, at 8:22 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>> 
>>> Do you mind posting the cap and ecap of the iommu used by your device?
>>> 
>>> You can get it via sysfs, for example:
>>> 
>>> /sys/bus/pci/devices/0000:00:14.0/iommu/intel-iommu# ls
>>> address  cap  domains_supported  domains_used  ecap  version
>> [root@manet intel-iommu]# lspci | grep Mellanox
>> 03:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
>> [root@manet intel-iommu]# pwd
>> /sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/iommu/intel-iommu
>> [root@manet intel-iommu]# for i in *; do   echo -n $i ": ";   cat $i; done
>> address : c7ffc000
>> cap : d2078c106f0466
> 
> MGAW: 101111 (supporting 48-bit address width)
> SAGAW: 00100 (supporting 48-bit 4-level page table)
> 
> So the calculation of domain->domain.geometry.aperture_end is right.

I found the cause of the performance loss with c062db039f40: it was
a testing error on my part. I will begin looking at c588072bba6b
("iommu/vt-d: Convert intel iommu driver to the iommu ops").


--
Chuck Lever



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-18 18:00       ` Robin Murphy
@ 2021-01-21 19:09         ` Chuck Lever
  -1 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-21 19:09 UTC (permalink / raw)
  To: Robin Murphy
  Cc: iommu, Will Deacon, linux-rdma, Lu Baolu, logang,
	Christoph Hellwig, murphyt7



> On Jan 18, 2021, at 1:00 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> 
> On 2021-01-18 16:18, Chuck Lever wrote:
>>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>> 
>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>> 
>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>> Hi-
>>>> 
>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>> iommu@lists ].
>>>> 
>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>> 
>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>> 
>>>> For those not familiar with the way storage protocols use RDMA, The
>>>> initiator/client sets up memory regions and the target/server uses
>>>> RDMA Read and Write to move data out of and into those regions. The
>>>> initiator/client uses only RDMA memory registration and invalidation
>>>> operations, and the target/server uses RDMA Read and Write.
>>>> 
>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>> enabled using the kernel command line options "intel_iommu=on
>>>> iommu=strict".
>>>> 
>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>> I was able to bisect on my client to the following commits.
>>>> 
>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>> map_sg"). This is about normal for this test.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>> 	Min xfer 					= 1017344.00 kB
>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>> 	Min xfer 					= 1030656.00 kB
>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>> 	Min xfer 					= 1042688.00 kB
>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>> 	Min xfer 					= 1039360.00 kB
>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>> 
>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>> 	Min xfer 					= 1035520.00 kB
>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>> 	Min xfer 					= 1033216.00 kB
>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>> 	Min xfer 					= 1044224.00 kB
>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>> 	Min xfer 					= 1043456.00 kB
>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>> 
>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>> the iommu ops"). Significant throughput loss.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>> 	Min xfer 					= 1022464.00 kB
>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>> 	Min xfer 					= 1035520.00 kB
>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>> 	Min xfer 					= 1032960.00 kB
>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>> 	Min xfer 					= 1035264.00 kB
>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>> 
>>>> The regression appears to be 100% reproducible.
>> Any thoughts?
>> How about some tools to try or debugging advice? I don't know where to start.
> 
> I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign).
> 
> With the last commit, though, at least part of it is likely to be the unfortunate inevitable overhead of the internal indirection through the IOMMU API. There's a coincidental performance-related thread where we've already started pondering some ideas in that area[1] (note that Intel is the last one to the party here; AMD has been using this path for a while, and it's all that arm64 systems have ever known). I'm not sure if there's any difference in the strict invalidation behaviour between the IOMMU API calls and the old intel_dma_ops, but I suppose that might be worth quickly double-checking as well. I guess the main thing would be to do some profiling to see where time is being spent in iommu-dma and intel-iommu vs. just different parts of intel-iommu before, and whether anything in particular stands out beyond the extra call overhead currently incurred by iommu_{map,unmap}.

I did a function_graph trace of the above iozone test on a v5.10 NFS
client and again on v5.11-rc. There is a substantial timing difference
in dma_map_sg_attrs. Each excerpt below is for DMA-mapping a 120KB set
of pages that are part of an NFS/RDMA WRITE operation.

v5.10:

1072.028308: funcgraph_entry:                   |  dma_map_sg_attrs() {
1072.028308: funcgraph_entry:                   |    intel_map_sg() {
1072.028309: funcgraph_entry:                   |      find_domain() {
1072.028309: funcgraph_entry:        0.280 us   |        get_domain_info();
1072.028310: funcgraph_exit:         0.930 us   |      }
1072.028310: funcgraph_entry:        0.360 us   |      domain_get_iommu();
1072.028311: funcgraph_entry:                   |      intel_alloc_iova() {
1072.028311: funcgraph_entry:                   |        alloc_iova_fast() {
1072.028311: funcgraph_entry:        0.375 us   |          _raw_spin_lock_irqsave();
1072.028312: funcgraph_entry:        0.285 us   |          __lock_text_start();
1072.028313: funcgraph_exit:         1.500 us   |        }
1072.028313: funcgraph_exit:         2.052 us   |      }
1072.028313: funcgraph_entry:                   |      domain_mapping() {
1072.028313: funcgraph_entry:                   |        __domain_mapping() {
1072.028314: funcgraph_entry:        0.350 us   |          pfn_to_dma_pte();
1072.028315: funcgraph_entry:        0.942 us   |          domain_flush_cache();
1072.028316: funcgraph_exit:         2.852 us   |        }
1072.028316: funcgraph_entry:        0.275 us   |        iommu_flush_write_buffer();
1072.028317: funcgraph_exit:         4.213 us   |      }
1072.028318: funcgraph_exit:         9.392 us   |    }
1072.028318: funcgraph_exit:       + 10.073 us  |  }
1072.028323: xprtrdma_mr_map:      mr.id=115 nents=30 122880@0xe476ca03f1180000:0x18011105 (TO_DEVICE)
1072.028323: xprtrdma_chunk_read:  task:63879@5 pos=148 122880@0xe476ca03f1180000:0x18011105 (more)


v5.11-rc:

57.602990: funcgraph_entry:                   |  dma_map_sg_attrs() {
57.602990: funcgraph_entry:                   |    iommu_dma_map_sg() {
57.602990: funcgraph_entry:        0.285 us   |      iommu_get_dma_domain();
57.602991: funcgraph_entry:        0.270 us   |      iommu_dma_deferred_attach();
57.602991: funcgraph_entry:                   |      iommu_dma_sync_sg_for_device() {
57.602992: funcgraph_entry:        0.268 us   |        dev_is_untrusted();
57.602992: funcgraph_exit:         0.815 us   |      }
57.602993: funcgraph_entry:        0.267 us   |      dev_is_untrusted();
57.602993: funcgraph_entry:                   |      iommu_dma_alloc_iova() {
57.602994: funcgraph_entry:                   |        alloc_iova_fast() {
57.602994: funcgraph_entry:        0.260 us   |          _raw_spin_lock_irqsave();
57.602995: funcgraph_entry:        0.293 us   |          _raw_spin_lock();
57.602995: funcgraph_entry:        0.273 us   |          _raw_spin_unlock_irqrestore();
57.602996: funcgraph_entry:        1.147 us   |          alloc_iova();
57.602997: funcgraph_exit:         3.370 us   |        }
57.602997: funcgraph_exit:         3.945 us   |      }
57.602998: funcgraph_entry:        0.272 us   |      dma_info_to_prot();
57.602998: funcgraph_entry:                   |      iommu_map_sg_atomic() {
57.602998: funcgraph_entry:                   |        __iommu_map_sg() {
57.602999: funcgraph_entry:        1.733 us   |          __iommu_map();
57.603001: funcgraph_entry:        1.642 us   |          __iommu_map();
57.603003: funcgraph_entry:        1.638 us   |          __iommu_map();
57.603005: funcgraph_entry:        1.645 us   |          __iommu_map();
57.603007: funcgraph_entry:        1.630 us   |          __iommu_map();
57.603009: funcgraph_entry:        1.770 us   |          __iommu_map();
57.603011: funcgraph_entry:        1.730 us   |          __iommu_map();
57.603013: funcgraph_entry:        1.633 us   |          __iommu_map();
57.603015: funcgraph_entry:        1.605 us   |          __iommu_map();
57.603017: funcgraph_entry:        2.847 us   |          __iommu_map();
57.603020: funcgraph_entry:        2.847 us   |          __iommu_map();
57.603024: funcgraph_entry:        2.955 us   |          __iommu_map();
57.603027: funcgraph_entry:        2.928 us   |          __iommu_map();
57.603030: funcgraph_entry:        2.933 us   |          __iommu_map();
57.603034: funcgraph_entry:        2.943 us   |          __iommu_map();
57.603037: funcgraph_entry:        2.928 us   |          __iommu_map();
57.603040: funcgraph_entry:        2.857 us   |          __iommu_map();
57.603044: funcgraph_entry:        2.953 us   |          __iommu_map();
57.603047: funcgraph_entry:        3.023 us   |          __iommu_map();
57.603050: funcgraph_entry:        1.645 us   |          __iommu_map();
57.603052: funcgraph_exit:       + 53.648 us  |        }
57.603052: funcgraph_exit:       + 54.178 us  |      }
57.603053: funcgraph_exit:       + 62.953 us  |    }
57.603053: funcgraph_exit:       + 63.567 us  |  }
57.603059: xprtrdma_mr_map:      task:60@5 mr.id=4 nents=30 122880@0xd79cc0e2f18c0000:0x00010501 (TO_DEVICE)
57.603060: xprtrdma_chunk_read:  task:60@5 pos=148 122880@0xd79cc0e2f18c0000:0x00010501 (more)


--
Chuck Lever




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-21 19:09         ` Chuck Lever
  0 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-21 19:09 UTC (permalink / raw)
  To: Robin Murphy
  Cc: linux-rdma, Will Deacon, murphyt7, iommu, logang, Christoph Hellwig



> On Jan 18, 2021, at 1:00 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> 
> On 2021-01-18 16:18, Chuck Lever wrote:
>>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>> 
>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>> 
>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>> Hi-
>>>> 
>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>> iommu@lists ].
>>>> 
>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>> 
>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>> 
>>>> For those not familiar with the way storage protocols use RDMA, The
>>>> initiator/client sets up memory regions and the target/server uses
>>>> RDMA Read and Write to move data out of and into those regions. The
>>>> initiator/client uses only RDMA memory registration and invalidation
>>>> operations, and the target/server uses RDMA Read and Write.
>>>> 
>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>> enabled using the kernel command line options "intel_iommu=on
>>>> iommu=strict".
>>>> 
>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>> I was able to bisect on my client to the following commits.
>>>> 
>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>> map_sg"). This is about normal for this test.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>> 	Min xfer 					= 1017344.00 kB
>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>> 	Min xfer 					= 1030656.00 kB
>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>> 	Min xfer 					= 1042688.00 kB
>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>> 	Min xfer 					= 1039360.00 kB
>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>> 
>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>> 	Min xfer 					= 1035520.00 kB
>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>> 	Min xfer 					= 1033216.00 kB
>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>> 	Min xfer 					= 1044224.00 kB
>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>> 	Min xfer 					= 1043456.00 kB
>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>> 
>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>> the iommu ops"). Significant throughput loss.
>>>> 
>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>> 	Min xfer 					= 1022464.00 kB
>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>> 	Min xfer 					= 1035520.00 kB
>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>> 	Min xfer 					= 1032960.00 kB
>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>> 	Min xfer 					= 1035264.00 kB
>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>> 
>>>> The regression appears to be 100% reproducible.
>> Any thoughts?
>> How about some tools to try or debugging advice? I don't know where to start.
> 
> I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign).
> 
> With the last commit, though, at least part of it is likely to be the unfortunate inevitable overhead of the internal indirection through the IOMMU API. There's a coincidental performance-related thread where we've already started pondering some ideas in that area[1] (note that Intel is the last one to the party here; AMD has been using this path for a while, and it's all that arm64 systems have ever known). I'm not sure if there's any difference in the strict invalidation behaviour between the IOMMU API calls and the old intel_dma_ops, but I suppose that might be worth quickly double-checking as well. I guess the main thing would be to do some profiling to see where time is being spent in iommu-dma and intel-iommu vs. just different parts of intel-iommu before, and whether anything in particular stands out beyond the extra call overhead currently incurred by iommu_{map,unmap}.

I did a function_graph trace of the above iozone test on a v5.10 NFS
client and again on v5.11-rc. There is a substantial timing difference
in dma_map_sg_attrs. Each excerpt below is for DMA-mapping a 120KB set
of pages that are part of an NFS/RDMA WRITE operation.

v5.10:

1072.028308: funcgraph_entry:                   |  dma_map_sg_attrs() {
1072.028308: funcgraph_entry:                   |    intel_map_sg() {
1072.028309: funcgraph_entry:                   |      find_domain() {
1072.028309: funcgraph_entry:        0.280 us   |        get_domain_info();
1072.028310: funcgraph_exit:         0.930 us   |      }
1072.028310: funcgraph_entry:        0.360 us   |      domain_get_iommu();
1072.028311: funcgraph_entry:                   |      intel_alloc_iova() {
1072.028311: funcgraph_entry:                   |        alloc_iova_fast() {
1072.028311: funcgraph_entry:        0.375 us   |          _raw_spin_lock_irqsave();
1072.028312: funcgraph_entry:        0.285 us   |          __lock_text_start();
1072.028313: funcgraph_exit:         1.500 us   |        }
1072.028313: funcgraph_exit:         2.052 us   |      }
1072.028313: funcgraph_entry:                   |      domain_mapping() {
1072.028313: funcgraph_entry:                   |        __domain_mapping() {
1072.028314: funcgraph_entry:        0.350 us   |          pfn_to_dma_pte();
1072.028315: funcgraph_entry:        0.942 us   |          domain_flush_cache();
1072.028316: funcgraph_exit:         2.852 us   |        }
1072.028316: funcgraph_entry:        0.275 us   |        iommu_flush_write_buffer();
1072.028317: funcgraph_exit:         4.213 us   |      }
1072.028318: funcgraph_exit:         9.392 us   |    }
1072.028318: funcgraph_exit:       + 10.073 us  |  }
1072.028323: xprtrdma_mr_map:      mr.id=115 nents=30 122880@0xe476ca03f1180000:0x18011105 (TO_DEVICE)
1072.028323: xprtrdma_chunk_read:  task:63879@5 pos=148 122880@0xe476ca03f1180000:0x18011105 (more)


v5.11-rc:

57.602990: funcgraph_entry:                   |  dma_map_sg_attrs() {
57.602990: funcgraph_entry:                   |    iommu_dma_map_sg() {
57.602990: funcgraph_entry:        0.285 us   |      iommu_get_dma_domain();
57.602991: funcgraph_entry:        0.270 us   |      iommu_dma_deferred_attach();
57.602991: funcgraph_entry:                   |      iommu_dma_sync_sg_for_device() {
57.602992: funcgraph_entry:        0.268 us   |        dev_is_untrusted();
57.602992: funcgraph_exit:         0.815 us   |      }
57.602993: funcgraph_entry:        0.267 us   |      dev_is_untrusted();
57.602993: funcgraph_entry:                   |      iommu_dma_alloc_iova() {
57.602994: funcgraph_entry:                   |        alloc_iova_fast() {
57.602994: funcgraph_entry:        0.260 us   |          _raw_spin_lock_irqsave();
57.602995: funcgraph_entry:        0.293 us   |          _raw_spin_lock();
57.602995: funcgraph_entry:        0.273 us   |          _raw_spin_unlock_irqrestore();
57.602996: funcgraph_entry:        1.147 us   |          alloc_iova();
57.602997: funcgraph_exit:         3.370 us   |        }
57.602997: funcgraph_exit:         3.945 us   |      }
57.602998: funcgraph_entry:        0.272 us   |      dma_info_to_prot();
57.602998: funcgraph_entry:                   |      iommu_map_sg_atomic() {
57.602998: funcgraph_entry:                   |        __iommu_map_sg() {
57.602999: funcgraph_entry:        1.733 us   |          __iommu_map();
57.603001: funcgraph_entry:        1.642 us   |          __iommu_map();
57.603003: funcgraph_entry:        1.638 us   |          __iommu_map();
57.603005: funcgraph_entry:        1.645 us   |          __iommu_map();
57.603007: funcgraph_entry:        1.630 us   |          __iommu_map();
57.603009: funcgraph_entry:        1.770 us   |          __iommu_map();
57.603011: funcgraph_entry:        1.730 us   |          __iommu_map();
57.603013: funcgraph_entry:        1.633 us   |          __iommu_map();
57.603015: funcgraph_entry:        1.605 us   |          __iommu_map();
57.603017: funcgraph_entry:        2.847 us   |          __iommu_map();
57.603020: funcgraph_entry:        2.847 us   |          __iommu_map();
57.603024: funcgraph_entry:        2.955 us   |          __iommu_map();
57.603027: funcgraph_entry:        2.928 us   |          __iommu_map();
57.603030: funcgraph_entry:        2.933 us   |          __iommu_map();
57.603034: funcgraph_entry:        2.943 us   |          __iommu_map();
57.603037: funcgraph_entry:        2.928 us   |          __iommu_map();
57.603040: funcgraph_entry:        2.857 us   |          __iommu_map();
57.603044: funcgraph_entry:        2.953 us   |          __iommu_map();
57.603047: funcgraph_entry:        3.023 us   |          __iommu_map();
57.603050: funcgraph_entry:        1.645 us   |          __iommu_map();
57.603052: funcgraph_exit:       + 53.648 us  |        }
57.603052: funcgraph_exit:       + 54.178 us  |      }
57.603053: funcgraph_exit:       + 62.953 us  |    }
57.603053: funcgraph_exit:       + 63.567 us  |  }
57.603059: xprtrdma_mr_map:      task:60@5 mr.id=4 nents=30 122880@0xd79cc0e2f18c0000:0x00010501 (TO_DEVICE)
57.603060: xprtrdma_chunk_read:  task:60@5 pos=148 122880@0xd79cc0e2f18c0000:0x00010501 (more)


--
Chuck Lever



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-21 19:09         ` Chuck Lever
@ 2021-01-22  3:00           ` Lu Baolu
  -1 siblings, 0 replies; 36+ messages in thread
From: Lu Baolu @ 2021-01-22  3:00 UTC (permalink / raw)
  To: Chuck Lever, Robin Murphy
  Cc: baolu.lu, iommu, Will Deacon, linux-rdma, logang,
	Christoph Hellwig, murphyt7, isaacm

+Isaac

On 1/22/21 3:09 AM, Chuck Lever wrote:
> 
> 
>> On Jan 18, 2021, at 1:00 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 2021-01-18 16:18, Chuck Lever wrote:
>>>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>>>
>>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>>>
>>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>>> Hi-
>>>>>
>>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>>> iommu@lists ].
>>>>>
>>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>>>
>>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>>>
>>>>> For those not familiar with the way storage protocols use RDMA, The
>>>>> initiator/client sets up memory regions and the target/server uses
>>>>> RDMA Read and Write to move data out of and into those regions. The
>>>>> initiator/client uses only RDMA memory registration and invalidation
>>>>> operations, and the target/server uses RDMA Read and Write.
>>>>>
>>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>>> enabled using the kernel command line options "intel_iommu=on
>>>>> iommu=strict".
>>>>>
>>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>>> I was able to bisect on my client to the following commits.
>>>>>
>>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>>> map_sg"). This is about normal for this test.
>>>>>
>>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>>> 	Min xfer 					= 1017344.00 kB
>>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>>> 	Min xfer 					= 1030656.00 kB
>>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>>> 	Min xfer 					= 1042688.00 kB
>>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>>> 	Min xfer 					= 1039360.00 kB
>>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>>>
>>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>>>
>>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>>> 	Min xfer 					= 1035520.00 kB
>>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>>> 	Min xfer 					= 1033216.00 kB
>>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>>> 	Min xfer 					= 1044224.00 kB
>>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>>> 	Min xfer 					= 1043456.00 kB
>>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>>>
>>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>>> the iommu ops"). Significant throughput loss.
>>>>>
>>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>>> 	Min xfer 					= 1022464.00 kB
>>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>>> 	Min xfer 					= 1035520.00 kB
>>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>>> 	Min xfer 					= 1032960.00 kB
>>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>>> 	Min xfer 					= 1035264.00 kB
>>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>>>
>>>>> The regression appears to be 100% reproducible.
>>> Any thoughts?
>>> How about some tools to try or debugging advice? I don't know where to start.
>>
>> I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign).
>>
>> With the last commit, though, at least part of it is likely to be the unfortunate inevitable overhead of the internal indirection through the IOMMU API. There's a coincidental performance-related thread where we've already started pondering some ideas in that area[1] (note that Intel is the last one to the party here; AMD has been using this path for a while, and it's all that arm64 systems have ever known). I'm not sure if there's any difference in the strict invalidation behaviour between the IOMMU API calls and the old intel_dma_ops, but I suppose that might be worth quickly double-checking as well. I guess the main thing would be to do some profiling to see where time is being spent in iommu-dma and intel-iommu vs. just different parts of intel-iommu before, and whether anything in particular stands out beyond the extra call overhead currently incurred by iommu_{map,unmap}.
> 
> I did a function_graph trace of the above iozone test on a v5.10 NFS
> client and again on v5.11-rc. There is a substantial timing difference
> in dma_map_sg_attrs. Each excerpt below is for DMA-mapping a 120KB set
> of pages that are part of an NFS/RDMA WRITE operation.
> 
> v5.10:
> 
> 1072.028308: funcgraph_entry:                   |  dma_map_sg_attrs() {
> 1072.028308: funcgraph_entry:                   |    intel_map_sg() {
> 1072.028309: funcgraph_entry:                   |      find_domain() {
> 1072.028309: funcgraph_entry:        0.280 us   |        get_domain_info();
> 1072.028310: funcgraph_exit:         0.930 us   |      }
> 1072.028310: funcgraph_entry:        0.360 us   |      domain_get_iommu();
> 1072.028311: funcgraph_entry:                   |      intel_alloc_iova() {
> 1072.028311: funcgraph_entry:                   |        alloc_iova_fast() {
> 1072.028311: funcgraph_entry:        0.375 us   |          _raw_spin_lock_irqsave();
> 1072.028312: funcgraph_entry:        0.285 us   |          __lock_text_start();
> 1072.028313: funcgraph_exit:         1.500 us   |        }
> 1072.028313: funcgraph_exit:         2.052 us   |      }
> 1072.028313: funcgraph_entry:                   |      domain_mapping() {
> 1072.028313: funcgraph_entry:                   |        __domain_mapping() {
> 1072.028314: funcgraph_entry:        0.350 us   |          pfn_to_dma_pte();
> 1072.028315: funcgraph_entry:        0.942 us   |          domain_flush_cache();
> 1072.028316: funcgraph_exit:         2.852 us   |        }
> 1072.028316: funcgraph_entry:        0.275 us   |        iommu_flush_write_buffer();
> 1072.028317: funcgraph_exit:         4.213 us   |      }
> 1072.028318: funcgraph_exit:         9.392 us   |    }
> 1072.028318: funcgraph_exit:       + 10.073 us  |  }
> 1072.028323: xprtrdma_mr_map:      mr.id=115 nents=30 122880@0xe476ca03f1180000:0x18011105 (TO_DEVICE)
> 1072.028323: xprtrdma_chunk_read:  task:63879@5 pos=148 122880@0xe476ca03f1180000:0x18011105 (more)
> 
> 
> v5.11-rc:
> 
> 57.602990: funcgraph_entry:                   |  dma_map_sg_attrs() {
> 57.602990: funcgraph_entry:                   |    iommu_dma_map_sg() {
> 57.602990: funcgraph_entry:        0.285 us   |      iommu_get_dma_domain();
> 57.602991: funcgraph_entry:        0.270 us   |      iommu_dma_deferred_attach();
> 57.602991: funcgraph_entry:                   |      iommu_dma_sync_sg_for_device() {
> 57.602992: funcgraph_entry:        0.268 us   |        dev_is_untrusted();
> 57.602992: funcgraph_exit:         0.815 us   |      }
> 57.602993: funcgraph_entry:        0.267 us   |      dev_is_untrusted();
> 57.602993: funcgraph_entry:                   |      iommu_dma_alloc_iova() {
> 57.602994: funcgraph_entry:                   |        alloc_iova_fast() {
> 57.602994: funcgraph_entry:        0.260 us   |          _raw_spin_lock_irqsave();
> 57.602995: funcgraph_entry:        0.293 us   |          _raw_spin_lock();
> 57.602995: funcgraph_entry:        0.273 us   |          _raw_spin_unlock_irqrestore();
> 57.602996: funcgraph_entry:        1.147 us   |          alloc_iova();
> 57.602997: funcgraph_exit:         3.370 us   |        }
> 57.602997: funcgraph_exit:         3.945 us   |      }
> 57.602998: funcgraph_entry:        0.272 us   |      dma_info_to_prot();
> 57.602998: funcgraph_entry:                   |      iommu_map_sg_atomic() {
> 57.602998: funcgraph_entry:                   |        __iommu_map_sg() {
> 57.602999: funcgraph_entry:        1.733 us   |          __iommu_map();
> 57.603001: funcgraph_entry:        1.642 us   |          __iommu_map();
> 57.603003: funcgraph_entry:        1.638 us   |          __iommu_map();
> 57.603005: funcgraph_entry:        1.645 us   |          __iommu_map();
> 57.603007: funcgraph_entry:        1.630 us   |          __iommu_map();
> 57.603009: funcgraph_entry:        1.770 us   |          __iommu_map();
> 57.603011: funcgraph_entry:        1.730 us   |          __iommu_map();
> 57.603013: funcgraph_entry:        1.633 us   |          __iommu_map();
> 57.603015: funcgraph_entry:        1.605 us   |          __iommu_map();
> 57.603017: funcgraph_entry:        2.847 us   |          __iommu_map();
> 57.603020: funcgraph_entry:        2.847 us   |          __iommu_map();
> 57.603024: funcgraph_entry:        2.955 us   |          __iommu_map();
> 57.603027: funcgraph_entry:        2.928 us   |          __iommu_map();
> 57.603030: funcgraph_entry:        2.933 us   |          __iommu_map();
> 57.603034: funcgraph_entry:        2.943 us   |          __iommu_map();
> 57.603037: funcgraph_entry:        2.928 us   |          __iommu_map();
> 57.603040: funcgraph_entry:        2.857 us   |          __iommu_map();
> 57.603044: funcgraph_entry:        2.953 us   |          __iommu_map();
> 57.603047: funcgraph_entry:        3.023 us   |          __iommu_map();
> 57.603050: funcgraph_entry:        1.645 us   |          __iommu_map();
> 57.603052: funcgraph_exit:       + 53.648 us  |        }
> 57.603052: funcgraph_exit:       + 54.178 us  |      }
> 57.603053: funcgraph_exit:       + 62.953 us  |    }
> 57.603053: funcgraph_exit:       + 63.567 us  |  }
> 57.603059: xprtrdma_mr_map:      task:60@5 mr.id=4 nents=30 122880@0xd79cc0e2f18c0000:0x00010501 (TO_DEVICE)
> 57.603060: xprtrdma_chunk_read:  task:60@5 pos=148 122880@0xd79cc0e2f18c0000:0x00010501 (more)
> 

I kind of believe it's due to the indirect calls. This is also reported
on ARM.

https://lore.kernel.org/linux-iommu/1610376862-927-1-git-send-email-isaacm@codeaurora.org/

Maybe we can try changing indirect calls to static ones to verify this
problem.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-22  3:00           ` Lu Baolu
  0 siblings, 0 replies; 36+ messages in thread
From: Lu Baolu @ 2021-01-22  3:00 UTC (permalink / raw)
  To: Chuck Lever, Robin Murphy
  Cc: isaacm, linux-rdma, Will Deacon, murphyt7, iommu, logang,
	Christoph Hellwig

+Isaac

On 1/22/21 3:09 AM, Chuck Lever wrote:
> 
> 
>> On Jan 18, 2021, at 1:00 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 2021-01-18 16:18, Chuck Lever wrote:
>>>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>>>
>>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>>>
>>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>>> Hi-
>>>>>
>>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>>> iommu@lists ].
>>>>>
>>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>>>
>>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>>>
>>>>> For those not familiar with the way storage protocols use RDMA, The
>>>>> initiator/client sets up memory regions and the target/server uses
>>>>> RDMA Read and Write to move data out of and into those regions. The
>>>>> initiator/client uses only RDMA memory registration and invalidation
>>>>> operations, and the target/server uses RDMA Read and Write.
>>>>>
>>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>>> enabled using the kernel command line options "intel_iommu=on
>>>>> iommu=strict".
>>>>>
>>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>>> I was able to bisect on my client to the following commits.
>>>>>
>>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>>> map_sg"). This is about normal for this test.
>>>>>
>>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>>> 	Min xfer 					= 1017344.00 kB
>>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>>> 	Min xfer 					= 1030656.00 kB
>>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>>> 	Min xfer 					= 1042688.00 kB
>>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>>> 	Min xfer 					= 1039360.00 kB
>>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>>>
>>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>>>
>>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>>> 	Min xfer 					= 1035520.00 kB
>>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>>> 	Min xfer 					= 1033216.00 kB
>>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>>> 	Min xfer 					= 1044224.00 kB
>>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>>> 	Min xfer 					= 1043456.00 kB
>>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>>>
>>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>>> the iommu ops"). Significant throughput loss.
>>>>>
>>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>>> 	Min xfer 					= 1022464.00 kB
>>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>>> 	Min xfer 					= 1035520.00 kB
>>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>>> 	Min xfer 					= 1032960.00 kB
>>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>>> 	Min xfer 					= 1035264.00 kB
>>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>>>
>>>>> The regression appears to be 100% reproducible.
>>> Any thoughts?
>>> How about some tools to try or debugging advice? I don't know where to start.
>>
>> I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign).
>>
>> With the last commit, though, at least part of it is likely to be the unfortunate inevitable overhead of the internal indirection through the IOMMU API. There's a coincidental performance-related thread where we've already started pondering some ideas in that area[1] (note that Intel is the last one to the party here; AMD has been using this path for a while, and it's all that arm64 systems have ever known). I'm not sure if there's any difference in the strict invalidation behaviour between the IOMMU API calls and the old intel_dma_ops, but I suppose that might be worth quickly double-checking as well. I guess the main thing would be to do some profiling to see where time is being spent in iommu-dma and intel-iommu vs. just different parts of intel-iommu before, and whether anything in particular stands out beyond the extra call overhead currently incurred by iommu_{map,unmap}.
> 
> I did a function_graph trace of the above iozone test on a v5.10 NFS
> client and again on v5.11-rc. There is a substantial timing difference
> in dma_map_sg_attrs. Each excerpt below is for DMA-mapping a 120KB set
> of pages that are part of an NFS/RDMA WRITE operation.
> 
> v5.10:
> 
> 1072.028308: funcgraph_entry:                   |  dma_map_sg_attrs() {
> 1072.028308: funcgraph_entry:                   |    intel_map_sg() {
> 1072.028309: funcgraph_entry:                   |      find_domain() {
> 1072.028309: funcgraph_entry:        0.280 us   |        get_domain_info();
> 1072.028310: funcgraph_exit:         0.930 us   |      }
> 1072.028310: funcgraph_entry:        0.360 us   |      domain_get_iommu();
> 1072.028311: funcgraph_entry:                   |      intel_alloc_iova() {
> 1072.028311: funcgraph_entry:                   |        alloc_iova_fast() {
> 1072.028311: funcgraph_entry:        0.375 us   |          _raw_spin_lock_irqsave();
> 1072.028312: funcgraph_entry:        0.285 us   |          __lock_text_start();
> 1072.028313: funcgraph_exit:         1.500 us   |        }
> 1072.028313: funcgraph_exit:         2.052 us   |      }
> 1072.028313: funcgraph_entry:                   |      domain_mapping() {
> 1072.028313: funcgraph_entry:                   |        __domain_mapping() {
> 1072.028314: funcgraph_entry:        0.350 us   |          pfn_to_dma_pte();
> 1072.028315: funcgraph_entry:        0.942 us   |          domain_flush_cache();
> 1072.028316: funcgraph_exit:         2.852 us   |        }
> 1072.028316: funcgraph_entry:        0.275 us   |        iommu_flush_write_buffer();
> 1072.028317: funcgraph_exit:         4.213 us   |      }
> 1072.028318: funcgraph_exit:         9.392 us   |    }
> 1072.028318: funcgraph_exit:       + 10.073 us  |  }
> 1072.028323: xprtrdma_mr_map:      mr.id=115 nents=30 122880@0xe476ca03f1180000:0x18011105 (TO_DEVICE)
> 1072.028323: xprtrdma_chunk_read:  task:63879@5 pos=148 122880@0xe476ca03f1180000:0x18011105 (more)
> 
> 
> v5.11-rc:
> 
> 57.602990: funcgraph_entry:                   |  dma_map_sg_attrs() {
> 57.602990: funcgraph_entry:                   |    iommu_dma_map_sg() {
> 57.602990: funcgraph_entry:        0.285 us   |      iommu_get_dma_domain();
> 57.602991: funcgraph_entry:        0.270 us   |      iommu_dma_deferred_attach();
> 57.602991: funcgraph_entry:                   |      iommu_dma_sync_sg_for_device() {
> 57.602992: funcgraph_entry:        0.268 us   |        dev_is_untrusted();
> 57.602992: funcgraph_exit:         0.815 us   |      }
> 57.602993: funcgraph_entry:        0.267 us   |      dev_is_untrusted();
> 57.602993: funcgraph_entry:                   |      iommu_dma_alloc_iova() {
> 57.602994: funcgraph_entry:                   |        alloc_iova_fast() {
> 57.602994: funcgraph_entry:        0.260 us   |          _raw_spin_lock_irqsave();
> 57.602995: funcgraph_entry:        0.293 us   |          _raw_spin_lock();
> 57.602995: funcgraph_entry:        0.273 us   |          _raw_spin_unlock_irqrestore();
> 57.602996: funcgraph_entry:        1.147 us   |          alloc_iova();
> 57.602997: funcgraph_exit:         3.370 us   |        }
> 57.602997: funcgraph_exit:         3.945 us   |      }
> 57.602998: funcgraph_entry:        0.272 us   |      dma_info_to_prot();
> 57.602998: funcgraph_entry:                   |      iommu_map_sg_atomic() {
> 57.602998: funcgraph_entry:                   |        __iommu_map_sg() {
> 57.602999: funcgraph_entry:        1.733 us   |          __iommu_map();
> 57.603001: funcgraph_entry:        1.642 us   |          __iommu_map();
> 57.603003: funcgraph_entry:        1.638 us   |          __iommu_map();
> 57.603005: funcgraph_entry:        1.645 us   |          __iommu_map();
> 57.603007: funcgraph_entry:        1.630 us   |          __iommu_map();
> 57.603009: funcgraph_entry:        1.770 us   |          __iommu_map();
> 57.603011: funcgraph_entry:        1.730 us   |          __iommu_map();
> 57.603013: funcgraph_entry:        1.633 us   |          __iommu_map();
> 57.603015: funcgraph_entry:        1.605 us   |          __iommu_map();
> 57.603017: funcgraph_entry:        2.847 us   |          __iommu_map();
> 57.603020: funcgraph_entry:        2.847 us   |          __iommu_map();
> 57.603024: funcgraph_entry:        2.955 us   |          __iommu_map();
> 57.603027: funcgraph_entry:        2.928 us   |          __iommu_map();
> 57.603030: funcgraph_entry:        2.933 us   |          __iommu_map();
> 57.603034: funcgraph_entry:        2.943 us   |          __iommu_map();
> 57.603037: funcgraph_entry:        2.928 us   |          __iommu_map();
> 57.603040: funcgraph_entry:        2.857 us   |          __iommu_map();
> 57.603044: funcgraph_entry:        2.953 us   |          __iommu_map();
> 57.603047: funcgraph_entry:        3.023 us   |          __iommu_map();
> 57.603050: funcgraph_entry:        1.645 us   |          __iommu_map();
> 57.603052: funcgraph_exit:       + 53.648 us  |        }
> 57.603052: funcgraph_exit:       + 54.178 us  |      }
> 57.603053: funcgraph_exit:       + 62.953 us  |    }
> 57.603053: funcgraph_exit:       + 63.567 us  |  }
> 57.603059: xprtrdma_mr_map:      task:60@5 mr.id=4 nents=30 122880@0xd79cc0e2f18c0000:0x00010501 (TO_DEVICE)
> 57.603060: xprtrdma_chunk_read:  task:60@5 pos=148 122880@0xd79cc0e2f18c0000:0x00010501 (more)
> 

I kind of believe it's due to the indirect calls. This is also reported
on ARM.

https://lore.kernel.org/linux-iommu/1610376862-927-1-git-send-email-isaacm@codeaurora.org/

Maybe we can try changing indirect calls to static ones to verify this
problem.

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-22  3:00           ` Lu Baolu
@ 2021-01-22 16:18             ` Chuck Lever
  -1 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-22 16:18 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Robin Murphy, iommu, Will Deacon, linux-rdma, logang,
	Christoph Hellwig, murphyt7, isaacm



> On Jan 21, 2021, at 10:00 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
> +Isaac
> 
> On 1/22/21 3:09 AM, Chuck Lever wrote:
>>> On Jan 18, 2021, at 1:00 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>>> 
>>> On 2021-01-18 16:18, Chuck Lever wrote:
>>>>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>>>> 
>>>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>>>> 
>>>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>>>> Hi-
>>>>>> 
>>>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>>>> iommu@lists ].
>>>>>> 
>>>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>>>> 
>>>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>>>> 
>>>>>> For those not familiar with the way storage protocols use RDMA, The
>>>>>> initiator/client sets up memory regions and the target/server uses
>>>>>> RDMA Read and Write to move data out of and into those regions. The
>>>>>> initiator/client uses only RDMA memory registration and invalidation
>>>>>> operations, and the target/server uses RDMA Read and Write.
>>>>>> 
>>>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>>>> enabled using the kernel command line options "intel_iommu=on
>>>>>> iommu=strict".
>>>>>> 
>>>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>>>> I was able to bisect on my client to the following commits.
>>>>>> 
>>>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>>>> map_sg"). This is about normal for this test.
>>>>>> 
>>>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>>>> 	Min xfer 					= 1017344.00 kB
>>>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>>>> 	Min xfer 					= 1030656.00 kB
>>>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>>>> 	Min xfer 					= 1042688.00 kB
>>>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>>>> 	Min xfer 					= 1039360.00 kB
>>>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>>>> 
>>>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>>>> 
>>>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>>>> 	Min xfer 					= 1035520.00 kB
>>>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>>>> 	Min xfer 					= 1033216.00 kB
>>>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>>>> 	Min xfer 					= 1044224.00 kB
>>>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>>>> 	Min xfer 					= 1043456.00 kB
>>>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>>>> 
>>>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>>>> the iommu ops"). Significant throughput loss.
>>>>>> 
>>>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>>>> 	Min xfer 					= 1022464.00 kB
>>>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>>>> 	Min xfer 					= 1035520.00 kB
>>>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>>>> 	Min xfer 					= 1032960.00 kB
>>>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>>>> 	Min xfer 					= 1035264.00 kB
>>>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>>>> 
>>>>>> The regression appears to be 100% reproducible.
>>>> Any thoughts?
>>>> How about some tools to try or debugging advice? I don't know where to start.
>>> 
>>> I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign).
>>> 
>>> With the last commit, though, at least part of it is likely to be the unfortunate inevitable overhead of the internal indirection through the IOMMU API. There's a coincidental performance-related thread where we've already started pondering some ideas in that area[1] (note that Intel is the last one to the party here; AMD has been using this path for a while, and it's all that arm64 systems have ever known). I'm not sure if there's any difference in the strict invalidation behaviour between the IOMMU API calls and the old intel_dma_ops, but I suppose that might be worth quickly double-checking as well. I guess the main thing would be to do some profiling to see where time is being spent in iommu-dma and intel-iommu vs. just different parts of intel-iommu before, and whether anything in particular stands out beyond the extra call overhead currently incurred by iommu_{map,unmap}.
>> I did a function_graph trace of the above iozone test on a v5.10 NFS
>> client and again on v5.11-rc. There is a substantial timing difference
>> in dma_map_sg_attrs. Each excerpt below is for DMA-mapping a 120KB set
>> of pages that are part of an NFS/RDMA WRITE operation.
>> v5.10:
>> 1072.028308: funcgraph_entry:                   |  dma_map_sg_attrs() {
>> 1072.028308: funcgraph_entry:                   |    intel_map_sg() {
>> 1072.028309: funcgraph_entry:                   |      find_domain() {
>> 1072.028309: funcgraph_entry:        0.280 us   |        get_domain_info();
>> 1072.028310: funcgraph_exit:         0.930 us   |      }
>> 1072.028310: funcgraph_entry:        0.360 us   |      domain_get_iommu();
>> 1072.028311: funcgraph_entry:                   |      intel_alloc_iova() {
>> 1072.028311: funcgraph_entry:                   |        alloc_iova_fast() {
>> 1072.028311: funcgraph_entry:        0.375 us   |          _raw_spin_lock_irqsave();
>> 1072.028312: funcgraph_entry:        0.285 us   |          __lock_text_start();
>> 1072.028313: funcgraph_exit:         1.500 us   |        }
>> 1072.028313: funcgraph_exit:         2.052 us   |      }
>> 1072.028313: funcgraph_entry:                   |      domain_mapping() {
>> 1072.028313: funcgraph_entry:                   |        __domain_mapping() {
>> 1072.028314: funcgraph_entry:        0.350 us   |          pfn_to_dma_pte();
>> 1072.028315: funcgraph_entry:        0.942 us   |          domain_flush_cache();
>> 1072.028316: funcgraph_exit:         2.852 us   |        }
>> 1072.028316: funcgraph_entry:        0.275 us   |        iommu_flush_write_buffer();
>> 1072.028317: funcgraph_exit:         4.213 us   |      }
>> 1072.028318: funcgraph_exit:         9.392 us   |    }
>> 1072.028318: funcgraph_exit:       + 10.073 us  |  }
>> 1072.028323: xprtrdma_mr_map:      mr.id=115 nents=30 122880@0xe476ca03f1180000:0x18011105 (TO_DEVICE)
>> 1072.028323: xprtrdma_chunk_read:  task:63879@5 pos=148 122880@0xe476ca03f1180000:0x18011105 (more)
>> v5.11-rc:
>> 57.602990: funcgraph_entry:                   |  dma_map_sg_attrs() {
>> 57.602990: funcgraph_entry:                   |    iommu_dma_map_sg() {
>> 57.602990: funcgraph_entry:        0.285 us   |      iommu_get_dma_domain();
>> 57.602991: funcgraph_entry:        0.270 us   |      iommu_dma_deferred_attach();
>> 57.602991: funcgraph_entry:                   |      iommu_dma_sync_sg_for_device() {
>> 57.602992: funcgraph_entry:        0.268 us   |        dev_is_untrusted();
>> 57.602992: funcgraph_exit:         0.815 us   |      }
>> 57.602993: funcgraph_entry:        0.267 us   |      dev_is_untrusted();
>> 57.602993: funcgraph_entry:                   |      iommu_dma_alloc_iova() {
>> 57.602994: funcgraph_entry:                   |        alloc_iova_fast() {
>> 57.602994: funcgraph_entry:        0.260 us   |          _raw_spin_lock_irqsave();
>> 57.602995: funcgraph_entry:        0.293 us   |          _raw_spin_lock();
>> 57.602995: funcgraph_entry:        0.273 us   |          _raw_spin_unlock_irqrestore();
>> 57.602996: funcgraph_entry:        1.147 us   |          alloc_iova();
>> 57.602997: funcgraph_exit:         3.370 us   |        }
>> 57.602997: funcgraph_exit:         3.945 us   |      }
>> 57.602998: funcgraph_entry:        0.272 us   |      dma_info_to_prot();
>> 57.602998: funcgraph_entry:                   |      iommu_map_sg_atomic() {
>> 57.602998: funcgraph_entry:                   |        __iommu_map_sg() {
>> 57.602999: funcgraph_entry:        1.733 us   |          __iommu_map();
>> 57.603001: funcgraph_entry:        1.642 us   |          __iommu_map();
>> 57.603003: funcgraph_entry:        1.638 us   |          __iommu_map();
>> 57.603005: funcgraph_entry:        1.645 us   |          __iommu_map();
>> 57.603007: funcgraph_entry:        1.630 us   |          __iommu_map();
>> 57.603009: funcgraph_entry:        1.770 us   |          __iommu_map();
>> 57.603011: funcgraph_entry:        1.730 us   |          __iommu_map();
>> 57.603013: funcgraph_entry:        1.633 us   |          __iommu_map();
>> 57.603015: funcgraph_entry:        1.605 us   |          __iommu_map();
>> 57.603017: funcgraph_entry:        2.847 us   |          __iommu_map();
>> 57.603020: funcgraph_entry:        2.847 us   |          __iommu_map();
>> 57.603024: funcgraph_entry:        2.955 us   |          __iommu_map();
>> 57.603027: funcgraph_entry:        2.928 us   |          __iommu_map();
>> 57.603030: funcgraph_entry:        2.933 us   |          __iommu_map();
>> 57.603034: funcgraph_entry:        2.943 us   |          __iommu_map();
>> 57.603037: funcgraph_entry:        2.928 us   |          __iommu_map();
>> 57.603040: funcgraph_entry:        2.857 us   |          __iommu_map();
>> 57.603044: funcgraph_entry:        2.953 us   |          __iommu_map();
>> 57.603047: funcgraph_entry:        3.023 us   |          __iommu_map();
>> 57.603050: funcgraph_entry:        1.645 us   |          __iommu_map();
>> 57.603052: funcgraph_exit:       + 53.648 us  |        }
>> 57.603052: funcgraph_exit:       + 54.178 us  |      }
>> 57.603053: funcgraph_exit:       + 62.953 us  |    }
>> 57.603053: funcgraph_exit:       + 63.567 us  |  }
>> 57.603059: xprtrdma_mr_map:      task:60@5 mr.id=4 nents=30 122880@0xd79cc0e2f18c0000:0x00010501 (TO_DEVICE)
>> 57.603060: xprtrdma_chunk_read:  task:60@5 pos=148 122880@0xd79cc0e2f18c0000:0x00010501 (more)
> 
> I kind of believe it's due to the indirect calls. This is also reported
> on ARM.
> 
> https://lore.kernel.org/linux-iommu/1610376862-927-1-git-send-email-isaacm@codeaurora.org/
> 
> Maybe we can try changing indirect calls to static ones to verify this
> problem.

I liked the idea of map_sg() enough to try my hand at building a PoC for
Intel, based on Isaac's patch series. It's just a cut-and-paste of the
generic iommu.c code with the indirect calls to ops->map() replaced.

The indirect calls do not seem to be the problem. Calling intel_iommu_map
directly appears to be as costly as calling it indirectly.

However, perhaps there are other ways map_sg() can be beneficial. In
v5.10, __domain_mapping and iommu_flush_write_buffer() appear to be
invoked just once for each large map operation, for example.


Here's a trace of my prototype in operation:

380.620150: funcgraph_entry:                   |  iommu_dma_map_sg() {
380.620150: funcgraph_entry:        0.285 us   |    iommu_get_dma_domain();
380.620150: funcgraph_entry:        0.265 us   |    iommu_dma_deferred_attach();
380.620151: funcgraph_entry:                   |    iommu_dma_sync_sg_for_device() {
380.620151: funcgraph_entry:        0.285 us   |      dev_is_untrusted();
380.620152: funcgraph_exit:         0.860 us   |    }
380.620152: funcgraph_entry:        0.263 us   |    dev_is_untrusted();
380.620153: funcgraph_entry:                   |    iommu_dma_alloc_iova() {
380.620153: funcgraph_entry:                   |      alloc_iova_fast() {
380.620153: funcgraph_entry:        0.268 us   |        _raw_spin_lock_irqsave();
380.620154: funcgraph_entry:        0.275 us   |        _raw_spin_unlock_irqrestore();
380.620155: funcgraph_exit:         1.402 us   |      }
380.620155: funcgraph_exit:         1.955 us   |    }
380.620155: funcgraph_entry:        0.265 us   |    dma_info_to_prot();
380.620156: funcgraph_entry:                   |    iommu_map_sg_atomic() {
380.620156: funcgraph_entry:                   |      __iommu_map_sg() {
380.620156: funcgraph_entry:                   |        intel_iommu_map_sg() {
380.620157: funcgraph_entry:        0.270 us   |          iommu_pgsize();
380.620157: funcgraph_entry:                   |          intel_iommu_map() {
380.620157: funcgraph_entry:        0.970 us   |            __domain_mapping();
380.620159: funcgraph_entry:        0.265 us   |            iommu_flush_write_buffer();
380.620159: funcgraph_exit:         2.322 us   |          }
380.620160: funcgraph_entry:        0.270 us   |          iommu_pgsize();
380.620160: funcgraph_entry:                   |          intel_iommu_map() {
380.620161: funcgraph_entry:        0.957 us   |            __domain_mapping();
380.620162: funcgraph_entry:        0.275 us   |            iommu_flush_write_buffer();
380.620163: funcgraph_exit:         2.315 us   |          }
380.620163: funcgraph_entry:        0.265 us   |          iommu_pgsize();
380.620163: funcgraph_entry:                   |          intel_iommu_map() {
380.620164: funcgraph_entry:        0.940 us   |            __domain_mapping();
380.620165: funcgraph_entry:        0.270 us   |            iommu_flush_write_buffer();
380.620166: funcgraph_exit:         2.295 us   |          }

 ....

380.620247: funcgraph_entry:        0.262 us   |          iommu_pgsize();
380.620248: funcgraph_entry:                   |          intel_iommu_map() {
380.620248: funcgraph_entry:        0.935 us   |            __domain_mapping();
380.620249: funcgraph_entry:        0.305 us   |            iommu_flush_write_buffer();
380.620250: funcgraph_exit:         2.315 us   |          }
380.620250: funcgraph_entry:        0.273 us   |          iommu_pgsize();
380.620251: funcgraph_entry:                   |          intel_iommu_map() {
380.620251: funcgraph_entry:        0.967 us   |            __domain_mapping();
380.620253: funcgraph_entry:        0.265 us   |            iommu_flush_write_buffer();
380.620253: funcgraph_exit:         2.310 us   |          }
380.620254: funcgraph_exit:       + 97.388 us  |        }
380.620254: funcgraph_exit:       + 97.960 us  |      }
380.620254: funcgraph_exit:       + 98.482 us  |    }
380.620255: funcgraph_exit:       ! 105.175 us |  }
380.620260: xprtrdma_mr_map:      task:1607@5 mr.id=126 nents=30 122880@0xf06ee5bbf1920000:0x70011104 (TO_DEVICE)
380.620261: xprtrdma_chunk_read:  task:1607@5 pos=148 122880@0xf06ee5bbf1920000:0x70011104 (more)


--
Chuck Lever




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-22 16:18             ` Chuck Lever
  0 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-22 16:18 UTC (permalink / raw)
  To: Lu Baolu
  Cc: isaacm, linux-rdma, logang, Robin Murphy, murphyt7, iommu,
	Will Deacon, Christoph Hellwig



> On Jan 21, 2021, at 10:00 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> 
> +Isaac
> 
> On 1/22/21 3:09 AM, Chuck Lever wrote:
>>> On Jan 18, 2021, at 1:00 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>>> 
>>> On 2021-01-18 16:18, Chuck Lever wrote:
>>>>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>>>> 
>>>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>>>> 
>>>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>>>> Hi-
>>>>>> 
>>>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>>>> iommu@lists ].
>>>>>> 
>>>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>>>> 
>>>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>>>> 
>>>>>> For those not familiar with the way storage protocols use RDMA, The
>>>>>> initiator/client sets up memory regions and the target/server uses
>>>>>> RDMA Read and Write to move data out of and into those regions. The
>>>>>> initiator/client uses only RDMA memory registration and invalidation
>>>>>> operations, and the target/server uses RDMA Read and Write.
>>>>>> 
>>>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>>>> enabled using the kernel command line options "intel_iommu=on
>>>>>> iommu=strict".
>>>>>> 
>>>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>>>> I was able to bisect on my client to the following commits.
>>>>>> 
>>>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>>>> map_sg"). This is about normal for this test.
>>>>>> 
>>>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>>>> 	Min xfer 					= 1017344.00 kB
>>>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>>>> 	Min xfer 					= 1030656.00 kB
>>>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>>>> 	Min xfer 					= 1042688.00 kB
>>>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>>>> 	Min xfer 					= 1039360.00 kB
>>>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>>>> 
>>>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>>>> 
>>>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>>>> 	Min xfer 					= 1035520.00 kB
>>>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>>>> 	Min xfer 					= 1033216.00 kB
>>>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>>>> 	Min xfer 					= 1044224.00 kB
>>>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>>>> 	Min xfer 					= 1043456.00 kB
>>>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>>>> 
>>>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>>>> the iommu ops"). Significant throughput loss.
>>>>>> 
>>>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>>>> 	Min xfer 					= 1022464.00 kB
>>>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>>>> 	Min xfer 					= 1035520.00 kB
>>>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>>>> 	Min xfer 					= 1032960.00 kB
>>>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>>>> 	Min xfer 					= 1035264.00 kB
>>>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>>>> 
>>>>>> The regression appears to be 100% reproducible.
>>>> Any thoughts?
>>>> How about some tools to try or debugging advice? I don't know where to start.
>>> 
>>> I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign).
>>> 
>>> With the last commit, though, at least part of it is likely to be the unfortunate inevitable overhead of the internal indirection through the IOMMU API. There's a coincidental performance-related thread where we've already started pondering some ideas in that area[1] (note that Intel is the last one to the party here; AMD has been using this path for a while, and it's all that arm64 systems have ever known). I'm not sure if there's any difference in the strict invalidation behaviour between the IOMMU API calls and the old intel_dma_ops, but I suppose that might be worth quickly double-checking as well. I guess the main thing would be to do some profiling to see where time is being spent in iommu-dma and intel-iommu vs. just different parts of intel-iommu before, and whether anything in particular stands out beyond the extra call overhead currently incurred by iommu_{map,unmap}.
>> I did a function_graph trace of the above iozone test on a v5.10 NFS
>> client and again on v5.11-rc. There is a substantial timing difference
>> in dma_map_sg_attrs. Each excerpt below is for DMA-mapping a 120KB set
>> of pages that are part of an NFS/RDMA WRITE operation.
>> v5.10:
>> 1072.028308: funcgraph_entry:                   |  dma_map_sg_attrs() {
>> 1072.028308: funcgraph_entry:                   |    intel_map_sg() {
>> 1072.028309: funcgraph_entry:                   |      find_domain() {
>> 1072.028309: funcgraph_entry:        0.280 us   |        get_domain_info();
>> 1072.028310: funcgraph_exit:         0.930 us   |      }
>> 1072.028310: funcgraph_entry:        0.360 us   |      domain_get_iommu();
>> 1072.028311: funcgraph_entry:                   |      intel_alloc_iova() {
>> 1072.028311: funcgraph_entry:                   |        alloc_iova_fast() {
>> 1072.028311: funcgraph_entry:        0.375 us   |          _raw_spin_lock_irqsave();
>> 1072.028312: funcgraph_entry:        0.285 us   |          __lock_text_start();
>> 1072.028313: funcgraph_exit:         1.500 us   |        }
>> 1072.028313: funcgraph_exit:         2.052 us   |      }
>> 1072.028313: funcgraph_entry:                   |      domain_mapping() {
>> 1072.028313: funcgraph_entry:                   |        __domain_mapping() {
>> 1072.028314: funcgraph_entry:        0.350 us   |          pfn_to_dma_pte();
>> 1072.028315: funcgraph_entry:        0.942 us   |          domain_flush_cache();
>> 1072.028316: funcgraph_exit:         2.852 us   |        }
>> 1072.028316: funcgraph_entry:        0.275 us   |        iommu_flush_write_buffer();
>> 1072.028317: funcgraph_exit:         4.213 us   |      }
>> 1072.028318: funcgraph_exit:         9.392 us   |    }
>> 1072.028318: funcgraph_exit:       + 10.073 us  |  }
>> 1072.028323: xprtrdma_mr_map:      mr.id=115 nents=30 122880@0xe476ca03f1180000:0x18011105 (TO_DEVICE)
>> 1072.028323: xprtrdma_chunk_read:  task:63879@5 pos=148 122880@0xe476ca03f1180000:0x18011105 (more)
>> v5.11-rc:
>> 57.602990: funcgraph_entry:                   |  dma_map_sg_attrs() {
>> 57.602990: funcgraph_entry:                   |    iommu_dma_map_sg() {
>> 57.602990: funcgraph_entry:        0.285 us   |      iommu_get_dma_domain();
>> 57.602991: funcgraph_entry:        0.270 us   |      iommu_dma_deferred_attach();
>> 57.602991: funcgraph_entry:                   |      iommu_dma_sync_sg_for_device() {
>> 57.602992: funcgraph_entry:        0.268 us   |        dev_is_untrusted();
>> 57.602992: funcgraph_exit:         0.815 us   |      }
>> 57.602993: funcgraph_entry:        0.267 us   |      dev_is_untrusted();
>> 57.602993: funcgraph_entry:                   |      iommu_dma_alloc_iova() {
>> 57.602994: funcgraph_entry:                   |        alloc_iova_fast() {
>> 57.602994: funcgraph_entry:        0.260 us   |          _raw_spin_lock_irqsave();
>> 57.602995: funcgraph_entry:        0.293 us   |          _raw_spin_lock();
>> 57.602995: funcgraph_entry:        0.273 us   |          _raw_spin_unlock_irqrestore();
>> 57.602996: funcgraph_entry:        1.147 us   |          alloc_iova();
>> 57.602997: funcgraph_exit:         3.370 us   |        }
>> 57.602997: funcgraph_exit:         3.945 us   |      }
>> 57.602998: funcgraph_entry:        0.272 us   |      dma_info_to_prot();
>> 57.602998: funcgraph_entry:                   |      iommu_map_sg_atomic() {
>> 57.602998: funcgraph_entry:                   |        __iommu_map_sg() {
>> 57.602999: funcgraph_entry:        1.733 us   |          __iommu_map();
>> 57.603001: funcgraph_entry:        1.642 us   |          __iommu_map();
>> 57.603003: funcgraph_entry:        1.638 us   |          __iommu_map();
>> 57.603005: funcgraph_entry:        1.645 us   |          __iommu_map();
>> 57.603007: funcgraph_entry:        1.630 us   |          __iommu_map();
>> 57.603009: funcgraph_entry:        1.770 us   |          __iommu_map();
>> 57.603011: funcgraph_entry:        1.730 us   |          __iommu_map();
>> 57.603013: funcgraph_entry:        1.633 us   |          __iommu_map();
>> 57.603015: funcgraph_entry:        1.605 us   |          __iommu_map();
>> 57.603017: funcgraph_entry:        2.847 us   |          __iommu_map();
>> 57.603020: funcgraph_entry:        2.847 us   |          __iommu_map();
>> 57.603024: funcgraph_entry:        2.955 us   |          __iommu_map();
>> 57.603027: funcgraph_entry:        2.928 us   |          __iommu_map();
>> 57.603030: funcgraph_entry:        2.933 us   |          __iommu_map();
>> 57.603034: funcgraph_entry:        2.943 us   |          __iommu_map();
>> 57.603037: funcgraph_entry:        2.928 us   |          __iommu_map();
>> 57.603040: funcgraph_entry:        2.857 us   |          __iommu_map();
>> 57.603044: funcgraph_entry:        2.953 us   |          __iommu_map();
>> 57.603047: funcgraph_entry:        3.023 us   |          __iommu_map();
>> 57.603050: funcgraph_entry:        1.645 us   |          __iommu_map();
>> 57.603052: funcgraph_exit:       + 53.648 us  |        }
>> 57.603052: funcgraph_exit:       + 54.178 us  |      }
>> 57.603053: funcgraph_exit:       + 62.953 us  |    }
>> 57.603053: funcgraph_exit:       + 63.567 us  |  }
>> 57.603059: xprtrdma_mr_map:      task:60@5 mr.id=4 nents=30 122880@0xd79cc0e2f18c0000:0x00010501 (TO_DEVICE)
>> 57.603060: xprtrdma_chunk_read:  task:60@5 pos=148 122880@0xd79cc0e2f18c0000:0x00010501 (more)
> 
> I kind of believe it's due to the indirect calls. This is also reported
> on ARM.
> 
> https://lore.kernel.org/linux-iommu/1610376862-927-1-git-send-email-isaacm@codeaurora.org/
> 
> Maybe we can try changing indirect calls to static ones to verify this
> problem.

I liked the idea of map_sg() enough to try my hand at building a PoC for
Intel, based on Isaac's patch series. It's just a cut-and-paste of the
generic iommu.c code with the indirect calls to ops->map() replaced.

The indirect calls do not seem to be the problem. Calling intel_iommu_map
directly appears to be as costly as calling it indirectly.

However, perhaps there are other ways map_sg() can be beneficial. In
v5.10, __domain_mapping and iommu_flush_write_buffer() appear to be
invoked just once for each large map operation, for example.


Here's a trace of my prototype in operation:

380.620150: funcgraph_entry:                   |  iommu_dma_map_sg() {
380.620150: funcgraph_entry:        0.285 us   |    iommu_get_dma_domain();
380.620150: funcgraph_entry:        0.265 us   |    iommu_dma_deferred_attach();
380.620151: funcgraph_entry:                   |    iommu_dma_sync_sg_for_device() {
380.620151: funcgraph_entry:        0.285 us   |      dev_is_untrusted();
380.620152: funcgraph_exit:         0.860 us   |    }
380.620152: funcgraph_entry:        0.263 us   |    dev_is_untrusted();
380.620153: funcgraph_entry:                   |    iommu_dma_alloc_iova() {
380.620153: funcgraph_entry:                   |      alloc_iova_fast() {
380.620153: funcgraph_entry:        0.268 us   |        _raw_spin_lock_irqsave();
380.620154: funcgraph_entry:        0.275 us   |        _raw_spin_unlock_irqrestore();
380.620155: funcgraph_exit:         1.402 us   |      }
380.620155: funcgraph_exit:         1.955 us   |    }
380.620155: funcgraph_entry:        0.265 us   |    dma_info_to_prot();
380.620156: funcgraph_entry:                   |    iommu_map_sg_atomic() {
380.620156: funcgraph_entry:                   |      __iommu_map_sg() {
380.620156: funcgraph_entry:                   |        intel_iommu_map_sg() {
380.620157: funcgraph_entry:        0.270 us   |          iommu_pgsize();
380.620157: funcgraph_entry:                   |          intel_iommu_map() {
380.620157: funcgraph_entry:        0.970 us   |            __domain_mapping();
380.620159: funcgraph_entry:        0.265 us   |            iommu_flush_write_buffer();
380.620159: funcgraph_exit:         2.322 us   |          }
380.620160: funcgraph_entry:        0.270 us   |          iommu_pgsize();
380.620160: funcgraph_entry:                   |          intel_iommu_map() {
380.620161: funcgraph_entry:        0.957 us   |            __domain_mapping();
380.620162: funcgraph_entry:        0.275 us   |            iommu_flush_write_buffer();
380.620163: funcgraph_exit:         2.315 us   |          }
380.620163: funcgraph_entry:        0.265 us   |          iommu_pgsize();
380.620163: funcgraph_entry:                   |          intel_iommu_map() {
380.620164: funcgraph_entry:        0.940 us   |            __domain_mapping();
380.620165: funcgraph_entry:        0.270 us   |            iommu_flush_write_buffer();
380.620166: funcgraph_exit:         2.295 us   |          }

 ....

380.620247: funcgraph_entry:        0.262 us   |          iommu_pgsize();
380.620248: funcgraph_entry:                   |          intel_iommu_map() {
380.620248: funcgraph_entry:        0.935 us   |            __domain_mapping();
380.620249: funcgraph_entry:        0.305 us   |            iommu_flush_write_buffer();
380.620250: funcgraph_exit:         2.315 us   |          }
380.620250: funcgraph_entry:        0.273 us   |          iommu_pgsize();
380.620251: funcgraph_entry:                   |          intel_iommu_map() {
380.620251: funcgraph_entry:        0.967 us   |            __domain_mapping();
380.620253: funcgraph_entry:        0.265 us   |            iommu_flush_write_buffer();
380.620253: funcgraph_exit:         2.310 us   |          }
380.620254: funcgraph_exit:       + 97.388 us  |        }
380.620254: funcgraph_exit:       + 97.960 us  |      }
380.620254: funcgraph_exit:       + 98.482 us  |    }
380.620255: funcgraph_exit:       ! 105.175 us |  }
380.620260: xprtrdma_mr_map:      task:1607@5 mr.id=126 nents=30 122880@0xf06ee5bbf1920000:0x70011104 (TO_DEVICE)
380.620261: xprtrdma_chunk_read:  task:1607@5 pos=148 122880@0xf06ee5bbf1920000:0x70011104 (more)


--
Chuck Lever



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-22 16:18             ` Chuck Lever
@ 2021-01-22 17:38               ` Robin Murphy
  -1 siblings, 0 replies; 36+ messages in thread
From: Robin Murphy @ 2021-01-22 17:38 UTC (permalink / raw)
  To: Chuck Lever, Lu Baolu
  Cc: iommu, Will Deacon, linux-rdma, logang, Christoph Hellwig,
	murphyt7, isaacm

On 2021-01-22 16:18, Chuck Lever wrote:
> 
> 
>> On Jan 21, 2021, at 10:00 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>
>> +Isaac
>>
>> On 1/22/21 3:09 AM, Chuck Lever wrote:
>>>> On Jan 18, 2021, at 1:00 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>>>>
>>>> On 2021-01-18 16:18, Chuck Lever wrote:
>>>>>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>>>>>
>>>>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>>>>>
>>>>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>>>>> Hi-
>>>>>>>
>>>>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>>>>> iommu@lists ].
>>>>>>>
>>>>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>>>>>
>>>>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>>>>>
>>>>>>> For those not familiar with the way storage protocols use RDMA, The
>>>>>>> initiator/client sets up memory regions and the target/server uses
>>>>>>> RDMA Read and Write to move data out of and into those regions. The
>>>>>>> initiator/client uses only RDMA memory registration and invalidation
>>>>>>> operations, and the target/server uses RDMA Read and Write.
>>>>>>>
>>>>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>>>>> enabled using the kernel command line options "intel_iommu=on
>>>>>>> iommu=strict".
>>>>>>>
>>>>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>>>>> I was able to bisect on my client to the following commits.
>>>>>>>
>>>>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>>>>> map_sg"). This is about normal for this test.
>>>>>>>
>>>>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>>>>> 	Min xfer 					= 1017344.00 kB
>>>>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>>>>> 	Min xfer 					= 1030656.00 kB
>>>>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>>>>> 	Min xfer 					= 1042688.00 kB
>>>>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>>>>> 	Min xfer 					= 1039360.00 kB
>>>>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>>>>>
>>>>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>>>>>
>>>>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>>>>> 	Min xfer 					= 1035520.00 kB
>>>>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>>>>> 	Min xfer 					= 1033216.00 kB
>>>>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>>>>> 	Min xfer 					= 1044224.00 kB
>>>>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>>>>> 	Min xfer 					= 1043456.00 kB
>>>>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>>>>>
>>>>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>>>>> the iommu ops"). Significant throughput loss.
>>>>>>>
>>>>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>>>>> 	Min xfer 					= 1022464.00 kB
>>>>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>>>>> 	Min xfer 					= 1035520.00 kB
>>>>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>>>>> 	Min xfer 					= 1032960.00 kB
>>>>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>>>>> 	Min xfer 					= 1035264.00 kB
>>>>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>>>>>
>>>>>>> The regression appears to be 100% reproducible.
>>>>> Any thoughts?
>>>>> How about some tools to try or debugging advice? I don't know where to start.
>>>>
>>>> I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign).
>>>>
>>>> With the last commit, though, at least part of it is likely to be the unfortunate inevitable overhead of the internal indirection through the IOMMU API. There's a coincidental performance-related thread where we've already started pondering some ideas in that area[1] (note that Intel is the last one to the party here; AMD has been using this path for a while, and it's all that arm64 systems have ever known). I'm not sure if there's any difference in the strict invalidation behaviour between the IOMMU API calls and the old intel_dma_ops, but I suppose that might be worth quickly double-checking as well. I guess the main thing would be to do some profiling to see where time is being spent in iommu-dma and intel-iommu vs. just different parts of intel-iommu before, and whether anything in particular stands out beyond the extra call overhead currently incurred by iommu_{map,unmap}.
>>> I did a function_graph trace of the above iozone test on a v5.10 NFS
>>> client and again on v5.11-rc. There is a substantial timing difference
>>> in dma_map_sg_attrs. Each excerpt below is for DMA-mapping a 120KB set
>>> of pages that are part of an NFS/RDMA WRITE operation.
>>> v5.10:
>>> 1072.028308: funcgraph_entry:                   |  dma_map_sg_attrs() {
>>> 1072.028308: funcgraph_entry:                   |    intel_map_sg() {
>>> 1072.028309: funcgraph_entry:                   |      find_domain() {
>>> 1072.028309: funcgraph_entry:        0.280 us   |        get_domain_info();
>>> 1072.028310: funcgraph_exit:         0.930 us   |      }
>>> 1072.028310: funcgraph_entry:        0.360 us   |      domain_get_iommu();
>>> 1072.028311: funcgraph_entry:                   |      intel_alloc_iova() {
>>> 1072.028311: funcgraph_entry:                   |        alloc_iova_fast() {
>>> 1072.028311: funcgraph_entry:        0.375 us   |          _raw_spin_lock_irqsave();
>>> 1072.028312: funcgraph_entry:        0.285 us   |          __lock_text_start();
>>> 1072.028313: funcgraph_exit:         1.500 us   |        }
>>> 1072.028313: funcgraph_exit:         2.052 us   |      }
>>> 1072.028313: funcgraph_entry:                   |      domain_mapping() {
>>> 1072.028313: funcgraph_entry:                   |        __domain_mapping() {
>>> 1072.028314: funcgraph_entry:        0.350 us   |          pfn_to_dma_pte();
>>> 1072.028315: funcgraph_entry:        0.942 us   |          domain_flush_cache();
>>> 1072.028316: funcgraph_exit:         2.852 us   |        }
>>> 1072.028316: funcgraph_entry:        0.275 us   |        iommu_flush_write_buffer();
>>> 1072.028317: funcgraph_exit:         4.213 us   |      }
>>> 1072.028318: funcgraph_exit:         9.392 us   |    }
>>> 1072.028318: funcgraph_exit:       + 10.073 us  |  }
>>> 1072.028323: xprtrdma_mr_map:      mr.id=115 nents=30 122880@0xe476ca03f1180000:0x18011105 (TO_DEVICE)
>>> 1072.028323: xprtrdma_chunk_read:  task:63879@5 pos=148 122880@0xe476ca03f1180000:0x18011105 (more)
>>> v5.11-rc:
>>> 57.602990: funcgraph_entry:                   |  dma_map_sg_attrs() {
>>> 57.602990: funcgraph_entry:                   |    iommu_dma_map_sg() {
>>> 57.602990: funcgraph_entry:        0.285 us   |      iommu_get_dma_domain();
>>> 57.602991: funcgraph_entry:        0.270 us   |      iommu_dma_deferred_attach();
>>> 57.602991: funcgraph_entry:                   |      iommu_dma_sync_sg_for_device() {
>>> 57.602992: funcgraph_entry:        0.268 us   |        dev_is_untrusted();
>>> 57.602992: funcgraph_exit:         0.815 us   |      }
>>> 57.602993: funcgraph_entry:        0.267 us   |      dev_is_untrusted();
>>> 57.602993: funcgraph_entry:                   |      iommu_dma_alloc_iova() {
>>> 57.602994: funcgraph_entry:                   |        alloc_iova_fast() {
>>> 57.602994: funcgraph_entry:        0.260 us   |          _raw_spin_lock_irqsave();
>>> 57.602995: funcgraph_entry:        0.293 us   |          _raw_spin_lock();
>>> 57.602995: funcgraph_entry:        0.273 us   |          _raw_spin_unlock_irqrestore();
>>> 57.602996: funcgraph_entry:        1.147 us   |          alloc_iova();
>>> 57.602997: funcgraph_exit:         3.370 us   |        }
>>> 57.602997: funcgraph_exit:         3.945 us   |      }
>>> 57.602998: funcgraph_entry:        0.272 us   |      dma_info_to_prot();
>>> 57.602998: funcgraph_entry:                   |      iommu_map_sg_atomic() {
>>> 57.602998: funcgraph_entry:                   |        __iommu_map_sg() {
>>> 57.602999: funcgraph_entry:        1.733 us   |          __iommu_map();
>>> 57.603001: funcgraph_entry:        1.642 us   |          __iommu_map();
>>> 57.603003: funcgraph_entry:        1.638 us   |          __iommu_map();
>>> 57.603005: funcgraph_entry:        1.645 us   |          __iommu_map();
>>> 57.603007: funcgraph_entry:        1.630 us   |          __iommu_map();
>>> 57.603009: funcgraph_entry:        1.770 us   |          __iommu_map();
>>> 57.603011: funcgraph_entry:        1.730 us   |          __iommu_map();
>>> 57.603013: funcgraph_entry:        1.633 us   |          __iommu_map();
>>> 57.603015: funcgraph_entry:        1.605 us   |          __iommu_map();
>>> 57.603017: funcgraph_entry:        2.847 us   |          __iommu_map();
>>> 57.603020: funcgraph_entry:        2.847 us   |          __iommu_map();
>>> 57.603024: funcgraph_entry:        2.955 us   |          __iommu_map();
>>> 57.603027: funcgraph_entry:        2.928 us   |          __iommu_map();
>>> 57.603030: funcgraph_entry:        2.933 us   |          __iommu_map();
>>> 57.603034: funcgraph_entry:        2.943 us   |          __iommu_map();
>>> 57.603037: funcgraph_entry:        2.928 us   |          __iommu_map();
>>> 57.603040: funcgraph_entry:        2.857 us   |          __iommu_map();
>>> 57.603044: funcgraph_entry:        2.953 us   |          __iommu_map();
>>> 57.603047: funcgraph_entry:        3.023 us   |          __iommu_map();
>>> 57.603050: funcgraph_entry:        1.645 us   |          __iommu_map();
>>> 57.603052: funcgraph_exit:       + 53.648 us  |        }
>>> 57.603052: funcgraph_exit:       + 54.178 us  |      }
>>> 57.603053: funcgraph_exit:       + 62.953 us  |    }
>>> 57.603053: funcgraph_exit:       + 63.567 us  |  }
>>> 57.603059: xprtrdma_mr_map:      task:60@5 mr.id=4 nents=30 122880@0xd79cc0e2f18c0000:0x00010501 (TO_DEVICE)
>>> 57.603060: xprtrdma_chunk_read:  task:60@5 pos=148 122880@0xd79cc0e2f18c0000:0x00010501 (more)
>>
>> I kind of believe it's due to the indirect calls. This is also reported
>> on ARM.
>>
>> https://lore.kernel.org/linux-iommu/1610376862-927-1-git-send-email-isaacm@codeaurora.org/
>>
>> Maybe we can try changing indirect calls to static ones to verify this
>> problem.
> 
> I liked the idea of map_sg() enough to try my hand at building a PoC for
> Intel, based on Isaac's patch series. It's just a cut-and-paste of the
> generic iommu.c code with the indirect calls to ops->map() replaced.
> 
> The indirect calls do not seem to be the problem. Calling intel_iommu_map
> directly appears to be as costly as calling it indirectly.
> 
> However, perhaps there are other ways map_sg() can be beneficial. In
> v5.10, __domain_mapping and iommu_flush_write_buffer() appear to be
> invoked just once for each large map operation, for example.

Oh, if the driver needs to do maintenance beyond just installing PTEs, 
that should probably be devolved to iotlb_sync_map anyway. There's a 
patch series here generalising that to be more useful, which is 
hopefully just waiting to be merged now:

https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/

Robin.

> Here's a trace of my prototype in operation:
> 
> 380.620150: funcgraph_entry:                   |  iommu_dma_map_sg() {
> 380.620150: funcgraph_entry:        0.285 us   |    iommu_get_dma_domain();
> 380.620150: funcgraph_entry:        0.265 us   |    iommu_dma_deferred_attach();
> 380.620151: funcgraph_entry:                   |    iommu_dma_sync_sg_for_device() {
> 380.620151: funcgraph_entry:        0.285 us   |      dev_is_untrusted();
> 380.620152: funcgraph_exit:         0.860 us   |    }
> 380.620152: funcgraph_entry:        0.263 us   |    dev_is_untrusted();
> 380.620153: funcgraph_entry:                   |    iommu_dma_alloc_iova() {
> 380.620153: funcgraph_entry:                   |      alloc_iova_fast() {
> 380.620153: funcgraph_entry:        0.268 us   |        _raw_spin_lock_irqsave();
> 380.620154: funcgraph_entry:        0.275 us   |        _raw_spin_unlock_irqrestore();
> 380.620155: funcgraph_exit:         1.402 us   |      }
> 380.620155: funcgraph_exit:         1.955 us   |    }
> 380.620155: funcgraph_entry:        0.265 us   |    dma_info_to_prot();
> 380.620156: funcgraph_entry:                   |    iommu_map_sg_atomic() {
> 380.620156: funcgraph_entry:                   |      __iommu_map_sg() {
> 380.620156: funcgraph_entry:                   |        intel_iommu_map_sg() {
> 380.620157: funcgraph_entry:        0.270 us   |          iommu_pgsize();
> 380.620157: funcgraph_entry:                   |          intel_iommu_map() {
> 380.620157: funcgraph_entry:        0.970 us   |            __domain_mapping();
> 380.620159: funcgraph_entry:        0.265 us   |            iommu_flush_write_buffer();
> 380.620159: funcgraph_exit:         2.322 us   |          }
> 380.620160: funcgraph_entry:        0.270 us   |          iommu_pgsize();
> 380.620160: funcgraph_entry:                   |          intel_iommu_map() {
> 380.620161: funcgraph_entry:        0.957 us   |            __domain_mapping();
> 380.620162: funcgraph_entry:        0.275 us   |            iommu_flush_write_buffer();
> 380.620163: funcgraph_exit:         2.315 us   |          }
> 380.620163: funcgraph_entry:        0.265 us   |          iommu_pgsize();
> 380.620163: funcgraph_entry:                   |          intel_iommu_map() {
> 380.620164: funcgraph_entry:        0.940 us   |            __domain_mapping();
> 380.620165: funcgraph_entry:        0.270 us   |            iommu_flush_write_buffer();
> 380.620166: funcgraph_exit:         2.295 us   |          }
> 
>   ....
> 
> 380.620247: funcgraph_entry:        0.262 us   |          iommu_pgsize();
> 380.620248: funcgraph_entry:                   |          intel_iommu_map() {
> 380.620248: funcgraph_entry:        0.935 us   |            __domain_mapping();
> 380.620249: funcgraph_entry:        0.305 us   |            iommu_flush_write_buffer();
> 380.620250: funcgraph_exit:         2.315 us   |          }
> 380.620250: funcgraph_entry:        0.273 us   |          iommu_pgsize();
> 380.620251: funcgraph_entry:                   |          intel_iommu_map() {
> 380.620251: funcgraph_entry:        0.967 us   |            __domain_mapping();
> 380.620253: funcgraph_entry:        0.265 us   |            iommu_flush_write_buffer();
> 380.620253: funcgraph_exit:         2.310 us   |          }
> 380.620254: funcgraph_exit:       + 97.388 us  |        }
> 380.620254: funcgraph_exit:       + 97.960 us  |      }
> 380.620254: funcgraph_exit:       + 98.482 us  |    }
> 380.620255: funcgraph_exit:       ! 105.175 us |  }
> 380.620260: xprtrdma_mr_map:      task:1607@5 mr.id=126 nents=30 122880@0xf06ee5bbf1920000:0x70011104 (TO_DEVICE)
> 380.620261: xprtrdma_chunk_read:  task:1607@5 pos=148 122880@0xf06ee5bbf1920000:0x70011104 (more)
> 
> 
> --
> Chuck Lever
> 
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-22 17:38               ` Robin Murphy
  0 siblings, 0 replies; 36+ messages in thread
From: Robin Murphy @ 2021-01-22 17:38 UTC (permalink / raw)
  To: Chuck Lever, Lu Baolu
  Cc: isaacm, linux-rdma, Will Deacon, murphyt7, iommu, logang,
	Christoph Hellwig

On 2021-01-22 16:18, Chuck Lever wrote:
> 
> 
>> On Jan 21, 2021, at 10:00 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>
>> +Isaac
>>
>> On 1/22/21 3:09 AM, Chuck Lever wrote:
>>>> On Jan 18, 2021, at 1:00 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>>>>
>>>> On 2021-01-18 16:18, Chuck Lever wrote:
>>>>>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>>>>>
>>>>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>>>>>
>>>>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>>>>> Hi-
>>>>>>>
>>>>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>>>>> iommu@lists ].
>>>>>>>
>>>>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>>>>>
>>>>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>>>>>
>>>>>>> For those not familiar with the way storage protocols use RDMA, The
>>>>>>> initiator/client sets up memory regions and the target/server uses
>>>>>>> RDMA Read and Write to move data out of and into those regions. The
>>>>>>> initiator/client uses only RDMA memory registration and invalidation
>>>>>>> operations, and the target/server uses RDMA Read and Write.
>>>>>>>
>>>>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>>>>> enabled using the kernel command line options "intel_iommu=on
>>>>>>> iommu=strict".
>>>>>>>
>>>>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>>>>> I was able to bisect on my client to the following commits.
>>>>>>>
>>>>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>>>>> map_sg"). This is about normal for this test.
>>>>>>>
>>>>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>>>>> 	Min xfer 					= 1017344.00 kB
>>>>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>>>>> 	Min xfer 					= 1030656.00 kB
>>>>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>>>>> 	Min xfer 					= 1042688.00 kB
>>>>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>>>>> 	Min xfer 					= 1039360.00 kB
>>>>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>>>>>
>>>>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>>>>>
>>>>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>>>>> 	Min xfer 					= 1035520.00 kB
>>>>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>>>>> 	Min xfer 					= 1033216.00 kB
>>>>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>>>>> 	Min xfer 					= 1044224.00 kB
>>>>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>>>>> 	Min xfer 					= 1043456.00 kB
>>>>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>>>>>
>>>>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>>>>> the iommu ops"). Significant throughput loss.
>>>>>>>
>>>>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>>>>> 	Min xfer 					= 1022464.00 kB
>>>>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>>>>> 	Min xfer 					= 1035520.00 kB
>>>>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>>>>> 	Min xfer 					= 1032960.00 kB
>>>>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>>>>> 	Min xfer 					= 1035264.00 kB
>>>>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>>>>>
>>>>>>> The regression appears to be 100% reproducible.
>>>>> Any thoughts?
>>>>> How about some tools to try or debugging advice? I don't know where to start.
>>>>
>>>> I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign).
>>>>
>>>> With the last commit, though, at least part of it is likely to be the unfortunate inevitable overhead of the internal indirection through the IOMMU API. There's a coincidental performance-related thread where we've already started pondering some ideas in that area[1] (note that Intel is the last one to the party here; AMD has been using this path for a while, and it's all that arm64 systems have ever known). I'm not sure if there's any difference in the strict invalidation behaviour between the IOMMU API calls and the old intel_dma_ops, but I suppose that might be worth quickly double-checking as well. I guess the main thing would be to do some profiling to see where time is being spent in iommu-dma and intel-iommu vs. just different parts of intel-iommu before, and whether anything in particular stands out beyond the extra call overhead currently incurred by iommu_{map,unmap}.
>>> I did a function_graph trace of the above iozone test on a v5.10 NFS
>>> client and again on v5.11-rc. There is a substantial timing difference
>>> in dma_map_sg_attrs. Each excerpt below is for DMA-mapping a 120KB set
>>> of pages that are part of an NFS/RDMA WRITE operation.
>>> v5.10:
>>> 1072.028308: funcgraph_entry:                   |  dma_map_sg_attrs() {
>>> 1072.028308: funcgraph_entry:                   |    intel_map_sg() {
>>> 1072.028309: funcgraph_entry:                   |      find_domain() {
>>> 1072.028309: funcgraph_entry:        0.280 us   |        get_domain_info();
>>> 1072.028310: funcgraph_exit:         0.930 us   |      }
>>> 1072.028310: funcgraph_entry:        0.360 us   |      domain_get_iommu();
>>> 1072.028311: funcgraph_entry:                   |      intel_alloc_iova() {
>>> 1072.028311: funcgraph_entry:                   |        alloc_iova_fast() {
>>> 1072.028311: funcgraph_entry:        0.375 us   |          _raw_spin_lock_irqsave();
>>> 1072.028312: funcgraph_entry:        0.285 us   |          __lock_text_start();
>>> 1072.028313: funcgraph_exit:         1.500 us   |        }
>>> 1072.028313: funcgraph_exit:         2.052 us   |      }
>>> 1072.028313: funcgraph_entry:                   |      domain_mapping() {
>>> 1072.028313: funcgraph_entry:                   |        __domain_mapping() {
>>> 1072.028314: funcgraph_entry:        0.350 us   |          pfn_to_dma_pte();
>>> 1072.028315: funcgraph_entry:        0.942 us   |          domain_flush_cache();
>>> 1072.028316: funcgraph_exit:         2.852 us   |        }
>>> 1072.028316: funcgraph_entry:        0.275 us   |        iommu_flush_write_buffer();
>>> 1072.028317: funcgraph_exit:         4.213 us   |      }
>>> 1072.028318: funcgraph_exit:         9.392 us   |    }
>>> 1072.028318: funcgraph_exit:       + 10.073 us  |  }
>>> 1072.028323: xprtrdma_mr_map:      mr.id=115 nents=30 122880@0xe476ca03f1180000:0x18011105 (TO_DEVICE)
>>> 1072.028323: xprtrdma_chunk_read:  task:63879@5 pos=148 122880@0xe476ca03f1180000:0x18011105 (more)
>>> v5.11-rc:
>>> 57.602990: funcgraph_entry:                   |  dma_map_sg_attrs() {
>>> 57.602990: funcgraph_entry:                   |    iommu_dma_map_sg() {
>>> 57.602990: funcgraph_entry:        0.285 us   |      iommu_get_dma_domain();
>>> 57.602991: funcgraph_entry:        0.270 us   |      iommu_dma_deferred_attach();
>>> 57.602991: funcgraph_entry:                   |      iommu_dma_sync_sg_for_device() {
>>> 57.602992: funcgraph_entry:        0.268 us   |        dev_is_untrusted();
>>> 57.602992: funcgraph_exit:         0.815 us   |      }
>>> 57.602993: funcgraph_entry:        0.267 us   |      dev_is_untrusted();
>>> 57.602993: funcgraph_entry:                   |      iommu_dma_alloc_iova() {
>>> 57.602994: funcgraph_entry:                   |        alloc_iova_fast() {
>>> 57.602994: funcgraph_entry:        0.260 us   |          _raw_spin_lock_irqsave();
>>> 57.602995: funcgraph_entry:        0.293 us   |          _raw_spin_lock();
>>> 57.602995: funcgraph_entry:        0.273 us   |          _raw_spin_unlock_irqrestore();
>>> 57.602996: funcgraph_entry:        1.147 us   |          alloc_iova();
>>> 57.602997: funcgraph_exit:         3.370 us   |        }
>>> 57.602997: funcgraph_exit:         3.945 us   |      }
>>> 57.602998: funcgraph_entry:        0.272 us   |      dma_info_to_prot();
>>> 57.602998: funcgraph_entry:                   |      iommu_map_sg_atomic() {
>>> 57.602998: funcgraph_entry:                   |        __iommu_map_sg() {
>>> 57.602999: funcgraph_entry:        1.733 us   |          __iommu_map();
>>> 57.603001: funcgraph_entry:        1.642 us   |          __iommu_map();
>>> 57.603003: funcgraph_entry:        1.638 us   |          __iommu_map();
>>> 57.603005: funcgraph_entry:        1.645 us   |          __iommu_map();
>>> 57.603007: funcgraph_entry:        1.630 us   |          __iommu_map();
>>> 57.603009: funcgraph_entry:        1.770 us   |          __iommu_map();
>>> 57.603011: funcgraph_entry:        1.730 us   |          __iommu_map();
>>> 57.603013: funcgraph_entry:        1.633 us   |          __iommu_map();
>>> 57.603015: funcgraph_entry:        1.605 us   |          __iommu_map();
>>> 57.603017: funcgraph_entry:        2.847 us   |          __iommu_map();
>>> 57.603020: funcgraph_entry:        2.847 us   |          __iommu_map();
>>> 57.603024: funcgraph_entry:        2.955 us   |          __iommu_map();
>>> 57.603027: funcgraph_entry:        2.928 us   |          __iommu_map();
>>> 57.603030: funcgraph_entry:        2.933 us   |          __iommu_map();
>>> 57.603034: funcgraph_entry:        2.943 us   |          __iommu_map();
>>> 57.603037: funcgraph_entry:        2.928 us   |          __iommu_map();
>>> 57.603040: funcgraph_entry:        2.857 us   |          __iommu_map();
>>> 57.603044: funcgraph_entry:        2.953 us   |          __iommu_map();
>>> 57.603047: funcgraph_entry:        3.023 us   |          __iommu_map();
>>> 57.603050: funcgraph_entry:        1.645 us   |          __iommu_map();
>>> 57.603052: funcgraph_exit:       + 53.648 us  |        }
>>> 57.603052: funcgraph_exit:       + 54.178 us  |      }
>>> 57.603053: funcgraph_exit:       + 62.953 us  |    }
>>> 57.603053: funcgraph_exit:       + 63.567 us  |  }
>>> 57.603059: xprtrdma_mr_map:      task:60@5 mr.id=4 nents=30 122880@0xd79cc0e2f18c0000:0x00010501 (TO_DEVICE)
>>> 57.603060: xprtrdma_chunk_read:  task:60@5 pos=148 122880@0xd79cc0e2f18c0000:0x00010501 (more)
>>
>> I kind of believe it's due to the indirect calls. This is also reported
>> on ARM.
>>
>> https://lore.kernel.org/linux-iommu/1610376862-927-1-git-send-email-isaacm@codeaurora.org/
>>
>> Maybe we can try changing indirect calls to static ones to verify this
>> problem.
> 
> I liked the idea of map_sg() enough to try my hand at building a PoC for
> Intel, based on Isaac's patch series. It's just a cut-and-paste of the
> generic iommu.c code with the indirect calls to ops->map() replaced.
> 
> The indirect calls do not seem to be the problem. Calling intel_iommu_map
> directly appears to be as costly as calling it indirectly.
> 
> However, perhaps there are other ways map_sg() can be beneficial. In
> v5.10, __domain_mapping and iommu_flush_write_buffer() appear to be
> invoked just once for each large map operation, for example.

Oh, if the driver needs to do maintenance beyond just installing PTEs, 
that should probably be devolved to iotlb_sync_map anyway. There's a 
patch series here generalising that to be more useful, which is 
hopefully just waiting to be merged now:

https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/

Robin.

> Here's a trace of my prototype in operation:
> 
> 380.620150: funcgraph_entry:                   |  iommu_dma_map_sg() {
> 380.620150: funcgraph_entry:        0.285 us   |    iommu_get_dma_domain();
> 380.620150: funcgraph_entry:        0.265 us   |    iommu_dma_deferred_attach();
> 380.620151: funcgraph_entry:                   |    iommu_dma_sync_sg_for_device() {
> 380.620151: funcgraph_entry:        0.285 us   |      dev_is_untrusted();
> 380.620152: funcgraph_exit:         0.860 us   |    }
> 380.620152: funcgraph_entry:        0.263 us   |    dev_is_untrusted();
> 380.620153: funcgraph_entry:                   |    iommu_dma_alloc_iova() {
> 380.620153: funcgraph_entry:                   |      alloc_iova_fast() {
> 380.620153: funcgraph_entry:        0.268 us   |        _raw_spin_lock_irqsave();
> 380.620154: funcgraph_entry:        0.275 us   |        _raw_spin_unlock_irqrestore();
> 380.620155: funcgraph_exit:         1.402 us   |      }
> 380.620155: funcgraph_exit:         1.955 us   |    }
> 380.620155: funcgraph_entry:        0.265 us   |    dma_info_to_prot();
> 380.620156: funcgraph_entry:                   |    iommu_map_sg_atomic() {
> 380.620156: funcgraph_entry:                   |      __iommu_map_sg() {
> 380.620156: funcgraph_entry:                   |        intel_iommu_map_sg() {
> 380.620157: funcgraph_entry:        0.270 us   |          iommu_pgsize();
> 380.620157: funcgraph_entry:                   |          intel_iommu_map() {
> 380.620157: funcgraph_entry:        0.970 us   |            __domain_mapping();
> 380.620159: funcgraph_entry:        0.265 us   |            iommu_flush_write_buffer();
> 380.620159: funcgraph_exit:         2.322 us   |          }
> 380.620160: funcgraph_entry:        0.270 us   |          iommu_pgsize();
> 380.620160: funcgraph_entry:                   |          intel_iommu_map() {
> 380.620161: funcgraph_entry:        0.957 us   |            __domain_mapping();
> 380.620162: funcgraph_entry:        0.275 us   |            iommu_flush_write_buffer();
> 380.620163: funcgraph_exit:         2.315 us   |          }
> 380.620163: funcgraph_entry:        0.265 us   |          iommu_pgsize();
> 380.620163: funcgraph_entry:                   |          intel_iommu_map() {
> 380.620164: funcgraph_entry:        0.940 us   |            __domain_mapping();
> 380.620165: funcgraph_entry:        0.270 us   |            iommu_flush_write_buffer();
> 380.620166: funcgraph_exit:         2.295 us   |          }
> 
>   ....
> 
> 380.620247: funcgraph_entry:        0.262 us   |          iommu_pgsize();
> 380.620248: funcgraph_entry:                   |          intel_iommu_map() {
> 380.620248: funcgraph_entry:        0.935 us   |            __domain_mapping();
> 380.620249: funcgraph_entry:        0.305 us   |            iommu_flush_write_buffer();
> 380.620250: funcgraph_exit:         2.315 us   |          }
> 380.620250: funcgraph_entry:        0.273 us   |          iommu_pgsize();
> 380.620251: funcgraph_entry:                   |          intel_iommu_map() {
> 380.620251: funcgraph_entry:        0.967 us   |            __domain_mapping();
> 380.620253: funcgraph_entry:        0.265 us   |            iommu_flush_write_buffer();
> 380.620253: funcgraph_exit:         2.310 us   |          }
> 380.620254: funcgraph_exit:       + 97.388 us  |        }
> 380.620254: funcgraph_exit:       + 97.960 us  |      }
> 380.620254: funcgraph_exit:       + 98.482 us  |    }
> 380.620255: funcgraph_exit:       ! 105.175 us |  }
> 380.620260: xprtrdma_mr_map:      task:1607@5 mr.id=126 nents=30 122880@0xf06ee5bbf1920000:0x70011104 (TO_DEVICE)
> 380.620261: xprtrdma_chunk_read:  task:1607@5 pos=148 122880@0xf06ee5bbf1920000:0x70011104 (more)
> 
> 
> --
> Chuck Lever
> 
> 
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-22 17:38               ` Robin Murphy
@ 2021-01-22 18:38                 ` Chuck Lever
  -1 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-22 18:38 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Lu Baolu, iommu, Will Deacon, linux-rdma, logang,
	Christoph Hellwig, murphyt7, isaacm



> On Jan 22, 2021, at 12:38 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> 
> On 2021-01-22 16:18, Chuck Lever wrote:
>>> On Jan 21, 2021, at 10:00 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>> 
>>> +Isaac
>>> 
>>> On 1/22/21 3:09 AM, Chuck Lever wrote:
>>>>> On Jan 18, 2021, at 1:00 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>>>>> 
>>>>> On 2021-01-18 16:18, Chuck Lever wrote:
>>>>>>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>>>>>> 
>>>>>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>>>>>> 
>>>>>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>>>>>> Hi-
>>>>>>>> 
>>>>>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>>>>>> iommu@lists ].
>>>>>>>> 
>>>>>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>>>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>>>>>> 
>>>>>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>>>>>> 
>>>>>>>> For those not familiar with the way storage protocols use RDMA, The
>>>>>>>> initiator/client sets up memory regions and the target/server uses
>>>>>>>> RDMA Read and Write to move data out of and into those regions. The
>>>>>>>> initiator/client uses only RDMA memory registration and invalidation
>>>>>>>> operations, and the target/server uses RDMA Read and Write.
>>>>>>>> 
>>>>>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>>>>>> enabled using the kernel command line options "intel_iommu=on
>>>>>>>> iommu=strict".
>>>>>>>> 
>>>>>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>>>>>> I was able to bisect on my client to the following commits.
>>>>>>>> 
>>>>>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>>>>>> map_sg"). This is about normal for this test.
>>>>>>>> 
>>>>>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>>>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>>>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>>>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>>>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>>>>>> 	Min xfer 					= 1017344.00 kB
>>>>>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>>>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>>>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>>>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>>>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>>>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>>>>>> 	Min xfer 					= 1030656.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>>>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>>>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>>>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>>>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>>>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>>>>>> 	Min xfer 					= 1042688.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>>>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>>>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>>>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>>>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>>>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>>>>>> 	Min xfer 					= 1039360.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>>>>>> 
>>>>>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>>>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>>>>>> 
>>>>>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>>>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>>>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>>>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>>>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>>>>>> 	Min xfer 					= 1035520.00 kB
>>>>>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>>>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>>>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>>>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>>>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>>>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>>>>>> 	Min xfer 					= 1033216.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>>>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>>>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>>>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>>>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>>>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>>>>>> 	Min xfer 					= 1044224.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>>>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>>>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>>>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>>>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>>>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>>>>>> 	Min xfer 					= 1043456.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>>>>>> 
>>>>>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>>>>>> the iommu ops"). Significant throughput loss.
>>>>>>>> 
>>>>>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>>>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>>>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>>>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>>>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>>>>>> 	Min xfer 					= 1022464.00 kB
>>>>>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>>>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>>>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>>>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>>>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>>>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>>>>>> 	Min xfer 					= 1035520.00 kB
>>>>>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>>>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>>>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>>>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>>>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>>>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>>>>>> 	Min xfer 					= 1032960.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>>>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>>>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>>>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>>>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>>>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>>>>>> 	Min xfer 					= 1035264.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>>>>>> 
>>>>>>>> The regression appears to be 100% reproducible.
>>>>>> Any thoughts?
>>>>>> How about some tools to try or debugging advice? I don't know where to start.
>>>>> 
>>>>> I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign).
>>>>> 
>>>>> With the last commit, though, at least part of it is likely to be the unfortunate inevitable overhead of the internal indirection through the IOMMU API. There's a coincidental performance-related thread where we've already started pondering some ideas in that area[1] (note that Intel is the last one to the party here; AMD has been using this path for a while, and it's all that arm64 systems have ever known). I'm not sure if there's any difference in the strict invalidation behaviour between the IOMMU API calls and the old intel_dma_ops, but I suppose that might be worth quickly double-checking as well. I guess the main thing would be to do some profiling to see where time is being spent in iommu-dma and intel-iommu vs. just different parts of intel-iommu before, and whether anything in particular stands out beyond the extra call overhead currently incurred by iommu_{map,unmap}.
>>>> I did a function_graph trace of the above iozone test on a v5.10 NFS
>>>> client and again on v5.11-rc. There is a substantial timing difference
>>>> in dma_map_sg_attrs. Each excerpt below is for DMA-mapping a 120KB set
>>>> of pages that are part of an NFS/RDMA WRITE operation.
>>>> v5.10:
>>>> 1072.028308: funcgraph_entry:                   |  dma_map_sg_attrs() {
>>>> 1072.028308: funcgraph_entry:                   |    intel_map_sg() {
>>>> 1072.028309: funcgraph_entry:                   |      find_domain() {
>>>> 1072.028309: funcgraph_entry:        0.280 us   |        get_domain_info();
>>>> 1072.028310: funcgraph_exit:         0.930 us   |      }
>>>> 1072.028310: funcgraph_entry:        0.360 us   |      domain_get_iommu();
>>>> 1072.028311: funcgraph_entry:                   |      intel_alloc_iova() {
>>>> 1072.028311: funcgraph_entry:                   |        alloc_iova_fast() {
>>>> 1072.028311: funcgraph_entry:        0.375 us   |          _raw_spin_lock_irqsave();
>>>> 1072.028312: funcgraph_entry:        0.285 us   |          __lock_text_start();
>>>> 1072.028313: funcgraph_exit:         1.500 us   |        }
>>>> 1072.028313: funcgraph_exit:         2.052 us   |      }
>>>> 1072.028313: funcgraph_entry:                   |      domain_mapping() {
>>>> 1072.028313: funcgraph_entry:                   |        __domain_mapping() {
>>>> 1072.028314: funcgraph_entry:        0.350 us   |          pfn_to_dma_pte();
>>>> 1072.028315: funcgraph_entry:        0.942 us   |          domain_flush_cache();
>>>> 1072.028316: funcgraph_exit:         2.852 us   |        }
>>>> 1072.028316: funcgraph_entry:        0.275 us   |        iommu_flush_write_buffer();
>>>> 1072.028317: funcgraph_exit:         4.213 us   |      }
>>>> 1072.028318: funcgraph_exit:         9.392 us   |    }
>>>> 1072.028318: funcgraph_exit:       + 10.073 us  |  }
>>>> 1072.028323: xprtrdma_mr_map:      mr.id=115 nents=30 122880@0xe476ca03f1180000:0x18011105 (TO_DEVICE)
>>>> 1072.028323: xprtrdma_chunk_read:  task:63879@5 pos=148 122880@0xe476ca03f1180000:0x18011105 (more)
>>>> v5.11-rc:
>>>> 57.602990: funcgraph_entry:                   |  dma_map_sg_attrs() {
>>>> 57.602990: funcgraph_entry:                   |    iommu_dma_map_sg() {
>>>> 57.602990: funcgraph_entry:        0.285 us   |      iommu_get_dma_domain();
>>>> 57.602991: funcgraph_entry:        0.270 us   |      iommu_dma_deferred_attach();
>>>> 57.602991: funcgraph_entry:                   |      iommu_dma_sync_sg_for_device() {
>>>> 57.602992: funcgraph_entry:        0.268 us   |        dev_is_untrusted();
>>>> 57.602992: funcgraph_exit:         0.815 us   |      }
>>>> 57.602993: funcgraph_entry:        0.267 us   |      dev_is_untrusted();
>>>> 57.602993: funcgraph_entry:                   |      iommu_dma_alloc_iova() {
>>>> 57.602994: funcgraph_entry:                   |        alloc_iova_fast() {
>>>> 57.602994: funcgraph_entry:        0.260 us   |          _raw_spin_lock_irqsave();
>>>> 57.602995: funcgraph_entry:        0.293 us   |          _raw_spin_lock();
>>>> 57.602995: funcgraph_entry:        0.273 us   |          _raw_spin_unlock_irqrestore();
>>>> 57.602996: funcgraph_entry:        1.147 us   |          alloc_iova();
>>>> 57.602997: funcgraph_exit:         3.370 us   |        }
>>>> 57.602997: funcgraph_exit:         3.945 us   |      }
>>>> 57.602998: funcgraph_entry:        0.272 us   |      dma_info_to_prot();
>>>> 57.602998: funcgraph_entry:                   |      iommu_map_sg_atomic() {
>>>> 57.602998: funcgraph_entry:                   |        __iommu_map_sg() {
>>>> 57.602999: funcgraph_entry:        1.733 us   |          __iommu_map();
>>>> 57.603001: funcgraph_entry:        1.642 us   |          __iommu_map();
>>>> 57.603003: funcgraph_entry:        1.638 us   |          __iommu_map();
>>>> 57.603005: funcgraph_entry:        1.645 us   |          __iommu_map();
>>>> 57.603007: funcgraph_entry:        1.630 us   |          __iommu_map();
>>>> 57.603009: funcgraph_entry:        1.770 us   |          __iommu_map();
>>>> 57.603011: funcgraph_entry:        1.730 us   |          __iommu_map();
>>>> 57.603013: funcgraph_entry:        1.633 us   |          __iommu_map();
>>>> 57.603015: funcgraph_entry:        1.605 us   |          __iommu_map();
>>>> 57.603017: funcgraph_entry:        2.847 us   |          __iommu_map();
>>>> 57.603020: funcgraph_entry:        2.847 us   |          __iommu_map();
>>>> 57.603024: funcgraph_entry:        2.955 us   |          __iommu_map();
>>>> 57.603027: funcgraph_entry:        2.928 us   |          __iommu_map();
>>>> 57.603030: funcgraph_entry:        2.933 us   |          __iommu_map();
>>>> 57.603034: funcgraph_entry:        2.943 us   |          __iommu_map();
>>>> 57.603037: funcgraph_entry:        2.928 us   |          __iommu_map();
>>>> 57.603040: funcgraph_entry:        2.857 us   |          __iommu_map();
>>>> 57.603044: funcgraph_entry:        2.953 us   |          __iommu_map();
>>>> 57.603047: funcgraph_entry:        3.023 us   |          __iommu_map();
>>>> 57.603050: funcgraph_entry:        1.645 us   |          __iommu_map();
>>>> 57.603052: funcgraph_exit:       + 53.648 us  |        }
>>>> 57.603052: funcgraph_exit:       + 54.178 us  |      }
>>>> 57.603053: funcgraph_exit:       + 62.953 us  |    }
>>>> 57.603053: funcgraph_exit:       + 63.567 us  |  }
>>>> 57.603059: xprtrdma_mr_map:      task:60@5 mr.id=4 nents=30 122880@0xd79cc0e2f18c0000:0x00010501 (TO_DEVICE)
>>>> 57.603060: xprtrdma_chunk_read:  task:60@5 pos=148 122880@0xd79cc0e2f18c0000:0x00010501 (more)
>>> 
>>> I kind of believe it's due to the indirect calls. This is also reported
>>> on ARM.
>>> 
>>> https://lore.kernel.org/linux-iommu/1610376862-927-1-git-send-email-isaacm@codeaurora.org/
>>> 
>>> Maybe we can try changing indirect calls to static ones to verify this
>>> problem.
>> I liked the idea of map_sg() enough to try my hand at building a PoC for
>> Intel, based on Isaac's patch series. It's just a cut-and-paste of the
>> generic iommu.c code with the indirect calls to ops->map() replaced.
>> The indirect calls do not seem to be the problem. Calling intel_iommu_map
>> directly appears to be as costly as calling it indirectly.
>> However, perhaps there are other ways map_sg() can be beneficial. In
>> v5.10, __domain_mapping and iommu_flush_write_buffer() appear to be
>> invoked just once for each large map operation, for example.
> 
> Oh, if the driver needs to do maintenance beyond just installing PTEs, that should probably be devolved to iotlb_sync_map anyway.

My naive observation is that the expensive part for intel_iommu_map()
seems to be clflush_cache_range.


> There's a patch series here generalising that to be more useful, which is hopefully just waiting to be merged now:
> 
> https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/

The Intel IOMMU driver would have to grow an iotlb_sync_map callback,
if that's an appropriate place to handle a clflush.

My concern is that none of these deeper changes seem appropriate for
5.11-rc. What is to be done to address the rather noticeable
regression in performance before v5.11 final?


> Robin.
> 
>> Here's a trace of my prototype in operation:
>> 380.620150: funcgraph_entry:                   |  iommu_dma_map_sg() {
>> 380.620150: funcgraph_entry:        0.285 us   |    iommu_get_dma_domain();
>> 380.620150: funcgraph_entry:        0.265 us   |    iommu_dma_deferred_attach();
>> 380.620151: funcgraph_entry:                   |    iommu_dma_sync_sg_for_device() {
>> 380.620151: funcgraph_entry:        0.285 us   |      dev_is_untrusted();
>> 380.620152: funcgraph_exit:         0.860 us   |    }
>> 380.620152: funcgraph_entry:        0.263 us   |    dev_is_untrusted();
>> 380.620153: funcgraph_entry:                   |    iommu_dma_alloc_iova() {
>> 380.620153: funcgraph_entry:                   |      alloc_iova_fast() {
>> 380.620153: funcgraph_entry:        0.268 us   |        _raw_spin_lock_irqsave();
>> 380.620154: funcgraph_entry:        0.275 us   |        _raw_spin_unlock_irqrestore();
>> 380.620155: funcgraph_exit:         1.402 us   |      }
>> 380.620155: funcgraph_exit:         1.955 us   |    }
>> 380.620155: funcgraph_entry:        0.265 us   |    dma_info_to_prot();
>> 380.620156: funcgraph_entry:                   |    iommu_map_sg_atomic() {
>> 380.620156: funcgraph_entry:                   |      __iommu_map_sg() {
>> 380.620156: funcgraph_entry:                   |        intel_iommu_map_sg() {
>> 380.620157: funcgraph_entry:        0.270 us   |          iommu_pgsize();
>> 380.620157: funcgraph_entry:                   |          intel_iommu_map() {
>> 380.620157: funcgraph_entry:        0.970 us   |            __domain_mapping();
>> 380.620159: funcgraph_entry:        0.265 us   |            iommu_flush_write_buffer();
>> 380.620159: funcgraph_exit:         2.322 us   |          }
>> 380.620160: funcgraph_entry:        0.270 us   |          iommu_pgsize();
>> 380.620160: funcgraph_entry:                   |          intel_iommu_map() {
>> 380.620161: funcgraph_entry:        0.957 us   |            __domain_mapping();
>> 380.620162: funcgraph_entry:        0.275 us   |            iommu_flush_write_buffer();
>> 380.620163: funcgraph_exit:         2.315 us   |          }
>> 380.620163: funcgraph_entry:        0.265 us   |          iommu_pgsize();
>> 380.620163: funcgraph_entry:                   |          intel_iommu_map() {
>> 380.620164: funcgraph_entry:        0.940 us   |            __domain_mapping();
>> 380.620165: funcgraph_entry:        0.270 us   |            iommu_flush_write_buffer();
>> 380.620166: funcgraph_exit:         2.295 us   |          }
>>  ....
>> 380.620247: funcgraph_entry:        0.262 us   |          iommu_pgsize();
>> 380.620248: funcgraph_entry:                   |          intel_iommu_map() {
>> 380.620248: funcgraph_entry:        0.935 us   |            __domain_mapping();
>> 380.620249: funcgraph_entry:        0.305 us   |            iommu_flush_write_buffer();
>> 380.620250: funcgraph_exit:         2.315 us   |          }
>> 380.620250: funcgraph_entry:        0.273 us   |          iommu_pgsize();
>> 380.620251: funcgraph_entry:                   |          intel_iommu_map() {
>> 380.620251: funcgraph_entry:        0.967 us   |            __domain_mapping();
>> 380.620253: funcgraph_entry:        0.265 us   |            iommu_flush_write_buffer();
>> 380.620253: funcgraph_exit:         2.310 us   |          }
>> 380.620254: funcgraph_exit:       + 97.388 us  |        }
>> 380.620254: funcgraph_exit:       + 97.960 us  |      }
>> 380.620254: funcgraph_exit:       + 98.482 us  |    }
>> 380.620255: funcgraph_exit:       ! 105.175 us |  }
>> 380.620260: xprtrdma_mr_map:      task:1607@5 mr.id=126 nents=30 122880@0xf06ee5bbf1920000:0x70011104 (TO_DEVICE)
>> 380.620261: xprtrdma_chunk_read:  task:1607@5 pos=148 122880@0xf06ee5bbf1920000:0x70011104 (more)
>> --
>> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-22 18:38                 ` Chuck Lever
  0 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2021-01-22 18:38 UTC (permalink / raw)
  To: Robin Murphy
  Cc: isaacm, linux-rdma, Will Deacon, murphyt7, iommu, logang,
	Christoph Hellwig



> On Jan 22, 2021, at 12:38 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> 
> On 2021-01-22 16:18, Chuck Lever wrote:
>>> On Jan 21, 2021, at 10:00 PM, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>> 
>>> +Isaac
>>> 
>>> On 1/22/21 3:09 AM, Chuck Lever wrote:
>>>>> On Jan 18, 2021, at 1:00 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>>>>> 
>>>>> On 2021-01-18 16:18, Chuck Lever wrote:
>>>>>>> On Jan 12, 2021, at 9:38 AM, Will Deacon <will@kernel.org> wrote:
>>>>>>> 
>>>>>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>>>>>> 
>>>>>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>>>>>>> Hi-
>>>>>>>> 
>>>>>>>> [ Please cc: me on replies, I'm not currently subscribed to
>>>>>>>> iommu@lists ].
>>>>>>>> 
>>>>>>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>>>>>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>>>>>>> 
>>>>>>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>>>>>>> 
>>>>>>>> For those not familiar with the way storage protocols use RDMA, The
>>>>>>>> initiator/client sets up memory regions and the target/server uses
>>>>>>>> RDMA Read and Write to move data out of and into those regions. The
>>>>>>>> initiator/client uses only RDMA memory registration and invalidation
>>>>>>>> operations, and the target/server uses RDMA Read and Write.
>>>>>>>> 
>>>>>>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>>>>>>> enabled using the kernel command line options "intel_iommu=on
>>>>>>>> iommu=strict".
>>>>>>>> 
>>>>>>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>>>>>>> I was able to bisect on my client to the following commits.
>>>>>>>> 
>>>>>>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>>>>>>> map_sg"). This is about normal for this test.
>>>>>>>> 
>>>>>>>> 	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
>>>>>>>> 	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
>>>>>>>> 	Min throughput per process 			=  387764.34 kB/sec
>>>>>>>> 	Max throughput per process 			=  399655.47 kB/sec
>>>>>>>> 	Avg throughput per process 			=  394381.76 kB/sec
>>>>>>>> 	Min xfer 					= 1017344.00 kB
>>>>>>>> 	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
>>>>>>>> 	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
>>>>>>>> 	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
>>>>>>>> 	Min throughput per process 			=  398983.72 kB/sec
>>>>>>>> 	Max throughput per process 			=  406199.66 kB/sec
>>>>>>>> 	Avg throughput per process 			=  403145.16 kB/sec
>>>>>>>> 	Min xfer 					= 1030656.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
>>>>>>>> 	Children see throughput for 12 readers 		= 5921370.94 kB/sec
>>>>>>>> 	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
>>>>>>>> 	Min throughput per process 			=  491812.38 kB/sec
>>>>>>>> 	Max throughput per process 			=  494777.28 kB/sec
>>>>>>>> 	Avg throughput per process 			=  493447.58 kB/sec
>>>>>>>> 	Min xfer 					= 1042688.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
>>>>>>>> 	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
>>>>>>>> 	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
>>>>>>>> 	Min throughput per process 			=  492805.81 kB/sec
>>>>>>>> 	Max throughput per process 			=  497280.19 kB/sec
>>>>>>>> 	Avg throughput per process 			=  495665.47 kB/sec
>>>>>>>> 	Min xfer 					= 1039360.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %
>>>>>>>> 
>>>>>>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>>>>>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>>>>>>> 
>>>>>>>> 	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
>>>>>>>> 	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
>>>>>>>> 	Min throughput per process 			=  359299.06 kB/sec
>>>>>>>> 	Max throughput per process 			=  363866.16 kB/sec
>>>>>>>> 	Avg throughput per process 			=  361868.26 kB/sec
>>>>>>>> 	Min xfer 					= 1035520.00 kB
>>>>>>>> 	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
>>>>>>>> 	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
>>>>>>>> 	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
>>>>>>>> 	Min throughput per process 			=  364553.88 kB/sec
>>>>>>>> 	Max throughput per process 			=  370029.28 kB/sec
>>>>>>>> 	Avg throughput per process 			=  367381.39 kB/sec
>>>>>>>> 	Min xfer 					= 1033216.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
>>>>>>>> 	Children see throughput for 12 readers 		= 5406879.47 kB/sec
>>>>>>>> 	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
>>>>>>>> 	Min throughput per process 			=  449583.03 kB/sec
>>>>>>>> 	Max throughput per process 			=  451761.69 kB/sec
>>>>>>>> 	Avg throughput per process 			=  450573.29 kB/sec
>>>>>>>> 	Min xfer 					= 1044224.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
>>>>>>>> 	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
>>>>>>>> 	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
>>>>>>>> 	Min throughput per process 			=  449918.12 kB/sec
>>>>>>>> 	Max throughput per process 			=  452489.28 kB/sec
>>>>>>>> 	Avg throughput per process 			=  450883.43 kB/sec
>>>>>>>> 	Min xfer 					= 1043456.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %
>>>>>>>> 
>>>>>>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>>>>>>> the iommu ops"). Significant throughput loss.
>>>>>>>> 
>>>>>>>> 	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
>>>>>>>> 	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
>>>>>>>> 	Min throughput per process 			=  313672.25 kB/sec
>>>>>>>> 	Max throughput per process 			=  321719.44 kB/sec
>>>>>>>> 	Avg throughput per process 			=  317669.74 kB/sec
>>>>>>>> 	Min xfer 					= 1022464.00 kB
>>>>>>>> 	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
>>>>>>>> 	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
>>>>>>>> 	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
>>>>>>>> 	Min throughput per process 			=  313654.44 kB/sec
>>>>>>>> 	Max throughput per process 			=  317844.50 kB/sec
>>>>>>>> 	Avg throughput per process 			=  315569.33 kB/sec
>>>>>>>> 	Min xfer 					= 1035520.00 kB
>>>>>>>> 	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
>>>>>>>> 	Children see throughput for 12 readers 		= 4265828.28 kB/sec
>>>>>>>> 	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
>>>>>>>> 	Min throughput per process 			=  352305.00 kB/sec
>>>>>>>> 	Max throughput per process 			=  357726.22 kB/sec
>>>>>>>> 	Avg throughput per process 			=  355485.69 kB/sec
>>>>>>>> 	Min xfer 					= 1032960.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
>>>>>>>> 	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
>>>>>>>> 	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
>>>>>>>> 	Min throughput per process 			=  348677.16 kB/sec
>>>>>>>> 	Max throughput per process 			=  353467.44 kB/sec
>>>>>>>> 	Avg throughput per process 			=  351720.93 kB/sec
>>>>>>>> 	Min xfer 					= 1035264.00 kB
>>>>>>>> 	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %
>>>>>>>> 
>>>>>>>> The regression appears to be 100% reproducible.
>>>>>> Any thoughts?
>>>>>> How about some tools to try or debugging advice? I don't know where to start.
>>>>> 
>>>>> I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign).
>>>>> 
>>>>> With the last commit, though, at least part of it is likely to be the unfortunate inevitable overhead of the internal indirection through the IOMMU API. There's a coincidental performance-related thread where we've already started pondering some ideas in that area[1] (note that Intel is the last one to the party here; AMD has been using this path for a while, and it's all that arm64 systems have ever known). I'm not sure if there's any difference in the strict invalidation behaviour between the IOMMU API calls and the old intel_dma_ops, but I suppose that might be worth quickly double-checking as well. I guess the main thing would be to do some profiling to see where time is being spent in iommu-dma and intel-iommu vs. just different parts of intel-iommu before, and whether anything in particular stands out beyond the extra call overhead currently incurred by iommu_{map,unmap}.
>>>> I did a function_graph trace of the above iozone test on a v5.10 NFS
>>>> client and again on v5.11-rc. There is a substantial timing difference
>>>> in dma_map_sg_attrs. Each excerpt below is for DMA-mapping a 120KB set
>>>> of pages that are part of an NFS/RDMA WRITE operation.
>>>> v5.10:
>>>> 1072.028308: funcgraph_entry:                   |  dma_map_sg_attrs() {
>>>> 1072.028308: funcgraph_entry:                   |    intel_map_sg() {
>>>> 1072.028309: funcgraph_entry:                   |      find_domain() {
>>>> 1072.028309: funcgraph_entry:        0.280 us   |        get_domain_info();
>>>> 1072.028310: funcgraph_exit:         0.930 us   |      }
>>>> 1072.028310: funcgraph_entry:        0.360 us   |      domain_get_iommu();
>>>> 1072.028311: funcgraph_entry:                   |      intel_alloc_iova() {
>>>> 1072.028311: funcgraph_entry:                   |        alloc_iova_fast() {
>>>> 1072.028311: funcgraph_entry:        0.375 us   |          _raw_spin_lock_irqsave();
>>>> 1072.028312: funcgraph_entry:        0.285 us   |          __lock_text_start();
>>>> 1072.028313: funcgraph_exit:         1.500 us   |        }
>>>> 1072.028313: funcgraph_exit:         2.052 us   |      }
>>>> 1072.028313: funcgraph_entry:                   |      domain_mapping() {
>>>> 1072.028313: funcgraph_entry:                   |        __domain_mapping() {
>>>> 1072.028314: funcgraph_entry:        0.350 us   |          pfn_to_dma_pte();
>>>> 1072.028315: funcgraph_entry:        0.942 us   |          domain_flush_cache();
>>>> 1072.028316: funcgraph_exit:         2.852 us   |        }
>>>> 1072.028316: funcgraph_entry:        0.275 us   |        iommu_flush_write_buffer();
>>>> 1072.028317: funcgraph_exit:         4.213 us   |      }
>>>> 1072.028318: funcgraph_exit:         9.392 us   |    }
>>>> 1072.028318: funcgraph_exit:       + 10.073 us  |  }
>>>> 1072.028323: xprtrdma_mr_map:      mr.id=115 nents=30 122880@0xe476ca03f1180000:0x18011105 (TO_DEVICE)
>>>> 1072.028323: xprtrdma_chunk_read:  task:63879@5 pos=148 122880@0xe476ca03f1180000:0x18011105 (more)
>>>> v5.11-rc:
>>>> 57.602990: funcgraph_entry:                   |  dma_map_sg_attrs() {
>>>> 57.602990: funcgraph_entry:                   |    iommu_dma_map_sg() {
>>>> 57.602990: funcgraph_entry:        0.285 us   |      iommu_get_dma_domain();
>>>> 57.602991: funcgraph_entry:        0.270 us   |      iommu_dma_deferred_attach();
>>>> 57.602991: funcgraph_entry:                   |      iommu_dma_sync_sg_for_device() {
>>>> 57.602992: funcgraph_entry:        0.268 us   |        dev_is_untrusted();
>>>> 57.602992: funcgraph_exit:         0.815 us   |      }
>>>> 57.602993: funcgraph_entry:        0.267 us   |      dev_is_untrusted();
>>>> 57.602993: funcgraph_entry:                   |      iommu_dma_alloc_iova() {
>>>> 57.602994: funcgraph_entry:                   |        alloc_iova_fast() {
>>>> 57.602994: funcgraph_entry:        0.260 us   |          _raw_spin_lock_irqsave();
>>>> 57.602995: funcgraph_entry:        0.293 us   |          _raw_spin_lock();
>>>> 57.602995: funcgraph_entry:        0.273 us   |          _raw_spin_unlock_irqrestore();
>>>> 57.602996: funcgraph_entry:        1.147 us   |          alloc_iova();
>>>> 57.602997: funcgraph_exit:         3.370 us   |        }
>>>> 57.602997: funcgraph_exit:         3.945 us   |      }
>>>> 57.602998: funcgraph_entry:        0.272 us   |      dma_info_to_prot();
>>>> 57.602998: funcgraph_entry:                   |      iommu_map_sg_atomic() {
>>>> 57.602998: funcgraph_entry:                   |        __iommu_map_sg() {
>>>> 57.602999: funcgraph_entry:        1.733 us   |          __iommu_map();
>>>> 57.603001: funcgraph_entry:        1.642 us   |          __iommu_map();
>>>> 57.603003: funcgraph_entry:        1.638 us   |          __iommu_map();
>>>> 57.603005: funcgraph_entry:        1.645 us   |          __iommu_map();
>>>> 57.603007: funcgraph_entry:        1.630 us   |          __iommu_map();
>>>> 57.603009: funcgraph_entry:        1.770 us   |          __iommu_map();
>>>> 57.603011: funcgraph_entry:        1.730 us   |          __iommu_map();
>>>> 57.603013: funcgraph_entry:        1.633 us   |          __iommu_map();
>>>> 57.603015: funcgraph_entry:        1.605 us   |          __iommu_map();
>>>> 57.603017: funcgraph_entry:        2.847 us   |          __iommu_map();
>>>> 57.603020: funcgraph_entry:        2.847 us   |          __iommu_map();
>>>> 57.603024: funcgraph_entry:        2.955 us   |          __iommu_map();
>>>> 57.603027: funcgraph_entry:        2.928 us   |          __iommu_map();
>>>> 57.603030: funcgraph_entry:        2.933 us   |          __iommu_map();
>>>> 57.603034: funcgraph_entry:        2.943 us   |          __iommu_map();
>>>> 57.603037: funcgraph_entry:        2.928 us   |          __iommu_map();
>>>> 57.603040: funcgraph_entry:        2.857 us   |          __iommu_map();
>>>> 57.603044: funcgraph_entry:        2.953 us   |          __iommu_map();
>>>> 57.603047: funcgraph_entry:        3.023 us   |          __iommu_map();
>>>> 57.603050: funcgraph_entry:        1.645 us   |          __iommu_map();
>>>> 57.603052: funcgraph_exit:       + 53.648 us  |        }
>>>> 57.603052: funcgraph_exit:       + 54.178 us  |      }
>>>> 57.603053: funcgraph_exit:       + 62.953 us  |    }
>>>> 57.603053: funcgraph_exit:       + 63.567 us  |  }
>>>> 57.603059: xprtrdma_mr_map:      task:60@5 mr.id=4 nents=30 122880@0xd79cc0e2f18c0000:0x00010501 (TO_DEVICE)
>>>> 57.603060: xprtrdma_chunk_read:  task:60@5 pos=148 122880@0xd79cc0e2f18c0000:0x00010501 (more)
>>> 
>>> I kind of believe it's due to the indirect calls. This is also reported
>>> on ARM.
>>> 
>>> https://lore.kernel.org/linux-iommu/1610376862-927-1-git-send-email-isaacm@codeaurora.org/
>>> 
>>> Maybe we can try changing indirect calls to static ones to verify this
>>> problem.
>> I liked the idea of map_sg() enough to try my hand at building a PoC for
>> Intel, based on Isaac's patch series. It's just a cut-and-paste of the
>> generic iommu.c code with the indirect calls to ops->map() replaced.
>> The indirect calls do not seem to be the problem. Calling intel_iommu_map
>> directly appears to be as costly as calling it indirectly.
>> However, perhaps there are other ways map_sg() can be beneficial. In
>> v5.10, __domain_mapping and iommu_flush_write_buffer() appear to be
>> invoked just once for each large map operation, for example.
> 
> Oh, if the driver needs to do maintenance beyond just installing PTEs, that should probably be devolved to iotlb_sync_map anyway.

My naive observation is that the expensive part for intel_iommu_map()
seems to be clflush_cache_range.


> There's a patch series here generalising that to be more useful, which is hopefully just waiting to be merged now:
> 
> https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/

The Intel IOMMU driver would have to grow an iotlb_sync_map callback,
if that's an appropriate place to handle a clflush.

My concern is that none of these deeper changes seem appropriate for
5.11-rc. What is to be done to address the rather noticeable
regression in performance before v5.11 final?


> Robin.
> 
>> Here's a trace of my prototype in operation:
>> 380.620150: funcgraph_entry:                   |  iommu_dma_map_sg() {
>> 380.620150: funcgraph_entry:        0.285 us   |    iommu_get_dma_domain();
>> 380.620150: funcgraph_entry:        0.265 us   |    iommu_dma_deferred_attach();
>> 380.620151: funcgraph_entry:                   |    iommu_dma_sync_sg_for_device() {
>> 380.620151: funcgraph_entry:        0.285 us   |      dev_is_untrusted();
>> 380.620152: funcgraph_exit:         0.860 us   |    }
>> 380.620152: funcgraph_entry:        0.263 us   |    dev_is_untrusted();
>> 380.620153: funcgraph_entry:                   |    iommu_dma_alloc_iova() {
>> 380.620153: funcgraph_entry:                   |      alloc_iova_fast() {
>> 380.620153: funcgraph_entry:        0.268 us   |        _raw_spin_lock_irqsave();
>> 380.620154: funcgraph_entry:        0.275 us   |        _raw_spin_unlock_irqrestore();
>> 380.620155: funcgraph_exit:         1.402 us   |      }
>> 380.620155: funcgraph_exit:         1.955 us   |    }
>> 380.620155: funcgraph_entry:        0.265 us   |    dma_info_to_prot();
>> 380.620156: funcgraph_entry:                   |    iommu_map_sg_atomic() {
>> 380.620156: funcgraph_entry:                   |      __iommu_map_sg() {
>> 380.620156: funcgraph_entry:                   |        intel_iommu_map_sg() {
>> 380.620157: funcgraph_entry:        0.270 us   |          iommu_pgsize();
>> 380.620157: funcgraph_entry:                   |          intel_iommu_map() {
>> 380.620157: funcgraph_entry:        0.970 us   |            __domain_mapping();
>> 380.620159: funcgraph_entry:        0.265 us   |            iommu_flush_write_buffer();
>> 380.620159: funcgraph_exit:         2.322 us   |          }
>> 380.620160: funcgraph_entry:        0.270 us   |          iommu_pgsize();
>> 380.620160: funcgraph_entry:                   |          intel_iommu_map() {
>> 380.620161: funcgraph_entry:        0.957 us   |            __domain_mapping();
>> 380.620162: funcgraph_entry:        0.275 us   |            iommu_flush_write_buffer();
>> 380.620163: funcgraph_exit:         2.315 us   |          }
>> 380.620163: funcgraph_entry:        0.265 us   |          iommu_pgsize();
>> 380.620163: funcgraph_entry:                   |          intel_iommu_map() {
>> 380.620164: funcgraph_entry:        0.940 us   |            __domain_mapping();
>> 380.620165: funcgraph_entry:        0.270 us   |            iommu_flush_write_buffer();
>> 380.620166: funcgraph_exit:         2.295 us   |          }
>>  ....
>> 380.620247: funcgraph_entry:        0.262 us   |          iommu_pgsize();
>> 380.620248: funcgraph_entry:                   |          intel_iommu_map() {
>> 380.620248: funcgraph_entry:        0.935 us   |            __domain_mapping();
>> 380.620249: funcgraph_entry:        0.305 us   |            iommu_flush_write_buffer();
>> 380.620250: funcgraph_exit:         2.315 us   |          }
>> 380.620250: funcgraph_entry:        0.273 us   |          iommu_pgsize();
>> 380.620251: funcgraph_entry:                   |          intel_iommu_map() {
>> 380.620251: funcgraph_entry:        0.967 us   |            __domain_mapping();
>> 380.620253: funcgraph_entry:        0.265 us   |            iommu_flush_write_buffer();
>> 380.620253: funcgraph_exit:         2.310 us   |          }
>> 380.620254: funcgraph_exit:       + 97.388 us  |        }
>> 380.620254: funcgraph_exit:       + 97.960 us  |      }
>> 380.620254: funcgraph_exit:       + 98.482 us  |    }
>> 380.620255: funcgraph_exit:       ! 105.175 us |  }
>> 380.620260: xprtrdma_mr_map:      task:1607@5 mr.id=126 nents=30 122880@0xf06ee5bbf1920000:0x70011104 (TO_DEVICE)
>> 380.620261: xprtrdma_chunk_read:  task:1607@5 pos=148 122880@0xf06ee5bbf1920000:0x70011104 (more)
>> --
>> Chuck Lever

--
Chuck Lever



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
  2021-01-22 17:38               ` Robin Murphy
@ 2021-01-24  7:17                 ` Lu Baolu
  -1 siblings, 0 replies; 36+ messages in thread
From: Lu Baolu @ 2021-01-24  7:17 UTC (permalink / raw)
  To: Robin Murphy, Chuck Lever
  Cc: baolu.lu, iommu, Will Deacon, linux-rdma, logang,
	Christoph Hellwig, murphyt7, isaacm

On 2021/1/23 1:38, Robin Murphy wrote:
>>> I kind of believe it's due to the indirect calls. This is also reported
>>> on ARM.
>>>
>>> https://lore.kernel.org/linux-iommu/1610376862-927-1-git-send-email-isaacm@codeaurora.org/ 
>>>
>>>
>>> Maybe we can try changing indirect calls to static ones to verify this
>>> problem.
>>
>> I liked the idea of map_sg() enough to try my hand at building a PoC for
>> Intel, based on Isaac's patch series. It's just a cut-and-paste of the
>> generic iommu.c code with the indirect calls to ops->map() replaced.
>>
>> The indirect calls do not seem to be the problem. Calling intel_iommu_map
>> directly appears to be as costly as calling it indirectly.
>>
>> However, perhaps there are other ways map_sg() can be beneficial. In
>> v5.10, __domain_mapping and iommu_flush_write_buffer() appear to be
>> invoked just once for each large map operation, for example.
> 
> Oh, if the driver needs to do maintenance beyond just installing PTEs, 
> that should probably be devolved to iotlb_sync_map anyway. There's a 
> patch series here generalising that to be more useful, which is 
> hopefully just waiting to be merged now:
> 
> https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/ 
> 

The iotlb_sync_map() could help here as far as I can see. I will post a
call-for-test patch set later.

> 
> Robin.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: performance regression noted in v5.11-rc after c062db039f40
@ 2021-01-24  7:17                 ` Lu Baolu
  0 siblings, 0 replies; 36+ messages in thread
From: Lu Baolu @ 2021-01-24  7:17 UTC (permalink / raw)
  To: Robin Murphy, Chuck Lever
  Cc: isaacm, linux-rdma, Will Deacon, murphyt7, iommu, logang,
	Christoph Hellwig

On 2021/1/23 1:38, Robin Murphy wrote:
>>> I kind of believe it's due to the indirect calls. This is also reported
>>> on ARM.
>>>
>>> https://lore.kernel.org/linux-iommu/1610376862-927-1-git-send-email-isaacm@codeaurora.org/ 
>>>
>>>
>>> Maybe we can try changing indirect calls to static ones to verify this
>>> problem.
>>
>> I liked the idea of map_sg() enough to try my hand at building a PoC for
>> Intel, based on Isaac's patch series. It's just a cut-and-paste of the
>> generic iommu.c code with the indirect calls to ops->map() replaced.
>>
>> The indirect calls do not seem to be the problem. Calling intel_iommu_map
>> directly appears to be as costly as calling it indirectly.
>>
>> However, perhaps there are other ways map_sg() can be beneficial. In
>> v5.10, __domain_mapping and iommu_flush_write_buffer() appear to be
>> invoked just once for each large map operation, for example.
> 
> Oh, if the driver needs to do maintenance beyond just installing PTEs, 
> that should probably be devolved to iotlb_sync_map anyway. There's a 
> patch series here generalising that to be more useful, which is 
> hopefully just waiting to be merged now:
> 
> https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/ 
> 

The iotlb_sync_map() could help here as far as I can see. I will post a
call-for-test patch set later.

> 
> Robin.

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-01-24  7:19 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-08 21:18 performance regression noted in v5.11-rc after c062db039f40 Chuck Lever
2021-01-08 21:18 ` Chuck Lever
2021-01-12 14:38 ` Will Deacon
2021-01-12 14:38   ` Will Deacon
2021-01-13  2:25   ` Lu Baolu
2021-01-13  2:25     ` Lu Baolu
2021-01-13 14:07     ` Chuck Lever
2021-01-13 14:07       ` Chuck Lever
2021-01-13 18:30       ` Chuck Lever
2021-01-13 18:30         ` Chuck Lever
2021-01-18 16:18   ` Chuck Lever
2021-01-18 16:18     ` Chuck Lever
2021-01-18 18:00     ` Robin Murphy
2021-01-18 18:00       ` Robin Murphy
2021-01-18 20:09       ` Chuck Lever
2021-01-18 20:09         ` Chuck Lever
2021-01-19  1:22         ` Lu Baolu
2021-01-19  1:22           ` Lu Baolu
2021-01-19 14:37           ` Chuck Lever
2021-01-19 14:37             ` Chuck Lever
2021-01-20  2:11             ` Lu Baolu
2021-01-20  2:11               ` Lu Baolu
2021-01-20 20:25               ` Chuck Lever
2021-01-20 20:25                 ` Chuck Lever
2021-01-21 19:09       ` Chuck Lever
2021-01-21 19:09         ` Chuck Lever
2021-01-22  3:00         ` Lu Baolu
2021-01-22  3:00           ` Lu Baolu
2021-01-22 16:18           ` Chuck Lever
2021-01-22 16:18             ` Chuck Lever
2021-01-22 17:38             ` Robin Murphy
2021-01-22 17:38               ` Robin Murphy
2021-01-22 18:38               ` Chuck Lever
2021-01-22 18:38                 ` Chuck Lever
2021-01-24  7:17               ` Lu Baolu
2021-01-24  7:17                 ` Lu Baolu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.