linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
@ 2021-07-09  8:38 Ming Lei
  2021-07-09 10:16 ` Russell King (Oracle)
  2021-07-09 10:26 ` Robin Murphy
  0 siblings, 2 replies; 30+ messages in thread
From: Ming Lei @ 2021-07-09  8:38 UTC (permalink / raw)
  To: linux-nvme, Will Deacon, linux-arm-kernel, iommu; +Cc: linux-kernel

Hello,

I observed that NVMe performance is very bad when running fio on one
CPU(aarch64) in remote numa node compared with the nvme pci numa node.

Please see the test result[1] 327K vs. 34.9K.

Latency trace shows that one big difference is in iommu_dma_unmap_sg(),
1111 nsecs vs 25437 nsecs.


[1] fio test & results

1) fio test result:

- run fio on local CPU
taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting

IOPS: 327K
avg latency of iommu_dma_unmap_sg(): 1111 nsecs


- run fio on remote CPU
taskset -c 80 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting

IOPS: 34.9K
avg latency of iommu_dma_unmap_sg(): 25437 nsecs

2) system info
[root@ampere-mtjade-04 ~]# lscpu | grep NUMA
NUMA node(s):                    2
NUMA node0 CPU(s):               0-79
NUMA node1 CPU(s):               80-159

lspci | grep NVMe
0003:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983

[root@ampere-mtjade-04 ~]# cat /sys/block/nvme1n1/device/device/numa_node
0



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-09  8:38 [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node Ming Lei
@ 2021-07-09 10:16 ` Russell King (Oracle)
  2021-07-09 14:21   ` Ming Lei
  2021-07-09 10:26 ` Robin Murphy
  1 sibling, 1 reply; 30+ messages in thread
From: Russell King (Oracle) @ 2021-07-09 10:16 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-nvme, Will Deacon, linux-arm-kernel, iommu, linux-kernel

On Fri, Jul 09, 2021 at 04:38:09PM +0800, Ming Lei wrote:
> I observed that NVMe performance is very bad when running fio on one
> CPU(aarch64) in remote numa node compared with the nvme pci numa node.

Have you checked the effect of running a memory-heavy process using
memory from node 1 while being executed by CPUs in node 0?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-09  8:38 [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node Ming Lei
  2021-07-09 10:16 ` Russell King (Oracle)
@ 2021-07-09 10:26 ` Robin Murphy
  2021-07-09 11:04   ` John Garry
  2021-07-09 14:24   ` Ming Lei
  1 sibling, 2 replies; 30+ messages in thread
From: Robin Murphy @ 2021-07-09 10:26 UTC (permalink / raw)
  To: Ming Lei, linux-nvme, Will Deacon, linux-arm-kernel, iommu; +Cc: linux-kernel

On 2021-07-09 09:38, Ming Lei wrote:
> Hello,
> 
> I observed that NVMe performance is very bad when running fio on one
> CPU(aarch64) in remote numa node compared with the nvme pci numa node.
> 
> Please see the test result[1] 327K vs. 34.9K.
> 
> Latency trace shows that one big difference is in iommu_dma_unmap_sg(),
> 1111 nsecs vs 25437 nsecs.

Are you able to dig down further into that? iommu_dma_unmap_sg() itself 
doesn't do anything particularly special, so whatever makes a difference 
is probably happening at a lower level, and I suspect there's probably 
an SMMU involved. If for instance it turns out to go all the way down to 
__arm_smmu_cmdq_poll_until_consumed() because polling MMIO from the 
wrong node is slow, there's unlikely to be much you can do about that 
other than the global "go faster" knobs (iommu.strict and 
iommu.passthrough) with their associated compromises.

Robin.

> [1] fio test & results
> 
> 1) fio test result:
> 
> - run fio on local CPU
> taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
> + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
> 
> IOPS: 327K
> avg latency of iommu_dma_unmap_sg(): 1111 nsecs
> 
> 
> - run fio on remote CPU
> taskset -c 80 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
> + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
> 
> IOPS: 34.9K
> avg latency of iommu_dma_unmap_sg(): 25437 nsecs
> 
> 2) system info
> [root@ampere-mtjade-04 ~]# lscpu | grep NUMA
> NUMA node(s):                    2
> NUMA node0 CPU(s):               0-79
> NUMA node1 CPU(s):               80-159
> 
> lspci | grep NVMe
> 0003:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> 
> [root@ampere-mtjade-04 ~]# cat /sys/block/nvme1n1/device/device/numa_node
> 0
> 
> 
> 
> Thanks,
> Ming
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-09 10:26 ` Robin Murphy
@ 2021-07-09 11:04   ` John Garry
  2021-07-09 12:34     ` Robin Murphy
  2021-07-09 14:24   ` Ming Lei
  1 sibling, 1 reply; 30+ messages in thread
From: John Garry @ 2021-07-09 11:04 UTC (permalink / raw)
  To: Robin Murphy, Ming Lei, linux-nvme, Will Deacon, linux-arm-kernel, iommu
  Cc: linux-kernel

On 09/07/2021 11:26, Robin Murphy wrote:
> n 2021-07-09 09:38, Ming Lei wrote:
>> Hello,
>>
>> I observed that NVMe performance is very bad when running fio on one
>> CPU(aarch64) in remote numa node compared with the nvme pci numa node.
>>
>> Please see the test result[1] 327K vs. 34.9K.
>>
>> Latency trace shows that one big difference is in iommu_dma_unmap_sg(),
>> 1111 nsecs vs 25437 nsecs.
> 
> Are you able to dig down further into that? iommu_dma_unmap_sg() itself 
> doesn't do anything particularly special, so whatever makes a difference 
> is probably happening at a lower level, and I suspect there's probably 
> an SMMU involved. If for instance it turns out to go all the way down to 
> __arm_smmu_cmdq_poll_until_consumed() because polling MMIO from the 
> wrong node is slow, there's unlikely to be much you can do about that 
> other than the global "go faster" knobs (iommu.strict and 
> iommu.passthrough) with their associated compromises.

There was also the disable_msipolling option:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c#n42

But I am not sure if that platform even supports MSI polling (or has 
smmu v3).

You could also try iommu.forcedac=1 cmdline option. But I doubt it will 
help since the issue was mentioned to be NUMA related.

> 
> Robin.
> 
>> [1] fio test & results
>>
>> 1) fio test result:
>>
>> - run fio on local CPU
>> taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
>> + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri 
>> --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 
>> --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 
>> --rw=randread --name=test --group_reporting
>>
>> IOPS: 327K
>> avg latency of iommu_dma_unmap_sg(): 1111 nsecs
>>
>>
>> - run fio on remote CPU
>> taskset -c 80 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
>> + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri 
>> --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 
>> --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 
>> --rw=randread --name=test --group_reporting
>>
>> IOPS: 34.9K
>> avg latency of iommu_dma_unmap_sg(): 25437 nsecs
>>
>> 2) system info
>> [root@ampere-mtjade-04 ~]# lscpu | grep NUMA
>> NUMA node(s):                    2
>> NUMA node0 CPU(s):               0-79
>> NUMA node1 CPU(s):               80-159
>>
>> lspci | grep NVMe
>> 0003:01:00.0 Non-Volatile memory controller: Samsung Electronics Co 
>> Ltd NVMe SSD Controller SM981/PM981/PM983
>>
>> [root@ampere-mtjade-04 ~]# cat /sys/block/nvme1n1/device/device/numa_node 

Since it's ampere, I guess it's smmu v3.

BTW, if you remember, I did raise a performance issue of smmuv3 with 
NVMe before:
https://lore.kernel.org/linux-iommu/b2a6e26d-6d0d-7f0d-f222-589812f701d2@huawei.com/

I did have this series to improve performance for systems with lots of 
CPUs, like above, but not accepted:
https://lore.kernel.org/linux-iommu/1598018062-175608-1-git-send-email-john.garry@huawei.com/

Thanks,
John


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-09 11:04   ` John Garry
@ 2021-07-09 12:34     ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2021-07-09 12:34 UTC (permalink / raw)
  To: John Garry, Ming Lei, linux-nvme, Will Deacon, linux-arm-kernel, iommu
  Cc: linux-kernel

On 2021-07-09 12:04, John Garry wrote:
> On 09/07/2021 11:26, Robin Murphy wrote:
>> n 2021-07-09 09:38, Ming Lei wrote:
>>> Hello,
>>>
>>> I observed that NVMe performance is very bad when running fio on one
>>> CPU(aarch64) in remote numa node compared with the nvme pci numa node.
>>>
>>> Please see the test result[1] 327K vs. 34.9K.
>>>
>>> Latency trace shows that one big difference is in iommu_dma_unmap_sg(),
>>> 1111 nsecs vs 25437 nsecs.
>>
>> Are you able to dig down further into that? iommu_dma_unmap_sg() 
>> itself doesn't do anything particularly special, so whatever makes a 
>> difference is probably happening at a lower level, and I suspect 
>> there's probably an SMMU involved. If for instance it turns out to go 
>> all the way down to __arm_smmu_cmdq_poll_until_consumed() because 
>> polling MMIO from the wrong node is slow, there's unlikely to be much 
>> you can do about that other than the global "go faster" knobs 
>> (iommu.strict and iommu.passthrough) with their associated compromises.
> 
> There was also the disable_msipolling option:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c#n42 
> 
> 
> But I am not sure if that platform even supports MSI polling (or has 
> smmu v3).

Hmm, I suppose in principle the MSI polling path could lead to a bit of 
cacheline ping-pong with the SMMU fetching and writing back to the sync 
command, but I'd rather find out more details of where exactly the extra 
time is being spent in this particular situation than speculate much 
further.

> You could also try iommu.forcedac=1 cmdline option. But I doubt it will 
> help since the issue was mentioned to be NUMA related.

Plus that shouldn't make any difference to unmaps anyway.

>>> [1] fio test & results
>>>
>>> 1) fio test result:
>>>
>>> - run fio on local CPU
>>> taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
>>> + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri 
>>> --iodepth=64 --iodepth_batch_submit=16 
>>> --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 
>>> --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
>>>
>>> IOPS: 327K
>>> avg latency of iommu_dma_unmap_sg(): 1111 nsecs
>>>
>>>
>>> - run fio on remote CPU
>>> taskset -c 80 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
>>> + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri 
>>> --iodepth=64 --iodepth_batch_submit=16 
>>> --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 
>>> --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
>>>
>>> IOPS: 34.9K
>>> avg latency of iommu_dma_unmap_sg(): 25437 nsecs
>>>
>>> 2) system info
>>> [root@ampere-mtjade-04 ~]# lscpu | grep NUMA
>>> NUMA node(s):                    2
>>> NUMA node0 CPU(s):               0-79
>>> NUMA node1 CPU(s):               80-159
>>>
>>> lspci | grep NVMe
>>> 0003:01:00.0 Non-Volatile memory controller: Samsung Electronics Co 
>>> Ltd NVMe SSD Controller SM981/PM981/PM983
>>>
>>> [root@ampere-mtjade-04 ~]# cat 
>>> /sys/block/nvme1n1/device/device/numa_node 
> 
> Since it's ampere, I guess it's smmu v3.
> 
> BTW, if you remember, I did raise a performance issue of smmuv3 with 
> NVMe before:
> https://lore.kernel.org/linux-iommu/b2a6e26d-6d0d-7f0d-f222-589812f701d2@huawei.com/ 

It doesn't seem like the best-case throughput is of concern in this case 
though, and my hunch is that a ~23x discrepancy in SMMU unmap 
performance depending on locality probably isn't specific to NVMe.

Robin.

> I did have this series to improve performance for systems with lots of 
> CPUs, like above, but not accepted:
> https://lore.kernel.org/linux-iommu/1598018062-175608-1-git-send-email-john.garry@huawei.com/ 
> 
> 
> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-09 10:16 ` Russell King (Oracle)
@ 2021-07-09 14:21   ` Ming Lei
  0 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2021-07-09 14:21 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: linux-nvme, Will Deacon, linux-arm-kernel, iommu, linux-kernel

On Fri, Jul 09, 2021 at 11:16:14AM +0100, Russell King (Oracle) wrote:
> On Fri, Jul 09, 2021 at 04:38:09PM +0800, Ming Lei wrote:
> > I observed that NVMe performance is very bad when running fio on one
> > CPU(aarch64) in remote numa node compared with the nvme pci numa node.
> 
> Have you checked the effect of running a memory-heavy process using
> memory from node 1 while being executed by CPUs in node 0?

1) aarch64
[root@ampere-mtjade-04 ~]# taskset -c 0 numactl -m 0  perf bench mem memcpy -s 4GB -f default
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 4GB bytes ...

      11.511752 GB/sec
[root@ampere-mtjade-04 ~]# taskset -c 0 numactl -m 1  perf bench mem memcpy -s 4GB -f default
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 4GB bytes ...

       3.084333 GB/sec


2) x86_64[1]
[root@hp-dl380g10-01 mingl]#  taskset -c 0 numactl -m 0  perf bench mem memcpy -s 4GB -f default
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 4GB bytes ...

       4.193927 GB/sec
[root@hp-dl380g10-01 mingl]#  taskset -c 0 numactl -m 1  perf bench mem memcpy -s 4GB -f default
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 4GB bytes ...

       3.553392 GB/sec


[1] on this x86_64 machine, IOPS can reach 680K in same fio nvme test 



Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-09 10:26 ` Robin Murphy
  2021-07-09 11:04   ` John Garry
@ 2021-07-09 14:24   ` Ming Lei
  2021-07-19 16:14     ` John Garry
  1 sibling, 1 reply; 30+ messages in thread
From: Ming Lei @ 2021-07-09 14:24 UTC (permalink / raw)
  To: Robin Murphy
  Cc: linux-nvme, Will Deacon, linux-arm-kernel, iommu, linux-kernel

On Fri, Jul 09, 2021 at 11:26:53AM +0100, Robin Murphy wrote:
> On 2021-07-09 09:38, Ming Lei wrote:
> > Hello,
> > 
> > I observed that NVMe performance is very bad when running fio on one
> > CPU(aarch64) in remote numa node compared with the nvme pci numa node.
> > 
> > Please see the test result[1] 327K vs. 34.9K.
> > 
> > Latency trace shows that one big difference is in iommu_dma_unmap_sg(),
> > 1111 nsecs vs 25437 nsecs.
> 
> Are you able to dig down further into that? iommu_dma_unmap_sg() itself
> doesn't do anything particularly special, so whatever makes a difference is
> probably happening at a lower level, and I suspect there's probably an SMMU
> involved. If for instance it turns out to go all the way down to
> __arm_smmu_cmdq_poll_until_consumed() because polling MMIO from the wrong
> node is slow, there's unlikely to be much you can do about that other than
> the global "go faster" knobs (iommu.strict and iommu.passthrough) with their
> associated compromises.

Follows the log of 'perf report'

1) good(run fio from cpus in the nvme's numa node)

-   34.86%     1.73%  fio       [nvme]              [k] nvme_process_cq                                                      ▒
   - 33.13% nvme_process_cq                                                                                                  ▒
      - 32.93% nvme_pci_complete_rq                                                                                          ▒
         - 24.92% nvme_unmap_data                                                                                            ▒
            - 20.08% dma_unmap_sg_attrs                                                                                      ▒
               - 19.79% iommu_dma_unmap_sg                                                                                   ▒
                  - 19.55% __iommu_dma_unmap                                                                                 ▒
                     - 16.86% arm_smmu_iotlb_sync                                                                            ▒
                        - 16.81% arm_smmu_tlb_inv_range_domain                                                               ▒
                           - 14.73% __arm_smmu_tlb_inv_range                                                                 ▒
                                14.44% arm_smmu_cmdq_issue_cmdlist                                                           ▒
                             0.89% __pi_memset                                                                               ▒
                             0.75% arm_smmu_atc_inv_domain                                                                   ▒
                     + 1.58% iommu_unmap_fast                                                                                ▒
                     + 0.71% iommu_dma_free_iova                                                                             ▒
            - 3.25% dma_unmap_page_attrs                                                                                     ▒
               - 3.21% iommu_dma_unmap_page                                                                                  ▒
                  - 3.14% __iommu_dma_unmap_swiotlb                                                                          ▒
                     - 2.86% __iommu_dma_unmap                                                                               ▒
                        - 2.48% arm_smmu_iotlb_sync                                                                          ▒
                           - 2.47% arm_smmu_tlb_inv_range_domain                                                             ▒
                              - 2.19% __arm_smmu_tlb_inv_range                                                               ▒
                                   2.16% arm_smmu_cmdq_issue_cmdlist                                                         ▒
            + 1.34% mempool_free                                                                                             ▒
         + 7.68% nvme_complete_rq                                                                                            ▒
   + 1.73% _start


2) bad(run fio from cpus not in the nvme's numa node)
-   49.25%     3.03%  fio       [nvme]              [k] nvme_process_cq                                                      ▒
   - 46.22% nvme_process_cq                                                                                                  ▒
      - 46.07% nvme_pci_complete_rq                                                                                          ▒
         - 41.02% nvme_unmap_data                                                                                            ▒
            - 34.92% dma_unmap_sg_attrs                                                                                      ▒
               - 34.75% iommu_dma_unmap_sg                                                                                   ▒
                  - 34.58% __iommu_dma_unmap                                                                                 ▒
                     - 33.04% arm_smmu_iotlb_sync                                                                            ▒
                        - 33.00% arm_smmu_tlb_inv_range_domain                                                               ▒
                           - 31.86% __arm_smmu_tlb_inv_range                                                                 ▒
                                31.71% arm_smmu_cmdq_issue_cmdlist                                                           ▒
                     + 0.90% iommu_unmap_fast                                                                                ▒
            - 5.17% dma_unmap_page_attrs                                                                                     ▒
               - 5.15% iommu_dma_unmap_page                                                                                  ▒
                  - 5.12% __iommu_dma_unmap_swiotlb                                                                          ▒
                     - 5.05% __iommu_dma_unmap                                                                               ▒
                        - 4.86% arm_smmu_iotlb_sync                                                                          ▒
                           - 4.85% arm_smmu_tlb_inv_range_domain                                                             ▒
                              - 4.70% __arm_smmu_tlb_inv_range                                                               ▒
                                   4.67% arm_smmu_cmdq_issue_cmdlist                                                         ▒
            + 0.74% mempool_free                                                                                             ▒
         + 4.83% nvme_complete_rq                                                                                            ▒
   + 3.03% _start


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-09 14:24   ` Ming Lei
@ 2021-07-19 16:14     ` John Garry
  2021-07-21  1:40       ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: John Garry @ 2021-07-19 16:14 UTC (permalink / raw)
  To: Ming Lei, Robin Murphy
  Cc: iommu, Will Deacon, linux-arm-kernel, linux-nvme, linux-kernel

On 09/07/2021 15:24, Ming Lei wrote:
>> associated compromises.
> Follows the log of 'perf report'
> 
> 1) good(run fio from cpus in the nvme's numa node)

Hi Ming,

If you're still interested in this issue, as an experiment only you can 
try my rebased patches here:

https://github.com/hisilicon/kernel-dev/commits/private-topic-smmu-5.14-cmdq-4

I think that you should see a significant performance boost.

Thanks
John

> 
> -   34.86%     1.73%  fio       [nvme]              [k] nvme_process_cq                                                      ▒
>     - 33.13% nvme_process_cq                                                                                                  ▒
>        - 32.93% nvme_pci_complete_rq                                                                                          ▒
>           - 24.92% nvme_unmap_data                                                                                            ▒
>              - 20.08% dma_unmap_sg_attrs                                                                                      ▒
>                 - 19.79% iommu_dma_unmap_sg                                                                                   ▒
>                    - 19.55% __iommu_dma_unmap                                                                                 ▒
>                       - 16.86% arm_smmu_iotlb_sync                                                                            ▒
>                          - 16.81% arm_smmu_tlb_inv_range_domain                                                               ▒
>                             - 14.73% __arm_smmu_tlb_inv_range                                                                 ▒
>                                  14.44% arm_smmu_cmdq_issue_cmdlist                                                           ▒
>                               0.89% __pi_memset                                                                               ▒
>                               0.75% arm_smmu_atc_inv_domain                                                                   ▒
>                       + 1.58% iommu_unmap_fast                                                                                ▒
>                       + 0.71% iommu_dma_free_iova                                                                             ▒
>              - 3.25% dma_unmap_page_attrs                                                                                     ▒
>                 - 3.21% iommu_dma_unmap_page                                                                                  ▒
>                    - 3.14% __iommu_dma_unmap_swiotlb                                                                          ▒
>                       - 2.86% __iommu_dma_unmap                                                                               ▒
>                          - 2.48% arm_smmu_iotlb_sync                                                                          ▒
>                             - 2.47% arm_smmu_tlb_inv_range_domain                                                             ▒
>                                - 2.19% __arm_smmu_tlb_inv_range                                                               ▒
>                                     2.16% arm_smmu_cmdq_issue_cmdlist                                                         ▒
>              + 1.34% mempool_free                                                                                             ▒
>           + 7.68% nvme_complete_rq                                                                                            ▒
>     + 1.73% _start
> 
> 
> 2) bad(run fio from cpus not in the nvme's numa node)
> -   49.25%     3.03%  fio       [nvme]              [k] nvme_process_cq                                                      ▒
>     - 46.22% nvme_process_cq                                                                                                  ▒
>        - 46.07% nvme_pci_complete_rq                                                                                          ▒
>           - 41.02% nvme_unmap_data                                                                                            ▒
>              - 34.92% dma_unmap_sg_attrs                                                                                      ▒
>                 - 34.75% iommu_dma_unmap_sg                                                                                   ▒
>                    - 34.58% __iommu_dma_unmap                                                                                 ▒
>                       - 33.04% arm_smmu_iotlb_sync                                                                            ▒
>                          - 33.00% arm_smmu_tlb_inv_range_domain                                                               ▒
>                             - 31.86% __arm_smmu_tlb_inv_range                                                                 ▒
>                                  31.71% arm_smmu_cmdq_issue_cmdlist                                                           ▒
>                       + 0.90% iommu_unmap_fast                                                                                ▒
>              - 5.17% dma_unmap_page_attrs                                                                                     ▒
>                 - 5.15% iommu_dma_unmap_page                                                                                  ▒
>                    - 5.12% __iommu_dma_unmap_swiotlb                                                                          ▒
>                       - 5.05% __iommu_dma_unmap                                                                               ▒
>                          - 4.86% arm_smmu_iotlb_sync                                                                          ▒
>                             - 4.85% arm_smmu_tlb_inv_range_domain                                                             ▒
>                                - 4.70% __arm_smmu_tlb_inv_range                                                               ▒
>                                     4.67% arm_smmu_cmdq_issue_cmdlist                                                         ▒
>              + 0.74% mempool_free                                                                                             ▒
>           + 4.83% nvme_complete_rq                                                                                            ▒
>     + 3.03% _start


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-19 16:14     ` John Garry
@ 2021-07-21  1:40       ` Ming Lei
  2021-07-21  9:23         ` John Garry
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2021-07-21  1:40 UTC (permalink / raw)
  To: John Garry
  Cc: Robin Murphy, iommu, Will Deacon, linux-arm-kernel, linux-nvme,
	linux-kernel

On Mon, Jul 19, 2021 at 05:14:28PM +0100, John Garry wrote:
> On 09/07/2021 15:24, Ming Lei wrote:
> > > associated compromises.
> > Follows the log of 'perf report'
> > 
> > 1) good(run fio from cpus in the nvme's numa node)
> 
> Hi Ming,
> 
> If you're still interested in this issue, as an experiment only you can try
> my rebased patches here:
> 
> https://github.com/hisilicon/kernel-dev/commits/private-topic-smmu-5.14-cmdq-4
> 
> I think that you should see a significant performance boost.

There is build issue, please check your tree:

  MODPOST vmlinux.symvers
  MODINFO modules.builtin.modinfo
  GEN     modules.builtin
  LD      .tmp_vmlinux.btf
ld: Unexpected GOT/PLT entries detected!
ld: Unexpected run-time procedure linkages detected!
ld: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.o: in function `smmu_test_store':
/root/git/linux/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:3892: undefined reference to `smmu_test_core'
  BTF     .btf.vmlinux.bin.o
pahole: .tmp_vmlinux.btf: No such file or directory
  LD      .tmp_vmlinux.kallsyms1
.btf.vmlinux.bin.o: file not recognized: file format not recognized
make: *** [Makefile:1177: vmlinux] Error 1


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-21  1:40       ` Ming Lei
@ 2021-07-21  9:23         ` John Garry
  2021-07-21  9:59           ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: John Garry @ 2021-07-21  9:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: Robin Murphy, iommu, Will Deacon, linux-arm-kernel, linux-nvme,
	linux-kernel

On 21/07/2021 02:40, Ming Lei wrote:
>> I think that you should see a significant performance boost.
> There is build issue, please check your tree:
> 
>    MODPOST vmlinux.symvers
>    MODINFO modules.builtin.modinfo
>    GEN     modules.builtin
>    LD      .tmp_vmlinux.btf
> ld: Unexpected GOT/PLT entries detected!
> ld: Unexpected run-time procedure linkages detected!
> ld: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.o: in function `smmu_test_store':
> /root/git/linux/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:3892: undefined reference to `smmu_test_core'
>    BTF     .btf.vmlinux.bin.o
> pahole: .tmp_vmlinux.btf: No such file or directory
>    LD      .tmp_vmlinux.kallsyms1
> .btf.vmlinux.bin.o: file not recognized: file format not recognized
> make: *** [Makefile:1177: vmlinux] Error 1

Ah, sorry. I had some test code which was not properly guarded with 
necessary build switches.

I have now removed that from the tree, so please re-pull.

Thanks,
John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-21  9:23         ` John Garry
@ 2021-07-21  9:59           ` Ming Lei
  2021-07-21 11:07             ` John Garry
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2021-07-21  9:59 UTC (permalink / raw)
  To: John Garry
  Cc: Robin Murphy, iommu, Will Deacon, linux-arm-kernel, linux-nvme,
	linux-kernel

On Wed, Jul 21, 2021 at 10:23:38AM +0100, John Garry wrote:
> On 21/07/2021 02:40, Ming Lei wrote:
> > > I think that you should see a significant performance boost.
> > There is build issue, please check your tree:
> > 
> >    MODPOST vmlinux.symvers
> >    MODINFO modules.builtin.modinfo
> >    GEN     modules.builtin
> >    LD      .tmp_vmlinux.btf
> > ld: Unexpected GOT/PLT entries detected!
> > ld: Unexpected run-time procedure linkages detected!
> > ld: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.o: in function `smmu_test_store':
> > /root/git/linux/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:3892: undefined reference to `smmu_test_core'
> >    BTF     .btf.vmlinux.bin.o
> > pahole: .tmp_vmlinux.btf: No such file or directory
> >    LD      .tmp_vmlinux.kallsyms1
> > .btf.vmlinux.bin.o: file not recognized: file format not recognized
> > make: *** [Makefile:1177: vmlinux] Error 1
> 
> Ah, sorry. I had some test code which was not properly guarded with
> necessary build switches.
> 
> I have now removed that from the tree, so please re-pull.

Now the kernel can be built successfully, but not see obvious improvement
on the reported issue:

[root@ampere-mtjade-04 ~]# uname -a
Linux ampere-mtjade-04.khw4.lab.eng.bos.redhat.com 5.14.0-rc2_smmu_fix+ #2 SMP Wed Jul 21 05:49:03 EDT 2021 aarch64 aarch64 aarch64 GNU/Linux

[root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=1503MiB/s][r=385k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3143: Wed Jul 21 05:58:14 2021
  read: IOPS=384k, BW=1501MiB/s (1573MB/s)(14.7GiB/10001msec)

[root@ampere-mtjade-04 ~]# taskset -c 80 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=138MiB/s][r=35.4k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3063: Wed Jul 21 05:55:31 2021
  read: IOPS=35.4k, BW=138MiB/s (145MB/s)(1383MiB/10001msec)


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-21  9:59           ` Ming Lei
@ 2021-07-21 11:07             ` John Garry
  2021-07-21 11:58               ` Ming Lei
  2021-07-22  7:58               ` Ming Lei
  0 siblings, 2 replies; 30+ messages in thread
From: John Garry @ 2021-07-21 11:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: Robin Murphy, iommu, Will Deacon, linux-arm-kernel, linux-nvme,
	linux-kernel

On 21/07/2021 10:59, Ming Lei wrote:
>> I have now removed that from the tree, so please re-pull.
> Now the kernel can be built successfully, but not see obvious improvement
> on the reported issue:
> 
> [root@ampere-mtjade-04 ~]# uname -a
> Linux ampere-mtjade-04.khw4.lab.eng.bos.redhat.com 5.14.0-rc2_smmu_fix+ #2 SMP Wed Jul 21 05:49:03 EDT 2021 aarch64 aarch64 aarch64 GNU/Linux
> 
> [root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
> + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
> test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
> fio-3.27
> Starting 1 process
> Jobs: 1 (f=1): [r(1)][100.0%][r=1503MiB/s][r=385k IOPS][eta 00m:00s]
> test: (groupid=0, jobs=1): err= 0: pid=3143: Wed Jul 21 05:58:14 2021
>    read: IOPS=384k, BW=1501MiB/s (1573MB/s)(14.7GiB/10001msec)

I am not sure what baseline you used previously, but you were getting 
327K then, so at least this would be an improvement.

> 
> [root@ampere-mtjade-04 ~]# taskset -c 80 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
> + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
> test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
> fio-3.27
> Starting 1 process
> Jobs: 1 (f=1): [r(1)][100.0%][r=138MiB/s][r=35.4k IOPS][eta 00m:00s]
> test: (groupid=0, jobs=1): err= 0: pid=3063: Wed Jul 21 05:55:31 2021
>    read: IOPS=35.4k, BW=138MiB/s (145MB/s)(1383MiB/10001msec)

I can try similar on our arm64 board when I get a chance.

Thanks,
John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-21 11:07             ` John Garry
@ 2021-07-21 11:58               ` Ming Lei
  2021-07-22  7:58               ` Ming Lei
  1 sibling, 0 replies; 30+ messages in thread
From: Ming Lei @ 2021-07-21 11:58 UTC (permalink / raw)
  To: John Garry
  Cc: Robin Murphy, iommu, Will Deacon, linux-arm-kernel, linux-nvme,
	linux-kernel

On Wed, Jul 21, 2021 at 12:07:22PM +0100, John Garry wrote:
> On 21/07/2021 10:59, Ming Lei wrote:
> > > I have now removed that from the tree, so please re-pull.
> > Now the kernel can be built successfully, but not see obvious improvement
> > on the reported issue:
> > 
> > [root@ampere-mtjade-04 ~]# uname -a
> > Linux ampere-mtjade-04.khw4.lab.eng.bos.redhat.com 5.14.0-rc2_smmu_fix+ #2 SMP Wed Jul 21 05:49:03 EDT 2021 aarch64 aarch64 aarch64 GNU/Linux
> > 
> > [root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
> > + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
> > test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
> > fio-3.27
> > Starting 1 process
> > Jobs: 1 (f=1): [r(1)][100.0%][r=1503MiB/s][r=385k IOPS][eta 00m:00s]
> > test: (groupid=0, jobs=1): err= 0: pid=3143: Wed Jul 21 05:58:14 2021
> >    read: IOPS=384k, BW=1501MiB/s (1573MB/s)(14.7GiB/10001msec)
> 
> I am not sure what baseline you used previously, but you were getting 327K
> then, so at least this would be an improvement.

Yeah, that might be one improvement, but not checked it since code base
is changed.

> 
> > 
> > [root@ampere-mtjade-04 ~]# taskset -c 80 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
> > + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
> > test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
> > fio-3.27
> > Starting 1 process
> > Jobs: 1 (f=1): [r(1)][100.0%][r=138MiB/s][r=35.4k IOPS][eta 00m:00s]
> > test: (groupid=0, jobs=1): err= 0: pid=3063: Wed Jul 21 05:55:31 2021
> >    read: IOPS=35.4k, BW=138MiB/s (145MB/s)(1383MiB/10001msec)
> 
> I can try similar on our arm64 board when I get a chance.

The issue I reported is this one.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-21 11:07             ` John Garry
  2021-07-21 11:58               ` Ming Lei
@ 2021-07-22  7:58               ` Ming Lei
  2021-07-22 10:05                 ` John Garry
  1 sibling, 1 reply; 30+ messages in thread
From: Ming Lei @ 2021-07-22  7:58 UTC (permalink / raw)
  To: John Garry
  Cc: Robin Murphy, iommu, Will Deacon, linux-arm-kernel, linux-nvme,
	linux-kernel

On Wed, Jul 21, 2021 at 12:07:22PM +0100, John Garry wrote:
> On 21/07/2021 10:59, Ming Lei wrote:
> > > I have now removed that from the tree, so please re-pull.
> > Now the kernel can be built successfully, but not see obvious improvement
> > on the reported issue:
> > 
> > [root@ampere-mtjade-04 ~]# uname -a
> > Linux ampere-mtjade-04.khw4.lab.eng.bos.redhat.com 5.14.0-rc2_smmu_fix+ #2 SMP Wed Jul 21 05:49:03 EDT 2021 aarch64 aarch64 aarch64 GNU/Linux
> > 
> > [root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
> > + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
> > test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
> > fio-3.27
> > Starting 1 process
> > Jobs: 1 (f=1): [r(1)][100.0%][r=1503MiB/s][r=385k IOPS][eta 00m:00s]
> > test: (groupid=0, jobs=1): err= 0: pid=3143: Wed Jul 21 05:58:14 2021
> >    read: IOPS=384k, BW=1501MiB/s (1573MB/s)(14.7GiB/10001msec)
> 
> I am not sure what baseline you used previously, but you were getting 327K
> then, so at least this would be an improvement.

Looks the improvement isn't from your patches, please see the test result on
v5.14-rc2:

[root@ampere-mtjade-04 ~]# uname -a
Linux ampere-mtjade-04.khw4.lab.eng.bos.redhat.com 5.14.0-rc2_linus #3 SMP Thu Jul 22 03:41:24 EDT 2021 aarch64 aarch64 aarch64 GNU/Linux
[root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 20 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=1489MiB/s][r=381k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3099: Thu Jul 22 03:53:04 2021
  read: IOPS=381k, BW=1487MiB/s (1559MB/s)(29.0GiB/20001msec)


thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-22  7:58               ` Ming Lei
@ 2021-07-22 10:05                 ` John Garry
  2021-07-22 10:19                   ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: John Garry @ 2021-07-22 10:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Robin Murphy, iommu, Will Deacon, linux-arm-kernel, linux-nvme,
	linux-kernel

On 22/07/2021 08:58, Ming Lei wrote:
> On Wed, Jul 21, 2021 at 12:07:22PM +0100, John Garry wrote:
>> On 21/07/2021 10:59, Ming Lei wrote:
>>>> I have now removed that from the tree, so please re-pull.
>>> Now the kernel can be built successfully, but not see obvious improvement
>>> on the reported issue:
>>>
>>> [root@ampere-mtjade-04 ~]# uname -a
>>> Linux ampere-mtjade-04.khw4.lab.eng.bos.redhat.com 5.14.0-rc2_smmu_fix+ #2 SMP Wed Jul 21 05:49:03 EDT 2021 aarch64 aarch64 aarch64 GNU/Linux
>>>
>>> [root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
>>> + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
>>> test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
>>> fio-3.27
>>> Starting 1 process
>>> Jobs: 1 (f=1): [r(1)][100.0%][r=1503MiB/s][r=385k IOPS][eta 00m:00s]
>>> test: (groupid=0, jobs=1): err= 0: pid=3143: Wed Jul 21 05:58:14 2021
>>>     read: IOPS=384k, BW=1501MiB/s (1573MB/s)(14.7GiB/10001msec)
>> I am not sure what baseline you used previously, but you were getting 327K
>> then, so at least this would be an improvement.
> Looks the improvement isn't from your patches, please see the test result on
> v5.14-rc2:
> 
> [root@ampere-mtjade-04 ~]# uname -a
> Linux ampere-mtjade-04.khw4.lab.eng.bos.redhat.com 5.14.0-rc2_linus #3 SMP Thu Jul 22 03:41:24 EDT 2021 aarch64 aarch64 aarch64 GNU/Linux
> [root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 20 1 /dev/nvme1n1 4k
> + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting
> test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
> fio-3.27
> Starting 1 process
> Jobs: 1 (f=1): [r(1)][100.0%][r=1489MiB/s][r=381k IOPS][eta 00m:00s]
> test: (groupid=0, jobs=1): err= 0: pid=3099: Thu Jul 22 03:53:04 2021
>    read: IOPS=381k, BW=1487MiB/s (1559MB/s)(29.0GiB/20001msec)

I'm a bit surprised at that.

Anyway, I don't see such an issue as you are seeing on my system. In 
general, running from different nodes doesn't make a huge difference. 
But note that the NVMe device is on NUMA node #2 on my 4-node system. I 
assume that the IOMMU is also located in that node.

sudo taskset -c 0 fio/fio --bs=4k --ioengine=io_uring --fixedbufs 
--registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 
--iodepth_batch_complete_min=16 --filename=/dev/nvme0n1 --direct=1 
--runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting

  read: IOPS=479k

---
sudo taskset -c 4 fio/fio --bs=4k --ioengine=io_uring --fixedbufs 
--registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 
--iodepth_batch_complete_min=16 --filename=/dev/nvme0n1 --direct=1 
--runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting

  read: IOPS=307k

---
sudo taskset -c 32 fio/fio --bs=4k --ioengine=io_uring --fixedbufs 
--registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 
--iodepth_batch_complete_min=16 --filename=/dev/nvme0n1 --direct=1 
--runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting

read: IOPS=566k

--
sudo taskset -c 64 fio/fio --bs=4k --ioengine=io_uring --fixedbufs 
--registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 
--iodepth_batch_complete_min=16 --filename=/dev/nvme0n1 --direct=1 
--runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting

read: IOPS=488k

---
sudo taskset -c 96 fio/fio --bs=4k --ioengine=io_uring --fixedbufs 
--registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 
--iodepth_batch_complete_min=16 --filename=/dev/nvme0n1 --direct=1 
--runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting

  read: IOPS=508k


If you check below, you can see that cpu4 services an NVMe irq. From 
checking htop, during the test that cpu is at 100% load, which I put the 
performance drop (vs cpu0) down to.

Here's some system info:

HW queue irq affinities:
PCI name is 81:00.0: nvme0n1
-eirq 298, cpu list 67, effective list 67
-eirq 299, cpu list 32-38, effective list 35
-eirq 300, cpu list 39-45, effective list 39
-eirq 301, cpu list 46-51, effective list 46
-eirq 302, cpu list 52-57, effective list 52
-eirq 303, cpu list 58-63, effective list 60
-eirq 304, cpu list 64-69, effective list 68
-eirq 305, cpu list 70-75, effective list 70
-eirq 306, cpu list 76-80, effective list 76
-eirq 307, cpu list 81-85, effective list 84
-eirq 308, cpu list 86-90, effective list 86
-eirq 309, cpu list 91-95, effective list 92
-eirq 310, cpu list 96-101, effective list 100
-eirq 311, cpu list 102-107, effective list 102
-eirq 312, cpu list 108-112, effective list 108
-eirq 313, cpu list 113-117, effective list 116
-eirq 314, cpu list 118-122, effective list 118
-eirq 315, cpu list 123-127, effective list 124
-eirq 316, cpu list 0-5, effective list 4
-eirq 317, cpu list 6-11, effective list 6
-eirq 318, cpu list 12-16, effective list 12
-eirq 319, cpu list 17-21, effective list 20
-eirq 320, cpu list 22-26, effective list 22
-eirq 321, cpu list 27-31, effective list 28


john@ubuntu:~$ lscpu | grep NUMA
NUMA node(s):  4
NUMA node0 CPU(s):   0-31
NUMA node1 CPU(s):   32-63
NUMA node2 CPU(s):   64-95
NUMA node3 CPU(s):   96-127

john@ubuntu:~$ lspci | grep -i non
81:00.0 Non-Volatile memory controller: Huawei Technologies Co., Ltd. 
Device 0123 (rev 45)

cat /sys/block/nvme0n1/device/device/numa_node
2

[   52.968495] nvme 0000:81:00.0: Adding to iommu group 5
[   52.980484] nvme nvme0: pci function 0000:81:00.0
[   52.999881] nvme nvme0: 23/0/0 default/read/poll queues
[   53.019821]  nvme0n1: p1

john@ubuntu:~$ uname -a
Linux ubuntu 5.14.0-rc2-dirty #297 SMP PREEMPT Thu Jul 22 09:23:33 BST 
2021 aarch64 aarch64 aarch64 GNU/Linux

Thanks,
John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-22 10:05                 ` John Garry
@ 2021-07-22 10:19                   ` Ming Lei
  2021-07-22 11:12                     ` John Garry
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2021-07-22 10:19 UTC (permalink / raw)
  To: John Garry
  Cc: Robin Murphy, iommu, Will Deacon, linux-arm-kernel, linux-nvme,
	linux-kernel

On Thu, Jul 22, 2021 at 11:05:00AM +0100, John Garry wrote:
> On 22/07/2021 08:58, Ming Lei wrote:
> > On Wed, Jul 21, 2021 at 12:07:22PM +0100, John Garry wrote:
> > > On 21/07/2021 10:59, Ming Lei wrote:
> > > > > I have now removed that from the tree, so please re-pull.
> > > > Now the kernel can be built successfully, but not see obvious improvement
> > > > on the reported issue:
> > > > 
> > > > [root@ampere-mtjade-04 ~]# uname -a
> > > > Linux ampere-mtjade-04.khw4.lab.eng.bos.redhat.com 5.14.0-rc2_smmu_fix+ #2 SMP Wed Jul 21 05:49:03 EDT 2021 aarch64 aarch64 aarch64 GNU/Linux
> > > > 
> > > > [root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
> > > > + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
> > > > test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
> > > > fio-3.27
> > > > Starting 1 process
> > > > Jobs: 1 (f=1): [r(1)][100.0%][r=1503MiB/s][r=385k IOPS][eta 00m:00s]
> > > > test: (groupid=0, jobs=1): err= 0: pid=3143: Wed Jul 21 05:58:14 2021
> > > >     read: IOPS=384k, BW=1501MiB/s (1573MB/s)(14.7GiB/10001msec)
> > > I am not sure what baseline you used previously, but you were getting 327K
> > > then, so at least this would be an improvement.
> > Looks the improvement isn't from your patches, please see the test result on
> > v5.14-rc2:
> > 
> > [root@ampere-mtjade-04 ~]# uname -a
> > Linux ampere-mtjade-04.khw4.lab.eng.bos.redhat.com 5.14.0-rc2_linus #3 SMP Thu Jul 22 03:41:24 EDT 2021 aarch64 aarch64 aarch64 GNU/Linux
> > [root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 20 1 /dev/nvme1n1 4k
> > + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting
> > test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
> > fio-3.27
> > Starting 1 process
> > Jobs: 1 (f=1): [r(1)][100.0%][r=1489MiB/s][r=381k IOPS][eta 00m:00s]
> > test: (groupid=0, jobs=1): err= 0: pid=3099: Thu Jul 22 03:53:04 2021
> >    read: IOPS=381k, BW=1487MiB/s (1559MB/s)(29.0GiB/20001msec)
> 
> I'm a bit surprised at that.
> 
> Anyway, I don't see such an issue as you are seeing on my system. In
> general, running from different nodes doesn't make a huge difference. But
> note that the NVMe device is on NUMA node #2 on my 4-node system. I assume
> that the IOMMU is also located in that node.
> 
> sudo taskset -c 0 fio/fio --bs=4k --ioengine=io_uring --fixedbufs
> --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16
> --iodepth_batch_complete_min=16 --filename=/dev/nvme0n1 --direct=1
> --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting
> 
>  read: IOPS=479k
> 
> ---
> sudo taskset -c 4 fio/fio --bs=4k --ioengine=io_uring --fixedbufs
> --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16
> --iodepth_batch_complete_min=16 --filename=/dev/nvme0n1 --direct=1
> --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting
> 
>  read: IOPS=307k
> 
> ---
> sudo taskset -c 32 fio/fio --bs=4k --ioengine=io_uring --fixedbufs
> --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16
> --iodepth_batch_complete_min=16 --filename=/dev/nvme0n1 --direct=1
> --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting
> 
> read: IOPS=566k
> 
> --
> sudo taskset -c 64 fio/fio --bs=4k --ioengine=io_uring --fixedbufs
> --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16
> --iodepth_batch_complete_min=16 --filename=/dev/nvme0n1 --direct=1
> --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting
> 
> read: IOPS=488k
> 
> ---
> sudo taskset -c 96 fio/fio --bs=4k --ioengine=io_uring --fixedbufs
> --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16
> --iodepth_batch_complete_min=16 --filename=/dev/nvme0n1 --direct=1
> --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting
> 
>  read: IOPS=508k
> 
> 
> If you check below, you can see that cpu4 services an NVMe irq. From
> checking htop, during the test that cpu is at 100% load, which I put the
> performance drop (vs cpu0) down to.

nvme.poll_queues is 2 in my test, and no irq is involved. But the irq mode
fio test is still as bad as io_uring.

> 
> Here's some system info:
> 
> HW queue irq affinities:
> PCI name is 81:00.0: nvme0n1
> -eirq 298, cpu list 67, effective list 67
> -eirq 299, cpu list 32-38, effective list 35
> -eirq 300, cpu list 39-45, effective list 39
> -eirq 301, cpu list 46-51, effective list 46
> -eirq 302, cpu list 52-57, effective list 52
> -eirq 303, cpu list 58-63, effective list 60
> -eirq 304, cpu list 64-69, effective list 68
> -eirq 305, cpu list 70-75, effective list 70
> -eirq 306, cpu list 76-80, effective list 76
> -eirq 307, cpu list 81-85, effective list 84
> -eirq 308, cpu list 86-90, effective list 86
> -eirq 309, cpu list 91-95, effective list 92
> -eirq 310, cpu list 96-101, effective list 100
> -eirq 311, cpu list 102-107, effective list 102
> -eirq 312, cpu list 108-112, effective list 108
> -eirq 313, cpu list 113-117, effective list 116
> -eirq 314, cpu list 118-122, effective list 118
> -eirq 315, cpu list 123-127, effective list 124
> -eirq 316, cpu list 0-5, effective list 4
> -eirq 317, cpu list 6-11, effective list 6
> -eirq 318, cpu list 12-16, effective list 12
> -eirq 319, cpu list 17-21, effective list 20
> -eirq 320, cpu list 22-26, effective list 22
> -eirq 321, cpu list 27-31, effective list 28
> 
> 
> john@ubuntu:~$ lscpu | grep NUMA
> NUMA node(s):  4
> NUMA node0 CPU(s):   0-31
> NUMA node1 CPU(s):   32-63
> NUMA node2 CPU(s):   64-95
> NUMA node3 CPU(s):   96-127
> 
> john@ubuntu:~$ lspci | grep -i non
> 81:00.0 Non-Volatile memory controller: Huawei Technologies Co., Ltd. Device
> 0123 (rev 45)
> 
> cat /sys/block/nvme0n1/device/device/numa_node
> 2

BTW, nvme driver doesn't apply the pci numa node, and I guess the
following patch is needed:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 11779be42186..3c5e10e8b0c2 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4366,7 +4366,11 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	ctrl->dev = dev;
 	ctrl->ops = ops;
 	ctrl->quirks = quirks;
+#ifdef CONFIG_NUMA
+	ctrl->numa_node = dev->numa_node;
+#else
 	ctrl->numa_node = NUMA_NO_NODE;
+#endif
 	INIT_WORK(&ctrl->scan_work, nvme_scan_work);
 	INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
 	INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);

> 
> [   52.968495] nvme 0000:81:00.0: Adding to iommu group 5
> [   52.980484] nvme nvme0: pci function 0000:81:00.0
> [   52.999881] nvme nvme0: 23/0/0 default/read/poll queues

Looks you didn't enabling polling. In irq mode, it isn't strange
to observe IOPS difference when running fio on different CPUs.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-22 10:19                   ` Ming Lei
@ 2021-07-22 11:12                     ` John Garry
  2021-07-22 12:53                       ` Marc Zyngier
  2021-07-22 15:54                       ` Ming Lei
  0 siblings, 2 replies; 30+ messages in thread
From: John Garry @ 2021-07-22 11:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Robin Murphy, iommu, Will Deacon, linux-arm-kernel, linux-nvme,
	linux-kernel

On 22/07/2021 11:19, Ming Lei wrote:
>> If you check below, you can see that cpu4 services an NVMe irq. From
>> checking htop, during the test that cpu is at 100% load, which I put the
>> performance drop (vs cpu0) down to.
> nvme.poll_queues is 2 in my test, and no irq is involved. But the irq mode
> fio test is still as bad as io_uring.
> 

I tried that:

dmesg | grep -i nvme
[    0.000000] Kernel command line: BOOT_IMAGE=/john/Image rdinit=/init 
crashkernel=256M@32M console=ttyAMA0,115200 earlycon acpi=force 
pcie_aspm=off noinitrd root=/dev/sda1 rw log_buf_len=16M user_debug=1 
iommu.strict=1 nvme.use_threaded_interrupts=0 irqchip.gicv3_pseudo_nmi=1 
nvme.poll_queues=2

[   30.531989] megaraid_sas 0000:08:00.0: NVMe passthru support : Yes
[   30.615336] megaraid_sas 0000:08:00.0: NVME page size   : (4096)
[   52.035895] nvme 0000:81:00.0: Adding to iommu group 5
[   52.047732] nvme nvme0: pci function 0000:81:00.0
[   52.067216] nvme nvme0: 22/0/2 default/read/poll queues
[   52.087318]  nvme0n1: p1

So I get these results:
cpu0 335K
cpu32 346K
cpu64 300K
cpu96 300K

So still not massive changes.

>> Here's some system info:
>>
>> HW queue irq affinities:
>> PCI name is 81:00.0: nvme0n1
>> -eirq 298, cpu list 67, effective list 67
>> -eirq 299, cpu list 32-38, effective list 35
>> -eirq 300, cpu list 39-45, effective list 39
>> -eirq 301, cpu list 46-51, effective list 46
>> -eirq 302, cpu list 52-57, effective list 52
>> -eirq 303, cpu list 58-63, effective list 60
>> -eirq 304, cpu list 64-69, effective list 68
>> -eirq 305, cpu list 70-75, effective list 70
>> -eirq 306, cpu list 76-80, effective list 76
>> -eirq 307, cpu list 81-85, effective list 84
>> -eirq 308, cpu list 86-90, effective list 86
>> -eirq 309, cpu list 91-95, effective list 92
>> -eirq 310, cpu list 96-101, effective list 100
>> -eirq 311, cpu list 102-107, effective list 102
>> -eirq 312, cpu list 108-112, effective list 108
>> -eirq 313, cpu list 113-117, effective list 116
>> -eirq 314, cpu list 118-122, effective list 118
>> -eirq 315, cpu list 123-127, effective list 124
>> -eirq 316, cpu list 0-5, effective list 4
>> -eirq 317, cpu list 6-11, effective list 6
>> -eirq 318, cpu list 12-16, effective list 12
>> -eirq 319, cpu list 17-21, effective list 20
>> -eirq 320, cpu list 22-26, effective list 22
>> -eirq 321, cpu list 27-31, effective list 28
>>
>>
>> john@ubuntu:~$ lscpu | grep NUMA
>> NUMA node(s):  4
>> NUMA node0 CPU(s):   0-31
>> NUMA node1 CPU(s):   32-63
>> NUMA node2 CPU(s):   64-95
>> NUMA node3 CPU(s):   96-127
>>
>> john@ubuntu:~$ lspci | grep -i non
>> 81:00.0 Non-Volatile memory controller: Huawei Technologies Co., Ltd. Device
>> 0123 (rev 45)
>>
>> cat /sys/block/nvme0n1/device/device/numa_node
>> 2
> BTW, nvme driver doesn't apply the pci numa node, and I guess the
> following patch is needed:
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 11779be42186..3c5e10e8b0c2 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4366,7 +4366,11 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
>   	ctrl->dev = dev;
>   	ctrl->ops = ops;
>   	ctrl->quirks = quirks;
> +#ifdef CONFIG_NUMA
> +	ctrl->numa_node = dev->numa_node;
> +#else
>   	ctrl->numa_node = NUMA_NO_NODE;
> +#endif

 From a quick look at the code, is this then later set for the PCI 
device in nvme_pci_configure_admin_queue()?

>   	INIT_WORK(&ctrl->scan_work, nvme_scan_work);
>   	INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
>   	INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
> 
>> [   52.968495] nvme 0000:81:00.0: Adding to iommu group 5
>> [   52.980484] nvme nvme0: pci function 0000:81:00.0
>> [   52.999881] nvme nvme0: 23/0/0 default/read/poll queues
> Looks you didn't enabling polling. In irq mode, it isn't strange
> to observe IOPS difference when running fio on different CPUs.

If you are still keen to investigate more, then can try either of these:

- add iommu.strict=0 to the cmdline

- use perf record+annotate to find the hotspot
   - For this you need to enable psuedo-NMI with 2x steps:
     CONFIG_ARM64_PSEUDO_NMI=y in defconfig
     Add irqchip.gicv3_pseudo_nmi=1

     See 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/Kconfig#n1745
     Your kernel log should show:
     [    0.000000] GICv3: Pseudo-NMIs enabled using forced ICC_PMR_EL1 
synchronisation

But my impression is that this may be a HW implementation issue, 
considering we don't see such a huge drop off on our HW.

Thanks,
John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-22 11:12                     ` John Garry
@ 2021-07-22 12:53                       ` Marc Zyngier
  2021-07-22 13:54                         ` John Garry
  2021-07-22 15:54                       ` Ming Lei
  1 sibling, 1 reply; 30+ messages in thread
From: Marc Zyngier @ 2021-07-22 12:53 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, Robin Murphy, iommu, Will Deacon, linux-arm-kernel,
	linux-nvme, linux-kernel

On 2021-07-22 12:12, John Garry wrote:

Hi John,

[...]

>     Your kernel log should show:
>     [    0.000000] GICv3: Pseudo-NMIs enabled using forced ICC_PMR_EL1
> synchronisation

Unrelated, but you seem to be running with ICC_CTLR_EL3.PMHE set,
which makes the overhead of pseudo-NMIs much higher than it should be
(you take a DSB SY on each interrupt unmasking).

If you are not using 1:N distribution of SPIs on the secure side,
consider turning that off in your firmware. This should make NMIs
slightly more pleasant to use.

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-22 12:53                       ` Marc Zyngier
@ 2021-07-22 13:54                         ` John Garry
  0 siblings, 0 replies; 30+ messages in thread
From: John Garry @ 2021-07-22 13:54 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Ming Lei, Robin Murphy, iommu, Will Deacon, linux-arm-kernel,
	linux-nvme, linux-kernel

On 22/07/2021 13:53, Marc Zyngier wrote:
> Hi John,
> 
> [...]
> 
>>     Your kernel log should show:
>>     [    0.000000] GICv3: Pseudo-NMIs enabled using forced ICC_PMR_EL1
>> synchronisation
> 
> Unrelated, but you seem to be running with ICC_CTLR_EL3.PMHE set,
> which makes the overhead of pseudo-NMIs much higher than it should be
> (you take a DSB SY on each interrupt unmasking).
> 
> If you are not using 1:N distribution of SPIs on the secure side,
> consider turning that off in your firmware. This should make NMIs
> slightly more pleasant to use.

Thanks for the hint. I speak to the BIOS guys.

John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-22 11:12                     ` John Garry
  2021-07-22 12:53                       ` Marc Zyngier
@ 2021-07-22 15:54                       ` Ming Lei
  2021-07-22 17:40                         ` Robin Murphy
  1 sibling, 1 reply; 30+ messages in thread
From: Ming Lei @ 2021-07-22 15:54 UTC (permalink / raw)
  To: John Garry
  Cc: Robin Murphy, iommu, Will Deacon, linux-arm-kernel, linux-nvme,
	linux-kernel

On Thu, Jul 22, 2021 at 12:12:05PM +0100, John Garry wrote:
> On 22/07/2021 11:19, Ming Lei wrote:
> > > If you check below, you can see that cpu4 services an NVMe irq. From
> > > checking htop, during the test that cpu is at 100% load, which I put the
> > > performance drop (vs cpu0) down to.
> > nvme.poll_queues is 2 in my test, and no irq is involved. But the irq mode
> > fio test is still as bad as io_uring.
> > 
> 
> I tried that:
> 
> dmesg | grep -i nvme
> [    0.000000] Kernel command line: BOOT_IMAGE=/john/Image rdinit=/init
> crashkernel=256M@32M console=ttyAMA0,115200 earlycon acpi=force
> pcie_aspm=off noinitrd root=/dev/sda1 rw log_buf_len=16M user_debug=1
> iommu.strict=1 nvme.use_threaded_interrupts=0 irqchip.gicv3_pseudo_nmi=1
> nvme.poll_queues=2
> 
> [   30.531989] megaraid_sas 0000:08:00.0: NVMe passthru support : Yes
> [   30.615336] megaraid_sas 0000:08:00.0: NVME page size   : (4096)
> [   52.035895] nvme 0000:81:00.0: Adding to iommu group 5
> [   52.047732] nvme nvme0: pci function 0000:81:00.0
> [   52.067216] nvme nvme0: 22/0/2 default/read/poll queues
> [   52.087318]  nvme0n1: p1
> 
> So I get these results:
> cpu0 335K
> cpu32 346K
> cpu64 300K
> cpu96 300K
> 
> So still not massive changes.

In your last email, the results are the following with irq mode io_uring:

 cpu0  497K
 cpu4  307K
 cpu32 566K
 cpu64 488K
 cpu96 508K

So looks you get much worse result with real io_polling?

> 
> > > Here's some system info:
> > > 
> > > HW queue irq affinities:
> > > PCI name is 81:00.0: nvme0n1
> > > -eirq 298, cpu list 67, effective list 67
> > > -eirq 299, cpu list 32-38, effective list 35
> > > -eirq 300, cpu list 39-45, effective list 39
> > > -eirq 301, cpu list 46-51, effective list 46
> > > -eirq 302, cpu list 52-57, effective list 52
> > > -eirq 303, cpu list 58-63, effective list 60
> > > -eirq 304, cpu list 64-69, effective list 68
> > > -eirq 305, cpu list 70-75, effective list 70
> > > -eirq 306, cpu list 76-80, effective list 76
> > > -eirq 307, cpu list 81-85, effective list 84
> > > -eirq 308, cpu list 86-90, effective list 86
> > > -eirq 309, cpu list 91-95, effective list 92
> > > -eirq 310, cpu list 96-101, effective list 100
> > > -eirq 311, cpu list 102-107, effective list 102
> > > -eirq 312, cpu list 108-112, effective list 108
> > > -eirq 313, cpu list 113-117, effective list 116
> > > -eirq 314, cpu list 118-122, effective list 118
> > > -eirq 315, cpu list 123-127, effective list 124
> > > -eirq 316, cpu list 0-5, effective list 4
> > > -eirq 317, cpu list 6-11, effective list 6
> > > -eirq 318, cpu list 12-16, effective list 12
> > > -eirq 319, cpu list 17-21, effective list 20
> > > -eirq 320, cpu list 22-26, effective list 22
> > > -eirq 321, cpu list 27-31, effective list 28
> > > 
> > > 
> > > john@ubuntu:~$ lscpu | grep NUMA
> > > NUMA node(s):  4
> > > NUMA node0 CPU(s):   0-31
> > > NUMA node1 CPU(s):   32-63
> > > NUMA node2 CPU(s):   64-95
> > > NUMA node3 CPU(s):   96-127
> > > 
> > > john@ubuntu:~$ lspci | grep -i non
> > > 81:00.0 Non-Volatile memory controller: Huawei Technologies Co., Ltd. Device
> > > 0123 (rev 45)
> > > 
> > > cat /sys/block/nvme0n1/device/device/numa_node
> > > 2
> > BTW, nvme driver doesn't apply the pci numa node, and I guess the
> > following patch is needed:
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 11779be42186..3c5e10e8b0c2 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -4366,7 +4366,11 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
> >   	ctrl->dev = dev;
> >   	ctrl->ops = ops;
> >   	ctrl->quirks = quirks;
> > +#ifdef CONFIG_NUMA
> > +	ctrl->numa_node = dev->numa_node;
> > +#else
> >   	ctrl->numa_node = NUMA_NO_NODE;
> > +#endif
> 
> From a quick look at the code, is this then later set for the PCI device in
> nvme_pci_configure_admin_queue()?

Yeah, you are right, the pci numa node has been used.

> 
> >   	INIT_WORK(&ctrl->scan_work, nvme_scan_work);
> >   	INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
> >   	INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
> > 
> > > [   52.968495] nvme 0000:81:00.0: Adding to iommu group 5
> > > [   52.980484] nvme nvme0: pci function 0000:81:00.0
> > > [   52.999881] nvme nvme0: 23/0/0 default/read/poll queues
> > Looks you didn't enabling polling. In irq mode, it isn't strange
> > to observe IOPS difference when running fio on different CPUs.
> 
> If you are still keen to investigate more, then can try either of these:
> 
> - add iommu.strict=0 to the cmdline
> 
> - use perf record+annotate to find the hotspot
>   - For this you need to enable psuedo-NMI with 2x steps:
>     CONFIG_ARM64_PSEUDO_NMI=y in defconfig
>     Add irqchip.gicv3_pseudo_nmi=1
> 
>     See https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/Kconfig#n1745
>     Your kernel log should show:
>     [    0.000000] GICv3: Pseudo-NMIs enabled using forced ICC_PMR_EL1
> synchronisation

OK, will try the above tomorrow.

> 
> But my impression is that this may be a HW implementation issue, considering
> we don't see such a huge drop off on our HW.

Except for mpere-mtjade, we saw bad nvme performance on ThunderX2® CN99XX too,
but I don't get one CN99XX system to check if the issue is same with
this one.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-22 15:54                       ` Ming Lei
@ 2021-07-22 17:40                         ` Robin Murphy
  2021-07-23 10:21                           ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Robin Murphy @ 2021-07-22 17:40 UTC (permalink / raw)
  To: Ming Lei, John Garry
  Cc: linux-kernel, linux-nvme, iommu, Will Deacon, linux-arm-kernel

On 2021-07-22 16:54, Ming Lei wrote:
[...]
>> If you are still keen to investigate more, then can try either of these:
>>
>> - add iommu.strict=0 to the cmdline
>>
>> - use perf record+annotate to find the hotspot
>>    - For this you need to enable psuedo-NMI with 2x steps:
>>      CONFIG_ARM64_PSEUDO_NMI=y in defconfig
>>      Add irqchip.gicv3_pseudo_nmi=1
>>
>>      See https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/Kconfig#n1745
>>      Your kernel log should show:
>>      [    0.000000] GICv3: Pseudo-NMIs enabled using forced ICC_PMR_EL1
>> synchronisation
> 
> OK, will try the above tomorrow.

Thanks, I was also going to suggest the latter, since it's what 
arm_smmu_cmdq_issue_cmdlist() does with IRQs masked that should be most 
indicative of where the slowness most likely stems from.

FWIW I would expect iommu.strict=0 to give a proportional reduction in 
SMMU overhead for both cases since it should effectively mean only 1/256 
as many invalidations are issued.

Could you also check whether the SMMU platform devices have "numa_node" 
properties exposed in sysfs (and if so whether the values look right), 
and share all the SMMU output from the boot log?

I still suspect that the most significant bottleneck is likely to be 
MMIO access across chips, incurring the CML/CCIX latency twice for every 
single read, but it's also possible that the performance of the SMMU 
itself could be reduced if its NUMA affinity is not described and we end 
up allocating stuff like pagetables on the wrong node as well.

>> But my impression is that this may be a HW implementation issue, considering
>> we don't see such a huge drop off on our HW.
> 
> Except for mpere-mtjade, we saw bad nvme performance on ThunderX2® CN99XX too,
> but I don't get one CN99XX system to check if the issue is same with
> this one.

I know Cavium's SMMU implementation didn't support MSIs, so that case 
would quite possibly lean towards the MMIO polling angle as well (albeit 
with a very different interconnect).

Robin.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-22 17:40                         ` Robin Murphy
@ 2021-07-23 10:21                           ` Ming Lei
  2021-07-26  7:51                             ` John Garry
  2021-07-27 17:08                             ` Robin Murphy
  0 siblings, 2 replies; 30+ messages in thread
From: Ming Lei @ 2021-07-23 10:21 UTC (permalink / raw)
  To: Robin Murphy
  Cc: John Garry, linux-kernel, linux-nvme, iommu, Will Deacon,
	linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 3040 bytes --]

On Thu, Jul 22, 2021 at 06:40:18PM +0100, Robin Murphy wrote:
> On 2021-07-22 16:54, Ming Lei wrote:
> [...]
> > > If you are still keen to investigate more, then can try either of these:
> > > 
> > > - add iommu.strict=0 to the cmdline
> > > 
> > > - use perf record+annotate to find the hotspot
> > >    - For this you need to enable psuedo-NMI with 2x steps:
> > >      CONFIG_ARM64_PSEUDO_NMI=y in defconfig
> > >      Add irqchip.gicv3_pseudo_nmi=1
> > > 
> > >      See https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/Kconfig#n1745
> > >      Your kernel log should show:
> > >      [    0.000000] GICv3: Pseudo-NMIs enabled using forced ICC_PMR_EL1
> > > synchronisation
> > 
> > OK, will try the above tomorrow.
> 
> Thanks, I was also going to suggest the latter, since it's what
> arm_smmu_cmdq_issue_cmdlist() does with IRQs masked that should be most
> indicative of where the slowness most likely stems from.

The improvement from 'iommu.strict=0' is very small:

[root@ampere-mtjade-04 ~]# cat /proc/cmdline
BOOT_IMAGE=(hd2,gpt2)/vmlinuz-5.14.0-rc2_linus root=UUID=cff79b49-6661-4347-b366-eb48273fe0c1 ro nvme.poll_queues=2 iommu.strict=0

[root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=1530MiB/s][r=392k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=2999: Fri Jul 23 06:05:15 2021
  read: IOPS=392k, BW=1530MiB/s (1604MB/s)(14.9GiB/10001msec)

[root@ampere-mtjade-04 ~]# taskset -c 80 ~/git/tools/test/nvme/io_uring 20 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=150MiB/s][r=38.4k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3063: Fri Jul 23 06:05:49 2021
  read: IOPS=38.4k, BW=150MiB/s (157MB/s)(3000MiB/20002msec)

> 
> FWIW I would expect iommu.strict=0 to give a proportional reduction in SMMU
> overhead for both cases since it should effectively mean only 1/256 as many
> invalidations are issued.
> 
> Could you also check whether the SMMU platform devices have "numa_node"
> properties exposed in sysfs (and if so whether the values look right), and
> share all the SMMU output from the boot log?

No found numa_node attribute for smmu platform device, and the whole dmesg log is
attached.


Thanks, 
Ming

[-- Attachment #2: arm64.log.tar.gz --]
[-- Type: application/gzip, Size: 34200 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-23 10:21                           ` Ming Lei
@ 2021-07-26  7:51                             ` John Garry
  2021-07-28  1:32                               ` Ming Lei
  2021-07-27 17:08                             ` Robin Murphy
  1 sibling, 1 reply; 30+ messages in thread
From: John Garry @ 2021-07-26  7:51 UTC (permalink / raw)
  To: Ming Lei, Robin Murphy
  Cc: linux-kernel, linux-nvme, iommu, Will Deacon, linux-arm-kernel

On 23/07/2021 11:21, Ming Lei wrote:
>> Thanks, I was also going to suggest the latter, since it's what
>> arm_smmu_cmdq_issue_cmdlist() does with IRQs masked that should be most
>> indicative of where the slowness most likely stems from.
> The improvement from 'iommu.strict=0' is very small:
> 

Have you tried turning off the IOMMU to ensure that this is really just 
an IOMMU problem?

You can try setting CONFIG_ARM_SMMU_V3=n in the defconfig or passing 
cmdline param iommu.passthrough=1 to bypass the the SMMU (equivalent to 
disabling for kernel drivers).

Thanks,
John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-23 10:21                           ` Ming Lei
  2021-07-26  7:51                             ` John Garry
@ 2021-07-27 17:08                             ` Robin Murphy
  1 sibling, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2021-07-27 17:08 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, linux-kernel, linux-nvme, iommu, Will Deacon,
	linux-arm-kernel

On 2021-07-23 11:21, Ming Lei wrote:
> On Thu, Jul 22, 2021 at 06:40:18PM +0100, Robin Murphy wrote:
>> On 2021-07-22 16:54, Ming Lei wrote:
>> [...]
>>>> If you are still keen to investigate more, then can try either of these:
>>>>
>>>> - add iommu.strict=0 to the cmdline
>>>>
>>>> - use perf record+annotate to find the hotspot
>>>>     - For this you need to enable psuedo-NMI with 2x steps:
>>>>       CONFIG_ARM64_PSEUDO_NMI=y in defconfig
>>>>       Add irqchip.gicv3_pseudo_nmi=1
>>>>
>>>>       See https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/Kconfig#n1745
>>>>       Your kernel log should show:
>>>>       [    0.000000] GICv3: Pseudo-NMIs enabled using forced ICC_PMR_EL1
>>>> synchronisation
>>>
>>> OK, will try the above tomorrow.
>>
>> Thanks, I was also going to suggest the latter, since it's what
>> arm_smmu_cmdq_issue_cmdlist() does with IRQs masked that should be most
>> indicative of where the slowness most likely stems from.
> 
> The improvement from 'iommu.strict=0' is very small:
> 
> [root@ampere-mtjade-04 ~]# cat /proc/cmdline
> BOOT_IMAGE=(hd2,gpt2)/vmlinuz-5.14.0-rc2_linus root=UUID=cff79b49-6661-4347-b366-eb48273fe0c1 ro nvme.poll_queues=2 iommu.strict=0
> 
> [root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
> + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
> test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
> fio-3.27
> Starting 1 process
> Jobs: 1 (f=1): [r(1)][100.0%][r=1530MiB/s][r=392k IOPS][eta 00m:00s]
> test: (groupid=0, jobs=1): err= 0: pid=2999: Fri Jul 23 06:05:15 2021
>    read: IOPS=392k, BW=1530MiB/s (1604MB/s)(14.9GiB/10001msec)
> 
> [root@ampere-mtjade-04 ~]# taskset -c 80 ~/git/tools/test/nvme/io_uring 20 1 /dev/nvme1n1 4k
> + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting
> test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
> fio-3.27
> Starting 1 process
> Jobs: 1 (f=1): [r(1)][100.0%][r=150MiB/s][r=38.4k IOPS][eta 00m:00s]
> test: (groupid=0, jobs=1): err= 0: pid=3063: Fri Jul 23 06:05:49 2021
>    read: IOPS=38.4k, BW=150MiB/s (157MB/s)(3000MiB/20002msec)

OK, that appears to confirm that the invalidation overhead is more of a 
symptom than the major contributing factor, which also seems to line up 
fairly well with the other information.

>> FWIW I would expect iommu.strict=0 to give a proportional reduction in SMMU
>> overhead for both cases since it should effectively mean only 1/256 as many
>> invalidations are issued.
>>
>> Could you also check whether the SMMU platform devices have "numa_node"
>> properties exposed in sysfs (and if so whether the values look right), and
>> share all the SMMU output from the boot log?
> 
> No found numa_node attribute for smmu platform device, and the whole dmesg log is
> attached.

Thanks, so it seems like the SMMUs have MSI capability and are correctly 
described as coherent, which means completion polling should be 
happening in memory and so hopefully not contributing much more than a 
couple of cross-socket cacheline migrations and/or snoops. Combined with 
the difference in the perf traces looking a lot smaller than the 
order-of-magnitude difference in the overall IOPS throughput, I suspect 
this is overall SMMU overhead exacerbated by the missing NUMA info. If 
every new 4K block touched by the NVMe means a TLB miss where the SMMU 
has to walk pagetables from the wrong side of the system, I'm sure 
that's going to add up.

I'd suggest following John's suggestion and getting some baseline 
figures for just the cross-socket overhead between the CPU and NVMe with 
the SMMU right out of the picture, then have a hack at the firmware (or 
pester the system vendor) to see how much of the difference you can make 
back up by having the SMMU proximity domains described correctly such 
that there's minimal likelihood of the SMMUs having to make non-local 
accesses to their in-memory data. FWIW I don't think it should be *too* 
hard to disassemble the IORT, fill in the proximity domain numbers and 
valid flags on the SMMU nodes, then assemble it again to load as an 
override (it's anything involving offsets in that table that's a real pain).

Note that you might also need to make sure you have CMA set up and sized 
appropriately with CONFIG_DMA_PERNUMA_CMA enabled to get the full benefit.

Robin.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-26  7:51                             ` John Garry
@ 2021-07-28  1:32                               ` Ming Lei
  2021-07-28 10:38                                 ` John Garry
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2021-07-28  1:32 UTC (permalink / raw)
  To: John Garry
  Cc: Robin Murphy, linux-kernel, linux-nvme, iommu, Will Deacon,
	linux-arm-kernel

On Mon, Jul 26, 2021 at 3:51 PM John Garry <john.garry@huawei.com> wrote:
>
> On 23/07/2021 11:21, Ming Lei wrote:
> >> Thanks, I was also going to suggest the latter, since it's what
> >> arm_smmu_cmdq_issue_cmdlist() does with IRQs masked that should be most
> >> indicative of where the slowness most likely stems from.
> > The improvement from 'iommu.strict=0' is very small:
> >
>
> Have you tried turning off the IOMMU to ensure that this is really just
> an IOMMU problem?
>
> You can try setting CONFIG_ARM_SMMU_V3=n in the defconfig or passing
> cmdline param iommu.passthrough=1 to bypass the the SMMU (equivalent to
> disabling for kernel drivers).

Bypassing SMMU via iommu.passthrough=1 basically doesn't make a difference
on this issue.

And from fio log, submission latency is good, but completion latency
is pretty bad,
and maybe it is something that writing to PCI memory isn't committed to HW in
time?

BTW, adding one mb() at the exit of nvme_queue_rq() doesn't make a difference.


Follows the fio log after passing iommu.passthrough=1:

[root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring
10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri
--iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16
--filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1
--rw=randread --name=test --group_reporting
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
4096B-4096B, ioengine=io_uring, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=1538MiB/s][r=394k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3053: Tue Jul 27 20:57:04 2021
  read: IOPS=393k, BW=1536MiB/s (1611MB/s)(15.0GiB/10001msec)
    slat (usec): min=12, max=343, avg=18.54, stdev= 3.47
    clat (usec): min=46, max=487, avg=140.15, stdev=22.72
     lat (usec): min=63, max=508, avg=158.72, stdev=22.29
    clat percentiles (usec):
     |  1.00th=[   87],  5.00th=[  104], 10.00th=[  113], 20.00th=[  123],
     | 30.00th=[  130], 40.00th=[  135], 50.00th=[  141], 60.00th=[  145],
     | 70.00th=[  151], 80.00th=[  159], 90.00th=[  167], 95.00th=[  176],
     | 99.00th=[  196], 99.50th=[  206], 99.90th=[  233], 99.95th=[  326],
     | 99.99th=[  392]
   bw (  MiB/s): min= 1533, max= 1539, per=100.00%, avg=1537.99,
stdev= 1.36, samples=19
   iops        : min=392672, max=394176, avg=393724.63, stdev=348.25, samples=19
  lat (usec)   : 50=0.01%, 100=3.64%, 250=96.30%, 500=0.06%
  cpu          : usr=17.58%, sys=82.03%, ctx=1113, majf=0, minf=5
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=3933712,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=1536MiB/s (1611MB/s), 1536MiB/s-1536MiB/s
(1611MB/s-1611MB/s), io=15.0GiB (16.1GB), run=10001-10001msec

Disk stats (read/write):
  nvme1n1: ios=3890950/0, merge=0/0, ticks=529137/0, in_queue=529137,
util=99.15%
[root@ampere-mtjade-04 ~]#
[root@ampere-mtjade-04 ~]# taskset -c 80
~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri
--iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16
--filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1
--rw=randread --name=test --group_reporting
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
4096B-4096B, ioengine=io_uring, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=150MiB/s][r=38.4k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3062: Tue Jul 27 20:57:23 2021
  read: IOPS=38.4k, BW=150MiB/s (157MB/s)(1501MiB/10002msec)
    slat (usec): min=14, max=376, avg=20.21, stdev= 4.66
    clat (usec): min=439, max=2457, avg=1640.85, stdev=17.01
     lat (usec): min=559, max=2494, avg=1661.09, stdev=15.67
    clat percentiles (usec):
     |  1.00th=[ 1614],  5.00th=[ 1631], 10.00th=[ 1647], 20.00th=[ 1647],
     | 30.00th=[ 1647], 40.00th=[ 1647], 50.00th=[ 1647], 60.00th=[ 1647],
     | 70.00th=[ 1647], 80.00th=[ 1647], 90.00th=[ 1647], 95.00th=[ 1647],
     | 99.00th=[ 1647], 99.50th=[ 1663], 99.90th=[ 1729], 99.95th=[ 1827],
     | 99.99th=[ 2057]
   bw (  KiB/s): min=153600, max=153984, per=100.00%, avg=153876.21,
stdev=88.10, samples=19
   iops        : min=38400, max=38496, avg=38469.05, stdev=22.02, samples=19
  lat (usec)   : 500=0.01%, 1000=0.01%
  lat (msec)   : 2=99.96%, 4=0.03%
  cpu          : usr=2.00%, sys=97.65%, ctx=1056, majf=0, minf=5
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=384288,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=150MiB/s (157MB/s), 150MiB/s-150MiB/s (157MB/s-157MB/s),
io=1501MiB (1574MB), run=10002-10002msec

Disk stats (read/write):
  nvme1n1: ios=380266/0, merge=0/0, ticks=554940/0, in_queue=554940, util=99.22%


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-28  1:32                               ` Ming Lei
@ 2021-07-28 10:38                                 ` John Garry
  2021-07-28 15:17                                   ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: John Garry @ 2021-07-28 10:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Robin Murphy, linux-kernel, linux-nvme, iommu, Will Deacon,
	linux-arm-kernel

On 28/07/2021 02:32, Ming Lei wrote:
> On Mon, Jul 26, 2021 at 3:51 PM John Garry<john.garry@huawei.com>  wrote:
>> On 23/07/2021 11:21, Ming Lei wrote:
>>>> Thanks, I was also going to suggest the latter, since it's what
>>>> arm_smmu_cmdq_issue_cmdlist() does with IRQs masked that should be most
>>>> indicative of where the slowness most likely stems from.
>>> The improvement from 'iommu.strict=0' is very small:
>>>
>> Have you tried turning off the IOMMU to ensure that this is really just
>> an IOMMU problem?
>>
>> You can try setting CONFIG_ARM_SMMU_V3=n in the defconfig or passing
>> cmdline param iommu.passthrough=1 to bypass the the SMMU (equivalent to
>> disabling for kernel drivers).
> Bypassing SMMU via iommu.passthrough=1 basically doesn't make a difference
> on this issue.

A ~90% throughput drop still seems to me to be too high to be a software 
issue. More so since I don't see similar on my system. And that 
throughput drop does not lead to a total CPU usage drop, from the fio log.

Do you know if anyone has run memory benchmark tests on this board to 
find out NUMA effect? I think lmbench or stream could be used for this.

Testing network performance in an equivalent fashion to storage could 
also be an idea.

Thanks,
John

> 
> And from fio log, submission latency is good, but completion latency
> is pretty bad,
> and maybe it is something that writing to PCI memory isn't committed to HW in
> time?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-28 10:38                                 ` John Garry
@ 2021-07-28 15:17                                   ` Ming Lei
  2021-07-28 15:39                                     ` Robin Murphy
  2021-08-10  9:36                                     ` John Garry
  0 siblings, 2 replies; 30+ messages in thread
From: Ming Lei @ 2021-07-28 15:17 UTC (permalink / raw)
  To: John Garry
  Cc: Robin Murphy, linux-kernel, linux-nvme, iommu, Will Deacon,
	linux-arm-kernel

On Wed, Jul 28, 2021 at 11:38:18AM +0100, John Garry wrote:
> On 28/07/2021 02:32, Ming Lei wrote:
> > On Mon, Jul 26, 2021 at 3:51 PM John Garry<john.garry@huawei.com>  wrote:
> > > On 23/07/2021 11:21, Ming Lei wrote:
> > > > > Thanks, I was also going to suggest the latter, since it's what
> > > > > arm_smmu_cmdq_issue_cmdlist() does with IRQs masked that should be most
> > > > > indicative of where the slowness most likely stems from.
> > > > The improvement from 'iommu.strict=0' is very small:
> > > > 
> > > Have you tried turning off the IOMMU to ensure that this is really just
> > > an IOMMU problem?
> > > 
> > > You can try setting CONFIG_ARM_SMMU_V3=n in the defconfig or passing
> > > cmdline param iommu.passthrough=1 to bypass the the SMMU (equivalent to
> > > disabling for kernel drivers).
> > Bypassing SMMU via iommu.passthrough=1 basically doesn't make a difference
> > on this issue.
> 
> A ~90% throughput drop still seems to me to be too high to be a software
> issue. More so since I don't see similar on my system. And that throughput
> drop does not lead to a total CPU usage drop, from the fio log.
> 
> Do you know if anyone has run memory benchmark tests on this board to find
> out NUMA effect? I think lmbench or stream could be used for this.

https://lore.kernel.org/lkml/YOhbc5C47IzC893B@T590/

-- 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-28 15:17                                   ` Ming Lei
@ 2021-07-28 15:39                                     ` Robin Murphy
  2021-08-10  9:36                                     ` John Garry
  1 sibling, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2021-07-28 15:39 UTC (permalink / raw)
  To: Ming Lei, John Garry
  Cc: linux-kernel, linux-nvme, iommu, Will Deacon, linux-arm-kernel

On 2021-07-28 16:17, Ming Lei wrote:
> On Wed, Jul 28, 2021 at 11:38:18AM +0100, John Garry wrote:
>> On 28/07/2021 02:32, Ming Lei wrote:
>>> On Mon, Jul 26, 2021 at 3:51 PM John Garry<john.garry@huawei.com>  wrote:
>>>> On 23/07/2021 11:21, Ming Lei wrote:
>>>>>> Thanks, I was also going to suggest the latter, since it's what
>>>>>> arm_smmu_cmdq_issue_cmdlist() does with IRQs masked that should be most
>>>>>> indicative of where the slowness most likely stems from.
>>>>> The improvement from 'iommu.strict=0' is very small:
>>>>>
>>>> Have you tried turning off the IOMMU to ensure that this is really just
>>>> an IOMMU problem?
>>>>
>>>> You can try setting CONFIG_ARM_SMMU_V3=n in the defconfig or passing
>>>> cmdline param iommu.passthrough=1 to bypass the the SMMU (equivalent to
>>>> disabling for kernel drivers).
>>> Bypassing SMMU via iommu.passthrough=1 basically doesn't make a difference
>>> on this issue.
>>
>> A ~90% throughput drop still seems to me to be too high to be a software
>> issue. More so since I don't see similar on my system. And that throughput
>> drop does not lead to a total CPU usage drop, from the fio log.

Indeed, it now sounds like $SUBJECT has been a complete red herring, and 
although the SMMU may be reflecting the underlying slowness it is not in 
fact a significant contributor to it. Presumably perf shows any 
difference in CPU time moving elsewhere once iommu_dma_unmap_sg() is out 
of the picture?

>> Do you know if anyone has run memory benchmark tests on this board to find
>> out NUMA effect? I think lmbench or stream could be used for this.
> 
> https://lore.kernel.org/lkml/YOhbc5C47IzC893B@T590/

Hmm, a ~4x discrepancy in CPU<->memory bandwidth is pretty significant, 
but it's still not the ~10x discrepancy in NVMe throughput. Possibly 
CPU<->PCIe and/or PCIe<->memory bandwidth is even further impacted 
between sockets, or perhaps all the individual latencies just add up - 
that level of detailed performance analysis is beyond my expertise. 
Either way I guess it's probably time to take it up with the system 
vendor to see if there's anything which can be tuned in hardware/firmware.

Robin.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-07-28 15:17                                   ` Ming Lei
  2021-07-28 15:39                                     ` Robin Murphy
@ 2021-08-10  9:36                                     ` John Garry
  2021-08-10 10:35                                       ` Ming Lei
  1 sibling, 1 reply; 30+ messages in thread
From: John Garry @ 2021-08-10  9:36 UTC (permalink / raw)
  To: Ming Lei
  Cc: Robin Murphy, linux-kernel, linux-nvme, iommu, Will Deacon,
	linux-arm-kernel

On 28/07/2021 16:17, Ming Lei wrote:
>>>> Have you tried turning off the IOMMU to ensure that this is really just
>>>> an IOMMU problem?
>>>>
>>>> You can try setting CONFIG_ARM_SMMU_V3=n in the defconfig or passing
>>>> cmdline param iommu.passthrough=1 to bypass the the SMMU (equivalent to
>>>> disabling for kernel drivers).
>>> Bypassing SMMU via iommu.passthrough=1 basically doesn't make a difference
>>> on this issue.
>> A ~90% throughput drop still seems to me to be too high to be a software
>> issue. More so since I don't see similar on my system. And that throughput
>> drop does not lead to a total CPU usage drop, from the fio log.
>>
>> Do you know if anyone has run memory benchmark tests on this board to find
>> out NUMA effect? I think lmbench or stream could be used for this.
> https://lore.kernel.org/lkml/YOhbc5C47IzC893B@T590/

Hi Ming,

Out of curiosity, did you investigate this topic any further?

And you also asked about my results earlier:

On 22/07/2021 16:54, Ming Lei wrote:
 >> [   52.035895] nvme 0000:81:00.0: Adding to iommu group 5
 >> [   52.047732] nvme nvme0: pci function 0000:81:00.0
 >> [   52.067216] nvme nvme0: 22/0/2 default/read/poll queues
 >> [   52.087318]  nvme0n1: p1
 >>
 >> So I get these results:
 >> cpu0 335K
 >> cpu32 346K
 >> cpu64 300K
 >> cpu96 300K
 >>
 >> So still not massive changes.
 > In your last email, the results are the following with irq mode io_uring:
 >
 >   cpu0  497K
 >   cpu4  307K
 >   cpu32 566K
 >   cpu64 488K
 >   cpu96 508K
 >
 > So looks you get much worse result with real io_polling?
 >

Would the expectation be that at least I get the same performance with 
io_polling here? Anything else to try which you can suggest to 
investigate this lower performance?

Thanks,
John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
  2021-08-10  9:36                                     ` John Garry
@ 2021-08-10 10:35                                       ` Ming Lei
  0 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2021-08-10 10:35 UTC (permalink / raw)
  To: John Garry
  Cc: Robin Murphy, linux-kernel, linux-nvme, iommu, Will Deacon,
	linux-arm-kernel

On Tue, Aug 10, 2021 at 10:36:47AM +0100, John Garry wrote:
> On 28/07/2021 16:17, Ming Lei wrote:
> > > > > Have you tried turning off the IOMMU to ensure that this is really just
> > > > > an IOMMU problem?
> > > > > 
> > > > > You can try setting CONFIG_ARM_SMMU_V3=n in the defconfig or passing
> > > > > cmdline param iommu.passthrough=1 to bypass the the SMMU (equivalent to
> > > > > disabling for kernel drivers).
> > > > Bypassing SMMU via iommu.passthrough=1 basically doesn't make a difference
> > > > on this issue.
> > > A ~90% throughput drop still seems to me to be too high to be a software
> > > issue. More so since I don't see similar on my system. And that throughput
> > > drop does not lead to a total CPU usage drop, from the fio log.
> > > 
> > > Do you know if anyone has run memory benchmark tests on this board to find
> > > out NUMA effect? I think lmbench or stream could be used for this.
> > https://lore.kernel.org/lkml/YOhbc5C47IzC893B@T590/
> 
> Hi Ming,
> 
> Out of curiosity, did you investigate this topic any further?

IMO, the issue is probably in device/system side, since completion latency is
increased a lot, meantime submission latency isn't changed.

Either the submission isn't committed to hardware in time, or the
completion status isn't updated to HW in time from viewpoint of CPU.

We have tried to update to new FW, but not see difference made.

> 
> And you also asked about my results earlier:
> 
> On 22/07/2021 16:54, Ming Lei wrote:
> >> [   52.035895] nvme 0000:81:00.0: Adding to iommu group 5
> >> [   52.047732] nvme nvme0: pci function 0000:81:00.0
> >> [   52.067216] nvme nvme0: 22/0/2 default/read/poll queues
> >> [   52.087318]  nvme0n1: p1
> >>
> >> So I get these results:
> >> cpu0 335K
> >> cpu32 346K
> >> cpu64 300K
> >> cpu96 300K
> >>
> >> So still not massive changes.
> > In your last email, the results are the following with irq mode io_uring:
> >
> >   cpu0  497K
> >   cpu4  307K
> >   cpu32 566K
> >   cpu64 488K
> >   cpu96 508K
> >
> > So looks you get much worse result with real io_polling?
> >
> 
> Would the expectation be that at least I get the same performance with
> io_polling here?

io_polling is supposed to improve IO latency a lot compared with irq
mode, and the perf data shows that clearly on x86_64.

> Anything else to try which you can suggest to investigate
> this lower performance?

You may try to compare irq mode and polling and narrow down the possible
reasons, no exact suggestion on how to investigate it, :-(


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2021-08-10 10:36 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-09  8:38 [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node Ming Lei
2021-07-09 10:16 ` Russell King (Oracle)
2021-07-09 14:21   ` Ming Lei
2021-07-09 10:26 ` Robin Murphy
2021-07-09 11:04   ` John Garry
2021-07-09 12:34     ` Robin Murphy
2021-07-09 14:24   ` Ming Lei
2021-07-19 16:14     ` John Garry
2021-07-21  1:40       ` Ming Lei
2021-07-21  9:23         ` John Garry
2021-07-21  9:59           ` Ming Lei
2021-07-21 11:07             ` John Garry
2021-07-21 11:58               ` Ming Lei
2021-07-22  7:58               ` Ming Lei
2021-07-22 10:05                 ` John Garry
2021-07-22 10:19                   ` Ming Lei
2021-07-22 11:12                     ` John Garry
2021-07-22 12:53                       ` Marc Zyngier
2021-07-22 13:54                         ` John Garry
2021-07-22 15:54                       ` Ming Lei
2021-07-22 17:40                         ` Robin Murphy
2021-07-23 10:21                           ` Ming Lei
2021-07-26  7:51                             ` John Garry
2021-07-28  1:32                               ` Ming Lei
2021-07-28 10:38                                 ` John Garry
2021-07-28 15:17                                   ` Ming Lei
2021-07-28 15:39                                     ` Robin Murphy
2021-08-10  9:36                                     ` John Garry
2021-08-10 10:35                                       ` Ming Lei
2021-07-27 17:08                             ` Robin Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).