On Thu, Jul 22, 2021 at 06:40:18PM +0100, Robin Murphy wrote: > On 2021-07-22 16:54, Ming Lei wrote: > [...] > > > If you are still keen to investigate more, then can try either of these: > > > > > > - add iommu.strict=0 to the cmdline > > > > > > - use perf record+annotate to find the hotspot > > > - For this you need to enable psuedo-NMI with 2x steps: > > > CONFIG_ARM64_PSEUDO_NMI=y in defconfig > > > Add irqchip.gicv3_pseudo_nmi=1 > > > > > > See https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/Kconfig#n1745 > > > Your kernel log should show: > > > [ 0.000000] GICv3: Pseudo-NMIs enabled using forced ICC_PMR_EL1 > > > synchronisation > > > > OK, will try the above tomorrow. > > Thanks, I was also going to suggest the latter, since it's what > arm_smmu_cmdq_issue_cmdlist() does with IRQs masked that should be most > indicative of where the slowness most likely stems from. The improvement from 'iommu.strict=0' is very small: [root@ampere-mtjade-04 ~]# cat /proc/cmdline BOOT_IMAGE=(hd2,gpt2)/vmlinuz-5.14.0-rc2_linus root=UUID=cff79b49-6661-4347-b366-eb48273fe0c1 ro nvme.poll_queues=2 iommu.strict=0 [root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64 fio-3.27 Starting 1 process Jobs: 1 (f=1): [r(1)][100.0%][r=1530MiB/s][r=392k IOPS][eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=2999: Fri Jul 23 06:05:15 2021 read: IOPS=392k, BW=1530MiB/s (1604MB/s)(14.9GiB/10001msec) [root@ampere-mtjade-04 ~]# taskset -c 80 ~/git/tools/test/nvme/io_uring 20 1 /dev/nvme1n1 4k + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64 fio-3.27 Starting 1 process Jobs: 1 (f=1): [r(1)][100.0%][r=150MiB/s][r=38.4k IOPS][eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=3063: Fri Jul 23 06:05:49 2021 read: IOPS=38.4k, BW=150MiB/s (157MB/s)(3000MiB/20002msec) > > FWIW I would expect iommu.strict=0 to give a proportional reduction in SMMU > overhead for both cases since it should effectively mean only 1/256 as many > invalidations are issued. > > Could you also check whether the SMMU platform devices have "numa_node" > properties exposed in sysfs (and if so whether the values look right), and > share all the SMMU output from the boot log? No found numa_node attribute for smmu platform device, and the whole dmesg log is attached. Thanks, Ming