Can't get RAID5/RAID6 NVMe randomread IOPS

* Can't get RAID5/RAID6  NVMe randomread  IOPS - AMD ROME what am I missing?????
@ 2021-07-27 20:32 Finlayson, James M CIV (USA)
  2021-07-27 21:52 ` Chris Murphy
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-07-27 20:32 UTC (permalink / raw)
  To: 'linux-raid@vger.kernel.org'; +Cc: Finlayson, James M CIV (USA)

Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small  fraction of it.   I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful.     This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant.   I'm just stumped why I can't do better.   If there is a fine manual that somebody can point me to, I'm happy to read it.......

I have tried both RAID5 and RAID6 trying to be highly cognizant of NUMAness.    The ROME is set to numas per socket to 1 and the BIOS is set to maximize infinity fabric performance and pcie performance via AMD's white papers.   NVMe drives are all Gen4 (I believe HPE rebadged SAMSUNG 1733a? - I can get the drives doing 1.45M 4KB random reads  each if I try hard.

Everything I can think to share:

[root@<server> <server>]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.4 (Ootpa)
[root@<server> <server>]# uname -r
4.18.0-305.el8.x86_64

root@<server> ~]# modinfo raid6
filename:       /lib/modules/4.18.0-305.el8.x86_64/kernel/drivers/md/raid456.ko.xz
alias:          raid6
alias:          raid5
alias:          md-level-6
alias:          md-raid6
alias:          md-personality-8
alias:          md-level-4
alias:          md-level-5
alias:          md-raid4
alias:          md-raid5
alias:          md-personality-4
description:    RAID4/5/6 (striping with parity) personality for MD
license:        GPL
rhelversion:    8.4
srcversion:     FE86A53E1C1CDAE8F972CBA
depends:        async_raid6_recov,async_pq,libcrc32c,raid6_pq,async_tx,async_memcpy,async_xor
intree:         Y
name:           raid456
vermagic:       4.18.0-305.el8.x86_64 SMP mod_unload modversions 
sig_id:         PKCS#7
signer:         Red Hat Enterprise Linux kernel signing key

[root@<server> ~]# lsblk
NAME         MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme16n1     259:0    0   1.8T  0 disk  
├─nvme16n1p1 259:1    0   512M  0 part  /boot/efi
├─nvme16n1p2 259:2    0   512M  0 part  /boot
├─nvme16n1p3 259:3    0  49.4G  0 part  [SWAP]
└─nvme16n1p4 259:4    0   1.7T  0 part  /
nvme0n1      259:5    0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme1n1      259:6    0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme2n1      259:7    0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme3n1      259:8    0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme7n1      259:9    0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme11n1     259:10   0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme10n1     259:11   0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme14n1     259:12   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme5n1      259:13   0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme8n1      259:14   0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme6n1      259:15   0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme9n1      259:16   0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme15n1     259:17   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme20n1     259:18   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme13n1     259:19   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme18n1     259:20   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme4n1      259:21   0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme21n1     259:22   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme22n1     259:23   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme24n1     259:24   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme12n1     259:25   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme17n1     259:26   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5
nvme19n1     259:27   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme23n1     259:28   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5

[root@<server> ~]# lsblk -o KNAME,MODEL,VENDOR
KNAME      MODEL                                    VENDOR
nvme0n1    MZXL515THALA-000H3                       
nvme1n1    MZXL515THALA-000H3                       
nvme2n1    MZXL515THALA-000H3                       
nvme3n1    MZXL515THALA-000H3                       
nvme7n1    MZXL515THALA-000H3                       
nvme11n1   MZXL515THALA-000H3                       
nvme10n1   MZXL515THALA-000H3                       
nvme14n1   MZXL515THALA-000H3                       
nvme5n1    MZXL515THALA-000H3                       
nvme8n1    MZXL515THALA-000H3                       
nvme6n1    MZXL515THALA-000H3                       
nvme9n1    MZXL515THALA-000H3                       
nvme15n1   MZXL515THALA-000H3                       
nvme20n1   MZXL515THALA-000H3                       
nvme13n1   MZXL515THALA-000H3                       
nvme18n1   MZXL515THALA-000H3                       
nvme4n1    MZXL515THALA-000H3                       
nvme21n1   MZXL515THALA-000H3                       
nvme22n1   MZXL515THALA-000H3                       
nvme24n1   MZXL515THALA-000H3                       
nvme12n1   MZXL515THALA-000H3                       
nvme17n1   MZXL515THALA-000H3                       
nvme19n1   MZXL515THALA-000H3                       
nvme23n1   MZXL515THALA-000H3    

[root@<server> jim]# ./map_numa.sh       (16 is the boot drive 0-11 on numa 0, 12-16,17-24 on numa 1)
device: nvme8 numanode: 0
device: nvme9 numanode: 0
device: nvme10 numanode: 0
device: nvme11 numanode: 0
device: nvme4 numanode: 0
device: nvme5 numanode: 0
device: nvme6 numanode: 0
device: nvme7 numanode: 0
device: nvme2 numanode: 0
device: nvme3 numanode: 0
device: nvme0 numanode: 0
device: nvme1 numanode: 0
device: nvme21 numanode: 1
device: nvme22 numanode: 1
device: nvme23 numanode: 1
device: nvme24 numanode: 1
device: nvme16 numanode: 1
device: nvme17 numanode: 1
device: nvme18 numanode: 1
device: nvme19 numanode: 1
device: nvme20 numanode: 1
device: nvme14 numanode: 1
device: nvme15 numanode: 1
device: nvme12 numanode: 1
device: nvme13 numanode: 1

[root@<server> jim]# cat /etc/udev/rules.d/99-abj.nr_32.rules 
KERNEL=="nvme*[0-9]n*[0-9]",ATTRS{model}=="MZXL515THALA-000H3",ATTR{queue/io_poll}="1",ATTR{queue/io_poll_delay}="100000",ATTR{queue/nomerges}="2",ATTR{queue/nr_requests}="1023",ATTR{queue/rotational}="0",ATTR{queue/rq_affinity}="2",ATTR{queue/scheduler}="none",ATTR{queue/add_random}="0",ATTR{queue/max_sectors_kb}="4096",PROGRAM="/usr/sbin/nvme set-feature /dev/%k --feature-id 8 --value 522 "    {coalesce up to 10 interrupts per device}

SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md*", ATTR{md/sync_speed_max}="2000000",ATTR{md/group_thread_cnt}="64", ATTR{md/stripe_cache_size}="8192",ATTR{queue/io_poll}="1",ATTR{queue/io_poll_delay}="100000",ATTR{queue/nomerges}="2",ATTR{queue/nr_requests}="1023",ATTR{queue/rotational}="0",ATTR{queue/rq_affinity}="2",ATTR{queue/scheduler}="none",ATTR{queue/add_random}="0",ATTR{queue/max_sectors_kb}="4096"

(I know the 1023 doesn't work in the md, but there for reference) - we tune for max iops, not for latency, thus the going hard at rq_affinity, nomerges, etc.....

[root@<server> <server>]# cat /proc/mdstat  (128K chunk is just something Fusion IO told me way back when and never needed to change)
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 nvme11n1[11](S) nvme10n1[10] nvme9n1[9] nvme8n1[8] nvme7n1[7] nvme6n1[6] nvme5n1[5] nvme4n1[4] nvme3n1[3] nvme2n1[2] nvme1n1[1] nvme0n1[0]
      150007961600 blocks super 1.2 level 5, 128k chunk, algorithm 2 [11/11] [UUUUUUUUUUU]
      bitmap: 0/112 pages [0KB], 65536KB chunk

md1 : active raid5 nvme24n1[11](S) nvme23n1[10] nvme22n1[9] nvme21n1[8] nvme20n1[7] nvme19n1[6] nvme18n1[5] nvme17n1[4] nvme15n1[3] nvme14n1[2] nvme13n1[1] nvme12n1[0]
      150007961600 blocks super 1.2 level 5, 128k chunk, algorithm 2 [11/11] [UUUUUUUUUUU]
      bitmap: 0/112 pages [0KB], 65536KB chunk

unused devices: <none>

[root@<server> /]# grep raid /var/log/messages......What troubles me is if mdraid checked parity on read, I could somewhat understand, but I would think the reads are nearly a pass through....
Jul 27 00:00:02 <server> rpmlist_verification[12745]: libblockdev-mdraid 2.24 Thu 22 Jul 2021 02:58:37 PM GMT
Jul 27 18:00:28 <server> kernel: raid6: sse2x1   gen()  9792 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x1   xor()  6436 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x2   gen() 11198 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x2   xor()  9546 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x4   gen() 14271 MB/s
Jul 27 18:00:29 <server> kernel: raid6: sse2x4   xor()  6354 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x1   gen() 22838 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x1   xor() 14069 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x2   gen() 26973 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x2   xor() 18380 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x4   gen() 26601 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x4   xor()  7025 MB/s
Jul 27 18:00:29 <server> kernel: raid6: using algorithm avx2x2 gen() 26973 MB/s
Jul 27 18:00:29 <server> kernel: raid6: .... xor() 18380 MB/s, rmw enabled
Jul 27 18:00:29 <server> kernel: raid6: using avx2x2 recovery algorithm

[root@<server> <server>]# cat fiojim.hpdl385.nps1
 [global]
name=random
iodepth=128
ioengine=libaio
direct=1
norandommap
group_reporting
randrepeat=1
random_generator=tausworthe64
bs=4k
rw=randread
numjobs=64
runtime=60

[socket0]
new_group
numa_mem_policy=bind:0
numa_cpu_nodes=0
filename=/dev/nvme0n1
filename=/dev/nvme1n1
filename=/dev/nvme2n1
filename=/dev/nvme3n1
filename=/dev/nvme4n1
filename=/dev/nvme5n1
filename=/dev/nvme6n1
filename=/dev/nvme7n1
filename=/dev/nvme8n1
filename=/dev/nvme9n1
filename=/dev/nvme10n1
filename=/dev/nvme11n1
[socket1]
new_group
numa_mem_policy=bind:1
numa_cpu_nodes=1
filename=/dev/nvme12n1
filename=/dev/nvme13n1
filename=/dev/nvme14n1
filename=/dev/nvme15n1
filename=/dev/nvme17n1
filename=/dev/nvme18n1
filename=/dev/nvme19n1
filename=/dev/nvme20n1
filename=/dev/nvme21n1
filename=/dev/nvme22n1
filename=/dev/nvme23n1
filename=/dev/nvme24n1
[socket0-md]
stonewall
numa_mem_policy=bind:0
numa_cpu_nodes=0
filename=/dev/md0
[socket1-md]
new_group
numa_mem_policy=bind:1
numa_cpu_nodes=1
filename=/dev/md1

iostat -xkz 1 with the drives raw:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.32    0.00   38.30    0.00    0.00   53.39

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1       1317510.00    0.00 5270044.00      0.00     0.00     0.00   0.00   0.00    0.31    0.00 411.95     4.00     0.00   0.00 100.40
nvme1n1       1317548.00    0.00 5270192.00      0.00     0.00     0.00   0.00   0.00    0.32    0.00 417.38     4.00     0.00   0.00 100.00
nvme2n1       1317578.00    0.00 5270316.00      0.00     0.00     0.00   0.00   0.00    0.31    0.00 414.77     4.00     0.00   0.00 100.20
nvme3n1       1317554.00    0.00 5270216.00      0.00     0.00     0.00   0.00   0.00    0.31    0.00 413.25     4.00     0.00   0.00 100.40
nvme7n1       1317559.00    0.00 5270236.00      0.00     0.00     0.00   0.00   0.00    0.33    0.00 430.03     4.00     0.00   0.00 100.40
nvme11n1      1317502.00    0.00 5269996.00      0.00     0.00     0.00   0.00   0.00    0.73    0.00 964.85     4.00     0.00   0.00 100.40
nvme10n1      1317656.00    0.00 5270624.00      0.00     0.00     0.00   0.00   0.00    0.80    0.00 1050.05     4.00     0.00   0.00 100.40
nvme14n1      1107632.00    0.00 4430528.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 307.52     4.00     0.00   0.00 100.40
nvme5n1       1317583.00    0.00 5270332.00      0.00     0.00     0.00   0.00   0.00    0.33    0.00 430.47     4.00     0.00   0.00 100.00
nvme8n1       1317617.00    0.00 5270468.00      0.00     0.00     0.00   0.00   0.00    0.74    0.00 972.52     4.00     0.00   0.00 101.00
nvme6n1       1317535.00    0.00 5270144.00      0.00     0.00     0.00   0.00   0.00    0.33    0.00 432.48     4.00     0.00   0.00 100.60
nvme9n1       1317582.00    0.00 5270328.00      0.00     0.00     0.00   0.00   0.00    0.75    0.00 992.82     4.00     0.00   0.00 100.40
nvme15n1      1107703.00    0.00 4430816.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 305.93     4.00     0.00   0.00 100.60
nvme20n1      1107712.00    0.00 4430848.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 306.72     4.00     0.00   0.00 100.20
nvme13n1      1107714.00    0.00 4430852.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 307.10     4.00     0.00   0.00 101.40
nvme18n1      1107674.00    0.00 4430696.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 306.04     4.00     0.00   0.00 100.20
nvme4n1       1317521.00    0.00 5270076.00      0.00     0.00     0.00   0.00   0.00    0.33    0.00 431.63     4.00     0.00   0.00 100.20
nvme21n1      1107714.00    0.00 4430856.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 309.11     4.00     0.00   0.00 100.40
nvme22n1      1107711.00    0.00 4430840.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 308.52     4.00     0.00   0.00 100.60
nvme24n1      1107441.00    0.00 4429768.00      0.00     0.00     0.00   0.00   0.00    3.86    0.00 4271.29     4.00     0.00   0.00 100.20
nvme12n1      1107733.00    0.00 4430932.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 307.70     4.00     0.00   0.00 100.40
nvme17n1      1107858.00    0.00 4431436.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 307.95     4.00     0.00   0.00 100.60
nvme19n1      1107766.00    0.00 4431064.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 307.17     4.00     0.00   0.00 100.40
nvme23n1      1108033.00    0.00 4432132.00      0.00     0.00     0.00   0.00   0.00    0.31    0.00 340.62     4.00     0.00   0.00 100.00

iostat -xkz 1 with the md's
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.56    0.00   49.94    0.00    0.00   49.51

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1       114589.00    0.00 458356.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.54     4.00     0.00   0.01 100.00
nvme1n1       115284.00    0.00 461136.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.77     4.00     0.00   0.01 100.00
nvme2n1       114911.00    0.00 459644.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.61     4.00     0.00   0.01 100.00
nvme3n1       114538.00    0.00 458152.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.55     4.00     0.00   0.01 100.00
nvme7n1       114524.00    0.00 458096.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.53     4.00     0.00   0.01 100.00
nvme10n1      114934.00    0.00 459736.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.61     4.00     0.00   0.01 100.00
nvme14n1      97399.00    0.00 389596.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.41     4.00     0.00   0.01 100.00
nvme5n1       114929.00    0.00 459716.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.61     4.00     0.00   0.01 100.00
nvme8n1       114393.00    0.00 457572.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.40     4.00     0.00   0.01  99.90
nvme6n1       114731.00    0.00 458924.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.56     4.00     0.00   0.01  99.90
nvme9n1       114146.00    0.00 456584.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.37     4.00     0.00   0.01  99.90
nvme15n1      96960.00    0.00 387840.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.30     4.00     0.00   0.01 100.00
nvme20n1      97171.00    0.00 388684.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.36     4.00     0.00   0.01 100.00
nvme13n1      96874.00    0.00 387496.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.31     4.00     0.00   0.01 100.00
nvme18n1      96696.00    0.00 386784.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.16     4.00     0.00   0.01 100.00
nvme4n1       115220.00    0.00 460876.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.75     4.00     0.00   0.01 100.00
nvme21n1      96756.00    0.00 387024.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.24     4.00     0.00   0.01 100.00
nvme22n1      97352.00    0.00 389408.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.36     4.00     0.00   0.01 100.00
nvme12n1      96899.00    0.00 387596.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.22     4.00     0.00   0.01 100.20
nvme17n1      96748.00    0.00 386992.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.24     4.00     0.00   0.01 100.00
nvme19n1      97191.00    0.00 388764.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.30     4.00     0.00   0.01 100.00
nvme23n1      96787.00    0.00 387148.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  28.41     4.00     0.00   0.01  99.90
md1           1066812.00    0.00 4267248.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     4.00     0.00   0.00   0.00
md0           1262173.00    0.00 5048692.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     4.00     0.00   0.00   0.00

fio output:
socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket0-md: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket1-md: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.26
Starting 256 processes
Jobs: 128 (f=128): [_(128),r(128)][1.6%][r=9103MiB/s][r=2330k IOPS][eta 02h:08m:00s]        
socket0: (groupid=0, jobs=64): err= 0: pid=18344: Tue Jul 27 20:00:10 2021
  read: IOPS=16.0M, BW=60.8GiB/s (65.3GB/s)(3651GiB/60003msec)
    slat (nsec): min=1222, max=18033k, avg=2429.23, stdev=2975.48
    clat (usec): min=24, max=20221, avg=510.51, stdev=336.57
     lat (usec): min=30, max=20240, avg=513.01, stdev=336.58
    clat percentiles (usec):
     |  1.00th=[  147],  5.00th=[  194], 10.00th=[  229], 20.00th=[  281],
     | 30.00th=[  326], 40.00th=[  367], 50.00th=[  412], 60.00th=[  469],
     | 70.00th=[  553], 80.00th=[  676], 90.00th=[  914], 95.00th=[ 1156],
     | 99.00th=[ 1778], 99.50th=[ 2073], 99.90th=[ 2868], 99.95th=[ 3294],
     | 99.99th=[ 4424]
   bw (  MiB/s): min=52367, max=65429, per=32.81%, avg=62388.68, stdev=33.73, samples=7424
   iops        : min=13406054, max=16749890, avg=15971477.42, stdev=8635.86, samples=7424
  lat (usec)   : 50=0.01%, 100=0.02%, 250=13.89%, 500=50.33%, 750=19.72%
  lat (usec)   : 1000=8.24%
  lat (msec)   : 2=7.22%, 4=0.57%, 10=0.02%, 20=0.01%, 50=0.01%
  cpu          : usr=17.93%, sys=49.30%, ctx=21719222, majf=0, minf=9915
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=957111950,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1: (groupid=1, jobs=64): err= 0: pid=18408: Tue Jul 27 20:00:10 2021
  read: IOPS=13.5M, BW=51.4GiB/s (55.2GB/s)(3085GiB/60008msec)
    slat (nsec): min=1232, max=1696.9k, avg=2580.28, stdev=2841.95
    clat (usec): min=21, max=26808, avg=604.58, stdev=1211.79
     lat (usec): min=26, max=26810, avg=607.23, stdev=1211.80
    clat percentiles (usec):
     |  1.00th=[  124],  5.00th=[  157], 10.00th=[  184], 20.00th=[  225],
     | 30.00th=[  258], 40.00th=[  289], 50.00th=[  318], 60.00th=[  351],
     | 70.00th=[  388], 80.00th=[  437], 90.00th=[  586], 95.00th=[ 2769],
     | 99.00th=[ 6587], 99.50th=[ 9372], 99.90th=[12649], 99.95th=[13829],
     | 99.99th=[16712]
   bw (  MiB/s): min=32950, max=67704, per=20.46%, avg=52713.11, stdev=106.96, samples=7424
   iops        : min=8435402, max=17332350, avg=13494532.64, stdev=27383.02, samples=7424
  lat (usec)   : 50=0.01%, 100=0.16%, 250=27.38%, 500=59.09%, 750=4.93%
  lat (usec)   : 1000=0.30%
  lat (msec)   : 2=0.60%, 4=5.67%, 10=1.47%, 20=0.39%, 50=0.01%
  cpu          : usr=14.86%, sys=45.29%, ctx=36050249, majf=0, minf=10046
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=808781317,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket0-md: (groupid=2, jobs=64): err= 0: pid=18479: Tue Jul 27 20:00:10 2021
  read: IOPS=1263k, BW=4934MiB/s (5174MB/s)(289GiB/60001msec)
    slat (nsec): min=1512, max=48037k, avg=49957.85, stdev=33615.19
    clat (usec): min=176, max=51614, avg=6432.56, stdev=410.54
     lat (usec): min=178, max=51639, avg=6482.58, stdev=412.23
    clat percentiles (usec):
     |  1.00th=[ 6128],  5.00th=[ 6259], 10.00th=[ 6325], 20.00th=[ 6325],
     | 30.00th=[ 6390], 40.00th=[ 6390], 50.00th=[ 6456], 60.00th=[ 6456],
     | 70.00th=[ 6521], 80.00th=[ 6521], 90.00th=[ 6587], 95.00th=[ 6587],
     | 99.00th=[ 6652], 99.50th=[ 6718], 99.90th=[ 7635], 99.95th=[16909],
     | 99.99th=[18220]
   bw (  MiB/s): min= 4582, max= 5934, per=100.00%, avg=4938.25, stdev= 2.07, samples=7616
   iops        : min=1173219, max=1519297, avg=1264175.97, stdev=528.77, samples=7616
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.34%, 10=99.57%, 20=0.08%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=1.23%, sys=95.69%, ctx=2557, majf=0, minf=9064
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=75789817,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1-md: (groupid=3, jobs=64): err= 0: pid=18543: Tue Jul 27 20:00:10 2021
  read: IOPS=1071k, BW=4183MiB/s (4386MB/s)(245GiB/60002msec)
    slat (nsec): min=1563, max=14080k, avg=59051.10, stdev=22401.39
    clat (usec): min=179, max=20799, avg=7588.23, stdev=303.92
     lat (usec): min=211, max=20853, avg=7647.34, stdev=305.26
    clat percentiles (usec):
     |  1.00th=[ 7111],  5.00th=[ 7373], 10.00th=[ 7439], 20.00th=[ 7504],
     | 30.00th=[ 7504], 40.00th=[ 7570], 50.00th=[ 7570], 60.00th=[ 7635],
     | 70.00th=[ 7635], 80.00th=[ 7701], 90.00th=[ 7767], 95.00th=[ 7767],
     | 99.00th=[ 7898], 99.50th=[ 7898], 99.90th=[ 8586], 99.95th=[13304],
     | 99.99th=[19006]
   bw (  MiB/s): min= 3955, max= 4642, per=100.00%, avg=4186.20, stdev= 0.98, samples=7616
   iops        : min=1012714, max=1188416, avg=1071653.68, stdev=251.68, samples=7616
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=99.94%, 20=0.05%, 50=0.01%
  cpu          : usr=1.06%, sys=95.70%, ctx=1980, majf=0, minf=9030
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=64246431,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=60.8GiB/s (65.3GB/s), 60.8GiB/s-60.8GiB/s (65.3GB/s-65.3GB/s), io=3651GiB (3920GB), run=60003-60003msec

Run status group 1 (all jobs):
   READ: bw=51.4GiB/s (55.2GB/s), 51.4GiB/s-51.4GiB/s (55.2GB/s-55.2GB/s), io=3085GiB (3313GB), run=60008-60008msec

Run status group 2 (all jobs):
   READ: bw=4934MiB/s (5174MB/s), 4934MiB/s-4934MiB/s (5174MB/s-5174MB/s), io=289GiB (310GB), run=60001-60001msec

Run status group 3 (all jobs):
   READ: bw=4183MiB/s (4386MB/s), 4183MiB/s-4183MiB/s (4386MB/s-4386MB/s), io=245GiB (263GB), run=60002-60002msec

Disk stats (read/write):
  nvme0n1: ios=79463384/0, merge=0/0, ticks=25148472/0, in_queue=25148472, util=98.78%
  nvme1n1: ios=79463574/0, merge=0/0, ticks=25224784/0, in_queue=25224784, util=98.87%
  nvme2n1: ios=79463699/0, merge=0/0, ticks=25305193/0, in_queue=25305193, util=98.96%
  nvme3n1: ios=79463925/0, merge=0/0, ticks=25234093/0, in_queue=25234093, util=99.00%
  nvme4n1: ios=79464135/0, merge=0/0, ticks=25396547/0, in_queue=25396547, util=99.06%
  nvme5n1: ios=79464346/0, merge=0/0, ticks=25393624/0, in_queue=25393624, util=99.10%
  nvme6n1: ios=79464535/0, merge=0/0, ticks=25330700/0, in_queue=25330700, util=99.19%
  nvme7n1: ios=79464721/0, merge=0/0, ticks=25349171/0, in_queue=25349171, util=99.24%
  nvme8n1: ios=79464029/0, merge=0/0, ticks=59063115/0, in_queue=59063115, util=99.32%
  nvme9n1: ios=79464120/0, merge=0/0, ticks=59023913/0, in_queue=59023913, util=99.33%
  nvme10n1: ios=79464799/0, merge=0/0, ticks=59136926/0, in_queue=59136927, util=99.39%
  nvme11n1: ios=79465392/0, merge=0/0, ticks=59091104/0, in_queue=59091104, util=99.51%
  nvme12n1: ios=67137057/0, merge=0/0, ticks=18685135/0, in_queue=18685136, util=99.60%
  nvme13n1: ios=67137217/0, merge=0/0, ticks=18638940/0, in_queue=18638940, util=99.76%
  nvme14n1: ios=67137341/0, merge=0/0, ticks=18663275/0, in_queue=18663275, util=99.70%
  nvme15n1: ios=67137620/0, merge=0/0, ticks=18629947/0, in_queue=18629948, util=99.77%
  nvme17n1: ios=67137778/0, merge=0/0, ticks=18709586/0, in_queue=18709585, util=99.80%
  nvme18n1: ios=67137952/0, merge=0/0, ticks=18591798/0, in_queue=18591797, util=99.72%
  nvme19n1: ios=67138199/0, merge=0/0, ticks=18669545/0, in_queue=18669545, util=99.86%
  nvme20n1: ios=67138378/0, merge=0/0, ticks=18600128/0, in_queue=18600128, util=99.89%
  nvme21n1: ios=67138562/0, merge=0/0, ticks=18720763/0, in_queue=18720763, util=100.00%
  nvme22n1: ios=67138772/0, merge=0/0, ticks=18659716/0, in_queue=18659716, util=100.00%
  nvme23n1: ios=67138982/0, merge=0/0, ticks=27862395/0, in_queue=27862395, util=100.00%
  nvme24n1: ios=67134934/0, merge=0/0, ticks=241977879/0, in_queue=241977879, util=100.00%
  md0: ios=75701982/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  md1: ios=64175011/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

I'm used to tuning interrupts, so here are the interrupts during the hero portion of the fio and the mdraid portion.....Without polling they are just well balanced irq's across the different nvme MQs
[root@<server> jim]# ./top-irq.pl -k 1
reporting top 10 every 6 secs subject to thresh=10 kernel=1

CAL             532284   CPU146   Function call interrupts
CAL             529615   CPU154   Function call interrupts
CAL             526198   CPU162   Function call interrupts
CAL             524012   CPU142   Function call interrupts
CAL             521467   CPU174   Function call interrupts
CAL             520821   CPU178   Function call interrupts
CAL             518798   CPU176   Function call interrupts
CAL             518244   CPU166   Function call interrupts
CAL             517524   CPU180   Function call interrupts
CAL             514563   CPU136   Function call interrupts

  reported top 10 (of 1885)
  reported interrupts = 5223526  870587.7 per sec    6.8% of all interrupts
^C
[root@<server> jim]# !!
./top-irq.pl -k 1
reporting top 10 every 6 secs subject to thresh=10 kernel=1

CAL              63759   CPU15    Function call interrupts
CAL              63664   CPU178   Function call interrupts
CAL              63428   CPU142   Function call interrupts
CAL              63382   CPU51    Function call interrupts
CAL              63285   CPU140   Function call interrupts
CAL              63068   CPU150   Function call interrupts
CAL              63017   CPU148   Function call interrupts
CAL              62984   CPU144   Function call interrupts
CAL              62842   CPU25    Function call interrupts
CAL              62835   CPU37    Function call interrupts

  reported top 10 (of 1885)
  reported interrupts = 632264  105377.3 per sec    4.0% of all interrupts

Lastly, I can't make md0 and md1 each get ~2M IOPS at the same time.   Sometimes the NUMA0 md is the fastest, sometimes the NUMA1 md is the fastest - I think there might some sort of bottleneck/race somewhere.   It stays that way until I stop them and reassemble.....and then it may switch.    I haven't troubleshooted enough to notice the pattern.

I have to work out with HPE why the socket0/socket1 difference in hero numbers 16.0M/13.5M is something I'll have to take up with HPE or maybe there is a card slowing down the drives in socket1.

Any help is greatly appreciated.  Criticism will be accepted and worst case, IF I HAVEN'T MISSED SOMETHING SO UTTERLY SILLY,  this becomes a defacto "where to start" for the base users like me before the kernel level experts get involved.

As an FYI - I have booted a 5.13 kernel and started using io_uring - no noticeable difference in md performance on a different server with GEN3 drives.....I can raise my "hero numbers" when I have time to play, but right now, my job is to get protected IOPS.

Jim Finlayson
U.S. Department of Defense

^ permalink raw reply	[flat|nested] 28+ messages in thread