From: "Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil>
To: "'linux-raid@vger.kernel.org'" <linux-raid@vger.kernel.org>
Cc: "Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil>
Subject: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
Date: Tue, 27 Jul 2021 20:32:00 +0000 [thread overview]
Message-ID: <5EAED86C53DED2479E3E145969315A2385841062@UMECHPA7B.easf.csd.disa.mil> (raw)
Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small fraction of it. I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful. This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant. I'm just stumped why I can't do better. If there is a fine manual that somebody can point me to, I'm happy to read it.......
I have tried both RAID5 and RAID6 trying to be highly cognizant of NUMAness. The ROME is set to numas per socket to 1 and the BIOS is set to maximize infinity fabric performance and pcie performance via AMD's white papers. NVMe drives are all Gen4 (I believe HPE rebadged SAMSUNG 1733a? - I can get the drives doing 1.45M 4KB random reads each if I try hard.
Everything I can think to share:
[root@<server> <server>]# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)
[root@<server> <server>]# uname -r
4.18.0-305.el8.x86_64
root@<server> ~]# modinfo raid6
filename: /lib/modules/4.18.0-305.el8.x86_64/kernel/drivers/md/raid456.ko.xz
alias: raid6
alias: raid5
alias: md-level-6
alias: md-raid6
alias: md-personality-8
alias: md-level-4
alias: md-level-5
alias: md-raid4
alias: md-raid5
alias: md-personality-4
description: RAID4/5/6 (striping with parity) personality for MD
license: GPL
rhelversion: 8.4
srcversion: FE86A53E1C1CDAE8F972CBA
depends: async_raid6_recov,async_pq,libcrc32c,raid6_pq,async_tx,async_memcpy,async_xor
intree: Y
name: raid456
vermagic: 4.18.0-305.el8.x86_64 SMP mod_unload modversions
sig_id: PKCS#7
signer: Red Hat Enterprise Linux kernel signing key
[root@<server> ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme16n1 259:0 0 1.8T 0 disk
├─nvme16n1p1 259:1 0 512M 0 part /boot/efi
├─nvme16n1p2 259:2 0 512M 0 part /boot
├─nvme16n1p3 259:3 0 49.4G 0 part [SWAP]
└─nvme16n1p4 259:4 0 1.7T 0 part /
nvme0n1 259:5 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme1n1 259:6 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme2n1 259:7 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme3n1 259:8 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme7n1 259:9 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme11n1 259:10 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme10n1 259:11 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme14n1 259:12 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme5n1 259:13 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme8n1 259:14 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme6n1 259:15 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme9n1 259:16 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme15n1 259:17 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme20n1 259:18 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme13n1 259:19 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme18n1 259:20 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme4n1 259:21 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme21n1 259:22 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme22n1 259:23 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme24n1 259:24 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme12n1 259:25 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme17n1 259:26 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme19n1 259:27 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme23n1 259:28 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
[root@<server> ~]# lsblk -o KNAME,MODEL,VENDOR
KNAME MODEL VENDOR
nvme0n1 MZXL515THALA-000H3
nvme1n1 MZXL515THALA-000H3
nvme2n1 MZXL515THALA-000H3
nvme3n1 MZXL515THALA-000H3
nvme7n1 MZXL515THALA-000H3
nvme11n1 MZXL515THALA-000H3
nvme10n1 MZXL515THALA-000H3
nvme14n1 MZXL515THALA-000H3
nvme5n1 MZXL515THALA-000H3
nvme8n1 MZXL515THALA-000H3
nvme6n1 MZXL515THALA-000H3
nvme9n1 MZXL515THALA-000H3
nvme15n1 MZXL515THALA-000H3
nvme20n1 MZXL515THALA-000H3
nvme13n1 MZXL515THALA-000H3
nvme18n1 MZXL515THALA-000H3
nvme4n1 MZXL515THALA-000H3
nvme21n1 MZXL515THALA-000H3
nvme22n1 MZXL515THALA-000H3
nvme24n1 MZXL515THALA-000H3
nvme12n1 MZXL515THALA-000H3
nvme17n1 MZXL515THALA-000H3
nvme19n1 MZXL515THALA-000H3
nvme23n1 MZXL515THALA-000H3
[root@<server> jim]# ./map_numa.sh (16 is the boot drive 0-11 on numa 0, 12-16,17-24 on numa 1)
device: nvme8 numanode: 0
device: nvme9 numanode: 0
device: nvme10 numanode: 0
device: nvme11 numanode: 0
device: nvme4 numanode: 0
device: nvme5 numanode: 0
device: nvme6 numanode: 0
device: nvme7 numanode: 0
device: nvme2 numanode: 0
device: nvme3 numanode: 0
device: nvme0 numanode: 0
device: nvme1 numanode: 0
device: nvme21 numanode: 1
device: nvme22 numanode: 1
device: nvme23 numanode: 1
device: nvme24 numanode: 1
device: nvme16 numanode: 1
device: nvme17 numanode: 1
device: nvme18 numanode: 1
device: nvme19 numanode: 1
device: nvme20 numanode: 1
device: nvme14 numanode: 1
device: nvme15 numanode: 1
device: nvme12 numanode: 1
device: nvme13 numanode: 1
[root@<server> jim]# cat /etc/udev/rules.d/99-abj.nr_32.rules
KERNEL=="nvme*[0-9]n*[0-9]",ATTRS{model}=="MZXL515THALA-000H3",ATTR{queue/io_poll}="1",ATTR{queue/io_poll_delay}="100000",ATTR{queue/nomerges}="2",ATTR{queue/nr_requests}="1023",ATTR{queue/rotational}="0",ATTR{queue/rq_affinity}="2",ATTR{queue/scheduler}="none",ATTR{queue/add_random}="0",ATTR{queue/max_sectors_kb}="4096",PROGRAM="/usr/sbin/nvme set-feature /dev/%k --feature-id 8 --value 522 " {coalesce up to 10 interrupts per device}
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md*", ATTR{md/sync_speed_max}="2000000",ATTR{md/group_thread_cnt}="64", ATTR{md/stripe_cache_size}="8192",ATTR{queue/io_poll}="1",ATTR{queue/io_poll_delay}="100000",ATTR{queue/nomerges}="2",ATTR{queue/nr_requests}="1023",ATTR{queue/rotational}="0",ATTR{queue/rq_affinity}="2",ATTR{queue/scheduler}="none",ATTR{queue/add_random}="0",ATTR{queue/max_sectors_kb}="4096"
(I know the 1023 doesn't work in the md, but there for reference) - we tune for max iops, not for latency, thus the going hard at rq_affinity, nomerges, etc.....
[root@<server> <server>]# cat /proc/mdstat (128K chunk is just something Fusion IO told me way back when and never needed to change)
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 nvme11n1[11](S) nvme10n1[10] nvme9n1[9] nvme8n1[8] nvme7n1[7] nvme6n1[6] nvme5n1[5] nvme4n1[4] nvme3n1[3] nvme2n1[2] nvme1n1[1] nvme0n1[0]
150007961600 blocks super 1.2 level 5, 128k chunk, algorithm 2 [11/11] [UUUUUUUUUUU]
bitmap: 0/112 pages [0KB], 65536KB chunk
md1 : active raid5 nvme24n1[11](S) nvme23n1[10] nvme22n1[9] nvme21n1[8] nvme20n1[7] nvme19n1[6] nvme18n1[5] nvme17n1[4] nvme15n1[3] nvme14n1[2] nvme13n1[1] nvme12n1[0]
150007961600 blocks super 1.2 level 5, 128k chunk, algorithm 2 [11/11] [UUUUUUUUUUU]
bitmap: 0/112 pages [0KB], 65536KB chunk
unused devices: <none>
[root@<server> /]# grep raid /var/log/messages......What troubles me is if mdraid checked parity on read, I could somewhat understand, but I would think the reads are nearly a pass through....
Jul 27 00:00:02 <server> rpmlist_verification[12745]: libblockdev-mdraid 2.24 Thu 22 Jul 2021 02:58:37 PM GMT
Jul 27 18:00:28 <server> kernel: raid6: sse2x1 gen() 9792 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x1 xor() 6436 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x2 gen() 11198 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x2 xor() 9546 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x4 gen() 14271 MB/s
Jul 27 18:00:29 <server> kernel: raid6: sse2x4 xor() 6354 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x1 gen() 22838 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x1 xor() 14069 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x2 gen() 26973 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x2 xor() 18380 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x4 gen() 26601 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x4 xor() 7025 MB/s
Jul 27 18:00:29 <server> kernel: raid6: using algorithm avx2x2 gen() 26973 MB/s
Jul 27 18:00:29 <server> kernel: raid6: .... xor() 18380 MB/s, rmw enabled
Jul 27 18:00:29 <server> kernel: raid6: using avx2x2 recovery algorithm
[root@<server> <server>]# cat fiojim.hpdl385.nps1
[global]
name=random
iodepth=128
ioengine=libaio
direct=1
norandommap
group_reporting
randrepeat=1
random_generator=tausworthe64
bs=4k
rw=randread
numjobs=64
runtime=60
[socket0]
new_group
numa_mem_policy=bind:0
numa_cpu_nodes=0
filename=/dev/nvme0n1
filename=/dev/nvme1n1
filename=/dev/nvme2n1
filename=/dev/nvme3n1
filename=/dev/nvme4n1
filename=/dev/nvme5n1
filename=/dev/nvme6n1
filename=/dev/nvme7n1
filename=/dev/nvme8n1
filename=/dev/nvme9n1
filename=/dev/nvme10n1
filename=/dev/nvme11n1
[socket1]
new_group
numa_mem_policy=bind:1
numa_cpu_nodes=1
filename=/dev/nvme12n1
filename=/dev/nvme13n1
filename=/dev/nvme14n1
filename=/dev/nvme15n1
filename=/dev/nvme17n1
filename=/dev/nvme18n1
filename=/dev/nvme19n1
filename=/dev/nvme20n1
filename=/dev/nvme21n1
filename=/dev/nvme22n1
filename=/dev/nvme23n1
filename=/dev/nvme24n1
[socket0-md]
stonewall
numa_mem_policy=bind:0
numa_cpu_nodes=0
filename=/dev/md0
[socket1-md]
new_group
numa_mem_policy=bind:1
numa_cpu_nodes=1
filename=/dev/md1
iostat -xkz 1 with the drives raw:
avg-cpu: %user %nice %system %iowait %steal %idle
8.32 0.00 38.30 0.00 0.00 53.39
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme0n1 1317510.00 0.00 5270044.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 411.95 4.00 0.00 0.00 100.40
nvme1n1 1317548.00 0.00 5270192.00 0.00 0.00 0.00 0.00 0.00 0.32 0.00 417.38 4.00 0.00 0.00 100.00
nvme2n1 1317578.00 0.00 5270316.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 414.77 4.00 0.00 0.00 100.20
nvme3n1 1317554.00 0.00 5270216.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 413.25 4.00 0.00 0.00 100.40
nvme7n1 1317559.00 0.00 5270236.00 0.00 0.00 0.00 0.00 0.00 0.33 0.00 430.03 4.00 0.00 0.00 100.40
nvme11n1 1317502.00 0.00 5269996.00 0.00 0.00 0.00 0.00 0.00 0.73 0.00 964.85 4.00 0.00 0.00 100.40
nvme10n1 1317656.00 0.00 5270624.00 0.00 0.00 0.00 0.00 0.00 0.80 0.00 1050.05 4.00 0.00 0.00 100.40
nvme14n1 1107632.00 0.00 4430528.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 307.52 4.00 0.00 0.00 100.40
nvme5n1 1317583.00 0.00 5270332.00 0.00 0.00 0.00 0.00 0.00 0.33 0.00 430.47 4.00 0.00 0.00 100.00
nvme8n1 1317617.00 0.00 5270468.00 0.00 0.00 0.00 0.00 0.00 0.74 0.00 972.52 4.00 0.00 0.00 101.00
nvme6n1 1317535.00 0.00 5270144.00 0.00 0.00 0.00 0.00 0.00 0.33 0.00 432.48 4.00 0.00 0.00 100.60
nvme9n1 1317582.00 0.00 5270328.00 0.00 0.00 0.00 0.00 0.00 0.75 0.00 992.82 4.00 0.00 0.00 100.40
nvme15n1 1107703.00 0.00 4430816.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 305.93 4.00 0.00 0.00 100.60
nvme20n1 1107712.00 0.00 4430848.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 306.72 4.00 0.00 0.00 100.20
nvme13n1 1107714.00 0.00 4430852.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 307.10 4.00 0.00 0.00 101.40
nvme18n1 1107674.00 0.00 4430696.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 306.04 4.00 0.00 0.00 100.20
nvme4n1 1317521.00 0.00 5270076.00 0.00 0.00 0.00 0.00 0.00 0.33 0.00 431.63 4.00 0.00 0.00 100.20
nvme21n1 1107714.00 0.00 4430856.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 309.11 4.00 0.00 0.00 100.40
nvme22n1 1107711.00 0.00 4430840.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 308.52 4.00 0.00 0.00 100.60
nvme24n1 1107441.00 0.00 4429768.00 0.00 0.00 0.00 0.00 0.00 3.86 0.00 4271.29 4.00 0.00 0.00 100.20
nvme12n1 1107733.00 0.00 4430932.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 307.70 4.00 0.00 0.00 100.40
nvme17n1 1107858.00 0.00 4431436.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 307.95 4.00 0.00 0.00 100.60
nvme19n1 1107766.00 0.00 4431064.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 307.17 4.00 0.00 0.00 100.40
nvme23n1 1108033.00 0.00 4432132.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 340.62 4.00 0.00 0.00 100.00
iostat -xkz 1 with the md's
avg-cpu: %user %nice %system %iowait %steal %idle
0.56 0.00 49.94 0.00 0.00 49.51
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme0n1 114589.00 0.00 458356.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.54 4.00 0.00 0.01 100.00
nvme1n1 115284.00 0.00 461136.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.77 4.00 0.00 0.01 100.00
nvme2n1 114911.00 0.00 459644.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.61 4.00 0.00 0.01 100.00
nvme3n1 114538.00 0.00 458152.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.55 4.00 0.00 0.01 100.00
nvme7n1 114524.00 0.00 458096.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.53 4.00 0.00 0.01 100.00
nvme10n1 114934.00 0.00 459736.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.61 4.00 0.00 0.01 100.00
nvme14n1 97399.00 0.00 389596.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.41 4.00 0.00 0.01 100.00
nvme5n1 114929.00 0.00 459716.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.61 4.00 0.00 0.01 100.00
nvme8n1 114393.00 0.00 457572.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.40 4.00 0.00 0.01 99.90
nvme6n1 114731.00 0.00 458924.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.56 4.00 0.00 0.01 99.90
nvme9n1 114146.00 0.00 456584.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.37 4.00 0.00 0.01 99.90
nvme15n1 96960.00 0.00 387840.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.30 4.00 0.00 0.01 100.00
nvme20n1 97171.00 0.00 388684.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.36 4.00 0.00 0.01 100.00
nvme13n1 96874.00 0.00 387496.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.31 4.00 0.00 0.01 100.00
nvme18n1 96696.00 0.00 386784.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.16 4.00 0.00 0.01 100.00
nvme4n1 115220.00 0.00 460876.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.75 4.00 0.00 0.01 100.00
nvme21n1 96756.00 0.00 387024.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.24 4.00 0.00 0.01 100.00
nvme22n1 97352.00 0.00 389408.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.36 4.00 0.00 0.01 100.00
nvme12n1 96899.00 0.00 387596.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.22 4.00 0.00 0.01 100.20
nvme17n1 96748.00 0.00 386992.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.24 4.00 0.00 0.01 100.00
nvme19n1 97191.00 0.00 388764.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.30 4.00 0.00 0.01 100.00
nvme23n1 96787.00 0.00 387148.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 28.41 4.00 0.00 0.01 99.90
md1 1066812.00 0.00 4267248.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00
md0 1262173.00 0.00 5048692.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00
fio output:
socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket0-md: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket1-md: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.26
Starting 256 processes
Jobs: 128 (f=128): [_(128),r(128)][1.6%][r=9103MiB/s][r=2330k IOPS][eta 02h:08m:00s]
socket0: (groupid=0, jobs=64): err= 0: pid=18344: Tue Jul 27 20:00:10 2021
read: IOPS=16.0M, BW=60.8GiB/s (65.3GB/s)(3651GiB/60003msec)
slat (nsec): min=1222, max=18033k, avg=2429.23, stdev=2975.48
clat (usec): min=24, max=20221, avg=510.51, stdev=336.57
lat (usec): min=30, max=20240, avg=513.01, stdev=336.58
clat percentiles (usec):
| 1.00th=[ 147], 5.00th=[ 194], 10.00th=[ 229], 20.00th=[ 281],
| 30.00th=[ 326], 40.00th=[ 367], 50.00th=[ 412], 60.00th=[ 469],
| 70.00th=[ 553], 80.00th=[ 676], 90.00th=[ 914], 95.00th=[ 1156],
| 99.00th=[ 1778], 99.50th=[ 2073], 99.90th=[ 2868], 99.95th=[ 3294],
| 99.99th=[ 4424]
bw ( MiB/s): min=52367, max=65429, per=32.81%, avg=62388.68, stdev=33.73, samples=7424
iops : min=13406054, max=16749890, avg=15971477.42, stdev=8635.86, samples=7424
lat (usec) : 50=0.01%, 100=0.02%, 250=13.89%, 500=50.33%, 750=19.72%
lat (usec) : 1000=8.24%
lat (msec) : 2=7.22%, 4=0.57%, 10=0.02%, 20=0.01%, 50=0.01%
cpu : usr=17.93%, sys=49.30%, ctx=21719222, majf=0, minf=9915
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=957111950,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
socket1: (groupid=1, jobs=64): err= 0: pid=18408: Tue Jul 27 20:00:10 2021
read: IOPS=13.5M, BW=51.4GiB/s (55.2GB/s)(3085GiB/60008msec)
slat (nsec): min=1232, max=1696.9k, avg=2580.28, stdev=2841.95
clat (usec): min=21, max=26808, avg=604.58, stdev=1211.79
lat (usec): min=26, max=26810, avg=607.23, stdev=1211.80
clat percentiles (usec):
| 1.00th=[ 124], 5.00th=[ 157], 10.00th=[ 184], 20.00th=[ 225],
| 30.00th=[ 258], 40.00th=[ 289], 50.00th=[ 318], 60.00th=[ 351],
| 70.00th=[ 388], 80.00th=[ 437], 90.00th=[ 586], 95.00th=[ 2769],
| 99.00th=[ 6587], 99.50th=[ 9372], 99.90th=[12649], 99.95th=[13829],
| 99.99th=[16712]
bw ( MiB/s): min=32950, max=67704, per=20.46%, avg=52713.11, stdev=106.96, samples=7424
iops : min=8435402, max=17332350, avg=13494532.64, stdev=27383.02, samples=7424
lat (usec) : 50=0.01%, 100=0.16%, 250=27.38%, 500=59.09%, 750=4.93%
lat (usec) : 1000=0.30%
lat (msec) : 2=0.60%, 4=5.67%, 10=1.47%, 20=0.39%, 50=0.01%
cpu : usr=14.86%, sys=45.29%, ctx=36050249, majf=0, minf=10046
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=808781317,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
socket0-md: (groupid=2, jobs=64): err= 0: pid=18479: Tue Jul 27 20:00:10 2021
read: IOPS=1263k, BW=4934MiB/s (5174MB/s)(289GiB/60001msec)
slat (nsec): min=1512, max=48037k, avg=49957.85, stdev=33615.19
clat (usec): min=176, max=51614, avg=6432.56, stdev=410.54
lat (usec): min=178, max=51639, avg=6482.58, stdev=412.23
clat percentiles (usec):
| 1.00th=[ 6128], 5.00th=[ 6259], 10.00th=[ 6325], 20.00th=[ 6325],
| 30.00th=[ 6390], 40.00th=[ 6390], 50.00th=[ 6456], 60.00th=[ 6456],
| 70.00th=[ 6521], 80.00th=[ 6521], 90.00th=[ 6587], 95.00th=[ 6587],
| 99.00th=[ 6652], 99.50th=[ 6718], 99.90th=[ 7635], 99.95th=[16909],
| 99.99th=[18220]
bw ( MiB/s): min= 4582, max= 5934, per=100.00%, avg=4938.25, stdev= 2.07, samples=7616
iops : min=1173219, max=1519297, avg=1264175.97, stdev=528.77, samples=7616
lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.34%, 10=99.57%, 20=0.08%, 50=0.01%
lat (msec) : 100=0.01%
cpu : usr=1.23%, sys=95.69%, ctx=2557, majf=0, minf=9064
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=75789817,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
socket1-md: (groupid=3, jobs=64): err= 0: pid=18543: Tue Jul 27 20:00:10 2021
read: IOPS=1071k, BW=4183MiB/s (4386MB/s)(245GiB/60002msec)
slat (nsec): min=1563, max=14080k, avg=59051.10, stdev=22401.39
clat (usec): min=179, max=20799, avg=7588.23, stdev=303.92
lat (usec): min=211, max=20853, avg=7647.34, stdev=305.26
clat percentiles (usec):
| 1.00th=[ 7111], 5.00th=[ 7373], 10.00th=[ 7439], 20.00th=[ 7504],
| 30.00th=[ 7504], 40.00th=[ 7570], 50.00th=[ 7570], 60.00th=[ 7635],
| 70.00th=[ 7635], 80.00th=[ 7701], 90.00th=[ 7767], 95.00th=[ 7767],
| 99.00th=[ 7898], 99.50th=[ 7898], 99.90th=[ 8586], 99.95th=[13304],
| 99.99th=[19006]
bw ( MiB/s): min= 3955, max= 4642, per=100.00%, avg=4186.20, stdev= 0.98, samples=7616
iops : min=1012714, max=1188416, avg=1071653.68, stdev=251.68, samples=7616
lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=99.94%, 20=0.05%, 50=0.01%
cpu : usr=1.06%, sys=95.70%, ctx=1980, majf=0, minf=9030
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=64246431,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: bw=60.8GiB/s (65.3GB/s), 60.8GiB/s-60.8GiB/s (65.3GB/s-65.3GB/s), io=3651GiB (3920GB), run=60003-60003msec
Run status group 1 (all jobs):
READ: bw=51.4GiB/s (55.2GB/s), 51.4GiB/s-51.4GiB/s (55.2GB/s-55.2GB/s), io=3085GiB (3313GB), run=60008-60008msec
Run status group 2 (all jobs):
READ: bw=4934MiB/s (5174MB/s), 4934MiB/s-4934MiB/s (5174MB/s-5174MB/s), io=289GiB (310GB), run=60001-60001msec
Run status group 3 (all jobs):
READ: bw=4183MiB/s (4386MB/s), 4183MiB/s-4183MiB/s (4386MB/s-4386MB/s), io=245GiB (263GB), run=60002-60002msec
Disk stats (read/write):
nvme0n1: ios=79463384/0, merge=0/0, ticks=25148472/0, in_queue=25148472, util=98.78%
nvme1n1: ios=79463574/0, merge=0/0, ticks=25224784/0, in_queue=25224784, util=98.87%
nvme2n1: ios=79463699/0, merge=0/0, ticks=25305193/0, in_queue=25305193, util=98.96%
nvme3n1: ios=79463925/0, merge=0/0, ticks=25234093/0, in_queue=25234093, util=99.00%
nvme4n1: ios=79464135/0, merge=0/0, ticks=25396547/0, in_queue=25396547, util=99.06%
nvme5n1: ios=79464346/0, merge=0/0, ticks=25393624/0, in_queue=25393624, util=99.10%
nvme6n1: ios=79464535/0, merge=0/0, ticks=25330700/0, in_queue=25330700, util=99.19%
nvme7n1: ios=79464721/0, merge=0/0, ticks=25349171/0, in_queue=25349171, util=99.24%
nvme8n1: ios=79464029/0, merge=0/0, ticks=59063115/0, in_queue=59063115, util=99.32%
nvme9n1: ios=79464120/0, merge=0/0, ticks=59023913/0, in_queue=59023913, util=99.33%
nvme10n1: ios=79464799/0, merge=0/0, ticks=59136926/0, in_queue=59136927, util=99.39%
nvme11n1: ios=79465392/0, merge=0/0, ticks=59091104/0, in_queue=59091104, util=99.51%
nvme12n1: ios=67137057/0, merge=0/0, ticks=18685135/0, in_queue=18685136, util=99.60%
nvme13n1: ios=67137217/0, merge=0/0, ticks=18638940/0, in_queue=18638940, util=99.76%
nvme14n1: ios=67137341/0, merge=0/0, ticks=18663275/0, in_queue=18663275, util=99.70%
nvme15n1: ios=67137620/0, merge=0/0, ticks=18629947/0, in_queue=18629948, util=99.77%
nvme17n1: ios=67137778/0, merge=0/0, ticks=18709586/0, in_queue=18709585, util=99.80%
nvme18n1: ios=67137952/0, merge=0/0, ticks=18591798/0, in_queue=18591797, util=99.72%
nvme19n1: ios=67138199/0, merge=0/0, ticks=18669545/0, in_queue=18669545, util=99.86%
nvme20n1: ios=67138378/0, merge=0/0, ticks=18600128/0, in_queue=18600128, util=99.89%
nvme21n1: ios=67138562/0, merge=0/0, ticks=18720763/0, in_queue=18720763, util=100.00%
nvme22n1: ios=67138772/0, merge=0/0, ticks=18659716/0, in_queue=18659716, util=100.00%
nvme23n1: ios=67138982/0, merge=0/0, ticks=27862395/0, in_queue=27862395, util=100.00%
nvme24n1: ios=67134934/0, merge=0/0, ticks=241977879/0, in_queue=241977879, util=100.00%
md0: ios=75701982/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
md1: ios=64175011/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
I'm used to tuning interrupts, so here are the interrupts during the hero portion of the fio and the mdraid portion.....Without polling they are just well balanced irq's across the different nvme MQs
[root@<server> jim]# ./top-irq.pl -k 1
reporting top 10 every 6 secs subject to thresh=10 kernel=1
CAL 532284 CPU146 Function call interrupts
CAL 529615 CPU154 Function call interrupts
CAL 526198 CPU162 Function call interrupts
CAL 524012 CPU142 Function call interrupts
CAL 521467 CPU174 Function call interrupts
CAL 520821 CPU178 Function call interrupts
CAL 518798 CPU176 Function call interrupts
CAL 518244 CPU166 Function call interrupts
CAL 517524 CPU180 Function call interrupts
CAL 514563 CPU136 Function call interrupts
reported top 10 (of 1885)
reported interrupts = 5223526 870587.7 per sec 6.8% of all interrupts
^C
[root@<server> jim]# !!
./top-irq.pl -k 1
reporting top 10 every 6 secs subject to thresh=10 kernel=1
CAL 63759 CPU15 Function call interrupts
CAL 63664 CPU178 Function call interrupts
CAL 63428 CPU142 Function call interrupts
CAL 63382 CPU51 Function call interrupts
CAL 63285 CPU140 Function call interrupts
CAL 63068 CPU150 Function call interrupts
CAL 63017 CPU148 Function call interrupts
CAL 62984 CPU144 Function call interrupts
CAL 62842 CPU25 Function call interrupts
CAL 62835 CPU37 Function call interrupts
reported top 10 (of 1885)
reported interrupts = 632264 105377.3 per sec 4.0% of all interrupts
Lastly, I can't make md0 and md1 each get ~2M IOPS at the same time. Sometimes the NUMA0 md is the fastest, sometimes the NUMA1 md is the fastest - I think there might some sort of bottleneck/race somewhere. It stays that way until I stop them and reassemble.....and then it may switch. I haven't troubleshooted enough to notice the pattern.
I have to work out with HPE why the socket0/socket1 difference in hero numbers 16.0M/13.5M is something I'll have to take up with HPE or maybe there is a card slowing down the drives in socket1.
Any help is greatly appreciated. Criticism will be accepted and worst case, IF I HAVEN'T MISSED SOMETHING SO UTTERLY SILLY, this becomes a defacto "where to start" for the base users like me before the kernel level experts get involved.
As an FYI - I have booted a 5.13 kernel and started using io_uring - no noticeable difference in md performance on a different server with GEN3 drives.....I can raise my "hero numbers" when I have time to play, but right now, my job is to get protected IOPS.
Jim Finlayson
U.S. Department of Defense
next reply other threads:[~2021-07-27 20:39 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-27 20:32 Finlayson, James M CIV (USA) [this message]
2021-07-27 21:52 ` Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? Chris Murphy
2021-07-27 22:42 ` Peter Grandi
2021-07-28 10:31 ` Matt Wallis
2021-07-28 10:43 ` [Non-DoD Source] " Finlayson, James M CIV (USA)
2021-07-29 0:54 ` [Non-DoD Source] " Matt Wallis
2021-07-29 16:35 ` Wols Lists
2021-07-29 18:12 ` Finlayson, James M CIV (USA)
2021-07-29 22:05 ` Finlayson, James M CIV (USA)
2021-07-30 8:28 ` Matt Wallis
2021-07-30 8:45 ` Miao Wang
2021-07-30 9:59 ` Finlayson, James M CIV (USA)
2021-07-30 14:03 ` Doug Ledford
2021-07-30 13:17 ` Peter Grandi
2021-07-30 9:54 ` Finlayson, James M CIV (USA)
2021-08-01 11:21 ` Gal Ofri
2021-08-03 14:59 ` [Non-DoD Source] " Finlayson, James M CIV (USA)
2021-08-04 9:33 ` Gal Ofri
[not found] ` <AS8PR04MB799205817C4647DAC740DE9A91EA9@AS8PR04MB7992.eurprd04.prod.outlook.com>
[not found] ` <5EAED86C53DED2479E3E145969315A2385856AD0@UMECHPA7B.easf.csd.disa.mil>
[not found] ` <5EAED86C53DED2479E3E145969315A2385856AF7@UMECHPA7B.easf.csd.disa.mil>
2021-08-05 19:52 ` Finlayson, James M CIV (USA)
2021-08-05 20:50 ` Finlayson, James M CIV (USA)
2021-08-05 21:10 ` Finlayson, James M CIV (USA)
2021-08-08 14:43 ` Gal Ofri
2021-08-09 19:01 ` Finlayson, James M CIV (USA)
2021-08-17 21:21 ` Finlayson, James M CIV (USA)
2021-08-18 0:45 ` [Non-DoD Source] " Matt Wallis
2021-08-18 10:20 ` Finlayson, James M CIV (USA)
2021-08-18 19:48 ` Doug Ledford
2021-08-18 19:59 ` Doug Ledford
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5EAED86C53DED2479E3E145969315A2385841062@UMECHPA7B.easf.csd.disa.mil \
--to=james.m.finlayson4.civ@mail.mil \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.