* Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
@ 2021-07-27 20:32 Finlayson, James M CIV (USA)
2021-07-27 21:52 ` Chris Murphy
` (4 more replies)
0 siblings, 5 replies; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-07-27 20:32 UTC (permalink / raw)
To: 'linux-raid@vger.kernel.org'; +Cc: Finlayson, James M CIV (USA)
Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small fraction of it. I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful. This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant. I'm just stumped why I can't do better. If there is a fine manual that somebody can point me to, I'm happy to read it.......
I have tried both RAID5 and RAID6 trying to be highly cognizant of NUMAness. The ROME is set to numas per socket to 1 and the BIOS is set to maximize infinity fabric performance and pcie performance via AMD's white papers. NVMe drives are all Gen4 (I believe HPE rebadged SAMSUNG 1733a? - I can get the drives doing 1.45M 4KB random reads each if I try hard.
Everything I can think to share:
[root@<server> <server>]# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)
[root@<server> <server>]# uname -r
4.18.0-305.el8.x86_64
root@<server> ~]# modinfo raid6
filename: /lib/modules/4.18.0-305.el8.x86_64/kernel/drivers/md/raid456.ko.xz
alias: raid6
alias: raid5
alias: md-level-6
alias: md-raid6
alias: md-personality-8
alias: md-level-4
alias: md-level-5
alias: md-raid4
alias: md-raid5
alias: md-personality-4
description: RAID4/5/6 (striping with parity) personality for MD
license: GPL
rhelversion: 8.4
srcversion: FE86A53E1C1CDAE8F972CBA
depends: async_raid6_recov,async_pq,libcrc32c,raid6_pq,async_tx,async_memcpy,async_xor
intree: Y
name: raid456
vermagic: 4.18.0-305.el8.x86_64 SMP mod_unload modversions
sig_id: PKCS#7
signer: Red Hat Enterprise Linux kernel signing key
[root@<server> ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme16n1 259:0 0 1.8T 0 disk
├─nvme16n1p1 259:1 0 512M 0 part /boot/efi
├─nvme16n1p2 259:2 0 512M 0 part /boot
├─nvme16n1p3 259:3 0 49.4G 0 part [SWAP]
└─nvme16n1p4 259:4 0 1.7T 0 part /
nvme0n1 259:5 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme1n1 259:6 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme2n1 259:7 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme3n1 259:8 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme7n1 259:9 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme11n1 259:10 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme10n1 259:11 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme14n1 259:12 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme5n1 259:13 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme8n1 259:14 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme6n1 259:15 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme9n1 259:16 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme15n1 259:17 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme20n1 259:18 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme13n1 259:19 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme18n1 259:20 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme4n1 259:21 0 14T 0 disk
└─md0 9:0 0 139.7T 0 raid5
nvme21n1 259:22 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme22n1 259:23 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme24n1 259:24 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme12n1 259:25 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme17n1 259:26 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme19n1 259:27 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
nvme23n1 259:28 0 14T 0 disk
└─md1 9:1 0 139.7T 0 raid5
[root@<server> ~]# lsblk -o KNAME,MODEL,VENDOR
KNAME MODEL VENDOR
nvme0n1 MZXL515THALA-000H3
nvme1n1 MZXL515THALA-000H3
nvme2n1 MZXL515THALA-000H3
nvme3n1 MZXL515THALA-000H3
nvme7n1 MZXL515THALA-000H3
nvme11n1 MZXL515THALA-000H3
nvme10n1 MZXL515THALA-000H3
nvme14n1 MZXL515THALA-000H3
nvme5n1 MZXL515THALA-000H3
nvme8n1 MZXL515THALA-000H3
nvme6n1 MZXL515THALA-000H3
nvme9n1 MZXL515THALA-000H3
nvme15n1 MZXL515THALA-000H3
nvme20n1 MZXL515THALA-000H3
nvme13n1 MZXL515THALA-000H3
nvme18n1 MZXL515THALA-000H3
nvme4n1 MZXL515THALA-000H3
nvme21n1 MZXL515THALA-000H3
nvme22n1 MZXL515THALA-000H3
nvme24n1 MZXL515THALA-000H3
nvme12n1 MZXL515THALA-000H3
nvme17n1 MZXL515THALA-000H3
nvme19n1 MZXL515THALA-000H3
nvme23n1 MZXL515THALA-000H3
[root@<server> jim]# ./map_numa.sh (16 is the boot drive 0-11 on numa 0, 12-16,17-24 on numa 1)
device: nvme8 numanode: 0
device: nvme9 numanode: 0
device: nvme10 numanode: 0
device: nvme11 numanode: 0
device: nvme4 numanode: 0
device: nvme5 numanode: 0
device: nvme6 numanode: 0
device: nvme7 numanode: 0
device: nvme2 numanode: 0
device: nvme3 numanode: 0
device: nvme0 numanode: 0
device: nvme1 numanode: 0
device: nvme21 numanode: 1
device: nvme22 numanode: 1
device: nvme23 numanode: 1
device: nvme24 numanode: 1
device: nvme16 numanode: 1
device: nvme17 numanode: 1
device: nvme18 numanode: 1
device: nvme19 numanode: 1
device: nvme20 numanode: 1
device: nvme14 numanode: 1
device: nvme15 numanode: 1
device: nvme12 numanode: 1
device: nvme13 numanode: 1
[root@<server> jim]# cat /etc/udev/rules.d/99-abj.nr_32.rules
KERNEL=="nvme*[0-9]n*[0-9]",ATTRS{model}=="MZXL515THALA-000H3",ATTR{queue/io_poll}="1",ATTR{queue/io_poll_delay}="100000",ATTR{queue/nomerges}="2",ATTR{queue/nr_requests}="1023",ATTR{queue/rotational}="0",ATTR{queue/rq_affinity}="2",ATTR{queue/scheduler}="none",ATTR{queue/add_random}="0",ATTR{queue/max_sectors_kb}="4096",PROGRAM="/usr/sbin/nvme set-feature /dev/%k --feature-id 8 --value 522 " {coalesce up to 10 interrupts per device}
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md*", ATTR{md/sync_speed_max}="2000000",ATTR{md/group_thread_cnt}="64", ATTR{md/stripe_cache_size}="8192",ATTR{queue/io_poll}="1",ATTR{queue/io_poll_delay}="100000",ATTR{queue/nomerges}="2",ATTR{queue/nr_requests}="1023",ATTR{queue/rotational}="0",ATTR{queue/rq_affinity}="2",ATTR{queue/scheduler}="none",ATTR{queue/add_random}="0",ATTR{queue/max_sectors_kb}="4096"
(I know the 1023 doesn't work in the md, but there for reference) - we tune for max iops, not for latency, thus the going hard at rq_affinity, nomerges, etc.....
[root@<server> <server>]# cat /proc/mdstat (128K chunk is just something Fusion IO told me way back when and never needed to change)
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 nvme11n1[11](S) nvme10n1[10] nvme9n1[9] nvme8n1[8] nvme7n1[7] nvme6n1[6] nvme5n1[5] nvme4n1[4] nvme3n1[3] nvme2n1[2] nvme1n1[1] nvme0n1[0]
150007961600 blocks super 1.2 level 5, 128k chunk, algorithm 2 [11/11] [UUUUUUUUUUU]
bitmap: 0/112 pages [0KB], 65536KB chunk
md1 : active raid5 nvme24n1[11](S) nvme23n1[10] nvme22n1[9] nvme21n1[8] nvme20n1[7] nvme19n1[6] nvme18n1[5] nvme17n1[4] nvme15n1[3] nvme14n1[2] nvme13n1[1] nvme12n1[0]
150007961600 blocks super 1.2 level 5, 128k chunk, algorithm 2 [11/11] [UUUUUUUUUUU]
bitmap: 0/112 pages [0KB], 65536KB chunk
unused devices: <none>
[root@<server> /]# grep raid /var/log/messages......What troubles me is if mdraid checked parity on read, I could somewhat understand, but I would think the reads are nearly a pass through....
Jul 27 00:00:02 <server> rpmlist_verification[12745]: libblockdev-mdraid 2.24 Thu 22 Jul 2021 02:58:37 PM GMT
Jul 27 18:00:28 <server> kernel: raid6: sse2x1 gen() 9792 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x1 xor() 6436 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x2 gen() 11198 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x2 xor() 9546 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x4 gen() 14271 MB/s
Jul 27 18:00:29 <server> kernel: raid6: sse2x4 xor() 6354 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x1 gen() 22838 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x1 xor() 14069 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x2 gen() 26973 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x2 xor() 18380 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x4 gen() 26601 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x4 xor() 7025 MB/s
Jul 27 18:00:29 <server> kernel: raid6: using algorithm avx2x2 gen() 26973 MB/s
Jul 27 18:00:29 <server> kernel: raid6: .... xor() 18380 MB/s, rmw enabled
Jul 27 18:00:29 <server> kernel: raid6: using avx2x2 recovery algorithm
[root@<server> <server>]# cat fiojim.hpdl385.nps1
[global]
name=random
iodepth=128
ioengine=libaio
direct=1
norandommap
group_reporting
randrepeat=1
random_generator=tausworthe64
bs=4k
rw=randread
numjobs=64
runtime=60
[socket0]
new_group
numa_mem_policy=bind:0
numa_cpu_nodes=0
filename=/dev/nvme0n1
filename=/dev/nvme1n1
filename=/dev/nvme2n1
filename=/dev/nvme3n1
filename=/dev/nvme4n1
filename=/dev/nvme5n1
filename=/dev/nvme6n1
filename=/dev/nvme7n1
filename=/dev/nvme8n1
filename=/dev/nvme9n1
filename=/dev/nvme10n1
filename=/dev/nvme11n1
[socket1]
new_group
numa_mem_policy=bind:1
numa_cpu_nodes=1
filename=/dev/nvme12n1
filename=/dev/nvme13n1
filename=/dev/nvme14n1
filename=/dev/nvme15n1
filename=/dev/nvme17n1
filename=/dev/nvme18n1
filename=/dev/nvme19n1
filename=/dev/nvme20n1
filename=/dev/nvme21n1
filename=/dev/nvme22n1
filename=/dev/nvme23n1
filename=/dev/nvme24n1
[socket0-md]
stonewall
numa_mem_policy=bind:0
numa_cpu_nodes=0
filename=/dev/md0
[socket1-md]
new_group
numa_mem_policy=bind:1
numa_cpu_nodes=1
filename=/dev/md1
iostat -xkz 1 with the drives raw:
avg-cpu: %user %nice %system %iowait %steal %idle
8.32 0.00 38.30 0.00 0.00 53.39
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme0n1 1317510.00 0.00 5270044.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 411.95 4.00 0.00 0.00 100.40
nvme1n1 1317548.00 0.00 5270192.00 0.00 0.00 0.00 0.00 0.00 0.32 0.00 417.38 4.00 0.00 0.00 100.00
nvme2n1 1317578.00 0.00 5270316.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 414.77 4.00 0.00 0.00 100.20
nvme3n1 1317554.00 0.00 5270216.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 413.25 4.00 0.00 0.00 100.40
nvme7n1 1317559.00 0.00 5270236.00 0.00 0.00 0.00 0.00 0.00 0.33 0.00 430.03 4.00 0.00 0.00 100.40
nvme11n1 1317502.00 0.00 5269996.00 0.00 0.00 0.00 0.00 0.00 0.73 0.00 964.85 4.00 0.00 0.00 100.40
nvme10n1 1317656.00 0.00 5270624.00 0.00 0.00 0.00 0.00 0.00 0.80 0.00 1050.05 4.00 0.00 0.00 100.40
nvme14n1 1107632.00 0.00 4430528.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 307.52 4.00 0.00 0.00 100.40
nvme5n1 1317583.00 0.00 5270332.00 0.00 0.00 0.00 0.00 0.00 0.33 0.00 430.47 4.00 0.00 0.00 100.00
nvme8n1 1317617.00 0.00 5270468.00 0.00 0.00 0.00 0.00 0.00 0.74 0.00 972.52 4.00 0.00 0.00 101.00
nvme6n1 1317535.00 0.00 5270144.00 0.00 0.00 0.00 0.00 0.00 0.33 0.00 432.48 4.00 0.00 0.00 100.60
nvme9n1 1317582.00 0.00 5270328.00 0.00 0.00 0.00 0.00 0.00 0.75 0.00 992.82 4.00 0.00 0.00 100.40
nvme15n1 1107703.00 0.00 4430816.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 305.93 4.00 0.00 0.00 100.60
nvme20n1 1107712.00 0.00 4430848.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 306.72 4.00 0.00 0.00 100.20
nvme13n1 1107714.00 0.00 4430852.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 307.10 4.00 0.00 0.00 101.40
nvme18n1 1107674.00 0.00 4430696.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 306.04 4.00 0.00 0.00 100.20
nvme4n1 1317521.00 0.00 5270076.00 0.00 0.00 0.00 0.00 0.00 0.33 0.00 431.63 4.00 0.00 0.00 100.20
nvme21n1 1107714.00 0.00 4430856.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 309.11 4.00 0.00 0.00 100.40
nvme22n1 1107711.00 0.00 4430840.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 308.52 4.00 0.00 0.00 100.60
nvme24n1 1107441.00 0.00 4429768.00 0.00 0.00 0.00 0.00 0.00 3.86 0.00 4271.29 4.00 0.00 0.00 100.20
nvme12n1 1107733.00 0.00 4430932.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 307.70 4.00 0.00 0.00 100.40
nvme17n1 1107858.00 0.00 4431436.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 307.95 4.00 0.00 0.00 100.60
nvme19n1 1107766.00 0.00 4431064.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 307.17 4.00 0.00 0.00 100.40
nvme23n1 1108033.00 0.00 4432132.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 340.62 4.00 0.00 0.00 100.00
iostat -xkz 1 with the md's
avg-cpu: %user %nice %system %iowait %steal %idle
0.56 0.00 49.94 0.00 0.00 49.51
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme0n1 114589.00 0.00 458356.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.54 4.00 0.00 0.01 100.00
nvme1n1 115284.00 0.00 461136.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.77 4.00 0.00 0.01 100.00
nvme2n1 114911.00 0.00 459644.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.61 4.00 0.00 0.01 100.00
nvme3n1 114538.00 0.00 458152.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.55 4.00 0.00 0.01 100.00
nvme7n1 114524.00 0.00 458096.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.53 4.00 0.00 0.01 100.00
nvme10n1 114934.00 0.00 459736.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.61 4.00 0.00 0.01 100.00
nvme14n1 97399.00 0.00 389596.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.41 4.00 0.00 0.01 100.00
nvme5n1 114929.00 0.00 459716.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.61 4.00 0.00 0.01 100.00
nvme8n1 114393.00 0.00 457572.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.40 4.00 0.00 0.01 99.90
nvme6n1 114731.00 0.00 458924.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.56 4.00 0.00 0.01 99.90
nvme9n1 114146.00 0.00 456584.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.37 4.00 0.00 0.01 99.90
nvme15n1 96960.00 0.00 387840.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.30 4.00 0.00 0.01 100.00
nvme20n1 97171.00 0.00 388684.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.36 4.00 0.00 0.01 100.00
nvme13n1 96874.00 0.00 387496.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.31 4.00 0.00 0.01 100.00
nvme18n1 96696.00 0.00 386784.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.16 4.00 0.00 0.01 100.00
nvme4n1 115220.00 0.00 460876.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.75 4.00 0.00 0.01 100.00
nvme21n1 96756.00 0.00 387024.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.24 4.00 0.00 0.01 100.00
nvme22n1 97352.00 0.00 389408.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.36 4.00 0.00 0.01 100.00
nvme12n1 96899.00 0.00 387596.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.22 4.00 0.00 0.01 100.20
nvme17n1 96748.00 0.00 386992.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.24 4.00 0.00 0.01 100.00
nvme19n1 97191.00 0.00 388764.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.30 4.00 0.00 0.01 100.00
nvme23n1 96787.00 0.00 387148.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 28.41 4.00 0.00 0.01 99.90
md1 1066812.00 0.00 4267248.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00
md0 1262173.00 0.00 5048692.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00
fio output:
socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket0-md: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket1-md: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.26
Starting 256 processes
Jobs: 128 (f=128): [_(128),r(128)][1.6%][r=9103MiB/s][r=2330k IOPS][eta 02h:08m:00s]
socket0: (groupid=0, jobs=64): err= 0: pid=18344: Tue Jul 27 20:00:10 2021
read: IOPS=16.0M, BW=60.8GiB/s (65.3GB/s)(3651GiB/60003msec)
slat (nsec): min=1222, max=18033k, avg=2429.23, stdev=2975.48
clat (usec): min=24, max=20221, avg=510.51, stdev=336.57
lat (usec): min=30, max=20240, avg=513.01, stdev=336.58
clat percentiles (usec):
| 1.00th=[ 147], 5.00th=[ 194], 10.00th=[ 229], 20.00th=[ 281],
| 30.00th=[ 326], 40.00th=[ 367], 50.00th=[ 412], 60.00th=[ 469],
| 70.00th=[ 553], 80.00th=[ 676], 90.00th=[ 914], 95.00th=[ 1156],
| 99.00th=[ 1778], 99.50th=[ 2073], 99.90th=[ 2868], 99.95th=[ 3294],
| 99.99th=[ 4424]
bw ( MiB/s): min=52367, max=65429, per=32.81%, avg=62388.68, stdev=33.73, samples=7424
iops : min=13406054, max=16749890, avg=15971477.42, stdev=8635.86, samples=7424
lat (usec) : 50=0.01%, 100=0.02%, 250=13.89%, 500=50.33%, 750=19.72%
lat (usec) : 1000=8.24%
lat (msec) : 2=7.22%, 4=0.57%, 10=0.02%, 20=0.01%, 50=0.01%
cpu : usr=17.93%, sys=49.30%, ctx=21719222, majf=0, minf=9915
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=957111950,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
socket1: (groupid=1, jobs=64): err= 0: pid=18408: Tue Jul 27 20:00:10 2021
read: IOPS=13.5M, BW=51.4GiB/s (55.2GB/s)(3085GiB/60008msec)
slat (nsec): min=1232, max=1696.9k, avg=2580.28, stdev=2841.95
clat (usec): min=21, max=26808, avg=604.58, stdev=1211.79
lat (usec): min=26, max=26810, avg=607.23, stdev=1211.80
clat percentiles (usec):
| 1.00th=[ 124], 5.00th=[ 157], 10.00th=[ 184], 20.00th=[ 225],
| 30.00th=[ 258], 40.00th=[ 289], 50.00th=[ 318], 60.00th=[ 351],
| 70.00th=[ 388], 80.00th=[ 437], 90.00th=[ 586], 95.00th=[ 2769],
| 99.00th=[ 6587], 99.50th=[ 9372], 99.90th=[12649], 99.95th=[13829],
| 99.99th=[16712]
bw ( MiB/s): min=32950, max=67704, per=20.46%, avg=52713.11, stdev=106.96, samples=7424
iops : min=8435402, max=17332350, avg=13494532.64, stdev=27383.02, samples=7424
lat (usec) : 50=0.01%, 100=0.16%, 250=27.38%, 500=59.09%, 750=4.93%
lat (usec) : 1000=0.30%
lat (msec) : 2=0.60%, 4=5.67%, 10=1.47%, 20=0.39%, 50=0.01%
cpu : usr=14.86%, sys=45.29%, ctx=36050249, majf=0, minf=10046
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=808781317,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
socket0-md: (groupid=2, jobs=64): err= 0: pid=18479: Tue Jul 27 20:00:10 2021
read: IOPS=1263k, BW=4934MiB/s (5174MB/s)(289GiB/60001msec)
slat (nsec): min=1512, max=48037k, avg=49957.85, stdev=33615.19
clat (usec): min=176, max=51614, avg=6432.56, stdev=410.54
lat (usec): min=178, max=51639, avg=6482.58, stdev=412.23
clat percentiles (usec):
| 1.00th=[ 6128], 5.00th=[ 6259], 10.00th=[ 6325], 20.00th=[ 6325],
| 30.00th=[ 6390], 40.00th=[ 6390], 50.00th=[ 6456], 60.00th=[ 6456],
| 70.00th=[ 6521], 80.00th=[ 6521], 90.00th=[ 6587], 95.00th=[ 6587],
| 99.00th=[ 6652], 99.50th=[ 6718], 99.90th=[ 7635], 99.95th=[16909],
| 99.99th=[18220]
bw ( MiB/s): min= 4582, max= 5934, per=100.00%, avg=4938.25, stdev= 2.07, samples=7616
iops : min=1173219, max=1519297, avg=1264175.97, stdev=528.77, samples=7616
lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.34%, 10=99.57%, 20=0.08%, 50=0.01%
lat (msec) : 100=0.01%
cpu : usr=1.23%, sys=95.69%, ctx=2557, majf=0, minf=9064
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=75789817,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
socket1-md: (groupid=3, jobs=64): err= 0: pid=18543: Tue Jul 27 20:00:10 2021
read: IOPS=1071k, BW=4183MiB/s (4386MB/s)(245GiB/60002msec)
slat (nsec): min=1563, max=14080k, avg=59051.10, stdev=22401.39
clat (usec): min=179, max=20799, avg=7588.23, stdev=303.92
lat (usec): min=211, max=20853, avg=7647.34, stdev=305.26
clat percentiles (usec):
| 1.00th=[ 7111], 5.00th=[ 7373], 10.00th=[ 7439], 20.00th=[ 7504],
| 30.00th=[ 7504], 40.00th=[ 7570], 50.00th=[ 7570], 60.00th=[ 7635],
| 70.00th=[ 7635], 80.00th=[ 7701], 90.00th=[ 7767], 95.00th=[ 7767],
| 99.00th=[ 7898], 99.50th=[ 7898], 99.90th=[ 8586], 99.95th=[13304],
| 99.99th=[19006]
bw ( MiB/s): min= 3955, max= 4642, per=100.00%, avg=4186.20, stdev= 0.98, samples=7616
iops : min=1012714, max=1188416, avg=1071653.68, stdev=251.68, samples=7616
lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=99.94%, 20=0.05%, 50=0.01%
cpu : usr=1.06%, sys=95.70%, ctx=1980, majf=0, minf=9030
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=64246431,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: bw=60.8GiB/s (65.3GB/s), 60.8GiB/s-60.8GiB/s (65.3GB/s-65.3GB/s), io=3651GiB (3920GB), run=60003-60003msec
Run status group 1 (all jobs):
READ: bw=51.4GiB/s (55.2GB/s), 51.4GiB/s-51.4GiB/s (55.2GB/s-55.2GB/s), io=3085GiB (3313GB), run=60008-60008msec
Run status group 2 (all jobs):
READ: bw=4934MiB/s (5174MB/s), 4934MiB/s-4934MiB/s (5174MB/s-5174MB/s), io=289GiB (310GB), run=60001-60001msec
Run status group 3 (all jobs):
READ: bw=4183MiB/s (4386MB/s), 4183MiB/s-4183MiB/s (4386MB/s-4386MB/s), io=245GiB (263GB), run=60002-60002msec
Disk stats (read/write):
nvme0n1: ios=79463384/0, merge=0/0, ticks=25148472/0, in_queue=25148472, util=98.78%
nvme1n1: ios=79463574/0, merge=0/0, ticks=25224784/0, in_queue=25224784, util=98.87%
nvme2n1: ios=79463699/0, merge=0/0, ticks=25305193/0, in_queue=25305193, util=98.96%
nvme3n1: ios=79463925/0, merge=0/0, ticks=25234093/0, in_queue=25234093, util=99.00%
nvme4n1: ios=79464135/0, merge=0/0, ticks=25396547/0, in_queue=25396547, util=99.06%
nvme5n1: ios=79464346/0, merge=0/0, ticks=25393624/0, in_queue=25393624, util=99.10%
nvme6n1: ios=79464535/0, merge=0/0, ticks=25330700/0, in_queue=25330700, util=99.19%
nvme7n1: ios=79464721/0, merge=0/0, ticks=25349171/0, in_queue=25349171, util=99.24%
nvme8n1: ios=79464029/0, merge=0/0, ticks=59063115/0, in_queue=59063115, util=99.32%
nvme9n1: ios=79464120/0, merge=0/0, ticks=59023913/0, in_queue=59023913, util=99.33%
nvme10n1: ios=79464799/0, merge=0/0, ticks=59136926/0, in_queue=59136927, util=99.39%
nvme11n1: ios=79465392/0, merge=0/0, ticks=59091104/0, in_queue=59091104, util=99.51%
nvme12n1: ios=67137057/0, merge=0/0, ticks=18685135/0, in_queue=18685136, util=99.60%
nvme13n1: ios=67137217/0, merge=0/0, ticks=18638940/0, in_queue=18638940, util=99.76%
nvme14n1: ios=67137341/0, merge=0/0, ticks=18663275/0, in_queue=18663275, util=99.70%
nvme15n1: ios=67137620/0, merge=0/0, ticks=18629947/0, in_queue=18629948, util=99.77%
nvme17n1: ios=67137778/0, merge=0/0, ticks=18709586/0, in_queue=18709585, util=99.80%
nvme18n1: ios=67137952/0, merge=0/0, ticks=18591798/0, in_queue=18591797, util=99.72%
nvme19n1: ios=67138199/0, merge=0/0, ticks=18669545/0, in_queue=18669545, util=99.86%
nvme20n1: ios=67138378/0, merge=0/0, ticks=18600128/0, in_queue=18600128, util=99.89%
nvme21n1: ios=67138562/0, merge=0/0, ticks=18720763/0, in_queue=18720763, util=100.00%
nvme22n1: ios=67138772/0, merge=0/0, ticks=18659716/0, in_queue=18659716, util=100.00%
nvme23n1: ios=67138982/0, merge=0/0, ticks=27862395/0, in_queue=27862395, util=100.00%
nvme24n1: ios=67134934/0, merge=0/0, ticks=241977879/0, in_queue=241977879, util=100.00%
md0: ios=75701982/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
md1: ios=64175011/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
I'm used to tuning interrupts, so here are the interrupts during the hero portion of the fio and the mdraid portion.....Without polling they are just well balanced irq's across the different nvme MQs
[root@<server> jim]# ./top-irq.pl -k 1
reporting top 10 every 6 secs subject to thresh=10 kernel=1
CAL 532284 CPU146 Function call interrupts
CAL 529615 CPU154 Function call interrupts
CAL 526198 CPU162 Function call interrupts
CAL 524012 CPU142 Function call interrupts
CAL 521467 CPU174 Function call interrupts
CAL 520821 CPU178 Function call interrupts
CAL 518798 CPU176 Function call interrupts
CAL 518244 CPU166 Function call interrupts
CAL 517524 CPU180 Function call interrupts
CAL 514563 CPU136 Function call interrupts
reported top 10 (of 1885)
reported interrupts = 5223526 870587.7 per sec 6.8% of all interrupts
^C
[root@<server> jim]# !!
./top-irq.pl -k 1
reporting top 10 every 6 secs subject to thresh=10 kernel=1
CAL 63759 CPU15 Function call interrupts
CAL 63664 CPU178 Function call interrupts
CAL 63428 CPU142 Function call interrupts
CAL 63382 CPU51 Function call interrupts
CAL 63285 CPU140 Function call interrupts
CAL 63068 CPU150 Function call interrupts
CAL 63017 CPU148 Function call interrupts
CAL 62984 CPU144 Function call interrupts
CAL 62842 CPU25 Function call interrupts
CAL 62835 CPU37 Function call interrupts
reported top 10 (of 1885)
reported interrupts = 632264 105377.3 per sec 4.0% of all interrupts
Lastly, I can't make md0 and md1 each get ~2M IOPS at the same time. Sometimes the NUMA0 md is the fastest, sometimes the NUMA1 md is the fastest - I think there might some sort of bottleneck/race somewhere. It stays that way until I stop them and reassemble.....and then it may switch. I haven't troubleshooted enough to notice the pattern.
I have to work out with HPE why the socket0/socket1 difference in hero numbers 16.0M/13.5M is something I'll have to take up with HPE or maybe there is a card slowing down the drives in socket1.
Any help is greatly appreciated. Criticism will be accepted and worst case, IF I HAVEN'T MISSED SOMETHING SO UTTERLY SILLY, this becomes a defacto "where to start" for the base users like me before the kernel level experts get involved.
As an FYI - I have booted a 5.13 kernel and started using io_uring - no noticeable difference in md performance on a different server with GEN3 drives.....I can raise my "hero numbers" when I have time to play, but right now, my job is to get protected IOPS.
Jim Finlayson
U.S. Department of Defense
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-07-27 20:32 Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? Finlayson, James M CIV (USA)
@ 2021-07-27 21:52 ` Chris Murphy
2021-07-27 22:42 ` Peter Grandi
` (3 subsequent siblings)
4 siblings, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2021-07-27 21:52 UTC (permalink / raw)
To: linux-raid; +Cc: Finlayson, James M CIV (USA)
On Tue, Jul 27, 2021 at 2:40 PM Finlayson, James M CIV (USA)
<james.m.finlayson4.civ@mail.mil> wrote:
>
> [root@<server> <server>]# cat /etc/redhat-release
> Red Hat Enterprise Linux release 8.4 (Ootpa)
> [root@<server> <server>]# uname -r
> 4.18.0-305.el8.x86_64
I think you'll get a better response by opening a support ticket with
your distro. That's a distro kernel and upstream's have pretty much
let that kernel version set sail a long time ago, and are mainly
concerned with linux-next, mainline, and stable kernels. You could
retest with kernel-ml from elrepo.org, 5.13.5 is up there for a couple
days.
--
Chris Murphy
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-07-27 20:32 Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? Finlayson, James M CIV (USA)
2021-07-27 21:52 ` Chris Murphy
@ 2021-07-27 22:42 ` Peter Grandi
2021-07-28 10:31 ` Matt Wallis
` (2 subsequent siblings)
4 siblings, 0 replies; 28+ messages in thread
From: Peter Grandi @ 2021-07-27 22:42 UTC (permalink / raw)
To: list Linux RAID
[...]
> Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
> nvme0n1 1317510.00 0.00 5270044.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 411.95 4.00 0.00 0.00 100.40
[...]
> Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
> nvme0n1 114589.00 0.00 458356.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.54 4.00 0.00 0.01 100.00
The obvious difference is the factor of 10 in "aqu-sz" and that
correspond to the factor of 10 in "r/s" and "rkB/s".
I have noticed that the MD RAID is does some weird things to the
queueing, it is not a "normal" block device, and this often
creates bizarrities (happens also with DM/LVM2).
Try to create a filesystem on top of 'md0' and 'md1' and test
that, things may be quite different.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-07-27 20:32 Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? Finlayson, James M CIV (USA)
2021-07-27 21:52 ` Chris Murphy
2021-07-27 22:42 ` Peter Grandi
@ 2021-07-28 10:31 ` Matt Wallis
2021-07-28 10:43 ` [Non-DoD Source] " Finlayson, James M CIV (USA)
2021-08-01 11:21 ` Gal Ofri
[not found] ` <AS8PR04MB799205817C4647DAC740DE9A91EA9@AS8PR04MB7992.eurprd04.prod.outlook.com>
4 siblings, 1 reply; 28+ messages in thread
From: Matt Wallis @ 2021-07-28 10:31 UTC (permalink / raw)
To: Finlayson, James M CIV (USA); +Cc: linux-raid
Hi Jim,
> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>
> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small fraction of it. I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful. This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant. I'm just stumped why I can't do better. If there is a fine manual that somebody can point me to, I'm happy to read it…
I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system…
1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
2. Most block IO in the kernel is limited in terms of threading, it may even be essentially single threaded. (This is where I will get corrected)
3. AFAICT, this includes mdraid, there’s a single thread per RAID device handling all the RAID calculations. (mdX_raid6)
What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason)
2. Create 8 RAID6 arrays with 1 partition per drive.
3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
I saw a significant (for me, significant is >20%) increase in IOPs doing this.
You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes.
There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
Matt
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-07-28 10:31 ` Matt Wallis
@ 2021-07-28 10:43 ` Finlayson, James M CIV (USA)
2021-07-29 0:54 ` [Non-DoD Source] " Matt Wallis
0 siblings, 1 reply; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-07-28 10:43 UTC (permalink / raw)
To: 'Matt Wallis'
Cc: 'linux-raid@vger.kernel.org', Finlayson, James M CIV (USA)
Matt,
I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term. As a researcher, I can do these cool science experiments, but I still have to hand designs to sustainment folks. I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed. We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.
I will have to try the LVM experiment. I'm an LVM neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also. Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
Thanks,
Jim
-----Original Message-----
From: Matt Wallis <mattw@madmonks.org>
Sent: Wednesday, July 28, 2021 6:32 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: linux-raid@vger.kernel.org
Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
Hi Jim,
> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>
> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small fraction of it. I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful. This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant. I'm just stumped why I can't do better. If there is a fine manual that somebody can point me to, I'm happy to read it…
I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
2. Most block IO in the kernel is limited in terms of threading, it may even be essentially single threaded. (This is where I will get corrected) 3. AFAICT, this includes mdraid, there’s a single thread per RAID device handling all the RAID calculations. (mdX_raid6)
What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
I saw a significant (for me, significant is >20%) increase in IOPs doing this.
You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes.
There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
Matt
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-07-28 10:43 ` [Non-DoD Source] " Finlayson, James M CIV (USA)
@ 2021-07-29 0:54 ` Matt Wallis
2021-07-29 16:35 ` Wols Lists
2021-07-29 22:05 ` Finlayson, James M CIV (USA)
0 siblings, 2 replies; 28+ messages in thread
From: Matt Wallis @ 2021-07-29 0:54 UTC (permalink / raw)
To: Finlayson, James M CIV (USA); +Cc: linux-raid
Hi Jim,
Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally.
Matt.
> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>
> Matt,
> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term. As a researcher, I can do these cool science experiments, but I still have to hand designs to sustainment folks. I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed. We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.
>
> I will have to try the LVM experiment. I'm an LVM neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also. Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
>
> Thanks,
> Jim
>
>
>
>
> -----Original Message-----
> From: Matt Wallis <mattw@madmonks.org>
> Sent: Wednesday, July 28, 2021 6:32 AM
> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Cc: linux-raid@vger.kernel.org
> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>
> Hi Jim,
>
>> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>
>> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small fraction of it. I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful. This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant. I'm just stumped why I can't do better. If there is a fine manual that somebody can point me to, I'm happy to read it…
>
> I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
> 2. Most block IO in the kernel is limited in terms of threading, it may even be essentially single threaded. (This is where I will get corrected) 3. AFAICT, this includes mdraid, there’s a single thread per RAID device handling all the RAID calculations. (mdX_raid6)
>
> What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
> 1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
> 3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
>
> You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
>
> I saw a significant (for me, significant is >20%) increase in IOPs doing this.
>
> You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes.
>
> There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
>
> You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
>
> Matt
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-07-29 0:54 ` [Non-DoD Source] " Matt Wallis
@ 2021-07-29 16:35 ` Wols Lists
2021-07-29 18:12 ` Finlayson, James M CIV (USA)
2021-07-29 22:05 ` Finlayson, James M CIV (USA)
1 sibling, 1 reply; 28+ messages in thread
From: Wols Lists @ 2021-07-29 16:35 UTC (permalink / raw)
To: Matt Wallis, Finlayson, James M CIV (USA); +Cc: linux-raid
On 29/07/21 01:54, Matt Wallis wrote:
> Hi Jim,
>
> Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
> Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
>
sticking raid 0 on top of raid 6 sounds an extremely weird thing to do.
What I guess you might be wanting to do instead is write a partition
table to the raid-6? That's perfectly normal if, imho, a bit unusual?
And LVM would be MUCH better than raid-0, I'm sure, because it addresses
this very issue by design, rather than by accident.
> I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally.
Is that wise? KISS.
>
> Matt.
>
>> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>
>> Matt,
>> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term. As a researcher, I can do these cool science experiments, but I still have to hand designs to sustainment folks. I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed. We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.
>>
>> I will have to try the LVM experiment. I'm an LVM neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also. Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
>>
Do. If it solves what you want, then it's worth it. I'm moving my stuff
over to LVM.
To throw something else into the mix, you've gone for raid 6, which
enables you to lose two drives, or corrupt one drive. Do you need the
two-drive redundancy? The calculations are a lot more expensive than
raid-5 if you're worried over write speed. I don't know the impact of it
but I'm playing with dm-integrity which provides some protection against
corruption.
>> Thanks,
>> Jim
>>
Cheers,
Wol
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-07-29 16:35 ` Wols Lists
@ 2021-07-29 18:12 ` Finlayson, James M CIV (USA)
0 siblings, 0 replies; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-07-29 18:12 UTC (permalink / raw)
To: 'Wols Lists', 'Matt Wallis'
Cc: 'linux-raid@vger.kernel.org'
Hi,
Actually, the RAID5/RAID6 mdraid implementations can't support the IOPS or the queue depths required for a basic single mdraid raid5/raid6 LUN. The partitions are just to create more mdraid stripes out of partitions to allow for more threads? To do the RAID parity work and to be able to issue more I/Os to the entirety of the NVMe SSDs using mdraid. Ultimately, I need one volume per NUMA domain comprised of RAIDed NVME SSDs. We're just exploring creative workarounds to the NVMe mdraid IOPS issues to get the most IOPS out of a collection of SSDs. I still have to put an xfs file system per volume for something useful to occur.
Thanks,
Jim
-----Original Message-----
From: Wols Lists <antlists@youngman.org.uk>
Sent: Thursday, July 29, 2021 12:35 PM
To: Matt Wallis <mattw@madmonks.org>; Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: linux-raid@vger.kernel.org
Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
On 29/07/21 01:54, Matt Wallis wrote:
> Hi Jim,
>
> Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
> Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
>
sticking raid 0 on top of raid 6 sounds an extremely weird thing to do.
What I guess you might be wanting to do instead is write a partition table to the raid-6? That's perfectly normal if, imho, a bit unusual?
And LVM would be MUCH better than raid-0, I'm sure, because it addresses this very issue by design, rather than by accident.
> I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally.
Is that wise? KISS.
>
> Matt.
>
>> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>
>> Matt,
>> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term. As a researcher, I can do these cool science experiments, but I still have to hand designs to sustainment folks. I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed. We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.
>>
>> I will have to try the LVM experiment. I'm an LVM neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also. Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
>>
Do. If it solves what you want, then it's worth it. I'm moving my stuff over to LVM.
To throw something else into the mix, you've gone for raid 6, which enables you to lose two drives, or corrupt one drive. Do you need the two-drive redundancy? The calculations are a lot more expensive than
raid-5 if you're worried over write speed. I don't know the impact of it but I'm playing with dm-integrity which provides some protection against corruption.
>> Thanks,
>> Jim
>>
Cheers,
Wol
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-07-29 0:54 ` [Non-DoD Source] " Matt Wallis
2021-07-29 16:35 ` Wols Lists
@ 2021-07-29 22:05 ` Finlayson, James M CIV (USA)
2021-07-30 8:28 ` Matt Wallis
1 sibling, 1 reply; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-07-29 22:05 UTC (permalink / raw)
To: 'Matt Wallis'
Cc: 'linux-raid@vger.kernel.org', Finlayson, James M CIV (USA)
Matt,
Thank you for the tip. I have put 32 partitions on each of my NVMe drives, made 32 raid5 stripes and then made an LVM from the 32 raid5 stripes. I then created one physical volume per 10 NVMe drives on each socket. I believe I have successfully harnessed "alien technology" (inside joke) to create a Frankenstein's monster. These are substantially better than doing a RAID0 stripe over the partitioned md's in the past. I whipped all of this together in two 15 minute sessions last night and just right now, so I might run some more extensive tests when I have the cycles. I didn't intend to leave the thread hanging.
BLUF - fio detailed output below.....
9 drives per socket raw
socket0, 9 drives, raw 4K random reads 13.6M IOPS , socket1 9 drives, raw 4K random reads , 12.3M IOPS
%Cpu(s): 4.4 us, 25.6 sy, 0.0 ni, 56.7 id, 0.0 wa, 13.1 hi, 0.2 si, 0.0 st
9 data drives per socket RAID5/LVM raw (9+1)
socket0, 9 drives, raw 4K random reads 8.57M IOPS , socket1 9 drives, raw 4K random reads , 8.57M IOPS
%Cpu(s): 7.0 us, 22.3 sy, 0.0 ni, 58.4 id, 0.0 wa, 12.1 hi, 0.2 si, 0.0 st
All,
I intend to test the 4.15 kernel patch next week. My SA would prefer if the patch has made it into the kernel-ml stream yet, so he could install an rpm, but I'll get him to build the kernel if need be.
If the 4.15 kernel patch doesn't alleviate the issues, I have a strong desire to have mdraid made better.
Quick fio results:
socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket0-lv: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket1-lv: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.26
Starting 256 processes
socket0: (groupid=0, jobs=64): err= 0: pid=64032: Thu Jul 29 21:48:32 2021
read: IOPS=13.6M, BW=51.7GiB/s (55.6GB/s)(3105GiB/60003msec)
slat (nsec): min=1292, max=1376.7k, avg=2545.32, stdev=2696.74
clat (usec): min=36, max=71580, avg=600.68, stdev=361.36
lat (usec): min=38, max=71616, avg=603.30, stdev=361.38
clat percentiles (usec):
| 1.00th=[ 169], 5.00th=[ 231], 10.00th=[ 277], 20.00th=[ 347],
| 30.00th=[ 404], 40.00th=[ 457], 50.00th=[ 519], 60.00th=[ 594],
| 70.00th=[ 676], 80.00th=[ 791], 90.00th=[ 996], 95.00th=[ 1205],
| 99.00th=[ 1909], 99.50th=[ 2409], 99.90th=[ 3556], 99.95th=[ 3884],
| 99.99th=[ 5538]
bw ( MiB/s): min=39960, max=56660, per=20.94%, avg=53040.69, stdev=49.61, samples=7488
iops : min=10229946, max=14504941, avg=13578391.80, stdev=12699.61, samples=7488
lat (usec) : 50=0.01%, 100=0.01%, 250=6.83%, 500=40.06%, 750=29.92%
lat (usec) : 1000=13.42%
lat (msec) : 2=8.91%, 4=0.82%, 10=0.04%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%
cpu : usr=14.82%, sys=46.57%, ctx=35564249, majf=0, minf=9754
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=813909201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
socket1: (groupid=1, jobs=64): err= 0: pid=64096: Thu Jul 29 21:48:32 2021
read: IOPS=12.3M, BW=46.9GiB/s (50.3GB/s)(2812GiB/60003msec)
slat (nsec): min=1292, max=1672.2k, avg=2672.21, stdev=2742.06
clat (usec): min=25, max=73526, avg=663.35, stdev=611.06
lat (usec): min=28, max=73545, avg=666.09, stdev=611.08
clat percentiles (usec):
| 1.00th=[ 143], 5.00th=[ 190], 10.00th=[ 227], 20.00th=[ 285],
| 30.00th=[ 338], 40.00th=[ 400], 50.00th=[ 478], 60.00th=[ 586],
| 70.00th=[ 725], 80.00th=[ 930], 90.00th=[ 1254], 95.00th=[ 1614],
| 99.00th=[ 3490], 99.50th=[ 4146], 99.90th=[ 6390], 99.95th=[ 6980],
| 99.99th=[ 8356]
bw ( MiB/s): min=28962, max=55326, per=12.70%, avg=48036.10, stdev=96.75, samples=7488
iops : min=7414327, max=14163615, avg=12297214.98, stdev=24768.82, samples=7488
lat (usec) : 50=0.01%, 100=0.03%, 250=13.75%, 500=38.84%, 750=18.71%
lat (usec) : 1000=11.55%
lat (msec) : 2=14.30%, 4=2.23%, 10=0.60%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%
cpu : usr=13.41%, sys=44.44%, ctx=39379711, majf=0, minf=9982
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=737168913,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
socket0-lv: (groupid=2, jobs=64): err= 0: pid=64166: Thu Jul 29 21:48:32 2021
read: IOPS=8570k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
slat (nsec): min=1873, max=11085k, avg=4694.47, stdev=4825.39
clat (usec): min=24, max=21739, avg=950.52, stdev=948.83
lat (usec): min=51, max=21743, avg=955.29, stdev=948.88
clat percentiles (usec):
| 1.00th=[ 155], 5.00th=[ 217], 10.00th=[ 265], 20.00th=[ 338],
| 30.00th=[ 404], 40.00th=[ 486], 50.00th=[ 594], 60.00th=[ 766],
| 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
| 99.00th=[ 4490], 99.50th=[ 5669], 99.90th=[ 8586], 99.95th=[ 9896],
| 99.99th=[12125]
bw ( MiB/s): min=24657, max=37453, per=-25.17%, avg=33516.00, stdev=35.32, samples=7616
iops : min=6312326, max=9588007, avg=8580076.03, stdev=9041.88, samples=7616
lat (usec) : 50=0.01%, 100=0.01%, 250=8.40%, 500=33.21%, 750=17.54%
lat (usec) : 1000=9.89%
lat (msec) : 2=19.93%, 4=9.58%, 10=1.40%, 20=0.05%, 50=0.01%
cpu : usr=9.01%, sys=51.48%, ctx=27829950, majf=0, minf=9028
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=514275323,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
socket1-lv: (groupid=3, jobs=64): err= 0: pid=64230: Thu Jul 29 21:48:32 2021
read: IOPS=8571k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
slat (nsec): min=1823, max=14362k, avg=4809.30, stdev=4940.42
clat (usec): min=50, max=22856, avg=950.31, stdev=948.13
lat (usec): min=54, max=22860, avg=955.19, stdev=948.19
clat percentiles (usec):
| 1.00th=[ 157], 5.00th=[ 221], 10.00th=[ 269], 20.00th=[ 343],
| 30.00th=[ 412], 40.00th=[ 490], 50.00th=[ 603], 60.00th=[ 766],
| 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
| 99.00th=[ 4293], 99.50th=[ 5604], 99.90th=[ 9503], 99.95th=[10683],
| 99.99th=[12649]
bw ( MiB/s): min=23434, max=36909, per=-25.17%, avg=33517.14, stdev=50.36, samples=7616
iops : min=5999220, max=9448818, avg=8580368.69, stdev=12892.93, samples=7616
lat (usec) : 100=0.01%, 250=7.88%, 500=33.09%, 750=18.09%, 1000=10.02%
lat (msec) : 2=19.91%, 4=9.75%, 10=1.16%, 20=0.08%, 50=0.01%
cpu : usr=9.14%, sys=51.94%, ctx=25524808, majf=0, minf=9037
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=514294010,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: bw=51.7GiB/s (55.6GB/s), 51.7GiB/s-51.7GiB/s (55.6GB/s-55.6GB/s), io=3105GiB (3334GB), run=60003-60003msec
Run status group 1 (all jobs):
READ: bw=46.9GiB/s (50.3GB/s), 46.9GiB/s-46.9GiB/s (50.3GB/s-50.3GB/s), io=2812GiB (3019GB), run=60003-60003msec
Run status group 2 (all jobs):
READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s (35.1GB/s-35.1GB/s), io=1962GiB (2106GB), run=60006-60006msec
Run status group 3 (all jobs):
READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s (35.1GB/s-35.1GB/s), io=1962GiB (2107GB), run=60006-60006msec
Disk stats (read/write):
nvme0n1: ios=90336694/0, merge=0/0, ticks=45102163/0, in_queue=45102163, util=97.44%
nvme1n1: ios=90337153/0, merge=0/0, ticks=47422886/0, in_queue=47422887, util=97.81%
nvme2n1: ios=90337516/0, merge=0/0, ticks=46419782/0, in_queue=46419782, util=97.95%
nvme3n1: ios=90337843/0, merge=0/0, ticks=46256374/0, in_queue=46256374, util=97.95%
nvme4n1: ios=90337742/0, merge=0/0, ticks=59122226/0, in_queue=59122225, util=98.19%
nvme5n1: ios=90338813/0, merge=0/0, ticks=57811758/0, in_queue=57811758, util=98.33%
nvme6n1: ios=90339194/0, merge=0/0, ticks=57369337/0, in_queue=57369337, util=98.37%
nvme7n1: ios=90339048/0, merge=0/0, ticks=55791076/0, in_queue=55791076, util=98.78%
nvme8n1: ios=90340234/0, merge=0/0, ticks=44977001/0, in_queue=44977001, util=99.01%
nvme12n1: ios=81819608/0, merge=0/0, ticks=26788080/0, in_queue=26788079, util=99.24%
nvme13n1: ios=81819831/0, merge=0/0, ticks=26736682/0, in_queue=26736681, util=99.57%
nvme14n1: ios=81820006/0, merge=0/0, ticks=26772951/0, in_queue=26772951, util=99.67%
nvme15n1: ios=81820215/0, merge=0/0, ticks=26741532/0, in_queue=26741532, util=99.78%
nvme17n1: ios=81819922/0, merge=0/0, ticks=76459192/0, in_queue=76459192, util=99.84%
nvme18n1: ios=81820146/0, merge=0/0, ticks=86756309/0, in_queue=86756309, util=99.82%
nvme19n1: ios=81820481/0, merge=0/0, ticks=75008919/0, in_queue=75008919, util=100.00%
nvme20n1: ios=81819690/0, merge=0/0, ticks=91888274/0, in_queue=91888275, util=100.00%
nvme21n1: ios=81821809/0, merge=0/0, ticks=26653056/0, in_queue=26653057, util=100.00%
-----Original Message-----
From: Matt Wallis <mattw@madmonks.org>
Sent: Wednesday, July 28, 2021 8:54 PM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: linux-raid@vger.kernel.org
Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
Hi Jim,
Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally.
Matt.
> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>
> Matt,
> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term. As a researcher, I can do these cool science experiments, but I still have to hand designs to sustainment folks. I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed. We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.
>
> I will have to try the LVM experiment. I'm an LVM neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also. Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
>
> Thanks,
> Jim
>
>
>
>
> -----Original Message-----
> From: Matt Wallis <mattw@madmonks.org>
> Sent: Wednesday, July 28, 2021 6:32 AM
> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Cc: linux-raid@vger.kernel.org
> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>
> Hi Jim,
>
>> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>
>> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small fraction of it. I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful. This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant. I'm just stumped why I can't do better. If there is a fine manual that somebody can point me to, I'm happy to read it…
>
> I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
> 2. Most block IO in the kernel is limited in terms of threading, it may even be essentially single threaded. (This is where I will get corrected) 3. AFAICT, this includes mdraid, there’s a single thread per RAID device handling all the RAID calculations. (mdX_raid6)
>
> What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
> 1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
> 3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
>
> You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
>
> I saw a significant (for me, significant is >20%) increase in IOPs doing this.
>
> You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes.
>
> There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
>
> You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
>
> Matt
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-07-29 22:05 ` Finlayson, James M CIV (USA)
@ 2021-07-30 8:28 ` Matt Wallis
2021-07-30 8:45 ` Miao Wang
2021-07-30 9:54 ` Finlayson, James M CIV (USA)
0 siblings, 2 replies; 28+ messages in thread
From: Matt Wallis @ 2021-07-30 8:28 UTC (permalink / raw)
To: Finlayson, James M CIV (USA); +Cc: linux-raid
Hi Jim,
That’s significantly better than I expected, I need to see if I can get someone to send me the box I was using so I can spend some more time on it.
Good luck with the rest of it, the bit I was looking for as a next step was going to be potentially tweaking stripe widths and the like to see how much difference it made on different workloads.
Matt.
> On 30 Jul 2021, at 08:05, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>
> Matt,
> Thank you for the tip. I have put 32 partitions on each of my NVMe drives, made 32 raid5 stripes and then made an LVM from the 32 raid5 stripes. I then created one physical volume per 10 NVMe drives on each socket. I believe I have successfully harnessed "alien technology" (inside joke) to create a Frankenstein's monster. These are substantially better than doing a RAID0 stripe over the partitioned md's in the past. I whipped all of this together in two 15 minute sessions last night and just right now, so I might run some more extensive tests when I have the cycles. I didn't intend to leave the thread hanging.
> BLUF - fio detailed output below.....
> 9 drives per socket raw
> socket0, 9 drives, raw 4K random reads 13.6M IOPS , socket1 9 drives, raw 4K random reads , 12.3M IOPS
> %Cpu(s): 4.4 us, 25.6 sy, 0.0 ni, 56.7 id, 0.0 wa, 13.1 hi, 0.2 si, 0.0 st
>
> 9 data drives per socket RAID5/LVM raw (9+1)
> socket0, 9 drives, raw 4K random reads 8.57M IOPS , socket1 9 drives, raw 4K random reads , 8.57M IOPS
> %Cpu(s): 7.0 us, 22.3 sy, 0.0 ni, 58.4 id, 0.0 wa, 12.1 hi, 0.2 si, 0.0 st
>
>
> All,
> I intend to test the 4.15 kernel patch next week. My SA would prefer if the patch has made it into the kernel-ml stream yet, so he could install an rpm, but I'll get him to build the kernel if need be.
> If the 4.15 kernel patch doesn't alleviate the issues, I have a strong desire to have mdraid made better.
>
>
> Quick fio results:
> socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
> ...
> socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
> ...
> socket0-lv: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
> ...
> socket1-lv: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
> ...
> fio-3.26
> Starting 256 processes
>
> socket0: (groupid=0, jobs=64): err= 0: pid=64032: Thu Jul 29 21:48:32 2021
> read: IOPS=13.6M, BW=51.7GiB/s (55.6GB/s)(3105GiB/60003msec)
> slat (nsec): min=1292, max=1376.7k, avg=2545.32, stdev=2696.74
> clat (usec): min=36, max=71580, avg=600.68, stdev=361.36
> lat (usec): min=38, max=71616, avg=603.30, stdev=361.38
> clat percentiles (usec):
> | 1.00th=[ 169], 5.00th=[ 231], 10.00th=[ 277], 20.00th=[ 347],
> | 30.00th=[ 404], 40.00th=[ 457], 50.00th=[ 519], 60.00th=[ 594],
> | 70.00th=[ 676], 80.00th=[ 791], 90.00th=[ 996], 95.00th=[ 1205],
> | 99.00th=[ 1909], 99.50th=[ 2409], 99.90th=[ 3556], 99.95th=[ 3884],
> | 99.99th=[ 5538]
> bw ( MiB/s): min=39960, max=56660, per=20.94%, avg=53040.69, stdev=49.61, samples=7488
> iops : min=10229946, max=14504941, avg=13578391.80, stdev=12699.61, samples=7488
> lat (usec) : 50=0.01%, 100=0.01%, 250=6.83%, 500=40.06%, 750=29.92%
> lat (usec) : 1000=13.42%
> lat (msec) : 2=8.91%, 4=0.82%, 10=0.04%, 20=0.01%, 50=0.01%
> lat (msec) : 100=0.01%
> cpu : usr=14.82%, sys=46.57%, ctx=35564249, majf=0, minf=9754
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> issued rwts: total=813909201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=128
> socket1: (groupid=1, jobs=64): err= 0: pid=64096: Thu Jul 29 21:48:32 2021
> read: IOPS=12.3M, BW=46.9GiB/s (50.3GB/s)(2812GiB/60003msec)
> slat (nsec): min=1292, max=1672.2k, avg=2672.21, stdev=2742.06
> clat (usec): min=25, max=73526, avg=663.35, stdev=611.06
> lat (usec): min=28, max=73545, avg=666.09, stdev=611.08
> clat percentiles (usec):
> | 1.00th=[ 143], 5.00th=[ 190], 10.00th=[ 227], 20.00th=[ 285],
> | 30.00th=[ 338], 40.00th=[ 400], 50.00th=[ 478], 60.00th=[ 586],
> | 70.00th=[ 725], 80.00th=[ 930], 90.00th=[ 1254], 95.00th=[ 1614],
> | 99.00th=[ 3490], 99.50th=[ 4146], 99.90th=[ 6390], 99.95th=[ 6980],
> | 99.99th=[ 8356]
> bw ( MiB/s): min=28962, max=55326, per=12.70%, avg=48036.10, stdev=96.75, samples=7488
> iops : min=7414327, max=14163615, avg=12297214.98, stdev=24768.82, samples=7488
> lat (usec) : 50=0.01%, 100=0.03%, 250=13.75%, 500=38.84%, 750=18.71%
> lat (usec) : 1000=11.55%
> lat (msec) : 2=14.30%, 4=2.23%, 10=0.60%, 20=0.01%, 50=0.01%
> lat (msec) : 100=0.01%
> cpu : usr=13.41%, sys=44.44%, ctx=39379711, majf=0, minf=9982
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> issued rwts: total=737168913,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=128
> socket0-lv: (groupid=2, jobs=64): err= 0: pid=64166: Thu Jul 29 21:48:32 2021
> read: IOPS=8570k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
> slat (nsec): min=1873, max=11085k, avg=4694.47, stdev=4825.39
> clat (usec): min=24, max=21739, avg=950.52, stdev=948.83
> lat (usec): min=51, max=21743, avg=955.29, stdev=948.88
> clat percentiles (usec):
> | 1.00th=[ 155], 5.00th=[ 217], 10.00th=[ 265], 20.00th=[ 338],
> | 30.00th=[ 404], 40.00th=[ 486], 50.00th=[ 594], 60.00th=[ 766],
> | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
> | 99.00th=[ 4490], 99.50th=[ 5669], 99.90th=[ 8586], 99.95th=[ 9896],
> | 99.99th=[12125]
> bw ( MiB/s): min=24657, max=37453, per=-25.17%, avg=33516.00, stdev=35.32, samples=7616
> iops : min=6312326, max=9588007, avg=8580076.03, stdev=9041.88, samples=7616
> lat (usec) : 50=0.01%, 100=0.01%, 250=8.40%, 500=33.21%, 750=17.54%
> lat (usec) : 1000=9.89%
> lat (msec) : 2=19.93%, 4=9.58%, 10=1.40%, 20=0.05%, 50=0.01%
> cpu : usr=9.01%, sys=51.48%, ctx=27829950, majf=0, minf=9028
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> issued rwts: total=514275323,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=128
> socket1-lv: (groupid=3, jobs=64): err= 0: pid=64230: Thu Jul 29 21:48:32 2021
> read: IOPS=8571k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
> slat (nsec): min=1823, max=14362k, avg=4809.30, stdev=4940.42
> clat (usec): min=50, max=22856, avg=950.31, stdev=948.13
> lat (usec): min=54, max=22860, avg=955.19, stdev=948.19
> clat percentiles (usec):
> | 1.00th=[ 157], 5.00th=[ 221], 10.00th=[ 269], 20.00th=[ 343],
> | 30.00th=[ 412], 40.00th=[ 490], 50.00th=[ 603], 60.00th=[ 766],
> | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
> | 99.00th=[ 4293], 99.50th=[ 5604], 99.90th=[ 9503], 99.95th=[10683],
> | 99.99th=[12649]
> bw ( MiB/s): min=23434, max=36909, per=-25.17%, avg=33517.14, stdev=50.36, samples=7616
> iops : min=5999220, max=9448818, avg=8580368.69, stdev=12892.93, samples=7616
> lat (usec) : 100=0.01%, 250=7.88%, 500=33.09%, 750=18.09%, 1000=10.02%
> lat (msec) : 2=19.91%, 4=9.75%, 10=1.16%, 20=0.08%, 50=0.01%
> cpu : usr=9.14%, sys=51.94%, ctx=25524808, majf=0, minf=9037
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> issued rwts: total=514294010,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=128
>
> Run status group 0 (all jobs):
> READ: bw=51.7GiB/s (55.6GB/s), 51.7GiB/s-51.7GiB/s (55.6GB/s-55.6GB/s), io=3105GiB (3334GB), run=60003-60003msec
>
> Run status group 1 (all jobs):
> READ: bw=46.9GiB/s (50.3GB/s), 46.9GiB/s-46.9GiB/s (50.3GB/s-50.3GB/s), io=2812GiB (3019GB), run=60003-60003msec
>
> Run status group 2 (all jobs):
> READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s (35.1GB/s-35.1GB/s), io=1962GiB (2106GB), run=60006-60006msec
>
> Run status group 3 (all jobs):
> READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s (35.1GB/s-35.1GB/s), io=1962GiB (2107GB), run=60006-60006msec
>
> Disk stats (read/write):
> nvme0n1: ios=90336694/0, merge=0/0, ticks=45102163/0, in_queue=45102163, util=97.44%
> nvme1n1: ios=90337153/0, merge=0/0, ticks=47422886/0, in_queue=47422887, util=97.81%
> nvme2n1: ios=90337516/0, merge=0/0, ticks=46419782/0, in_queue=46419782, util=97.95%
> nvme3n1: ios=90337843/0, merge=0/0, ticks=46256374/0, in_queue=46256374, util=97.95%
> nvme4n1: ios=90337742/0, merge=0/0, ticks=59122226/0, in_queue=59122225, util=98.19%
> nvme5n1: ios=90338813/0, merge=0/0, ticks=57811758/0, in_queue=57811758, util=98.33%
> nvme6n1: ios=90339194/0, merge=0/0, ticks=57369337/0, in_queue=57369337, util=98.37%
> nvme7n1: ios=90339048/0, merge=0/0, ticks=55791076/0, in_queue=55791076, util=98.78%
> nvme8n1: ios=90340234/0, merge=0/0, ticks=44977001/0, in_queue=44977001, util=99.01%
> nvme12n1: ios=81819608/0, merge=0/0, ticks=26788080/0, in_queue=26788079, util=99.24%
> nvme13n1: ios=81819831/0, merge=0/0, ticks=26736682/0, in_queue=26736681, util=99.57%
> nvme14n1: ios=81820006/0, merge=0/0, ticks=26772951/0, in_queue=26772951, util=99.67%
> nvme15n1: ios=81820215/0, merge=0/0, ticks=26741532/0, in_queue=26741532, util=99.78%
> nvme17n1: ios=81819922/0, merge=0/0, ticks=76459192/0, in_queue=76459192, util=99.84%
> nvme18n1: ios=81820146/0, merge=0/0, ticks=86756309/0, in_queue=86756309, util=99.82%
> nvme19n1: ios=81820481/0, merge=0/0, ticks=75008919/0, in_queue=75008919, util=100.00%
> nvme20n1: ios=81819690/0, merge=0/0, ticks=91888274/0, in_queue=91888275, util=100.00%
> nvme21n1: ios=81821809/0, merge=0/0, ticks=26653056/0, in_queue=26653057, util=100.00%
> -----Original Message-----
> From: Matt Wallis <mattw@madmonks.org>
> Sent: Wednesday, July 28, 2021 8:54 PM
> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Cc: linux-raid@vger.kernel.org
> Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>
> Hi Jim,
>
> Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
> Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
>
> I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally.
>
> Matt.
>
>> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>
>> Matt,
>> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term. As a researcher, I can do these cool science experiments, but I still have to hand designs to sustainment folks. I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed. We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.
>>
>> I will have to try the LVM experiment. I'm an LVM neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also. Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
>>
>> Thanks,
>> Jim
>>
>>
>>
>>
>> -----Original Message-----
>> From: Matt Wallis <mattw@madmonks.org>
>> Sent: Wednesday, July 28, 2021 6:32 AM
>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
>> Cc: linux-raid@vger.kernel.org
>> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>>
>> Hi Jim,
>>
>>> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>>
>>> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small fraction of it. I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful. This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant. I'm just stumped why I can't do better. If there is a fine manual that somebody can point me to, I'm happy to read it…
>>
>> I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
>> 2. Most block IO in the kernel is limited in terms of threading, it may even be essentially single threaded. (This is where I will get corrected) 3. AFAICT, this includes mdraid, there’s a single thread per RAID device handling all the RAID calculations. (mdX_raid6)
>>
>> What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
>> 1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
>> 3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
>>
>> You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
>>
>> I saw a significant (for me, significant is >20%) increase in IOPs doing this.
>>
>> You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes.
>>
>> There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
>>
>> You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
>>
>> Matt
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-07-30 8:28 ` Matt Wallis
@ 2021-07-30 8:45 ` Miao Wang
2021-07-30 9:59 ` Finlayson, James M CIV (USA)
2021-07-30 13:17 ` Peter Grandi
2021-07-30 9:54 ` Finlayson, James M CIV (USA)
1 sibling, 2 replies; 28+ messages in thread
From: Miao Wang @ 2021-07-30 8:45 UTC (permalink / raw)
To: Finlayson, James M CIV (USA); +Cc: Matt Wallis, linux-raid
Hi Jim,
Nice to hear about your findings on how to let linux md work better on fast nvme drives, because previously I was also stuck in a similar problem and finally gave up. Since it is very difficult to find such environment with so many fast nvme drives, I wonder if you have any interest in ZFS. Maybe you can set up a similar raidz configuration on those drives and see whether its performance is better or worse.
Cheers,
Miao Wang
> 2021年07月30日 16:28,Matt Wallis <mattw@madmonks.org> 写道:
>
> Hi Jim,
>
> That’s significantly better than I expected, I need to see if I can get someone to send me the box I was using so I can spend some more time on it.
>
> Good luck with the rest of it, the bit I was looking for as a next step was going to be potentially tweaking stripe widths and the like to see how much difference it made on different workloads.
>
> Matt.
>
>> On 30 Jul 2021, at 08:05, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>
>> Matt,
>> Thank you for the tip. I have put 32 partitions on each of my NVMe drives, made 32 raid5 stripes and then made an LVM from the 32 raid5 stripes. I then created one physical volume per 10 NVMe drives on each socket. I believe I have successfully harnessed "alien technology" (inside joke) to create a Frankenstein's monster. These are substantially better than doing a RAID0 stripe over the partitioned md's in the past. I whipped all of this together in two 15 minute sessions last night and just right now, so I might run some more extensive tests when I have the cycles. I didn't intend to leave the thread hanging.
>> BLUF - fio detailed output below.....
>> 9 drives per socket raw
>> socket0, 9 drives, raw 4K random reads 13.6M IOPS , socket1 9 drives, raw 4K random reads , 12.3M IOPS
>> %Cpu(s): 4.4 us, 25.6 sy, 0.0 ni, 56.7 id, 0.0 wa, 13.1 hi, 0.2 si, 0.0 st
>>
>> 9 data drives per socket RAID5/LVM raw (9+1)
>> socket0, 9 drives, raw 4K random reads 8.57M IOPS , socket1 9 drives, raw 4K random reads , 8.57M IOPS
>> %Cpu(s): 7.0 us, 22.3 sy, 0.0 ni, 58.4 id, 0.0 wa, 12.1 hi, 0.2 si, 0.0 st
>>
>>
>> All,
>> I intend to test the 4.15 kernel patch next week. My SA would prefer if the patch has made it into the kernel-ml stream yet, so he could install an rpm, but I'll get him to build the kernel if need be.
>> If the 4.15 kernel patch doesn't alleviate the issues, I have a strong desire to have mdraid made better.
>>
>>
>> Quick fio results:
>> socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
>> ...
>> socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
>> ...
>> socket0-lv: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
>> ...
>> socket1-lv: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
>> ...
>> fio-3.26
>> Starting 256 processes
>>
>> socket0: (groupid=0, jobs=64): err= 0: pid=64032: Thu Jul 29 21:48:32 2021
>> read: IOPS=13.6M, BW=51.7GiB/s (55.6GB/s)(3105GiB/60003msec)
>> slat (nsec): min=1292, max=1376.7k, avg=2545.32, stdev=2696.74
>> clat (usec): min=36, max=71580, avg=600.68, stdev=361.36
>> lat (usec): min=38, max=71616, avg=603.30, stdev=361.38
>> clat percentiles (usec):
>> | 1.00th=[ 169], 5.00th=[ 231], 10.00th=[ 277], 20.00th=[ 347],
>> | 30.00th=[ 404], 40.00th=[ 457], 50.00th=[ 519], 60.00th=[ 594],
>> | 70.00th=[ 676], 80.00th=[ 791], 90.00th=[ 996], 95.00th=[ 1205],
>> | 99.00th=[ 1909], 99.50th=[ 2409], 99.90th=[ 3556], 99.95th=[ 3884],
>> | 99.99th=[ 5538]
>> bw ( MiB/s): min=39960, max=56660, per=20.94%, avg=53040.69, stdev=49.61, samples=7488
>> iops : min=10229946, max=14504941, avg=13578391.80, stdev=12699.61, samples=7488
>> lat (usec) : 50=0.01%, 100=0.01%, 250=6.83%, 500=40.06%, 750=29.92%
>> lat (usec) : 1000=13.42%
>> lat (msec) : 2=8.91%, 4=0.82%, 10=0.04%, 20=0.01%, 50=0.01%
>> lat (msec) : 100=0.01%
>> cpu : usr=14.82%, sys=46.57%, ctx=35564249, majf=0, minf=9754
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>> issued rwts: total=813909201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>> latency : target=0, window=0, percentile=100.00%, depth=128
>> socket1: (groupid=1, jobs=64): err= 0: pid=64096: Thu Jul 29 21:48:32 2021
>> read: IOPS=12.3M, BW=46.9GiB/s (50.3GB/s)(2812GiB/60003msec)
>> slat (nsec): min=1292, max=1672.2k, avg=2672.21, stdev=2742.06
>> clat (usec): min=25, max=73526, avg=663.35, stdev=611.06
>> lat (usec): min=28, max=73545, avg=666.09, stdev=611.08
>> clat percentiles (usec):
>> | 1.00th=[ 143], 5.00th=[ 190], 10.00th=[ 227], 20.00th=[ 285],
>> | 30.00th=[ 338], 40.00th=[ 400], 50.00th=[ 478], 60.00th=[ 586],
>> | 70.00th=[ 725], 80.00th=[ 930], 90.00th=[ 1254], 95.00th=[ 1614],
>> | 99.00th=[ 3490], 99.50th=[ 4146], 99.90th=[ 6390], 99.95th=[ 6980],
>> | 99.99th=[ 8356]
>> bw ( MiB/s): min=28962, max=55326, per=12.70%, avg=48036.10, stdev=96.75, samples=7488
>> iops : min=7414327, max=14163615, avg=12297214.98, stdev=24768.82, samples=7488
>> lat (usec) : 50=0.01%, 100=0.03%, 250=13.75%, 500=38.84%, 750=18.71%
>> lat (usec) : 1000=11.55%
>> lat (msec) : 2=14.30%, 4=2.23%, 10=0.60%, 20=0.01%, 50=0.01%
>> lat (msec) : 100=0.01%
>> cpu : usr=13.41%, sys=44.44%, ctx=39379711, majf=0, minf=9982
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>> issued rwts: total=737168913,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>> latency : target=0, window=0, percentile=100.00%, depth=128
>> socket0-lv: (groupid=2, jobs=64): err= 0: pid=64166: Thu Jul 29 21:48:32 2021
>> read: IOPS=8570k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
>> slat (nsec): min=1873, max=11085k, avg=4694.47, stdev=4825.39
>> clat (usec): min=24, max=21739, avg=950.52, stdev=948.83
>> lat (usec): min=51, max=21743, avg=955.29, stdev=948.88
>> clat percentiles (usec):
>> | 1.00th=[ 155], 5.00th=[ 217], 10.00th=[ 265], 20.00th=[ 338],
>> | 30.00th=[ 404], 40.00th=[ 486], 50.00th=[ 594], 60.00th=[ 766],
>> | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
>> | 99.00th=[ 4490], 99.50th=[ 5669], 99.90th=[ 8586], 99.95th=[ 9896],
>> | 99.99th=[12125]
>> bw ( MiB/s): min=24657, max=37453, per=-25.17%, avg=33516.00, stdev=35.32, samples=7616
>> iops : min=6312326, max=9588007, avg=8580076.03, stdev=9041.88, samples=7616
>> lat (usec) : 50=0.01%, 100=0.01%, 250=8.40%, 500=33.21%, 750=17.54%
>> lat (usec) : 1000=9.89%
>> lat (msec) : 2=19.93%, 4=9.58%, 10=1.40%, 20=0.05%, 50=0.01%
>> cpu : usr=9.01%, sys=51.48%, ctx=27829950, majf=0, minf=9028
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>> issued rwts: total=514275323,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>> latency : target=0, window=0, percentile=100.00%, depth=128
>> socket1-lv: (groupid=3, jobs=64): err= 0: pid=64230: Thu Jul 29 21:48:32 2021
>> read: IOPS=8571k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
>> slat (nsec): min=1823, max=14362k, avg=4809.30, stdev=4940.42
>> clat (usec): min=50, max=22856, avg=950.31, stdev=948.13
>> lat (usec): min=54, max=22860, avg=955.19, stdev=948.19
>> clat percentiles (usec):
>> | 1.00th=[ 157], 5.00th=[ 221], 10.00th=[ 269], 20.00th=[ 343],
>> | 30.00th=[ 412], 40.00th=[ 490], 50.00th=[ 603], 60.00th=[ 766],
>> | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
>> | 99.00th=[ 4293], 99.50th=[ 5604], 99.90th=[ 9503], 99.95th=[10683],
>> | 99.99th=[12649]
>> bw ( MiB/s): min=23434, max=36909, per=-25.17%, avg=33517.14, stdev=50.36, samples=7616
>> iops : min=5999220, max=9448818, avg=8580368.69, stdev=12892.93, samples=7616
>> lat (usec) : 100=0.01%, 250=7.88%, 500=33.09%, 750=18.09%, 1000=10.02%
>> lat (msec) : 2=19.91%, 4=9.75%, 10=1.16%, 20=0.08%, 50=0.01%
>> cpu : usr=9.14%, sys=51.94%, ctx=25524808, majf=0, minf=9037
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>> issued rwts: total=514294010,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>> latency : target=0, window=0, percentile=100.00%, depth=128
>>
>> Run status group 0 (all jobs):
>> READ: bw=51.7GiB/s (55.6GB/s), 51.7GiB/s-51.7GiB/s (55.6GB/s-55.6GB/s), io=3105GiB (3334GB), run=60003-60003msec
>>
>> Run status group 1 (all jobs):
>> READ: bw=46.9GiB/s (50.3GB/s), 46.9GiB/s-46.9GiB/s (50.3GB/s-50.3GB/s), io=2812GiB (3019GB), run=60003-60003msec
>>
>> Run status group 2 (all jobs):
>> READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s (35.1GB/s-35.1GB/s), io=1962GiB (2106GB), run=60006-60006msec
>>
>> Run status group 3 (all jobs):
>> READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s (35.1GB/s-35.1GB/s), io=1962GiB (2107GB), run=60006-60006msec
>>
>> Disk stats (read/write):
>> nvme0n1: ios=90336694/0, merge=0/0, ticks=45102163/0, in_queue=45102163, util=97.44%
>> nvme1n1: ios=90337153/0, merge=0/0, ticks=47422886/0, in_queue=47422887, util=97.81%
>> nvme2n1: ios=90337516/0, merge=0/0, ticks=46419782/0, in_queue=46419782, util=97.95%
>> nvme3n1: ios=90337843/0, merge=0/0, ticks=46256374/0, in_queue=46256374, util=97.95%
>> nvme4n1: ios=90337742/0, merge=0/0, ticks=59122226/0, in_queue=59122225, util=98.19%
>> nvme5n1: ios=90338813/0, merge=0/0, ticks=57811758/0, in_queue=57811758, util=98.33%
>> nvme6n1: ios=90339194/0, merge=0/0, ticks=57369337/0, in_queue=57369337, util=98.37%
>> nvme7n1: ios=90339048/0, merge=0/0, ticks=55791076/0, in_queue=55791076, util=98.78%
>> nvme8n1: ios=90340234/0, merge=0/0, ticks=44977001/0, in_queue=44977001, util=99.01%
>> nvme12n1: ios=81819608/0, merge=0/0, ticks=26788080/0, in_queue=26788079, util=99.24%
>> nvme13n1: ios=81819831/0, merge=0/0, ticks=26736682/0, in_queue=26736681, util=99.57%
>> nvme14n1: ios=81820006/0, merge=0/0, ticks=26772951/0, in_queue=26772951, util=99.67%
>> nvme15n1: ios=81820215/0, merge=0/0, ticks=26741532/0, in_queue=26741532, util=99.78%
>> nvme17n1: ios=81819922/0, merge=0/0, ticks=76459192/0, in_queue=76459192, util=99.84%
>> nvme18n1: ios=81820146/0, merge=0/0, ticks=86756309/0, in_queue=86756309, util=99.82%
>> nvme19n1: ios=81820481/0, merge=0/0, ticks=75008919/0, in_queue=75008919, util=100.00%
>> nvme20n1: ios=81819690/0, merge=0/0, ticks=91888274/0, in_queue=91888275, util=100.00%
>> nvme21n1: ios=81821809/0, merge=0/0, ticks=26653056/0, in_queue=26653057, util=100.00%
>> -----Original Message-----
>> From: Matt Wallis <mattw@madmonks.org>
>> Sent: Wednesday, July 28, 2021 8:54 PM
>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
>> Cc: linux-raid@vger.kernel.org
>> Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>>
>> Hi Jim,
>>
>> Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
>> Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
>>
>> I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally.
>>
>> Matt.
>>
>>> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>>
>>> Matt,
>>> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term. As a researcher, I can do these cool science experiments, but I still have to hand designs to sustainment folks. I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed. We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.
>>>
>>> I will have to try the LVM experiment. I'm an LVM neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also. Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
>>>
>>> Thanks,
>>> Jim
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Matt Wallis <mattw@madmonks.org>
>>> Sent: Wednesday, July 28, 2021 6:32 AM
>>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
>>> Cc: linux-raid@vger.kernel.org
>>> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>>>
>>> Hi Jim,
>>>
>>>> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>>>
>>>> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small fraction of it. I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful. This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant. I'm just stumped why I can't do better. If there is a fine manual that somebody can point me to, I'm happy to read it…
>>>
>>> I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
>>> 2. Most block IO in the kernel is limited in terms of threading, it may even be essentially single threaded. (This is where I will get corrected) 3. AFAICT, this includes mdraid, there’s a single thread per RAID device handling all the RAID calculations. (mdX_raid6)
>>>
>>> What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
>>> 1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
>>> 3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
>>>
>>> You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
>>>
>>> I saw a significant (for me, significant is >20%) increase in IOPs doing this.
>>>
>>> You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes.
>>>
>>> There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
>>>
>>> You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
>>>
>>> Matt
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-07-30 8:45 ` Miao Wang
@ 2021-07-30 9:59 ` Finlayson, James M CIV (USA)
2021-07-30 14:03 ` Doug Ledford
2021-07-30 13:17 ` Peter Grandi
1 sibling, 1 reply; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-07-30 9:59 UTC (permalink / raw)
To: 'Miao Wang'
Cc: 'Matt Wallis', 'linux-raid@vger.kernel.org'
There is interest in ZFS. We're waiting for the direct I/O patches to settle in Open ZFS because we couldn't find any way to get around the ARC (everything has to touch the ARC). ZFS spins an entire CPU core or more worrying about which ARC entries it has to evict. I know who is doing the work. Once it settles, I'll see if they are willing to publish to zfs-discuss.
-----Original Message-----
From: Miao Wang <shankerwangmiao@gmail.com>
Sent: Friday, July 30, 2021 4:46 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: Matt Wallis <mattw@madmonks.org>; linux-raid@vger.kernel.org
Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
Hi Jim,
Nice to hear about your findings on how to let linux md work better on fast nvme drives, because previously I was also stuck in a similar problem and finally gave up. Since it is very difficult to find such environment with so many fast nvme drives, I wonder if you have any interest in ZFS. Maybe you can set up a similar raidz configuration on those drives and see whether its performance is better or worse.
Cheers,
Miao Wang
> 2021年07月30日 16:28,Matt Wallis <mattw@madmonks.org> 写道:
>
> Hi Jim,
>
> That’s significantly better than I expected, I need to see if I can get someone to send me the box I was using so I can spend some more time on it.
>
> Good luck with the rest of it, the bit I was looking for as a next step was going to be potentially tweaking stripe widths and the like to see how much difference it made on different workloads.
>
> Matt.
>
>> On 30 Jul 2021, at 08:05, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>
>> Matt,
>> Thank you for the tip. I have put 32 partitions on each of my NVMe drives, made 32 raid5 stripes and then made an LVM from the 32 raid5 stripes. I then created one physical volume per 10 NVMe drives on each socket. I believe I have successfully harnessed "alien technology" (inside joke) to create a Frankenstein's monster. These are substantially better than doing a RAID0 stripe over the partitioned md's in the past. I whipped all of this together in two 15 minute sessions last night and just right now, so I might run some more extensive tests when I have the cycles. I didn't intend to leave the thread hanging.
>> BLUF - fio detailed output below.....
>> 9 drives per socket raw
>> socket0, 9 drives, raw 4K random reads 13.6M IOPS , socket1 9 drives,
>> raw 4K random reads , 12.3M IOPS
>> %Cpu(s): 4.4 us, 25.6 sy, 0.0 ni, 56.7 id, 0.0 wa, 13.1 hi, 0.2
>> si, 0.0 st
>>
>> 9 data drives per socket RAID5/LVM raw (9+1) socket0, 9 drives, raw
>> 4K random reads 8.57M IOPS , socket1 9 drives, raw 4K random reads ,
>> 8.57M IOPS
>> %Cpu(s): 7.0 us, 22.3 sy, 0.0 ni, 58.4 id, 0.0 wa, 12.1 hi, 0.2
>> si, 0.0 st
>>
>>
>> All,
>> I intend to test the 4.15 kernel patch next week. My SA would prefer if the patch has made it into the kernel-ml stream yet, so he could install an rpm, but I'll get him to build the kernel if need be.
>> If the 4.15 kernel patch doesn't alleviate the issues, I have a strong desire to have mdraid made better.
>>
>>
>> Quick fio results:
>> socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
>> 4096B-4096B, ioengine=libaio, iodepth=128 ...
>> socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
>> 4096B-4096B, ioengine=libaio, iodepth=128 ...
>> socket0-lv: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B,
>> (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
>> socket1-lv: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B,
>> (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
>> fio-3.26
>> Starting 256 processes
>>
>> socket0: (groupid=0, jobs=64): err= 0: pid=64032: Thu Jul 29 21:48:32
>> 2021
>> read: IOPS=13.6M, BW=51.7GiB/s (55.6GB/s)(3105GiB/60003msec)
>> slat (nsec): min=1292, max=1376.7k, avg=2545.32, stdev=2696.74
>> clat (usec): min=36, max=71580, avg=600.68, stdev=361.36
>> lat (usec): min=38, max=71616, avg=603.30, stdev=361.38
>> clat percentiles (usec):
>> | 1.00th=[ 169], 5.00th=[ 231], 10.00th=[ 277], 20.00th=[ 347],
>> | 30.00th=[ 404], 40.00th=[ 457], 50.00th=[ 519], 60.00th=[ 594],
>> | 70.00th=[ 676], 80.00th=[ 791], 90.00th=[ 996], 95.00th=[ 1205],
>> | 99.00th=[ 1909], 99.50th=[ 2409], 99.90th=[ 3556], 99.95th=[ 3884],
>> | 99.99th=[ 5538]
>> bw ( MiB/s): min=39960, max=56660, per=20.94%, avg=53040.69, stdev=49.61, samples=7488
>> iops : min=10229946, max=14504941, avg=13578391.80, stdev=12699.61, samples=7488
>> lat (usec) : 50=0.01%, 100=0.01%, 250=6.83%, 500=40.06%, 750=29.92%
>> lat (usec) : 1000=13.42%
>> lat (msec) : 2=8.91%, 4=0.82%, 10=0.04%, 20=0.01%, 50=0.01%
>> lat (msec) : 100=0.01%
>> cpu : usr=14.82%, sys=46.57%, ctx=35564249, majf=0, minf=9754
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>> issued rwts: total=813909201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>> latency : target=0, window=0, percentile=100.00%, depth=128
>> socket1: (groupid=1, jobs=64): err= 0: pid=64096: Thu Jul 29 21:48:32
>> 2021
>> read: IOPS=12.3M, BW=46.9GiB/s (50.3GB/s)(2812GiB/60003msec)
>> slat (nsec): min=1292, max=1672.2k, avg=2672.21, stdev=2742.06
>> clat (usec): min=25, max=73526, avg=663.35, stdev=611.06
>> lat (usec): min=28, max=73545, avg=666.09, stdev=611.08
>> clat percentiles (usec):
>> | 1.00th=[ 143], 5.00th=[ 190], 10.00th=[ 227], 20.00th=[ 285],
>> | 30.00th=[ 338], 40.00th=[ 400], 50.00th=[ 478], 60.00th=[ 586],
>> | 70.00th=[ 725], 80.00th=[ 930], 90.00th=[ 1254], 95.00th=[ 1614],
>> | 99.00th=[ 3490], 99.50th=[ 4146], 99.90th=[ 6390], 99.95th=[ 6980],
>> | 99.99th=[ 8356]
>> bw ( MiB/s): min=28962, max=55326, per=12.70%, avg=48036.10, stdev=96.75, samples=7488
>> iops : min=7414327, max=14163615, avg=12297214.98, stdev=24768.82, samples=7488
>> lat (usec) : 50=0.01%, 100=0.03%, 250=13.75%, 500=38.84%, 750=18.71%
>> lat (usec) : 1000=11.55%
>> lat (msec) : 2=14.30%, 4=2.23%, 10=0.60%, 20=0.01%, 50=0.01%
>> lat (msec) : 100=0.01%
>> cpu : usr=13.41%, sys=44.44%, ctx=39379711, majf=0, minf=9982
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>> issued rwts: total=737168913,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>> latency : target=0, window=0, percentile=100.00%, depth=128
>> socket0-lv: (groupid=2, jobs=64): err= 0: pid=64166: Thu Jul 29
>> 21:48:32 2021
>> read: IOPS=8570k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
>> slat (nsec): min=1873, max=11085k, avg=4694.47, stdev=4825.39
>> clat (usec): min=24, max=21739, avg=950.52, stdev=948.83
>> lat (usec): min=51, max=21743, avg=955.29, stdev=948.88
>> clat percentiles (usec):
>> | 1.00th=[ 155], 5.00th=[ 217], 10.00th=[ 265], 20.00th=[ 338],
>> | 30.00th=[ 404], 40.00th=[ 486], 50.00th=[ 594], 60.00th=[ 766],
>> | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
>> | 99.00th=[ 4490], 99.50th=[ 5669], 99.90th=[ 8586], 99.95th=[ 9896],
>> | 99.99th=[12125]
>> bw ( MiB/s): min=24657, max=37453, per=-25.17%, avg=33516.00, stdev=35.32, samples=7616
>> iops : min=6312326, max=9588007, avg=8580076.03, stdev=9041.88, samples=7616
>> lat (usec) : 50=0.01%, 100=0.01%, 250=8.40%, 500=33.21%, 750=17.54%
>> lat (usec) : 1000=9.89%
>> lat (msec) : 2=19.93%, 4=9.58%, 10=1.40%, 20=0.05%, 50=0.01%
>> cpu : usr=9.01%, sys=51.48%, ctx=27829950, majf=0, minf=9028
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>> issued rwts: total=514275323,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>> latency : target=0, window=0, percentile=100.00%, depth=128
>> socket1-lv: (groupid=3, jobs=64): err= 0: pid=64230: Thu Jul 29
>> 21:48:32 2021
>> read: IOPS=8571k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
>> slat (nsec): min=1823, max=14362k, avg=4809.30, stdev=4940.42
>> clat (usec): min=50, max=22856, avg=950.31, stdev=948.13
>> lat (usec): min=54, max=22860, avg=955.19, stdev=948.19
>> clat percentiles (usec):
>> | 1.00th=[ 157], 5.00th=[ 221], 10.00th=[ 269], 20.00th=[ 343],
>> | 30.00th=[ 412], 40.00th=[ 490], 50.00th=[ 603], 60.00th=[ 766],
>> | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
>> | 99.00th=[ 4293], 99.50th=[ 5604], 99.90th=[ 9503], 99.95th=[10683],
>> | 99.99th=[12649]
>> bw ( MiB/s): min=23434, max=36909, per=-25.17%, avg=33517.14, stdev=50.36, samples=7616
>> iops : min=5999220, max=9448818, avg=8580368.69, stdev=12892.93, samples=7616
>> lat (usec) : 100=0.01%, 250=7.88%, 500=33.09%, 750=18.09%, 1000=10.02%
>> lat (msec) : 2=19.91%, 4=9.75%, 10=1.16%, 20=0.08%, 50=0.01%
>> cpu : usr=9.14%, sys=51.94%, ctx=25524808, majf=0, minf=9037
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>> issued rwts: total=514294010,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>> latency : target=0, window=0, percentile=100.00%, depth=128
>>
>> Run status group 0 (all jobs):
>> READ: bw=51.7GiB/s (55.6GB/s), 51.7GiB/s-51.7GiB/s
>> (55.6GB/s-55.6GB/s), io=3105GiB (3334GB), run=60003-60003msec
>>
>> Run status group 1 (all jobs):
>> READ: bw=46.9GiB/s (50.3GB/s), 46.9GiB/s-46.9GiB/s
>> (50.3GB/s-50.3GB/s), io=2812GiB (3019GB), run=60003-60003msec
>>
>> Run status group 2 (all jobs):
>> READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s
>> (35.1GB/s-35.1GB/s), io=1962GiB (2106GB), run=60006-60006msec
>>
>> Run status group 3 (all jobs):
>> READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s
>> (35.1GB/s-35.1GB/s), io=1962GiB (2107GB), run=60006-60006msec
>>
>> Disk stats (read/write):
>> nvme0n1: ios=90336694/0, merge=0/0, ticks=45102163/0,
>> in_queue=45102163, util=97.44%
>> nvme1n1: ios=90337153/0, merge=0/0, ticks=47422886/0,
>> in_queue=47422887, util=97.81%
>> nvme2n1: ios=90337516/0, merge=0/0, ticks=46419782/0,
>> in_queue=46419782, util=97.95%
>> nvme3n1: ios=90337843/0, merge=0/0, ticks=46256374/0,
>> in_queue=46256374, util=97.95%
>> nvme4n1: ios=90337742/0, merge=0/0, ticks=59122226/0,
>> in_queue=59122225, util=98.19%
>> nvme5n1: ios=90338813/0, merge=0/0, ticks=57811758/0,
>> in_queue=57811758, util=98.33%
>> nvme6n1: ios=90339194/0, merge=0/0, ticks=57369337/0,
>> in_queue=57369337, util=98.37%
>> nvme7n1: ios=90339048/0, merge=0/0, ticks=55791076/0,
>> in_queue=55791076, util=98.78%
>> nvme8n1: ios=90340234/0, merge=0/0, ticks=44977001/0,
>> in_queue=44977001, util=99.01%
>> nvme12n1: ios=81819608/0, merge=0/0, ticks=26788080/0,
>> in_queue=26788079, util=99.24%
>> nvme13n1: ios=81819831/0, merge=0/0, ticks=26736682/0,
>> in_queue=26736681, util=99.57%
>> nvme14n1: ios=81820006/0, merge=0/0, ticks=26772951/0,
>> in_queue=26772951, util=99.67%
>> nvme15n1: ios=81820215/0, merge=0/0, ticks=26741532/0,
>> in_queue=26741532, util=99.78%
>> nvme17n1: ios=81819922/0, merge=0/0, ticks=76459192/0,
>> in_queue=76459192, util=99.84%
>> nvme18n1: ios=81820146/0, merge=0/0, ticks=86756309/0,
>> in_queue=86756309, util=99.82%
>> nvme19n1: ios=81820481/0, merge=0/0, ticks=75008919/0,
>> in_queue=75008919, util=100.00%
>> nvme20n1: ios=81819690/0, merge=0/0, ticks=91888274/0,
>> in_queue=91888275, util=100.00%
>> nvme21n1: ios=81821809/0, merge=0/0, ticks=26653056/0,
>> in_queue=26653057, util=100.00% -----Original Message-----
>> From: Matt Wallis <mattw@madmonks.org>
>> Sent: Wednesday, July 28, 2021 8:54 PM
>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
>> Cc: linux-raid@vger.kernel.org
>> Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>>
>> Hi Jim,
>>
>> Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
>> Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
>>
>> I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally.
>>
>> Matt.
>>
>>> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>>
>>> Matt,
>>> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term. As a researcher, I can do these cool science experiments, but I still have to hand designs to sustainment folks. I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed. We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.
>>>
>>> I will have to try the LVM experiment. I'm an LVM neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also. Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
>>>
>>> Thanks,
>>> Jim
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Matt Wallis <mattw@madmonks.org>
>>> Sent: Wednesday, July 28, 2021 6:32 AM
>>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
>>> Cc: linux-raid@vger.kernel.org
>>> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>>>
>>> Hi Jim,
>>>
>>>> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>>>
>>>> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small fraction of it. I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful. This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant. I'm just stumped why I can't do better. If there is a fine manual that somebody can point me to, I'm happy to read it…
>>>
>>> I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
>>> 2. Most block IO in the kernel is limited in terms of threading, it
>>> may even be essentially single threaded. (This is where I will get
>>> corrected) 3. AFAICT, this includes mdraid, there’s a single thread
>>> per RAID device handling all the RAID calculations. (mdX_raid6)
>>>
>>> What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
>>> 1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
>>> 3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
>>>
>>> You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
>>>
>>> I saw a significant (for me, significant is >20%) increase in IOPs doing this.
>>>
>>> You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes.
>>>
>>> There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
>>>
>>> You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
>>>
>>> Matt
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-07-30 9:59 ` Finlayson, James M CIV (USA)
@ 2021-07-30 14:03 ` Doug Ledford
0 siblings, 0 replies; 28+ messages in thread
From: Doug Ledford @ 2021-07-30 14:03 UTC (permalink / raw)
To: Finlayson, James M CIV (USA); +Cc: Miao Wang, Matt Wallis, linux-raid
You can try btrfs in lieu of zfs. As long as metadata is raid1, data
can be raid5/6 and things are ok. The raid5 write issue only applies
to metadata.
On Fri, Jul 30, 2021 at 6:09 AM Finlayson, James M CIV (USA)
<james.m.finlayson4.civ@mail.mil> wrote:
>
> There is interest in ZFS. We're waiting for the direct I/O patches to settle in Open ZFS because we couldn't find any way to get around the ARC (everything has to touch the ARC). ZFS spins an entire CPU core or more worrying about which ARC entries it has to evict. I know who is doing the work. Once it settles, I'll see if they are willing to publish to zfs-discuss.
>
> -----Original Message-----
> From: Miao Wang <shankerwangmiao@gmail.com>
> Sent: Friday, July 30, 2021 4:46 AM
> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Cc: Matt Wallis <mattw@madmonks.org>; linux-raid@vger.kernel.org
> Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>
> Hi Jim,
>
> Nice to hear about your findings on how to let linux md work better on fast nvme drives, because previously I was also stuck in a similar problem and finally gave up. Since it is very difficult to find such environment with so many fast nvme drives, I wonder if you have any interest in ZFS. Maybe you can set up a similar raidz configuration on those drives and see whether its performance is better or worse.
>
> Cheers,
>
> Miao Wang
>
> > 2021年07月30日 16:28,Matt Wallis <mattw@madmonks.org> 写道:
> >
> > Hi Jim,
> >
> > That’s significantly better than I expected, I need to see if I can get someone to send me the box I was using so I can spend some more time on it.
> >
> > Good luck with the rest of it, the bit I was looking for as a next step was going to be potentially tweaking stripe widths and the like to see how much difference it made on different workloads.
> >
> > Matt.
> >
> >> On 30 Jul 2021, at 08:05, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> >>
> >> Matt,
> >> Thank you for the tip. I have put 32 partitions on each of my NVMe drives, made 32 raid5 stripes and then made an LVM from the 32 raid5 stripes. I then created one physical volume per 10 NVMe drives on each socket. I believe I have successfully harnessed "alien technology" (inside joke) to create a Frankenstein's monster. These are substantially better than doing a RAID0 stripe over the partitioned md's in the past. I whipped all of this together in two 15 minute sessions last night and just right now, so I might run some more extensive tests when I have the cycles. I didn't intend to leave the thread hanging.
> >> BLUF - fio detailed output below.....
> >> 9 drives per socket raw
> >> socket0, 9 drives, raw 4K random reads 13.6M IOPS , socket1 9 drives,
> >> raw 4K random reads , 12.3M IOPS
> >> %Cpu(s): 4.4 us, 25.6 sy, 0.0 ni, 56.7 id, 0.0 wa, 13.1 hi, 0.2
> >> si, 0.0 st
> >>
> >> 9 data drives per socket RAID5/LVM raw (9+1) socket0, 9 drives, raw
> >> 4K random reads 8.57M IOPS , socket1 9 drives, raw 4K random reads ,
> >> 8.57M IOPS
> >> %Cpu(s): 7.0 us, 22.3 sy, 0.0 ni, 58.4 id, 0.0 wa, 12.1 hi, 0.2
> >> si, 0.0 st
> >>
> >>
> >> All,
> >> I intend to test the 4.15 kernel patch next week. My SA would prefer if the patch has made it into the kernel-ml stream yet, so he could install an rpm, but I'll get him to build the kernel if need be.
> >> If the 4.15 kernel patch doesn't alleviate the issues, I have a strong desire to have mdraid made better.
> >>
> >>
> >> Quick fio results:
> >> socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> >> 4096B-4096B, ioengine=libaio, iodepth=128 ...
> >> socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> >> 4096B-4096B, ioengine=libaio, iodepth=128 ...
> >> socket0-lv: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B,
> >> (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
> >> socket1-lv: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B,
> >> (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
> >> fio-3.26
> >> Starting 256 processes
> >>
> >> socket0: (groupid=0, jobs=64): err= 0: pid=64032: Thu Jul 29 21:48:32
> >> 2021
> >> read: IOPS=13.6M, BW=51.7GiB/s (55.6GB/s)(3105GiB/60003msec)
> >> slat (nsec): min=1292, max=1376.7k, avg=2545.32, stdev=2696.74
> >> clat (usec): min=36, max=71580, avg=600.68, stdev=361.36
> >> lat (usec): min=38, max=71616, avg=603.30, stdev=361.38
> >> clat percentiles (usec):
> >> | 1.00th=[ 169], 5.00th=[ 231], 10.00th=[ 277], 20.00th=[ 347],
> >> | 30.00th=[ 404], 40.00th=[ 457], 50.00th=[ 519], 60.00th=[ 594],
> >> | 70.00th=[ 676], 80.00th=[ 791], 90.00th=[ 996], 95.00th=[ 1205],
> >> | 99.00th=[ 1909], 99.50th=[ 2409], 99.90th=[ 3556], 99.95th=[ 3884],
> >> | 99.99th=[ 5538]
> >> bw ( MiB/s): min=39960, max=56660, per=20.94%, avg=53040.69, stdev=49.61, samples=7488
> >> iops : min=10229946, max=14504941, avg=13578391.80, stdev=12699.61, samples=7488
> >> lat (usec) : 50=0.01%, 100=0.01%, 250=6.83%, 500=40.06%, 750=29.92%
> >> lat (usec) : 1000=13.42%
> >> lat (msec) : 2=8.91%, 4=0.82%, 10=0.04%, 20=0.01%, 50=0.01%
> >> lat (msec) : 100=0.01%
> >> cpu : usr=14.82%, sys=46.57%, ctx=35564249, majf=0, minf=9754
> >> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> >> issued rwts: total=813909201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> >> latency : target=0, window=0, percentile=100.00%, depth=128
> >> socket1: (groupid=1, jobs=64): err= 0: pid=64096: Thu Jul 29 21:48:32
> >> 2021
> >> read: IOPS=12.3M, BW=46.9GiB/s (50.3GB/s)(2812GiB/60003msec)
> >> slat (nsec): min=1292, max=1672.2k, avg=2672.21, stdev=2742.06
> >> clat (usec): min=25, max=73526, avg=663.35, stdev=611.06
> >> lat (usec): min=28, max=73545, avg=666.09, stdev=611.08
> >> clat percentiles (usec):
> >> | 1.00th=[ 143], 5.00th=[ 190], 10.00th=[ 227], 20.00th=[ 285],
> >> | 30.00th=[ 338], 40.00th=[ 400], 50.00th=[ 478], 60.00th=[ 586],
> >> | 70.00th=[ 725], 80.00th=[ 930], 90.00th=[ 1254], 95.00th=[ 1614],
> >> | 99.00th=[ 3490], 99.50th=[ 4146], 99.90th=[ 6390], 99.95th=[ 6980],
> >> | 99.99th=[ 8356]
> >> bw ( MiB/s): min=28962, max=55326, per=12.70%, avg=48036.10, stdev=96.75, samples=7488
> >> iops : min=7414327, max=14163615, avg=12297214.98, stdev=24768.82, samples=7488
> >> lat (usec) : 50=0.01%, 100=0.03%, 250=13.75%, 500=38.84%, 750=18.71%
> >> lat (usec) : 1000=11.55%
> >> lat (msec) : 2=14.30%, 4=2.23%, 10=0.60%, 20=0.01%, 50=0.01%
> >> lat (msec) : 100=0.01%
> >> cpu : usr=13.41%, sys=44.44%, ctx=39379711, majf=0, minf=9982
> >> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> >> issued rwts: total=737168913,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> >> latency : target=0, window=0, percentile=100.00%, depth=128
> >> socket0-lv: (groupid=2, jobs=64): err= 0: pid=64166: Thu Jul 29
> >> 21:48:32 2021
> >> read: IOPS=8570k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
> >> slat (nsec): min=1873, max=11085k, avg=4694.47, stdev=4825.39
> >> clat (usec): min=24, max=21739, avg=950.52, stdev=948.83
> >> lat (usec): min=51, max=21743, avg=955.29, stdev=948.88
> >> clat percentiles (usec):
> >> | 1.00th=[ 155], 5.00th=[ 217], 10.00th=[ 265], 20.00th=[ 338],
> >> | 30.00th=[ 404], 40.00th=[ 486], 50.00th=[ 594], 60.00th=[ 766],
> >> | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
> >> | 99.00th=[ 4490], 99.50th=[ 5669], 99.90th=[ 8586], 99.95th=[ 9896],
> >> | 99.99th=[12125]
> >> bw ( MiB/s): min=24657, max=37453, per=-25.17%, avg=33516.00, stdev=35.32, samples=7616
> >> iops : min=6312326, max=9588007, avg=8580076.03, stdev=9041.88, samples=7616
> >> lat (usec) : 50=0.01%, 100=0.01%, 250=8.40%, 500=33.21%, 750=17.54%
> >> lat (usec) : 1000=9.89%
> >> lat (msec) : 2=19.93%, 4=9.58%, 10=1.40%, 20=0.05%, 50=0.01%
> >> cpu : usr=9.01%, sys=51.48%, ctx=27829950, majf=0, minf=9028
> >> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> >> issued rwts: total=514275323,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> >> latency : target=0, window=0, percentile=100.00%, depth=128
> >> socket1-lv: (groupid=3, jobs=64): err= 0: pid=64230: Thu Jul 29
> >> 21:48:32 2021
> >> read: IOPS=8571k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
> >> slat (nsec): min=1823, max=14362k, avg=4809.30, stdev=4940.42
> >> clat (usec): min=50, max=22856, avg=950.31, stdev=948.13
> >> lat (usec): min=54, max=22860, avg=955.19, stdev=948.19
> >> clat percentiles (usec):
> >> | 1.00th=[ 157], 5.00th=[ 221], 10.00th=[ 269], 20.00th=[ 343],
> >> | 30.00th=[ 412], 40.00th=[ 490], 50.00th=[ 603], 60.00th=[ 766],
> >> | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
> >> | 99.00th=[ 4293], 99.50th=[ 5604], 99.90th=[ 9503], 99.95th=[10683],
> >> | 99.99th=[12649]
> >> bw ( MiB/s): min=23434, max=36909, per=-25.17%, avg=33517.14, stdev=50.36, samples=7616
> >> iops : min=5999220, max=9448818, avg=8580368.69, stdev=12892.93, samples=7616
> >> lat (usec) : 100=0.01%, 250=7.88%, 500=33.09%, 750=18.09%, 1000=10.02%
> >> lat (msec) : 2=19.91%, 4=9.75%, 10=1.16%, 20=0.08%, 50=0.01%
> >> cpu : usr=9.14%, sys=51.94%, ctx=25524808, majf=0, minf=9037
> >> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> >> issued rwts: total=514294010,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> >> latency : target=0, window=0, percentile=100.00%, depth=128
> >>
> >> Run status group 0 (all jobs):
> >> READ: bw=51.7GiB/s (55.6GB/s), 51.7GiB/s-51.7GiB/s
> >> (55.6GB/s-55.6GB/s), io=3105GiB (3334GB), run=60003-60003msec
> >>
> >> Run status group 1 (all jobs):
> >> READ: bw=46.9GiB/s (50.3GB/s), 46.9GiB/s-46.9GiB/s
> >> (50.3GB/s-50.3GB/s), io=2812GiB (3019GB), run=60003-60003msec
> >>
> >> Run status group 2 (all jobs):
> >> READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s
> >> (35.1GB/s-35.1GB/s), io=1962GiB (2106GB), run=60006-60006msec
> >>
> >> Run status group 3 (all jobs):
> >> READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s
> >> (35.1GB/s-35.1GB/s), io=1962GiB (2107GB), run=60006-60006msec
> >>
> >> Disk stats (read/write):
> >> nvme0n1: ios=90336694/0, merge=0/0, ticks=45102163/0,
> >> in_queue=45102163, util=97.44%
> >> nvme1n1: ios=90337153/0, merge=0/0, ticks=47422886/0,
> >> in_queue=47422887, util=97.81%
> >> nvme2n1: ios=90337516/0, merge=0/0, ticks=46419782/0,
> >> in_queue=46419782, util=97.95%
> >> nvme3n1: ios=90337843/0, merge=0/0, ticks=46256374/0,
> >> in_queue=46256374, util=97.95%
> >> nvme4n1: ios=90337742/0, merge=0/0, ticks=59122226/0,
> >> in_queue=59122225, util=98.19%
> >> nvme5n1: ios=90338813/0, merge=0/0, ticks=57811758/0,
> >> in_queue=57811758, util=98.33%
> >> nvme6n1: ios=90339194/0, merge=0/0, ticks=57369337/0,
> >> in_queue=57369337, util=98.37%
> >> nvme7n1: ios=90339048/0, merge=0/0, ticks=55791076/0,
> >> in_queue=55791076, util=98.78%
> >> nvme8n1: ios=90340234/0, merge=0/0, ticks=44977001/0,
> >> in_queue=44977001, util=99.01%
> >> nvme12n1: ios=81819608/0, merge=0/0, ticks=26788080/0,
> >> in_queue=26788079, util=99.24%
> >> nvme13n1: ios=81819831/0, merge=0/0, ticks=26736682/0,
> >> in_queue=26736681, util=99.57%
> >> nvme14n1: ios=81820006/0, merge=0/0, ticks=26772951/0,
> >> in_queue=26772951, util=99.67%
> >> nvme15n1: ios=81820215/0, merge=0/0, ticks=26741532/0,
> >> in_queue=26741532, util=99.78%
> >> nvme17n1: ios=81819922/0, merge=0/0, ticks=76459192/0,
> >> in_queue=76459192, util=99.84%
> >> nvme18n1: ios=81820146/0, merge=0/0, ticks=86756309/0,
> >> in_queue=86756309, util=99.82%
> >> nvme19n1: ios=81820481/0, merge=0/0, ticks=75008919/0,
> >> in_queue=75008919, util=100.00%
> >> nvme20n1: ios=81819690/0, merge=0/0, ticks=91888274/0,
> >> in_queue=91888275, util=100.00%
> >> nvme21n1: ios=81821809/0, merge=0/0, ticks=26653056/0,
> >> in_queue=26653057, util=100.00% -----Original Message-----
> >> From: Matt Wallis <mattw@madmonks.org>
> >> Sent: Wednesday, July 28, 2021 8:54 PM
> >> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> >> Cc: linux-raid@vger.kernel.org
> >> Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
> >>
> >> Hi Jim,
> >>
> >> Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
> >> Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
> >>
> >> I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally.
> >>
> >> Matt.
> >>
> >>> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> >>>
> >>> Matt,
> >>> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term. As a researcher, I can do these cool science experiments, but I still have to hand designs to sustainment folks. I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed. We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.
> >>>
> >>> I will have to try the LVM experiment. I'm an LVM neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also. Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
> >>>
> >>> Thanks,
> >>> Jim
> >>>
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Matt Wallis <mattw@madmonks.org>
> >>> Sent: Wednesday, July 28, 2021 6:32 AM
> >>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> >>> Cc: linux-raid@vger.kernel.org
> >>> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
> >>>
> >>> Hi Jim,
> >>>
> >>>> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> >>>>
> >>>> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small fraction of it. I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful. This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant. I'm just stumped why I can't do better. If there is a fine manual that somebody can point me to, I'm happy to read it…
> >>>
> >>> I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
> >>> 2. Most block IO in the kernel is limited in terms of threading, it
> >>> may even be essentially single threaded. (This is where I will get
> >>> corrected) 3. AFAICT, this includes mdraid, there’s a single thread
> >>> per RAID device handling all the RAID calculations. (mdX_raid6)
> >>>
> >>> What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
> >>> 1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
> >>> 3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
> >>>
> >>> You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
> >>>
> >>> I saw a significant (for me, significant is >20%) increase in IOPs doing this.
> >>>
> >>> You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes.
> >>>
> >>> There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
> >>>
> >>> You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
> >>>
> >>> Matt
>
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-07-30 8:45 ` Miao Wang
2021-07-30 9:59 ` Finlayson, James M CIV (USA)
@ 2021-07-30 13:17 ` Peter Grandi
1 sibling, 0 replies; 28+ messages in thread
From: Peter Grandi @ 2021-07-30 13:17 UTC (permalink / raw)
To: list Linux RAID
>>> On Fri, 30 Jul 2021 16:45:32 +0800, Miao Wang
>>> <shankerwangmiao@gmail.com> said:
> [...] was also stuck in a similar problem and finally gave
> up. Since it is very difficult to find such environment with
> so many fast nvme drives, I wonder if you have any interest in
> ZFS. [...]
Or Btrfs or the new 'bachefs' which is OK for simple
configurations (RAID10-like).
But part of the issue here with MD RAID is that it is in theory
mostly a translation layer like 'loop', but also sort of like a
virtual block device too, and weird things happen as IO requests
get reshape and requeued.
My impression that I mentioned in a previous message is that
probably the critical detail is:
>> Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
>> nvme0n1 1317510.00 0.00 5270044.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 411.95 4.00 0.00 0.00 100.40
>> [...]
>> Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
>> nvme0n1 114589.00 0.00 458356.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.54 4.00 0.00 0.01 100.00
> The obvious difference is the factor of 10 in "aqu-sz" and that
> correspond to the factor of 10 in "r/s" and "rkB/s".
That may happen because the test is run directly on the 'md[01]'
block device, which can do odd things. Counterintutively much
bigger 'aqu-sz' and thus much better speed could be achieved by
doing the test using a suitable filesystem on top of the 'md[01]'
device.
With ZFS there is a good chance that since striping is integrated
within ZFS that could happen too, especially on highly parallel
workloads.
There is however a huge warning: the test is run on IOPS with
4KiB blocks, and ZFS in COW mode does not work well with that
(especially for writes, but also for reads, if compression and
checksumming are enabled, for RAIDz) so I think that it should be
run with COW disabled, or perhaps on a 'zvol'.
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-07-30 8:28 ` Matt Wallis
2021-07-30 8:45 ` Miao Wang
@ 2021-07-30 9:54 ` Finlayson, James M CIV (USA)
1 sibling, 0 replies; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-07-30 9:54 UTC (permalink / raw)
To: 'Matt Wallis'; +Cc: 'linux-raid@vger.kernel.org'
I just always used 128K - large enough to cover most small operations for IOPS to harness one read of a single drive to complete the I/O and small enough what a 1MB O_DIRECT write will get us a full stripe write. Plus, I always plug the old "Fusion I/O" crowd. Best white papers ever - "here are our numbers, here's how we got them, here are the instructions for you to get them too".
-----Original Message-----
From: Matt Wallis <mattw@madmonks.org>
Sent: Friday, July 30, 2021 4:28 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: linux-raid@vger.kernel.org
Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
Hi Jim,
That’s significantly better than I expected, I need to see if I can get someone to send me the box I was using so I can spend some more time on it.
Good luck with the rest of it, the bit I was looking for as a next step was going to be potentially tweaking stripe widths and the like to see how much difference it made on different workloads.
Matt.
> On 30 Jul 2021, at 08:05, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>
> Matt,
> Thank you for the tip. I have put 32 partitions on each of my NVMe drives, made 32 raid5 stripes and then made an LVM from the 32 raid5 stripes. I then created one physical volume per 10 NVMe drives on each socket. I believe I have successfully harnessed "alien technology" (inside joke) to create a Frankenstein's monster. These are substantially better than doing a RAID0 stripe over the partitioned md's in the past. I whipped all of this together in two 15 minute sessions last night and just right now, so I might run some more extensive tests when I have the cycles. I didn't intend to leave the thread hanging.
> BLUF - fio detailed output below.....
> 9 drives per socket raw
> socket0, 9 drives, raw 4K random reads 13.6M IOPS , socket1 9 drives,
> raw 4K random reads , 12.3M IOPS
> %Cpu(s): 4.4 us, 25.6 sy, 0.0 ni, 56.7 id, 0.0 wa, 13.1 hi, 0.2
> si, 0.0 st
>
> 9 data drives per socket RAID5/LVM raw (9+1) socket0, 9 drives, raw 4K
> random reads 8.57M IOPS , socket1 9 drives, raw 4K random reads ,
> 8.57M IOPS
> %Cpu(s): 7.0 us, 22.3 sy, 0.0 ni, 58.4 id, 0.0 wa, 12.1 hi, 0.2
> si, 0.0 st
>
>
> All,
> I intend to test the 4.15 kernel patch next week. My SA would prefer if the patch has made it into the kernel-ml stream yet, so he could install an rpm, but I'll get him to build the kernel if need be.
> If the 4.15 kernel patch doesn't alleviate the issues, I have a strong desire to have mdraid made better.
>
>
> Quick fio results:
> socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=libaio, iodepth=128 ...
> socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=libaio, iodepth=128 ...
> socket0-lv: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B,
> (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
> socket1-lv: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B,
> (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
> fio-3.26
> Starting 256 processes
>
> socket0: (groupid=0, jobs=64): err= 0: pid=64032: Thu Jul 29 21:48:32
> 2021
> read: IOPS=13.6M, BW=51.7GiB/s (55.6GB/s)(3105GiB/60003msec)
> slat (nsec): min=1292, max=1376.7k, avg=2545.32, stdev=2696.74
> clat (usec): min=36, max=71580, avg=600.68, stdev=361.36
> lat (usec): min=38, max=71616, avg=603.30, stdev=361.38
> clat percentiles (usec):
> | 1.00th=[ 169], 5.00th=[ 231], 10.00th=[ 277], 20.00th=[ 347],
> | 30.00th=[ 404], 40.00th=[ 457], 50.00th=[ 519], 60.00th=[ 594],
> | 70.00th=[ 676], 80.00th=[ 791], 90.00th=[ 996], 95.00th=[ 1205],
> | 99.00th=[ 1909], 99.50th=[ 2409], 99.90th=[ 3556], 99.95th=[ 3884],
> | 99.99th=[ 5538]
> bw ( MiB/s): min=39960, max=56660, per=20.94%, avg=53040.69, stdev=49.61, samples=7488
> iops : min=10229946, max=14504941, avg=13578391.80, stdev=12699.61, samples=7488
> lat (usec) : 50=0.01%, 100=0.01%, 250=6.83%, 500=40.06%, 750=29.92%
> lat (usec) : 1000=13.42%
> lat (msec) : 2=8.91%, 4=0.82%, 10=0.04%, 20=0.01%, 50=0.01%
> lat (msec) : 100=0.01%
> cpu : usr=14.82%, sys=46.57%, ctx=35564249, majf=0, minf=9754
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> issued rwts: total=813909201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=128
> socket1: (groupid=1, jobs=64): err= 0: pid=64096: Thu Jul 29 21:48:32
> 2021
> read: IOPS=12.3M, BW=46.9GiB/s (50.3GB/s)(2812GiB/60003msec)
> slat (nsec): min=1292, max=1672.2k, avg=2672.21, stdev=2742.06
> clat (usec): min=25, max=73526, avg=663.35, stdev=611.06
> lat (usec): min=28, max=73545, avg=666.09, stdev=611.08
> clat percentiles (usec):
> | 1.00th=[ 143], 5.00th=[ 190], 10.00th=[ 227], 20.00th=[ 285],
> | 30.00th=[ 338], 40.00th=[ 400], 50.00th=[ 478], 60.00th=[ 586],
> | 70.00th=[ 725], 80.00th=[ 930], 90.00th=[ 1254], 95.00th=[ 1614],
> | 99.00th=[ 3490], 99.50th=[ 4146], 99.90th=[ 6390], 99.95th=[ 6980],
> | 99.99th=[ 8356]
> bw ( MiB/s): min=28962, max=55326, per=12.70%, avg=48036.10, stdev=96.75, samples=7488
> iops : min=7414327, max=14163615, avg=12297214.98, stdev=24768.82, samples=7488
> lat (usec) : 50=0.01%, 100=0.03%, 250=13.75%, 500=38.84%, 750=18.71%
> lat (usec) : 1000=11.55%
> lat (msec) : 2=14.30%, 4=2.23%, 10=0.60%, 20=0.01%, 50=0.01%
> lat (msec) : 100=0.01%
> cpu : usr=13.41%, sys=44.44%, ctx=39379711, majf=0, minf=9982
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> issued rwts: total=737168913,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=128
> socket0-lv: (groupid=2, jobs=64): err= 0: pid=64166: Thu Jul 29
> 21:48:32 2021
> read: IOPS=8570k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
> slat (nsec): min=1873, max=11085k, avg=4694.47, stdev=4825.39
> clat (usec): min=24, max=21739, avg=950.52, stdev=948.83
> lat (usec): min=51, max=21743, avg=955.29, stdev=948.88
> clat percentiles (usec):
> | 1.00th=[ 155], 5.00th=[ 217], 10.00th=[ 265], 20.00th=[ 338],
> | 30.00th=[ 404], 40.00th=[ 486], 50.00th=[ 594], 60.00th=[ 766],
> | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
> | 99.00th=[ 4490], 99.50th=[ 5669], 99.90th=[ 8586], 99.95th=[ 9896],
> | 99.99th=[12125]
> bw ( MiB/s): min=24657, max=37453, per=-25.17%, avg=33516.00, stdev=35.32, samples=7616
> iops : min=6312326, max=9588007, avg=8580076.03, stdev=9041.88, samples=7616
> lat (usec) : 50=0.01%, 100=0.01%, 250=8.40%, 500=33.21%, 750=17.54%
> lat (usec) : 1000=9.89%
> lat (msec) : 2=19.93%, 4=9.58%, 10=1.40%, 20=0.05%, 50=0.01%
> cpu : usr=9.01%, sys=51.48%, ctx=27829950, majf=0, minf=9028
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> issued rwts: total=514275323,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=128
> socket1-lv: (groupid=3, jobs=64): err= 0: pid=64230: Thu Jul 29
> 21:48:32 2021
> read: IOPS=8571k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
> slat (nsec): min=1823, max=14362k, avg=4809.30, stdev=4940.42
> clat (usec): min=50, max=22856, avg=950.31, stdev=948.13
> lat (usec): min=54, max=22860, avg=955.19, stdev=948.19
> clat percentiles (usec):
> | 1.00th=[ 157], 5.00th=[ 221], 10.00th=[ 269], 20.00th=[ 343],
> | 30.00th=[ 412], 40.00th=[ 490], 50.00th=[ 603], 60.00th=[ 766],
> | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
> | 99.00th=[ 4293], 99.50th=[ 5604], 99.90th=[ 9503], 99.95th=[10683],
> | 99.99th=[12649]
> bw ( MiB/s): min=23434, max=36909, per=-25.17%, avg=33517.14, stdev=50.36, samples=7616
> iops : min=5999220, max=9448818, avg=8580368.69, stdev=12892.93, samples=7616
> lat (usec) : 100=0.01%, 250=7.88%, 500=33.09%, 750=18.09%, 1000=10.02%
> lat (msec) : 2=19.91%, 4=9.75%, 10=1.16%, 20=0.08%, 50=0.01%
> cpu : usr=9.14%, sys=51.94%, ctx=25524808, majf=0, minf=9037
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> issued rwts: total=514294010,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=128
>
> Run status group 0 (all jobs):
> READ: bw=51.7GiB/s (55.6GB/s), 51.7GiB/s-51.7GiB/s
> (55.6GB/s-55.6GB/s), io=3105GiB (3334GB), run=60003-60003msec
>
> Run status group 1 (all jobs):
> READ: bw=46.9GiB/s (50.3GB/s), 46.9GiB/s-46.9GiB/s
> (50.3GB/s-50.3GB/s), io=2812GiB (3019GB), run=60003-60003msec
>
> Run status group 2 (all jobs):
> READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s
> (35.1GB/s-35.1GB/s), io=1962GiB (2106GB), run=60006-60006msec
>
> Run status group 3 (all jobs):
> READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s
> (35.1GB/s-35.1GB/s), io=1962GiB (2107GB), run=60006-60006msec
>
> Disk stats (read/write):
> nvme0n1: ios=90336694/0, merge=0/0, ticks=45102163/0,
> in_queue=45102163, util=97.44%
> nvme1n1: ios=90337153/0, merge=0/0, ticks=47422886/0,
> in_queue=47422887, util=97.81%
> nvme2n1: ios=90337516/0, merge=0/0, ticks=46419782/0,
> in_queue=46419782, util=97.95%
> nvme3n1: ios=90337843/0, merge=0/0, ticks=46256374/0,
> in_queue=46256374, util=97.95%
> nvme4n1: ios=90337742/0, merge=0/0, ticks=59122226/0,
> in_queue=59122225, util=98.19%
> nvme5n1: ios=90338813/0, merge=0/0, ticks=57811758/0,
> in_queue=57811758, util=98.33%
> nvme6n1: ios=90339194/0, merge=0/0, ticks=57369337/0,
> in_queue=57369337, util=98.37%
> nvme7n1: ios=90339048/0, merge=0/0, ticks=55791076/0,
> in_queue=55791076, util=98.78%
> nvme8n1: ios=90340234/0, merge=0/0, ticks=44977001/0,
> in_queue=44977001, util=99.01%
> nvme12n1: ios=81819608/0, merge=0/0, ticks=26788080/0,
> in_queue=26788079, util=99.24%
> nvme13n1: ios=81819831/0, merge=0/0, ticks=26736682/0,
> in_queue=26736681, util=99.57%
> nvme14n1: ios=81820006/0, merge=0/0, ticks=26772951/0,
> in_queue=26772951, util=99.67%
> nvme15n1: ios=81820215/0, merge=0/0, ticks=26741532/0,
> in_queue=26741532, util=99.78%
> nvme17n1: ios=81819922/0, merge=0/0, ticks=76459192/0,
> in_queue=76459192, util=99.84%
> nvme18n1: ios=81820146/0, merge=0/0, ticks=86756309/0,
> in_queue=86756309, util=99.82%
> nvme19n1: ios=81820481/0, merge=0/0, ticks=75008919/0,
> in_queue=75008919, util=100.00%
> nvme20n1: ios=81819690/0, merge=0/0, ticks=91888274/0,
> in_queue=91888275, util=100.00%
> nvme21n1: ios=81821809/0, merge=0/0, ticks=26653056/0,
> in_queue=26653057, util=100.00% -----Original Message-----
> From: Matt Wallis <mattw@madmonks.org>
> Sent: Wednesday, July 28, 2021 8:54 PM
> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Cc: linux-raid@vger.kernel.org
> Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>
> Hi Jim,
>
> Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
> Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
>
> I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally.
>
> Matt.
>
>> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>
>> Matt,
>> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term. As a researcher, I can do these cool science experiments, but I still have to hand designs to sustainment folks. I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed. We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.
>>
>> I will have to try the LVM experiment. I'm an LVM neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also. Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
>>
>> Thanks,
>> Jim
>>
>>
>>
>>
>> -----Original Message-----
>> From: Matt Wallis <mattw@madmonks.org>
>> Sent: Wednesday, July 28, 2021 6:32 AM
>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
>> Cc: linux-raid@vger.kernel.org
>> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>>
>> Hi Jim,
>>
>>> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>>
>>> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small fraction of it. I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful. This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant. I'm just stumped why I can't do better. If there is a fine manual that somebody can point me to, I'm happy to read it…
>>
>> I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
>> 2. Most block IO in the kernel is limited in terms of threading, it
>> may even be essentially single threaded. (This is where I will get
>> corrected) 3. AFAICT, this includes mdraid, there’s a single thread
>> per RAID device handling all the RAID calculations. (mdX_raid6)
>>
>> What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
>> 1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
>> 3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
>>
>> You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
>>
>> I saw a significant (for me, significant is >20%) increase in IOPs doing this.
>>
>> You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes.
>>
>> There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
>>
>> You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
>>
>> Matt
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-07-27 20:32 Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? Finlayson, James M CIV (USA)
` (2 preceding siblings ...)
2021-07-28 10:31 ` Matt Wallis
@ 2021-08-01 11:21 ` Gal Ofri
2021-08-03 14:59 ` [Non-DoD Source] " Finlayson, James M CIV (USA)
[not found] ` <AS8PR04MB799205817C4647DAC740DE9A91EA9@AS8PR04MB7992.eurprd04.prod.outlook.com>
4 siblings, 1 reply; 28+ messages in thread
From: Gal Ofri @ 2021-08-01 11:21 UTC (permalink / raw)
To: Finlayson, James M CIV (USA); +Cc: 'linux-raid@vger.kernel.org'
Hey Jim,
Read iops (rand/seq) were addressed in a recent commit:
97ae27252f49 md/raid5: avoid device_lock in read_one_chunk()
https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e
It was merged into 5.14, so you can either cherry-pick it or just use a
latest-master kernel.
Sounds like your environment is stronger than the one I used for the
testing, so please do share your benchmark if you manage to surpass the
results described in the commit message.
Cheers,
Gal Ofri,
Volumez (formerly storing.io)
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-08-01 11:21 ` Gal Ofri
@ 2021-08-03 14:59 ` Finlayson, James M CIV (USA)
2021-08-04 9:33 ` Gal Ofri
0 siblings, 1 reply; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-08-03 14:59 UTC (permalink / raw)
To: 'Gal Ofri', 'linux-raid@vger.kernel.org'
Gal,
My SA just gave me the server with the 5.14 RC4 kernel built. I have a two pass preconditioning run going right now to get us maximum results. I expect to be able to run the tests hopefully by COB Wednesday. Preconditioning will take 8 hours unfortunately (15.36TB drives), I have to make BIOS changes for apples to apples "hero runs" and then get the mdraid's created. In your opinion, if I bypass the initial formatting with mdadm --assume-clean, will that make a difference in the results? I usually let the format run, but I want to get you results as soon as possible.
Thanks,
Jim
-----Original Message-----
From: Gal Ofri <gal.ofri@volumez.com>
Sent: Sunday, August 1, 2021 7:21 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
All active links contained in this email were disabled. Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser.
----
Hey Jim,
Read iops (rand/seq) were addressed in a recent commit:
97ae27252f49 md/raid5: avoid device_lock in read_one_chunk() Caution-https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e
It was merged into 5.14, so you can either cherry-pick it or just use a latest-master kernel.
Sounds like your environment is stronger than the one I used for the testing, so please do share your benchmark if you manage to surpass the results described in the commit message.
Cheers,
Gal Ofri,
Volumez (formerly storing.io)
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
2021-08-03 14:59 ` [Non-DoD Source] " Finlayson, James M CIV (USA)
@ 2021-08-04 9:33 ` Gal Ofri
0 siblings, 0 replies; 28+ messages in thread
From: Gal Ofri @ 2021-08-04 9:33 UTC (permalink / raw)
To: Finlayson, James M CIV (USA); +Cc: 'linux-raid@vger.kernel.org'
On Tue, 3 Aug 2021 14:59:45 +0000
"Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil> wrote:
> My SA just gave me the server with the 5.14 RC4 kernel built. I have a two pass preconditioning run going right now to get us maximum results. I expect to be able to run the tests hopefully by COB Wednesday. Preconditioning will take 8 hours unfortunately (15.36TB drives), I have to make BIOS changes for apples to apples "hero runs" and then get the mdraid's created. In your opinion, if I bypass the initial formatting with mdadm --assume-clean, will that make a difference in the results? I usually let the format run, but I want to get you results as soon as possible.
You're running Reads workload so it doesn't make sense to read stuff
without formatting first. IMO, you better wait for the formatting to
complete rather than try to skip/reduce it.
Also, note that in my tests I had to setup XFS over the raid in order to
avoid queueing issues. Feel free to ping me if you want the setup script
or the vdbench file(s) that I used.
Cheers,
Gal
^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <AS8PR04MB799205817C4647DAC740DE9A91EA9@AS8PR04MB7992.eurprd04.prod.outlook.com>]
end of thread, other threads:[~2021-08-18 19:59 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-27 20:32 Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? Finlayson, James M CIV (USA)
2021-07-27 21:52 ` Chris Murphy
2021-07-27 22:42 ` Peter Grandi
2021-07-28 10:31 ` Matt Wallis
2021-07-28 10:43 ` [Non-DoD Source] " Finlayson, James M CIV (USA)
2021-07-29 0:54 ` [Non-DoD Source] " Matt Wallis
2021-07-29 16:35 ` Wols Lists
2021-07-29 18:12 ` Finlayson, James M CIV (USA)
2021-07-29 22:05 ` Finlayson, James M CIV (USA)
2021-07-30 8:28 ` Matt Wallis
2021-07-30 8:45 ` Miao Wang
2021-07-30 9:59 ` Finlayson, James M CIV (USA)
2021-07-30 14:03 ` Doug Ledford
2021-07-30 13:17 ` Peter Grandi
2021-07-30 9:54 ` Finlayson, James M CIV (USA)
2021-08-01 11:21 ` Gal Ofri
2021-08-03 14:59 ` [Non-DoD Source] " Finlayson, James M CIV (USA)
2021-08-04 9:33 ` Gal Ofri
[not found] ` <AS8PR04MB799205817C4647DAC740DE9A91EA9@AS8PR04MB7992.eurprd04.prod.outlook.com>
[not found] ` <5EAED86C53DED2479E3E145969315A2385856AD0@UMECHPA7B.easf.csd.disa.mil>
[not found] ` <5EAED86C53DED2479E3E145969315A2385856AF7@UMECHPA7B.easf.csd.disa.mil>
2021-08-05 19:52 ` Finlayson, James M CIV (USA)
2021-08-05 20:50 ` Finlayson, James M CIV (USA)
2021-08-05 21:10 ` Finlayson, James M CIV (USA)
2021-08-08 14:43 ` Gal Ofri
2021-08-09 19:01 ` Finlayson, James M CIV (USA)
2021-08-17 21:21 ` Finlayson, James M CIV (USA)
2021-08-18 0:45 ` [Non-DoD Source] " Matt Wallis
2021-08-18 10:20 ` Finlayson, James M CIV (USA)
2021-08-18 19:48 ` Doug Ledford
2021-08-18 19:59 ` Doug Ledford
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.