All of lore.kernel.org
 help / color / mirror / Atom feed
* Can't get RAID5/RAID6  NVMe randomread  IOPS - AMD ROME what am I missing?????
@ 2021-07-27 20:32 Finlayson, James M CIV (USA)
  2021-07-27 21:52 ` Chris Murphy
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-07-27 20:32 UTC (permalink / raw)
  To: 'linux-raid@vger.kernel.org'; +Cc: Finlayson, James M CIV (USA)

Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small  fraction of it.   I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful.     This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant.   I'm just stumped why I can't do better.   If there is a fine manual that somebody can point me to, I'm happy to read it.......

I have tried both RAID5 and RAID6 trying to be highly cognizant of NUMAness.    The ROME is set to numas per socket to 1 and the BIOS is set to maximize infinity fabric performance and pcie performance via AMD's white papers.   NVMe drives are all Gen4 (I believe HPE rebadged SAMSUNG 1733a? - I can get the drives doing 1.45M 4KB random reads  each if I try hard.

Everything I can think to share:

[root@<server> <server>]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.4 (Ootpa)
[root@<server> <server>]# uname -r
4.18.0-305.el8.x86_64

root@<server> ~]# modinfo raid6
filename:       /lib/modules/4.18.0-305.el8.x86_64/kernel/drivers/md/raid456.ko.xz
alias:          raid6
alias:          raid5
alias:          md-level-6
alias:          md-raid6
alias:          md-personality-8
alias:          md-level-4
alias:          md-level-5
alias:          md-raid4
alias:          md-raid5
alias:          md-personality-4
description:    RAID4/5/6 (striping with parity) personality for MD
license:        GPL
rhelversion:    8.4
srcversion:     FE86A53E1C1CDAE8F972CBA
depends:        async_raid6_recov,async_pq,libcrc32c,raid6_pq,async_tx,async_memcpy,async_xor
intree:         Y
name:           raid456
vermagic:       4.18.0-305.el8.x86_64 SMP mod_unload modversions 
sig_id:         PKCS#7
signer:         Red Hat Enterprise Linux kernel signing key

[root@<server> ~]# lsblk
NAME         MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme16n1     259:0    0   1.8T  0 disk  
├─nvme16n1p1 259:1    0   512M  0 part  /boot/efi
├─nvme16n1p2 259:2    0   512M  0 part  /boot
├─nvme16n1p3 259:3    0  49.4G  0 part  [SWAP]
└─nvme16n1p4 259:4    0   1.7T  0 part  /
nvme0n1      259:5    0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme1n1      259:6    0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme2n1      259:7    0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme3n1      259:8    0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme7n1      259:9    0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme11n1     259:10   0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme10n1     259:11   0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme14n1     259:12   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme5n1      259:13   0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme8n1      259:14   0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme6n1      259:15   0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme9n1      259:16   0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme15n1     259:17   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme20n1     259:18   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme13n1     259:19   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme18n1     259:20   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme4n1      259:21   0    14T  0 disk  
└─md0          9:0    0 139.7T  0 raid5 
nvme21n1     259:22   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme22n1     259:23   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme24n1     259:24   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme12n1     259:25   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme17n1     259:26   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5
nvme19n1     259:27   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5 
nvme23n1     259:28   0    14T  0 disk  
└─md1          9:1    0 139.7T  0 raid5

[root@<server> ~]# lsblk -o KNAME,MODEL,VENDOR
KNAME      MODEL                                    VENDOR
nvme0n1    MZXL515THALA-000H3                       
nvme1n1    MZXL515THALA-000H3                       
nvme2n1    MZXL515THALA-000H3                       
nvme3n1    MZXL515THALA-000H3                       
nvme7n1    MZXL515THALA-000H3                       
nvme11n1   MZXL515THALA-000H3                       
nvme10n1   MZXL515THALA-000H3                       
nvme14n1   MZXL515THALA-000H3                       
nvme5n1    MZXL515THALA-000H3                       
nvme8n1    MZXL515THALA-000H3                       
nvme6n1    MZXL515THALA-000H3                       
nvme9n1    MZXL515THALA-000H3                       
nvme15n1   MZXL515THALA-000H3                       
nvme20n1   MZXL515THALA-000H3                       
nvme13n1   MZXL515THALA-000H3                       
nvme18n1   MZXL515THALA-000H3                       
nvme4n1    MZXL515THALA-000H3                       
nvme21n1   MZXL515THALA-000H3                       
nvme22n1   MZXL515THALA-000H3                       
nvme24n1   MZXL515THALA-000H3                       
nvme12n1   MZXL515THALA-000H3                       
nvme17n1   MZXL515THALA-000H3                       
nvme19n1   MZXL515THALA-000H3                       
nvme23n1   MZXL515THALA-000H3    

[root@<server> jim]# ./map_numa.sh       (16 is the boot drive 0-11 on numa 0, 12-16,17-24 on numa 1)
device: nvme8 numanode: 0
device: nvme9 numanode: 0
device: nvme10 numanode: 0
device: nvme11 numanode: 0
device: nvme4 numanode: 0
device: nvme5 numanode: 0
device: nvme6 numanode: 0
device: nvme7 numanode: 0
device: nvme2 numanode: 0
device: nvme3 numanode: 0
device: nvme0 numanode: 0
device: nvme1 numanode: 0
device: nvme21 numanode: 1
device: nvme22 numanode: 1
device: nvme23 numanode: 1
device: nvme24 numanode: 1
device: nvme16 numanode: 1
device: nvme17 numanode: 1
device: nvme18 numanode: 1
device: nvme19 numanode: 1
device: nvme20 numanode: 1
device: nvme14 numanode: 1
device: nvme15 numanode: 1
device: nvme12 numanode: 1
device: nvme13 numanode: 1

[root@<server> jim]# cat /etc/udev/rules.d/99-abj.nr_32.rules 
KERNEL=="nvme*[0-9]n*[0-9]",ATTRS{model}=="MZXL515THALA-000H3",ATTR{queue/io_poll}="1",ATTR{queue/io_poll_delay}="100000",ATTR{queue/nomerges}="2",ATTR{queue/nr_requests}="1023",ATTR{queue/rotational}="0",ATTR{queue/rq_affinity}="2",ATTR{queue/scheduler}="none",ATTR{queue/add_random}="0",ATTR{queue/max_sectors_kb}="4096",PROGRAM="/usr/sbin/nvme set-feature /dev/%k --feature-id 8 --value 522 "    {coalesce up to 10 interrupts per device}



SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md*", ATTR{md/sync_speed_max}="2000000",ATTR{md/group_thread_cnt}="64", ATTR{md/stripe_cache_size}="8192",ATTR{queue/io_poll}="1",ATTR{queue/io_poll_delay}="100000",ATTR{queue/nomerges}="2",ATTR{queue/nr_requests}="1023",ATTR{queue/rotational}="0",ATTR{queue/rq_affinity}="2",ATTR{queue/scheduler}="none",ATTR{queue/add_random}="0",ATTR{queue/max_sectors_kb}="4096"

(I know the 1023 doesn't work in the md, but there for reference) - we tune for max iops, not for latency, thus the going hard at rq_affinity, nomerges, etc.....

[root@<server> <server>]# cat /proc/mdstat  (128K chunk is just something Fusion IO told me way back when and never needed to change)
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 nvme11n1[11](S) nvme10n1[10] nvme9n1[9] nvme8n1[8] nvme7n1[7] nvme6n1[6] nvme5n1[5] nvme4n1[4] nvme3n1[3] nvme2n1[2] nvme1n1[1] nvme0n1[0]
      150007961600 blocks super 1.2 level 5, 128k chunk, algorithm 2 [11/11] [UUUUUUUUUUU]
      bitmap: 0/112 pages [0KB], 65536KB chunk

md1 : active raid5 nvme24n1[11](S) nvme23n1[10] nvme22n1[9] nvme21n1[8] nvme20n1[7] nvme19n1[6] nvme18n1[5] nvme17n1[4] nvme15n1[3] nvme14n1[2] nvme13n1[1] nvme12n1[0]
      150007961600 blocks super 1.2 level 5, 128k chunk, algorithm 2 [11/11] [UUUUUUUUUUU]
      bitmap: 0/112 pages [0KB], 65536KB chunk

unused devices: <none>

[root@<server> /]# grep raid /var/log/messages......What troubles me is if mdraid checked parity on read, I could somewhat understand, but I would think the reads are nearly a pass through....
Jul 27 00:00:02 <server> rpmlist_verification[12745]: libblockdev-mdraid 2.24 Thu 22 Jul 2021 02:58:37 PM GMT
Jul 27 18:00:28 <server> kernel: raid6: sse2x1   gen()  9792 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x1   xor()  6436 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x2   gen() 11198 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x2   xor()  9546 MB/s
Jul 27 18:00:28 <server> kernel: raid6: sse2x4   gen() 14271 MB/s
Jul 27 18:00:29 <server> kernel: raid6: sse2x4   xor()  6354 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x1   gen() 22838 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x1   xor() 14069 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x2   gen() 26973 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x2   xor() 18380 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x4   gen() 26601 MB/s
Jul 27 18:00:29 <server> kernel: raid6: avx2x4   xor()  7025 MB/s
Jul 27 18:00:29 <server> kernel: raid6: using algorithm avx2x2 gen() 26973 MB/s
Jul 27 18:00:29 <server> kernel: raid6: .... xor() 18380 MB/s, rmw enabled
Jul 27 18:00:29 <server> kernel: raid6: using avx2x2 recovery algorithm

[root@<server> <server>]# cat fiojim.hpdl385.nps1
 [global]
name=random
iodepth=128
ioengine=libaio
direct=1
norandommap
group_reporting
randrepeat=1
random_generator=tausworthe64
bs=4k
rw=randread
numjobs=64
runtime=60

[socket0]
new_group
numa_mem_policy=bind:0
numa_cpu_nodes=0
filename=/dev/nvme0n1
filename=/dev/nvme1n1
filename=/dev/nvme2n1
filename=/dev/nvme3n1
filename=/dev/nvme4n1
filename=/dev/nvme5n1
filename=/dev/nvme6n1
filename=/dev/nvme7n1
filename=/dev/nvme8n1
filename=/dev/nvme9n1
filename=/dev/nvme10n1
filename=/dev/nvme11n1
[socket1]
new_group
numa_mem_policy=bind:1
numa_cpu_nodes=1
filename=/dev/nvme12n1
filename=/dev/nvme13n1
filename=/dev/nvme14n1
filename=/dev/nvme15n1
filename=/dev/nvme17n1
filename=/dev/nvme18n1
filename=/dev/nvme19n1
filename=/dev/nvme20n1
filename=/dev/nvme21n1
filename=/dev/nvme22n1
filename=/dev/nvme23n1
filename=/dev/nvme24n1
[socket0-md]
stonewall
numa_mem_policy=bind:0
numa_cpu_nodes=0
filename=/dev/md0
[socket1-md]
new_group
numa_mem_policy=bind:1
numa_cpu_nodes=1
filename=/dev/md1

iostat -xkz 1 with the drives raw:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.32    0.00   38.30    0.00    0.00   53.39

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1       1317510.00    0.00 5270044.00      0.00     0.00     0.00   0.00   0.00    0.31    0.00 411.95     4.00     0.00   0.00 100.40
nvme1n1       1317548.00    0.00 5270192.00      0.00     0.00     0.00   0.00   0.00    0.32    0.00 417.38     4.00     0.00   0.00 100.00
nvme2n1       1317578.00    0.00 5270316.00      0.00     0.00     0.00   0.00   0.00    0.31    0.00 414.77     4.00     0.00   0.00 100.20
nvme3n1       1317554.00    0.00 5270216.00      0.00     0.00     0.00   0.00   0.00    0.31    0.00 413.25     4.00     0.00   0.00 100.40
nvme7n1       1317559.00    0.00 5270236.00      0.00     0.00     0.00   0.00   0.00    0.33    0.00 430.03     4.00     0.00   0.00 100.40
nvme11n1      1317502.00    0.00 5269996.00      0.00     0.00     0.00   0.00   0.00    0.73    0.00 964.85     4.00     0.00   0.00 100.40
nvme10n1      1317656.00    0.00 5270624.00      0.00     0.00     0.00   0.00   0.00    0.80    0.00 1050.05     4.00     0.00   0.00 100.40
nvme14n1      1107632.00    0.00 4430528.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 307.52     4.00     0.00   0.00 100.40
nvme5n1       1317583.00    0.00 5270332.00      0.00     0.00     0.00   0.00   0.00    0.33    0.00 430.47     4.00     0.00   0.00 100.00
nvme8n1       1317617.00    0.00 5270468.00      0.00     0.00     0.00   0.00   0.00    0.74    0.00 972.52     4.00     0.00   0.00 101.00
nvme6n1       1317535.00    0.00 5270144.00      0.00     0.00     0.00   0.00   0.00    0.33    0.00 432.48     4.00     0.00   0.00 100.60
nvme9n1       1317582.00    0.00 5270328.00      0.00     0.00     0.00   0.00   0.00    0.75    0.00 992.82     4.00     0.00   0.00 100.40
nvme15n1      1107703.00    0.00 4430816.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 305.93     4.00     0.00   0.00 100.60
nvme20n1      1107712.00    0.00 4430848.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 306.72     4.00     0.00   0.00 100.20
nvme13n1      1107714.00    0.00 4430852.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 307.10     4.00     0.00   0.00 101.40
nvme18n1      1107674.00    0.00 4430696.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 306.04     4.00     0.00   0.00 100.20
nvme4n1       1317521.00    0.00 5270076.00      0.00     0.00     0.00   0.00   0.00    0.33    0.00 431.63     4.00     0.00   0.00 100.20
nvme21n1      1107714.00    0.00 4430856.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 309.11     4.00     0.00   0.00 100.40
nvme22n1      1107711.00    0.00 4430840.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 308.52     4.00     0.00   0.00 100.60
nvme24n1      1107441.00    0.00 4429768.00      0.00     0.00     0.00   0.00   0.00    3.86    0.00 4271.29     4.00     0.00   0.00 100.20
nvme12n1      1107733.00    0.00 4430932.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 307.70     4.00     0.00   0.00 100.40
nvme17n1      1107858.00    0.00 4431436.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 307.95     4.00     0.00   0.00 100.60
nvme19n1      1107766.00    0.00 4431064.00      0.00     0.00     0.00   0.00   0.00    0.28    0.00 307.17     4.00     0.00   0.00 100.40
nvme23n1      1108033.00    0.00 4432132.00      0.00     0.00     0.00   0.00   0.00    0.31    0.00 340.62     4.00     0.00   0.00 100.00



iostat -xkz 1 with the md's
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.56    0.00   49.94    0.00    0.00   49.51

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1       114589.00    0.00 458356.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.54     4.00     0.00   0.01 100.00
nvme1n1       115284.00    0.00 461136.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.77     4.00     0.00   0.01 100.00
nvme2n1       114911.00    0.00 459644.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.61     4.00     0.00   0.01 100.00
nvme3n1       114538.00    0.00 458152.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.55     4.00     0.00   0.01 100.00
nvme7n1       114524.00    0.00 458096.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.53     4.00     0.00   0.01 100.00
nvme10n1      114934.00    0.00 459736.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.61     4.00     0.00   0.01 100.00
nvme14n1      97399.00    0.00 389596.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.41     4.00     0.00   0.01 100.00
nvme5n1       114929.00    0.00 459716.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.61     4.00     0.00   0.01 100.00
nvme8n1       114393.00    0.00 457572.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.40     4.00     0.00   0.01  99.90
nvme6n1       114731.00    0.00 458924.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.56     4.00     0.00   0.01  99.90
nvme9n1       114146.00    0.00 456584.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.37     4.00     0.00   0.01  99.90
nvme15n1      96960.00    0.00 387840.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.30     4.00     0.00   0.01 100.00
nvme20n1      97171.00    0.00 388684.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.36     4.00     0.00   0.01 100.00
nvme13n1      96874.00    0.00 387496.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.31     4.00     0.00   0.01 100.00
nvme18n1      96696.00    0.00 386784.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.16     4.00     0.00   0.01 100.00
nvme4n1       115220.00    0.00 460876.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.75     4.00     0.00   0.01 100.00
nvme21n1      96756.00    0.00 387024.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.24     4.00     0.00   0.01 100.00
nvme22n1      97352.00    0.00 389408.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.36     4.00     0.00   0.01 100.00
nvme12n1      96899.00    0.00 387596.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.22     4.00     0.00   0.01 100.20
nvme17n1      96748.00    0.00 386992.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.24     4.00     0.00   0.01 100.00
nvme19n1      97191.00    0.00 388764.00      0.00     0.00     0.00   0.00   0.00    0.30    0.00  29.30     4.00     0.00   0.01 100.00
nvme23n1      96787.00    0.00 387148.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  28.41     4.00     0.00   0.01  99.90
md1           1066812.00    0.00 4267248.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     4.00     0.00   0.00   0.00
md0           1262173.00    0.00 5048692.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     4.00     0.00   0.00   0.00

fio output:
socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket0-md: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket1-md: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.26
Starting 256 processes
Jobs: 128 (f=128): [_(128),r(128)][1.6%][r=9103MiB/s][r=2330k IOPS][eta 02h:08m:00s]        
socket0: (groupid=0, jobs=64): err= 0: pid=18344: Tue Jul 27 20:00:10 2021
  read: IOPS=16.0M, BW=60.8GiB/s (65.3GB/s)(3651GiB/60003msec)
    slat (nsec): min=1222, max=18033k, avg=2429.23, stdev=2975.48
    clat (usec): min=24, max=20221, avg=510.51, stdev=336.57
     lat (usec): min=30, max=20240, avg=513.01, stdev=336.58
    clat percentiles (usec):
     |  1.00th=[  147],  5.00th=[  194], 10.00th=[  229], 20.00th=[  281],
     | 30.00th=[  326], 40.00th=[  367], 50.00th=[  412], 60.00th=[  469],
     | 70.00th=[  553], 80.00th=[  676], 90.00th=[  914], 95.00th=[ 1156],
     | 99.00th=[ 1778], 99.50th=[ 2073], 99.90th=[ 2868], 99.95th=[ 3294],
     | 99.99th=[ 4424]
   bw (  MiB/s): min=52367, max=65429, per=32.81%, avg=62388.68, stdev=33.73, samples=7424
   iops        : min=13406054, max=16749890, avg=15971477.42, stdev=8635.86, samples=7424
  lat (usec)   : 50=0.01%, 100=0.02%, 250=13.89%, 500=50.33%, 750=19.72%
  lat (usec)   : 1000=8.24%
  lat (msec)   : 2=7.22%, 4=0.57%, 10=0.02%, 20=0.01%, 50=0.01%
  cpu          : usr=17.93%, sys=49.30%, ctx=21719222, majf=0, minf=9915
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=957111950,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1: (groupid=1, jobs=64): err= 0: pid=18408: Tue Jul 27 20:00:10 2021
  read: IOPS=13.5M, BW=51.4GiB/s (55.2GB/s)(3085GiB/60008msec)
    slat (nsec): min=1232, max=1696.9k, avg=2580.28, stdev=2841.95
    clat (usec): min=21, max=26808, avg=604.58, stdev=1211.79
     lat (usec): min=26, max=26810, avg=607.23, stdev=1211.80
    clat percentiles (usec):
     |  1.00th=[  124],  5.00th=[  157], 10.00th=[  184], 20.00th=[  225],
     | 30.00th=[  258], 40.00th=[  289], 50.00th=[  318], 60.00th=[  351],
     | 70.00th=[  388], 80.00th=[  437], 90.00th=[  586], 95.00th=[ 2769],
     | 99.00th=[ 6587], 99.50th=[ 9372], 99.90th=[12649], 99.95th=[13829],
     | 99.99th=[16712]
   bw (  MiB/s): min=32950, max=67704, per=20.46%, avg=52713.11, stdev=106.96, samples=7424
   iops        : min=8435402, max=17332350, avg=13494532.64, stdev=27383.02, samples=7424
  lat (usec)   : 50=0.01%, 100=0.16%, 250=27.38%, 500=59.09%, 750=4.93%
  lat (usec)   : 1000=0.30%
  lat (msec)   : 2=0.60%, 4=5.67%, 10=1.47%, 20=0.39%, 50=0.01%
  cpu          : usr=14.86%, sys=45.29%, ctx=36050249, majf=0, minf=10046
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=808781317,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket0-md: (groupid=2, jobs=64): err= 0: pid=18479: Tue Jul 27 20:00:10 2021
  read: IOPS=1263k, BW=4934MiB/s (5174MB/s)(289GiB/60001msec)
    slat (nsec): min=1512, max=48037k, avg=49957.85, stdev=33615.19
    clat (usec): min=176, max=51614, avg=6432.56, stdev=410.54
     lat (usec): min=178, max=51639, avg=6482.58, stdev=412.23
    clat percentiles (usec):
     |  1.00th=[ 6128],  5.00th=[ 6259], 10.00th=[ 6325], 20.00th=[ 6325],
     | 30.00th=[ 6390], 40.00th=[ 6390], 50.00th=[ 6456], 60.00th=[ 6456],
     | 70.00th=[ 6521], 80.00th=[ 6521], 90.00th=[ 6587], 95.00th=[ 6587],
     | 99.00th=[ 6652], 99.50th=[ 6718], 99.90th=[ 7635], 99.95th=[16909],
     | 99.99th=[18220]
   bw (  MiB/s): min= 4582, max= 5934, per=100.00%, avg=4938.25, stdev= 2.07, samples=7616
   iops        : min=1173219, max=1519297, avg=1264175.97, stdev=528.77, samples=7616
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.34%, 10=99.57%, 20=0.08%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=1.23%, sys=95.69%, ctx=2557, majf=0, minf=9064
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=75789817,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1-md: (groupid=3, jobs=64): err= 0: pid=18543: Tue Jul 27 20:00:10 2021
  read: IOPS=1071k, BW=4183MiB/s (4386MB/s)(245GiB/60002msec)
    slat (nsec): min=1563, max=14080k, avg=59051.10, stdev=22401.39
    clat (usec): min=179, max=20799, avg=7588.23, stdev=303.92
     lat (usec): min=211, max=20853, avg=7647.34, stdev=305.26
    clat percentiles (usec):
     |  1.00th=[ 7111],  5.00th=[ 7373], 10.00th=[ 7439], 20.00th=[ 7504],
     | 30.00th=[ 7504], 40.00th=[ 7570], 50.00th=[ 7570], 60.00th=[ 7635],
     | 70.00th=[ 7635], 80.00th=[ 7701], 90.00th=[ 7767], 95.00th=[ 7767],
     | 99.00th=[ 7898], 99.50th=[ 7898], 99.90th=[ 8586], 99.95th=[13304],
     | 99.99th=[19006]
   bw (  MiB/s): min= 3955, max= 4642, per=100.00%, avg=4186.20, stdev= 0.98, samples=7616
   iops        : min=1012714, max=1188416, avg=1071653.68, stdev=251.68, samples=7616
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=99.94%, 20=0.05%, 50=0.01%
  cpu          : usr=1.06%, sys=95.70%, ctx=1980, majf=0, minf=9030
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=64246431,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=60.8GiB/s (65.3GB/s), 60.8GiB/s-60.8GiB/s (65.3GB/s-65.3GB/s), io=3651GiB (3920GB), run=60003-60003msec

Run status group 1 (all jobs):
   READ: bw=51.4GiB/s (55.2GB/s), 51.4GiB/s-51.4GiB/s (55.2GB/s-55.2GB/s), io=3085GiB (3313GB), run=60008-60008msec

Run status group 2 (all jobs):
   READ: bw=4934MiB/s (5174MB/s), 4934MiB/s-4934MiB/s (5174MB/s-5174MB/s), io=289GiB (310GB), run=60001-60001msec

Run status group 3 (all jobs):
   READ: bw=4183MiB/s (4386MB/s), 4183MiB/s-4183MiB/s (4386MB/s-4386MB/s), io=245GiB (263GB), run=60002-60002msec

Disk stats (read/write):
  nvme0n1: ios=79463384/0, merge=0/0, ticks=25148472/0, in_queue=25148472, util=98.78%
  nvme1n1: ios=79463574/0, merge=0/0, ticks=25224784/0, in_queue=25224784, util=98.87%
  nvme2n1: ios=79463699/0, merge=0/0, ticks=25305193/0, in_queue=25305193, util=98.96%
  nvme3n1: ios=79463925/0, merge=0/0, ticks=25234093/0, in_queue=25234093, util=99.00%
  nvme4n1: ios=79464135/0, merge=0/0, ticks=25396547/0, in_queue=25396547, util=99.06%
  nvme5n1: ios=79464346/0, merge=0/0, ticks=25393624/0, in_queue=25393624, util=99.10%
  nvme6n1: ios=79464535/0, merge=0/0, ticks=25330700/0, in_queue=25330700, util=99.19%
  nvme7n1: ios=79464721/0, merge=0/0, ticks=25349171/0, in_queue=25349171, util=99.24%
  nvme8n1: ios=79464029/0, merge=0/0, ticks=59063115/0, in_queue=59063115, util=99.32%
  nvme9n1: ios=79464120/0, merge=0/0, ticks=59023913/0, in_queue=59023913, util=99.33%
  nvme10n1: ios=79464799/0, merge=0/0, ticks=59136926/0, in_queue=59136927, util=99.39%
  nvme11n1: ios=79465392/0, merge=0/0, ticks=59091104/0, in_queue=59091104, util=99.51%
  nvme12n1: ios=67137057/0, merge=0/0, ticks=18685135/0, in_queue=18685136, util=99.60%
  nvme13n1: ios=67137217/0, merge=0/0, ticks=18638940/0, in_queue=18638940, util=99.76%
  nvme14n1: ios=67137341/0, merge=0/0, ticks=18663275/0, in_queue=18663275, util=99.70%
  nvme15n1: ios=67137620/0, merge=0/0, ticks=18629947/0, in_queue=18629948, util=99.77%
  nvme17n1: ios=67137778/0, merge=0/0, ticks=18709586/0, in_queue=18709585, util=99.80%
  nvme18n1: ios=67137952/0, merge=0/0, ticks=18591798/0, in_queue=18591797, util=99.72%
  nvme19n1: ios=67138199/0, merge=0/0, ticks=18669545/0, in_queue=18669545, util=99.86%
  nvme20n1: ios=67138378/0, merge=0/0, ticks=18600128/0, in_queue=18600128, util=99.89%
  nvme21n1: ios=67138562/0, merge=0/0, ticks=18720763/0, in_queue=18720763, util=100.00%
  nvme22n1: ios=67138772/0, merge=0/0, ticks=18659716/0, in_queue=18659716, util=100.00%
  nvme23n1: ios=67138982/0, merge=0/0, ticks=27862395/0, in_queue=27862395, util=100.00%
  nvme24n1: ios=67134934/0, merge=0/0, ticks=241977879/0, in_queue=241977879, util=100.00%
  md0: ios=75701982/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  md1: ios=64175011/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%


I'm used to tuning interrupts, so here are the interrupts during the hero portion of the fio and the mdraid portion.....Without polling they are just well balanced irq's across the different nvme MQs
[root@<server> jim]# ./top-irq.pl -k 1
reporting top 10 every 6 secs subject to thresh=10 kernel=1

CAL             532284   CPU146   Function call interrupts
CAL             529615   CPU154   Function call interrupts
CAL             526198   CPU162   Function call interrupts
CAL             524012   CPU142   Function call interrupts
CAL             521467   CPU174   Function call interrupts
CAL             520821   CPU178   Function call interrupts
CAL             518798   CPU176   Function call interrupts
CAL             518244   CPU166   Function call interrupts
CAL             517524   CPU180   Function call interrupts
CAL             514563   CPU136   Function call interrupts

  reported top 10 (of 1885)
  reported interrupts = 5223526  870587.7 per sec    6.8% of all interrupts
^C
[root@<server> jim]# !!
./top-irq.pl -k 1
reporting top 10 every 6 secs subject to thresh=10 kernel=1

CAL              63759   CPU15    Function call interrupts
CAL              63664   CPU178   Function call interrupts
CAL              63428   CPU142   Function call interrupts
CAL              63382   CPU51    Function call interrupts
CAL              63285   CPU140   Function call interrupts
CAL              63068   CPU150   Function call interrupts
CAL              63017   CPU148   Function call interrupts
CAL              62984   CPU144   Function call interrupts
CAL              62842   CPU25    Function call interrupts
CAL              62835   CPU37    Function call interrupts

  reported top 10 (of 1885)
  reported interrupts = 632264  105377.3 per sec    4.0% of all interrupts


Lastly, I can't make md0 and md1 each get ~2M IOPS at the same time.   Sometimes the NUMA0 md is the fastest, sometimes the NUMA1 md is the fastest - I think there might some sort of bottleneck/race somewhere.   It stays that way until I stop them and reassemble.....and then it may switch.    I haven't troubleshooted enough to notice the pattern.

I have to work out with HPE why the socket0/socket1 difference in hero numbers 16.0M/13.5M is something I'll have to take up with HPE or maybe there is a card slowing down the drives in socket1.


Any help is greatly appreciated.  Criticism will be accepted and worst case, IF I HAVEN'T MISSED SOMETHING SO UTTERLY SILLY,  this becomes a defacto "where to start" for the base users like me before the kernel level experts get involved.

As an FYI - I have booted a 5.13 kernel and started using io_uring - no noticeable difference in md performance on a different server with GEN3 drives.....I can raise my "hero numbers" when I have time to play, but right now, my job is to get protected IOPS.


Jim Finlayson
U.S. Department of Defense



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-07-27 20:32 Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? Finlayson, James M CIV (USA)
@ 2021-07-27 21:52 ` Chris Murphy
  2021-07-27 22:42 ` Peter Grandi
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2021-07-27 21:52 UTC (permalink / raw)
  To: linux-raid; +Cc: Finlayson, James M CIV (USA)

On Tue, Jul 27, 2021 at 2:40 PM Finlayson, James M CIV (USA)
<james.m.finlayson4.civ@mail.mil> wrote:
>
> [root@<server> <server>]# cat /etc/redhat-release
> Red Hat Enterprise Linux release 8.4 (Ootpa)
> [root@<server> <server>]# uname -r
> 4.18.0-305.el8.x86_64

I think you'll get a better response by opening a support ticket with
your distro. That's a distro kernel and upstream's have pretty much
let that kernel version set sail a long time ago, and are mainly
concerned with linux-next, mainline, and stable kernels. You could
retest with kernel-ml from elrepo.org, 5.13.5 is up there for a couple
days.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Can't get RAID5/RAID6  NVMe randomread  IOPS - AMD ROME what am I missing?????
  2021-07-27 20:32 Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? Finlayson, James M CIV (USA)
  2021-07-27 21:52 ` Chris Murphy
@ 2021-07-27 22:42 ` Peter Grandi
  2021-07-28 10:31 ` Matt Wallis
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Peter Grandi @ 2021-07-27 22:42 UTC (permalink / raw)
  To: list Linux RAID

[...]
> Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> nvme0n1       1317510.00    0.00 5270044.00      0.00     0.00     0.00   0.00   0.00    0.31    0.00 411.95     4.00     0.00   0.00 100.40
[...]
> Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> nvme0n1       114589.00    0.00 458356.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.54     4.00     0.00   0.01 100.00

The obvious difference is the factor of 10 in "aqu-sz" and that
correspond to the factor of 10 in "r/s" and "rkB/s".

I have noticed that the MD RAID is does some weird things to the
queueing, it is not a "normal" block device, and this often
creates bizarrities (happens also with DM/LVM2).

Try to create a filesystem on top of 'md0' and 'md1' and test
that, things may be quite different.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Can't get RAID5/RAID6  NVMe randomread  IOPS - AMD ROME what am I missing?????
  2021-07-27 20:32 Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? Finlayson, James M CIV (USA)
  2021-07-27 21:52 ` Chris Murphy
  2021-07-27 22:42 ` Peter Grandi
@ 2021-07-28 10:31 ` Matt Wallis
  2021-07-28 10:43   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
  2021-08-01 11:21 ` Gal Ofri
       [not found] ` <AS8PR04MB799205817C4647DAC740DE9A91EA9@AS8PR04MB7992.eurprd04.prod.outlook.com>
  4 siblings, 1 reply; 28+ messages in thread
From: Matt Wallis @ 2021-07-28 10:31 UTC (permalink / raw)
  To: Finlayson, James M CIV (USA); +Cc: linux-raid

Hi Jim,

> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> 
> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small  fraction of it.   I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful.     This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant.   I'm just stumped why I can't do better.   If there is a fine manual that somebody can point me to, I'm happy to read it…

I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system…
1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
2. Most block IO in the kernel is limited in terms of threading, it may even be essentially single threaded. (This is where I will get corrected)
3. AFAICT, this includes mdraid, there’s a single thread per RAID device handling all the RAID calculations. (mdX_raid6)

What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason)
2. Create 8 RAID6 arrays with 1 partition per drive.
3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.

You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.

I saw a significant (for me, significant is >20%) increase in IOPs doing this. 

You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes. 

There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.

You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.

Matt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-07-28 10:31 ` Matt Wallis
@ 2021-07-28 10:43   ` Finlayson, James M CIV (USA)
  2021-07-29  0:54     ` [Non-DoD Source] " Matt Wallis
  0 siblings, 1 reply; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-07-28 10:43 UTC (permalink / raw)
  To: 'Matt Wallis'
  Cc: 'linux-raid@vger.kernel.org', Finlayson, James M CIV (USA)

Matt,
I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term.     As a researcher, I can do these cool science experiments, but I still have to hand designs  to sustainment folks.  I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one  xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed.    We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.  

I will have to try the LVM experiment.  I'm an LVM  neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also.  Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.

Thanks,
Jim




-----Original Message-----
From: Matt Wallis <mattw@madmonks.org> 
Sent: Wednesday, July 28, 2021 6:32 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: linux-raid@vger.kernel.org
Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

Hi Jim,

> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> 
> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small  fraction of it.   I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful.     This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant.   I'm just stumped why I can't do better.   If there is a fine manual that somebody can point me to, I'm happy to read it…

I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
2. Most block IO in the kernel is limited in terms of threading, it may even be essentially single threaded. (This is where I will get corrected) 3. AFAICT, this includes mdraid, there’s a single thread per RAID device handling all the RAID calculations. (mdX_raid6)

What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.

You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.

I saw a significant (for me, significant is >20%) increase in IOPs doing this. 

You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes. 

There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.

You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.

Matt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-07-28 10:43   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
@ 2021-07-29  0:54     ` Matt Wallis
  2021-07-29 16:35       ` Wols Lists
  2021-07-29 22:05       ` Finlayson, James M CIV (USA)
  0 siblings, 2 replies; 28+ messages in thread
From: Matt Wallis @ 2021-07-29  0:54 UTC (permalink / raw)
  To: Finlayson, James M CIV (USA); +Cc: linux-raid

Hi Jim,

Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.

I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally. 

Matt.

> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> 
> Matt,
> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term.     As a researcher, I can do these cool science experiments, but I still have to hand designs  to sustainment folks.  I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one  xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed.    We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.  
> 
> I will have to try the LVM experiment.  I'm an LVM  neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also.  Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
> 
> Thanks,
> Jim
> 
> 
> 
> 
> -----Original Message-----
> From: Matt Wallis <mattw@madmonks.org> 
> Sent: Wednesday, July 28, 2021 6:32 AM
> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Cc: linux-raid@vger.kernel.org
> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
> 
> Hi Jim,
> 
>> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>> 
>> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small  fraction of it.   I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful.     This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant.   I'm just stumped why I can't do better.   If there is a fine manual that somebody can point me to, I'm happy to read it…
> 
> I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
> 2. Most block IO in the kernel is limited in terms of threading, it may even be essentially single threaded. (This is where I will get corrected) 3. AFAICT, this includes mdraid, there’s a single thread per RAID device handling all the RAID calculations. (mdX_raid6)
> 
> What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
> 1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
> 3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
> 
> You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
> 
> I saw a significant (for me, significant is >20%) increase in IOPs doing this. 
> 
> You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes. 
> 
> There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
> 
> You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
> 
> Matt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-07-29  0:54     ` [Non-DoD Source] " Matt Wallis
@ 2021-07-29 16:35       ` Wols Lists
  2021-07-29 18:12         ` Finlayson, James M CIV (USA)
  2021-07-29 22:05       ` Finlayson, James M CIV (USA)
  1 sibling, 1 reply; 28+ messages in thread
From: Wols Lists @ 2021-07-29 16:35 UTC (permalink / raw)
  To: Matt Wallis, Finlayson, James M CIV (USA); +Cc: linux-raid

On 29/07/21 01:54, Matt Wallis wrote:
> Hi Jim,
> 
> Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
> Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
> 
sticking raid 0 on top of raid 6 sounds an extremely weird thing to do.

What I guess you might be wanting to do instead is write a partition
table to the raid-6? That's perfectly normal if, imho, a bit unusual?

And LVM would be MUCH better than raid-0, I'm sure, because it addresses
this very issue by design, rather than by accident.

> I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally. 

Is that wise? KISS.
> 
> Matt.
> 
>> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>
>> Matt,
>> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term.     As a researcher, I can do these cool science experiments, but I still have to hand designs  to sustainment folks.  I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one  xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed.    We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.  
>>
>> I will have to try the LVM experiment.  I'm an LVM  neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also.  Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
>>
Do. If it solves what you want, then it's worth it. I'm moving my stuff
over to LVM.

To throw something else into the mix, you've gone for raid 6, which
enables you to lose two drives, or corrupt one drive. Do you need the
two-drive redundancy? The calculations are a lot more expensive than
raid-5 if you're worried over write speed. I don't know the impact of it
but I'm playing with dm-integrity which provides some protection against
corruption.

>> Thanks,
>> Jim
>>
Cheers,
Wol


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-07-29 16:35       ` Wols Lists
@ 2021-07-29 18:12         ` Finlayson, James M CIV (USA)
  0 siblings, 0 replies; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-07-29 18:12 UTC (permalink / raw)
  To: 'Wols Lists', 'Matt Wallis'
  Cc: 'linux-raid@vger.kernel.org'

Hi,
Actually, the RAID5/RAID6 mdraid implementations can't support the IOPS or the queue depths required for a basic single mdraid raid5/raid6 LUN.    The partitions are just to create more mdraid stripes out of partitions to allow for more threads? To do the RAID parity work and to be able to issue more I/Os to the entirety of the NVMe SSDs using mdraid.   Ultimately, I need one volume per NUMA domain comprised of RAIDed NVME SSDs.   We're just exploring creative workarounds to the NVMe mdraid IOPS issues to get the most IOPS out of a collection of SSDs.    I still have to put an xfs file system per volume for something useful to occur.
Thanks,
Jim



-----Original Message-----
From: Wols Lists <antlists@youngman.org.uk> 
Sent: Thursday, July 29, 2021 12:35 PM
To: Matt Wallis <mattw@madmonks.org>; Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: linux-raid@vger.kernel.org
Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

On 29/07/21 01:54, Matt Wallis wrote:
> Hi Jim,
> 
> Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
> Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
> 
sticking raid 0 on top of raid 6 sounds an extremely weird thing to do.

What I guess you might be wanting to do instead is write a partition table to the raid-6? That's perfectly normal if, imho, a bit unusual?

And LVM would be MUCH better than raid-0, I'm sure, because it addresses this very issue by design, rather than by accident.

> I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally. 

Is that wise? KISS.
> 
> Matt.
> 
>> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>
>> Matt,
>> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term.     As a researcher, I can do these cool science experiments, but I still have to hand designs  to sustainment folks.  I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one  xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed.    We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.  
>>
>> I will have to try the LVM experiment.  I'm an LVM  neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also.  Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
>>
Do. If it solves what you want, then it's worth it. I'm moving my stuff over to LVM.

To throw something else into the mix, you've gone for raid 6, which enables you to lose two drives, or corrupt one drive. Do you need the two-drive redundancy? The calculations are a lot more expensive than
raid-5 if you're worried over write speed. I don't know the impact of it but I'm playing with dm-integrity which provides some protection against corruption.

>> Thanks,
>> Jim
>>
Cheers,
Wol


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-07-29  0:54     ` [Non-DoD Source] " Matt Wallis
  2021-07-29 16:35       ` Wols Lists
@ 2021-07-29 22:05       ` Finlayson, James M CIV (USA)
  2021-07-30  8:28         ` Matt Wallis
  1 sibling, 1 reply; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-07-29 22:05 UTC (permalink / raw)
  To: 'Matt Wallis'
  Cc: 'linux-raid@vger.kernel.org', Finlayson, James M CIV (USA)

Matt,
Thank you for the tip.   I have put 32 partitions on each of my NVMe drives, made 32 raid5 stripes and then made an LVM from the 32 raid5 stripes.   I then created one physical volume per 10 NVMe drives on each socket.    I believe I have successfully harnessed "alien technology" (inside joke) to create a Frankenstein's monster.   These are substantially better than doing a RAID0 stripe over the partitioned md's in the past.  I whipped all of this together in two 15 minute sessions last night and just right now, so I might run some more extensive tests when I have the cycles.   I didn't intend to leave the thread hanging.
BLUF -  fio detailed output below.....
9 drives per socket raw
socket0, 9 drives, raw 4K random reads 13.6M IOPS , socket1 9 drives, raw 4K random reads , 12.3M IOPS
%Cpu(s):  4.4 us, 25.6 sy,  0.0 ni, 56.7 id,  0.0 wa, 13.1 hi,  0.2 si,  0.0 st

9 data drives per socket RAID5/LVM raw (9+1)
socket0, 9 drives, raw 4K random reads 8.57M IOPS , socket1 9 drives, raw 4K random reads , 8.57M IOPS
%Cpu(s):  7.0 us, 22.3 sy,  0.0 ni, 58.4 id,  0.0 wa, 12.1 hi,  0.2 si,  0.0 st


All,
I intend to test the 4.15 kernel patch next week.   My SA would prefer if the patch has made it into the kernel-ml stream yet, so he could install an rpm, but I'll get him to build the kernel if need be.
If the 4.15 kernel patch doesn't alleviate the issues, I have a strong desire to have mdraid made better.


Quick fio results:
socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket0-lv: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket1-lv: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.26
Starting 256 processes

socket0: (groupid=0, jobs=64): err= 0: pid=64032: Thu Jul 29 21:48:32 2021
  read: IOPS=13.6M, BW=51.7GiB/s (55.6GB/s)(3105GiB/60003msec)
    slat (nsec): min=1292, max=1376.7k, avg=2545.32, stdev=2696.74
    clat (usec): min=36, max=71580, avg=600.68, stdev=361.36
     lat (usec): min=38, max=71616, avg=603.30, stdev=361.38
    clat percentiles (usec):
     |  1.00th=[  169],  5.00th=[  231], 10.00th=[  277], 20.00th=[  347],
     | 30.00th=[  404], 40.00th=[  457], 50.00th=[  519], 60.00th=[  594],
     | 70.00th=[  676], 80.00th=[  791], 90.00th=[  996], 95.00th=[ 1205],
     | 99.00th=[ 1909], 99.50th=[ 2409], 99.90th=[ 3556], 99.95th=[ 3884],
     | 99.99th=[ 5538]
   bw (  MiB/s): min=39960, max=56660, per=20.94%, avg=53040.69, stdev=49.61, samples=7488
   iops        : min=10229946, max=14504941, avg=13578391.80, stdev=12699.61, samples=7488
  lat (usec)   : 50=0.01%, 100=0.01%, 250=6.83%, 500=40.06%, 750=29.92%
  lat (usec)   : 1000=13.42%
  lat (msec)   : 2=8.91%, 4=0.82%, 10=0.04%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=14.82%, sys=46.57%, ctx=35564249, majf=0, minf=9754
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=813909201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1: (groupid=1, jobs=64): err= 0: pid=64096: Thu Jul 29 21:48:32 2021
  read: IOPS=12.3M, BW=46.9GiB/s (50.3GB/s)(2812GiB/60003msec)
    slat (nsec): min=1292, max=1672.2k, avg=2672.21, stdev=2742.06
    clat (usec): min=25, max=73526, avg=663.35, stdev=611.06
     lat (usec): min=28, max=73545, avg=666.09, stdev=611.08
    clat percentiles (usec):
     |  1.00th=[  143],  5.00th=[  190], 10.00th=[  227], 20.00th=[  285],
     | 30.00th=[  338], 40.00th=[  400], 50.00th=[  478], 60.00th=[  586],
     | 70.00th=[  725], 80.00th=[  930], 90.00th=[ 1254], 95.00th=[ 1614],
     | 99.00th=[ 3490], 99.50th=[ 4146], 99.90th=[ 6390], 99.95th=[ 6980],
     | 99.99th=[ 8356]
   bw (  MiB/s): min=28962, max=55326, per=12.70%, avg=48036.10, stdev=96.75, samples=7488
   iops        : min=7414327, max=14163615, avg=12297214.98, stdev=24768.82, samples=7488
  lat (usec)   : 50=0.01%, 100=0.03%, 250=13.75%, 500=38.84%, 750=18.71%
  lat (usec)   : 1000=11.55%
  lat (msec)   : 2=14.30%, 4=2.23%, 10=0.60%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=13.41%, sys=44.44%, ctx=39379711, majf=0, minf=9982
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=737168913,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket0-lv: (groupid=2, jobs=64): err= 0: pid=64166: Thu Jul 29 21:48:32 2021
  read: IOPS=8570k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
    slat (nsec): min=1873, max=11085k, avg=4694.47, stdev=4825.39
    clat (usec): min=24, max=21739, avg=950.52, stdev=948.83
     lat (usec): min=51, max=21743, avg=955.29, stdev=948.88
    clat percentiles (usec):
     |  1.00th=[  155],  5.00th=[  217], 10.00th=[  265], 20.00th=[  338],
     | 30.00th=[  404], 40.00th=[  486], 50.00th=[  594], 60.00th=[  766],
     | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
     | 99.00th=[ 4490], 99.50th=[ 5669], 99.90th=[ 8586], 99.95th=[ 9896],
     | 99.99th=[12125]
   bw (  MiB/s): min=24657, max=37453, per=-25.17%, avg=33516.00, stdev=35.32, samples=7616
   iops        : min=6312326, max=9588007, avg=8580076.03, stdev=9041.88, samples=7616
  lat (usec)   : 50=0.01%, 100=0.01%, 250=8.40%, 500=33.21%, 750=17.54%
  lat (usec)   : 1000=9.89%
  lat (msec)   : 2=19.93%, 4=9.58%, 10=1.40%, 20=0.05%, 50=0.01%
  cpu          : usr=9.01%, sys=51.48%, ctx=27829950, majf=0, minf=9028
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=514275323,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1-lv: (groupid=3, jobs=64): err= 0: pid=64230: Thu Jul 29 21:48:32 2021
  read: IOPS=8571k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
    slat (nsec): min=1823, max=14362k, avg=4809.30, stdev=4940.42
    clat (usec): min=50, max=22856, avg=950.31, stdev=948.13
     lat (usec): min=54, max=22860, avg=955.19, stdev=948.19
    clat percentiles (usec):
     |  1.00th=[  157],  5.00th=[  221], 10.00th=[  269], 20.00th=[  343],
     | 30.00th=[  412], 40.00th=[  490], 50.00th=[  603], 60.00th=[  766],
     | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
     | 99.00th=[ 4293], 99.50th=[ 5604], 99.90th=[ 9503], 99.95th=[10683],
     | 99.99th=[12649]
   bw (  MiB/s): min=23434, max=36909, per=-25.17%, avg=33517.14, stdev=50.36, samples=7616
   iops        : min=5999220, max=9448818, avg=8580368.69, stdev=12892.93, samples=7616
  lat (usec)   : 100=0.01%, 250=7.88%, 500=33.09%, 750=18.09%, 1000=10.02%
  lat (msec)   : 2=19.91%, 4=9.75%, 10=1.16%, 20=0.08%, 50=0.01%
  cpu          : usr=9.14%, sys=51.94%, ctx=25524808, majf=0, minf=9037
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=514294010,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=51.7GiB/s (55.6GB/s), 51.7GiB/s-51.7GiB/s (55.6GB/s-55.6GB/s), io=3105GiB (3334GB), run=60003-60003msec

Run status group 1 (all jobs):
   READ: bw=46.9GiB/s (50.3GB/s), 46.9GiB/s-46.9GiB/s (50.3GB/s-50.3GB/s), io=2812GiB (3019GB), run=60003-60003msec

Run status group 2 (all jobs):
   READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s (35.1GB/s-35.1GB/s), io=1962GiB (2106GB), run=60006-60006msec

Run status group 3 (all jobs):
   READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s (35.1GB/s-35.1GB/s), io=1962GiB (2107GB), run=60006-60006msec

Disk stats (read/write):
  nvme0n1: ios=90336694/0, merge=0/0, ticks=45102163/0, in_queue=45102163, util=97.44%
  nvme1n1: ios=90337153/0, merge=0/0, ticks=47422886/0, in_queue=47422887, util=97.81%
  nvme2n1: ios=90337516/0, merge=0/0, ticks=46419782/0, in_queue=46419782, util=97.95%
  nvme3n1: ios=90337843/0, merge=0/0, ticks=46256374/0, in_queue=46256374, util=97.95%
  nvme4n1: ios=90337742/0, merge=0/0, ticks=59122226/0, in_queue=59122225, util=98.19%
  nvme5n1: ios=90338813/0, merge=0/0, ticks=57811758/0, in_queue=57811758, util=98.33%
  nvme6n1: ios=90339194/0, merge=0/0, ticks=57369337/0, in_queue=57369337, util=98.37%
  nvme7n1: ios=90339048/0, merge=0/0, ticks=55791076/0, in_queue=55791076, util=98.78%
  nvme8n1: ios=90340234/0, merge=0/0, ticks=44977001/0, in_queue=44977001, util=99.01%
  nvme12n1: ios=81819608/0, merge=0/0, ticks=26788080/0, in_queue=26788079, util=99.24%
  nvme13n1: ios=81819831/0, merge=0/0, ticks=26736682/0, in_queue=26736681, util=99.57%
  nvme14n1: ios=81820006/0, merge=0/0, ticks=26772951/0, in_queue=26772951, util=99.67%
  nvme15n1: ios=81820215/0, merge=0/0, ticks=26741532/0, in_queue=26741532, util=99.78%
  nvme17n1: ios=81819922/0, merge=0/0, ticks=76459192/0, in_queue=76459192, util=99.84%
  nvme18n1: ios=81820146/0, merge=0/0, ticks=86756309/0, in_queue=86756309, util=99.82%
  nvme19n1: ios=81820481/0, merge=0/0, ticks=75008919/0, in_queue=75008919, util=100.00%
  nvme20n1: ios=81819690/0, merge=0/0, ticks=91888274/0, in_queue=91888275, util=100.00%
  nvme21n1: ios=81821809/0, merge=0/0, ticks=26653056/0, in_queue=26653057, util=100.00%
-----Original Message-----
From: Matt Wallis <mattw@madmonks.org> 
Sent: Wednesday, July 28, 2021 8:54 PM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: linux-raid@vger.kernel.org
Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

Hi Jim,

Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.

I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally. 

Matt.

> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> 
> Matt,
> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term.     As a researcher, I can do these cool science experiments, but I still have to hand designs  to sustainment folks.  I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one  xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed.    We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.  
> 
> I will have to try the LVM experiment.  I'm an LVM  neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also.  Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
> 
> Thanks,
> Jim
> 
> 
> 
> 
> -----Original Message-----
> From: Matt Wallis <mattw@madmonks.org> 
> Sent: Wednesday, July 28, 2021 6:32 AM
> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Cc: linux-raid@vger.kernel.org
> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
> 
> Hi Jim,
> 
>> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>> 
>> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small  fraction of it.   I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful.     This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant.   I'm just stumped why I can't do better.   If there is a fine manual that somebody can point me to, I'm happy to read it…
> 
> I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
> 2. Most block IO in the kernel is limited in terms of threading, it may even be essentially single threaded. (This is where I will get corrected) 3. AFAICT, this includes mdraid, there’s a single thread per RAID device handling all the RAID calculations. (mdX_raid6)
> 
> What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
> 1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
> 3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
> 
> You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
> 
> I saw a significant (for me, significant is >20%) increase in IOPs doing this. 
> 
> You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes. 
> 
> There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
> 
> You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
> 
> Matt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-07-29 22:05       ` Finlayson, James M CIV (USA)
@ 2021-07-30  8:28         ` Matt Wallis
  2021-07-30  8:45           ` Miao Wang
  2021-07-30  9:54           ` Finlayson, James M CIV (USA)
  0 siblings, 2 replies; 28+ messages in thread
From: Matt Wallis @ 2021-07-30  8:28 UTC (permalink / raw)
  To: Finlayson, James M CIV (USA); +Cc: linux-raid

Hi Jim,

That’s significantly better than I expected, I need to see if I can get someone to send me the box I was using so I can spend some more time on it.

Good luck with the rest of it, the bit I was looking for as a next step was going to be potentially tweaking stripe widths and the like to see how much difference it made on different workloads.

Matt. 

> On 30 Jul 2021, at 08:05, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> 
> Matt,
> Thank you for the tip.   I have put 32 partitions on each of my NVMe drives, made 32 raid5 stripes and then made an LVM from the 32 raid5 stripes.   I then created one physical volume per 10 NVMe drives on each socket.    I believe I have successfully harnessed "alien technology" (inside joke) to create a Frankenstein's monster.   These are substantially better than doing a RAID0 stripe over the partitioned md's in the past.  I whipped all of this together in two 15 minute sessions last night and just right now, so I might run some more extensive tests when I have the cycles.   I didn't intend to leave the thread hanging.
> BLUF -  fio detailed output below.....
> 9 drives per socket raw
> socket0, 9 drives, raw 4K random reads 13.6M IOPS , socket1 9 drives, raw 4K random reads , 12.3M IOPS
> %Cpu(s):  4.4 us, 25.6 sy,  0.0 ni, 56.7 id,  0.0 wa, 13.1 hi,  0.2 si,  0.0 st
> 
> 9 data drives per socket RAID5/LVM raw (9+1)
> socket0, 9 drives, raw 4K random reads 8.57M IOPS , socket1 9 drives, raw 4K random reads , 8.57M IOPS
> %Cpu(s):  7.0 us, 22.3 sy,  0.0 ni, 58.4 id,  0.0 wa, 12.1 hi,  0.2 si,  0.0 st
> 
> 
> All,
> I intend to test the 4.15 kernel patch next week.   My SA would prefer if the patch has made it into the kernel-ml stream yet, so he could install an rpm, but I'll get him to build the kernel if need be.
> If the 4.15 kernel patch doesn't alleviate the issues, I have a strong desire to have mdraid made better.
> 
> 
> Quick fio results:
> socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
> ...
> socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
> ...
> socket0-lv: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
> ...
> socket1-lv: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
> ...
> fio-3.26
> Starting 256 processes
> 
> socket0: (groupid=0, jobs=64): err= 0: pid=64032: Thu Jul 29 21:48:32 2021
>  read: IOPS=13.6M, BW=51.7GiB/s (55.6GB/s)(3105GiB/60003msec)
>    slat (nsec): min=1292, max=1376.7k, avg=2545.32, stdev=2696.74
>    clat (usec): min=36, max=71580, avg=600.68, stdev=361.36
>     lat (usec): min=38, max=71616, avg=603.30, stdev=361.38
>    clat percentiles (usec):
>     |  1.00th=[  169],  5.00th=[  231], 10.00th=[  277], 20.00th=[  347],
>     | 30.00th=[  404], 40.00th=[  457], 50.00th=[  519], 60.00th=[  594],
>     | 70.00th=[  676], 80.00th=[  791], 90.00th=[  996], 95.00th=[ 1205],
>     | 99.00th=[ 1909], 99.50th=[ 2409], 99.90th=[ 3556], 99.95th=[ 3884],
>     | 99.99th=[ 5538]
>   bw (  MiB/s): min=39960, max=56660, per=20.94%, avg=53040.69, stdev=49.61, samples=7488
>   iops        : min=10229946, max=14504941, avg=13578391.80, stdev=12699.61, samples=7488
>  lat (usec)   : 50=0.01%, 100=0.01%, 250=6.83%, 500=40.06%, 750=29.92%
>  lat (usec)   : 1000=13.42%
>  lat (msec)   : 2=8.91%, 4=0.82%, 10=0.04%, 20=0.01%, 50=0.01%
>  lat (msec)   : 100=0.01%
>  cpu          : usr=14.82%, sys=46.57%, ctx=35564249, majf=0, minf=9754
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>     issued rwts: total=813909201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>     latency   : target=0, window=0, percentile=100.00%, depth=128
> socket1: (groupid=1, jobs=64): err= 0: pid=64096: Thu Jul 29 21:48:32 2021
>  read: IOPS=12.3M, BW=46.9GiB/s (50.3GB/s)(2812GiB/60003msec)
>    slat (nsec): min=1292, max=1672.2k, avg=2672.21, stdev=2742.06
>    clat (usec): min=25, max=73526, avg=663.35, stdev=611.06
>     lat (usec): min=28, max=73545, avg=666.09, stdev=611.08
>    clat percentiles (usec):
>     |  1.00th=[  143],  5.00th=[  190], 10.00th=[  227], 20.00th=[  285],
>     | 30.00th=[  338], 40.00th=[  400], 50.00th=[  478], 60.00th=[  586],
>     | 70.00th=[  725], 80.00th=[  930], 90.00th=[ 1254], 95.00th=[ 1614],
>     | 99.00th=[ 3490], 99.50th=[ 4146], 99.90th=[ 6390], 99.95th=[ 6980],
>     | 99.99th=[ 8356]
>   bw (  MiB/s): min=28962, max=55326, per=12.70%, avg=48036.10, stdev=96.75, samples=7488
>   iops        : min=7414327, max=14163615, avg=12297214.98, stdev=24768.82, samples=7488
>  lat (usec)   : 50=0.01%, 100=0.03%, 250=13.75%, 500=38.84%, 750=18.71%
>  lat (usec)   : 1000=11.55%
>  lat (msec)   : 2=14.30%, 4=2.23%, 10=0.60%, 20=0.01%, 50=0.01%
>  lat (msec)   : 100=0.01%
>  cpu          : usr=13.41%, sys=44.44%, ctx=39379711, majf=0, minf=9982
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>     issued rwts: total=737168913,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>     latency   : target=0, window=0, percentile=100.00%, depth=128
> socket0-lv: (groupid=2, jobs=64): err= 0: pid=64166: Thu Jul 29 21:48:32 2021
>  read: IOPS=8570k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
>    slat (nsec): min=1873, max=11085k, avg=4694.47, stdev=4825.39
>    clat (usec): min=24, max=21739, avg=950.52, stdev=948.83
>     lat (usec): min=51, max=21743, avg=955.29, stdev=948.88
>    clat percentiles (usec):
>     |  1.00th=[  155],  5.00th=[  217], 10.00th=[  265], 20.00th=[  338],
>     | 30.00th=[  404], 40.00th=[  486], 50.00th=[  594], 60.00th=[  766],
>     | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
>     | 99.00th=[ 4490], 99.50th=[ 5669], 99.90th=[ 8586], 99.95th=[ 9896],
>     | 99.99th=[12125]
>   bw (  MiB/s): min=24657, max=37453, per=-25.17%, avg=33516.00, stdev=35.32, samples=7616
>   iops        : min=6312326, max=9588007, avg=8580076.03, stdev=9041.88, samples=7616
>  lat (usec)   : 50=0.01%, 100=0.01%, 250=8.40%, 500=33.21%, 750=17.54%
>  lat (usec)   : 1000=9.89%
>  lat (msec)   : 2=19.93%, 4=9.58%, 10=1.40%, 20=0.05%, 50=0.01%
>  cpu          : usr=9.01%, sys=51.48%, ctx=27829950, majf=0, minf=9028
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>     issued rwts: total=514275323,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>     latency   : target=0, window=0, percentile=100.00%, depth=128
> socket1-lv: (groupid=3, jobs=64): err= 0: pid=64230: Thu Jul 29 21:48:32 2021
>  read: IOPS=8571k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
>    slat (nsec): min=1823, max=14362k, avg=4809.30, stdev=4940.42
>    clat (usec): min=50, max=22856, avg=950.31, stdev=948.13
>     lat (usec): min=54, max=22860, avg=955.19, stdev=948.19
>    clat percentiles (usec):
>     |  1.00th=[  157],  5.00th=[  221], 10.00th=[  269], 20.00th=[  343],
>     | 30.00th=[  412], 40.00th=[  490], 50.00th=[  603], 60.00th=[  766],
>     | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
>     | 99.00th=[ 4293], 99.50th=[ 5604], 99.90th=[ 9503], 99.95th=[10683],
>     | 99.99th=[12649]
>   bw (  MiB/s): min=23434, max=36909, per=-25.17%, avg=33517.14, stdev=50.36, samples=7616
>   iops        : min=5999220, max=9448818, avg=8580368.69, stdev=12892.93, samples=7616
>  lat (usec)   : 100=0.01%, 250=7.88%, 500=33.09%, 750=18.09%, 1000=10.02%
>  lat (msec)   : 2=19.91%, 4=9.75%, 10=1.16%, 20=0.08%, 50=0.01%
>  cpu          : usr=9.14%, sys=51.94%, ctx=25524808, majf=0, minf=9037
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>     issued rwts: total=514294010,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>     latency   : target=0, window=0, percentile=100.00%, depth=128
> 
> Run status group 0 (all jobs):
>   READ: bw=51.7GiB/s (55.6GB/s), 51.7GiB/s-51.7GiB/s (55.6GB/s-55.6GB/s), io=3105GiB (3334GB), run=60003-60003msec
> 
> Run status group 1 (all jobs):
>   READ: bw=46.9GiB/s (50.3GB/s), 46.9GiB/s-46.9GiB/s (50.3GB/s-50.3GB/s), io=2812GiB (3019GB), run=60003-60003msec
> 
> Run status group 2 (all jobs):
>   READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s (35.1GB/s-35.1GB/s), io=1962GiB (2106GB), run=60006-60006msec
> 
> Run status group 3 (all jobs):
>   READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s (35.1GB/s-35.1GB/s), io=1962GiB (2107GB), run=60006-60006msec
> 
> Disk stats (read/write):
>  nvme0n1: ios=90336694/0, merge=0/0, ticks=45102163/0, in_queue=45102163, util=97.44%
>  nvme1n1: ios=90337153/0, merge=0/0, ticks=47422886/0, in_queue=47422887, util=97.81%
>  nvme2n1: ios=90337516/0, merge=0/0, ticks=46419782/0, in_queue=46419782, util=97.95%
>  nvme3n1: ios=90337843/0, merge=0/0, ticks=46256374/0, in_queue=46256374, util=97.95%
>  nvme4n1: ios=90337742/0, merge=0/0, ticks=59122226/0, in_queue=59122225, util=98.19%
>  nvme5n1: ios=90338813/0, merge=0/0, ticks=57811758/0, in_queue=57811758, util=98.33%
>  nvme6n1: ios=90339194/0, merge=0/0, ticks=57369337/0, in_queue=57369337, util=98.37%
>  nvme7n1: ios=90339048/0, merge=0/0, ticks=55791076/0, in_queue=55791076, util=98.78%
>  nvme8n1: ios=90340234/0, merge=0/0, ticks=44977001/0, in_queue=44977001, util=99.01%
>  nvme12n1: ios=81819608/0, merge=0/0, ticks=26788080/0, in_queue=26788079, util=99.24%
>  nvme13n1: ios=81819831/0, merge=0/0, ticks=26736682/0, in_queue=26736681, util=99.57%
>  nvme14n1: ios=81820006/0, merge=0/0, ticks=26772951/0, in_queue=26772951, util=99.67%
>  nvme15n1: ios=81820215/0, merge=0/0, ticks=26741532/0, in_queue=26741532, util=99.78%
>  nvme17n1: ios=81819922/0, merge=0/0, ticks=76459192/0, in_queue=76459192, util=99.84%
>  nvme18n1: ios=81820146/0, merge=0/0, ticks=86756309/0, in_queue=86756309, util=99.82%
>  nvme19n1: ios=81820481/0, merge=0/0, ticks=75008919/0, in_queue=75008919, util=100.00%
>  nvme20n1: ios=81819690/0, merge=0/0, ticks=91888274/0, in_queue=91888275, util=100.00%
>  nvme21n1: ios=81821809/0, merge=0/0, ticks=26653056/0, in_queue=26653057, util=100.00%
> -----Original Message-----
> From: Matt Wallis <mattw@madmonks.org> 
> Sent: Wednesday, July 28, 2021 8:54 PM
> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Cc: linux-raid@vger.kernel.org
> Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
> 
> Hi Jim,
> 
> Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
> Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
> 
> I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally. 
> 
> Matt.
> 
>> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>> 
>> Matt,
>> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term.     As a researcher, I can do these cool science experiments, but I still have to hand designs  to sustainment folks.  I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one  xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed.    We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.  
>> 
>> I will have to try the LVM experiment.  I'm an LVM  neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also.  Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
>> 
>> Thanks,
>> Jim
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Matt Wallis <mattw@madmonks.org> 
>> Sent: Wednesday, July 28, 2021 6:32 AM
>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
>> Cc: linux-raid@vger.kernel.org
>> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>> 
>> Hi Jim,
>> 
>>> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>> 
>>> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small  fraction of it.   I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful.     This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant.   I'm just stumped why I can't do better.   If there is a fine manual that somebody can point me to, I'm happy to read it…
>> 
>> I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
>> 2. Most block IO in the kernel is limited in terms of threading, it may even be essentially single threaded. (This is where I will get corrected) 3. AFAICT, this includes mdraid, there’s a single thread per RAID device handling all the RAID calculations. (mdX_raid6)
>> 
>> What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
>> 1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
>> 3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
>> 
>> You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
>> 
>> I saw a significant (for me, significant is >20%) increase in IOPs doing this. 
>> 
>> You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes. 
>> 
>> There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
>> 
>> You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
>> 
>> Matt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-07-30  8:28         ` Matt Wallis
@ 2021-07-30  8:45           ` Miao Wang
  2021-07-30  9:59             ` Finlayson, James M CIV (USA)
  2021-07-30 13:17             ` Peter Grandi
  2021-07-30  9:54           ` Finlayson, James M CIV (USA)
  1 sibling, 2 replies; 28+ messages in thread
From: Miao Wang @ 2021-07-30  8:45 UTC (permalink / raw)
  To: Finlayson, James M CIV (USA); +Cc: Matt Wallis, linux-raid

Hi Jim,

Nice to hear about your findings on how to let linux md work better on fast nvme drives, because previously I was also stuck in a similar problem and finally gave up. Since it is very difficult to find such environment with so many fast nvme drives, I wonder if you have any interest in ZFS. Maybe you can set up a similar raidz configuration on those drives and see whether its performance is better or worse.

Cheers,

Miao Wang

> 2021年07月30日 16:28,Matt Wallis <mattw@madmonks.org> 写道:
> 
> Hi Jim,
> 
> That’s significantly better than I expected, I need to see if I can get someone to send me the box I was using so I can spend some more time on it.
> 
> Good luck with the rest of it, the bit I was looking for as a next step was going to be potentially tweaking stripe widths and the like to see how much difference it made on different workloads.
> 
> Matt. 
> 
>> On 30 Jul 2021, at 08:05, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>> 
>> Matt,
>> Thank you for the tip.   I have put 32 partitions on each of my NVMe drives, made 32 raid5 stripes and then made an LVM from the 32 raid5 stripes.   I then created one physical volume per 10 NVMe drives on each socket.    I believe I have successfully harnessed "alien technology" (inside joke) to create a Frankenstein's monster.   These are substantially better than doing a RAID0 stripe over the partitioned md's in the past.  I whipped all of this together in two 15 minute sessions last night and just right now, so I might run some more extensive tests when I have the cycles.   I didn't intend to leave the thread hanging.
>> BLUF -  fio detailed output below.....
>> 9 drives per socket raw
>> socket0, 9 drives, raw 4K random reads 13.6M IOPS , socket1 9 drives, raw 4K random reads , 12.3M IOPS
>> %Cpu(s):  4.4 us, 25.6 sy,  0.0 ni, 56.7 id,  0.0 wa, 13.1 hi,  0.2 si,  0.0 st
>> 
>> 9 data drives per socket RAID5/LVM raw (9+1)
>> socket0, 9 drives, raw 4K random reads 8.57M IOPS , socket1 9 drives, raw 4K random reads , 8.57M IOPS
>> %Cpu(s):  7.0 us, 22.3 sy,  0.0 ni, 58.4 id,  0.0 wa, 12.1 hi,  0.2 si,  0.0 st
>> 
>> 
>> All,
>> I intend to test the 4.15 kernel patch next week.   My SA would prefer if the patch has made it into the kernel-ml stream yet, so he could install an rpm, but I'll get him to build the kernel if need be.
>> If the 4.15 kernel patch doesn't alleviate the issues, I have a strong desire to have mdraid made better.
>> 
>> 
>> Quick fio results:
>> socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
>> ...
>> socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
>> ...
>> socket0-lv: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
>> ...
>> socket1-lv: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
>> ...
>> fio-3.26
>> Starting 256 processes
>> 
>> socket0: (groupid=0, jobs=64): err= 0: pid=64032: Thu Jul 29 21:48:32 2021
>> read: IOPS=13.6M, BW=51.7GiB/s (55.6GB/s)(3105GiB/60003msec)
>>   slat (nsec): min=1292, max=1376.7k, avg=2545.32, stdev=2696.74
>>   clat (usec): min=36, max=71580, avg=600.68, stdev=361.36
>>    lat (usec): min=38, max=71616, avg=603.30, stdev=361.38
>>   clat percentiles (usec):
>>    |  1.00th=[  169],  5.00th=[  231], 10.00th=[  277], 20.00th=[  347],
>>    | 30.00th=[  404], 40.00th=[  457], 50.00th=[  519], 60.00th=[  594],
>>    | 70.00th=[  676], 80.00th=[  791], 90.00th=[  996], 95.00th=[ 1205],
>>    | 99.00th=[ 1909], 99.50th=[ 2409], 99.90th=[ 3556], 99.95th=[ 3884],
>>    | 99.99th=[ 5538]
>>  bw (  MiB/s): min=39960, max=56660, per=20.94%, avg=53040.69, stdev=49.61, samples=7488
>>  iops        : min=10229946, max=14504941, avg=13578391.80, stdev=12699.61, samples=7488
>> lat (usec)   : 50=0.01%, 100=0.01%, 250=6.83%, 500=40.06%, 750=29.92%
>> lat (usec)   : 1000=13.42%
>> lat (msec)   : 2=8.91%, 4=0.82%, 10=0.04%, 20=0.01%, 50=0.01%
>> lat (msec)   : 100=0.01%
>> cpu          : usr=14.82%, sys=46.57%, ctx=35564249, majf=0, minf=9754
>> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>>    issued rwts: total=813909201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>>    latency   : target=0, window=0, percentile=100.00%, depth=128
>> socket1: (groupid=1, jobs=64): err= 0: pid=64096: Thu Jul 29 21:48:32 2021
>> read: IOPS=12.3M, BW=46.9GiB/s (50.3GB/s)(2812GiB/60003msec)
>>   slat (nsec): min=1292, max=1672.2k, avg=2672.21, stdev=2742.06
>>   clat (usec): min=25, max=73526, avg=663.35, stdev=611.06
>>    lat (usec): min=28, max=73545, avg=666.09, stdev=611.08
>>   clat percentiles (usec):
>>    |  1.00th=[  143],  5.00th=[  190], 10.00th=[  227], 20.00th=[  285],
>>    | 30.00th=[  338], 40.00th=[  400], 50.00th=[  478], 60.00th=[  586],
>>    | 70.00th=[  725], 80.00th=[  930], 90.00th=[ 1254], 95.00th=[ 1614],
>>    | 99.00th=[ 3490], 99.50th=[ 4146], 99.90th=[ 6390], 99.95th=[ 6980],
>>    | 99.99th=[ 8356]
>>  bw (  MiB/s): min=28962, max=55326, per=12.70%, avg=48036.10, stdev=96.75, samples=7488
>>  iops        : min=7414327, max=14163615, avg=12297214.98, stdev=24768.82, samples=7488
>> lat (usec)   : 50=0.01%, 100=0.03%, 250=13.75%, 500=38.84%, 750=18.71%
>> lat (usec)   : 1000=11.55%
>> lat (msec)   : 2=14.30%, 4=2.23%, 10=0.60%, 20=0.01%, 50=0.01%
>> lat (msec)   : 100=0.01%
>> cpu          : usr=13.41%, sys=44.44%, ctx=39379711, majf=0, minf=9982
>> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>>    issued rwts: total=737168913,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>>    latency   : target=0, window=0, percentile=100.00%, depth=128
>> socket0-lv: (groupid=2, jobs=64): err= 0: pid=64166: Thu Jul 29 21:48:32 2021
>> read: IOPS=8570k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
>>   slat (nsec): min=1873, max=11085k, avg=4694.47, stdev=4825.39
>>   clat (usec): min=24, max=21739, avg=950.52, stdev=948.83
>>    lat (usec): min=51, max=21743, avg=955.29, stdev=948.88
>>   clat percentiles (usec):
>>    |  1.00th=[  155],  5.00th=[  217], 10.00th=[  265], 20.00th=[  338],
>>    | 30.00th=[  404], 40.00th=[  486], 50.00th=[  594], 60.00th=[  766],
>>    | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
>>    | 99.00th=[ 4490], 99.50th=[ 5669], 99.90th=[ 8586], 99.95th=[ 9896],
>>    | 99.99th=[12125]
>>  bw (  MiB/s): min=24657, max=37453, per=-25.17%, avg=33516.00, stdev=35.32, samples=7616
>>  iops        : min=6312326, max=9588007, avg=8580076.03, stdev=9041.88, samples=7616
>> lat (usec)   : 50=0.01%, 100=0.01%, 250=8.40%, 500=33.21%, 750=17.54%
>> lat (usec)   : 1000=9.89%
>> lat (msec)   : 2=19.93%, 4=9.58%, 10=1.40%, 20=0.05%, 50=0.01%
>> cpu          : usr=9.01%, sys=51.48%, ctx=27829950, majf=0, minf=9028
>> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>>    issued rwts: total=514275323,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>>    latency   : target=0, window=0, percentile=100.00%, depth=128
>> socket1-lv: (groupid=3, jobs=64): err= 0: pid=64230: Thu Jul 29 21:48:32 2021
>> read: IOPS=8571k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
>>   slat (nsec): min=1823, max=14362k, avg=4809.30, stdev=4940.42
>>   clat (usec): min=50, max=22856, avg=950.31, stdev=948.13
>>    lat (usec): min=54, max=22860, avg=955.19, stdev=948.19
>>   clat percentiles (usec):
>>    |  1.00th=[  157],  5.00th=[  221], 10.00th=[  269], 20.00th=[  343],
>>    | 30.00th=[  412], 40.00th=[  490], 50.00th=[  603], 60.00th=[  766],
>>    | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
>>    | 99.00th=[ 4293], 99.50th=[ 5604], 99.90th=[ 9503], 99.95th=[10683],
>>    | 99.99th=[12649]
>>  bw (  MiB/s): min=23434, max=36909, per=-25.17%, avg=33517.14, stdev=50.36, samples=7616
>>  iops        : min=5999220, max=9448818, avg=8580368.69, stdev=12892.93, samples=7616
>> lat (usec)   : 100=0.01%, 250=7.88%, 500=33.09%, 750=18.09%, 1000=10.02%
>> lat (msec)   : 2=19.91%, 4=9.75%, 10=1.16%, 20=0.08%, 50=0.01%
>> cpu          : usr=9.14%, sys=51.94%, ctx=25524808, majf=0, minf=9037
>> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>>    issued rwts: total=514294010,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>>    latency   : target=0, window=0, percentile=100.00%, depth=128
>> 
>> Run status group 0 (all jobs):
>>  READ: bw=51.7GiB/s (55.6GB/s), 51.7GiB/s-51.7GiB/s (55.6GB/s-55.6GB/s), io=3105GiB (3334GB), run=60003-60003msec
>> 
>> Run status group 1 (all jobs):
>>  READ: bw=46.9GiB/s (50.3GB/s), 46.9GiB/s-46.9GiB/s (50.3GB/s-50.3GB/s), io=2812GiB (3019GB), run=60003-60003msec
>> 
>> Run status group 2 (all jobs):
>>  READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s (35.1GB/s-35.1GB/s), io=1962GiB (2106GB), run=60006-60006msec
>> 
>> Run status group 3 (all jobs):
>>  READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s (35.1GB/s-35.1GB/s), io=1962GiB (2107GB), run=60006-60006msec
>> 
>> Disk stats (read/write):
>> nvme0n1: ios=90336694/0, merge=0/0, ticks=45102163/0, in_queue=45102163, util=97.44%
>> nvme1n1: ios=90337153/0, merge=0/0, ticks=47422886/0, in_queue=47422887, util=97.81%
>> nvme2n1: ios=90337516/0, merge=0/0, ticks=46419782/0, in_queue=46419782, util=97.95%
>> nvme3n1: ios=90337843/0, merge=0/0, ticks=46256374/0, in_queue=46256374, util=97.95%
>> nvme4n1: ios=90337742/0, merge=0/0, ticks=59122226/0, in_queue=59122225, util=98.19%
>> nvme5n1: ios=90338813/0, merge=0/0, ticks=57811758/0, in_queue=57811758, util=98.33%
>> nvme6n1: ios=90339194/0, merge=0/0, ticks=57369337/0, in_queue=57369337, util=98.37%
>> nvme7n1: ios=90339048/0, merge=0/0, ticks=55791076/0, in_queue=55791076, util=98.78%
>> nvme8n1: ios=90340234/0, merge=0/0, ticks=44977001/0, in_queue=44977001, util=99.01%
>> nvme12n1: ios=81819608/0, merge=0/0, ticks=26788080/0, in_queue=26788079, util=99.24%
>> nvme13n1: ios=81819831/0, merge=0/0, ticks=26736682/0, in_queue=26736681, util=99.57%
>> nvme14n1: ios=81820006/0, merge=0/0, ticks=26772951/0, in_queue=26772951, util=99.67%
>> nvme15n1: ios=81820215/0, merge=0/0, ticks=26741532/0, in_queue=26741532, util=99.78%
>> nvme17n1: ios=81819922/0, merge=0/0, ticks=76459192/0, in_queue=76459192, util=99.84%
>> nvme18n1: ios=81820146/0, merge=0/0, ticks=86756309/0, in_queue=86756309, util=99.82%
>> nvme19n1: ios=81820481/0, merge=0/0, ticks=75008919/0, in_queue=75008919, util=100.00%
>> nvme20n1: ios=81819690/0, merge=0/0, ticks=91888274/0, in_queue=91888275, util=100.00%
>> nvme21n1: ios=81821809/0, merge=0/0, ticks=26653056/0, in_queue=26653057, util=100.00%
>> -----Original Message-----
>> From: Matt Wallis <mattw@madmonks.org> 
>> Sent: Wednesday, July 28, 2021 8:54 PM
>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
>> Cc: linux-raid@vger.kernel.org
>> Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>> 
>> Hi Jim,
>> 
>> Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
>> Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
>> 
>> I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally. 
>> 
>> Matt.
>> 
>>> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>> 
>>> Matt,
>>> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term.     As a researcher, I can do these cool science experiments, but I still have to hand designs  to sustainment folks.  I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one  xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed.    We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.  
>>> 
>>> I will have to try the LVM experiment.  I'm an LVM  neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also.  Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
>>> 
>>> Thanks,
>>> Jim
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Matt Wallis <mattw@madmonks.org> 
>>> Sent: Wednesday, July 28, 2021 6:32 AM
>>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
>>> Cc: linux-raid@vger.kernel.org
>>> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>>> 
>>> Hi Jim,
>>> 
>>>> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>>> 
>>>> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small  fraction of it.   I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful.     This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant.   I'm just stumped why I can't do better.   If there is a fine manual that somebody can point me to, I'm happy to read it…
>>> 
>>> I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
>>> 2. Most block IO in the kernel is limited in terms of threading, it may even be essentially single threaded. (This is where I will get corrected) 3. AFAICT, this includes mdraid, there’s a single thread per RAID device handling all the RAID calculations. (mdX_raid6)
>>> 
>>> What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
>>> 1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
>>> 3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
>>> 
>>> You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
>>> 
>>> I saw a significant (for me, significant is >20%) increase in IOPs doing this. 
>>> 
>>> You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes. 
>>> 
>>> There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
>>> 
>>> You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
>>> 
>>> Matt


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-07-30  8:28         ` Matt Wallis
  2021-07-30  8:45           ` Miao Wang
@ 2021-07-30  9:54           ` Finlayson, James M CIV (USA)
  1 sibling, 0 replies; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-07-30  9:54 UTC (permalink / raw)
  To: 'Matt Wallis'; +Cc: 'linux-raid@vger.kernel.org'

I just always used 128K - large enough to cover most small operations for IOPS to harness one read of a single drive to complete the I/O and small enough what a 1MB O_DIRECT write will get us a full stripe write.   Plus, I always plug the old "Fusion I/O" crowd.   Best white papers ever - "here are our numbers, here's how we got them, here are the instructions for you to get them too".     


-----Original Message-----
From: Matt Wallis <mattw@madmonks.org> 
Sent: Friday, July 30, 2021 4:28 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: linux-raid@vger.kernel.org
Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

Hi Jim,

That’s significantly better than I expected, I need to see if I can get someone to send me the box I was using so I can spend some more time on it.

Good luck with the rest of it, the bit I was looking for as a next step was going to be potentially tweaking stripe widths and the like to see how much difference it made on different workloads.

Matt. 

> On 30 Jul 2021, at 08:05, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> 
> Matt,
> Thank you for the tip.   I have put 32 partitions on each of my NVMe drives, made 32 raid5 stripes and then made an LVM from the 32 raid5 stripes.   I then created one physical volume per 10 NVMe drives on each socket.    I believe I have successfully harnessed "alien technology" (inside joke) to create a Frankenstein's monster.   These are substantially better than doing a RAID0 stripe over the partitioned md's in the past.  I whipped all of this together in two 15 minute sessions last night and just right now, so I might run some more extensive tests when I have the cycles.   I didn't intend to leave the thread hanging.
> BLUF -  fio detailed output below.....
> 9 drives per socket raw
> socket0, 9 drives, raw 4K random reads 13.6M IOPS , socket1 9 drives, 
> raw 4K random reads , 12.3M IOPS
> %Cpu(s):  4.4 us, 25.6 sy,  0.0 ni, 56.7 id,  0.0 wa, 13.1 hi,  0.2 
> si,  0.0 st
> 
> 9 data drives per socket RAID5/LVM raw (9+1) socket0, 9 drives, raw 4K 
> random reads 8.57M IOPS , socket1 9 drives, raw 4K random reads , 
> 8.57M IOPS
> %Cpu(s):  7.0 us, 22.3 sy,  0.0 ni, 58.4 id,  0.0 wa, 12.1 hi,  0.2 
> si,  0.0 st
> 
> 
> All,
> I intend to test the 4.15 kernel patch next week.   My SA would prefer if the patch has made it into the kernel-ml stream yet, so he could install an rpm, but I'll get him to build the kernel if need be.
> If the 4.15 kernel patch doesn't alleviate the issues, I have a strong desire to have mdraid made better.
> 
> 
> Quick fio results:
> socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=128 ...
> socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=128 ...
> socket0-lv: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, 
> (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
> socket1-lv: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, 
> (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
> fio-3.26
> Starting 256 processes
> 
> socket0: (groupid=0, jobs=64): err= 0: pid=64032: Thu Jul 29 21:48:32 
> 2021
>  read: IOPS=13.6M, BW=51.7GiB/s (55.6GB/s)(3105GiB/60003msec)
>    slat (nsec): min=1292, max=1376.7k, avg=2545.32, stdev=2696.74
>    clat (usec): min=36, max=71580, avg=600.68, stdev=361.36
>     lat (usec): min=38, max=71616, avg=603.30, stdev=361.38
>    clat percentiles (usec):
>     |  1.00th=[  169],  5.00th=[  231], 10.00th=[  277], 20.00th=[  347],
>     | 30.00th=[  404], 40.00th=[  457], 50.00th=[  519], 60.00th=[  594],
>     | 70.00th=[  676], 80.00th=[  791], 90.00th=[  996], 95.00th=[ 1205],
>     | 99.00th=[ 1909], 99.50th=[ 2409], 99.90th=[ 3556], 99.95th=[ 3884],
>     | 99.99th=[ 5538]
>   bw (  MiB/s): min=39960, max=56660, per=20.94%, avg=53040.69, stdev=49.61, samples=7488
>   iops        : min=10229946, max=14504941, avg=13578391.80, stdev=12699.61, samples=7488
>  lat (usec)   : 50=0.01%, 100=0.01%, 250=6.83%, 500=40.06%, 750=29.92%
>  lat (usec)   : 1000=13.42%
>  lat (msec)   : 2=8.91%, 4=0.82%, 10=0.04%, 20=0.01%, 50=0.01%
>  lat (msec)   : 100=0.01%
>  cpu          : usr=14.82%, sys=46.57%, ctx=35564249, majf=0, minf=9754
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>     issued rwts: total=813909201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>     latency   : target=0, window=0, percentile=100.00%, depth=128
> socket1: (groupid=1, jobs=64): err= 0: pid=64096: Thu Jul 29 21:48:32 
> 2021
>  read: IOPS=12.3M, BW=46.9GiB/s (50.3GB/s)(2812GiB/60003msec)
>    slat (nsec): min=1292, max=1672.2k, avg=2672.21, stdev=2742.06
>    clat (usec): min=25, max=73526, avg=663.35, stdev=611.06
>     lat (usec): min=28, max=73545, avg=666.09, stdev=611.08
>    clat percentiles (usec):
>     |  1.00th=[  143],  5.00th=[  190], 10.00th=[  227], 20.00th=[  285],
>     | 30.00th=[  338], 40.00th=[  400], 50.00th=[  478], 60.00th=[  586],
>     | 70.00th=[  725], 80.00th=[  930], 90.00th=[ 1254], 95.00th=[ 1614],
>     | 99.00th=[ 3490], 99.50th=[ 4146], 99.90th=[ 6390], 99.95th=[ 6980],
>     | 99.99th=[ 8356]
>   bw (  MiB/s): min=28962, max=55326, per=12.70%, avg=48036.10, stdev=96.75, samples=7488
>   iops        : min=7414327, max=14163615, avg=12297214.98, stdev=24768.82, samples=7488
>  lat (usec)   : 50=0.01%, 100=0.03%, 250=13.75%, 500=38.84%, 750=18.71%
>  lat (usec)   : 1000=11.55%
>  lat (msec)   : 2=14.30%, 4=2.23%, 10=0.60%, 20=0.01%, 50=0.01%
>  lat (msec)   : 100=0.01%
>  cpu          : usr=13.41%, sys=44.44%, ctx=39379711, majf=0, minf=9982
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>     issued rwts: total=737168913,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>     latency   : target=0, window=0, percentile=100.00%, depth=128
> socket0-lv: (groupid=2, jobs=64): err= 0: pid=64166: Thu Jul 29 
> 21:48:32 2021
>  read: IOPS=8570k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
>    slat (nsec): min=1873, max=11085k, avg=4694.47, stdev=4825.39
>    clat (usec): min=24, max=21739, avg=950.52, stdev=948.83
>     lat (usec): min=51, max=21743, avg=955.29, stdev=948.88
>    clat percentiles (usec):
>     |  1.00th=[  155],  5.00th=[  217], 10.00th=[  265], 20.00th=[  338],
>     | 30.00th=[  404], 40.00th=[  486], 50.00th=[  594], 60.00th=[  766],
>     | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
>     | 99.00th=[ 4490], 99.50th=[ 5669], 99.90th=[ 8586], 99.95th=[ 9896],
>     | 99.99th=[12125]
>   bw (  MiB/s): min=24657, max=37453, per=-25.17%, avg=33516.00, stdev=35.32, samples=7616
>   iops        : min=6312326, max=9588007, avg=8580076.03, stdev=9041.88, samples=7616
>  lat (usec)   : 50=0.01%, 100=0.01%, 250=8.40%, 500=33.21%, 750=17.54%
>  lat (usec)   : 1000=9.89%
>  lat (msec)   : 2=19.93%, 4=9.58%, 10=1.40%, 20=0.05%, 50=0.01%
>  cpu          : usr=9.01%, sys=51.48%, ctx=27829950, majf=0, minf=9028
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>     issued rwts: total=514275323,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>     latency   : target=0, window=0, percentile=100.00%, depth=128
> socket1-lv: (groupid=3, jobs=64): err= 0: pid=64230: Thu Jul 29 
> 21:48:32 2021
>  read: IOPS=8571k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
>    slat (nsec): min=1823, max=14362k, avg=4809.30, stdev=4940.42
>    clat (usec): min=50, max=22856, avg=950.31, stdev=948.13
>     lat (usec): min=54, max=22860, avg=955.19, stdev=948.19
>    clat percentiles (usec):
>     |  1.00th=[  157],  5.00th=[  221], 10.00th=[  269], 20.00th=[  343],
>     | 30.00th=[  412], 40.00th=[  490], 50.00th=[  603], 60.00th=[  766],
>     | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
>     | 99.00th=[ 4293], 99.50th=[ 5604], 99.90th=[ 9503], 99.95th=[10683],
>     | 99.99th=[12649]
>   bw (  MiB/s): min=23434, max=36909, per=-25.17%, avg=33517.14, stdev=50.36, samples=7616
>   iops        : min=5999220, max=9448818, avg=8580368.69, stdev=12892.93, samples=7616
>  lat (usec)   : 100=0.01%, 250=7.88%, 500=33.09%, 750=18.09%, 1000=10.02%
>  lat (msec)   : 2=19.91%, 4=9.75%, 10=1.16%, 20=0.08%, 50=0.01%
>  cpu          : usr=9.14%, sys=51.94%, ctx=25524808, majf=0, minf=9037
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>     issued rwts: total=514294010,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>     latency   : target=0, window=0, percentile=100.00%, depth=128
> 
> Run status group 0 (all jobs):
>   READ: bw=51.7GiB/s (55.6GB/s), 51.7GiB/s-51.7GiB/s 
> (55.6GB/s-55.6GB/s), io=3105GiB (3334GB), run=60003-60003msec
> 
> Run status group 1 (all jobs):
>   READ: bw=46.9GiB/s (50.3GB/s), 46.9GiB/s-46.9GiB/s 
> (50.3GB/s-50.3GB/s), io=2812GiB (3019GB), run=60003-60003msec
> 
> Run status group 2 (all jobs):
>   READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s 
> (35.1GB/s-35.1GB/s), io=1962GiB (2106GB), run=60006-60006msec
> 
> Run status group 3 (all jobs):
>   READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s 
> (35.1GB/s-35.1GB/s), io=1962GiB (2107GB), run=60006-60006msec
> 
> Disk stats (read/write):
>  nvme0n1: ios=90336694/0, merge=0/0, ticks=45102163/0, 
> in_queue=45102163, util=97.44%
>  nvme1n1: ios=90337153/0, merge=0/0, ticks=47422886/0, 
> in_queue=47422887, util=97.81%
>  nvme2n1: ios=90337516/0, merge=0/0, ticks=46419782/0, 
> in_queue=46419782, util=97.95%
>  nvme3n1: ios=90337843/0, merge=0/0, ticks=46256374/0, 
> in_queue=46256374, util=97.95%
>  nvme4n1: ios=90337742/0, merge=0/0, ticks=59122226/0, 
> in_queue=59122225, util=98.19%
>  nvme5n1: ios=90338813/0, merge=0/0, ticks=57811758/0, 
> in_queue=57811758, util=98.33%
>  nvme6n1: ios=90339194/0, merge=0/0, ticks=57369337/0, 
> in_queue=57369337, util=98.37%
>  nvme7n1: ios=90339048/0, merge=0/0, ticks=55791076/0, 
> in_queue=55791076, util=98.78%
>  nvme8n1: ios=90340234/0, merge=0/0, ticks=44977001/0, 
> in_queue=44977001, util=99.01%
>  nvme12n1: ios=81819608/0, merge=0/0, ticks=26788080/0, 
> in_queue=26788079, util=99.24%
>  nvme13n1: ios=81819831/0, merge=0/0, ticks=26736682/0, 
> in_queue=26736681, util=99.57%
>  nvme14n1: ios=81820006/0, merge=0/0, ticks=26772951/0, 
> in_queue=26772951, util=99.67%
>  nvme15n1: ios=81820215/0, merge=0/0, ticks=26741532/0, 
> in_queue=26741532, util=99.78%
>  nvme17n1: ios=81819922/0, merge=0/0, ticks=76459192/0, 
> in_queue=76459192, util=99.84%
>  nvme18n1: ios=81820146/0, merge=0/0, ticks=86756309/0, 
> in_queue=86756309, util=99.82%
>  nvme19n1: ios=81820481/0, merge=0/0, ticks=75008919/0, 
> in_queue=75008919, util=100.00%
>  nvme20n1: ios=81819690/0, merge=0/0, ticks=91888274/0, 
> in_queue=91888275, util=100.00%
>  nvme21n1: ios=81821809/0, merge=0/0, ticks=26653056/0, 
> in_queue=26653057, util=100.00% -----Original Message-----
> From: Matt Wallis <mattw@madmonks.org>
> Sent: Wednesday, July 28, 2021 8:54 PM
> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Cc: linux-raid@vger.kernel.org
> Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
> 
> Hi Jim,
> 
> Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
> Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
> 
> I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally. 
> 
> Matt.
> 
>> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>> 
>> Matt,
>> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term.     As a researcher, I can do these cool science experiments, but I still have to hand designs  to sustainment folks.  I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one  xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed.    We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.  
>> 
>> I will have to try the LVM experiment.  I'm an LVM  neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also.  Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
>> 
>> Thanks,
>> Jim
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Matt Wallis <mattw@madmonks.org>
>> Sent: Wednesday, July 28, 2021 6:32 AM
>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
>> Cc: linux-raid@vger.kernel.org
>> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>> 
>> Hi Jim,
>> 
>>> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>> 
>>> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small  fraction of it.   I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful.     This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant.   I'm just stumped why I can't do better.   If there is a fine manual that somebody can point me to, I'm happy to read it…
>> 
>> I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
>> 2. Most block IO in the kernel is limited in terms of threading, it 
>> may even be essentially single threaded. (This is where I will get 
>> corrected) 3. AFAICT, this includes mdraid, there’s a single thread 
>> per RAID device handling all the RAID calculations. (mdX_raid6)
>> 
>> What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
>> 1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
>> 3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
>> 
>> You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
>> 
>> I saw a significant (for me, significant is >20%) increase in IOPs doing this. 
>> 
>> You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes. 
>> 
>> There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
>> 
>> You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
>> 
>> Matt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-07-30  8:45           ` Miao Wang
@ 2021-07-30  9:59             ` Finlayson, James M CIV (USA)
  2021-07-30 14:03               ` Doug Ledford
  2021-07-30 13:17             ` Peter Grandi
  1 sibling, 1 reply; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-07-30  9:59 UTC (permalink / raw)
  To: 'Miao Wang'
  Cc: 'Matt Wallis', 'linux-raid@vger.kernel.org'

There is interest in ZFS.   We're waiting for the direct I/O patches to settle in Open ZFS because we couldn't find any way to get around the ARC (everything has to touch the ARC).  ZFS spins an entire CPU core or more worrying about which ARC entries it has to evict.      I know who is doing the work.   Once it settles, I'll see if they are willing to publish to zfs-discuss.

-----Original Message-----
From: Miao Wang <shankerwangmiao@gmail.com> 
Sent: Friday, July 30, 2021 4:46 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: Matt Wallis <mattw@madmonks.org>; linux-raid@vger.kernel.org
Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

Hi Jim,

Nice to hear about your findings on how to let linux md work better on fast nvme drives, because previously I was also stuck in a similar problem and finally gave up. Since it is very difficult to find such environment with so many fast nvme drives, I wonder if you have any interest in ZFS. Maybe you can set up a similar raidz configuration on those drives and see whether its performance is better or worse.

Cheers,

Miao Wang

> 2021年07月30日 16:28,Matt Wallis <mattw@madmonks.org> 写道:
> 
> Hi Jim,
> 
> That’s significantly better than I expected, I need to see if I can get someone to send me the box I was using so I can spend some more time on it.
> 
> Good luck with the rest of it, the bit I was looking for as a next step was going to be potentially tweaking stripe widths and the like to see how much difference it made on different workloads.
> 
> Matt. 
> 
>> On 30 Jul 2021, at 08:05, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>> 
>> Matt,
>> Thank you for the tip.   I have put 32 partitions on each of my NVMe drives, made 32 raid5 stripes and then made an LVM from the 32 raid5 stripes.   I then created one physical volume per 10 NVMe drives on each socket.    I believe I have successfully harnessed "alien technology" (inside joke) to create a Frankenstein's monster.   These are substantially better than doing a RAID0 stripe over the partitioned md's in the past.  I whipped all of this together in two 15 minute sessions last night and just right now, so I might run some more extensive tests when I have the cycles.   I didn't intend to leave the thread hanging.
>> BLUF -  fio detailed output below.....
>> 9 drives per socket raw
>> socket0, 9 drives, raw 4K random reads 13.6M IOPS , socket1 9 drives, 
>> raw 4K random reads , 12.3M IOPS
>> %Cpu(s):  4.4 us, 25.6 sy,  0.0 ni, 56.7 id,  0.0 wa, 13.1 hi,  0.2 
>> si,  0.0 st
>> 
>> 9 data drives per socket RAID5/LVM raw (9+1) socket0, 9 drives, raw 
>> 4K random reads 8.57M IOPS , socket1 9 drives, raw 4K random reads , 
>> 8.57M IOPS
>> %Cpu(s):  7.0 us, 22.3 sy,  0.0 ni, 58.4 id,  0.0 wa, 12.1 hi,  0.2 
>> si,  0.0 st
>> 
>> 
>> All,
>> I intend to test the 4.15 kernel patch next week.   My SA would prefer if the patch has made it into the kernel-ml stream yet, so he could install an rpm, but I'll get him to build the kernel if need be.
>> If the 4.15 kernel patch doesn't alleviate the issues, I have a strong desire to have mdraid made better.
>> 
>> 
>> Quick fio results:
>> socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
>> 4096B-4096B, ioengine=libaio, iodepth=128 ...
>> socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
>> 4096B-4096B, ioengine=libaio, iodepth=128 ...
>> socket0-lv: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, 
>> (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
>> socket1-lv: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, 
>> (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
>> fio-3.26
>> Starting 256 processes
>> 
>> socket0: (groupid=0, jobs=64): err= 0: pid=64032: Thu Jul 29 21:48:32 
>> 2021
>> read: IOPS=13.6M, BW=51.7GiB/s (55.6GB/s)(3105GiB/60003msec)
>>   slat (nsec): min=1292, max=1376.7k, avg=2545.32, stdev=2696.74
>>   clat (usec): min=36, max=71580, avg=600.68, stdev=361.36
>>    lat (usec): min=38, max=71616, avg=603.30, stdev=361.38
>>   clat percentiles (usec):
>>    |  1.00th=[  169],  5.00th=[  231], 10.00th=[  277], 20.00th=[  347],
>>    | 30.00th=[  404], 40.00th=[  457], 50.00th=[  519], 60.00th=[  594],
>>    | 70.00th=[  676], 80.00th=[  791], 90.00th=[  996], 95.00th=[ 1205],
>>    | 99.00th=[ 1909], 99.50th=[ 2409], 99.90th=[ 3556], 99.95th=[ 3884],
>>    | 99.99th=[ 5538]
>>  bw (  MiB/s): min=39960, max=56660, per=20.94%, avg=53040.69, stdev=49.61, samples=7488
>>  iops        : min=10229946, max=14504941, avg=13578391.80, stdev=12699.61, samples=7488
>> lat (usec)   : 50=0.01%, 100=0.01%, 250=6.83%, 500=40.06%, 750=29.92%
>> lat (usec)   : 1000=13.42%
>> lat (msec)   : 2=8.91%, 4=0.82%, 10=0.04%, 20=0.01%, 50=0.01%
>> lat (msec)   : 100=0.01%
>> cpu          : usr=14.82%, sys=46.57%, ctx=35564249, majf=0, minf=9754
>> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>>    issued rwts: total=813909201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>>    latency   : target=0, window=0, percentile=100.00%, depth=128
>> socket1: (groupid=1, jobs=64): err= 0: pid=64096: Thu Jul 29 21:48:32 
>> 2021
>> read: IOPS=12.3M, BW=46.9GiB/s (50.3GB/s)(2812GiB/60003msec)
>>   slat (nsec): min=1292, max=1672.2k, avg=2672.21, stdev=2742.06
>>   clat (usec): min=25, max=73526, avg=663.35, stdev=611.06
>>    lat (usec): min=28, max=73545, avg=666.09, stdev=611.08
>>   clat percentiles (usec):
>>    |  1.00th=[  143],  5.00th=[  190], 10.00th=[  227], 20.00th=[  285],
>>    | 30.00th=[  338], 40.00th=[  400], 50.00th=[  478], 60.00th=[  586],
>>    | 70.00th=[  725], 80.00th=[  930], 90.00th=[ 1254], 95.00th=[ 1614],
>>    | 99.00th=[ 3490], 99.50th=[ 4146], 99.90th=[ 6390], 99.95th=[ 6980],
>>    | 99.99th=[ 8356]
>>  bw (  MiB/s): min=28962, max=55326, per=12.70%, avg=48036.10, stdev=96.75, samples=7488
>>  iops        : min=7414327, max=14163615, avg=12297214.98, stdev=24768.82, samples=7488
>> lat (usec)   : 50=0.01%, 100=0.03%, 250=13.75%, 500=38.84%, 750=18.71%
>> lat (usec)   : 1000=11.55%
>> lat (msec)   : 2=14.30%, 4=2.23%, 10=0.60%, 20=0.01%, 50=0.01%
>> lat (msec)   : 100=0.01%
>> cpu          : usr=13.41%, sys=44.44%, ctx=39379711, majf=0, minf=9982
>> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>>    issued rwts: total=737168913,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>>    latency   : target=0, window=0, percentile=100.00%, depth=128
>> socket0-lv: (groupid=2, jobs=64): err= 0: pid=64166: Thu Jul 29 
>> 21:48:32 2021
>> read: IOPS=8570k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
>>   slat (nsec): min=1873, max=11085k, avg=4694.47, stdev=4825.39
>>   clat (usec): min=24, max=21739, avg=950.52, stdev=948.83
>>    lat (usec): min=51, max=21743, avg=955.29, stdev=948.88
>>   clat percentiles (usec):
>>    |  1.00th=[  155],  5.00th=[  217], 10.00th=[  265], 20.00th=[  338],
>>    | 30.00th=[  404], 40.00th=[  486], 50.00th=[  594], 60.00th=[  766],
>>    | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
>>    | 99.00th=[ 4490], 99.50th=[ 5669], 99.90th=[ 8586], 99.95th=[ 9896],
>>    | 99.99th=[12125]
>>  bw (  MiB/s): min=24657, max=37453, per=-25.17%, avg=33516.00, stdev=35.32, samples=7616
>>  iops        : min=6312326, max=9588007, avg=8580076.03, stdev=9041.88, samples=7616
>> lat (usec)   : 50=0.01%, 100=0.01%, 250=8.40%, 500=33.21%, 750=17.54%
>> lat (usec)   : 1000=9.89%
>> lat (msec)   : 2=19.93%, 4=9.58%, 10=1.40%, 20=0.05%, 50=0.01%
>> cpu          : usr=9.01%, sys=51.48%, ctx=27829950, majf=0, minf=9028
>> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>>    issued rwts: total=514275323,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>>    latency   : target=0, window=0, percentile=100.00%, depth=128
>> socket1-lv: (groupid=3, jobs=64): err= 0: pid=64230: Thu Jul 29 
>> 21:48:32 2021
>> read: IOPS=8571k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
>>   slat (nsec): min=1823, max=14362k, avg=4809.30, stdev=4940.42
>>   clat (usec): min=50, max=22856, avg=950.31, stdev=948.13
>>    lat (usec): min=54, max=22860, avg=955.19, stdev=948.19
>>   clat percentiles (usec):
>>    |  1.00th=[  157],  5.00th=[  221], 10.00th=[  269], 20.00th=[  343],
>>    | 30.00th=[  412], 40.00th=[  490], 50.00th=[  603], 60.00th=[  766],
>>    | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
>>    | 99.00th=[ 4293], 99.50th=[ 5604], 99.90th=[ 9503], 99.95th=[10683],
>>    | 99.99th=[12649]
>>  bw (  MiB/s): min=23434, max=36909, per=-25.17%, avg=33517.14, stdev=50.36, samples=7616
>>  iops        : min=5999220, max=9448818, avg=8580368.69, stdev=12892.93, samples=7616
>> lat (usec)   : 100=0.01%, 250=7.88%, 500=33.09%, 750=18.09%, 1000=10.02%
>> lat (msec)   : 2=19.91%, 4=9.75%, 10=1.16%, 20=0.08%, 50=0.01%
>> cpu          : usr=9.14%, sys=51.94%, ctx=25524808, majf=0, minf=9037
>> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>>    issued rwts: total=514294010,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>>    latency   : target=0, window=0, percentile=100.00%, depth=128
>> 
>> Run status group 0 (all jobs):
>>  READ: bw=51.7GiB/s (55.6GB/s), 51.7GiB/s-51.7GiB/s 
>> (55.6GB/s-55.6GB/s), io=3105GiB (3334GB), run=60003-60003msec
>> 
>> Run status group 1 (all jobs):
>>  READ: bw=46.9GiB/s (50.3GB/s), 46.9GiB/s-46.9GiB/s 
>> (50.3GB/s-50.3GB/s), io=2812GiB (3019GB), run=60003-60003msec
>> 
>> Run status group 2 (all jobs):
>>  READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s 
>> (35.1GB/s-35.1GB/s), io=1962GiB (2106GB), run=60006-60006msec
>> 
>> Run status group 3 (all jobs):
>>  READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s 
>> (35.1GB/s-35.1GB/s), io=1962GiB (2107GB), run=60006-60006msec
>> 
>> Disk stats (read/write):
>> nvme0n1: ios=90336694/0, merge=0/0, ticks=45102163/0, 
>> in_queue=45102163, util=97.44%
>> nvme1n1: ios=90337153/0, merge=0/0, ticks=47422886/0, 
>> in_queue=47422887, util=97.81%
>> nvme2n1: ios=90337516/0, merge=0/0, ticks=46419782/0, 
>> in_queue=46419782, util=97.95%
>> nvme3n1: ios=90337843/0, merge=0/0, ticks=46256374/0, 
>> in_queue=46256374, util=97.95%
>> nvme4n1: ios=90337742/0, merge=0/0, ticks=59122226/0, 
>> in_queue=59122225, util=98.19%
>> nvme5n1: ios=90338813/0, merge=0/0, ticks=57811758/0, 
>> in_queue=57811758, util=98.33%
>> nvme6n1: ios=90339194/0, merge=0/0, ticks=57369337/0, 
>> in_queue=57369337, util=98.37%
>> nvme7n1: ios=90339048/0, merge=0/0, ticks=55791076/0, 
>> in_queue=55791076, util=98.78%
>> nvme8n1: ios=90340234/0, merge=0/0, ticks=44977001/0, 
>> in_queue=44977001, util=99.01%
>> nvme12n1: ios=81819608/0, merge=0/0, ticks=26788080/0, 
>> in_queue=26788079, util=99.24%
>> nvme13n1: ios=81819831/0, merge=0/0, ticks=26736682/0, 
>> in_queue=26736681, util=99.57%
>> nvme14n1: ios=81820006/0, merge=0/0, ticks=26772951/0, 
>> in_queue=26772951, util=99.67%
>> nvme15n1: ios=81820215/0, merge=0/0, ticks=26741532/0, 
>> in_queue=26741532, util=99.78%
>> nvme17n1: ios=81819922/0, merge=0/0, ticks=76459192/0, 
>> in_queue=76459192, util=99.84%
>> nvme18n1: ios=81820146/0, merge=0/0, ticks=86756309/0, 
>> in_queue=86756309, util=99.82%
>> nvme19n1: ios=81820481/0, merge=0/0, ticks=75008919/0, 
>> in_queue=75008919, util=100.00%
>> nvme20n1: ios=81819690/0, merge=0/0, ticks=91888274/0, 
>> in_queue=91888275, util=100.00%
>> nvme21n1: ios=81821809/0, merge=0/0, ticks=26653056/0, 
>> in_queue=26653057, util=100.00% -----Original Message-----
>> From: Matt Wallis <mattw@madmonks.org>
>> Sent: Wednesday, July 28, 2021 8:54 PM
>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
>> Cc: linux-raid@vger.kernel.org
>> Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>> 
>> Hi Jim,
>> 
>> Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
>> Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
>> 
>> I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally. 
>> 
>> Matt.
>> 
>>> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>> 
>>> Matt,
>>> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term.     As a researcher, I can do these cool science experiments, but I still have to hand designs  to sustainment folks.  I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one  xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed.    We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.  
>>> 
>>> I will have to try the LVM experiment.  I'm an LVM  neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also.  Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
>>> 
>>> Thanks,
>>> Jim
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Matt Wallis <mattw@madmonks.org>
>>> Sent: Wednesday, July 28, 2021 6:32 AM
>>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
>>> Cc: linux-raid@vger.kernel.org
>>> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>>> 
>>> Hi Jim,
>>> 
>>>> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
>>>> 
>>>> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small  fraction of it.   I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful.     This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant.   I'm just stumped why I can't do better.   If there is a fine manual that somebody can point me to, I'm happy to read it…
>>> 
>>> I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
>>> 2. Most block IO in the kernel is limited in terms of threading, it 
>>> may even be essentially single threaded. (This is where I will get 
>>> corrected) 3. AFAICT, this includes mdraid, there’s a single thread 
>>> per RAID device handling all the RAID calculations. (mdX_raid6)
>>> 
>>> What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
>>> 1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
>>> 3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
>>> 
>>> You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
>>> 
>>> I saw a significant (for me, significant is >20%) increase in IOPs doing this. 
>>> 
>>> You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes. 
>>> 
>>> There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
>>> 
>>> You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
>>> 
>>> Matt


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-07-30  8:45           ` Miao Wang
  2021-07-30  9:59             ` Finlayson, James M CIV (USA)
@ 2021-07-30 13:17             ` Peter Grandi
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Grandi @ 2021-07-30 13:17 UTC (permalink / raw)
  To: list Linux RAID

>>> On Fri, 30 Jul 2021 16:45:32 +0800, Miao Wang
>>> <shankerwangmiao@gmail.com> said:

> [...] was also stuck in a similar problem and finally gave
> up. Since it is very difficult to find such environment with
> so many fast nvme drives, I wonder if you have any interest in
> ZFS. [...]

Or Btrfs or the new 'bachefs' which is OK for simple
configurations (RAID10-like).

But part of the issue here with MD RAID is that it is in theory
mostly a translation layer like 'loop', but also sort of like a
virtual block device too, and weird things happen as IO requests
get reshape and requeued.

My impression that I mentioned in a previous message is that
probably the critical detail is:


>> Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
>> nvme0n1       1317510.00    0.00 5270044.00      0.00     0.00     0.00   0.00   0.00    0.31    0.00 411.95     4.00     0.00   0.00 100.40
>> [...]
>> Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
>> nvme0n1       114589.00    0.00 458356.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.54     4.00     0.00   0.01 100.00

> The obvious difference is the factor of 10 in "aqu-sz" and that
> correspond to the factor of 10 in "r/s" and "rkB/s".

That may happen because the test is run directly on the 'md[01]'
block device, which can do odd things. Counterintutively much
bigger 'aqu-sz' and thus much better speed could be achieved by
doing the test using a suitable filesystem on top of the 'md[01]'
device.

With ZFS there is a good chance that since striping is integrated
within ZFS that could happen too, especially on highly parallel
workloads.

There is however a huge warning: the test is run on IOPS with
4KiB blocks, and ZFS in COW mode does not work well with that
(especially for writes, but also for reads, if compression and
checksumming are enabled, for RAIDz) so I think that it should be
run with COW disabled, or perhaps on a 'zvol'.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-07-30  9:59             ` Finlayson, James M CIV (USA)
@ 2021-07-30 14:03               ` Doug Ledford
  0 siblings, 0 replies; 28+ messages in thread
From: Doug Ledford @ 2021-07-30 14:03 UTC (permalink / raw)
  To: Finlayson, James M CIV (USA); +Cc: Miao Wang, Matt Wallis, linux-raid

You can try btrfs in lieu of zfs.  As long as metadata is raid1, data
can be raid5/6 and things are ok.  The raid5 write issue only applies
to metadata.

On Fri, Jul 30, 2021 at 6:09 AM Finlayson, James M CIV (USA)
<james.m.finlayson4.civ@mail.mil> wrote:
>
> There is interest in ZFS.   We're waiting for the direct I/O patches to settle in Open ZFS because we couldn't find any way to get around the ARC (everything has to touch the ARC).  ZFS spins an entire CPU core or more worrying about which ARC entries it has to evict.      I know who is doing the work.   Once it settles, I'll see if they are willing to publish to zfs-discuss.
>
> -----Original Message-----
> From: Miao Wang <shankerwangmiao@gmail.com>
> Sent: Friday, July 30, 2021 4:46 AM
> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Cc: Matt Wallis <mattw@madmonks.org>; linux-raid@vger.kernel.org
> Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>
> Hi Jim,
>
> Nice to hear about your findings on how to let linux md work better on fast nvme drives, because previously I was also stuck in a similar problem and finally gave up. Since it is very difficult to find such environment with so many fast nvme drives, I wonder if you have any interest in ZFS. Maybe you can set up a similar raidz configuration on those drives and see whether its performance is better or worse.
>
> Cheers,
>
> Miao Wang
>
> > 2021年07月30日 16:28,Matt Wallis <mattw@madmonks.org> 写道:
> >
> > Hi Jim,
> >
> > That’s significantly better than I expected, I need to see if I can get someone to send me the box I was using so I can spend some more time on it.
> >
> > Good luck with the rest of it, the bit I was looking for as a next step was going to be potentially tweaking stripe widths and the like to see how much difference it made on different workloads.
> >
> > Matt.
> >
> >> On 30 Jul 2021, at 08:05, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> >>
> >> Matt,
> >> Thank you for the tip.   I have put 32 partitions on each of my NVMe drives, made 32 raid5 stripes and then made an LVM from the 32 raid5 stripes.   I then created one physical volume per 10 NVMe drives on each socket.    I believe I have successfully harnessed "alien technology" (inside joke) to create a Frankenstein's monster.   These are substantially better than doing a RAID0 stripe over the partitioned md's in the past.  I whipped all of this together in two 15 minute sessions last night and just right now, so I might run some more extensive tests when I have the cycles.   I didn't intend to leave the thread hanging.
> >> BLUF -  fio detailed output below.....
> >> 9 drives per socket raw
> >> socket0, 9 drives, raw 4K random reads 13.6M IOPS , socket1 9 drives,
> >> raw 4K random reads , 12.3M IOPS
> >> %Cpu(s):  4.4 us, 25.6 sy,  0.0 ni, 56.7 id,  0.0 wa, 13.1 hi,  0.2
> >> si,  0.0 st
> >>
> >> 9 data drives per socket RAID5/LVM raw (9+1) socket0, 9 drives, raw
> >> 4K random reads 8.57M IOPS , socket1 9 drives, raw 4K random reads ,
> >> 8.57M IOPS
> >> %Cpu(s):  7.0 us, 22.3 sy,  0.0 ni, 58.4 id,  0.0 wa, 12.1 hi,  0.2
> >> si,  0.0 st
> >>
> >>
> >> All,
> >> I intend to test the 4.15 kernel patch next week.   My SA would prefer if the patch has made it into the kernel-ml stream yet, so he could install an rpm, but I'll get him to build the kernel if need be.
> >> If the 4.15 kernel patch doesn't alleviate the issues, I have a strong desire to have mdraid made better.
> >>
> >>
> >> Quick fio results:
> >> socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> >> 4096B-4096B, ioengine=libaio, iodepth=128 ...
> >> socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> >> 4096B-4096B, ioengine=libaio, iodepth=128 ...
> >> socket0-lv: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B,
> >> (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
> >> socket1-lv: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B,
> >> (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
> >> fio-3.26
> >> Starting 256 processes
> >>
> >> socket0: (groupid=0, jobs=64): err= 0: pid=64032: Thu Jul 29 21:48:32
> >> 2021
> >> read: IOPS=13.6M, BW=51.7GiB/s (55.6GB/s)(3105GiB/60003msec)
> >>   slat (nsec): min=1292, max=1376.7k, avg=2545.32, stdev=2696.74
> >>   clat (usec): min=36, max=71580, avg=600.68, stdev=361.36
> >>    lat (usec): min=38, max=71616, avg=603.30, stdev=361.38
> >>   clat percentiles (usec):
> >>    |  1.00th=[  169],  5.00th=[  231], 10.00th=[  277], 20.00th=[  347],
> >>    | 30.00th=[  404], 40.00th=[  457], 50.00th=[  519], 60.00th=[  594],
> >>    | 70.00th=[  676], 80.00th=[  791], 90.00th=[  996], 95.00th=[ 1205],
> >>    | 99.00th=[ 1909], 99.50th=[ 2409], 99.90th=[ 3556], 99.95th=[ 3884],
> >>    | 99.99th=[ 5538]
> >>  bw (  MiB/s): min=39960, max=56660, per=20.94%, avg=53040.69, stdev=49.61, samples=7488
> >>  iops        : min=10229946, max=14504941, avg=13578391.80, stdev=12699.61, samples=7488
> >> lat (usec)   : 50=0.01%, 100=0.01%, 250=6.83%, 500=40.06%, 750=29.92%
> >> lat (usec)   : 1000=13.42%
> >> lat (msec)   : 2=8.91%, 4=0.82%, 10=0.04%, 20=0.01%, 50=0.01%
> >> lat (msec)   : 100=0.01%
> >> cpu          : usr=14.82%, sys=46.57%, ctx=35564249, majf=0, minf=9754
> >> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> >>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> >>    issued rwts: total=813909201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> >>    latency   : target=0, window=0, percentile=100.00%, depth=128
> >> socket1: (groupid=1, jobs=64): err= 0: pid=64096: Thu Jul 29 21:48:32
> >> 2021
> >> read: IOPS=12.3M, BW=46.9GiB/s (50.3GB/s)(2812GiB/60003msec)
> >>   slat (nsec): min=1292, max=1672.2k, avg=2672.21, stdev=2742.06
> >>   clat (usec): min=25, max=73526, avg=663.35, stdev=611.06
> >>    lat (usec): min=28, max=73545, avg=666.09, stdev=611.08
> >>   clat percentiles (usec):
> >>    |  1.00th=[  143],  5.00th=[  190], 10.00th=[  227], 20.00th=[  285],
> >>    | 30.00th=[  338], 40.00th=[  400], 50.00th=[  478], 60.00th=[  586],
> >>    | 70.00th=[  725], 80.00th=[  930], 90.00th=[ 1254], 95.00th=[ 1614],
> >>    | 99.00th=[ 3490], 99.50th=[ 4146], 99.90th=[ 6390], 99.95th=[ 6980],
> >>    | 99.99th=[ 8356]
> >>  bw (  MiB/s): min=28962, max=55326, per=12.70%, avg=48036.10, stdev=96.75, samples=7488
> >>  iops        : min=7414327, max=14163615, avg=12297214.98, stdev=24768.82, samples=7488
> >> lat (usec)   : 50=0.01%, 100=0.03%, 250=13.75%, 500=38.84%, 750=18.71%
> >> lat (usec)   : 1000=11.55%
> >> lat (msec)   : 2=14.30%, 4=2.23%, 10=0.60%, 20=0.01%, 50=0.01%
> >> lat (msec)   : 100=0.01%
> >> cpu          : usr=13.41%, sys=44.44%, ctx=39379711, majf=0, minf=9982
> >> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> >>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> >>    issued rwts: total=737168913,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> >>    latency   : target=0, window=0, percentile=100.00%, depth=128
> >> socket0-lv: (groupid=2, jobs=64): err= 0: pid=64166: Thu Jul 29
> >> 21:48:32 2021
> >> read: IOPS=8570k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
> >>   slat (nsec): min=1873, max=11085k, avg=4694.47, stdev=4825.39
> >>   clat (usec): min=24, max=21739, avg=950.52, stdev=948.83
> >>    lat (usec): min=51, max=21743, avg=955.29, stdev=948.88
> >>   clat percentiles (usec):
> >>    |  1.00th=[  155],  5.00th=[  217], 10.00th=[  265], 20.00th=[  338],
> >>    | 30.00th=[  404], 40.00th=[  486], 50.00th=[  594], 60.00th=[  766],
> >>    | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
> >>    | 99.00th=[ 4490], 99.50th=[ 5669], 99.90th=[ 8586], 99.95th=[ 9896],
> >>    | 99.99th=[12125]
> >>  bw (  MiB/s): min=24657, max=37453, per=-25.17%, avg=33516.00, stdev=35.32, samples=7616
> >>  iops        : min=6312326, max=9588007, avg=8580076.03, stdev=9041.88, samples=7616
> >> lat (usec)   : 50=0.01%, 100=0.01%, 250=8.40%, 500=33.21%, 750=17.54%
> >> lat (usec)   : 1000=9.89%
> >> lat (msec)   : 2=19.93%, 4=9.58%, 10=1.40%, 20=0.05%, 50=0.01%
> >> cpu          : usr=9.01%, sys=51.48%, ctx=27829950, majf=0, minf=9028
> >> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> >>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> >>    issued rwts: total=514275323,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> >>    latency   : target=0, window=0, percentile=100.00%, depth=128
> >> socket1-lv: (groupid=3, jobs=64): err= 0: pid=64230: Thu Jul 29
> >> 21:48:32 2021
> >> read: IOPS=8571k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
> >>   slat (nsec): min=1823, max=14362k, avg=4809.30, stdev=4940.42
> >>   clat (usec): min=50, max=22856, avg=950.31, stdev=948.13
> >>    lat (usec): min=54, max=22860, avg=955.19, stdev=948.19
> >>   clat percentiles (usec):
> >>    |  1.00th=[  157],  5.00th=[  221], 10.00th=[  269], 20.00th=[  343],
> >>    | 30.00th=[  412], 40.00th=[  490], 50.00th=[  603], 60.00th=[  766],
> >>    | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
> >>    | 99.00th=[ 4293], 99.50th=[ 5604], 99.90th=[ 9503], 99.95th=[10683],
> >>    | 99.99th=[12649]
> >>  bw (  MiB/s): min=23434, max=36909, per=-25.17%, avg=33517.14, stdev=50.36, samples=7616
> >>  iops        : min=5999220, max=9448818, avg=8580368.69, stdev=12892.93, samples=7616
> >> lat (usec)   : 100=0.01%, 250=7.88%, 500=33.09%, 750=18.09%, 1000=10.02%
> >> lat (msec)   : 2=19.91%, 4=9.75%, 10=1.16%, 20=0.08%, 50=0.01%
> >> cpu          : usr=9.14%, sys=51.94%, ctx=25524808, majf=0, minf=9037
> >> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> >>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
> >>    issued rwts: total=514294010,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> >>    latency   : target=0, window=0, percentile=100.00%, depth=128
> >>
> >> Run status group 0 (all jobs):
> >>  READ: bw=51.7GiB/s (55.6GB/s), 51.7GiB/s-51.7GiB/s
> >> (55.6GB/s-55.6GB/s), io=3105GiB (3334GB), run=60003-60003msec
> >>
> >> Run status group 1 (all jobs):
> >>  READ: bw=46.9GiB/s (50.3GB/s), 46.9GiB/s-46.9GiB/s
> >> (50.3GB/s-50.3GB/s), io=2812GiB (3019GB), run=60003-60003msec
> >>
> >> Run status group 2 (all jobs):
> >>  READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s
> >> (35.1GB/s-35.1GB/s), io=1962GiB (2106GB), run=60006-60006msec
> >>
> >> Run status group 3 (all jobs):
> >>  READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s
> >> (35.1GB/s-35.1GB/s), io=1962GiB (2107GB), run=60006-60006msec
> >>
> >> Disk stats (read/write):
> >> nvme0n1: ios=90336694/0, merge=0/0, ticks=45102163/0,
> >> in_queue=45102163, util=97.44%
> >> nvme1n1: ios=90337153/0, merge=0/0, ticks=47422886/0,
> >> in_queue=47422887, util=97.81%
> >> nvme2n1: ios=90337516/0, merge=0/0, ticks=46419782/0,
> >> in_queue=46419782, util=97.95%
> >> nvme3n1: ios=90337843/0, merge=0/0, ticks=46256374/0,
> >> in_queue=46256374, util=97.95%
> >> nvme4n1: ios=90337742/0, merge=0/0, ticks=59122226/0,
> >> in_queue=59122225, util=98.19%
> >> nvme5n1: ios=90338813/0, merge=0/0, ticks=57811758/0,
> >> in_queue=57811758, util=98.33%
> >> nvme6n1: ios=90339194/0, merge=0/0, ticks=57369337/0,
> >> in_queue=57369337, util=98.37%
> >> nvme7n1: ios=90339048/0, merge=0/0, ticks=55791076/0,
> >> in_queue=55791076, util=98.78%
> >> nvme8n1: ios=90340234/0, merge=0/0, ticks=44977001/0,
> >> in_queue=44977001, util=99.01%
> >> nvme12n1: ios=81819608/0, merge=0/0, ticks=26788080/0,
> >> in_queue=26788079, util=99.24%
> >> nvme13n1: ios=81819831/0, merge=0/0, ticks=26736682/0,
> >> in_queue=26736681, util=99.57%
> >> nvme14n1: ios=81820006/0, merge=0/0, ticks=26772951/0,
> >> in_queue=26772951, util=99.67%
> >> nvme15n1: ios=81820215/0, merge=0/0, ticks=26741532/0,
> >> in_queue=26741532, util=99.78%
> >> nvme17n1: ios=81819922/0, merge=0/0, ticks=76459192/0,
> >> in_queue=76459192, util=99.84%
> >> nvme18n1: ios=81820146/0, merge=0/0, ticks=86756309/0,
> >> in_queue=86756309, util=99.82%
> >> nvme19n1: ios=81820481/0, merge=0/0, ticks=75008919/0,
> >> in_queue=75008919, util=100.00%
> >> nvme20n1: ios=81819690/0, merge=0/0, ticks=91888274/0,
> >> in_queue=91888275, util=100.00%
> >> nvme21n1: ios=81821809/0, merge=0/0, ticks=26653056/0,
> >> in_queue=26653057, util=100.00% -----Original Message-----
> >> From: Matt Wallis <mattw@madmonks.org>
> >> Sent: Wednesday, July 28, 2021 8:54 PM
> >> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> >> Cc: linux-raid@vger.kernel.org
> >> Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
> >>
> >> Hi Jim,
> >>
> >> Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
> >> Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
> >>
> >> I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally.
> >>
> >> Matt.
> >>
> >>> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> >>>
> >>> Matt,
> >>> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term.     As a researcher, I can do these cool science experiments, but I still have to hand designs  to sustainment folks.  I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one  xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed.    We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.
> >>>
> >>> I will have to try the LVM experiment.  I'm an LVM  neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also.  Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
> >>>
> >>> Thanks,
> >>> Jim
> >>>
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Matt Wallis <mattw@madmonks.org>
> >>> Sent: Wednesday, July 28, 2021 6:32 AM
> >>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> >>> Cc: linux-raid@vger.kernel.org
> >>> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
> >>>
> >>> Hi Jim,
> >>>
> >>>> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> >>>>
> >>>> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small  fraction of it.   I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful.     This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant.   I'm just stumped why I can't do better.   If there is a fine manual that somebody can point me to, I'm happy to read it…
> >>>
> >>> I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
> >>> 2. Most block IO in the kernel is limited in terms of threading, it
> >>> may even be essentially single threaded. (This is where I will get
> >>> corrected) 3. AFAICT, this includes mdraid, there’s a single thread
> >>> per RAID device handling all the RAID calculations. (mdX_raid6)
> >>>
> >>> What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
> >>> 1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
> >>> 3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
> >>>
> >>> You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
> >>>
> >>> I saw a significant (for me, significant is >20%) increase in IOPs doing this.
> >>>
> >>> You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes.
> >>>
> >>> There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
> >>>
> >>> You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
> >>>
> >>> Matt
>


-- 
Doug Ledford <dledford@redhat.com>
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Can't get RAID5/RAID6  NVMe randomread  IOPS - AMD ROME what am I missing?????
  2021-07-27 20:32 Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? Finlayson, James M CIV (USA)
                   ` (2 preceding siblings ...)
  2021-07-28 10:31 ` Matt Wallis
@ 2021-08-01 11:21 ` Gal Ofri
  2021-08-03 14:59   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
       [not found] ` <AS8PR04MB799205817C4647DAC740DE9A91EA9@AS8PR04MB7992.eurprd04.prod.outlook.com>
  4 siblings, 1 reply; 28+ messages in thread
From: Gal Ofri @ 2021-08-01 11:21 UTC (permalink / raw)
  To: Finlayson, James M CIV (USA); +Cc: 'linux-raid@vger.kernel.org'

Hey Jim,

Read iops (rand/seq) were addressed in a recent commit:
97ae27252f49 md/raid5: avoid device_lock in read_one_chunk()
https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e
It was merged into 5.14, so you can either cherry-pick it or just use a
latest-master kernel.

Sounds like your environment is stronger than the one I used for the
testing, so please do share your benchmark if you manage to surpass the
results described in the commit message.

Cheers,
Gal Ofri,
Volumez (formerly storing.io)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-08-01 11:21 ` Gal Ofri
@ 2021-08-03 14:59   ` Finlayson, James M CIV (USA)
  2021-08-04  9:33     ` Gal Ofri
  0 siblings, 1 reply; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-08-03 14:59 UTC (permalink / raw)
  To: 'Gal Ofri', 'linux-raid@vger.kernel.org'

Gal,
My SA just gave me the server with the 5.14 RC4 kernel built.   I have a two pass preconditioning run  going right now to get us maximum results.   I expect to be able to run the tests hopefully by COB Wednesday.   Preconditioning will take 8 hours unfortunately (15.36TB drives), I have to make BIOS changes for apples to apples "hero runs" and then get the mdraid's created.    In your opinion, if I bypass the initial formatting with mdadm --assume-clean, will that make a difference in the results?    I usually let the format run, but I want to get you results as soon as possible.
Thanks,
Jim




-----Original Message-----
From: Gal Ofri <gal.ofri@volumez.com> 
Sent: Sunday, August 1, 2021 7:21 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

All active links contained in this email were disabled.  Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser.  




----

Hey Jim,

Read iops (rand/seq) were addressed in a recent commit:
97ae27252f49 md/raid5: avoid device_lock in read_one_chunk() Caution-https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e
It was merged into 5.14, so you can either cherry-pick it or just use a latest-master kernel.

Sounds like your environment is stronger than the one I used for the testing, so please do share your benchmark if you manage to surpass the results described in the commit message.

Cheers,
Gal Ofri,
Volumez (formerly storing.io)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-08-03 14:59   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
@ 2021-08-04  9:33     ` Gal Ofri
  0 siblings, 0 replies; 28+ messages in thread
From: Gal Ofri @ 2021-08-04  9:33 UTC (permalink / raw)
  To: Finlayson, James M CIV (USA); +Cc: 'linux-raid@vger.kernel.org'

On Tue, 3 Aug 2021 14:59:45 +0000
"Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil> wrote:

> My SA just gave me the server with the 5.14 RC4 kernel built.   I have a two pass preconditioning run  going right now to get us maximum results.   I expect to be able to run the tests hopefully by COB Wednesday.   Preconditioning will take 8 hours unfortunately (15.36TB drives), I have to make BIOS changes for apples to apples "hero runs" and then get the mdraid's created.    In your opinion, if I bypass the initial formatting with mdadm --assume-clean, will that make a difference in the results?    I usually let the format run, but I want to get you results as soon as possible.

You're running Reads workload so it doesn't make sense to read stuff
without formatting first. IMO, you better wait for the formatting to
complete rather than try to skip/reduce it.
Also, note that in my tests I had to setup XFS over the raid in order to
avoid queueing issues. Feel free to ping me if you want the setup script
or the vdbench file(s) that I used.

Cheers,
Gal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
       [not found]     ` <5EAED86C53DED2479E3E145969315A2385856AF7@UMECHPA7B.easf.csd.disa.mil>
@ 2021-08-05 19:52       ` Finlayson, James M CIV (USA)
  2021-08-05 20:50         ` Finlayson, James M CIV (USA)
  0 siblings, 1 reply; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-08-05 19:52 UTC (permalink / raw)
  To: 'linux-raid@vger.kernel.org'
  Cc: 'Gal Ofri', Finlayson, James M CIV (USA)

Sorry - again..I sent HTML instead of plain text

Resend - mailing list bounce  
All, 
Sorry for the delay - both work and life got into the way.   Here is some feedback:

BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1 RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s  - best hero numbers I've ever seen on mdraid  RAID5 IOPS.   I think the kernel patch is good.  Prior was  socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s....   I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control.   

I need to verify the RAW IOPS - admittedly this is a different server and I didn't do any regression testing before the kernel, but my raw were  socket0: 13.2M IOPS and socket1  13.5M IOPS.   Prior was socket0 16.0M IOPS and socket1 13.5M IOPS.   - admittedly there appears to a regression in the socket0 "hero run" but what I don't know that since this is a different server, I don't know if I have a configuration management issue in my zealousness to test this patch or whether we have a regression.   I was so excited to have the attention of kernel developers that needed my help that I borrowed another system, because I didn't want to tear apart my "Frankenstein's monster" 32 partition mdraid LVM mess.   If I can switch kernels and reboot before work and life get back in the way, I'll follow  up..

I think I might have to give myself the action to run this to ground next week on the other server.   Without a doubt the mdraid lock improvement is worth taking forward.   I either have to find my error or point a finger as my raw hero numbers got worse.   I tend to see one socket outrun another -  the way HPE allocates the nvme drives to pcie root complexes  is not how I'd like to do it so the drives are unbalanced on the PCIe root complexes (drives are in 4 different root complexes on socket 0 and 3 on socket 1, so one would think socket0 will always be faster for hero runs  (an NPS4 numa mapping is the best way to show it:  
[root@gremlin04 hornet05]# cat *nps4
#filename=/dev/nvme0n1 0
#filename=/dev/nvme1n1 0
#filename=/dev/nvme2n1 1
#filename=/dev/nvme3n1 1
#filename=/dev/nvme4n1 2
#filename=/dev/nvme5n1 2
#filename=/dev/nvme6n1 2
#filename=/dev/nvme7n1 2
#filename=/dev/nvme8n1 3
#filename=/dev/nvme9n1 3
#filename=/dev/nvme10n1 3
#filename=/dev/nvme11n1 3
#filename=/dev/nvme12n1 4
#filename=/dev/nvme13n1 4
#filename=/dev/nvme14n1 4
#filename=/dev/nvme15n1 4
#filename=/dev/nvme17n1 5
#filename=/dev/nvme18n1 5
#filename=/dev/nvme19n1 5
#filename=/dev/nvme20n1 5
#filename=/dev/nvme21n1 6
#filename=/dev/nvme22n1 6
#filename=/dev/nvme23n1 6
#filename=/dev/nvme24n1 6


fio fiojim.hpdl385.nps1
socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket0-md: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
socket1-md: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.26
Starting 256 processes
Jobs: 128 (f=128): [_(128),r(128)][1.5%][r=42.8GiB/s][r=11.2M IOPS][eta 10h:40m:00s]        
socket0: (groupid=0, jobs=64): err= 0: pid=522428: Thu Aug  5 19:33:05 2021
  read: IOPS=13.2M, BW=50.2GiB/s (53.9GB/s)(14.7TiB/300005msec)
    slat (nsec): min=1312, max=8308.1k, avg=2206.72, stdev=1505.92
    clat (usec): min=14, max=42033, avg=619.56, stdev=671.45
     lat (usec): min=19, max=42045, avg=621.83, stdev=671.46
    clat percentiles (usec):
     |  1.00th=[  113],  5.00th=[  149], 10.00th=[  180], 20.00th=[  229],
     | 30.00th=[  273], 40.00th=[  310], 50.00th=[  351], 60.00th=[  408],
     | 70.00th=[  578], 80.00th=[  938], 90.00th=[ 1467], 95.00th=[ 1909],
     | 99.00th=[ 3163], 99.50th=[ 4178], 99.90th=[ 5800], 99.95th=[ 6390],
     | 99.99th=[ 8455]
   bw (  MiB/s): min=28741, max=61365, per=18.56%, avg=51489.80, stdev=82.09, samples=38016
   iops        : min=7357916, max=15709528, avg=13181362.22, stdev=21013.83, samples=38016
  lat (usec)   : 20=0.01%, 50=0.02%, 100=0.42%, 250=24.52%, 500=42.21%
  lat (usec)   : 750=7.94%, 1000=6.34%
  lat (msec)   : 2=14.26%, 4=3.74%, 10=0.54%, 20=0.01%, 50=0.01%
  cpu          : usr=14.58%, sys=47.48%, ctx=291912925, majf=0, minf=10492
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=3949519687,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1: (groupid=1, jobs=64): err= 0: pid=522492: Thu Aug  5 19:33:05 2021
  read: IOPS=13.6M, BW=51.8GiB/s (55.7GB/s)(15.2TiB/300004msec)
    slat (nsec): min=1323, max=4335.7k, avg=2242.27, stdev=1608.25
    clat (usec): min=14, max=41341, avg=600.15, stdev=726.62
     lat (usec): min=20, max=41358, avg=602.46, stdev=726.64
    clat percentiles (usec):
     |  1.00th=[  115],  5.00th=[  151], 10.00th=[  184], 20.00th=[  231],
     | 30.00th=[  269], 40.00th=[  306], 50.00th=[  347], 60.00th=[  400],
     | 70.00th=[  506], 80.00th=[  799], 90.00th=[ 1303], 95.00th=[ 1909],
     | 99.00th=[ 3589], 99.50th=[ 4424], 99.90th=[ 7111], 99.95th=[ 7767],
     | 99.99th=[10290]
   bw (  MiB/s): min=28663, max=71847, per=21.11%, avg=53145.09, stdev=111.29, samples=38016
   iops        : min=7337860, max=18392866, avg=13605117.00, stdev=28491.19, samples=38016
  lat (usec)   : 20=0.01%, 50=0.02%, 100=0.36%, 250=24.52%, 500=44.77%
  lat (usec)   : 750=8.90%, 1000=6.37%
  lat (msec)   : 2=10.52%, 4=3.87%, 10=0.66%, 20=0.01%, 50=0.01%
  cpu          : usr=14.86%, sys=49.40%, ctx=282634154, majf=0, minf=10276
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=4076360454,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket0-md: (groupid=2, jobs=64): err= 0: pid=524061: Thu Aug  5 19:33:05 2021
  read: IOPS=5332k, BW=20.3GiB/s (21.8GB/s)(6102GiB/300002msec)
    slat (nsec): min=1633, max=17043k, avg=11123.38, stdev=8694.61
    clat (usec): min=186, max=18705, avg=1524.87, stdev=115.29
     lat (usec): min=200, max=18743, avg=1536.08, stdev=115.90
    clat percentiles (usec):
     |  1.00th=[ 1270],  5.00th=[ 1336], 10.00th=[ 1369], 20.00th=[ 1418],
     | 30.00th=[ 1467], 40.00th=[ 1500], 50.00th=[ 1532], 60.00th=[ 1549],
     | 70.00th=[ 1582], 80.00th=[ 1631], 90.00th=[ 1680], 95.00th=[ 1713],
     | 99.00th=[ 1795], 99.50th=[ 1811], 99.90th=[ 1893], 99.95th=[ 1926],
     | 99.99th=[ 2089]
   bw (  MiB/s): min=19030, max=21969, per=100.00%, avg=20843.43, stdev= 5.35, samples=38272
   iops        : min=4871687, max=5624289, avg=5335900.01, stdev=1370.43, samples=38272
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=99.97%, 4=0.02%, 10=0.01%, 20=0.01%
  cpu          : usr=5.56%, sys=77.91%, ctx=8118, majf=0, minf=9018
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
    issued rwts: total=1599503201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1-md: (groupid=3, jobs=64): err= 0: pid=524125: Thu Aug  5 19:33:05 2021
  read: IOPS=5892k, BW=22.5GiB/s (24.1GB/s)(6743GiB/300002msec)
    slat (nsec): min=1663, max=1274.1k, avg=9896.09, stdev=7939.50
    clat (usec): min=236, max=11102, avg=1379.86, stdev=148.64
     lat (usec): min=239, max=11110, avg=1389.84, stdev=149.54
    clat percentiles (usec):
     |  1.00th=[ 1106],  5.00th=[ 1172], 10.00th=[ 1205], 20.00th=[ 1254],
     | 30.00th=[ 1287], 40.00th=[ 1336], 50.00th=[ 1369], 60.00th=[ 1401],
     | 70.00th=[ 1434], 80.00th=[ 1500], 90.00th=[ 1582], 95.00th=[ 1663],
     | 99.00th=[ 1811], 99.50th=[ 1860], 99.90th=[ 1942], 99.95th=[ 1958],
     | 99.99th=[ 2040]
   bw (  MiB/s): min=20982, max=24535, per=-82.15%, avg=23034.61, stdev=15.46, samples=38272
   iops        : min=5371404, max=6281119, avg=5896843.14, stdev=3958.21, samples=38272
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=99.97%, 4=0.02%, 10=0.01%, 20=0.01%
  cpu          : usr=6.55%, sys=74.98%, ctx=9833, majf=0, minf=8956
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=1767618924,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=50.2GiB/s (53.9GB/s), 50.2GiB/s-50.2GiB/s (53.9GB/s-53.9GB/s), io=14.7TiB (16.2TB), run=300005-300005msec

Run status group 1 (all jobs):
   READ: bw=51.8GiB/s (55.7GB/s), 51.8GiB/s-51.8GiB/s (55.7GB/s-55.7GB/s), io=15.2TiB (16.7TB), run=300004-300004msec

Run status group 2 (all jobs):
   READ: bw=20.3GiB/s (21.8GB/s), 20.3GiB/s-20.3GiB/s (21.8GB/s-21.8GB/s), io=6102GiB (6552GB), run=300002-300002msec

Run status group 3 (all jobs):
   READ: bw=22.5GiB/s (24.1GB/s), 22.5GiB/s-22.5GiB/s (24.1GB/s-24.1GB/s), io=6743GiB (7240GB), run=300002-300002msec

Disk stats (read/write):
  nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  md0: ios=1599378656/0, merge=0/0, ticks=391992721/0, in_queue=391992721, util=100.00%
  md1: ios=1767484212/0, merge=0/0, ticks=427666887/0, in_queue=427666887, util=100.00%

From: Gal Ofri <gal.ofri@volumez.com> 
Sent: Wednesday, July 28, 2021 5:43 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>; 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

All active links contained in this email were disabled. Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser. 
________________________________________

A recent commit raised the limit on raid5/6 read iops.
It's available in 5.14.
See Caution-https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e < Caution-https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e > 
commit 97ae27252f4962d0fcc38ee1d9f913d817a2024e
Author: Gal Ofri <gal.ofri@storing.io>
Date:   Mon Jun 7 14:07:03 2021 +0300
    md/raid5: avoid device_lock in read_one_chunk()

Please do share if you reach more iops in your env than described in the commit.

Cheers,
Gal, 
Volumez (formerly storing.io)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-08-05 19:52       ` Finlayson, James M CIV (USA)
@ 2021-08-05 20:50         ` Finlayson, James M CIV (USA)
  2021-08-05 21:10           ` Finlayson, James M CIV (USA)
  0 siblings, 1 reply; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-08-05 20:50 UTC (permalink / raw)
  To: 'linux-raid@vger.kernel.org'
  Cc: 'Gal Ofri', Finlayson, James M CIV (USA)

As far as the slower hero numbers - false alarm on my part - rebooted with 4.18 RHEL 8.4 kernel
Socket0 hero - 13.2M IOPS, Socket1 hero 13.7M IOPS.   I have to figure out the differences either between my drives or my server.  Chances are, slot for slot I have PCIe cards that are in different slots between the two servers if I had to guess....

As a major flag though - with mdraid volumes I created under the 5.14rc3 kernel, I lock the system up solid when I try to access them under 4.18.....I'm not an expert on forcing NMI's and getting the stack traces, so I might have to leave that to others.....After two lockups, I returned to the 5.14 kernel.   If I need to run something - you have seen the config I have - I'm willing.   

I'm willing to push as hard as I can and to run anything that can help as long as it isn't urgent - I have a day job and have some constraints as a civil servant, however, I have the researcher, push push push mindset.   I want to really encourage the community to push as hard as possible on protected IOPS and I'm willing to help however I can....In my interactions with the processor and server OEMs - I'm encouraging them to get the Linux leaders in I/O development, the biggest baddest Server/SSD combinations they have early in the development.   I know they won't listen to me but I'm trying to help.

For those of you on Rome server, get with your server provider.   There are some things in the BIOS that can be tweaked for I/O.   


-----Original Message-----
From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> 
Sent: Thursday, August 5, 2021 3:52 PM
To: 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
Cc: 'Gal Ofri' <gal.ofri@volumez.com>; Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

Sorry - again..I sent HTML instead of plain text

Resend - mailing list bounce
All,
Sorry for the delay - both work and life got into the way.   Here is some feedback:

BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1 RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s  - best hero numbers I've ever seen on mdraid  RAID5 IOPS.   I think the kernel patch is good.  Prior was  socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s....   I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control.   

I need to verify the RAW IOPS - admittedly this is a different server and I didn't do any regression testing before the kernel, but my raw were  socket0: 13.2M IOPS and socket1  13.5M IOPS.   Prior was socket0 16.0M IOPS and socket1 13.5M IOPS.   - admittedly there appears to a regression in the socket0 "hero run" but what I don't know that since this is a different server, I don't know if I have a configuration management issue in my zealousness to test this patch or whether we have a regression.   I was so excited to have the attention of kernel developers that needed my help that I borrowed another system, because I didn't want to tear apart my "Frankenstein's monster" 32 partition mdraid LVM mess.   If I can switch kernels and reboot before work and life get back in the way, I'll follow  up..

I think I might have to give myself the action to run this to ground next week on the other server.   Without a doubt the mdraid lock improvement is worth taking forward.   I either have to find my error or point a finger as my raw hero numbers got worse.   I tend to see one socket outrun another -  the way HPE allocates the nvme drives to pcie root complexes  is not how I'd like to do it so the drives are unbalanced on the PCIe root complexes (drives are in 4 different root complexes on socket 0 and 3 on socket 1, so one would think socket0 will always be faster for hero runs  (an NPS4 numa mapping is the best way to show it:
[root@gremlin04 hornet05]# cat *nps4
#filename=/dev/nvme0n1 0
#filename=/dev/nvme1n1 0
#filename=/dev/nvme2n1 1
#filename=/dev/nvme3n1 1
#filename=/dev/nvme4n1 2
#filename=/dev/nvme5n1 2
#filename=/dev/nvme6n1 2
#filename=/dev/nvme7n1 2
#filename=/dev/nvme8n1 3
#filename=/dev/nvme9n1 3
#filename=/dev/nvme10n1 3
#filename=/dev/nvme11n1 3
#filename=/dev/nvme12n1 4
#filename=/dev/nvme13n1 4
#filename=/dev/nvme14n1 4
#filename=/dev/nvme15n1 4
#filename=/dev/nvme17n1 5
#filename=/dev/nvme18n1 5
#filename=/dev/nvme19n1 5
#filename=/dev/nvme20n1 5
#filename=/dev/nvme21n1 6
#filename=/dev/nvme22n1 6
#filename=/dev/nvme23n1 6
#filename=/dev/nvme24n1 6


fio fiojim.hpdl385.nps1
socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
socket0-md: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
socket1-md: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
fio-3.26
Starting 256 processes
Jobs: 128 (f=128): [_(128),r(128)][1.5%][r=42.8GiB/s][r=11.2M IOPS][eta 10h:40m:00s]
socket0: (groupid=0, jobs=64): err= 0: pid=522428: Thu Aug  5 19:33:05 2021
  read: IOPS=13.2M, BW=50.2GiB/s (53.9GB/s)(14.7TiB/300005msec)
    slat (nsec): min=1312, max=8308.1k, avg=2206.72, stdev=1505.92
    clat (usec): min=14, max=42033, avg=619.56, stdev=671.45
     lat (usec): min=19, max=42045, avg=621.83, stdev=671.46
    clat percentiles (usec):
     |  1.00th=[  113],  5.00th=[  149], 10.00th=[  180], 20.00th=[  229],
     | 30.00th=[  273], 40.00th=[  310], 50.00th=[  351], 60.00th=[  408],
     | 70.00th=[  578], 80.00th=[  938], 90.00th=[ 1467], 95.00th=[ 1909],
     | 99.00th=[ 3163], 99.50th=[ 4178], 99.90th=[ 5800], 99.95th=[ 6390],
     | 99.99th=[ 8455]
   bw (  MiB/s): min=28741, max=61365, per=18.56%, avg=51489.80, stdev=82.09, samples=38016
   iops        : min=7357916, max=15709528, avg=13181362.22, stdev=21013.83, samples=38016
  lat (usec)   : 20=0.01%, 50=0.02%, 100=0.42%, 250=24.52%, 500=42.21%
  lat (usec)   : 750=7.94%, 1000=6.34%
  lat (msec)   : 2=14.26%, 4=3.74%, 10=0.54%, 20=0.01%, 50=0.01%
  cpu          : usr=14.58%, sys=47.48%, ctx=291912925, majf=0, minf=10492
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=3949519687,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1: (groupid=1, jobs=64): err= 0: pid=522492: Thu Aug  5 19:33:05 2021
  read: IOPS=13.6M, BW=51.8GiB/s (55.7GB/s)(15.2TiB/300004msec)
    slat (nsec): min=1323, max=4335.7k, avg=2242.27, stdev=1608.25
    clat (usec): min=14, max=41341, avg=600.15, stdev=726.62
     lat (usec): min=20, max=41358, avg=602.46, stdev=726.64
    clat percentiles (usec):
     |  1.00th=[  115],  5.00th=[  151], 10.00th=[  184], 20.00th=[  231],
     | 30.00th=[  269], 40.00th=[  306], 50.00th=[  347], 60.00th=[  400],
     | 70.00th=[  506], 80.00th=[  799], 90.00th=[ 1303], 95.00th=[ 1909],
     | 99.00th=[ 3589], 99.50th=[ 4424], 99.90th=[ 7111], 99.95th=[ 7767],
     | 99.99th=[10290]
   bw (  MiB/s): min=28663, max=71847, per=21.11%, avg=53145.09, stdev=111.29, samples=38016
   iops        : min=7337860, max=18392866, avg=13605117.00, stdev=28491.19, samples=38016
  lat (usec)   : 20=0.01%, 50=0.02%, 100=0.36%, 250=24.52%, 500=44.77%
  lat (usec)   : 750=8.90%, 1000=6.37%
  lat (msec)   : 2=10.52%, 4=3.87%, 10=0.66%, 20=0.01%, 50=0.01%
  cpu          : usr=14.86%, sys=49.40%, ctx=282634154, majf=0, minf=10276
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=4076360454,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket0-md: (groupid=2, jobs=64): err= 0: pid=524061: Thu Aug  5 19:33:05 2021
  read: IOPS=5332k, BW=20.3GiB/s (21.8GB/s)(6102GiB/300002msec)
    slat (nsec): min=1633, max=17043k, avg=11123.38, stdev=8694.61
    clat (usec): min=186, max=18705, avg=1524.87, stdev=115.29
     lat (usec): min=200, max=18743, avg=1536.08, stdev=115.90
    clat percentiles (usec):
     |  1.00th=[ 1270],  5.00th=[ 1336], 10.00th=[ 1369], 20.00th=[ 1418],
     | 30.00th=[ 1467], 40.00th=[ 1500], 50.00th=[ 1532], 60.00th=[ 1549],
     | 70.00th=[ 1582], 80.00th=[ 1631], 90.00th=[ 1680], 95.00th=[ 1713],
     | 99.00th=[ 1795], 99.50th=[ 1811], 99.90th=[ 1893], 99.95th=[ 1926],
     | 99.99th=[ 2089]
   bw (  MiB/s): min=19030, max=21969, per=100.00%, avg=20843.43, stdev= 5.35, samples=38272
   iops        : min=4871687, max=5624289, avg=5335900.01, stdev=1370.43, samples=38272
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=99.97%, 4=0.02%, 10=0.01%, 20=0.01%
  cpu          : usr=5.56%, sys=77.91%, ctx=8118, majf=0, minf=9018
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
    issued rwts: total=1599503201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1-md: (groupid=3, jobs=64): err= 0: pid=524125: Thu Aug  5 19:33:05 2021
  read: IOPS=5892k, BW=22.5GiB/s (24.1GB/s)(6743GiB/300002msec)
    slat (nsec): min=1663, max=1274.1k, avg=9896.09, stdev=7939.50
    clat (usec): min=236, max=11102, avg=1379.86, stdev=148.64
     lat (usec): min=239, max=11110, avg=1389.84, stdev=149.54
    clat percentiles (usec):
     |  1.00th=[ 1106],  5.00th=[ 1172], 10.00th=[ 1205], 20.00th=[ 1254],
     | 30.00th=[ 1287], 40.00th=[ 1336], 50.00th=[ 1369], 60.00th=[ 1401],
     | 70.00th=[ 1434], 80.00th=[ 1500], 90.00th=[ 1582], 95.00th=[ 1663],
     | 99.00th=[ 1811], 99.50th=[ 1860], 99.90th=[ 1942], 99.95th=[ 1958],
     | 99.99th=[ 2040]
   bw (  MiB/s): min=20982, max=24535, per=-82.15%, avg=23034.61, stdev=15.46, samples=38272
   iops        : min=5371404, max=6281119, avg=5896843.14, stdev=3958.21, samples=38272
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=99.97%, 4=0.02%, 10=0.01%, 20=0.01%
  cpu          : usr=6.55%, sys=74.98%, ctx=9833, majf=0, minf=8956
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=1767618924,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=50.2GiB/s (53.9GB/s), 50.2GiB/s-50.2GiB/s (53.9GB/s-53.9GB/s), io=14.7TiB (16.2TB), run=300005-300005msec

Run status group 1 (all jobs):
   READ: bw=51.8GiB/s (55.7GB/s), 51.8GiB/s-51.8GiB/s (55.7GB/s-55.7GB/s), io=15.2TiB (16.7TB), run=300004-300004msec

Run status group 2 (all jobs):
   READ: bw=20.3GiB/s (21.8GB/s), 20.3GiB/s-20.3GiB/s (21.8GB/s-21.8GB/s), io=6102GiB (6552GB), run=300002-300002msec

Run status group 3 (all jobs):
   READ: bw=22.5GiB/s (24.1GB/s), 22.5GiB/s-22.5GiB/s (24.1GB/s-24.1GB/s), io=6743GiB (7240GB), run=300002-300002msec

Disk stats (read/write):
  nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  md0: ios=1599378656/0, merge=0/0, ticks=391992721/0, in_queue=391992721, util=100.00%
  md1: ios=1767484212/0, merge=0/0, ticks=427666887/0, in_queue=427666887, util=100.00%

From: Gal Ofri <gal.ofri@volumez.com>
Sent: Wednesday, July 28, 2021 5:43 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>; 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

All active links contained in this email were disabled. Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser. 
________________________________________

A recent commit raised the limit on raid5/6 read iops.
It's available in 5.14.
See Caution-https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e < Caution-https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e > commit 97ae27252f4962d0fcc38ee1d9f913d817a2024e
Author: Gal Ofri <gal.ofri@storing.io>
Date:   Mon Jun 7 14:07:03 2021 +0300
    md/raid5: avoid device_lock in read_one_chunk()

Please do share if you reach more iops in your env than described in the commit.

Cheers,
Gal,
Volumez (formerly storing.io)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-08-05 20:50         ` Finlayson, James M CIV (USA)
@ 2021-08-05 21:10           ` Finlayson, James M CIV (USA)
  2021-08-08 14:43             ` Gal Ofri
  0 siblings, 1 reply; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-08-05 21:10 UTC (permalink / raw)
  To: 'linux-raid@vger.kernel.org'
  Cc: 'Gal Ofri', Finlayson, James M CIV (USA)

Final spray from me for a few days.

In my strict numa adherence with mdraid, I see lots of variability between reboots/assembles.    Sometimes md0 wins, sometimes md1 wins, and in my earlier runs md0 and md1 are notionally balanced.   I change nothing but see this variance.   I just cranked up a week long extended run of these 10+1+1s under the 5.14rc3 kernel and right now   md0 is doing 5M IOPS and md1 6.3M - still totaling in the low 11's but quite the disparity.   Am I missing a tuning knob?   I shared everything I do and know in the earlier thread.   I just want to point this out while I have attention.   I've have seen this behavior over and over again.    The more I know about AMD, the more I think I can't depend on the HPC profile provided and I need to take full control of the BIOS.   I know there is a ton of power management going on under the covers, so maybe that is what I'm experiencing.    The more I type, the more I think I don't see it on Intel, but I don't have a modern Intel machine with modern SSDs to test.   I'll accept that there is nothing inherent in mdraid or the kernel to cause this and put my attention to the BIOS if the experts can confirm....

-----Original Message-----
From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> 
Sent: Thursday, August 5, 2021 4:50 PM
To: 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
Cc: 'Gal Ofri' <gal.ofri@volumez.com>; Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

As far as the slower hero numbers - false alarm on my part - rebooted with 4.18 RHEL 8.4 kernel
Socket0 hero - 13.2M IOPS, Socket1 hero 13.7M IOPS.   I have to figure out the differences either between my drives or my server.  Chances are, slot for slot I have PCIe cards that are in different slots between the two servers if I had to guess....

As a major flag though - with mdraid volumes I created under the 5.14rc3 kernel, I lock the system up solid when I try to access them under 4.18.....I'm not an expert on forcing NMI's and getting the stack traces, so I might have to leave that to others.....After two lockups, I returned to the 5.14 kernel.   If I need to run something - you have seen the config I have - I'm willing.   

I'm willing to push as hard as I can and to run anything that can help as long as it isn't urgent - I have a day job and have some constraints as a civil servant, however, I have the researcher, push push push mindset.   I want to really encourage the community to push as hard as possible on protected IOPS and I'm willing to help however I can....In my interactions with the processor and server OEMs - I'm encouraging them to get the Linux leaders in I/O development, the biggest baddest Server/SSD combinations they have early in the development.   I know they won't listen to me but I'm trying to help.

For those of you on Rome server, get with your server provider.   There are some things in the BIOS that can be tweaked for I/O.   


-----Original Message-----
From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> 
Sent: Thursday, August 5, 2021 3:52 PM
To: 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
Cc: 'Gal Ofri' <gal.ofri@volumez.com>; Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

Sorry - again..I sent HTML instead of plain text

Resend - mailing list bounce
All,
Sorry for the delay - both work and life got into the way.   Here is some feedback:

BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1 RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s  - best hero numbers I've ever seen on mdraid  RAID5 IOPS.   I think the kernel patch is good.  Prior was  socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s....   I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control.   

I need to verify the RAW IOPS - admittedly this is a different server and I didn't do any regression testing before the kernel, but my raw were  socket0: 13.2M IOPS and socket1  13.5M IOPS.   Prior was socket0 16.0M IOPS and socket1 13.5M IOPS.   - admittedly there appears to a regression in the socket0 "hero run" but what I don't know that since this is a different server, I don't know if I have a configuration management issue in my zealousness to test this patch or whether we have a regression.   I was so excited to have the attention of kernel developers that needed my help that I borrowed another system, because I didn't want to tear apart my "Frankenstein's monster" 32 partition mdraid LVM mess.   If I can switch kernels and reboot before work and life get back in the way, I'll follow  up..

I think I might have to give myself the action to run this to ground next week on the other server.   Without a doubt the mdraid lock improvement is worth taking forward.   I either have to find my error or point a finger as my raw hero numbers got worse.   I tend to see one socket outrun another -  the way HPE allocates the nvme drives to pcie root complexes  is not how I'd like to do it so the drives are unbalanced on the PCIe root complexes (drives are in 4 different root complexes on socket 0 and 3 on socket 1, so one would think socket0 will always be faster for hero runs  (an NPS4 numa mapping is the best way to show it:
[root@gremlin04 hornet05]# cat *nps4
#filename=/dev/nvme0n1 0
#filename=/dev/nvme1n1 0
#filename=/dev/nvme2n1 1
#filename=/dev/nvme3n1 1
#filename=/dev/nvme4n1 2
#filename=/dev/nvme5n1 2
#filename=/dev/nvme6n1 2
#filename=/dev/nvme7n1 2
#filename=/dev/nvme8n1 3
#filename=/dev/nvme9n1 3
#filename=/dev/nvme10n1 3
#filename=/dev/nvme11n1 3
#filename=/dev/nvme12n1 4
#filename=/dev/nvme13n1 4
#filename=/dev/nvme14n1 4
#filename=/dev/nvme15n1 4
#filename=/dev/nvme17n1 5
#filename=/dev/nvme18n1 5
#filename=/dev/nvme19n1 5
#filename=/dev/nvme20n1 5
#filename=/dev/nvme21n1 6
#filename=/dev/nvme22n1 6
#filename=/dev/nvme23n1 6
#filename=/dev/nvme24n1 6


fio fiojim.hpdl385.nps1
socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
socket0-md: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
socket1-md: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
fio-3.26
Starting 256 processes
Jobs: 128 (f=128): [_(128),r(128)][1.5%][r=42.8GiB/s][r=11.2M IOPS][eta 10h:40m:00s]
socket0: (groupid=0, jobs=64): err= 0: pid=522428: Thu Aug  5 19:33:05 2021
  read: IOPS=13.2M, BW=50.2GiB/s (53.9GB/s)(14.7TiB/300005msec)
    slat (nsec): min=1312, max=8308.1k, avg=2206.72, stdev=1505.92
    clat (usec): min=14, max=42033, avg=619.56, stdev=671.45
     lat (usec): min=19, max=42045, avg=621.83, stdev=671.46
    clat percentiles (usec):
     |  1.00th=[  113],  5.00th=[  149], 10.00th=[  180], 20.00th=[  229],
     | 30.00th=[  273], 40.00th=[  310], 50.00th=[  351], 60.00th=[  408],
     | 70.00th=[  578], 80.00th=[  938], 90.00th=[ 1467], 95.00th=[ 1909],
     | 99.00th=[ 3163], 99.50th=[ 4178], 99.90th=[ 5800], 99.95th=[ 6390],
     | 99.99th=[ 8455]
   bw (  MiB/s): min=28741, max=61365, per=18.56%, avg=51489.80, stdev=82.09, samples=38016
   iops        : min=7357916, max=15709528, avg=13181362.22, stdev=21013.83, samples=38016
  lat (usec)   : 20=0.01%, 50=0.02%, 100=0.42%, 250=24.52%, 500=42.21%
  lat (usec)   : 750=7.94%, 1000=6.34%
  lat (msec)   : 2=14.26%, 4=3.74%, 10=0.54%, 20=0.01%, 50=0.01%
  cpu          : usr=14.58%, sys=47.48%, ctx=291912925, majf=0, minf=10492
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=3949519687,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1: (groupid=1, jobs=64): err= 0: pid=522492: Thu Aug  5 19:33:05 2021
  read: IOPS=13.6M, BW=51.8GiB/s (55.7GB/s)(15.2TiB/300004msec)
    slat (nsec): min=1323, max=4335.7k, avg=2242.27, stdev=1608.25
    clat (usec): min=14, max=41341, avg=600.15, stdev=726.62
     lat (usec): min=20, max=41358, avg=602.46, stdev=726.64
    clat percentiles (usec):
     |  1.00th=[  115],  5.00th=[  151], 10.00th=[  184], 20.00th=[  231],
     | 30.00th=[  269], 40.00th=[  306], 50.00th=[  347], 60.00th=[  400],
     | 70.00th=[  506], 80.00th=[  799], 90.00th=[ 1303], 95.00th=[ 1909],
     | 99.00th=[ 3589], 99.50th=[ 4424], 99.90th=[ 7111], 99.95th=[ 7767],
     | 99.99th=[10290]
   bw (  MiB/s): min=28663, max=71847, per=21.11%, avg=53145.09, stdev=111.29, samples=38016
   iops        : min=7337860, max=18392866, avg=13605117.00, stdev=28491.19, samples=38016
  lat (usec)   : 20=0.01%, 50=0.02%, 100=0.36%, 250=24.52%, 500=44.77%
  lat (usec)   : 750=8.90%, 1000=6.37%
  lat (msec)   : 2=10.52%, 4=3.87%, 10=0.66%, 20=0.01%, 50=0.01%
  cpu          : usr=14.86%, sys=49.40%, ctx=282634154, majf=0, minf=10276
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=4076360454,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket0-md: (groupid=2, jobs=64): err= 0: pid=524061: Thu Aug  5 19:33:05 2021
  read: IOPS=5332k, BW=20.3GiB/s (21.8GB/s)(6102GiB/300002msec)
    slat (nsec): min=1633, max=17043k, avg=11123.38, stdev=8694.61
    clat (usec): min=186, max=18705, avg=1524.87, stdev=115.29
     lat (usec): min=200, max=18743, avg=1536.08, stdev=115.90
    clat percentiles (usec):
     |  1.00th=[ 1270],  5.00th=[ 1336], 10.00th=[ 1369], 20.00th=[ 1418],
     | 30.00th=[ 1467], 40.00th=[ 1500], 50.00th=[ 1532], 60.00th=[ 1549],
     | 70.00th=[ 1582], 80.00th=[ 1631], 90.00th=[ 1680], 95.00th=[ 1713],
     | 99.00th=[ 1795], 99.50th=[ 1811], 99.90th=[ 1893], 99.95th=[ 1926],
     | 99.99th=[ 2089]
   bw (  MiB/s): min=19030, max=21969, per=100.00%, avg=20843.43, stdev= 5.35, samples=38272
   iops        : min=4871687, max=5624289, avg=5335900.01, stdev=1370.43, samples=38272
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=99.97%, 4=0.02%, 10=0.01%, 20=0.01%
  cpu          : usr=5.56%, sys=77.91%, ctx=8118, majf=0, minf=9018
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
    issued rwts: total=1599503201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1-md: (groupid=3, jobs=64): err= 0: pid=524125: Thu Aug  5 19:33:05 2021
  read: IOPS=5892k, BW=22.5GiB/s (24.1GB/s)(6743GiB/300002msec)
    slat (nsec): min=1663, max=1274.1k, avg=9896.09, stdev=7939.50
    clat (usec): min=236, max=11102, avg=1379.86, stdev=148.64
     lat (usec): min=239, max=11110, avg=1389.84, stdev=149.54
    clat percentiles (usec):
     |  1.00th=[ 1106],  5.00th=[ 1172], 10.00th=[ 1205], 20.00th=[ 1254],
     | 30.00th=[ 1287], 40.00th=[ 1336], 50.00th=[ 1369], 60.00th=[ 1401],
     | 70.00th=[ 1434], 80.00th=[ 1500], 90.00th=[ 1582], 95.00th=[ 1663],
     | 99.00th=[ 1811], 99.50th=[ 1860], 99.90th=[ 1942], 99.95th=[ 1958],
     | 99.99th=[ 2040]
   bw (  MiB/s): min=20982, max=24535, per=-82.15%, avg=23034.61, stdev=15.46, samples=38272
   iops        : min=5371404, max=6281119, avg=5896843.14, stdev=3958.21, samples=38272
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=99.97%, 4=0.02%, 10=0.01%, 20=0.01%
  cpu          : usr=6.55%, sys=74.98%, ctx=9833, majf=0, minf=8956
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=1767618924,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=50.2GiB/s (53.9GB/s), 50.2GiB/s-50.2GiB/s (53.9GB/s-53.9GB/s), io=14.7TiB (16.2TB), run=300005-300005msec

Run status group 1 (all jobs):
   READ: bw=51.8GiB/s (55.7GB/s), 51.8GiB/s-51.8GiB/s (55.7GB/s-55.7GB/s), io=15.2TiB (16.7TB), run=300004-300004msec

Run status group 2 (all jobs):
   READ: bw=20.3GiB/s (21.8GB/s), 20.3GiB/s-20.3GiB/s (21.8GB/s-21.8GB/s), io=6102GiB (6552GB), run=300002-300002msec

Run status group 3 (all jobs):
   READ: bw=22.5GiB/s (24.1GB/s), 22.5GiB/s-22.5GiB/s (24.1GB/s-24.1GB/s), io=6743GiB (7240GB), run=300002-300002msec

Disk stats (read/write):
  nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  md0: ios=1599378656/0, merge=0/0, ticks=391992721/0, in_queue=391992721, util=100.00%
  md1: ios=1767484212/0, merge=0/0, ticks=427666887/0, in_queue=427666887, util=100.00%

From: Gal Ofri <gal.ofri@volumez.com>
Sent: Wednesday, July 28, 2021 5:43 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>; 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

All active links contained in this email were disabled. Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser. 
________________________________________

A recent commit raised the limit on raid5/6 read iops.
It's available in 5.14.
See Caution-https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e < Caution-https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e > commit 97ae27252f4962d0fcc38ee1d9f913d817a2024e
Author: Gal Ofri <gal.ofri@storing.io>
Date:   Mon Jun 7 14:07:03 2021 +0300
    md/raid5: avoid device_lock in read_one_chunk()

Please do share if you reach more iops in your env than described in the commit.

Cheers,
Gal,
Volumez (formerly storing.io)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-08-05 21:10           ` Finlayson, James M CIV (USA)
@ 2021-08-08 14:43             ` Gal Ofri
  2021-08-09 19:01               ` Finlayson, James M CIV (USA)
  0 siblings, 1 reply; 28+ messages in thread
From: Gal Ofri @ 2021-08-08 14:43 UTC (permalink / raw)
  To: Finlayson, James M CIV (USA); +Cc: 'linux-raid@vger.kernel.org'

On Thu, 5 Aug 2021 21:10:40 +0000
"Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil> wrote:

> BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1 RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s  - best hero numbers I've ever seen on mdraid  RAID5 IOPS.   I think the kernel patch is good.  Prior was  socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s....   I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control.   
That's great !
Thanks for sharing your results.
I'd appreciate if you could run a sequential-reads workload (128k/256k) so
that we get a better sense of the throughput potential here.

> In my strict numa adherence with mdraid, I see lots of variability between reboots/assembles.    Sometimes md0 wins, sometimes md1 wins, and in my earlier runs md0 and md1 are notionally balanced.   I change nothing but see this variance.   I just cranked up a week long extended run of these 10+1+1s under the 5.14rc3 kernel and right now   md0 is doing 5M IOPS and md1 6.3M 
Given my humble experience with the code in question, I suspect that it is
not really optimized for numa awareness, so I find your findings quite
reasonable. I don't really have a good tip for that.

I'm focusing now on thin-provisioned logical volumes (lvm - it has a much
worse reads bottleneck actually), but we have plans for researching
md/raid5 again soon to improve write workloads.
I'll ping you when I have a patch that might be relevant.

Cheers,
Gal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-08-08 14:43             ` Gal Ofri
@ 2021-08-09 19:01               ` Finlayson, James M CIV (USA)
  2021-08-17 21:21                 ` Finlayson, James M CIV (USA)
  0 siblings, 1 reply; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-08-09 19:01 UTC (permalink / raw)
  To: 'Gal Ofri', 'linux-raid@vger.kernel.org'
  Cc: Finlayson, James M CIV (USA)

Sequential Performance:
BLUF, 1M sequential, direct I/O  reads, QD 128  - 85GiB/s across both 10+1+1 NUMA aware 128K striped LUNS.   Had the imbalance between NUMA 0 44.5GiB/s and NUMA 1 39.4GiB/s but still could be drifting power management on the AMD Rome cores.    I tried a 1280K blocksize to try to get a full stripe read, but Linux seems so unfriendly to non-power of 2 blocksizes.... performance decreased considerably (20GiB/s ?) with the 10x128KB blocksize....   I think I ran for about 40 minutes with the 1M reads...


socket0-md: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128
...
socket1-md: (g=1): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128
...
fio-3.26
Starting 128 processes

fio: terminating on signal 2

socket0-md: (groupid=0, jobs=64): err= 0: pid=1645360: Mon Aug  9 18:53:36 2021
  read: IOPS=45.6k, BW=44.5GiB/s (47.8GB/s)(114TiB/2626961msec)
    slat (usec): min=12, max=4463, avg=24.86, stdev=15.58
    clat (usec): min=249, max=1904.8k, avg=179674.12, stdev=138190.51
     lat (usec): min=295, max=1904.8k, avg=179699.07, stdev=138191.00
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 20.00th=[   17],
     | 30.00th=[  106], 40.00th=[  116], 50.00th=[  209], 60.00th=[  226],
     | 70.00th=[  236], 80.00th=[  321], 90.00th=[  351], 95.00th=[  372],
     | 99.00th=[  472], 99.50th=[  481], 99.90th=[ 1267], 99.95th=[ 1401],
     | 99.99th=[ 1586]
   bw (  MiB/s): min=  967, max=114322, per=8.68%, avg=45897.69, stdev=330.42, samples=333433
   iops        : min=  929, max=114304, avg=45879.39, stdev=330.41, samples=333433
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.05%, 1000=0.06%
  lat (msec)   : 2=0.49%, 4=4.36%, 10=9.43%, 20=7.52%, 50=3.48%
  lat (msec)   : 100=2.70%, 250=47.39%, 500=24.25%, 750=0.09%, 1000=0.01%
  lat (msec)   : 2000=0.15%
  cpu          : usr=0.07%, sys=1.83%, ctx=77483816, majf=0, minf=37747
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=119750623,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1-md: (groupid=1, jobs=64): err= 0: pid=1645424: Mon Aug  9 18:53:36 2021
  read: IOPS=40.3k, BW=39.4GiB/s (42.3GB/s)(101TiB/2627054msec)
    slat (usec): min=12, max=57137, avg=23.77, stdev=27.80
    clat (usec): min=130, max=1746.1k, avg=203005.37, stdev=158045.10
     lat (usec): min=269, max=1746.1k, avg=203029.23, stdev=158045.27
    clat percentiles (usec):
     |  1.00th=[    570],  5.00th=[    693], 10.00th=[   2573],
     | 20.00th=[  21103], 30.00th=[ 102237], 40.00th=[ 143655],
     | 50.00th=[ 204473], 60.00th=[ 231736], 70.00th=[ 283116],
     | 80.00th=[ 320865], 90.00th=[ 421528], 95.00th=[ 455082],
     | 99.00th=[ 583009], 99.50th=[ 608175], 99.90th=[1061159],
     | 99.95th=[1166017], 99.99th=[1367344]
   bw (  MiB/s): min=  599, max=124821, per=-3.40%, avg=40571.79, stdev=319.36, samples=333904
   iops        : min=  568, max=124809, avg=40554.92, stdev=319.34, samples=333904
  lat (usec)   : 250=0.01%, 500=0.14%, 750=6.31%, 1000=2.60%
  lat (msec)   : 2=0.58%, 4=2.04%, 10=4.17%, 20=3.82%, 50=3.71%
  lat (msec)   : 100=5.91%, 250=32.86%, 500=33.81%, 750=3.81%, 1000=0.10%
  lat (msec)   : 2000=0.14%
  cpu          : usr=0.05%, sys=1.56%, ctx=71342745, majf=0, minf=37766
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=105992570,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=44.5GiB/s (47.8GB/s), 44.5GiB/s-44.5GiB/s (47.8GB/s-47.8GB/s), io=114TiB (126TB), run=2626961-2626961msec

Run status group 1 (all jobs):
   READ: bw=39.4GiB/s (42.3GB/s), 39.4GiB/s-39.4GiB/s (42.3GB/s-42.3GB/s), io=101TiB (111TB), run=2627054-2627054msec

Disk stats (read/write):
    md0: ios=960804546/0, merge=0/0, ticks=18446744072288672424/0, in_queue=18446744072288672424, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
  nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
    md1: ios=850399203/0, merge=0/0, ticks=2118156441/0, in_queue=2118156441, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
  nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

-----Original Message-----
From: Gal Ofri <gal.ofri@volumez.com> 
Sent: Sunday, August 8, 2021 10:44 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
Subject: Re: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

On Thu, 5 Aug 2021 21:10:40 +0000
"Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil> wrote:

> BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1 
> RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s  - best hero numbers I've ever seen on mdraid  RAID5 IOPS.   I think the kernel patch is good.  Prior was  socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s....   I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control.
That's great !
Thanks for sharing your results.
I'd appreciate if you could run a sequential-reads workload (128k/256k) so that we get a better sense of the throughput potential here.

> In my strict numa adherence with mdraid, I see lots of variability between reboots/assembles.    Sometimes md0 wins, sometimes md1 wins, and in my earlier runs md0 and md1 are notionally balanced.   I change nothing but see this variance.   I just cranked up a week long extended run of these 10+1+1s under the 5.14rc3 kernel and right now   md0 is doing 5M IOPS and md1 6.3M 
Given my humble experience with the code in question, I suspect that it is not really optimized for numa awareness, so I find your findings quite reasonable. I don't really have a good tip for that.

I'm focusing now on thin-provisioned logical volumes (lvm - it has a much worse reads bottleneck actually), but we have plans for researching
md/raid5 again soon to improve write workloads.
I'll ping you when I have a patch that might be relevant.

Cheers,
Gal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-08-09 19:01               ` Finlayson, James M CIV (USA)
@ 2021-08-17 21:21                 ` Finlayson, James M CIV (USA)
  2021-08-18  0:45                   ` [Non-DoD Source] " Matt Wallis
  0 siblings, 1 reply; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-08-17 21:21 UTC (permalink / raw)
  To: 'linux-raid@vger.kernel.org'; +Cc: Finlayson, James M CIV (USA)

All,
A quick random performance update (this is the best I can do in "going for it" with all of the guidance from this list) - I'm thrilled.....

5.14rc4 kernel Gen 4 drives, all AMD Rome BIOS tuning to keep I/O from power throttling,  SMT turned on (off yielded higher performance but left no room for anything else),  15.36TB drives cut into 32 equal partitions,  32 NUMA aligned raid5 9+1s from the same partition on NUMA0 combined with an LVM concatenating all 32 RAID5's into one volume.    I then do the exact same thing on NUMA1.

4K random reads, SMT off, sustained bandwidth of > 90GB/s, sustained IOPS across both LVMs, ~23M - bad part, only 7% of the system left to do anything useful
4K random reads, SMT on, sustained bandwidth of > 84GB/s, sustained IOPS across both LVMs, ~21M - 46.7% idle (.73% users, 52.6% system time)
Takeaway - IMHO, no reason to turn off SMT, it helps way more than it hurts...

Without the partitioning and lvm shenanigans, with SMT on, 5.14rc4 kernel, most AMD BIOS tuning (not all), I'm at 46GB/s, 11.7M IOPS , 42.2% idle (3% user, 54.7% system time)

With stock RHEL 8.4, 4.18 kernel, SMT on, both partitioning and LVM shenanigans, most AMD BIOS tuning (not all), I'm at 81.5GB/s, 20.4M IOPS, 49% idle (5.5% user, 46.75% system time)

The question I have for the list, given my large drive sizes, it takes me a day to set up and build an mdraid/lvm configuration.    Has anybody found the "sweet spot" for how many partitions per drive?    I now have a script to generate the drive partitions, a script for building the mdraid volumes, and a procedure for unwinding from all of this and starting again.    

If anybody knows the point of diminishing return for the number of partitions per drive to max out at, it would save me a few days of letting 32 run for a day, reconfiguring for 16, 8, 4, 2, 1....I could just tear apart my LVMs and remake them with half as many RAID partitions, but depending upon how the nvme drive is "RAINed" across NAND chips, I might leave performance on the table.   The researcher in me says, start over, don't make ANY assumptions.

As an aside, on the server, I'm maintaining around 1.1M  NUMA aware IOPS per drive, when hitting all 24 drives individually without RAID, so I'm thrilled with the performance ceiling with the RAID, I just have to find a way to make it something somebody would be willing to maintain.   Somewhere is a sweet spot between sustainability and performance.   Once I find that I have to figure out if there is something useful to do with this new toy.....


Regards,
Jim




-----Original Message-----
From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> 
Sent: Monday, August 9, 2021 3:02 PM
To: 'Gal Ofri' <gal.ofri@volumez.com>; 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
Cc: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

Sequential Performance:
BLUF, 1M sequential, direct I/O  reads, QD 128  - 85GiB/s across both 10+1+1 NUMA aware 128K striped LUNS.   Had the imbalance between NUMA 0 44.5GiB/s and NUMA 1 39.4GiB/s but still could be drifting power management on the AMD Rome cores.    I tried a 1280K blocksize to try to get a full stripe read, but Linux seems so unfriendly to non-power of 2 blocksizes.... performance decreased considerably (20GiB/s ?) with the 10x128KB blocksize....   I think I ran for about 40 minutes with the 1M reads...


socket0-md: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ...
socket1-md: (g=1): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ...
fio-3.26
Starting 128 processes

fio: terminating on signal 2

socket0-md: (groupid=0, jobs=64): err= 0: pid=1645360: Mon Aug  9 18:53:36 2021
  read: IOPS=45.6k, BW=44.5GiB/s (47.8GB/s)(114TiB/2626961msec)
    slat (usec): min=12, max=4463, avg=24.86, stdev=15.58
    clat (usec): min=249, max=1904.8k, avg=179674.12, stdev=138190.51
     lat (usec): min=295, max=1904.8k, avg=179699.07, stdev=138191.00
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 20.00th=[   17],
     | 30.00th=[  106], 40.00th=[  116], 50.00th=[  209], 60.00th=[  226],
     | 70.00th=[  236], 80.00th=[  321], 90.00th=[  351], 95.00th=[  372],
     | 99.00th=[  472], 99.50th=[  481], 99.90th=[ 1267], 99.95th=[ 1401],
     | 99.99th=[ 1586]
   bw (  MiB/s): min=  967, max=114322, per=8.68%, avg=45897.69, stdev=330.42, samples=333433
   iops        : min=  929, max=114304, avg=45879.39, stdev=330.41, samples=333433
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.05%, 1000=0.06%
  lat (msec)   : 2=0.49%, 4=4.36%, 10=9.43%, 20=7.52%, 50=3.48%
  lat (msec)   : 100=2.70%, 250=47.39%, 500=24.25%, 750=0.09%, 1000=0.01%
  lat (msec)   : 2000=0.15%
  cpu          : usr=0.07%, sys=1.83%, ctx=77483816, majf=0, minf=37747
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=119750623,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1-md: (groupid=1, jobs=64): err= 0: pid=1645424: Mon Aug  9 18:53:36 2021
  read: IOPS=40.3k, BW=39.4GiB/s (42.3GB/s)(101TiB/2627054msec)
    slat (usec): min=12, max=57137, avg=23.77, stdev=27.80
    clat (usec): min=130, max=1746.1k, avg=203005.37, stdev=158045.10
     lat (usec): min=269, max=1746.1k, avg=203029.23, stdev=158045.27
    clat percentiles (usec):
     |  1.00th=[    570],  5.00th=[    693], 10.00th=[   2573],
     | 20.00th=[  21103], 30.00th=[ 102237], 40.00th=[ 143655],
     | 50.00th=[ 204473], 60.00th=[ 231736], 70.00th=[ 283116],
     | 80.00th=[ 320865], 90.00th=[ 421528], 95.00th=[ 455082],
     | 99.00th=[ 583009], 99.50th=[ 608175], 99.90th=[1061159],
     | 99.95th=[1166017], 99.99th=[1367344]
   bw (  MiB/s): min=  599, max=124821, per=-3.40%, avg=40571.79, stdev=319.36, samples=333904
   iops        : min=  568, max=124809, avg=40554.92, stdev=319.34, samples=333904
  lat (usec)   : 250=0.01%, 500=0.14%, 750=6.31%, 1000=2.60%
  lat (msec)   : 2=0.58%, 4=2.04%, 10=4.17%, 20=3.82%, 50=3.71%
  lat (msec)   : 100=5.91%, 250=32.86%, 500=33.81%, 750=3.81%, 1000=0.10%
  lat (msec)   : 2000=0.14%
  cpu          : usr=0.05%, sys=1.56%, ctx=71342745, majf=0, minf=37766
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=105992570,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=44.5GiB/s (47.8GB/s), 44.5GiB/s-44.5GiB/s (47.8GB/s-47.8GB/s), io=114TiB (126TB), run=2626961-2626961msec

Run status group 1 (all jobs):
   READ: bw=39.4GiB/s (42.3GB/s), 39.4GiB/s-39.4GiB/s (42.3GB/s-42.3GB/s), io=101TiB (111TB), run=2627054-2627054msec

Disk stats (read/write):
    md0: ios=960804546/0, merge=0/0, ticks=18446744072288672424/0, in_queue=18446744072288672424, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
  nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
    md1: ios=850399203/0, merge=0/0, ticks=2118156441/0, in_queue=2118156441, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
  nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

-----Original Message-----
From: Gal Ofri <gal.ofri@volumez.com>
Sent: Sunday, August 8, 2021 10:44 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
Subject: Re: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

On Thu, 5 Aug 2021 21:10:40 +0000
"Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil> wrote:

> BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1
> RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s  - best hero numbers I've ever seen on mdraid  RAID5 IOPS.   I think the kernel patch is good.  Prior was  socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s....   I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control.
That's great !
Thanks for sharing your results.
I'd appreciate if you could run a sequential-reads workload (128k/256k) so that we get a better sense of the throughput potential here.

> In my strict numa adherence with mdraid, I see lots of variability between reboots/assembles.    Sometimes md0 wins, sometimes md1 wins, and in my earlier runs md0 and md1 are notionally balanced.   I change nothing but see this variance.   I just cranked up a week long extended run of these 10+1+1s under the 5.14rc3 kernel and right now   md0 is doing 5M IOPS and md1 6.3M 
Given my humble experience with the code in question, I suspect that it is not really optimized for numa awareness, so I find your findings quite reasonable. I don't really have a good tip for that.

I'm focusing now on thin-provisioned logical volumes (lvm - it has a much worse reads bottleneck actually), but we have plans for researching
md/raid5 again soon to improve write workloads.
I'll ping you when I have a patch that might be relevant.

Cheers,
Gal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-08-17 21:21                 ` Finlayson, James M CIV (USA)
@ 2021-08-18  0:45                   ` Matt Wallis
  2021-08-18 10:20                     ` Finlayson, James M CIV (USA)
  0 siblings, 1 reply; 28+ messages in thread
From: Matt Wallis @ 2021-08-18  0:45 UTC (permalink / raw)
  To: Finlayson, James M CIV (USA); +Cc: linux-raid

Hi Jim,

Awesome stuff. I’m looking to get access back to a server I was using before for my tests so I can play some more myself.
I did wonder about your use case, and if you were planning to present the storage over a network to another server, or intended to use it as local storage for an application.

The problem is basically that we’re limited no matter what we do. There’s no way with current PCIe+networking to get that bandwidth outside the box, and you don’t have much compute left inside the box.

You could simplify the configuration a little bit by using a parallel file system like BeeGFS. Parallel file systems like to stripe data over multiple targets anyway, so you could remove the LVM layer, and simply present 64 RAID volumes for BeeGFS to write to.  

Normal parallel file system operation is to export the volumes over a network, but BeeGFS does have an alternate mode called BeeOND, or BeeGFS on Demand, which builds up dynamic file systems using the local disks in multiple servers, you could potentially look at a single server BeeOND configuration and see if that worked, but I suspect you’d be exchanging bottlenecks.

There’s a new parallel FS on the market that might also be of interest, called MadFS. It’s based on another parallel file system but with certain parts re-written using the Rust language which significantly improved it’s ability to handle higher IOPs. 

Hmm, just realised the box I had access to before won’t help, it was built on an older Intel platform so bottlenecked by PCIe lanes. I’ll have to see if I can get something newer.

Matt.

> On 18 Aug 2021, at 07:21, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> 
> All,
> A quick random performance update (this is the best I can do in "going for it" with all of the guidance from this list) - I'm thrilled.....
> 
> 5.14rc4 kernel Gen 4 drives, all AMD Rome BIOS tuning to keep I/O from power throttling,  SMT turned on (off yielded higher performance but left no room for anything else),  15.36TB drives cut into 32 equal partitions,  32 NUMA aligned raid5 9+1s from the same partition on NUMA0 combined with an LVM concatenating all 32 RAID5's into one volume.    I then do the exact same thing on NUMA1.
> 
> 4K random reads, SMT off, sustained bandwidth of > 90GB/s, sustained IOPS across both LVMs, ~23M - bad part, only 7% of the system left to do anything useful
> 4K random reads, SMT on, sustained bandwidth of > 84GB/s, sustained IOPS across both LVMs, ~21M - 46.7% idle (.73% users, 52.6% system time)
> Takeaway - IMHO, no reason to turn off SMT, it helps way more than it hurts...
> 
> Without the partitioning and lvm shenanigans, with SMT on, 5.14rc4 kernel, most AMD BIOS tuning (not all), I'm at 46GB/s, 11.7M IOPS , 42.2% idle (3% user, 54.7% system time)
> 
> With stock RHEL 8.4, 4.18 kernel, SMT on, both partitioning and LVM shenanigans, most AMD BIOS tuning (not all), I'm at 81.5GB/s, 20.4M IOPS, 49% idle (5.5% user, 46.75% system time)
> 
> The question I have for the list, given my large drive sizes, it takes me a day to set up and build an mdraid/lvm configuration.    Has anybody found the "sweet spot" for how many partitions per drive?    I now have a script to generate the drive partitions, a script for building the mdraid volumes, and a procedure for unwinding from all of this and starting again.    
> 
> If anybody knows the point of diminishing return for the number of partitions per drive to max out at, it would save me a few days of letting 32 run for a day, reconfiguring for 16, 8, 4, 2, 1....I could just tear apart my LVMs and remake them with half as many RAID partitions, but depending upon how the nvme drive is "RAINed" across NAND chips, I might leave performance on the table.   The researcher in me says, start over, don't make ANY assumptions.
> 
> As an aside, on the server, I'm maintaining around 1.1M  NUMA aware IOPS per drive, when hitting all 24 drives individually without RAID, so I'm thrilled with the performance ceiling with the RAID, I just have to find a way to make it something somebody would be willing to maintain.   Somewhere is a sweet spot between sustainability and performance.   Once I find that I have to figure out if there is something useful to do with this new toy.....
> 
> 
> Regards,
> Jim
> 
> 
> 
> 
> -----Original Message-----
> From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> 
> Sent: Monday, August 9, 2021 3:02 PM
> To: 'Gal Ofri' <gal.ofri@volumez.com>; 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
> Cc: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
> 
> Sequential Performance:
> BLUF, 1M sequential, direct I/O  reads, QD 128  - 85GiB/s across both 10+1+1 NUMA aware 128K striped LUNS.   Had the imbalance between NUMA 0 44.5GiB/s and NUMA 1 39.4GiB/s but still could be drifting power management on the AMD Rome cores.    I tried a 1280K blocksize to try to get a full stripe read, but Linux seems so unfriendly to non-power of 2 blocksizes.... performance decreased considerably (20GiB/s ?) with the 10x128KB blocksize....   I think I ran for about 40 minutes with the 1M reads...
> 
> 
> socket0-md: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ...
> socket1-md: (g=1): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ...
> fio-3.26
> Starting 128 processes
> 
> fio: terminating on signal 2
> 
> socket0-md: (groupid=0, jobs=64): err= 0: pid=1645360: Mon Aug  9 18:53:36 2021
>  read: IOPS=45.6k, BW=44.5GiB/s (47.8GB/s)(114TiB/2626961msec)
>    slat (usec): min=12, max=4463, avg=24.86, stdev=15.58
>    clat (usec): min=249, max=1904.8k, avg=179674.12, stdev=138190.51
>     lat (usec): min=295, max=1904.8k, avg=179699.07, stdev=138191.00
>    clat percentiles (msec):
>     |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 20.00th=[   17],
>     | 30.00th=[  106], 40.00th=[  116], 50.00th=[  209], 60.00th=[  226],
>     | 70.00th=[  236], 80.00th=[  321], 90.00th=[  351], 95.00th=[  372],
>     | 99.00th=[  472], 99.50th=[  481], 99.90th=[ 1267], 99.95th=[ 1401],
>     | 99.99th=[ 1586]
>   bw (  MiB/s): min=  967, max=114322, per=8.68%, avg=45897.69, stdev=330.42, samples=333433
>   iops        : min=  929, max=114304, avg=45879.39, stdev=330.41, samples=333433
>  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.05%, 1000=0.06%
>  lat (msec)   : 2=0.49%, 4=4.36%, 10=9.43%, 20=7.52%, 50=3.48%
>  lat (msec)   : 100=2.70%, 250=47.39%, 500=24.25%, 750=0.09%, 1000=0.01%
>  lat (msec)   : 2000=0.15%
>  cpu          : usr=0.07%, sys=1.83%, ctx=77483816, majf=0, minf=37747
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>     issued rwts: total=119750623,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>     latency   : target=0, window=0, percentile=100.00%, depth=128
> socket1-md: (groupid=1, jobs=64): err= 0: pid=1645424: Mon Aug  9 18:53:36 2021
>  read: IOPS=40.3k, BW=39.4GiB/s (42.3GB/s)(101TiB/2627054msec)
>    slat (usec): min=12, max=57137, avg=23.77, stdev=27.80
>    clat (usec): min=130, max=1746.1k, avg=203005.37, stdev=158045.10
>     lat (usec): min=269, max=1746.1k, avg=203029.23, stdev=158045.27
>    clat percentiles (usec):
>     |  1.00th=[    570],  5.00th=[    693], 10.00th=[   2573],
>     | 20.00th=[  21103], 30.00th=[ 102237], 40.00th=[ 143655],
>     | 50.00th=[ 204473], 60.00th=[ 231736], 70.00th=[ 283116],
>     | 80.00th=[ 320865], 90.00th=[ 421528], 95.00th=[ 455082],
>     | 99.00th=[ 583009], 99.50th=[ 608175], 99.90th=[1061159],
>     | 99.95th=[1166017], 99.99th=[1367344]
>   bw (  MiB/s): min=  599, max=124821, per=-3.40%, avg=40571.79, stdev=319.36, samples=333904
>   iops        : min=  568, max=124809, avg=40554.92, stdev=319.34, samples=333904
>  lat (usec)   : 250=0.01%, 500=0.14%, 750=6.31%, 1000=2.60%
>  lat (msec)   : 2=0.58%, 4=2.04%, 10=4.17%, 20=3.82%, 50=3.71%
>  lat (msec)   : 100=5.91%, 250=32.86%, 500=33.81%, 750=3.81%, 1000=0.10%
>  lat (msec)   : 2000=0.14%
>  cpu          : usr=0.05%, sys=1.56%, ctx=71342745, majf=0, minf=37766
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>     issued rwts: total=105992570,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>     latency   : target=0, window=0, percentile=100.00%, depth=128
> 
> Run status group 0 (all jobs):
>   READ: bw=44.5GiB/s (47.8GB/s), 44.5GiB/s-44.5GiB/s (47.8GB/s-47.8GB/s), io=114TiB (126TB), run=2626961-2626961msec
> 
> Run status group 1 (all jobs):
>   READ: bw=39.4GiB/s (42.3GB/s), 39.4GiB/s-39.4GiB/s (42.3GB/s-42.3GB/s), io=101TiB (111TB), run=2627054-2627054msec
> 
> Disk stats (read/write):
>    md0: ios=960804546/0, merge=0/0, ticks=18446744072288672424/0, in_queue=18446744072288672424, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>  nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>    md1: ios=850399203/0, merge=0/0, ticks=2118156441/0, in_queue=2118156441, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>  nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> 
> -----Original Message-----
> From: Gal Ofri <gal.ofri@volumez.com>
> Sent: Sunday, August 8, 2021 10:44 AM
> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Cc: 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
> Subject: Re: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
> 
> On Thu, 5 Aug 2021 21:10:40 +0000
> "Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil> wrote:
> 
>> BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1
>> RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s  - best hero numbers I've ever seen on mdraid  RAID5 IOPS.   I think the kernel patch is good.  Prior was  socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s....   I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control.
> That's great !
> Thanks for sharing your results.
> I'd appreciate if you could run a sequential-reads workload (128k/256k) so that we get a better sense of the throughput potential here.
> 
>> In my strict numa adherence with mdraid, I see lots of variability between reboots/assembles.    Sometimes md0 wins, sometimes md1 wins, and in my earlier runs md0 and md1 are notionally balanced.   I change nothing but see this variance.   I just cranked up a week long extended run of these 10+1+1s under the 5.14rc3 kernel and right now   md0 is doing 5M IOPS and md1 6.3M 
> Given my humble experience with the code in question, I suspect that it is not really optimized for numa awareness, so I find your findings quite reasonable. I don't really have a good tip for that.
> 
> I'm focusing now on thin-provisioned logical volumes (lvm - it has a much worse reads bottleneck actually), but we have plans for researching
> md/raid5 again soon to improve write workloads.
> I'll ping you when I have a patch that might be relevant.
> 
> Cheers,
> Gal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-08-18  0:45                   ` [Non-DoD Source] " Matt Wallis
@ 2021-08-18 10:20                     ` Finlayson, James M CIV (USA)
  2021-08-18 19:48                       ` Doug Ledford
  2021-08-18 19:59                       ` Doug Ledford
  0 siblings, 2 replies; 28+ messages in thread
From: Finlayson, James M CIV (USA) @ 2021-08-18 10:20 UTC (permalink / raw)
  To: 'Matt Wallis', linux-raid

All,
I'm happy to be in the position to pioneer some of this "tuning" if nobody has done this prior.   After updating this thread and then providing a status report to my leadership, it hit me on what we're really balancing is "how many mdraid kernel worker threads" it takes to hit max IOPS.  I'll go find that out.     If real world testing becomes my contribution , so be it.     I was an O/S developer originally working in the I/O subsystem, but early in my career that effort was deprecated, so I've only been an integrator of COTS and open source for the last 30 years and my programming skills have minimized to just perl and bash.   I don't have the skills necessary to make coding contributions.

Where I'd like mdraid to get is such that we don't need to do this, but this is a marathon, not a sprint.

As far as the PCIe lanes, AMD has situations where 160 gen 4 lanes are available (deleting 1 XGMI2 socket to socket interconnect).   If you have NUMA awareness, the box seems highly capable.    

Regards,
Jim



-----Original Message-----
From: Matt Wallis <mattw@madmonks.org> 
Sent: Tuesday, August 17, 2021 8:45 PM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: linux-raid@vger.kernel.org
Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

Hi Jim,

Awesome stuff. I’m looking to get access back to a server I was using before for my tests so I can play some more myself.
I did wonder about your use case, and if you were planning to present the storage over a network to another server, or intended to use it as local storage for an application.

The problem is basically that we’re limited no matter what we do. There’s no way with current PCIe+networking to get that bandwidth outside the box, and you don’t have much compute left inside the box.

You could simplify the configuration a little bit by using a parallel file system like BeeGFS. Parallel file systems like to stripe data over multiple targets anyway, so you could remove the LVM layer, and simply present 64 RAID volumes for BeeGFS to write to.  

Normal parallel file system operation is to export the volumes over a network, but BeeGFS does have an alternate mode called BeeOND, or BeeGFS on Demand, which builds up dynamic file systems using the local disks in multiple servers, you could potentially look at a single server BeeOND configuration and see if that worked, but I suspect you’d be exchanging bottlenecks.

There’s a new parallel FS on the market that might also be of interest, called MadFS. It’s based on another parallel file system but with certain parts re-written using the Rust language which significantly improved it’s ability to handle higher IOPs. 

Hmm, just realised the box I had access to before won’t help, it was built on an older Intel platform so bottlenecked by PCIe lanes. I’ll have to see if I can get something newer.

Matt.

> On 18 Aug 2021, at 07:21, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> 
> All,
> A quick random performance update (this is the best I can do in "going for it" with all of the guidance from this list) - I'm thrilled.....
> 
> 5.14rc4 kernel Gen 4 drives, all AMD Rome BIOS tuning to keep I/O from power throttling,  SMT turned on (off yielded higher performance but left no room for anything else),  15.36TB drives cut into 32 equal partitions,  32 NUMA aligned raid5 9+1s from the same partition on NUMA0 combined with an LVM concatenating all 32 RAID5's into one volume.    I then do the exact same thing on NUMA1.
> 
> 4K random reads, SMT off, sustained bandwidth of > 90GB/s, sustained 
> IOPS across both LVMs, ~23M - bad part, only 7% of the system left to 
> do anything useful 4K random reads, SMT on, sustained bandwidth of > 84GB/s, sustained IOPS across both LVMs, ~21M - 46.7% idle (.73% users, 52.6% system time) Takeaway - IMHO, no reason to turn off SMT, it helps way more than it hurts...
> 
> Without the partitioning and lvm shenanigans, with SMT on, 5.14rc4 
> kernel, most AMD BIOS tuning (not all), I'm at 46GB/s, 11.7M IOPS , 
> 42.2% idle (3% user, 54.7% system time)
> 
> With stock RHEL 8.4, 4.18 kernel, SMT on, both partitioning and LVM 
> shenanigans, most AMD BIOS tuning (not all), I'm at 81.5GB/s, 20.4M 
> IOPS, 49% idle (5.5% user, 46.75% system time)
> 
> The question I have for the list, given my large drive sizes, it takes me a day to set up and build an mdraid/lvm configuration.    Has anybody found the "sweet spot" for how many partitions per drive?    I now have a script to generate the drive partitions, a script for building the mdraid volumes, and a procedure for unwinding from all of this and starting again.    
> 
> If anybody knows the point of diminishing return for the number of partitions per drive to max out at, it would save me a few days of letting 32 run for a day, reconfiguring for 16, 8, 4, 2, 1....I could just tear apart my LVMs and remake them with half as many RAID partitions, but depending upon how the nvme drive is "RAINed" across NAND chips, I might leave performance on the table.   The researcher in me says, start over, don't make ANY assumptions.
> 
> As an aside, on the server, I'm maintaining around 1.1M  NUMA aware IOPS per drive, when hitting all 24 drives individually without RAID, so I'm thrilled with the performance ceiling with the RAID, I just have to find a way to make it something somebody would be willing to maintain.   Somewhere is a sweet spot between sustainability and performance.   Once I find that I have to figure out if there is something useful to do with this new toy.....
> 
> 
> Regards,
> Jim
> 
> 
> 
> 
> -----Original Message-----
> From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Sent: Monday, August 9, 2021 3:02 PM
> To: 'Gal Ofri' <gal.ofri@volumez.com>; 'linux-raid@vger.kernel.org' 
> <linux-raid@vger.kernel.org>
> Cc: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
> 
> Sequential Performance:
> BLUF, 1M sequential, direct I/O  reads, QD 128  - 85GiB/s across both 10+1+1 NUMA aware 128K striped LUNS.   Had the imbalance between NUMA 0 44.5GiB/s and NUMA 1 39.4GiB/s but still could be drifting power management on the AMD Rome cores.    I tried a 1280K blocksize to try to get a full stripe read, but Linux seems so unfriendly to non-power of 2 blocksizes.... performance decreased considerably (20GiB/s ?) with the 10x128KB blocksize....   I think I ran for about 40 minutes with the 1M reads...
> 
> 
> socket0-md: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ...
> socket1-md: (g=1): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ...
> fio-3.26
> Starting 128 processes
> 
> fio: terminating on signal 2
> 
> socket0-md: (groupid=0, jobs=64): err= 0: pid=1645360: Mon Aug  9 
> 18:53:36 2021
>  read: IOPS=45.6k, BW=44.5GiB/s (47.8GB/s)(114TiB/2626961msec)
>    slat (usec): min=12, max=4463, avg=24.86, stdev=15.58
>    clat (usec): min=249, max=1904.8k, avg=179674.12, stdev=138190.51
>     lat (usec): min=295, max=1904.8k, avg=179699.07, stdev=138191.00
>    clat percentiles (msec):
>     |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 20.00th=[   17],
>     | 30.00th=[  106], 40.00th=[  116], 50.00th=[  209], 60.00th=[  226],
>     | 70.00th=[  236], 80.00th=[  321], 90.00th=[  351], 95.00th=[  372],
>     | 99.00th=[  472], 99.50th=[  481], 99.90th=[ 1267], 99.95th=[ 1401],
>     | 99.99th=[ 1586]
>   bw (  MiB/s): min=  967, max=114322, per=8.68%, avg=45897.69, stdev=330.42, samples=333433
>   iops        : min=  929, max=114304, avg=45879.39, stdev=330.41, samples=333433
>  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.05%, 1000=0.06%
>  lat (msec)   : 2=0.49%, 4=4.36%, 10=9.43%, 20=7.52%, 50=3.48%
>  lat (msec)   : 100=2.70%, 250=47.39%, 500=24.25%, 750=0.09%, 1000=0.01%
>  lat (msec)   : 2000=0.15%
>  cpu          : usr=0.07%, sys=1.83%, ctx=77483816, majf=0, minf=37747
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>     issued rwts: total=119750623,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>     latency   : target=0, window=0, percentile=100.00%, depth=128
> socket1-md: (groupid=1, jobs=64): err= 0: pid=1645424: Mon Aug  9 
> 18:53:36 2021
>  read: IOPS=40.3k, BW=39.4GiB/s (42.3GB/s)(101TiB/2627054msec)
>    slat (usec): min=12, max=57137, avg=23.77, stdev=27.80
>    clat (usec): min=130, max=1746.1k, avg=203005.37, stdev=158045.10
>     lat (usec): min=269, max=1746.1k, avg=203029.23, stdev=158045.27
>    clat percentiles (usec):
>     |  1.00th=[    570],  5.00th=[    693], 10.00th=[   2573],
>     | 20.00th=[  21103], 30.00th=[ 102237], 40.00th=[ 143655],
>     | 50.00th=[ 204473], 60.00th=[ 231736], 70.00th=[ 283116],
>     | 80.00th=[ 320865], 90.00th=[ 421528], 95.00th=[ 455082],
>     | 99.00th=[ 583009], 99.50th=[ 608175], 99.90th=[1061159],
>     | 99.95th=[1166017], 99.99th=[1367344]
>   bw (  MiB/s): min=  599, max=124821, per=-3.40%, avg=40571.79, stdev=319.36, samples=333904
>   iops        : min=  568, max=124809, avg=40554.92, stdev=319.34, samples=333904
>  lat (usec)   : 250=0.01%, 500=0.14%, 750=6.31%, 1000=2.60%
>  lat (msec)   : 2=0.58%, 4=2.04%, 10=4.17%, 20=3.82%, 50=3.71%
>  lat (msec)   : 100=5.91%, 250=32.86%, 500=33.81%, 750=3.81%, 1000=0.10%
>  lat (msec)   : 2000=0.14%
>  cpu          : usr=0.05%, sys=1.56%, ctx=71342745, majf=0, minf=37766
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>     issued rwts: total=105992570,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>     latency   : target=0, window=0, percentile=100.00%, depth=128
> 
> Run status group 0 (all jobs):
>   READ: bw=44.5GiB/s (47.8GB/s), 44.5GiB/s-44.5GiB/s 
> (47.8GB/s-47.8GB/s), io=114TiB (126TB), run=2626961-2626961msec
> 
> Run status group 1 (all jobs):
>   READ: bw=39.4GiB/s (42.3GB/s), 39.4GiB/s-39.4GiB/s 
> (42.3GB/s-42.3GB/s), io=101TiB (111TB), run=2627054-2627054msec
> 
> Disk stats (read/write):
>    md0: ios=960804546/0, merge=0/0, ticks=18446744072288672424/0, 
> in_queue=18446744072288672424, util=100.00%, aggrios=0/0, 
> aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>  nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>    md1: ios=850399203/0, merge=0/0, ticks=2118156441/0, 
> in_queue=2118156441, util=100.00%, aggrios=0/0, aggrmerge=0/0, 
> aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>  nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> 
> -----Original Message-----
> From: Gal Ofri <gal.ofri@volumez.com>
> Sent: Sunday, August 8, 2021 10:44 AM
> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Cc: 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
> Subject: Re: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
> 
> On Thu, 5 Aug 2021 21:10:40 +0000
> "Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil> wrote:
> 
>> BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1
>> RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s  - best hero numbers I've ever seen on mdraid  RAID5 IOPS.   I think the kernel patch is good.  Prior was  socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s....   I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control.
> That's great !
> Thanks for sharing your results.
> I'd appreciate if you could run a sequential-reads workload (128k/256k) so that we get a better sense of the throughput potential here.
> 
>> In my strict numa adherence with mdraid, I see lots of variability between reboots/assembles.    Sometimes md0 wins, sometimes md1 wins, and in my earlier runs md0 and md1 are notionally balanced.   I change nothing but see this variance.   I just cranked up a week long extended run of these 10+1+1s under the 5.14rc3 kernel and right now   md0 is doing 5M IOPS and md1 6.3M 
> Given my humble experience with the code in question, I suspect that it is not really optimized for numa awareness, so I find your findings quite reasonable. I don't really have a good tip for that.
> 
> I'm focusing now on thin-provisioned logical volumes (lvm - it has a 
> much worse reads bottleneck actually), but we have plans for 
> researching
> md/raid5 again soon to improve write workloads.
> I'll ping you when I have a patch that might be relevant.
> 
> Cheers,
> Gal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-08-18 10:20                     ` Finlayson, James M CIV (USA)
@ 2021-08-18 19:48                       ` Doug Ledford
  2021-08-18 19:59                       ` Doug Ledford
  1 sibling, 0 replies; 28+ messages in thread
From: Doug Ledford @ 2021-08-18 19:48 UTC (permalink / raw)
  To: Finlayson, James M CIV (USA), 'Matt Wallis', linux-raid

[-- Attachment #1: Type: text/plain, Size: 18595 bytes --]

On Wed, 2021-08-18 at 10:20 +0000, Finlayson, James M CIV (USA) wrote:
> All,
> I'm happy to be in the position to pioneer some of this "tuning" if
> nobody has done this prior.

Here's what I would be interested to know: how does btrfs do using these
drives bare in raid5 mode?  You have to do the metadata in raid1 mode,
but you can tear down and retest btrfs filesystems on this in a matter
of minutes because it doesn't have to initialize the array.  So you
could try a btrfs per NUMA node, one big btrfs, or other configurations.

Now, let me explain why I think this would be interesting.  I'm a long
time user and developer on the MD raid stack, going all the way back to
the first SSE implementation of the raid5 xor operations.  I've always
used mdraid, and later lvm + mdraid, to build my boxes.  But I've come
to believe that there is an inherint weakness to the mdraid + lvm +
filesystem stack that btrfs (and zfs) overcome by building their raid
code into the filesystem itself.  The inherint weakness is that the
filesystem is the source of truth for what blocks on the device have or
have not been allocated, and what their contents should be.  The mdraid
stack therefore has to do things like initialize the array because it
doesn't know what's written and what isn't.  This also impacts
reconstruction and error recovery similarly.  But, more importantly, it
means that in an attempt to avoid always having huge latency penalties
caused by read-modify-write cycles, the mdraid subsystem maintains its
own cache layer (the stripe cache) separate from the official page cache
of the kernel.  Although I haven't instrumented things to see for sure
if I'm right, my suspicion is that the stripe cache sometimes gets
blocked up under memory pressure and stalls writes to the array.  The
symptom I see is that when I'm copying a large file to the server via
10Gig Ethernet, it will start at 900MB/s and may stay that fast for the
entire operation, but other times the copy will stall, sometimes going
all the way to 0MB/s, for a random period of time.  My suspicion is that
when this happens, there is memory pressure and the raid5 code is having
trouble reading in blocks for read-modify-write operations when the
write it needs to perform is not a full stripe wide write.  This is
avoided when the filesystem is aware of the multi drive layout and
issues the reads itself.  So I strongly suspect that when I build my
next iteration of my home server, it's going to be btrfs (in fact, I
have a test install on it already, but I haven't had the time to do all
the testing needed to confirm it actually solves the problem of the
previous generation of my server).

>    After updating this thread and then providing a status report to my
> leadership, it hit me on what we're really balancing is "how many
> mdraid kernel worker threads" it takes to hit max IOPS.  I'll go find
> that out.     If real world testing becomes my contribution , so be
> it.     I was an O/S developer originally working in the I/O
> subsystem, but early in my career that effort was deprecated, so I've
> only been an integrator of COTS and open source for the last 30 years
> and my programming skills have minimized to just perl and bash.   I
> don't have the skills necessary to make coding contributions.
> 
> Where I'd like mdraid to get is such that we don't need to do this,
> but this is a marathon, not a sprint.
> 
> As far as the PCIe lanes, AMD has situations where 160 gen 4 lanes are
> available (deleting 1 XGMI2 socket to socket interconnect).   If you
> have NUMA awareness, the box seems highly capable.    
> 
> Regards,
> Jim
> 
> 
> 
> -----Original Message-----
> From: Matt Wallis <mattw@madmonks.org> 
> Sent: Tuesday, August 17, 2021 8:45 PM
> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Cc: linux-raid@vger.kernel.org
> Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread
> IOPS - AMD ROME what am I missing?????
> 
> Hi Jim,
> 
> Awesome stuff. I’m looking to get access back to a server I was using
> before for my tests so I can play some more myself.
> I did wonder about your use case, and if you were planning to present
> the storage over a network to another server, or intended to use it as
> local storage for an application.
> 
> The problem is basically that we’re limited no matter what we do.
> There’s no way with current PCIe+networking to get that bandwidth
> outside the box, and you don’t have much compute left inside the box.
> 
> You could simplify the configuration a little bit by using a parallel
> file system like BeeGFS. Parallel file systems like to stripe data
> over multiple targets anyway, so you could remove the LVM layer, and
> simply present 64 RAID volumes for BeeGFS to write to.  
> 
> Normal parallel file system operation is to export the volumes over a
> network, but BeeGFS does have an alternate mode called BeeOND, or
> BeeGFS on Demand, which builds up dynamic file systems using the local
> disks in multiple servers, you could potentially look at a single
> server BeeOND configuration and see if that worked, but I suspect
> you’d be exchanging bottlenecks.
> 
> There’s a new parallel FS on the market that might also be of
> interest, called MadFS. It’s based on another parallel file system but
> with certain parts re-written using the Rust language which
> significantly improved it’s ability to handle higher IOPs. 
> 
> Hmm, just realised the box I had access to before won’t help, it was
> built on an older Intel platform so bottlenecked by PCIe lanes. I’ll
> have to see if I can get something newer.
> 
> Matt.
> 
> > On 18 Aug 2021, at 07:21, Finlayson, James M CIV (USA) <
> > james.m.finlayson4.civ@mail.mil> wrote:
> > 
> > All,
> > A quick random performance update (this is the best I can do in
> > "going for it" with all of the guidance from this list) - I'm
> > thrilled.....
> > 
> > 5.14rc4 kernel Gen 4 drives, all AMD Rome BIOS tuning to keep I/O
> > from power throttling,  SMT turned on (off yielded higher
> > performance but left no room for anything else),  15.36TB drives cut
> > into 32 equal partitions,  32 NUMA aligned raid5 9+1s from the same
> > partition on NUMA0 combined with an LVM concatenating all 32 RAID5's
> > into one volume.    I then do the exact same thing on NUMA1.
> > 
> > 4K random reads, SMT off, sustained bandwidth of > 90GB/s, sustained
> > IOPS across both LVMs, ~23M - bad part, only 7% of the system left
> > to 
> > do anything useful 4K random reads, SMT on, sustained bandwidth of >
> > 84GB/s, sustained IOPS across both LVMs, ~21M - 46.7% idle (.73%
> > users, 52.6% system time) Takeaway - IMHO, no reason to turn off
> > SMT, it helps way more than it hurts...
> > 
> > Without the partitioning and lvm shenanigans, with SMT on, 5.14rc4 
> > kernel, most AMD BIOS tuning (not all), I'm at 46GB/s, 11.7M IOPS , 
> > 42.2% idle (3% user, 54.7% system time)
> > 
> > With stock RHEL 8.4, 4.18 kernel, SMT on, both partitioning and LVM 
> > shenanigans, most AMD BIOS tuning (not all), I'm at 81.5GB/s, 20.4M 
> > IOPS, 49% idle (5.5% user, 46.75% system time)
> > 
> > The question I have for the list, given my large drive sizes, it
> > takes me a day to set up and build an mdraid/lvm configuration.   
> > Has anybody found the "sweet spot" for how many partitions per
> > drive?    I now have a script to generate the drive partitions, a
> > script for building the mdraid volumes, and a procedure for
> > unwinding from all of this and starting again.    
> > 
> > If anybody knows the point of diminishing return for the number of
> > partitions per drive to max out at, it would save me a few days of
> > letting 32 run for a day, reconfiguring for 16, 8, 4, 2, 1....I
> > could just tear apart my LVMs and remake them with half as many RAID
> > partitions, but depending upon how the nvme drive is "RAINed" across
> > NAND chips, I might leave performance on the table.   The researcher
> > in me says, start over, don't make ANY assumptions.
> > 
> > As an aside, on the server, I'm maintaining around 1.1M  NUMA aware
> > IOPS per drive, when hitting all 24 drives individually without
> > RAID, so I'm thrilled with the performance ceiling with the RAID, I
> > just have to find a way to make it something somebody would be
> > willing to maintain.   Somewhere is a sweet spot between
> > sustainability and performance.   Once I find that I have to figure
> > out if there is something useful to do with this new toy.....
> > 
> > 
> > Regards,
> > Jim
> > 
> > 
> > 
> > 
> > -----Original Message-----
> > From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> > Sent: Monday, August 9, 2021 3:02 PM
> > To: 'Gal Ofri' <gal.ofri@volumez.com>; 'linux-raid@vger.kernel.org' 
> > <linux-raid@vger.kernel.org>
> > Cc: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> > Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe
> > randomread IOPS - AMD ROME what am I missing?????
> > 
> > Sequential Performance:
> > BLUF, 1M sequential, direct I/O  reads, QD 128  - 85GiB/s across
> > both 10+1+1 NUMA aware 128K striped LUNS.   Had the imbalance
> > between NUMA 0 44.5GiB/s and NUMA 1 39.4GiB/s but still could be
> > drifting power management on the AMD Rome cores.    I tried a 1280K
> > blocksize to try to get a full stripe read, but Linux seems so
> > unfriendly to non-power of 2 blocksizes.... performance decreased
> > considerably (20GiB/s ?) with the 10x128KB blocksize....   I think I
> > ran for about 40 minutes with the 1M reads...
> > 
> > 
> > socket0-md: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-
> > 1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ...
> > socket1-md: (g=1): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-
> > 1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ...
> > fio-3.26
> > Starting 128 processes
> > 
> > fio: terminating on signal 2
> > 
> > socket0-md: (groupid=0, jobs=64): err= 0: pid=1645360: Mon Aug  9 
> > 18:53:36 2021
> >  read: IOPS=45.6k, BW=44.5GiB/s (47.8GB/s)(114TiB/2626961msec)
> >    slat (usec): min=12, max=4463, avg=24.86, stdev=15.58
> >    clat (usec): min=249, max=1904.8k, avg=179674.12, stdev=138190.51
> >     lat (usec): min=295, max=1904.8k, avg=179699.07, stdev=138191.00
> >    clat percentiles (msec):
> >     |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 20.00th=[  
> > 17],
> >     | 30.00th=[  106], 40.00th=[  116], 50.00th=[  209], 60.00th=[ 
> > 226],
> >     | 70.00th=[  236], 80.00th=[  321], 90.00th=[  351], 95.00th=[ 
> > 372],
> >     | 99.00th=[  472], 99.50th=[  481], 99.90th=[ 1267], 99.95th=[
> > 1401],
> >     | 99.99th=[ 1586]
> >   bw (  MiB/s): min=  967, max=114322, per=8.68%, avg=45897.69,
> > stdev=330.42, samples=333433
> >   iops        : min=  929, max=114304, avg=45879.39, stdev=330.41,
> > samples=333433
> >  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.05%, 1000=0.06%
> >  lat (msec)   : 2=0.49%, 4=4.36%, 10=9.43%, 20=7.52%, 50=3.48%
> >  lat (msec)   : 100=2.70%, 250=47.39%, 500=24.25%, 750=0.09%,
> > 1000=0.01%
> >  lat (msec)   : 2000=0.15%
> >  cpu          : usr=0.07%, sys=1.83%, ctx=77483816, majf=0,
> > minf=37747
> >  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
> > >=64=100.0%
> >     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > >=64=0.0%
> >     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > >=64=0.1%
> >     issued rwts: total=119750623,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> >     latency   : target=0, window=0, percentile=100.00%, depth=128
> > socket1-md: (groupid=1, jobs=64): err= 0: pid=1645424: Mon Aug  9 
> > 18:53:36 2021
> >  read: IOPS=40.3k, BW=39.4GiB/s (42.3GB/s)(101TiB/2627054msec)
> >    slat (usec): min=12, max=57137, avg=23.77, stdev=27.80
> >    clat (usec): min=130, max=1746.1k, avg=203005.37, stdev=158045.10
> >     lat (usec): min=269, max=1746.1k, avg=203029.23, stdev=158045.27
> >    clat percentiles (usec):
> >     |  1.00th=[    570],  5.00th=[    693], 10.00th=[   2573],
> >     | 20.00th=[  21103], 30.00th=[ 102237], 40.00th=[ 143655],
> >     | 50.00th=[ 204473], 60.00th=[ 231736], 70.00th=[ 283116],
> >     | 80.00th=[ 320865], 90.00th=[ 421528], 95.00th=[ 455082],
> >     | 99.00th=[ 583009], 99.50th=[ 608175], 99.90th=[1061159],
> >     | 99.95th=[1166017], 99.99th=[1367344]
> >   bw (  MiB/s): min=  599, max=124821, per=-3.40%, avg=40571.79,
> > stdev=319.36, samples=333904
> >   iops        : min=  568, max=124809, avg=40554.92, stdev=319.34,
> > samples=333904
> >  lat (usec)   : 250=0.01%, 500=0.14%, 750=6.31%, 1000=2.60%
> >  lat (msec)   : 2=0.58%, 4=2.04%, 10=4.17%, 20=3.82%, 50=3.71%
> >  lat (msec)   : 100=5.91%, 250=32.86%, 500=33.81%, 750=3.81%,
> > 1000=0.10%
> >  lat (msec)   : 2000=0.14%
> >  cpu          : usr=0.05%, sys=1.56%, ctx=71342745, majf=0,
> > minf=37766
> >  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
> > >=64=100.0%
> >     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > >=64=0.0%
> >     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > >=64=0.1%
> >     issued rwts: total=105992570,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> >     latency   : target=0, window=0, percentile=100.00%, depth=128
> > 
> > Run status group 0 (all jobs):
> >   READ: bw=44.5GiB/s (47.8GB/s), 44.5GiB/s-44.5GiB/s 
> > (47.8GB/s-47.8GB/s), io=114TiB (126TB), run=2626961-2626961msec
> > 
> > Run status group 1 (all jobs):
> >   READ: bw=39.4GiB/s (42.3GB/s), 39.4GiB/s-39.4GiB/s 
> > (42.3GB/s-42.3GB/s), io=101TiB (111TB), run=2627054-2627054msec
> > 
> > Disk stats (read/write):
> >    md0: ios=960804546/0, merge=0/0, ticks=18446744072288672424/0, 
> > in_queue=18446744072288672424, util=100.00%, aggrios=0/0, 
> > aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
> >  nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >    md1: ios=850399203/0, merge=0/0, ticks=2118156441/0, 
> > in_queue=2118156441, util=100.00%, aggrios=0/0, aggrmerge=0/0, 
> > aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
> >  nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> > 
> > -----Original Message-----
> > From: Gal Ofri <gal.ofri@volumez.com>
> > Sent: Sunday, August 8, 2021 10:44 AM
> > To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> > Cc: 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
> > Subject: Re: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe
> > randomread IOPS - AMD ROME what am I missing?????
> > 
> > On Thu, 5 Aug 2021 21:10:40 +0000
> > "Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil>
> > wrote:
> > 
> > > BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1
> > > RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS
> > > 22.5GiB/s  - best hero numbers I've ever seen on mdraid  RAID5
> > > IOPS.   I think the kernel patch is good.  Prior was  socket0
> > > 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s....   I'm
> > > willing to help push this as hard as we can until we hit a
> > > bottleneck outside of our control.
> > That's great !
> > Thanks for sharing your results.
> > I'd appreciate if you could run a sequential-reads workload
> > (128k/256k) so that we get a better sense of the throughput
> > potential here.
> > 
> > > In my strict numa adherence with mdraid, I see lots of variability
> > > between reboots/assembles.    Sometimes md0 wins, sometimes md1
> > > wins, and in my earlier runs md0 and md1 are notionally
> > > balanced.   I change nothing but see this variance.   I just
> > > cranked up a week long extended run of these 10+1+1s under the
> > > 5.14rc3 kernel and right now   md0 is doing 5M IOPS and md1 6.3M 
> > Given my humble experience with the code in question, I suspect that
> > it is not really optimized for numa awareness, so I find your
> > findings quite reasonable. I don't really have a good tip for that.
> > 
> > I'm focusing now on thin-provisioned logical volumes (lvm - it has a
> > much worse reads bottleneck actually), but we have plans for 
> > researching
> > md/raid5 again soon to improve write workloads.
> > I'll ping you when I have a patch that might be relevant.
> > 
> > Cheers,
> > Gal

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
  2021-08-18 10:20                     ` Finlayson, James M CIV (USA)
  2021-08-18 19:48                       ` Doug Ledford
@ 2021-08-18 19:59                       ` Doug Ledford
  1 sibling, 0 replies; 28+ messages in thread
From: Doug Ledford @ 2021-08-18 19:59 UTC (permalink / raw)
  To: Finlayson, James M CIV (USA), 'Matt Wallis', linux-raid

[-- Attachment #1: Type: text/plain, Size: 1933 bytes --]

> > The question I have for the list, given my large drive sizes, it
> > takes me a day to set up and build an mdraid/lvm configuration.   
> > Has anybody found the "sweet spot" for how many partitions per
> > drive?    I now have a script to generate the drive partitions, a
> > script for building the mdraid volumes, and a procedure for
> > unwinding from all of this and starting again.    

I don't have a feeling for the sweet spot on the number of partitions,
but if you put too many devices in a raid5/6 array, you virtually
guarantee all writes will have to be read-modify-write writes instead of
full stripe writes.

So, when dealing with keeping the parity on the array in sync, a full
stripe write allows you to simply write all blocks in the stripe,
calculate the parity as you do so, and then write the parity out.  For a
partial stripe write, you either have to read in the blocks you aren't
writing and then treat it as a full stripe write and calculate the
parity, or you have to read in the blocks being written and the current
parity block, xor the blocks being over written out of the existing
parity block and then xor the blocks you are writing over the old ones
into the parity block, then write the new blocks and new parity out.

For that reason, I usually try to keep my arrays to no more than 7 or 8
members.  A lot of times, for streaming testing, really high numbers of
drives in a parity raid array will seem to perform fine, but when under
real world conditions might not do so well.  There are also several
filesystems that will optimize their metadata layout when put on an
mdraid device (xfs and ext4), but I'm pretty sure that gets blocked when
you put lvm between the filesystem and the mdraid device.

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-08-18 19:59 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-27 20:32 Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? Finlayson, James M CIV (USA)
2021-07-27 21:52 ` Chris Murphy
2021-07-27 22:42 ` Peter Grandi
2021-07-28 10:31 ` Matt Wallis
2021-07-28 10:43   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
2021-07-29  0:54     ` [Non-DoD Source] " Matt Wallis
2021-07-29 16:35       ` Wols Lists
2021-07-29 18:12         ` Finlayson, James M CIV (USA)
2021-07-29 22:05       ` Finlayson, James M CIV (USA)
2021-07-30  8:28         ` Matt Wallis
2021-07-30  8:45           ` Miao Wang
2021-07-30  9:59             ` Finlayson, James M CIV (USA)
2021-07-30 14:03               ` Doug Ledford
2021-07-30 13:17             ` Peter Grandi
2021-07-30  9:54           ` Finlayson, James M CIV (USA)
2021-08-01 11:21 ` Gal Ofri
2021-08-03 14:59   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
2021-08-04  9:33     ` Gal Ofri
     [not found] ` <AS8PR04MB799205817C4647DAC740DE9A91EA9@AS8PR04MB7992.eurprd04.prod.outlook.com>
     [not found]   ` <5EAED86C53DED2479E3E145969315A2385856AD0@UMECHPA7B.easf.csd.disa.mil>
     [not found]     ` <5EAED86C53DED2479E3E145969315A2385856AF7@UMECHPA7B.easf.csd.disa.mil>
2021-08-05 19:52       ` Finlayson, James M CIV (USA)
2021-08-05 20:50         ` Finlayson, James M CIV (USA)
2021-08-05 21:10           ` Finlayson, James M CIV (USA)
2021-08-08 14:43             ` Gal Ofri
2021-08-09 19:01               ` Finlayson, James M CIV (USA)
2021-08-17 21:21                 ` Finlayson, James M CIV (USA)
2021-08-18  0:45                   ` [Non-DoD Source] " Matt Wallis
2021-08-18 10:20                     ` Finlayson, James M CIV (USA)
2021-08-18 19:48                       ` Doug Ledford
2021-08-18 19:59                       ` Doug Ledford

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.