Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

From: Matt Wallis <mattw@madmonks.org>
To: "Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil>
Cc: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
Date: Wed, 18 Aug 2021 10:45:04 +1000	[thread overview]
Message-ID: <300042B9-F46F-42CF-8FD7-F1C2FE0965E5@madmonks.org> (raw)
In-Reply-To: <5EAED86C53DED2479E3E145969315A238585E0EF@UMECHPA7B.easf.csd.disa.mil>

Hi Jim,

Awesome stuff. I’m looking to get access back to a server I was using before for my tests so I can play some more myself.
I did wonder about your use case, and if you were planning to present the storage over a network to another server, or intended to use it as local storage for an application.

The problem is basically that we’re limited no matter what we do. There’s no way with current PCIe+networking to get that bandwidth outside the box, and you don’t have much compute left inside the box.

You could simplify the configuration a little bit by using a parallel file system like BeeGFS. Parallel file systems like to stripe data over multiple targets anyway, so you could remove the LVM layer, and simply present 64 RAID volumes for BeeGFS to write to.  

Normal parallel file system operation is to export the volumes over a network, but BeeGFS does have an alternate mode called BeeOND, or BeeGFS on Demand, which builds up dynamic file systems using the local disks in multiple servers, you could potentially look at a single server BeeOND configuration and see if that worked, but I suspect you’d be exchanging bottlenecks.

There’s a new parallel FS on the market that might also be of interest, called MadFS. It’s based on another parallel file system but with certain parts re-written using the Rust language which significantly improved it’s ability to handle higher IOPs. 

Hmm, just realised the box I had access to before won’t help, it was built on an older Intel platform so bottlenecked by PCIe lanes. I’ll have to see if I can get something newer.

Matt.

> On 18 Aug 2021, at 07:21, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> 
> All,
> A quick random performance update (this is the best I can do in "going for it" with all of the guidance from this list) - I'm thrilled.....
> 
> 5.14rc4 kernel Gen 4 drives, all AMD Rome BIOS tuning to keep I/O from power throttling,  SMT turned on (off yielded higher performance but left no room for anything else),  15.36TB drives cut into 32 equal partitions,  32 NUMA aligned raid5 9+1s from the same partition on NUMA0 combined with an LVM concatenating all 32 RAID5's into one volume.    I then do the exact same thing on NUMA1.
> 
> 4K random reads, SMT off, sustained bandwidth of > 90GB/s, sustained IOPS across both LVMs, ~23M - bad part, only 7% of the system left to do anything useful
> 4K random reads, SMT on, sustained bandwidth of > 84GB/s, sustained IOPS across both LVMs, ~21M - 46.7% idle (.73% users, 52.6% system time)
> Takeaway - IMHO, no reason to turn off SMT, it helps way more than it hurts...
> 
> Without the partitioning and lvm shenanigans, with SMT on, 5.14rc4 kernel, most AMD BIOS tuning (not all), I'm at 46GB/s, 11.7M IOPS , 42.2% idle (3% user, 54.7% system time)
> 
> With stock RHEL 8.4, 4.18 kernel, SMT on, both partitioning and LVM shenanigans, most AMD BIOS tuning (not all), I'm at 81.5GB/s, 20.4M IOPS, 49% idle (5.5% user, 46.75% system time)
> 
> The question I have for the list, given my large drive sizes, it takes me a day to set up and build an mdraid/lvm configuration.    Has anybody found the "sweet spot" for how many partitions per drive?    I now have a script to generate the drive partitions, a script for building the mdraid volumes, and a procedure for unwinding from all of this and starting again.    
> 
> If anybody knows the point of diminishing return for the number of partitions per drive to max out at, it would save me a few days of letting 32 run for a day, reconfiguring for 16, 8, 4, 2, 1....I could just tear apart my LVMs and remake them with half as many RAID partitions, but depending upon how the nvme drive is "RAINed" across NAND chips, I might leave performance on the table.   The researcher in me says, start over, don't make ANY assumptions.
> 
> As an aside, on the server, I'm maintaining around 1.1M  NUMA aware IOPS per drive, when hitting all 24 drives individually without RAID, so I'm thrilled with the performance ceiling with the RAID, I just have to find a way to make it something somebody would be willing to maintain.   Somewhere is a sweet spot between sustainability and performance.   Once I find that I have to figure out if there is something useful to do with this new toy.....
> 
> 
> Regards,
> Jim
> 
> 
> 
> 
> -----Original Message-----
> From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> 
> Sent: Monday, August 9, 2021 3:02 PM
> To: 'Gal Ofri' <gal.ofri@volumez.com>; 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
> Cc: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
> 
> Sequential Performance:
> BLUF, 1M sequential, direct I/O  reads, QD 128  - 85GiB/s across both 10+1+1 NUMA aware 128K striped LUNS.   Had the imbalance between NUMA 0 44.5GiB/s and NUMA 1 39.4GiB/s but still could be drifting power management on the AMD Rome cores.    I tried a 1280K blocksize to try to get a full stripe read, but Linux seems so unfriendly to non-power of 2 blocksizes.... performance decreased considerably (20GiB/s ?) with the 10x128KB blocksize....   I think I ran for about 40 minutes with the 1M reads...
> 
> 
> socket0-md: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ...
> socket1-md: (g=1): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ...
> fio-3.26
> Starting 128 processes
> 
> fio: terminating on signal 2
> 
> socket0-md: (groupid=0, jobs=64): err= 0: pid=1645360: Mon Aug  9 18:53:36 2021
>  read: IOPS=45.6k, BW=44.5GiB/s (47.8GB/s)(114TiB/2626961msec)
>    slat (usec): min=12, max=4463, avg=24.86, stdev=15.58
>    clat (usec): min=249, max=1904.8k, avg=179674.12, stdev=138190.51
>     lat (usec): min=295, max=1904.8k, avg=179699.07, stdev=138191.00
>    clat percentiles (msec):
>     |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 20.00th=[   17],
>     | 30.00th=[  106], 40.00th=[  116], 50.00th=[  209], 60.00th=[  226],
>     | 70.00th=[  236], 80.00th=[  321], 90.00th=[  351], 95.00th=[  372],
>     | 99.00th=[  472], 99.50th=[  481], 99.90th=[ 1267], 99.95th=[ 1401],
>     | 99.99th=[ 1586]
>   bw (  MiB/s): min=  967, max=114322, per=8.68%, avg=45897.69, stdev=330.42, samples=333433
>   iops        : min=  929, max=114304, avg=45879.39, stdev=330.41, samples=333433
>  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.05%, 1000=0.06%
>  lat (msec)   : 2=0.49%, 4=4.36%, 10=9.43%, 20=7.52%, 50=3.48%
>  lat (msec)   : 100=2.70%, 250=47.39%, 500=24.25%, 750=0.09%, 1000=0.01%
>  lat (msec)   : 2000=0.15%
>  cpu          : usr=0.07%, sys=1.83%, ctx=77483816, majf=0, minf=37747
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>     issued rwts: total=119750623,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>     latency   : target=0, window=0, percentile=100.00%, depth=128
> socket1-md: (groupid=1, jobs=64): err= 0: pid=1645424: Mon Aug  9 18:53:36 2021
>  read: IOPS=40.3k, BW=39.4GiB/s (42.3GB/s)(101TiB/2627054msec)
>    slat (usec): min=12, max=57137, avg=23.77, stdev=27.80
>    clat (usec): min=130, max=1746.1k, avg=203005.37, stdev=158045.10
>     lat (usec): min=269, max=1746.1k, avg=203029.23, stdev=158045.27
>    clat percentiles (usec):
>     |  1.00th=[    570],  5.00th=[    693], 10.00th=[   2573],
>     | 20.00th=[  21103], 30.00th=[ 102237], 40.00th=[ 143655],
>     | 50.00th=[ 204473], 60.00th=[ 231736], 70.00th=[ 283116],
>     | 80.00th=[ 320865], 90.00th=[ 421528], 95.00th=[ 455082],
>     | 99.00th=[ 583009], 99.50th=[ 608175], 99.90th=[1061159],
>     | 99.95th=[1166017], 99.99th=[1367344]
>   bw (  MiB/s): min=  599, max=124821, per=-3.40%, avg=40571.79, stdev=319.36, samples=333904
>   iops        : min=  568, max=124809, avg=40554.92, stdev=319.34, samples=333904
>  lat (usec)   : 250=0.01%, 500=0.14%, 750=6.31%, 1000=2.60%
>  lat (msec)   : 2=0.58%, 4=2.04%, 10=4.17%, 20=3.82%, 50=3.71%
>  lat (msec)   : 100=5.91%, 250=32.86%, 500=33.81%, 750=3.81%, 1000=0.10%
>  lat (msec)   : 2000=0.14%
>  cpu          : usr=0.05%, sys=1.56%, ctx=71342745, majf=0, minf=37766
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>     issued rwts: total=105992570,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>     latency   : target=0, window=0, percentile=100.00%, depth=128
> 
> Run status group 0 (all jobs):
>   READ: bw=44.5GiB/s (47.8GB/s), 44.5GiB/s-44.5GiB/s (47.8GB/s-47.8GB/s), io=114TiB (126TB), run=2626961-2626961msec
> 
> Run status group 1 (all jobs):
>   READ: bw=39.4GiB/s (42.3GB/s), 39.4GiB/s-39.4GiB/s (42.3GB/s-42.3GB/s), io=101TiB (111TB), run=2627054-2627054msec
> 
> Disk stats (read/write):
>    md0: ios=960804546/0, merge=0/0, ticks=18446744072288672424/0, in_queue=18446744072288672424, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>  nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>    md1: ios=850399203/0, merge=0/0, ticks=2118156441/0, in_queue=2118156441, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>  nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>  nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> 
> -----Original Message-----
> From: Gal Ofri <gal.ofri@volumez.com>
> Sent: Sunday, August 8, 2021 10:44 AM
> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Cc: 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
> Subject: Re: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
> 
> On Thu, 5 Aug 2021 21:10:40 +0000
> "Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil> wrote:
> 
>> BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1
>> RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s  - best hero numbers I've ever seen on mdraid  RAID5 IOPS.   I think the kernel patch is good.  Prior was  socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s....   I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control.
> That's great !
> Thanks for sharing your results.
> I'd appreciate if you could run a sequential-reads workload (128k/256k) so that we get a better sense of the throughput potential here.
> 
>> In my strict numa adherence with mdraid, I see lots of variability between reboots/assembles.    Sometimes md0 wins, sometimes md1 wins, and in my earlier runs md0 and md1 are notionally balanced.   I change nothing but see this variance.   I just cranked up a week long extended run of these 10+1+1s under the 5.14rc3 kernel and right now   md0 is doing 5M IOPS and md1 6.3M 
> Given my humble experience with the code in question, I suspect that it is not really optimized for numa awareness, so I find your findings quite reasonable. I don't really have a good tip for that.
> 
> I'm focusing now on thin-provisioned logical volumes (lvm - it has a much worse reads bottleneck actually), but we have plans for researching
> md/raid5 again soon to improve write workloads.
> I'll ping you when I have a patch that might be relevant.
> 
> Cheers,
> Gal