On Wed, 2021-08-18 at 10:20 +0000, Finlayson, James M CIV (USA) wrote:
> All,
> I'm happy to be in the position to pioneer some of this "tuning" if
> nobody has done this prior.

Here's what I would be interested to know: how does btrfs do using these
drives bare in raid5 mode?  You have to do the metadata in raid1 mode,
but you can tear down and retest btrfs filesystems on this in a matter
of minutes because it doesn't have to initialize the array.  So you
could try a btrfs per NUMA node, one big btrfs, or other configurations.

Now, let me explain why I think this would be interesting.  I'm a long
time user and developer on the MD raid stack, going all the way back to
the first SSE implementation of the raid5 xor operations.  I've always
used mdraid, and later lvm + mdraid, to build my boxes.  But I've come
to believe that there is an inherint weakness to the mdraid + lvm +
filesystem stack that btrfs (and zfs) overcome by building their raid
code into the filesystem itself.  The inherint weakness is that the
filesystem is the source of truth for what blocks on the device have or
have not been allocated, and what their contents should be.  The mdraid
stack therefore has to do things like initialize the array because it
doesn't know what's written and what isn't.  This also impacts
reconstruction and error recovery similarly.  But, more importantly, it
means that in an attempt to avoid always having huge latency penalties
caused by read-modify-write cycles, the mdraid subsystem maintains its
own cache layer (the stripe cache) separate from the official page cache
of the kernel.  Although I haven't instrumented things to see for sure
if I'm right, my suspicion is that the stripe cache sometimes gets
blocked up under memory pressure and stalls writes to the array.  The
symptom I see is that when I'm copying a large file to the server via
10Gig Ethernet, it will start at 900MB/s and may stay that fast for the
entire operation, but other times the copy will stall, sometimes going
all the way to 0MB/s, for a random period of time.  My suspicion is that
when this happens, there is memory pressure and the raid5 code is having
trouble reading in blocks for read-modify-write operations when the
write it needs to perform is not a full stripe wide write.  This is
avoided when the filesystem is aware of the multi drive layout and
issues the reads itself.  So I strongly suspect that when I build my
next iteration of my home server, it's going to be btrfs (in fact, I
have a test install on it already, but I haven't had the time to do all
the testing needed to confirm it actually solves the problem of the
previous generation of my server).

>    After updating this thread and then providing a status report to my
> leadership, it hit me on what we're really balancing is "how many
> mdraid kernel worker threads" it takes to hit max IOPS.  I'll go find
> that out.     If real world testing becomes my contribution , so be
> it.     I was an O/S developer originally working in the I/O
> subsystem, but early in my career that effort was deprecated, so I've
> only been an integrator of COTS and open source for the last 30 years
> and my programming skills have minimized to just perl and bash.   I
> don't have the skills necessary to make coding contributions.
> 
> Where I'd like mdraid to get is such that we don't need to do this,
> but this is a marathon, not a sprint.
> 
> As far as the PCIe lanes, AMD has situations where 160 gen 4 lanes are
> available (deleting 1 XGMI2 socket to socket interconnect).   If you
> have NUMA awareness, the box seems highly capable.    
> 
> Regards,
> Jim
> 
> 
> 
> -----Original Message-----
> From: Matt Wallis <mattw@madmonks.org> 
> Sent: Tuesday, August 17, 2021 8:45 PM
> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> Cc: linux-raid@vger.kernel.org
> Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread
> IOPS - AMD ROME what am I missing?????
> 
> Hi Jim,
> 
> Awesome stuff. I’m looking to get access back to a server I was using
> before for my tests so I can play some more myself.
> I did wonder about your use case, and if you were planning to present
> the storage over a network to another server, or intended to use it as
> local storage for an application.
> 
> The problem is basically that we’re limited no matter what we do.
> There’s no way with current PCIe+networking to get that bandwidth
> outside the box, and you don’t have much compute left inside the box.
> 
> You could simplify the configuration a little bit by using a parallel
> file system like BeeGFS. Parallel file systems like to stripe data
> over multiple targets anyway, so you could remove the LVM layer, and
> simply present 64 RAID volumes for BeeGFS to write to.  
> 
> Normal parallel file system operation is to export the volumes over a
> network, but BeeGFS does have an alternate mode called BeeOND, or
> BeeGFS on Demand, which builds up dynamic file systems using the local
> disks in multiple servers, you could potentially look at a single
> server BeeOND configuration and see if that worked, but I suspect
> you’d be exchanging bottlenecks.
> 
> There’s a new parallel FS on the market that might also be of
> interest, called MadFS. It’s based on another parallel file system but
> with certain parts re-written using the Rust language which
> significantly improved it’s ability to handle higher IOPs. 
> 
> Hmm, just realised the box I had access to before won’t help, it was
> built on an older Intel platform so bottlenecked by PCIe lanes. I’ll
> have to see if I can get something newer.
> 
> Matt.
> 
> > On 18 Aug 2021, at 07:21, Finlayson, James M CIV (USA) <
> > james.m.finlayson4.civ@mail.mil> wrote:
> > 
> > All,
> > A quick random performance update (this is the best I can do in
> > "going for it" with all of the guidance from this list) - I'm
> > thrilled.....
> > 
> > 5.14rc4 kernel Gen 4 drives, all AMD Rome BIOS tuning to keep I/O
> > from power throttling,  SMT turned on (off yielded higher
> > performance but left no room for anything else),  15.36TB drives cut
> > into 32 equal partitions,  32 NUMA aligned raid5 9+1s from the same
> > partition on NUMA0 combined with an LVM concatenating all 32 RAID5's
> > into one volume.    I then do the exact same thing on NUMA1.
> > 
> > 4K random reads, SMT off, sustained bandwidth of > 90GB/s, sustained
> > IOPS across both LVMs, ~23M - bad part, only 7% of the system left
> > to 
> > do anything useful 4K random reads, SMT on, sustained bandwidth of >
> > 84GB/s, sustained IOPS across both LVMs, ~21M - 46.7% idle (.73%
> > users, 52.6% system time) Takeaway - IMHO, no reason to turn off
> > SMT, it helps way more than it hurts...
> > 
> > Without the partitioning and lvm shenanigans, with SMT on, 5.14rc4 
> > kernel, most AMD BIOS tuning (not all), I'm at 46GB/s, 11.7M IOPS , 
> > 42.2% idle (3% user, 54.7% system time)
> > 
> > With stock RHEL 8.4, 4.18 kernel, SMT on, both partitioning and LVM 
> > shenanigans, most AMD BIOS tuning (not all), I'm at 81.5GB/s, 20.4M 
> > IOPS, 49% idle (5.5% user, 46.75% system time)
> > 
> > The question I have for the list, given my large drive sizes, it
> > takes me a day to set up and build an mdraid/lvm configuration.   
> > Has anybody found the "sweet spot" for how many partitions per
> > drive?    I now have a script to generate the drive partitions, a
> > script for building the mdraid volumes, and a procedure for
> > unwinding from all of this and starting again.    
> > 
> > If anybody knows the point of diminishing return for the number of
> > partitions per drive to max out at, it would save me a few days of
> > letting 32 run for a day, reconfiguring for 16, 8, 4, 2, 1....I
> > could just tear apart my LVMs and remake them with half as many RAID
> > partitions, but depending upon how the nvme drive is "RAINed" across
> > NAND chips, I might leave performance on the table.   The researcher
> > in me says, start over, don't make ANY assumptions.
> > 
> > As an aside, on the server, I'm maintaining around 1.1M  NUMA aware
> > IOPS per drive, when hitting all 24 drives individually without
> > RAID, so I'm thrilled with the performance ceiling with the RAID, I
> > just have to find a way to make it something somebody would be
> > willing to maintain.   Somewhere is a sweet spot between
> > sustainability and performance.   Once I find that I have to figure
> > out if there is something useful to do with this new toy.....
> > 
> > 
> > Regards,
> > Jim
> > 
> > 
> > 
> > 
> > -----Original Message-----
> > From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> > Sent: Monday, August 9, 2021 3:02 PM
> > To: 'Gal Ofri' <gal.ofri@volumez.com>; 'linux-raid@vger.kernel.org' 
> > <linux-raid@vger.kernel.org>
> > Cc: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> > Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe
> > randomread IOPS - AMD ROME what am I missing?????
> > 
> > Sequential Performance:
> > BLUF, 1M sequential, direct I/O  reads, QD 128  - 85GiB/s across
> > both 10+1+1 NUMA aware 128K striped LUNS.   Had the imbalance
> > between NUMA 0 44.5GiB/s and NUMA 1 39.4GiB/s but still could be
> > drifting power management on the AMD Rome cores.    I tried a 1280K
> > blocksize to try to get a full stripe read, but Linux seems so
> > unfriendly to non-power of 2 blocksizes.... performance decreased
> > considerably (20GiB/s ?) with the 10x128KB blocksize....   I think I
> > ran for about 40 minutes with the 1M reads...
> > 
> > 
> > socket0-md: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-
> > 1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ...
> > socket1-md: (g=1): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-
> > 1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ...
> > fio-3.26
> > Starting 128 processes
> > 
> > fio: terminating on signal 2
> > 
> > socket0-md: (groupid=0, jobs=64): err= 0: pid=1645360: Mon Aug  9 
> > 18:53:36 2021
> >  read: IOPS=45.6k, BW=44.5GiB/s (47.8GB/s)(114TiB/2626961msec)
> >    slat (usec): min=12, max=4463, avg=24.86, stdev=15.58
> >    clat (usec): min=249, max=1904.8k, avg=179674.12, stdev=138190.51
> >     lat (usec): min=295, max=1904.8k, avg=179699.07, stdev=138191.00
> >    clat percentiles (msec):
> >     |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 20.00th=[  
> > 17],
> >     | 30.00th=[  106], 40.00th=[  116], 50.00th=[  209], 60.00th=[ 
> > 226],
> >     | 70.00th=[  236], 80.00th=[  321], 90.00th=[  351], 95.00th=[ 
> > 372],
> >     | 99.00th=[  472], 99.50th=[  481], 99.90th=[ 1267], 99.95th=[
> > 1401],
> >     | 99.99th=[ 1586]
> >   bw (  MiB/s): min=  967, max=114322, per=8.68%, avg=45897.69,
> > stdev=330.42, samples=333433
> >   iops        : min=  929, max=114304, avg=45879.39, stdev=330.41,
> > samples=333433
> >  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.05%, 1000=0.06%
> >  lat (msec)   : 2=0.49%, 4=4.36%, 10=9.43%, 20=7.52%, 50=3.48%
> >  lat (msec)   : 100=2.70%, 250=47.39%, 500=24.25%, 750=0.09%,
> > 1000=0.01%
> >  lat (msec)   : 2000=0.15%
> >  cpu          : usr=0.07%, sys=1.83%, ctx=77483816, majf=0,
> > minf=37747
> >  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
> > >=64=100.0%
> >     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > >=64=0.0%
> >     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > >=64=0.1%
> >     issued rwts: total=119750623,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> >     latency   : target=0, window=0, percentile=100.00%, depth=128
> > socket1-md: (groupid=1, jobs=64): err= 0: pid=1645424: Mon Aug  9 
> > 18:53:36 2021
> >  read: IOPS=40.3k, BW=39.4GiB/s (42.3GB/s)(101TiB/2627054msec)
> >    slat (usec): min=12, max=57137, avg=23.77, stdev=27.80
> >    clat (usec): min=130, max=1746.1k, avg=203005.37, stdev=158045.10
> >     lat (usec): min=269, max=1746.1k, avg=203029.23, stdev=158045.27
> >    clat percentiles (usec):
> >     |  1.00th=[    570],  5.00th=[    693], 10.00th=[   2573],
> >     | 20.00th=[  21103], 30.00th=[ 102237], 40.00th=[ 143655],
> >     | 50.00th=[ 204473], 60.00th=[ 231736], 70.00th=[ 283116],
> >     | 80.00th=[ 320865], 90.00th=[ 421528], 95.00th=[ 455082],
> >     | 99.00th=[ 583009], 99.50th=[ 608175], 99.90th=[1061159],
> >     | 99.95th=[1166017], 99.99th=[1367344]
> >   bw (  MiB/s): min=  599, max=124821, per=-3.40%, avg=40571.79,
> > stdev=319.36, samples=333904
> >   iops        : min=  568, max=124809, avg=40554.92, stdev=319.34,
> > samples=333904
> >  lat (usec)   : 250=0.01%, 500=0.14%, 750=6.31%, 1000=2.60%
> >  lat (msec)   : 2=0.58%, 4=2.04%, 10=4.17%, 20=3.82%, 50=3.71%
> >  lat (msec)   : 100=5.91%, 250=32.86%, 500=33.81%, 750=3.81%,
> > 1000=0.10%
> >  lat (msec)   : 2000=0.14%
> >  cpu          : usr=0.05%, sys=1.56%, ctx=71342745, majf=0,
> > minf=37766
> >  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
> > >=64=100.0%
> >     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > >=64=0.0%
> >     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > >=64=0.1%
> >     issued rwts: total=105992570,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> >     latency   : target=0, window=0, percentile=100.00%, depth=128
> > 
> > Run status group 0 (all jobs):
> >   READ: bw=44.5GiB/s (47.8GB/s), 44.5GiB/s-44.5GiB/s 
> > (47.8GB/s-47.8GB/s), io=114TiB (126TB), run=2626961-2626961msec
> > 
> > Run status group 1 (all jobs):
> >   READ: bw=39.4GiB/s (42.3GB/s), 39.4GiB/s-39.4GiB/s 
> > (42.3GB/s-42.3GB/s), io=101TiB (111TB), run=2627054-2627054msec
> > 
> > Disk stats (read/write):
> >    md0: ios=960804546/0, merge=0/0, ticks=18446744072288672424/0, 
> > in_queue=18446744072288672424, util=100.00%, aggrios=0/0, 
> > aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
> >  nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >    md1: ios=850399203/0, merge=0/0, ticks=2118156441/0, 
> > in_queue=2118156441, util=100.00%, aggrios=0/0, aggrmerge=0/0, 
> > aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
> >  nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >  nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> > 
> > -----Original Message-----
> > From: Gal Ofri <gal.ofri@volumez.com>
> > Sent: Sunday, August 8, 2021 10:44 AM
> > To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
> > Cc: 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
> > Subject: Re: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe
> > randomread IOPS - AMD ROME what am I missing?????
> > 
> > On Thu, 5 Aug 2021 21:10:40 +0000
> > "Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil>
> > wrote:
> > 
> > > BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1
> > > RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS
> > > 22.5GiB/s  - best hero numbers I've ever seen on mdraid  RAID5
> > > IOPS.   I think the kernel patch is good.  Prior was  socket0
> > > 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s....   I'm
> > > willing to help push this as hard as we can until we hit a
> > > bottleneck outside of our control.
> > That's great !
> > Thanks for sharing your results.
> > I'd appreciate if you could run a sequential-reads workload
> > (128k/256k) so that we get a better sense of the throughput
> > potential here.
> > 
> > > In my strict numa adherence with mdraid, I see lots of variability
> > > between reboots/assembles.    Sometimes md0 wins, sometimes md1
> > > wins, and in my earlier runs md0 and md1 are notionally
> > > balanced.   I change nothing but see this variance.   I just
> > > cranked up a week long extended run of these 10+1+1s under the
> > > 5.14rc3 kernel and right now   md0 is doing 5M IOPS and md1 6.3M 
> > Given my humble experience with the code in question, I suspect that
> > it is not really optimized for numa awareness, so I find your
> > findings quite reasonable. I don't really have a good tip for that.
> > 
> > I'm focusing now on thin-provisioned logical volumes (lvm - it has a
> > much worse reads bottleneck actually), but we have plans for 
> > researching
> > md/raid5 again soon to improve write workloads.
> > I'll ping you when I have a patch that might be relevant.
> > 
> > Cheers,
> > Gal

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD