All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil>
To: 'Gal Ofri' <gal.ofri@volumez.com>,
	"'linux-raid@vger.kernel.org'" <linux-raid@vger.kernel.org>
Cc: "Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil>
Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
Date: Mon, 9 Aug 2021 19:01:38 +0000	[thread overview]
Message-ID: <5EAED86C53DED2479E3E145969315A2385857258@UMECHPA7B.easf.csd.disa.mil> (raw)
In-Reply-To: <20210808174331.1e444db9@gofri-dell>

Sequential Performance:
BLUF, 1M sequential, direct I/O  reads, QD 128  - 85GiB/s across both 10+1+1 NUMA aware 128K striped LUNS.   Had the imbalance between NUMA 0 44.5GiB/s and NUMA 1 39.4GiB/s but still could be drifting power management on the AMD Rome cores.    I tried a 1280K blocksize to try to get a full stripe read, but Linux seems so unfriendly to non-power of 2 blocksizes.... performance decreased considerably (20GiB/s ?) with the 10x128KB blocksize....   I think I ran for about 40 minutes with the 1M reads...


socket0-md: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128
...
socket1-md: (g=1): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128
...
fio-3.26
Starting 128 processes

fio: terminating on signal 2

socket0-md: (groupid=0, jobs=64): err= 0: pid=1645360: Mon Aug  9 18:53:36 2021
  read: IOPS=45.6k, BW=44.5GiB/s (47.8GB/s)(114TiB/2626961msec)
    slat (usec): min=12, max=4463, avg=24.86, stdev=15.58
    clat (usec): min=249, max=1904.8k, avg=179674.12, stdev=138190.51
     lat (usec): min=295, max=1904.8k, avg=179699.07, stdev=138191.00
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 20.00th=[   17],
     | 30.00th=[  106], 40.00th=[  116], 50.00th=[  209], 60.00th=[  226],
     | 70.00th=[  236], 80.00th=[  321], 90.00th=[  351], 95.00th=[  372],
     | 99.00th=[  472], 99.50th=[  481], 99.90th=[ 1267], 99.95th=[ 1401],
     | 99.99th=[ 1586]
   bw (  MiB/s): min=  967, max=114322, per=8.68%, avg=45897.69, stdev=330.42, samples=333433
   iops        : min=  929, max=114304, avg=45879.39, stdev=330.41, samples=333433
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.05%, 1000=0.06%
  lat (msec)   : 2=0.49%, 4=4.36%, 10=9.43%, 20=7.52%, 50=3.48%
  lat (msec)   : 100=2.70%, 250=47.39%, 500=24.25%, 750=0.09%, 1000=0.01%
  lat (msec)   : 2000=0.15%
  cpu          : usr=0.07%, sys=1.83%, ctx=77483816, majf=0, minf=37747
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=119750623,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1-md: (groupid=1, jobs=64): err= 0: pid=1645424: Mon Aug  9 18:53:36 2021
  read: IOPS=40.3k, BW=39.4GiB/s (42.3GB/s)(101TiB/2627054msec)
    slat (usec): min=12, max=57137, avg=23.77, stdev=27.80
    clat (usec): min=130, max=1746.1k, avg=203005.37, stdev=158045.10
     lat (usec): min=269, max=1746.1k, avg=203029.23, stdev=158045.27
    clat percentiles (usec):
     |  1.00th=[    570],  5.00th=[    693], 10.00th=[   2573],
     | 20.00th=[  21103], 30.00th=[ 102237], 40.00th=[ 143655],
     | 50.00th=[ 204473], 60.00th=[ 231736], 70.00th=[ 283116],
     | 80.00th=[ 320865], 90.00th=[ 421528], 95.00th=[ 455082],
     | 99.00th=[ 583009], 99.50th=[ 608175], 99.90th=[1061159],
     | 99.95th=[1166017], 99.99th=[1367344]
   bw (  MiB/s): min=  599, max=124821, per=-3.40%, avg=40571.79, stdev=319.36, samples=333904
   iops        : min=  568, max=124809, avg=40554.92, stdev=319.34, samples=333904
  lat (usec)   : 250=0.01%, 500=0.14%, 750=6.31%, 1000=2.60%
  lat (msec)   : 2=0.58%, 4=2.04%, 10=4.17%, 20=3.82%, 50=3.71%
  lat (msec)   : 100=5.91%, 250=32.86%, 500=33.81%, 750=3.81%, 1000=0.10%
  lat (msec)   : 2000=0.14%
  cpu          : usr=0.05%, sys=1.56%, ctx=71342745, majf=0, minf=37766
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=105992570,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=44.5GiB/s (47.8GB/s), 44.5GiB/s-44.5GiB/s (47.8GB/s-47.8GB/s), io=114TiB (126TB), run=2626961-2626961msec

Run status group 1 (all jobs):
   READ: bw=39.4GiB/s (42.3GB/s), 39.4GiB/s-39.4GiB/s (42.3GB/s-42.3GB/s), io=101TiB (111TB), run=2627054-2627054msec

Disk stats (read/write):
    md0: ios=960804546/0, merge=0/0, ticks=18446744072288672424/0, in_queue=18446744072288672424, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
  nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
    md1: ios=850399203/0, merge=0/0, ticks=2118156441/0, in_queue=2118156441, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
  nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

-----Original Message-----
From: Gal Ofri <gal.ofri@volumez.com> 
Sent: Sunday, August 8, 2021 10:44 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>
Cc: 'linux-raid@vger.kernel.org' <linux-raid@vger.kernel.org>
Subject: Re: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

On Thu, 5 Aug 2021 21:10:40 +0000
"Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil> wrote:

> BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1 
> RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s  - best hero numbers I've ever seen on mdraid  RAID5 IOPS.   I think the kernel patch is good.  Prior was  socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s....   I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control.
That's great !
Thanks for sharing your results.
I'd appreciate if you could run a sequential-reads workload (128k/256k) so that we get a better sense of the throughput potential here.

> In my strict numa adherence with mdraid, I see lots of variability between reboots/assembles.    Sometimes md0 wins, sometimes md1 wins, and in my earlier runs md0 and md1 are notionally balanced.   I change nothing but see this variance.   I just cranked up a week long extended run of these 10+1+1s under the 5.14rc3 kernel and right now   md0 is doing 5M IOPS and md1 6.3M 
Given my humble experience with the code in question, I suspect that it is not really optimized for numa awareness, so I find your findings quite reasonable. I don't really have a good tip for that.

I'm focusing now on thin-provisioned logical volumes (lvm - it has a much worse reads bottleneck actually), but we have plans for researching
md/raid5 again soon to improve write workloads.
I'll ping you when I have a patch that might be relevant.

Cheers,
Gal

  reply	other threads:[~2021-08-09 19:04 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-27 20:32 Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? Finlayson, James M CIV (USA)
2021-07-27 21:52 ` Chris Murphy
2021-07-27 22:42 ` Peter Grandi
2021-07-28 10:31 ` Matt Wallis
2021-07-28 10:43   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
2021-07-29  0:54     ` [Non-DoD Source] " Matt Wallis
2021-07-29 16:35       ` Wols Lists
2021-07-29 18:12         ` Finlayson, James M CIV (USA)
2021-07-29 22:05       ` Finlayson, James M CIV (USA)
2021-07-30  8:28         ` Matt Wallis
2021-07-30  8:45           ` Miao Wang
2021-07-30  9:59             ` Finlayson, James M CIV (USA)
2021-07-30 14:03               ` Doug Ledford
2021-07-30 13:17             ` Peter Grandi
2021-07-30  9:54           ` Finlayson, James M CIV (USA)
2021-08-01 11:21 ` Gal Ofri
2021-08-03 14:59   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
2021-08-04  9:33     ` Gal Ofri
     [not found] ` <AS8PR04MB799205817C4647DAC740DE9A91EA9@AS8PR04MB7992.eurprd04.prod.outlook.com>
     [not found]   ` <5EAED86C53DED2479E3E145969315A2385856AD0@UMECHPA7B.easf.csd.disa.mil>
     [not found]     ` <5EAED86C53DED2479E3E145969315A2385856AF7@UMECHPA7B.easf.csd.disa.mil>
2021-08-05 19:52       ` Finlayson, James M CIV (USA)
2021-08-05 20:50         ` Finlayson, James M CIV (USA)
2021-08-05 21:10           ` Finlayson, James M CIV (USA)
2021-08-08 14:43             ` Gal Ofri
2021-08-09 19:01               ` Finlayson, James M CIV (USA) [this message]
2021-08-17 21:21                 ` Finlayson, James M CIV (USA)
2021-08-18  0:45                   ` [Non-DoD Source] " Matt Wallis
2021-08-18 10:20                     ` Finlayson, James M CIV (USA)
2021-08-18 19:48                       ` Doug Ledford
2021-08-18 19:59                       ` Doug Ledford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5EAED86C53DED2479E3E145969315A2385857258@UMECHPA7B.easf.csd.disa.mil \
    --to=james.m.finlayson4.civ@mail.mil \
    --cc=gal.ofri@volumez.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.