All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matt Wallis <mattw@madmonks.org>
To: "Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@mail.mil>
Cc: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Subject: Re: Can't get RAID5/RAID6  NVMe randomread  IOPS - AMD ROME what am I missing?????
Date: Wed, 28 Jul 2021 20:31:44 +1000	[thread overview]
Message-ID: <07195088-7E4B-4586-BB45-04890265BD62@madmonks.org> (raw)
In-Reply-To: <5EAED86C53DED2479E3E145969315A2385841062@UMECHPA7B.easf.csd.disa.mil>

Hi Jim,

> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil> wrote:
> 
> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small  fraction of it.   I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful.     This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant.   I'm just stumped why I can't do better.   If there is a fine manual that somebody can point me to, I'm happy to read it…

I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system…
1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
2. Most block IO in the kernel is limited in terms of threading, it may even be essentially single threaded. (This is where I will get corrected)
3. AFAICT, this includes mdraid, there’s a single thread per RAID device handling all the RAID calculations. (mdX_raid6)

What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason)
2. Create 8 RAID6 arrays with 1 partition per drive.
3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.

You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.

I saw a significant (for me, significant is >20%) increase in IOPs doing this. 

You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes. 

There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.

You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.

Matt

  parent reply	other threads:[~2021-07-28 10:39 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-27 20:32 Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? Finlayson, James M CIV (USA)
2021-07-27 21:52 ` Chris Murphy
2021-07-27 22:42 ` Peter Grandi
2021-07-28 10:31 ` Matt Wallis [this message]
2021-07-28 10:43   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
2021-07-29  0:54     ` [Non-DoD Source] " Matt Wallis
2021-07-29 16:35       ` Wols Lists
2021-07-29 18:12         ` Finlayson, James M CIV (USA)
2021-07-29 22:05       ` Finlayson, James M CIV (USA)
2021-07-30  8:28         ` Matt Wallis
2021-07-30  8:45           ` Miao Wang
2021-07-30  9:59             ` Finlayson, James M CIV (USA)
2021-07-30 14:03               ` Doug Ledford
2021-07-30 13:17             ` Peter Grandi
2021-07-30  9:54           ` Finlayson, James M CIV (USA)
2021-08-01 11:21 ` Gal Ofri
2021-08-03 14:59   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
2021-08-04  9:33     ` Gal Ofri
     [not found] ` <AS8PR04MB799205817C4647DAC740DE9A91EA9@AS8PR04MB7992.eurprd04.prod.outlook.com>
     [not found]   ` <5EAED86C53DED2479E3E145969315A2385856AD0@UMECHPA7B.easf.csd.disa.mil>
     [not found]     ` <5EAED86C53DED2479E3E145969315A2385856AF7@UMECHPA7B.easf.csd.disa.mil>
2021-08-05 19:52       ` Finlayson, James M CIV (USA)
2021-08-05 20:50         ` Finlayson, James M CIV (USA)
2021-08-05 21:10           ` Finlayson, James M CIV (USA)
2021-08-08 14:43             ` Gal Ofri
2021-08-09 19:01               ` Finlayson, James M CIV (USA)
2021-08-17 21:21                 ` Finlayson, James M CIV (USA)
2021-08-18  0:45                   ` [Non-DoD Source] " Matt Wallis
2021-08-18 10:20                     ` Finlayson, James M CIV (USA)
2021-08-18 19:48                       ` Doug Ledford
2021-08-18 19:59                       ` Doug Ledford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=07195088-7E4B-4586-BB45-04890265BD62@madmonks.org \
    --to=mattw@madmonks.org \
    --cc=james.m.finlayson4.civ@mail.mil \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.