All of lore.kernel.org
 help / color / mirror / Atom feed
From: Phil Turmel <philip@turmel.org>
To: Dallas Clement <dallas.a.clement@gmail.com>,
	John Stoffel <john@stoffel.org>
Cc: Mark Knecht <markknecht@gmail.com>,
	Linux-RAID <linux-raid@vger.kernel.org>
Subject: Re: best base / worst case RAID 5,6 write speeds
Date: Tue, 15 Dec 2015 14:22:00 -0500	[thread overview]
Message-ID: <56706858.2040908@turmel.org> (raw)
In-Reply-To: <CAE9DZURepRB3k-pBnRg2Tx8GCzAr6zCzv+LJy4mK4CDdAkBYVQ@mail.gmail.com>

Hi Dallas,

On 12/15/2015 12:30 PM, Dallas Clement wrote:
> Thanks guys for all the ideas and help.
> 
> Phil,
> 
>> Very interesting indeed. I wonder if the extra I/O in flight at high
>> depths is consuming all available stripe cache space, possibly not
>> consistently. I'd raise and lower that in various combinations with
>> various combinations of iodepth.  Running out of stripe cache will cause
>> premature RMWs.
> 
> Okay, I'll play with that today.  I have to confess I'm not sure that
> I completely understand how the stripe cache works.  I think the idea
> is to batch I/Os into a complete stripe if possible and write out to
> the disks all in one go to avoid RMWs.  Other than alignment issues,
> I'm unclear on what triggers RMWs.  It seems like as Robert mentioned
> that if the I/Os block size is stripe aligned, there should never be
> RMWs.
>
> My stripe cache is 8192 btw.
>

Stripe cache is the kernel's workspace to compute parity or to recover
data from parity.  It works on 4k blocks.  Per "man md", the units are
number of such blocks per device.  *The blocks in each cache stripe are
separated from each other on disk by the chunk size*.

Let's examine some scenarios for your 128k chunk size, 12 devices.  You
have 8192 cache stripes of 12 blocks each:

1) Random write of 16k.  4 stripes will be allocated from the cache for
*all* of the devices, and filled for the devices written.  The raid5
state machine lets them sit briefly for a chance for more writes to the
other blocks in each stripe.

1a) If none come in, MD will request a read of the old data blocks and
the old parities.  When those arrive, it'll compute the new parities and
write both parities and new data blocks.  Total I/O: 32k read, 32k write.

1b) If other random writes come in for those stripes, chunk size spaced,
MD will wait a bit more.  Then it will read in any blocks that weren't
written, compute parity, and write all the new data and parity.  Total
I/O: 16k * n, possibly some reads, the rest writes.

2) Sequential write of stripe-aligned 1408k.  The first 128k allocates
64 cache stripes and fills their first block.  The next 128k fills the
second block of each cache stripe.  And so on, filling all the data
blocks in the cache stripes.  MD shortly notices a full cache stripe
write on each, so it just computes the parities and submits all of those
writes.

3) Sequential write of 256k, aligned or not.  As above, but you only
fill two blocks in each cache stripe.  MD then reads 1152k, computes
parity, and writes 384k.

4) Multiple back-to-back writes of 1408k aligned.  First grabs 64 cache
stripes and shortly queues all of those writes.  Next grabs another 64
cache stripes and queues more writes. And then another 64 caches stripes
and writes.  Underlying layer, as its queue grows, notices the adjacency
of chunk writes from multiple top-level writes and starts merging.
Stripe caches are still held, though, until each write is completed.  If
128 top-level writes are in flight (8192/64), you've exhausted your
stripe cache.  Note that this is writes in flight in your application
*and* writes in flight from anything else.  Keeping in mind that merging
might actually raise the completion latency for the earlier writes.

I'm sure you can come up with more.  The key is that stripe parity
calculations must be performed on blocks separated on disk by the chunk
size.  Really big chunk sizes don't actually help parity raid, since
everything is broken down to 4k for the stripe cache, then re-merged
underneath it.

> I with this were for fun! ;)  Although this has been a fun discussion.
> I've learned a ton.  This effort is for work though.  I'd be all over
> the SSDs and caching otherwise.  I'm trying to characterize and then
> squeeze all of the performance I can out of a legacy NAS product.  I
> am constrained by the existing hardware.  Unfortunately I do not have
> the option of using SSDs or hardware RAID controllers.  I have to rely
> completely on Linux RAID.
> 
> I also need to optimize for large sequential writes (streaming video,
> audio, large file transfers), iSCSI (mostly used for hosting VMs), and
> random I/O (small and big files) as you would expect with a NAS.

On spinning rust, once you introduce any random writes, you've
effectively made the entire stack a random workload.  This is true for
all raid levels, but particularly true for parity raid due to the RMW
cycles.  If you really need great sequential performance, you can't
allow the VMs and the databases and small files on the same disks.

That said, I recommend a parity raid chunk size of 16k or 32k for all
workloads.  Greatly improves spatial locality for random writes, reduces
stripe cache hogging for sequential writes, and doesn't hurt sequential
reads too much.

Phil

  reply	other threads:[~2015-12-15 19:22 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-10  1:34 best base / worst case RAID 5,6 write speeds Dallas Clement
2015-12-10  6:36 ` Alexander Afonyashin
2015-12-10 14:38   ` Dallas Clement
2015-12-10 15:14 ` John Stoffel
2015-12-10 18:40   ` Dallas Clement
     [not found]     ` <CAK2H+ed+fe5Wr0B=h5AzK5_=ougQtW_6cJcUG_S_cg+WfzDb=Q@mail.gmail.com>
2015-12-10 19:26       ` Dallas Clement
2015-12-10 19:33         ` John Stoffel
2015-12-10 22:19           ` Wols Lists
2015-12-10 19:28     ` John Stoffel
2015-12-10 22:23       ` Wols Lists
2015-12-10 20:06 ` Phil Turmel
2015-12-10 20:09   ` Dallas Clement
2015-12-10 20:29     ` Phil Turmel
2015-12-10 21:14       ` Dallas Clement
2015-12-10 21:32         ` Phil Turmel
     [not found]     ` <CAK2H+ednN7dCGzcOt8TxgNdhdDA1mN6Xr5P8vQ+Y=-uRoxRksw@mail.gmail.com>
2015-12-11  0:02       ` Dallas Clement
     [not found]         ` <CAK2H+efF2dM1BsM7kzfTxMdQEHvbWRaVe7zJLTGcPZzafn2M6A@mail.gmail.com>
2015-12-11  0:41           ` Dallas Clement
2015-12-11  1:19             ` Dallas Clement
     [not found]               ` <CAK2H+ec-zMbhxoFyHXLkdM-z-9cYYzNbPFhn19XjTHqrOMDZKQ@mail.gmail.com>
2015-12-11 15:44                 ` Dallas Clement
2015-12-11 16:32                   ` John Stoffel
2015-12-11 16:47                     ` Dallas Clement
2015-12-11 19:34                       ` John Stoffel
2015-12-11 21:24                         ` Dallas Clement
2015-12-11 23:30                           ` Dallas Clement
2015-12-12  0:00                             ` Dallas Clement
2015-12-12  0:38                               ` Phil Turmel
2015-12-12  2:55                                 ` Dallas Clement
2015-12-12  4:47                                   ` Phil Turmel
2015-12-14 20:14                                     ` Dallas Clement
     [not found]                                       ` <CAK2H+edazVORrVovWDeTA8DmqUL+5HRH-AcRwg8KkMas=o+Cog@mail.gmail.com>
2015-12-14 20:55                                         ` Dallas Clement
     [not found]                                           ` <CAK2H+ed-3Z8SR20t8rpt3Fb48c3X2Jft=qZoiY9emC2nQww1xQ@mail.gmail.com>
2015-12-14 21:20                                             ` Dallas Clement
2015-12-14 22:05                                               ` Dallas Clement
2015-12-14 22:31                                                 ` Tommy Apel
     [not found]                                                 ` <CAK2H+ecMvDLdYLhMtMQbP7Ygw-VohG7LGZ2n7H+LAXQ1waJK3A@mail.gmail.com>
2015-12-14 23:25                                                   ` Dallas Clement
2015-12-15  2:36                                                     ` Dallas Clement
2015-12-15 13:53                                                       ` Phil Turmel
2015-12-15 14:09                                                       ` Robert Kierski
2015-12-15 15:14                                                       ` John Stoffel
2015-12-15 17:30                                                         ` Dallas Clement
2015-12-15 19:22                                                           ` Phil Turmel [this message]
2015-12-15 19:44                                                             ` Dallas Clement
2015-12-15 19:52                                                               ` Phil Turmel
2015-12-15 21:54                                                           ` John Stoffel
2015-12-15 23:07                                                             ` Dallas Clement
2015-12-16 15:31                                                               ` Dallas Clement
     [not found]                                                                 ` <CAK2H+eeD2k4yzuvL4uF_qKycp6A=XPe8pVF_J-7Agi8Ze89PPQ@mail.gmail.com>
2015-12-17  5:57                                                                   ` Dallas Clement
2015-12-17 13:41                                                                   ` Phil Turmel
2015-12-17 21:08                                                                     ` Dallas Clement
2015-12-17 22:40                                                                       ` Phil Turmel
2015-12-17 23:28                                                                         ` Dallas Clement
2015-12-18  0:54                                                                           ` Dallas Clement
     [not found]                                                                             ` <CAFx4rwT8xgwZ0OWaLLsZvhMskiwmY54MzHgnnEPaswByeRrXxQ@mail.gmail.com>
2015-12-22  6:15                                                                               ` Doug Dumitru
2015-12-22 14:34                                                                                 ` Robert Kierski
2015-12-22 16:48                                                                                 ` Dallas Clement
2015-12-22 18:33                                                                                   ` Doug Dumitru
2016-01-04 18:56                                                                                     ` Robert Kierski
2016-01-04 19:13                                                                                       ` Doug Dumitru
2016-01-04 19:33                                                                                         ` Robert Kierski
2016-01-04 19:43                                                                                           ` Doug Dumitru
2016-01-15 16:53                                                                                             ` Robert Kierski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56706858.2040908@turmel.org \
    --to=philip@turmel.org \
    --cc=dallas.a.clement@gmail.com \
    --cc=john@stoffel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=markknecht@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.