All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dallas Clement <dallas.a.clement@gmail.com>
To: John Stoffel <john@stoffel.org>
Cc: Mark Knecht <markknecht@gmail.com>,
	Phil Turmel <philip@turmel.org>,
	Linux-RAID <linux-raid@vger.kernel.org>
Subject: Re: best base / worst case RAID 5,6 write speeds
Date: Tue, 15 Dec 2015 17:07:20 -0600	[thread overview]
Message-ID: <CAE9DZUQNBPNXFs69JRU0Q82TQ4RjAgpsc7voMgEzuhSZhmDjig@mail.gmail.com> (raw)
In-Reply-To: <22128.35881.182823.556362@quad.stoffel.home>

On Tue, Dec 15, 2015 at 3:54 PM, John Stoffel <john@stoffel.org> wrote:
>>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:
>
> Dallas> Thanks guys for all the ideas and help.
> Dallas> Phil,
>
>>> Very interesting indeed. I wonder if the extra I/O in flight at high
>>> depths is consuming all available stripe cache space, possibly not
>>> consistently. I'd raise and lower that in various combinations with
>>> various combinations of iodepth.  Running out of stripe cache will cause
>>> premature RMWs.
>
> Dallas> Okay, I'll play with that today.  I have to confess I'm not
> Dallas> sure that I completely understand how the stripe cache works.
> Dallas> I think the idea is to batch I/Os into a complete stripe if
> Dallas> possible and write out to the disks all in one go to avoid
> Dallas> RMWs.  Other than alignment issues, I'm unclear on what
> Dallas> triggers RMWs.  It seems like as Robert mentioned that if the
> Dallas> I/Os block size is stripe aligned, there should never be RMWs.
>
> Remember, there's a bounding limit on both how large the stripe cache
> is, and how long (timewise) it will let the cache sit around waiting
> for new blocks to come in.  That's probably what you're hitting at
> times with the high queue depth numbers.
>
> I assume the blocktrace info would tell you more, but I haven't really
> a clue how to interpret it.
>
>
> Dallas> My stripe cache is 8192 btw.
>
> Dallas> John,
>
>>> I suspect you've hit a known problem-ish area with Linux disk io, which is that big queue depths aren't optimal.
>
> Dallas> Yes, certainly looks that way.  But maybe as Phil indicated I might be
> Dallas> exceeding my stripe cache.  I am still surprised that there are so
> Dallas> many RMWs even if the stripe cache has been exhausted.
>
>>> As you can see, it peaks at a queue depth of 4, and then tends
>>> downward before falling off a cliff.  So now what I'd do is keep the
>>> queue depth at 4, but vary the block size and other parameters and see
>>> how things change there.
>
> Dallas> Why do you think there is a gradual drop off after queue depth
> Dallas> of 4 and before it falls off the cliff?
>
> I think because the in-kernel sizes start getting bigger, and so the
> kernel spends more time queuing and caching the data and moving it
> around, instead of just shoveling it down to the disks as quick as it
> can.
>
> Dallas> I with this were for fun! ;) Although this has been a fun
> Dallas> discussion.  I've learned a ton.  This effort is for work
> Dallas> though.  I'd be all over the SSDs and caching otherwise.  I'm
> Dallas> trying to characterize and then squeeze all of the performance
> Dallas> I can out of a legacy NAS product.  I am constrained by the
> Dallas> existing hardware.  Unfortunately I do not have the option of
> Dallas> using SSDs or hardware RAID controllers.  I have to rely
> Dallas> completely on Linux RAID.
>
> Ah... in that case, you need to do your testing from the NAS side,
> don't bother going to this level.  I'd honestly now just set your
> queue depth to 4 and move on to testing the NAS side of things, where
> you have one, two, four, eight, or more test boxes hitting the NAS
> box.
>
> Dallas> I also need to optimize for large sequential writes (streaming
> Dallas> video, audio, large file transfers), iSCSI (mostly used for
> Dallas> hosting VMs), and random I/O (small and big files) as you
> Dallas> would expect with a NAS.
>
> So you want to do everything at all once.  Fun.  So really I'd move
> back to the Network side, because unless your NAS box has more than
> 1GigE interface, and supports Bonding/trunking, you've hit the
> performance wall.
>
> Also, even if you get a ton of performance with large streaming
> writes, when you sprinkle in a small set of random IO/s, you're going
> to hit the cliff much sooner.  And in that case... it's another set of
> optimizations.
>
> Are you going to use NFSv3?  TCP?  UDP?  1500 MTU, 9000 MTU?  How many
> clients?  How active?
>
> Can you give up disk space for IOP/s?  So get away from the RAID6 and
> move to RAID1 mirrors with a strip atop it, so that you maximize how
> many IOPs you can get.
>

Hi John.

> Remember, there's a bounding limit on both how large the stripe cache
> is, and how long (timewise) it will let the cache sit around waiting
> for new blocks to come in.  That's probably what you're hitting at
> times with the high queue depth numbers.

Okay, good to know.  I did try doubling the size of the stripe cache
just to see if it would reduce the # of RMWs at iodepth>=64.  It did
not.  So it looks like cache timing out as you mentioned.

> So you want to do everything at all once.  Fun.  So really I'd move
> back to the Network side, because unless your NAS box has more than
> 1GigE interface, and supports Bonding/trunking, you've hit the
> performance wall.

I'm not sure I necessarily want to tune everything at once.
Surprisingly this box does have 10GigE interfaces.  I just want to get
RAID tuned the best I can before I start testing over the network.
With 10 GigE this box should be able to write 1200 MB/s max.  But as
reported earlier, I'm not even able to get that with fio running
locally on the box writing to the RAID device.

After Phil's explanation I now better understand what triggers the
RMWs.  Clearly I would like to minimize these to get the best
performance for both sequential and random patterns.  Messing with the
stripe cache size doesn't seem to change anything with performance, so
will probably play with a smaller chunk size next to see if that
helps.

> Also, even if you get a ton of performance with large streaming
> writes, when you sprinkle in a small set of random IO/s, you're going
> to hit the cliff much sooner.  And in that case... it's another set of
> optimizations.

Yes, I get that.  There are definitely some customers that do a little
of everything with these NAS boxes.  But from what I've seen, a lot of
them also use a NAS for just one thing - host VMs or stream media or
file server or database / web app.

> Are you going to use NFSv3?  TCP?  UDP?  1500 MTU, 9000 MTU?

Yes, all of these are supported.

> How many clients?  How active?

These boxes tend to get used pretty hard.  Probably the biggest
application is iSCSI or Samba backups, and then hosting VMs.  For
backups it's usually small number of clients but heavy volume.  For
VMs there can be quite a few.

One other consideration is that these kind of products undergo lots of
bench-mark testing with the usual host of tools.  Probably my main
goal at this point is for the tests which are primarily focused on
sequential throughput (small and large blocks) are the best that they
can be given the hardware limitations.  10 Gbps iSCSI throughput is
probably the most important benchmark.  If I can somehow get the RAID
5,6 write speeds up over 1 GB/s I would be very happy. Right now 10
Gbps iSCSI write performance is limited by the RAID device performance
sadly.

> Can you give up disk space for IOP/s?  So get away from the RAID6 and
> move to RAID1 mirrors with a strip atop it, so that you maximize how
> many IOPs you can get.

Yes, this box already supports RAID 0, 1, 5, 6, 10, 50, 60

  reply	other threads:[~2015-12-15 23:07 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-10  1:34 best base / worst case RAID 5,6 write speeds Dallas Clement
2015-12-10  6:36 ` Alexander Afonyashin
2015-12-10 14:38   ` Dallas Clement
2015-12-10 15:14 ` John Stoffel
2015-12-10 18:40   ` Dallas Clement
     [not found]     ` <CAK2H+ed+fe5Wr0B=h5AzK5_=ougQtW_6cJcUG_S_cg+WfzDb=Q@mail.gmail.com>
2015-12-10 19:26       ` Dallas Clement
2015-12-10 19:33         ` John Stoffel
2015-12-10 22:19           ` Wols Lists
2015-12-10 19:28     ` John Stoffel
2015-12-10 22:23       ` Wols Lists
2015-12-10 20:06 ` Phil Turmel
2015-12-10 20:09   ` Dallas Clement
2015-12-10 20:29     ` Phil Turmel
2015-12-10 21:14       ` Dallas Clement
2015-12-10 21:32         ` Phil Turmel
     [not found]     ` <CAK2H+ednN7dCGzcOt8TxgNdhdDA1mN6Xr5P8vQ+Y=-uRoxRksw@mail.gmail.com>
2015-12-11  0:02       ` Dallas Clement
     [not found]         ` <CAK2H+efF2dM1BsM7kzfTxMdQEHvbWRaVe7zJLTGcPZzafn2M6A@mail.gmail.com>
2015-12-11  0:41           ` Dallas Clement
2015-12-11  1:19             ` Dallas Clement
     [not found]               ` <CAK2H+ec-zMbhxoFyHXLkdM-z-9cYYzNbPFhn19XjTHqrOMDZKQ@mail.gmail.com>
2015-12-11 15:44                 ` Dallas Clement
2015-12-11 16:32                   ` John Stoffel
2015-12-11 16:47                     ` Dallas Clement
2015-12-11 19:34                       ` John Stoffel
2015-12-11 21:24                         ` Dallas Clement
2015-12-11 23:30                           ` Dallas Clement
2015-12-12  0:00                             ` Dallas Clement
2015-12-12  0:38                               ` Phil Turmel
2015-12-12  2:55                                 ` Dallas Clement
2015-12-12  4:47                                   ` Phil Turmel
2015-12-14 20:14                                     ` Dallas Clement
     [not found]                                       ` <CAK2H+edazVORrVovWDeTA8DmqUL+5HRH-AcRwg8KkMas=o+Cog@mail.gmail.com>
2015-12-14 20:55                                         ` Dallas Clement
     [not found]                                           ` <CAK2H+ed-3Z8SR20t8rpt3Fb48c3X2Jft=qZoiY9emC2nQww1xQ@mail.gmail.com>
2015-12-14 21:20                                             ` Dallas Clement
2015-12-14 22:05                                               ` Dallas Clement
2015-12-14 22:31                                                 ` Tommy Apel
     [not found]                                                 ` <CAK2H+ecMvDLdYLhMtMQbP7Ygw-VohG7LGZ2n7H+LAXQ1waJK3A@mail.gmail.com>
2015-12-14 23:25                                                   ` Dallas Clement
2015-12-15  2:36                                                     ` Dallas Clement
2015-12-15 13:53                                                       ` Phil Turmel
2015-12-15 14:09                                                       ` Robert Kierski
2015-12-15 15:14                                                       ` John Stoffel
2015-12-15 17:30                                                         ` Dallas Clement
2015-12-15 19:22                                                           ` Phil Turmel
2015-12-15 19:44                                                             ` Dallas Clement
2015-12-15 19:52                                                               ` Phil Turmel
2015-12-15 21:54                                                           ` John Stoffel
2015-12-15 23:07                                                             ` Dallas Clement [this message]
2015-12-16 15:31                                                               ` Dallas Clement
     [not found]                                                                 ` <CAK2H+eeD2k4yzuvL4uF_qKycp6A=XPe8pVF_J-7Agi8Ze89PPQ@mail.gmail.com>
2015-12-17  5:57                                                                   ` Dallas Clement
2015-12-17 13:41                                                                   ` Phil Turmel
2015-12-17 21:08                                                                     ` Dallas Clement
2015-12-17 22:40                                                                       ` Phil Turmel
2015-12-17 23:28                                                                         ` Dallas Clement
2015-12-18  0:54                                                                           ` Dallas Clement
     [not found]                                                                             ` <CAFx4rwT8xgwZ0OWaLLsZvhMskiwmY54MzHgnnEPaswByeRrXxQ@mail.gmail.com>
2015-12-22  6:15                                                                               ` Doug Dumitru
2015-12-22 14:34                                                                                 ` Robert Kierski
2015-12-22 16:48                                                                                 ` Dallas Clement
2015-12-22 18:33                                                                                   ` Doug Dumitru
2016-01-04 18:56                                                                                     ` Robert Kierski
2016-01-04 19:13                                                                                       ` Doug Dumitru
2016-01-04 19:33                                                                                         ` Robert Kierski
2016-01-04 19:43                                                                                           ` Doug Dumitru
2016-01-15 16:53                                                                                             ` Robert Kierski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAE9DZUQNBPNXFs69JRU0Q82TQ4RjAgpsc7voMgEzuhSZhmDjig@mail.gmail.com \
    --to=dallas.a.clement@gmail.com \
    --cc=john@stoffel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=markknecht@gmail.com \
    --cc=philip@turmel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.