All of lore.kernel.org
 help / color / mirror / Atom feed
From: Doug Dumitru <doug@easyco.com>
To: Dallas Clement <dallas.a.clement@gmail.com>,
	Robert Kierski <rkierski@cray.com>
Cc: Linux-RAID <linux-raid@vger.kernel.org>
Subject: Re: best base / worst case RAID 5,6 write speeds
Date: Tue, 22 Dec 2015 10:33:24 -0800	[thread overview]
Message-ID: <CAFx4rwQvTrUEskJxdc6CpvGu3uVYOJWEvHmMkqmNbmTgTVT0-Q@mail.gmail.com> (raw)
In-Reply-To: <CAE9DZUQoNh2uU1h4okY0Fz5wzVv6-ZTnet9tS-dNQRHsLnPvNg@mail.gmail.com>

Robert and Dallas,

The patch is an astonishingly single case and has a few usage caveats.

It only works when IO is precisely aligned on stripe boundaries.  If
anything is off-aligned, or even is an aligned case is encountered and
the stripe cache is not empty, the patch special case does not happen.
Second, the patch assumes that your application layer "makes sense"
and will not try to read a block that is in the middle of being
written.

The patch is in use on production servers, but with still more
caveats.  It turns off if the array is not clean or if a rebuild or
check is in progress.

Here is "raid5.c" from CentOS 7 with the patch applied:

https://drive.google.com/file/d/0B3T4AZzjEGVkbUYzeVZqbkIzN1E/view?usp=sharing

The modified areas are all inside of #ifdef EASYCO conditionals.  I
did not want to post this as a patch here as this is not appropriate
code for general use.

-- Some comments on stripe cache --

The stripe cache is a lot of overhead for this particular case, but
still works quite well compared to the alternatives.  Most benchmarks
I see with high-end raid cards cannot get to 1GB/sec either on raid-5
or raid-6.

Moving away from the stripe cache, especially dynamically, might open
up a nasty set of locking semantics.

-- Some comments on the raid background thread --

With most "reasonable" disk sets, the single raid thread is fine for
raid-5 at 1.8GB/sec.  If you want to get raid-6 faster, you need more
cores.  With my E5-1650 v3 I get just over 8 GB/sec with raid-6, most
of which is the raid-6 parity compute code.  Multi-socket E5s might do
a little better, but NUMA throws all sorts of interesting performance
tuning issues at our proprietary layer that is above raid.

-- Some comments on benchmarks --

If you run benchmarks like fio, you will get IO patterns that never
happen "in live datasets".  For example, a real file system will never
read a block that is being written.  This is a side effect of the file
systems use of pages as cache and writes that come from dirty pages.
Benchmarks just pump random numbers and overlaps are allowed.  This
means you must write code that survives the benchmarks, but optimizing
for a benchmark in some areas is dubious.

-- Some comments on RMW and SSDs --

One reason I wrote this patch was to keep SSDs happy.  If you write to
SSDs "perfectly" they never degrade and stay at full performance.  If
you do any random writing, the SSDs eventually need to do some space
management (garbage collection).  Even the 2-3% of RMW that I see
without the patch is enough to cost 3x of SSD wear with some drives.



Doug Dumitru
WildFire Storage


On Tue, Dec 22, 2015 at 8:48 AM, Dallas Clement
<dallas.a.clement@gmail.com> wrote:
> On Tue, Dec 22, 2015 at 12:15 AM, Doug Dumitru <doug@easyco.com> wrote:
>> My apologies for diving in so late.
>>
>> I routinely run 24 drive raid-5 sets with SSDs.  Chunk is set at 32K
>> and the applications only writes "perfect" 736K "stripes".  The SSDs
>> are Samsung 850 pros on dedicated LSI 3008 SAS ports and are at "new"
>> preconditioning (ie, they are at full speed) or just over 500 MB/sec.
>> CPU is a single E5-1650 v3.
>>
>> With stock RAID-5 code, I get about 1.8 GB/sec, q=4.
>>
>> Now this application is writing from kernel space
>> (generic_make_request w/ q waiting for completion callback).  There
>> are a lot of RMW operations happening here.  I think the raid-5
>> background thread is waking up asynchronously when only a part of the
>> write has been buffered into stripe cache pages.  The bio going into
>> the raid layer is a single bio, so nothing is being carved up on the
>> request end.  The raid-5 helper thread also saturates a cpu core
>> (which is about as fast as you can get with an E5-1650).
>>
>> If I patch raid5.ko with special case code to avoid the stripe cache
>> and just compute parity and go, the write throughput goes up above
>> 11GB/sec.
>>
>> This is obviously an impossible IO pattern for most applications, but
>> does confirm that the upper limit of (n-1)*bw is "possible", but not
>> with the current stripe cache logic in the raid layer.
>>
>> Doug Dumitru
>> WildFire Storage
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>> If I patch raid5.ko with special case code to avoid the stripe cache
>> and just compute parity and go, the write throughput goes up above
>> 11GB/sec.
>
> Hi Doug.  This is really quite astounding and encouraging!  Would you
> be willing to share your patch?  I am eager to give it a try for RAID
> 5 and 6.
>
>> Now this application is writing from kernel space
>> (generic_make_request w/ q waiting for completion callback).  There
>> are a lot of RMW operations happening here.  I think the raid-5
>> background thread is waking up asynchronously when only a part of the
>> write has been buffered into stripe cache pages.
>
> I am also anxious to hear from anyone who maintains the stripe cache
> code.  I am seeing similar behavior when I monitor writes of perfectly
> stripe-aligned blocks.  The # of RMWs are smallish and seem to vary,
> but still I do not expect to see any of them!



-- 
Doug Dumitru
EasyCo LLC

  reply	other threads:[~2015-12-22 18:33 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-10  1:34 best base / worst case RAID 5,6 write speeds Dallas Clement
2015-12-10  6:36 ` Alexander Afonyashin
2015-12-10 14:38   ` Dallas Clement
2015-12-10 15:14 ` John Stoffel
2015-12-10 18:40   ` Dallas Clement
     [not found]     ` <CAK2H+ed+fe5Wr0B=h5AzK5_=ougQtW_6cJcUG_S_cg+WfzDb=Q@mail.gmail.com>
2015-12-10 19:26       ` Dallas Clement
2015-12-10 19:33         ` John Stoffel
2015-12-10 22:19           ` Wols Lists
2015-12-10 19:28     ` John Stoffel
2015-12-10 22:23       ` Wols Lists
2015-12-10 20:06 ` Phil Turmel
2015-12-10 20:09   ` Dallas Clement
2015-12-10 20:29     ` Phil Turmel
2015-12-10 21:14       ` Dallas Clement
2015-12-10 21:32         ` Phil Turmel
     [not found]     ` <CAK2H+ednN7dCGzcOt8TxgNdhdDA1mN6Xr5P8vQ+Y=-uRoxRksw@mail.gmail.com>
2015-12-11  0:02       ` Dallas Clement
     [not found]         ` <CAK2H+efF2dM1BsM7kzfTxMdQEHvbWRaVe7zJLTGcPZzafn2M6A@mail.gmail.com>
2015-12-11  0:41           ` Dallas Clement
2015-12-11  1:19             ` Dallas Clement
     [not found]               ` <CAK2H+ec-zMbhxoFyHXLkdM-z-9cYYzNbPFhn19XjTHqrOMDZKQ@mail.gmail.com>
2015-12-11 15:44                 ` Dallas Clement
2015-12-11 16:32                   ` John Stoffel
2015-12-11 16:47                     ` Dallas Clement
2015-12-11 19:34                       ` John Stoffel
2015-12-11 21:24                         ` Dallas Clement
2015-12-11 23:30                           ` Dallas Clement
2015-12-12  0:00                             ` Dallas Clement
2015-12-12  0:38                               ` Phil Turmel
2015-12-12  2:55                                 ` Dallas Clement
2015-12-12  4:47                                   ` Phil Turmel
2015-12-14 20:14                                     ` Dallas Clement
     [not found]                                       ` <CAK2H+edazVORrVovWDeTA8DmqUL+5HRH-AcRwg8KkMas=o+Cog@mail.gmail.com>
2015-12-14 20:55                                         ` Dallas Clement
     [not found]                                           ` <CAK2H+ed-3Z8SR20t8rpt3Fb48c3X2Jft=qZoiY9emC2nQww1xQ@mail.gmail.com>
2015-12-14 21:20                                             ` Dallas Clement
2015-12-14 22:05                                               ` Dallas Clement
2015-12-14 22:31                                                 ` Tommy Apel
     [not found]                                                 ` <CAK2H+ecMvDLdYLhMtMQbP7Ygw-VohG7LGZ2n7H+LAXQ1waJK3A@mail.gmail.com>
2015-12-14 23:25                                                   ` Dallas Clement
2015-12-15  2:36                                                     ` Dallas Clement
2015-12-15 13:53                                                       ` Phil Turmel
2015-12-15 14:09                                                       ` Robert Kierski
2015-12-15 15:14                                                       ` John Stoffel
2015-12-15 17:30                                                         ` Dallas Clement
2015-12-15 19:22                                                           ` Phil Turmel
2015-12-15 19:44                                                             ` Dallas Clement
2015-12-15 19:52                                                               ` Phil Turmel
2015-12-15 21:54                                                           ` John Stoffel
2015-12-15 23:07                                                             ` Dallas Clement
2015-12-16 15:31                                                               ` Dallas Clement
     [not found]                                                                 ` <CAK2H+eeD2k4yzuvL4uF_qKycp6A=XPe8pVF_J-7Agi8Ze89PPQ@mail.gmail.com>
2015-12-17  5:57                                                                   ` Dallas Clement
2015-12-17 13:41                                                                   ` Phil Turmel
2015-12-17 21:08                                                                     ` Dallas Clement
2015-12-17 22:40                                                                       ` Phil Turmel
2015-12-17 23:28                                                                         ` Dallas Clement
2015-12-18  0:54                                                                           ` Dallas Clement
     [not found]                                                                             ` <CAFx4rwT8xgwZ0OWaLLsZvhMskiwmY54MzHgnnEPaswByeRrXxQ@mail.gmail.com>
2015-12-22  6:15                                                                               ` Doug Dumitru
2015-12-22 14:34                                                                                 ` Robert Kierski
2015-12-22 16:48                                                                                 ` Dallas Clement
2015-12-22 18:33                                                                                   ` Doug Dumitru [this message]
2016-01-04 18:56                                                                                     ` Robert Kierski
2016-01-04 19:13                                                                                       ` Doug Dumitru
2016-01-04 19:33                                                                                         ` Robert Kierski
2016-01-04 19:43                                                                                           ` Doug Dumitru
2016-01-15 16:53                                                                                             ` Robert Kierski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAFx4rwQvTrUEskJxdc6CpvGu3uVYOJWEvHmMkqmNbmTgTVT0-Q@mail.gmail.com \
    --to=doug@easyco.com \
    --cc=dallas.a.clement@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=rkierski@cray.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.