All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID10 Write Performance
@ 2015-12-18 18:43 Marc Smith
  2015-12-22 19:36 ` Marc Smith
  0 siblings, 1 reply; 5+ messages in thread
From: Marc Smith @ 2015-12-18 18:43 UTC (permalink / raw)
  To: linux-raid

Hi,

I'm testing a (24) slot SSD array (Supermicro) with MD RAID. The setup
consists of the Supermicro chassis, (24) Pliant LB406M SAS SSD drives,
(3) Avago/LSI SAS3008 SAS HBAs, and (2) Intel Xeon E5-2660 2.60GHz
processors.

The (24) SSDs are directly connected (pass-through back-plane) to the
(3) SAS HBAs (eight drives per HBA) with no SAS expanders.

I'm planning to use RAID10 for this system. I started by playing with
some performance configurations, I'm specifically looking at random IO
performance.

The test commands I've been using with fio are the following:
4K 100% random, 100% READ: fio --bs=4k --direct=1 --rw=randread
--ioengine=libaio --iodepth=16 --numjobs=16 --name=/dev/md0
--runtime=60
4K 100% random, 100% WRITE: fio --bs=4k --direct=1 --rw=randwrite
--ioengine=libaio --iodepth=16 --numjobs=16 --name=/dev/md0
--runtime=60

As a benchmark, I initially tested all twenty-four drives using RAID0;
using a 8K chunk size and here are the numbers I got:
4K random read: 645,233 IOPS
4K random write: 309,879 IOPS

Not too shabby... obviously these are just for bench-marking, the plan
is to use RAID10 for production.

So, I won't go into the specifics of all the tests, but I've tried
quite a few different RAID10 configurations: Nested RAID 10 (1+0) -
RAID 0 (stripe) built with RAID 1 (mirror) arrays, Nested RAID 10
(0+1) - RAID 1 (mirror) built with RAID 0 (stripe) arrays, and
"Complex" RAID 10 - Near Layout / 2.

All of these yield very similar results using (12) of the disks spread
across the (3) HBAs. As an example:
Nested RAID 10 (0+1) - RAID 1 (mirror) built with RAID 0 (stripe) arrays
For the (2) stripe sets (2 disks per HBA, 6 total per set):
mdadm --create --verbose /dev/md0 --level=stripe --raid-devices=6
--chunk=64K /dev/sda1 /dev/sdb1 /dev/sdi1 /dev/sdj1 /dev/sdq1
/dev/sdr1
mdadm --create --verbose /dev/md1 --level=stripe --raid-devices=6
--chunk=64K /dev/sdc1 /dev/sdd1 /dev/sdk1 /dev/sdl1 /dev/sds1
/dev/sdt1
For the (1) mirror set (consisting of the 2 stripe sets):
mdadm --create --verbose /dev/md2 --level=mirror --raid-devices=2
/dev/md0 /dev/md1

Running the random 4K performance tests described above yields the
following results for the RAID10 array:
4K random read: 276,967 IOPS
4K random write: 643 IOPS


The read numbers seem in-line with what I expected, but the writes are
absolutely dismal. I expect them not be where the read numbers are,
but this is really, really low! I gotta have something configured
incorrectly, right?

I've experimented with different chunk sizes, and haven't gotten much
of a change in the write numbers. Again, I've tried several different
variations of a "RAID10" configuration (nested 1+0, nested 0+1,
complex using near/2) and all yield very similar results: Good read
performance, extremely poor write performance.

Even the throughput when doing a sequential test with the writes is
not where I'd expect it to be, so something definitely seems to be up
when mixing RAID levels 0 and 1. I didn't explore all the extremes of
the chunk sizes, so perhaps its as simple as that? I haven't tested
the "far" and "offset" layouts of RAID10 yet, but I'm not hopeful its
going to be any different.


Here is what I'm using:
Linux 3.14.57 (vanilla)
mdadm - v3.3.2 - 21st August 2014
fio-2.0.13


Any ideas or suggestions would be greatly appreciated. Just as a
simple test, I created a RAID5 volume using (4) of the SSDs and ran
the same random IO performance tests:
4K random read: 169,026 IOPS
4K random write: 12,682 IOPS

Not sure with the default RAID5 mdadm creation command that we get any
write cache, but we're getting ~ 12K IOPS with RAID5. Not great, but
when compared to the 643 IOPS with RAID10...


Thanks in advance!


--Marc

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID10 Write Performance
  2015-12-18 18:43 RAID10 Write Performance Marc Smith
@ 2015-12-22 19:36 ` Marc Smith
  2015-12-23  2:20   ` NeilBrown
  0 siblings, 1 reply; 5+ messages in thread
From: Marc Smith @ 2015-12-22 19:36 UTC (permalink / raw)
  To: linux-raid

Solved... appears it was the write-intent bitmap that caused the
performance issues. I discovered if I left the test running longer
than 60 seconds, the performance would eventually climb to where I'd
expect it. I ran 'mdadm --grow --bitmap=none /dev/md0' and now random
write performance is high/good/stable right off the bat.

--Marc

On Fri, Dec 18, 2015 at 1:43 PM, Marc Smith <marc.smith@mcc.edu> wrote:
> Hi,
>
> I'm testing a (24) slot SSD array (Supermicro) with MD RAID. The setup
> consists of the Supermicro chassis, (24) Pliant LB406M SAS SSD drives,
> (3) Avago/LSI SAS3008 SAS HBAs, and (2) Intel Xeon E5-2660 2.60GHz
> processors.
>
> The (24) SSDs are directly connected (pass-through back-plane) to the
> (3) SAS HBAs (eight drives per HBA) with no SAS expanders.
>
> I'm planning to use RAID10 for this system. I started by playing with
> some performance configurations, I'm specifically looking at random IO
> performance.
>
> The test commands I've been using with fio are the following:
> 4K 100% random, 100% READ: fio --bs=4k --direct=1 --rw=randread
> --ioengine=libaio --iodepth=16 --numjobs=16 --name=/dev/md0
> --runtime=60
> 4K 100% random, 100% WRITE: fio --bs=4k --direct=1 --rw=randwrite
> --ioengine=libaio --iodepth=16 --numjobs=16 --name=/dev/md0
> --runtime=60
>
> As a benchmark, I initially tested all twenty-four drives using RAID0;
> using a 8K chunk size and here are the numbers I got:
> 4K random read: 645,233 IOPS
> 4K random write: 309,879 IOPS
>
> Not too shabby... obviously these are just for bench-marking, the plan
> is to use RAID10 for production.
>
> So, I won't go into the specifics of all the tests, but I've tried
> quite a few different RAID10 configurations: Nested RAID 10 (1+0) -
> RAID 0 (stripe) built with RAID 1 (mirror) arrays, Nested RAID 10
> (0+1) - RAID 1 (mirror) built with RAID 0 (stripe) arrays, and
> "Complex" RAID 10 - Near Layout / 2.
>
> All of these yield very similar results using (12) of the disks spread
> across the (3) HBAs. As an example:
> Nested RAID 10 (0+1) - RAID 1 (mirror) built with RAID 0 (stripe) arrays
> For the (2) stripe sets (2 disks per HBA, 6 total per set):
> mdadm --create --verbose /dev/md0 --level=stripe --raid-devices=6
> --chunk=64K /dev/sda1 /dev/sdb1 /dev/sdi1 /dev/sdj1 /dev/sdq1
> /dev/sdr1
> mdadm --create --verbose /dev/md1 --level=stripe --raid-devices=6
> --chunk=64K /dev/sdc1 /dev/sdd1 /dev/sdk1 /dev/sdl1 /dev/sds1
> /dev/sdt1
> For the (1) mirror set (consisting of the 2 stripe sets):
> mdadm --create --verbose /dev/md2 --level=mirror --raid-devices=2
> /dev/md0 /dev/md1
>
> Running the random 4K performance tests described above yields the
> following results for the RAID10 array:
> 4K random read: 276,967 IOPS
> 4K random write: 643 IOPS
>
>
> The read numbers seem in-line with what I expected, but the writes are
> absolutely dismal. I expect them not be where the read numbers are,
> but this is really, really low! I gotta have something configured
> incorrectly, right?
>
> I've experimented with different chunk sizes, and haven't gotten much
> of a change in the write numbers. Again, I've tried several different
> variations of a "RAID10" configuration (nested 1+0, nested 0+1,
> complex using near/2) and all yield very similar results: Good read
> performance, extremely poor write performance.
>
> Even the throughput when doing a sequential test with the writes is
> not where I'd expect it to be, so something definitely seems to be up
> when mixing RAID levels 0 and 1. I didn't explore all the extremes of
> the chunk sizes, so perhaps its as simple as that? I haven't tested
> the "far" and "offset" layouts of RAID10 yet, but I'm not hopeful its
> going to be any different.
>
>
> Here is what I'm using:
> Linux 3.14.57 (vanilla)
> mdadm - v3.3.2 - 21st August 2014
> fio-2.0.13
>
>
> Any ideas or suggestions would be greatly appreciated. Just as a
> simple test, I created a RAID5 volume using (4) of the SSDs and ran
> the same random IO performance tests:
> 4K random read: 169,026 IOPS
> 4K random write: 12,682 IOPS
>
> Not sure with the default RAID5 mdadm creation command that we get any
> write cache, but we're getting ~ 12K IOPS with RAID5. Not great, but
> when compared to the 643 IOPS with RAID10...
>
>
> Thanks in advance!
>
>
> --Marc

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID10 Write Performance
  2015-12-22 19:36 ` Marc Smith
@ 2015-12-23  2:20   ` NeilBrown
  2015-12-23 13:59     ` Marc Smith
  0 siblings, 1 reply; 5+ messages in thread
From: NeilBrown @ 2015-12-23  2:20 UTC (permalink / raw)
  To: Marc Smith, linux-raid

[-- Attachment #1: Type: text/plain, Size: 598 bytes --]

On Wed, Dec 23 2015, Marc Smith wrote:

> Solved... appears it was the write-intent bitmap that caused the
> performance issues. I discovered if I left the test running longer
> than 60 seconds, the performance would eventually climb to where I'd
> expect it. I ran 'mdadm --grow --bitmap=none /dev/md0' and now random
> write performance is high/good/stable right off the bat.

Keeping a write-intent bitmap really is a good idea.
Using a larger bitmap chunk size can reduce the performance penalty and
preserve much of the value.  It is easy enough to experiment with
different sizes.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID10 Write Performance
  2015-12-23  2:20   ` NeilBrown
@ 2015-12-23 13:59     ` Marc Smith
  2015-12-23 22:58       ` NeilBrown
  0 siblings, 1 reply; 5+ messages in thread
From: Marc Smith @ 2015-12-23 13:59 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Okay, thanks, I'll turn it back on and try some different chunk sizes.

For my own knowledge, why/what is taking place under the covers that
causes this behavior? When testing with fio, it sometimes takes 1-2
minutes of "ramp up time" before the performance numbers are
good/expected (when the write-intent bitmap is enabled).


Thanks,

Marc


On Tue, Dec 22, 2015 at 9:20 PM, NeilBrown <neilb@suse.com> wrote:
> On Wed, Dec 23 2015, Marc Smith wrote:
>
>> Solved... appears it was the write-intent bitmap that caused the
>> performance issues. I discovered if I left the test running longer
>> than 60 seconds, the performance would eventually climb to where I'd
>> expect it. I ran 'mdadm --grow --bitmap=none /dev/md0' and now random
>> write performance is high/good/stable right off the bat.
>
> Keeping a write-intent bitmap really is a good idea.
> Using a larger bitmap chunk size can reduce the performance penalty and
> preserve much of the value.  It is easy enough to experiment with
> different sizes.
>
> NeilBrown

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID10 Write Performance
  2015-12-23 13:59     ` Marc Smith
@ 2015-12-23 22:58       ` NeilBrown
  0 siblings, 0 replies; 5+ messages in thread
From: NeilBrown @ 2015-12-23 22:58 UTC (permalink / raw)
  To: Marc Smith; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2290 bytes --]

On Thu, Dec 24 2015, Marc Smith wrote:

> Okay, thanks, I'll turn it back on and try some different chunk sizes.
>
> For my own knowledge, why/what is taking place under the covers that
> causes this behavior? When testing with fio, it sometimes takes 1-2
> minutes of "ramp up time" before the performance numbers are
> good/expected (when the write-intent bitmap is enabled).
>

Whenever md needs to write to a region (a bitmap-chunk) of the array
that it hasn't written to recently, it needs to set a bit and write out
the bitmap first.  It tries to gather multiple writes together and set
several bits at once, but a synchronous workload will defeat that.
Once the bit is set it will stay set until several seconds after the
last write.  I think it defaults to 5 seconds.
  mdadm -X /dev/some-component
will list it as 'daemon sleep'.

So the delay you are seeing is the time it takes to get all of those
bits set.  1 minute does sound like a long time, though it the writes
are synchronous that would easily explain it.
With a larger chunk size, there are fewer bits to set so fewer time that
the drives need to seek to the other end of the disk to write out the
bitmap.

If you run "watch -n 0.1 mdadm -X /dev/something" in a window it will
report how many bits are set moment by moment.  That might give you some
feel for what is happening.

NeilBrown



>
> Thanks,
>
> Marc
>
>
> On Tue, Dec 22, 2015 at 9:20 PM, NeilBrown <neilb@suse.com> wrote:
>> On Wed, Dec 23 2015, Marc Smith wrote:
>>
>>> Solved... appears it was the write-intent bitmap that caused the
>>> performance issues. I discovered if I left the test running longer
>>> than 60 seconds, the performance would eventually climb to where I'd
>>> expect it. I ran 'mdadm --grow --bitmap=none /dev/md0' and now random
>>> write performance is high/good/stable right off the bat.
>>
>> Keeping a write-intent bitmap really is a good idea.
>> Using a larger bitmap chunk size can reduce the performance penalty and
>> preserve much of the value.  It is easy enough to experiment with
>> different sizes.
>>
>> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-12-23 22:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-18 18:43 RAID10 Write Performance Marc Smith
2015-12-22 19:36 ` Marc Smith
2015-12-23  2:20   ` NeilBrown
2015-12-23 13:59     ` Marc Smith
2015-12-23 22:58       ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.