All of lore.kernel.org
 help / color / mirror / Atom feed
* Optimizing small IO with md RAID
@ 2011-05-30  7:14 fibreraid
  2011-05-30 10:43 ` Stan Hoeppner
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: fibreraid @ 2011-05-30  7:14 UTC (permalink / raw)
  To: linux-raid, fibre raid

Hi all,

I am looking to optimize md RAID performance as much as possible.

I've managed to get some rather strong large 4M IOps performance, but
small 4K IOps are still rather subpar, given the hardware.

CPU: 2 x Intel Westmere 6-core 2.4GHz
RAM: 24GB DDR3 1066
SAS controllers: 3 x LSI SAS2008 (6 Gbps SAS)
Drives: 24 x SSD's
Kernel: 2.6.38 x64 kernel (home-grown)
Benchmarking Tool: fio 1.54

Here are the results.I used the following commands to perform these benchmarks:

4K READ: fio --bs=4k --direct=1 --rw=read --ioengine=libaio
--iodepth=512 --runtime=60 --name=/dev/md0
4K WRITE: fio --bs=4k --direct=1 --rw=write--ioengine=libaio
--iodepth=512 --runtime=60 --name=/dev/md0
4M READ: fio --bs=4m --direct=1 --rw=read --ioengine=libaio
--iodepth=64 --runtime=60 --name=/dev/md0
4M WRITE: fio --bs=4m --direct=1 --rw=read --ioengine=libaio
--iodepth=64 --runtime=60 --name=/dev/md0

In each case below, the md chunk size was 64K. In RAID 5 and RAID 6,
one hot-spare was specified.

	raid0 24 x SSD	raid5 23 x SSD	raid6 23 x SSD	raid0 (2 * (raid5 x 11 SSD))						
4K read	179,923 IO/s	93,503 IO/s	116,866 IO/s	75,782 IO/s
4K write	168,027 IO/s	108,408 IO/s	120,477 IO/s	90,954 IO/s
4M read	4,576.7 MB/s	4,406.7 MB/s	4,052.2 MB/s	3,566.6 MB/s
4M write	3,146.8 MB/s	1,337.2 MB/s	1,259.9 MB/s	1,856.4 MB/s

Note that each individual SSD tests out as follows:

4k read: 56,342 IO/s
4k write: 33,792 IO/s
4M read: 231 MB/s
4M write: 130 MB/s


My concerns:

1. Given the above individual SSD performance, 24 SSD's in an md array
is at best getting 4K read/write performance of 2-3 drives, which
seems very low. I would expect significantly better linear scaling.
2. On the other hand, 4M read/write are performing more like 10-15
drives, which is much better, though still seems like it could get
better.
3. 4k read/write looks good for RAID 0, but drop off by over 40% with
RAID 5. While somewhat understandable on writes, why such a
significant hit on reads?
4. RAID 5 4M writes take a big hit compared to RAID 0, from 3146 MB/s
to 1337 MB/s. Despite the RAID 5 overhead, that still seems huge given
the CPU's at hand. Why?
5. Using a RAID 0 across two 11-SSD RAID 5's gives better RAID 5 4M
write performance, but worse in reads and significantly worse in 4K
reads/writes. Why?


Any thoughts would be greatly appreciated, especially patch ideas for
tweaking options. Thanks!

Best,
Tommy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Optimizing small IO with md RAID
  2011-05-30  7:14 Optimizing small IO with md RAID fibreraid
@ 2011-05-30 10:43 ` Stan Hoeppner
  2011-05-30 11:20 ` David Brown
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Stan Hoeppner @ 2011-05-30 10:43 UTC (permalink / raw)
  To: fibreraid; +Cc: linux-raid

On 5/30/2011 2:14 AM, fibreraid@gmail.com wrote:
> Hi all,
> 
> I am looking to optimize md RAID performance as much as possible.
> 
> I've managed to get some rather strong large 4M IOps performance, but
> small 4K IOps are still rather subpar, given the hardware.
> 
> CPU: 2 x Intel Westmere 6-core 2.4GHz
> RAM: 24GB DDR3 1066
> SAS controllers: 3 x LSI SAS2008 (6 Gbps SAS)
> Drives: 24 x SSD's
> Kernel: 2.6.38 x64 kernel (home-grown)
> Benchmarking Tool: fio 1.54
> 
> Here are the results.I used the following commands to perform these benchmarks:
> 
> 4K READ: fio --bs=4k --direct=1 --rw=read --ioengine=libaio
> --iodepth=512 --runtime=60 --name=/dev/md0
> 4K WRITE: fio --bs=4k --direct=1 --rw=write--ioengine=libaio
> --iodepth=512 --runtime=60 --name=/dev/md0
> 4M READ: fio --bs=4m --direct=1 --rw=read --ioengine=libaio
> --iodepth=64 --runtime=60 --name=/dev/md0
> 4M WRITE: fio --bs=4m --direct=1 --rw=read --ioengine=libaio
> --iodepth=64 --runtime=60 --name=/dev/md0

Did you test with buffered IO?  Unless you're running Oracle or a custom
app that only uses O_DIRECT, you should probably be testing buffered IO
as well as it's a more real world test case most of the time.

> In each case below, the md chunk size was 64K. In RAID 5 and RAID 6,
> one hot-spare was specified.

IOPS and throughput tuning often traditionally have an inverse
relationship.  It may prove difficult to tune maximum performance for
both cases.

> 	raid0 24 x SSD	raid5 23 x SSD	raid6 23 x SSD	raid0 (2 * (raid5 x 11 SSD))						
> 4K read	179,923 IO/s	93,503 IO/s	116,866 IO/s	75,782 IO/s
> 4K write	168,027 IO/s	108,408 IO/s	120,477 IO/s	90,954 IO/s
> 4M read	4,576.7 MB/s	4,406.7 MB/s	4,052.2 MB/s	3,566.6 MB/s
> 4M write	3,146.8 MB/s	1,337.2 MB/s	1,259.9 MB/s	1,856.4 MB/s

> Note that each individual SSD tests out as follows:
> 
> 4k read: 56,342 IO/s
> 4k write: 33,792 IO/s
> 4M read: 231 MB/s
> 4M write: 130 MB/s

This looks like a filesystem limitation.

> My concerns:
> 
> 1. Given the above individual SSD performance, 24 SSD's in an md array
> is at best getting 4K read/write performance of 2-3 drives, which
> seems very low. I would expect significantly better linear scaling.
> 2. On the other hand, 4M read/write are performing more like 10-15
> drives, which is much better, though still seems like it could get
> better.
> 3. 4k read/write looks good for RAID 0, but drop off by over 40% with
> RAID 5. While somewhat understandable on writes, why such a
> significant hit on reads?
> 4. RAID 5 4M writes take a big hit compared to RAID 0, from 3146 MB/s
> to 1337 MB/s. Despite the RAID 5 overhead, that still seems huge given
> the CPU's at hand. Why?
> 5. Using a RAID 0 across two 11-SSD RAID 5's gives better RAID 5 4M
> write performance, but worse in reads and significantly worse in 4K
> reads/writes. Why?
> 
> Any thoughts would be greatly appreciated, especially patch ideas for
> tweaking options. Thanks!

Your filesystem interaction with mdraid levels (stripe/chunk meshing)
may be limiting your performance.  FIO does test files IIRC, not direct
block IO.  Are you using EXT3/4?  XFS?

I suggest you try the following.  Create an md raid *linear* array of
all 24 SSDs using a 4KB chunk size.  Format the resulting md device with
XFS specifying 24 allocation groups, not other options.  Something like:

~# mdadm -C /dev/md0 -n=24 -c=4 -l=linear /dev/sd[a..x]
~# mdadm -A /dev/md0 /dev/sb[a..x]
~# mkfs.xfs /dev/md0 -d agcount=24

This setup will parallelize the IO load at the file level instead of at
the stripe or chunk level of the md RAID layer.  Each file in the test
will be wholly written to and read from only one SSD, but you'll get 24
parallel streams, one to/from each SSD.  (You can do the same thing with
RAID 10, 6, etc, but files will get striped across multiple drives,
which doesn't work well for small files)

Simply specify agcount=[number of actual data devices], not including
devices, or space, consumed by redundancy.  For example, in a 10 disk
RAID 10 you'd use agcount=5.  For a 10 disk RAID 6, agcount=8, and so on.

Since you're using 2.6.38 you'll want to enable XFS delayed logging,
which speeds up large metadata write loads substantially.  To do so,
simply add 'delaylog' to your fstab mount options, such as:

/dev/md0       /test           xfs     defaults,delaylog

I'm interested to see what kind of performance increase you get with
this setup.

-- 
Stan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Optimizing small IO with md RAID
  2011-05-30  7:14 Optimizing small IO with md RAID fibreraid
  2011-05-30 10:43 ` Stan Hoeppner
@ 2011-05-30 11:20 ` David Brown
  2011-05-30 11:57   ` John Robinson
  2011-05-31  3:23 ` Stefan /*St0fF*/ Hübner
  2011-05-31  3:48 ` Joe Landman
  3 siblings, 1 reply; 10+ messages in thread
From: David Brown @ 2011-05-30 11:20 UTC (permalink / raw)
  To: linux-raid

On 30/05/2011 09:14, fibreraid@gmail.com wrote:
> Hi all,
>
> I am looking to optimize md RAID performance as much as possible.
>
> I've managed to get some rather strong large 4M IOps performance, but
> small 4K IOps are still rather subpar, given the hardware.
>
> CPU: 2 x Intel Westmere 6-core 2.4GHz
> RAM: 24GB DDR3 1066
> SAS controllers: 3 x LSI SAS2008 (6 Gbps SAS)
> Drives: 24 x SSD's
> Kernel: 2.6.38 x64 kernel (home-grown)
> Benchmarking Tool: fio 1.54
>
> Here are the results.I used the following commands to perform these benchmarks:
>
> 4K READ: fio --bs=4k --direct=1 --rw=read --ioengine=libaio
> --iodepth=512 --runtime=60 --name=/dev/md0
> 4K WRITE: fio --bs=4k --direct=1 --rw=write--ioengine=libaio
> --iodepth=512 --runtime=60 --name=/dev/md0
> 4M READ: fio --bs=4m --direct=1 --rw=read --ioengine=libaio
> --iodepth=64 --runtime=60 --name=/dev/md0
> 4M WRITE: fio --bs=4m --direct=1 --rw=read --ioengine=libaio
> --iodepth=64 --runtime=60 --name=/dev/md0
>
> In each case below, the md chunk size was 64K. In RAID 5 and RAID 6,
> one hot-spare was specified.
>
> 	raid0 24 x SSD	raid5 23 x SSD	raid6 23 x SSD	raid0 (2 * (raid5 x 11 SSD))						
> 4K read	179,923 IO/s	93,503 IO/s	116,866 IO/s	75,782 IO/s
> 4K write	168,027 IO/s	108,408 IO/s	120,477 IO/s	90,954 IO/s
> 4M read	4,576.7 MB/s	4,406.7 MB/s	4,052.2 MB/s	3,566.6 MB/s
> 4M write	3,146.8 MB/s	1,337.2 MB/s	1,259.9 MB/s	1,856.4 MB/s
>
> Note that each individual SSD tests out as follows:
>
> 4k read: 56,342 IO/s
> 4k write: 33,792 IO/s
> 4M read: 231 MB/s
> 4M write: 130 MB/s
>
>
> My concerns:
>
> 1. Given the above individual SSD performance, 24 SSD's in an md array
> is at best getting 4K read/write performance of 2-3 drives, which
> seems very low. I would expect significantly better linear scaling.
> 2. On the other hand, 4M read/write are performing more like 10-15
> drives, which is much better, though still seems like it could get
> better.
> 3. 4k read/write looks good for RAID 0, but drop off by over 40% with
> RAID 5. While somewhat understandable on writes, why such a
> significant hit on reads?
> 4. RAID 5 4M writes take a big hit compared to RAID 0, from 3146 MB/s
> to 1337 MB/s. Despite the RAID 5 overhead, that still seems huge given
> the CPU's at hand. Why?
> 5. Using a RAID 0 across two 11-SSD RAID 5's gives better RAID 5 4M
> write performance, but worse in reads and significantly worse in 4K
> reads/writes. Why?
>
>
> Any thoughts would be greatly appreciated, especially patch ideas for
> tweaking options. Thanks!
>

(This is in addition to what Stan said about filesystems, etc.)

If my mental calculations are correct, writing 4M to this raid5/raid6 
setup takes about 1.5 stripes.  Typically that will mean two partial 
stripe writes (or even two partials and one full).  Partial stripe 
writes on raid5/6 means reading in most of the old stripe, calculating 
the new parity, and writing out the new data and parity.  When you tried 
with a raid0 of two raid5 groups, this effect was less because more of 
the writes were full stripes.

With SSDs, you have very low latency between a read system call and the 
data being accessed - that's what gives it a high IOps.  But it also 
means that layers of indirection such as more complex raid or layered 
raid have more effect.

Try your measurements with a raid10,far setup.  It costs more on data 
space, but should, I think, be quite a bit faster.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Optimizing small IO with md RAID
  2011-05-30 11:20 ` David Brown
@ 2011-05-30 11:57   ` John Robinson
  2011-05-30 13:08     ` David Brown
  0 siblings, 1 reply; 10+ messages in thread
From: John Robinson @ 2011-05-30 11:57 UTC (permalink / raw)
  To: Linux RAID

On 30/05/2011 12:20, David Brown wrote:
> (This is in addition to what Stan said about filesystems, etc.)
[...]
> Try your measurements with a raid10,far setup. It costs more on data
> space, but should, I think, be quite a bit faster.

I'd also be interested in what performance is like with RAID60, e.g. 4 
6-drive RAID6 sets, combined into one RAID0. I suggest this arrangement 
because it gives slightly better data space (33% better than the RAID10 
arrangement), better redundancy (if that's a consideration[1]), and 
would keep all your stripe widths in powers of two, e.g. 64K chunk on 
the RAID6s would give a 256K stripe width and end up with an overall 
stripe width of 1M at the RAID0.

You will probably always have relatively poor small write performance 
with any parity RAID for reasons both David and Stan already pointed 
out, though the above might be the least worst, if you see what I mean.

You could also try 3 8-drive RAID6s or 2 12-drive RAID6s but you'd 
definitely have to be careful - as Stan says - with your filesystem 
configuration because of the stripe widths, and the bigger your parity 
RAIDs the worse your small write and degraded performance becomes.

Cheers,

John.

[1] RAID6 lets you get away with sector errors while rebuilding after a 
disc failure. In addition, as it happens, setting up this arrangement 
with two drives on each controller for each of the RAID6s would mean you 
could tolerate a controller failure, albeit with horrible performance 
and you would have no redundancy left. You could configure smaller 
RAID6s or RAID10 to tolerate a controller failure too.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Optimizing small IO with md RAID
  2011-05-30 11:57   ` John Robinson
@ 2011-05-30 13:08     ` David Brown
  2011-05-30 15:24       ` fibreraid
  0 siblings, 1 reply; 10+ messages in thread
From: David Brown @ 2011-05-30 13:08 UTC (permalink / raw)
  To: linux-raid

On 30/05/2011 13:57, John Robinson wrote:
> On 30/05/2011 12:20, David Brown wrote:
>> (This is in addition to what Stan said about filesystems, etc.)
> [...]
>> Try your measurements with a raid10,far setup. It costs more on data
>> space, but should, I think, be quite a bit faster.
>
> I'd also be interested in what performance is like with RAID60, e.g. 4
> 6-drive RAID6 sets, combined into one RAID0. I suggest this arrangement
> because it gives slightly better data space (33% better than the RAID10
> arrangement), better redundancy (if that's a consideration[1]), and
> would keep all your stripe widths in powers of two, e.g. 64K chunk on
> the RAID6s would give a 256K stripe width and end up with an overall
> stripe width of 1M at the RAID0.
>

Power-of-two stripe widths may be better for xfs than non-power-of-two 
widths - perhaps Stan can answer that (he seems to know lots about xfs 
on raid).  But you have to be careful when testing and benchmarking - 
with power-of-two stripe widths, it's easy to get great 4 MB performance 
but terrible 5 MB performance.


As for the redundancy of raid6 (or 60) vs. raid10, the redundancy is 
different but not necessarily better, depending on your failure types 
and requirements.  raid6 will tolerate any two drives failing, while 
raid10 will tolerate up to half the drives failing as long as you don't 
lose both halves of a pair.  Depending on the chances of a random disk 
failing, if you have enough disks then the chances of two disks in a 
pair failing are less than the chances of three disks in a raid6 setup 
failing.  And raid10 suffers much less from running in degraded mode 
than raid6, and recovery is faster and less stressful.  So which is 
"better" depends on the user.

Of course, there is no question about the differences in space 
efficiency - that's easy to calculate.

For greater paranoia, you can always go for raid15 or even raid16...

> You will probably always have relatively poor small write performance
> with any parity RAID for reasons both David and Stan already pointed
> out, though the above might be the least worst, if you see what I mean.
>
> You could also try 3 8-drive RAID6s or 2 12-drive RAID6s but you'd
> definitely have to be careful - as Stan says - with your filesystem
> configuration because of the stripe widths, and the bigger your parity
> RAIDs the worse your small write and degraded performance becomes.
>
> Cheers,
>
> John.
>
> [1] RAID6 lets you get away with sector errors while rebuilding after a
> disc failure. In addition, as it happens, setting up this arrangement
> with two drives on each controller for each of the RAID6s would mean you
> could tolerate a controller failure, albeit with horrible performance
> and you would have no redundancy left. You could configure smaller
> RAID6s or RAID10 to tolerate a controller failure too.
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Optimizing small IO with md RAID
  2011-05-30 13:08     ` David Brown
@ 2011-05-30 15:24       ` fibreraid
  2011-05-30 16:56         ` David Brown
  2011-05-30 21:21         ` Stan Hoeppner
  0 siblings, 2 replies; 10+ messages in thread
From: fibreraid @ 2011-05-30 15:24 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

Hi All,

I appreciate the feedback but most of it seems around File System
recommendations or to change to parity-less RAID, like RAID 10. In my
tests, there is no file system; I am testing the raw block device as I
want to establish best-numbers there before layering on the file
system.

-Tommy


On Mon, May 30, 2011 at 6:08 AM, David Brown <david@westcontrol.com> wrote:
> On 30/05/2011 13:57, John Robinson wrote:
>>
>> On 30/05/2011 12:20, David Brown wrote:
>>>
>>> (This is in addition to what Stan said about filesystems, etc.)
>>
>> [...]
>>>
>>> Try your measurements with a raid10,far setup. It costs more on data
>>> space, but should, I think, be quite a bit faster.
>>
>> I'd also be interested in what performance is like with RAID60, e.g. 4
>> 6-drive RAID6 sets, combined into one RAID0. I suggest this arrangement
>> because it gives slightly better data space (33% better than the RAID10
>> arrangement), better redundancy (if that's a consideration[1]), and
>> would keep all your stripe widths in powers of two, e.g. 64K chunk on
>> the RAID6s would give a 256K stripe width and end up with an overall
>> stripe width of 1M at the RAID0.
>>
>
> Power-of-two stripe widths may be better for xfs than non-power-of-two
> widths - perhaps Stan can answer that (he seems to know lots about xfs on
> raid).  But you have to be careful when testing and benchmarking - with
> power-of-two stripe widths, it's easy to get great 4 MB performance but
> terrible 5 MB performance.
>
>
> As for the redundancy of raid6 (or 60) vs. raid10, the redundancy is
> different but not necessarily better, depending on your failure types and
> requirements.  raid6 will tolerate any two drives failing, while raid10 will
> tolerate up to half the drives failing as long as you don't lose both halves
> of a pair.  Depending on the chances of a random disk failing, if you have
> enough disks then the chances of two disks in a pair failing are less than
> the chances of three disks in a raid6 setup failing.  And raid10 suffers
> much less from running in degraded mode than raid6, and recovery is faster
> and less stressful.  So which is "better" depends on the user.
>
> Of course, there is no question about the differences in space efficiency -
> that's easy to calculate.
>
> For greater paranoia, you can always go for raid15 or even raid16...
>
>> You will probably always have relatively poor small write performance
>> with any parity RAID for reasons both David and Stan already pointed
>> out, though the above might be the least worst, if you see what I mean.
>>
>> You could also try 3 8-drive RAID6s or 2 12-drive RAID6s but you'd
>> definitely have to be careful - as Stan says - with your filesystem
>> configuration because of the stripe widths, and the bigger your parity
>> RAIDs the worse your small write and degraded performance becomes.
>>
>> Cheers,
>>
>> John.
>>
>> [1] RAID6 lets you get away with sector errors while rebuilding after a
>> disc failure. In addition, as it happens, setting up this arrangement
>> with two drives on each controller for each of the RAID6s would mean you
>> could tolerate a controller failure, albeit with horrible performance
>> and you would have no redundancy left. You could configure smaller
>> RAID6s or RAID10 to tolerate a controller failure too.
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Optimizing small IO with md RAID
  2011-05-30 15:24       ` fibreraid
@ 2011-05-30 16:56         ` David Brown
  2011-05-30 21:21         ` Stan Hoeppner
  1 sibling, 0 replies; 10+ messages in thread
From: David Brown @ 2011-05-30 16:56 UTC (permalink / raw)
  To: linux-raid

On 30/05/11 17:24, fibreraid@gmail.com wrote:
> Hi All,
>
> I appreciate the feedback but most of it seems around File System
> recommendations or to change to parity-less RAID, like RAID 10. In my
> tests, there is no file system; I am testing the raw block device as I
> want to establish best-numbers there before layering on the file
> system.
>

I understand about testing the low-level speed before adding filesystem 
(and possibly lvm) layers, but what's wrong with parity-less RAID? 
RAID10,far has lower space efficiency than RAID5 or RAID6, but typically 
has performance close to RAID0, and it sounded like you were judging 
performance to be the most important factor.

mvh.,

David


> -Tommy
>
>
> On Mon, May 30, 2011 at 6:08 AM, David Brown<david@westcontrol.com>  wrote:
>> On 30/05/2011 13:57, John Robinson wrote:
>>>
>>> On 30/05/2011 12:20, David Brown wrote:
>>>>
>>>> (This is in addition to what Stan said about filesystems, etc.)
>>>
>>> [...]
>>>>
>>>> Try your measurements with a raid10,far setup. It costs more on data
>>>> space, but should, I think, be quite a bit faster.
>>>
>>> I'd also be interested in what performance is like with RAID60, e.g. 4
>>> 6-drive RAID6 sets, combined into one RAID0. I suggest this arrangement
>>> because it gives slightly better data space (33% better than the RAID10
>>> arrangement), better redundancy (if that's a consideration[1]), and
>>> would keep all your stripe widths in powers of two, e.g. 64K chunk on
>>> the RAID6s would give a 256K stripe width and end up with an overall
>>> stripe width of 1M at the RAID0.
>>>
>>
>> Power-of-two stripe widths may be better for xfs than non-power-of-two
>> widths - perhaps Stan can answer that (he seems to know lots about xfs on
>> raid).  But you have to be careful when testing and benchmarking - with
>> power-of-two stripe widths, it's easy to get great 4 MB performance but
>> terrible 5 MB performance.
>>
>>
>> As for the redundancy of raid6 (or 60) vs. raid10, the redundancy is
>> different but not necessarily better, depending on your failure types and
>> requirements.  raid6 will tolerate any two drives failing, while raid10 will
>> tolerate up to half the drives failing as long as you don't lose both halves
>> of a pair.  Depending on the chances of a random disk failing, if you have
>> enough disks then the chances of two disks in a pair failing are less than
>> the chances of three disks in a raid6 setup failing.  And raid10 suffers
>> much less from running in degraded mode than raid6, and recovery is faster
>> and less stressful.  So which is "better" depends on the user.
>>
>> Of course, there is no question about the differences in space efficiency -
>> that's easy to calculate.
>>
>> For greater paranoia, you can always go for raid15 or even raid16...
>>
>>> You will probably always have relatively poor small write performance
>>> with any parity RAID for reasons both David and Stan already pointed
>>> out, though the above might be the least worst, if you see what I mean.
>>>
>>> You could also try 3 8-drive RAID6s or 2 12-drive RAID6s but you'd
>>> definitely have to be careful - as Stan says - with your filesystem
>>> configuration because of the stripe widths, and the bigger your parity
>>> RAIDs the worse your small write and degraded performance becomes.
>>>
>>> Cheers,
>>>
>>> John.
>>>
>>> [1] RAID6 lets you get away with sector errors while rebuilding after a
>>> disc failure. In addition, as it happens, setting up this arrangement
>>> with two drives on each controller for each of the RAID6s would mean you
>>> could tolerate a controller failure, albeit with horrible performance
>>> and you would have no redundancy left. You could configure smaller
>>> RAID6s or RAID10 to tolerate a controller failure too.
>>>
>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Optimizing small IO with md RAID
  2011-05-30 15:24       ` fibreraid
  2011-05-30 16:56         ` David Brown
@ 2011-05-30 21:21         ` Stan Hoeppner
  1 sibling, 0 replies; 10+ messages in thread
From: Stan Hoeppner @ 2011-05-30 21:21 UTC (permalink / raw)
  To: fibreraid; +Cc: David Brown, linux-raid

On 5/30/2011 10:24 AM, fibreraid@gmail.com wrote:
> Hi All,
> 
> I appreciate the feedback but most of it seems around File System
> recommendations or to change to parity-less RAID, like RAID 10. In my
> tests, there is no file system; I am testing the raw block device as I
> want to establish best-numbers there before layering on the file
> system.

You're not performing valid taste case.  You will always have a
filesystem in production.  The performance of every md raid level plus
filesystem plus hardware combination will be different, and thus they
must be tuned together, not each in isolation, especially in the case of
SSDs.   Case in point:  slap EXT2 on your current array setup and then
XFS.  Test each with file based IO.  You'll see XFS has radically
superior parallel IO performance compared to EXT2.  Tweaking the array
setup will not yield significant EXT2 speedup for parallel IO.

Disk striping was invented 2+ decades ago to increase performance of
slow spindles for large file reads and writes, but the performance is
very low for small file IO due to partial stripe width operations taking
many of your spindles out of play, decreasing parallelism.  Adding
parity to the striping exacerbates this problem.  This is the classic
trade off between performance and redundancy.

SSDs have no moving parts, and natively have extremely high IOPS and
throughput rates, each SSD having on the order of 150x the seek rate of
a mech drive, and 2-3x the streaming throughput rate.  Thus, striping is
irrelevant to SSD performance, and, as you've seen, will degrade small
file performance due to partial width writes etc.

If you truly want to maximize real world performance of those 24 SSDs,
take one of your striped RAID configurations and format it with XFS
using the defaults.  Then run FIO with highly parallel file based IO
tests, i.e. two to four worker threads per CPU core.  Then delete the
array and create the linear setup I previously recommended and run the
same tests.  When comparing the  results I think you'll begin to see why
I recommend this setup for both highly parallel small and large file IO.
 Your large file IO numbers may be a little smaller with this setup, but
you should be able to play with chunk size to achieve the best balance
with both small and large file IO.

Regardless of chunk size, you should still see better overall parallel
IOPS and throughput than with striping, especially parity striping.  If
you need redundancy and maximum parallel performance, and can afford to
'waste' SSDs, create 12 RAID1 devices and make a linear array of the 12,
giving XFS 12 allocation groups.  For parallel small file workloads this
will yield better performance than RAID10 for the same cost of device
space.  Large file parallel performance should be similar.

-- 
Stan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Optimizing small IO with md RAID
  2011-05-30  7:14 Optimizing small IO with md RAID fibreraid
  2011-05-30 10:43 ` Stan Hoeppner
  2011-05-30 11:20 ` David Brown
@ 2011-05-31  3:23 ` Stefan /*St0fF*/ Hübner
  2011-05-31  3:48 ` Joe Landman
  3 siblings, 0 replies; 10+ messages in thread
From: Stefan /*St0fF*/ Hübner @ 2011-05-31  3:23 UTC (permalink / raw)
  To: fibreraid; +Cc: linux-raid

Hi,

are those LSISAS2008 in IR or in IT mode?  Software RAID performance on
those controllers is really bad with a high throw-out in IR mode, as the
IR mode is made for those "integrated RAID" types like RAID0, RAID1,
RAID1E and RAID10.

We've seen much better SoftwareRAID performance on this Controller in IT
(IniTiator) Mode.  See
http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/internal/sas9211-8i/index.html#Product%20Brief
the downloads site.

If your controller BIOS already says: "SAS2008-IT" or "LSI 9211-IT" on
boot-up, then you already got IT Firmware on it.  That would be the
moment I'd start thinking about a 9265 Controller and not software RAID.

I mean, with a Westmere board and CPU ... you spend enough money on the
hardware, but you want to save on the real bottleneck?  Sounds a bit
irrational to me...

Cheers,
Stefan

Am 30.05.2011 09:14, schrieb fibreraid@gmail.com:
> Hi all,
> 
> I am looking to optimize md RAID performance as much as possible.
> 
> I've managed to get some rather strong large 4M IOps performance, but
> small 4K IOps are still rather subpar, given the hardware.
> 
> CPU: 2 x Intel Westmere 6-core 2.4GHz
> RAM: 24GB DDR3 1066
> SAS controllers: 3 x LSI SAS2008 (6 Gbps SAS)
> Drives: 24 x SSD's
> Kernel: 2.6.38 x64 kernel (home-grown)
> Benchmarking Tool: fio 1.54
> 
> Here are the results.I used the following commands to perform these benchmarks:
> 
> 4K READ: fio --bs=4k --direct=1 --rw=read --ioengine=libaio
> --iodepth=512 --runtime=60 --name=/dev/md0
> 4K WRITE: fio --bs=4k --direct=1 --rw=write--ioengine=libaio
> --iodepth=512 --runtime=60 --name=/dev/md0
> 4M READ: fio --bs=4m --direct=1 --rw=read --ioengine=libaio
> --iodepth=64 --runtime=60 --name=/dev/md0
> 4M WRITE: fio --bs=4m --direct=1 --rw=read --ioengine=libaio
> --iodepth=64 --runtime=60 --name=/dev/md0
> 
> In each case below, the md chunk size was 64K. In RAID 5 and RAID 6,
> one hot-spare was specified.
> 
> 	raid0 24 x SSD	raid5 23 x SSD	raid6 23 x SSD	raid0 (2 * (raid5 x 11 SSD))						
> 4K read	179,923 IO/s	93,503 IO/s	116,866 IO/s	75,782 IO/s
> 4K write	168,027 IO/s	108,408 IO/s	120,477 IO/s	90,954 IO/s
> 4M read	4,576.7 MB/s	4,406.7 MB/s	4,052.2 MB/s	3,566.6 MB/s
> 4M write	3,146.8 MB/s	1,337.2 MB/s	1,259.9 MB/s	1,856.4 MB/s
> 
> Note that each individual SSD tests out as follows:
> 
> 4k read: 56,342 IO/s
> 4k write: 33,792 IO/s
> 4M read: 231 MB/s
> 4M write: 130 MB/s
> 
> 
> My concerns:
> 
> 1. Given the above individual SSD performance, 24 SSD's in an md array
> is at best getting 4K read/write performance of 2-3 drives, which
> seems very low. I would expect significantly better linear scaling.
> 2. On the other hand, 4M read/write are performing more like 10-15
> drives, which is much better, though still seems like it could get
> better.
> 3. 4k read/write looks good for RAID 0, but drop off by over 40% with
> RAID 5. While somewhat understandable on writes, why such a
> significant hit on reads?
> 4. RAID 5 4M writes take a big hit compared to RAID 0, from 3146 MB/s
> to 1337 MB/s. Despite the RAID 5 overhead, that still seems huge given
> the CPU's at hand. Why?
> 5. Using a RAID 0 across two 11-SSD RAID 5's gives better RAID 5 4M
> write performance, but worse in reads and significantly worse in 4K
> reads/writes. Why?
> 
> 
> Any thoughts would be greatly appreciated, especially patch ideas for
> tweaking options. Thanks!
> 
> Best,
> Tommy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Optimizing small IO with md RAID
  2011-05-30  7:14 Optimizing small IO with md RAID fibreraid
                   ` (2 preceding siblings ...)
  2011-05-31  3:23 ` Stefan /*St0fF*/ Hübner
@ 2011-05-31  3:48 ` Joe Landman
  3 siblings, 0 replies; 10+ messages in thread
From: Joe Landman @ 2011-05-31  3:48 UTC (permalink / raw)
  To: fibreraid; +Cc: linux-raid

On 05/30/2011 03:14 AM, fibreraid@gmail.com wrote:
> Hi all,
>
> I am looking to optimize md RAID performance as much as possible.
>
> I've managed to get some rather strong large 4M IOps performance, but
> small 4K IOps are still rather subpar, given the hardware.

Understand that much of what passes for realistic test cases for SSDs 
are ... well ... not that good.  Write something other than zeros, and 
turn off write caching on the SSDs.  Then you get similar results to 
what you see.

[...]

> In each case below, the md chunk size was 64K. In RAID 5 and RAID 6,
> one hot-spare was specified.
>
> 	raid0 24 x SSD	raid5 23 x SSD	raid6 23 x SSD	raid0 (2 * (raid5 x 11 SSD))						
> 4K read	179,923 IO/s	93,503 IO/s	116,866 IO/s	75,782 IO/s
> 4K write	168,027 IO/s	108,408 IO/s	120,477 IO/s	90,954 IO/s

A 4k random read/write? Or a sequential?  The 4k sequential reads/writes 
will be merged into a larger size.

A 4k write is going to result in a read-modify-write cycle for this 
config.

These results suggest a 7k IOP 4k write performance, and about 7.5k IOP 
4k read performance.  Are these Intel drives?  These numbers are in line 
with what I've measured for them.

> 4M read	4,576.7 MB/s	4,406.7 MB/s	4,052.2 MB/s	3,566.6 MB/s
> 4M write	3,146.8 MB/s	1,337.2 MB/s	1,259.9 MB/s	1,856.4 MB/s
>
> Note that each individual SSD tests out as follows:
>
> 4k read: 56,342 IO/s
> 4k write: 33,792 IO/s
> 4M read: 231 MB/s
> 4M write: 130 MB/s

Is write caching on in this case but not the other?

>
>
> My concerns:
>
> 1. Given the above individual SSD performance, 24 SSD's in an md array
> is at best getting 4K read/write performance of 2-3 drives, which
> seems very low. I would expect significantly better linear scaling.

You've got lots of RMW cycles going on for the write side, I wouldn't 
expect million IOP performance out of a system like this.


> 2. On the other hand, 4M read/write are performing more like 10-15
> drives, which is much better, though still seems like it could get
> better.

These controllers are often on PCIe-x8 gen 2 ports.  Thats 4GB/s maximum 
in each direction.  After the overhead on the bus,  you get 86% of the 
remaining bandwidth.  This is 3.4 GB/s.  So your 4+ GB/s results are 
either the result of caching, or multiple controllers. Since I see the 
direct=1, I am guessing multiple controllers.   Unless you have a single 
controller in a PCIe-x16 gen 2 slot ...



-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-05-31  3:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-30  7:14 Optimizing small IO with md RAID fibreraid
2011-05-30 10:43 ` Stan Hoeppner
2011-05-30 11:20 ` David Brown
2011-05-30 11:57   ` John Robinson
2011-05-30 13:08     ` David Brown
2011-05-30 15:24       ` fibreraid
2011-05-30 16:56         ` David Brown
2011-05-30 21:21         ` Stan Hoeppner
2011-05-31  3:23 ` Stefan /*St0fF*/ Hübner
2011-05-31  3:48 ` Joe Landman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.