All of lore.kernel.org
 help / color / mirror / Atom feed
* Successful RAID 6 setup
@ 2009-11-04 18:03 Andrew Dunn
  2009-11-04 18:40 ` Leslie Rhorer
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Andrew Dunn @ 2009-11-04 18:03 UTC (permalink / raw)
  To: linux-raid

I sent this a couple days ago, wondering if it fell through the cracks or if I am asking the wrong questions.

------

I will preface this by saying I only need about 100MB/s out of my array
because I access it via a gigabit crossover cable.

I am backing up all of my information right now (~4TB) with the
intention of re-creating this array with a larger chunk size and
possibly tweaking the file system a little bit.

My original array was a raid6 of 9 WD caviar black drives, the chunk
size was 64k. I use USAS-AOC-L8i controllers to address all of my drives
and the TLER setting on the drives is enabled for 7 seconds.

storrgie@ALEXANDRIA:~$ sudo mdadm -D /dev/md0
/dev/md0:
        Version : 00.90
  Creation Time : Wed Oct 14 19:59:46 2009
     Raid Level : raid6
     Array Size : 6837319552 (6520.58 GiB 7001.42 GB)
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
   Raid Devices : 9
  Total Devices : 9
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Mon Nov  2 16:58:43 2009
          State : active
 Active Devices : 9
Working Devices : 9
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 64K

           UUID : 53dadda1:c58785d5:613e2239:070da8c8 (local to host
ALEXANDRIA)
         Events : 0.649527

    Number   Major   Minor   RaidDevice State
       0       8       65        0      active sync   /dev/sde1
       1       8       81        1      active sync   /dev/sdf1
       2       8       97        2      active sync   /dev/sdg1
       3       8      113        3      active sync   /dev/sdh1
       4       8      129        4      active sync   /dev/sdi1
       5       8      145        5      active sync   /dev/sdj1
       6       8      161        6      active sync   /dev/sdk1
       7       8      177        7      active sync   /dev/sdl1
       8       8      193        8      active sync   /dev/sdm1

I have noticed slow rebuilding time when I first created the array and
intermittent lockups while writing large data sets.

Per some reading I was thinking of adjusting my chunk size to 1024k, and
trying to figure out the weird stuff required when creating a file system.

Questions:

Should I have the TLER on my drives enabled? (WDTLER, seven seconds)

Is 1024k chunk size going to be a good choice for my purposes? (I store
use this for storage of files that are 4MiB to 16GiB)

Is ext4 the ideal file system for my purposes?

Should I be investigating into the file system stripe size and chunk
size or let mkfs choose these for me? If I need to, please be kind to
point me in a good direction as I am new to this lower level file system
stuff.

Can I change the properties of my file system in place (ext4 or other)
so that I can tweak the stripe size when I add more drives and grow the
array?

Should I be asking any other questions?

Thanks a ton, this is the first mailing list I have ever subscribed, I
am really excited to see what you all say.

-- 
Andrew Dunn
http://agdunn.net


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Successful RAID 6 setup
  2009-11-04 18:03 Successful RAID 6 setup Andrew Dunn
@ 2009-11-04 18:40 ` Leslie Rhorer
  2009-11-07 18:35   ` Doug Ledford
  2009-11-06 21:22 ` Thomas Fjellstrom
  2009-11-08  7:07 ` Beolach
  2 siblings, 1 reply; 9+ messages in thread
From: Leslie Rhorer @ 2009-11-04 18:40 UTC (permalink / raw)
  To: linux-raid

> I will preface this by saying I only need about 100MB/s out of my array
> because I access it via a gigabit crossover cable.

	That's certainly within the capabilities of a good setup.

> I am backing up all of my information right now (~4TB) with the
> intention of re-creating this array with a larger chunk size and
> possibly tweaking the file system a little bit.
> 
> My original array was a raid6 of 9 WD caviar black drives, the chunk
> size was 64k. I use USAS-AOC-L8i controllers to address all of my drives
> and the TLER setting on the drives is enabled for 7 seconds.

	I would recommend a larger chunk size.  I'm using 256K, and even
512K or 1024K probably would not be excessive.

> storrgie@ALEXANDRIA:~$ sudo mdadm -D /dev/md0
> /dev/md0:
>         Version : 00.90

	I definitely recommend something other than 0.9, especially if this
array is to grow a lot.

> I have noticed slow rebuilding time when I first created the array and
> intermittent lockups while writing large data sets.

	Lock-ups are not good.  Investigate your kernel log.  A write-intent
bitmap is recommended to reduce rebuild time.

> Is ext4 the ideal file system for my purposes?

	I'm using xfs.  YMMV.

> Should I be investigating into the file system stripe size and chunk
> size or let mkfs choose these for me? If I need to, please be kind to
> point me in a good direction as I am new to this lower level file system
> stuff.

	I don/'t know specifically about ext4, but xfs did a fine job of
assigning stripe and chunk size.

> Can I change the properties of my file system in place (ext4 or other)
> so that I can tweak the stripe size when I add more drives and grow the
> array?

	One can with xfs.  I expect ext4 may be the same.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Successful RAID 6 setup
  2009-11-04 18:03 Successful RAID 6 setup Andrew Dunn
  2009-11-04 18:40 ` Leslie Rhorer
@ 2009-11-06 21:22 ` Thomas Fjellstrom
  2009-11-08  7:07 ` Beolach
  2 siblings, 0 replies; 9+ messages in thread
From: Thomas Fjellstrom @ 2009-11-06 21:22 UTC (permalink / raw)
  To: Andrew Dunn; +Cc: linux-raid

On Wed November 4 2009, Andrew Dunn wrote:
> I sent this a couple days ago, wondering if it fell through the cracks or
>  if I am asking the wrong questions.
> 
> ------
> 
> I will preface this by saying I only need about 100MB/s out of my array
> because I access it via a gigabit crossover cable.
> 
> I am backing up all of my information right now (~4TB) with the
> intention of re-creating this array with a larger chunk size and
> possibly tweaking the file system a little bit.
> 
> My original array was a raid6 of 9 WD caviar black drives, the chunk
> size was 64k. I use USAS-AOC-L8i controllers to address all of my drives
> and the TLER setting on the drives is enabled for 7 seconds.
> 
> storrgie@ALEXANDRIA:~$ sudo mdadm -D /dev/md0
> /dev/md0:
>         Version : 00.90
>   Creation Time : Wed Oct 14 19:59:46 2009
>      Raid Level : raid6
>      Array Size : 6837319552 (6520.58 GiB 7001.42 GB)
>   Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
>    Raid Devices : 9
>   Total Devices : 9
> Preferred Minor : 0
>     Persistence : Superblock is persistent
> 
>     Update Time : Mon Nov  2 16:58:43 2009
>           State : active
>  Active Devices : 9
> Working Devices : 9
>  Failed Devices : 0
>   Spare Devices : 0
> 
>      Chunk Size : 64K
> 
>            UUID : 53dadda1:c58785d5:613e2239:070da8c8 (local to host
> ALEXANDRIA)
>          Events : 0.649527
> 
>     Number   Major   Minor   RaidDevice State
>        0       8       65        0      active sync   /dev/sde1
>        1       8       81        1      active sync   /dev/sdf1
>        2       8       97        2      active sync   /dev/sdg1
>        3       8      113        3      active sync   /dev/sdh1
>        4       8      129        4      active sync   /dev/sdi1
>        5       8      145        5      active sync   /dev/sdj1
>        6       8      161        6      active sync   /dev/sdk1
>        7       8      177        7      active sync   /dev/sdl1
>        8       8      193        8      active sync   /dev/sdm1
> 
> I have noticed slow rebuilding time when I first created the array and
> intermittent lockups while writing large data sets.
> 
> Per some reading I was thinking of adjusting my chunk size to 1024k, and
> trying to figure out the weird stuff required when creating a file
>  system.
> 
> Questions:
> 
> Should I have the TLER on my drives enabled? (WDTLER, seven seconds)

I've heard people claim that TLER is really only useful for a hardware raid 
card. The card itself can manage the failing sector a lot faster than the 
drives. With Software raid, you might as well let the drive try its best, or 
md will want to fail the drive (I'm not sure how hard it tries to recover a 
sector that has previously returned an error).

> Is 1024k chunk size going to be a good choice for my purposes? (I store
> use this for storage of files that are 4MiB to 16GiB)
> 
> Is ext4 the ideal file system for my purposes?
> 
> Should I be investigating into the file system stripe size and chunk
> size or let mkfs choose these for me? If I need to, please be kind to
> point me in a good direction as I am new to this lower level file system
> stuff.
> 
> Can I change the properties of my file system in place (ext4 or other)
> so that I can tweak the stripe size when I add more drives and grow the
> array?
> 
> Should I be asking any other questions?
> 
> Thanks a ton, this is the first mailing list I have ever subscribed, I
> am really excited to see what you all say.
> 


-- 
Thomas Fjellstrom
tfjellstrom@shaw.ca

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Successful RAID 6 setup
  2009-11-04 18:40 ` Leslie Rhorer
@ 2009-11-07 18:35   ` Doug Ledford
  2009-11-08  6:42     ` Beolach
  0 siblings, 1 reply; 9+ messages in thread
From: Doug Ledford @ 2009-11-07 18:35 UTC (permalink / raw)
  To: Leslie Rhorer; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4341 bytes --]

On 11/04/2009 01:40 PM, Leslie Rhorer wrote:
>> I will preface this by saying I only need about 100MB/s out of my array
>> because I access it via a gigabit crossover cable.
> 
> 	That's certainly within the capabilities of a good setup.
> 
>> I am backing up all of my information right now (~4TB) with the
>> intention of re-creating this array with a larger chunk size and
>> possibly tweaking the file system a little bit.
>>
>> My original array was a raid6 of 9 WD caviar black drives, the chunk
>> size was 64k. I use USAS-AOC-L8i controllers to address all of my drives
>> and the TLER setting on the drives is enabled for 7 seconds.
> 
> 	I would recommend a larger chunk size.  I'm using 256K, and even
> 512K or 1024K probably would not be excessive.

OK, I've got some data that I'm not quite ready to send out yet, but it
maps out the relationship between max_sectors_kb (largest request size a
disk can process, which varies based upon scsi host adapter in question,
but for SATA adapters is capped at and defaults to 512KB max per
request) and chunk size for a raid0 array across 4 disks or 5 disks (I
could run other array sizes too, and that's part of what I'm waiting on
before sending the data out).  The point here being that a raid0 array
will show up more of the md/lower layer block device interactions where
as raid5/6 would muddy the waters with other stuff.  The results of the
tests I ran were pretty conclusive that the sweet spot for chunk size is
when chunk size is == max_sectors_kb, and since SATA is the predominant
thing today and it defaults to 512K, that gives a 512K chunk as the
sweet spot.  Given that the chunk size is generally about optimizing
block device operations at the command/queue level, it should transfer
directly to raid5/6 as well.

>> storrgie@ALEXANDRIA:~$ sudo mdadm -D /dev/md0
>> /dev/md0:
>>         Version : 00.90
> 
> 	I definitely recommend something other than 0.9, especially if this
> array is to grow a lot.
> 
>> I have noticed slow rebuilding time when I first created the array and
>> intermittent lockups while writing large data sets.
> 
> 	Lock-ups are not good.  Investigate your kernel log.  A write-intent
> bitmap is recommended to reduce rebuild time.
> 
>> Is ext4 the ideal file system for my purposes?
> 
> 	I'm using xfs.  YMMV.
> 
>> Should I be investigating into the file system stripe size and chunk
>> size or let mkfs choose these for me? If I need to, please be kind to
>> point me in a good direction as I am new to this lower level file system
>> stuff.
> 
> 	I don/'t know specifically about ext4, but xfs did a fine job of
> assigning stripe and chunk size.

xfs pulls this out all on it's own, ext2/3/4 need to be told (and you
need very recent ext utils to tell it both stripe and stride sizes).

>> Can I change the properties of my file system in place (ext4 or other)
>> so that I can tweak the stripe size when I add more drives and grow the
>> array?
> 
> 	One can with xfs.  I expect ext4 may be the same.

Actually, this needs clarified somewhat.  You can tweak xfs in terms of
the sunit and swidth settings.  This will effect new allocations *only*!
 All of your existing data will still be wherever it was and if that
happens to be not so well laid out for the new array, too bad.  For the
ext filesystems, they use this information at filesystem creation time
to lay out their block groups, inode tables, etc. in such a fashion that
they are aligned to individual chunks and also so that they are *not*
exactly stripe width apart from each other (which forces the metadata to
reside on different disks and avoids the possible pathological case
where you could accidentally end up with the metadata blocks always
falling on the same disk in the array making that one disk a huge
bottleneck to the rest of the array).  Once an ext filesystem is
created, I don't think it uses the data much any longer, but I could be
wrong.  However, I know that it won't be rearranged for your new layout,
so you get what you get after you grow the fs.


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Successful RAID 6 setup
  2009-11-07 18:35   ` Doug Ledford
@ 2009-11-08  6:42     ` Beolach
  2009-11-08 16:15       ` Doug Ledford
  2009-11-09 17:37       ` Bill Davidsen
  0 siblings, 2 replies; 9+ messages in thread
From: Beolach @ 2009-11-08  6:42 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Leslie Rhorer, linux-raid

On Sat, Nov 7, 2009 at 11:35, Doug Ledford <dledford@redhat.com> wrote:
> On 11/04/2009 01:40 PM, Leslie Rhorer wrote:
>>       I would recommend a larger chunk size.  I'm using 256K, and even
>> 512K or 1024K probably would not be excessive.
>
> OK, I've got some data that I'm not quite ready to send out yet, but it
> maps out the relationship between max_sectors_kb (largest request size a
> disk can process, which varies based upon scsi host adapter in question,
> but for SATA adapters is capped at and defaults to 512KB max per
> request) and chunk size for a raid0 array across 4 disks or 5 disks (I
> could run other array sizes too, and that's part of what I'm waiting on
> before sending the data out).  The point here being that a raid0 array
> will show up more of the md/lower layer block device interactions where
> as raid5/6 would muddy the waters with other stuff.  The results of the
> tests I ran were pretty conclusive that the sweet spot for chunk size is
> when chunk size is == max_sectors_kb, and since SATA is the predominant
> thing today and it defaults to 512K, that gives a 512K chunk as the
> sweet spot.  Given that the chunk size is generally about optimizing
> block device operations at the command/queue level, it should transfer
> directly to raid5/6 as well.
>

This only really applies for large sequential io loads, right?  I seem
to recall
smaller chunk sizes being more effective for smaller random io loads.

>>> Is ext4 the ideal file system for my purposes?
>>
>>       I'm using xfs.  YMMV.
>>

I'm also using XFS, and for now it's maybe safer than ext4 (In the 2.6.32-rc
series, there was recently an obscure bug that could cause ext4 filesystem
corruption - it's fixed now, but that type of thing scares me).  But ext[234]
have forward compatibility w/ btrfs (you can convert an ext4 to a btrfs), so
if you want to use btrfs when it stabalizes, maybe you should go w/ ext3 or
ext4 for now.  But that might take a long time.  XFS can perform better than
ext[234], but if you're really only going to be accessing this over an 1Gib/s
network, it won't matter.

>>> Should I be investigating into the file system stripe size and chunk
>>> size or let mkfs choose these for me? If I need to, please be kind to
>>> point me in a good direction as I am new to this lower level file system
>>> stuff.
>>
>>       I don/'t know specifically about ext4, but xfs did a fine job of
>> assigning stripe and chunk size.
>
> xfs pulls this out all on it's own, ext2/3/4 need to be told (and you
> need very recent ext utils to tell it both stripe and stride sizes).
>
>>> Can I change the properties of my file system in place (ext4 or other)
>>> so that I can tweak the stripe size when I add more drives and grow the
>>> array?
>>
>>       One can with xfs.  I expect ext4 may be the same.
>
> Actually, this needs clarified somewhat.  You can tweak xfs in terms of
> the sunit and swidth settings.  This will effect new allocations *only*!
>  All of your existing data will still be wherever it was and if that
> happens to be not so well laid out for the new array, too bad.  For the
> ext filesystems, they use this information at filesystem creation time
> to lay out their block groups, inode tables, etc. in such a fashion that
> they are aligned to individual chunks and also so that they are *not*
> exactly stripe width apart from each other (which forces the metadata to
> reside on different disks and avoids the possible pathological case
> where you could accidentally end up with the metadata blocks always
> falling on the same disk in the array making that one disk a huge
> bottleneck to the rest of the array).  Once an ext filesystem is
> created, I don't think it uses the data much any longer, but I could be
> wrong.  However, I know that it won't be rearranged for your new layout,
> so you get what you get after you grow the fs.
>

You can change both the stride & stripe width extended options on an
existing ext[234] filesystem w/ tune2fs (w/ XFS, it's done w/ mount options),
and as I understand the tune2fs man page the block allocator should use
the new values, although it seems stripe width is maybe used more.


--
Conway S. Smith

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Successful RAID 6 setup
  2009-11-04 18:03 Successful RAID 6 setup Andrew Dunn
  2009-11-04 18:40 ` Leslie Rhorer
  2009-11-06 21:22 ` Thomas Fjellstrom
@ 2009-11-08  7:07 ` Beolach
  2009-11-08 13:31   ` Andrew Dunn
  2 siblings, 1 reply; 9+ messages in thread
From: Beolach @ 2009-11-08  7:07 UTC (permalink / raw)
  To: Andrew Dunn; +Cc: linux-raid

On Wed, Nov 4, 2009 at 11:03, Andrew Dunn <andrew.g.dunn@gmail.com> wrote:
> I sent this a couple days ago, wondering if it fell through the cracks or if I am asking the wrong questions.
>
> ------
>
> I will preface this by saying I only need about 100MB/s out of my array
> because I access it via a gigabit crossover cable.
>
> I am backing up all of my information right now (~4TB) with the
> intention of re-creating this array with a larger chunk size and
> possibly tweaking the file system a little bit.
>
> My original array was a raid6 of 9 WD caviar black drives, the chunk
> size was 64k. I use USAS-AOC-L8i controllers to address all of my drives
> and the TLER setting on the drives is enabled for 7 seconds.
>
<snip mdadm -D>

That array should easily be able to fill your 100MB/s speed requirement.  If
you really are only accessing over a 1Gib/s link, I wouldn't worry much
about tweaking for performance.

>
> I have noticed slow rebuilding time when I first created the array and
> intermittent lockups while writing large data sets.
>
> Per some reading I was thinking of adjusting my chunk size to 1024k, and
> trying to figure out the weird stuff required when creating a file system.
>

Can you quantify "slow rebuilding time"?  And was that just when you first
created the array, or do you still see slowness when you check/repair the
array?  Using a write-intent bitmap might help here.

I don't know that any of the options discussed so far are likely to help w/
intermittent lockups.  When you see them, are you writing your data over
the network?  How badly does it lock up your system?  What kernel version
are you using?


Good Luck,
--
Conway S. Smith

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Successful RAID 6 setup
  2009-11-08  7:07 ` Beolach
@ 2009-11-08 13:31   ` Andrew Dunn
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Dunn @ 2009-11-08 13:31 UTC (permalink / raw)
  To: Beolach; +Cc: linux-raid

Thanks for the replies.

I began using mdadm in ubuntu 8.04, which I believe was the 2.6.24
kernel. The rebuild times under this kernel with 6 500GiB drives were
very fast... I think like 100-200m, but that was with a raid5.

Using raid6 under ubuntu 9.04 , which was 2.6.28 I would get 20m rebuild
times on raid6 with 1TiB drives. This would take literally 10-14 hours
depending on my array size.

Now with ubuntu 9.10, which is 2.6.31 I get like 50-75m on the same
exact raid6 rebuild.

I didn't realize that software/OS mattered so much for performance from
kernel to kernel.

I need to look more into this bitmap that people have mentioned.


Beolach wrote:
> On Wed, Nov 4, 2009 at 11:03, Andrew Dunn <andrew.g.dunn@gmail.com> wrote:
>   
>> I sent this a couple days ago, wondering if it fell through the cracks or if I am asking the wrong questions.
>>
>> ------
>>
>> I will preface this by saying I only need about 100MB/s out of my array
>> because I access it via a gigabit crossover cable.
>>
>> I am backing up all of my information right now (~4TB) with the
>> intention of re-creating this array with a larger chunk size and
>> possibly tweaking the file system a little bit.
>>
>> My original array was a raid6 of 9 WD caviar black drives, the chunk
>> size was 64k. I use USAS-AOC-L8i controllers to address all of my drives
>> and the TLER setting on the drives is enabled for 7 seconds.
>>
>>     
> <snip mdadm -D>
>
> That array should easily be able to fill your 100MB/s speed requirement.  If
> you really are only accessing over a 1Gib/s link, I wouldn't worry much
> about tweaking for performance.
>
>   
>> I have noticed slow rebuilding time when I first created the array and
>> intermittent lockups while writing large data sets.
>>
>> Per some reading I was thinking of adjusting my chunk size to 1024k, and
>> trying to figure out the weird stuff required when creating a file system.
>>
>>     
>
> Can you quantify "slow rebuilding time"?  And was that just when you first
> created the array, or do you still see slowness when you check/repair the
> array?  Using a write-intent bitmap might help here.
>
> I don't know that any of the options discussed so far are likely to help w/
> intermittent lockups.  When you see them, are you writing your data over
> the network?  How badly does it lock up your system?  What kernel version
> are you using?
>
>
> Good Luck,
> --
> Conway S. Smith
>   

-- 
Andrew Dunn
http://agdunn.net


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Successful RAID 6 setup
  2009-11-08  6:42     ` Beolach
@ 2009-11-08 16:15       ` Doug Ledford
  2009-11-09 17:37       ` Bill Davidsen
  1 sibling, 0 replies; 9+ messages in thread
From: Doug Ledford @ 2009-11-08 16:15 UTC (permalink / raw)
  To: Beolach; +Cc: Leslie Rhorer, linux-raid

[-- Attachment #1: Type: text/plain, Size: 4924 bytes --]

On 11/08/2009 01:42 AM, Beolach wrote:
> On Sat, Nov 7, 2009 at 11:35, Doug Ledford <dledford@redhat.com> wrote:
>> On 11/04/2009 01:40 PM, Leslie Rhorer wrote:
>>>       I would recommend a larger chunk size.  I'm using 256K, and even
>>> 512K or 1024K probably would not be excessive.
>>
>> OK, I've got some data that I'm not quite ready to send out yet, but it
>> maps out the relationship between max_sectors_kb (largest request size a
>> disk can process, which varies based upon scsi host adapter in question,
>> but for SATA adapters is capped at and defaults to 512KB max per
>> request) and chunk size for a raid0 array across 4 disks or 5 disks (I
>> could run other array sizes too, and that's part of what I'm waiting on
>> before sending the data out).  The point here being that a raid0 array
>> will show up more of the md/lower layer block device interactions where
>> as raid5/6 would muddy the waters with other stuff.  The results of the
>> tests I ran were pretty conclusive that the sweet spot for chunk size is
>> when chunk size is == max_sectors_kb, and since SATA is the predominant
>> thing today and it defaults to 512K, that gives a 512K chunk as the
>> sweet spot.  Given that the chunk size is generally about optimizing
>> block device operations at the command/queue level, it should transfer
>> directly to raid5/6 as well.
>>
> 
> This only really applies for large sequential io loads, right?  I seem
> to recall
> smaller chunk sizes being more effective for smaller random io loads.

Actually, no.  Small chunk sizes don't help with truly random I/O.
Assuming the I/O is truly random, your layout doesn't really matter
because no matter how you lay stuff out, your still going to get random
I/O to each drive.  The only real reason to use small chunk sizes in the
past, and this reason is no longer true today, was to stream I/O across
all the platters simultaneously on even modest size sequential I/O in
order to be able to get the speed of all drives combined as your maximum
I/O speed.  This has always, and still does, hurt random I/O
performance.  But, back in the day when disks only did 5 to 10MB/s of
throughput, and the computer could do hundreds, it made a certain amount
of sense.  Now we can do hundreds per disk, and it doesn't.  Since the
sequential performance of even a single disk is probably good enough in
most cases, it's far preferable to optimize your array for random I/O.

Well, that means optimizing for seeks.  In any given array, your maximum
number of operations is equal to the maximum number of seeks that can be
performed (since with small random I/O you generally don't saturate the
bandwidth).  So, every time a command spans a chunk from one disk to the
next, that single command consumes one of the possible seeks on both
disks.  In order to optimize for seeks, you need to reduce the number of
seeks per command in your array as much as possible, and that means at
least attempting to keep each read/write, whether random or sequential,
on a single disk for as long as possible.  This gives the highest
probability that any given command will complete while only accessing a
single disk.  If that command completes while only touching a single
disk, all the other disks in your array are free to complete other
commands simultaneously.  So, in an optimal raid array for random I/O,
you want all of your disks handling a complete command at a time so that
your disks are effectively running in parallel.  When commands regularly
span across chunks to other disks you gain speed for a specific command
at the expense of consuming multiple seeks.

Past testing has shown that this effect will produce increased
performance under random I/O even with chunk sizes going up to 4096k.
However, we reached a point of diminishing returns somewhere around
256k.  It seems that as soon as you reach a chunk size equal (or roughly
equal) to the max command size for a drive, then it doesn't do much
better to go any higher (less than 1% improvement for huge jumps in
chunk size).  We didn't fully analyse it, but I would venture to guess
that most likely the maximum command size is large enough that the read
ahead code in the filesystem might grab up to one commands worth of data
in a short enough period of time for it to be considered sequential by
the disk, but would take long enough before grabbing the next sequential
chunk that other intervening reads/writes would have essentially made
these two sequential operations have an intervening seek and therefore
be as random themselves (speaking of one large sequential I/O in the
middle of a bunch of random I/O).

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Successful RAID 6 setup
  2009-11-08  6:42     ` Beolach
  2009-11-08 16:15       ` Doug Ledford
@ 2009-11-09 17:37       ` Bill Davidsen
  1 sibling, 0 replies; 9+ messages in thread
From: Bill Davidsen @ 2009-11-09 17:37 UTC (permalink / raw)
  To: Beolach; +Cc: Doug Ledford, Leslie Rhorer, linux-raid

Beolach wrote:
> On Sat, Nov 7, 2009 at 11:35, Doug Ledford <dledford@redhat.com> wrote:
>   
>> On 11/04/2009 01:40 PM, Leslie Rhorer wrote:
>>     
>>>       I would recommend a larger chunk size.  I'm using 256K, and even
>>> 512K or 1024K probably would not be excessive.
>>>       
>> OK, I've got some data that I'm not quite ready to send out yet, but it
>> maps out the relationship between max_sectors_kb (largest request size a
>> disk can process, which varies based upon scsi host adapter in question,
>> but for SATA adapters is capped at and defaults to 512KB max per
>> request) and chunk size for a raid0 array across 4 disks or 5 disks (I
>> could run other array sizes too, and that's part of what I'm waiting on
>> before sending the data out).  The point here being that a raid0 array
>> will show up more of the md/lower layer block device interactions where
>> as raid5/6 would muddy the waters with other stuff.  The results of the
>> tests I ran were pretty conclusive that the sweet spot for chunk size is
>> when chunk size is == max_sectors_kb, and since SATA is the predominant
>> thing today and it defaults to 512K, that gives a 512K chunk as the
>> sweet spot.  Given that the chunk size is generally about optimizing
>> block device operations at the command/queue level, it should transfer
>> directly to raid5/6 as well.
>>
>>     
>
> This only really applies for large sequential io loads, right?  I seem
> to recall
> smaller chunk sizes being more effective for smaller random io loads.
>   

Not true now (if it ever was). The operative limit here is seek time, 
not transfer time. Back in the day of old and slow drives, hanging off 
old and slow connections, the time to transfer the data was somewhat of 
an issue. Current SATA drives and controllers have higher transfer 
rates, and until SSD make seek times smaller bigger is better within reason.

Related question: that said, why is a six drive raid6 slower than a four 
drive? On a small write all the data chunks have to be read, but that 
can be done in parallel, so the limit should stay at the seek time of 
the slowest drive. In practice it behaves as if the data chunks were 
being read one at a time. Is that real, or just fallout from not a long 
enough test to smooth out the data?

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-11-09 17:37 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-04 18:03 Successful RAID 6 setup Andrew Dunn
2009-11-04 18:40 ` Leslie Rhorer
2009-11-07 18:35   ` Doug Ledford
2009-11-08  6:42     ` Beolach
2009-11-08 16:15       ` Doug Ledford
2009-11-09 17:37       ` Bill Davidsen
2009-11-06 21:22 ` Thomas Fjellstrom
2009-11-08  7:07 ` Beolach
2009-11-08 13:31   ` Andrew Dunn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.