linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* raid0 slower than devices it is assembled of?
@ 2003-12-15 13:34 Witold Krecicki
  2003-12-15 15:44 ` Witold Krecicki
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Witold Krecicki @ 2003-12-15 13:34 UTC (permalink / raw)
  To: Linux Kernel Mailing List

I've got / on linux-raid0 on 2.6.0-t11-cset-20031209_2107:
<cite>
/dev/md/1:
        Version : 00.90.01
  Creation Time : Thu Sep 11 22:04:54 2003
     Raid Level : raid0
     Array Size : 232315776 (221.55 GiB 237.89 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Mon Dec 15 12:55:48 2003
          State : clean, no-errors
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 64K

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
           UUID : b66633c2:ff11f60d:00119f8d:7bb9fc6c
         Events : 0.357
</cite>
Disks are two ST3120026AS connected to sii3112a controller, driven by sata_sil 
'patched' so no limit for block size is applied (it's not needed for it). 

Those are results of hdparm -tT on drives:
<cite>
/dev/md/1:
 Timing buffer-cache reads:   128 MB in  0.40 seconds =323.28 MB/sec
 Timing buffered disk reads:  64 MB in  1.75 seconds = 36.47 MB/sec
/dev/sda:
 Timing buffer-cache reads:   128 MB in  0.41 seconds =309.23 MB/sec
 Timing buffered disk reads:  64 MB in  1.46 seconds = 43.87 MB/sec
/dev/sdb:
 Timing buffer-cache reads:   128 MB in  0.41 seconds =315.32 MB/sec
 Timing buffered disk reads:  64 MB in  1.23 seconds = 52.04 MB/sec
</cite>
What seems strange to me is that second drive is faster than first one 
(devices are symmetrical, sd[a,b]2 is swapspace (not mounted at time of 
test), sd[a,b]1 is /boot (raid1)).
What is even stranger is that raid0 which should be faster than single drive, 
is pretty much slower- what's the reason of that?
-- 
Witold Kręcicki (adasi) adasi [at] culm.net
GPG key: 7AE20871
http://www.culm.net

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-15 13:34 raid0 slower than devices it is assembled of? Witold Krecicki
@ 2003-12-15 15:44 ` Witold Krecicki
  2003-12-16  4:01 ` jw schultz
  2003-12-16 21:25 ` jw schultz
  2 siblings, 0 replies; 27+ messages in thread
From: Witold Krecicki @ 2003-12-15 15:44 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Dnia Monday 15 of December 2003 14:34, Witold Krecicki napisał:
also, what I got while investigating why one drive is slower than another - is 
there a way to use SMART on a drive connected via SATA-libata?

-- 
Witold Kręcicki (adasi) adasi [at] culm.net
GPG key: 7AE20871
http://www.culm.net

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-15 13:34 raid0 slower than devices it is assembled of? Witold Krecicki
  2003-12-15 15:44 ` Witold Krecicki
@ 2003-12-16  4:01 ` jw schultz
  2003-12-16 14:51   ` Helge Hafting
                     ` (2 more replies)
  2003-12-16 21:25 ` jw schultz
  2 siblings, 3 replies; 27+ messages in thread
From: jw schultz @ 2003-12-16  4:01 UTC (permalink / raw)
  To: Linux Kernel Mailing List

On Mon, Dec 15, 2003 at 02:34:54PM +0100, Witold Krecicki wrote:
> I've got / on linux-raid0 on 2.6.0-t11-cset-20031209_2107:
> <cite>
> /dev/md/1:
>         Version : 00.90.01
>   Creation Time : Thu Sep 11 22:04:54 2003
>      Raid Level : raid0
>      Array Size : 232315776 (221.55 GiB 237.89 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 1
>     Persistence : Superblock is persistent
> 
>     Update Time : Mon Dec 15 12:55:48 2003
>           State : clean, no-errors
>  Active Devices : 2
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 0
> 
>      Chunk Size : 64K
> 
>     Number   Major   Minor   RaidDevice State
>        0       8        3        0      active sync   /dev/sda3
>        1       8       19        1      active sync   /dev/sdb3
>            UUID : b66633c2:ff11f60d:00119f8d:7bb9fc6c
>          Events : 0.357
> </cite>
> Disks are two ST3120026AS connected to sii3112a controller, driven by sata_sil 
> 'patched' so no limit for block size is applied (it's not needed for it). 
> 
> Those are results of hdparm -tT on drives:
> <cite>
> /dev/md/1:
>  Timing buffer-cache reads:   128 MB in  0.40 seconds =323.28 MB/sec
>  Timing buffered disk reads:  64 MB in  1.75 seconds = 36.47 MB/sec
> /dev/sda:
>  Timing buffer-cache reads:   128 MB in  0.41 seconds =309.23 MB/sec
>  Timing buffered disk reads:  64 MB in  1.46 seconds = 43.87 MB/sec
> /dev/sdb:
>  Timing buffer-cache reads:   128 MB in  0.41 seconds =315.32 MB/sec
>  Timing buffered disk reads:  64 MB in  1.23 seconds = 52.04 MB/sec
> </cite>
> What seems strange to me is that second drive is faster than first one 
> (devices are symmetrical, sd[a,b]2 is swapspace (not mounted at time of 
> test), sd[a,b]1 is /boot (raid1)).
> What is even stranger is that raid0 which should be faster than single drive, 
> is pretty much slower- what's the reason of that?

Overhead+randomness would make an md stripe slower.

	This measurement is an indication of  how fast  the
	drive  can sustain sequential data reads

No Linux [R]AID improves sequential performance.  How would
reading 65KB from two disks in alternation be faster than
reading continuously from one disk?

There used to be some HW raid controllers that might have
improved sequential performance by using stripe sizes of 512
bytes (every access hit all disks) but then you suffered
near worst case latency with every non-cached read.


-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-16  4:01 ` jw schultz
@ 2003-12-16 14:51   ` Helge Hafting
  2003-12-16 16:42     ` Linus Torvalds
  2003-12-16 20:51     ` Andre Hedrick
  2003-12-16 20:09   ` Witold Krecicki
  2003-12-16 21:11   ` Adam Kropelin
  2 siblings, 2 replies; 27+ messages in thread
From: Helge Hafting @ 2003-12-16 14:51 UTC (permalink / raw)
  To: jw schultz; +Cc: Linux Kernel Mailing List

jw schultz wrote:

> No Linux [R]AID improves sequential performance.  How would
> reading 65KB from two disks in alternation be faster than
> reading continuously from one disk?
> 
Raid-0 is ideally N times faster than a single disk, when
you have N disks.  Because you can read continuously from N
disks instead of from 1, thereby N-doubling the bandwith.

Wether the current drivers manages that is of course another story.

Helge Hafting



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-16 14:51   ` Helge Hafting
@ 2003-12-16 16:42     ` Linus Torvalds
  2003-12-16 20:58       ` Mike Fedyk
                         ` (3 more replies)
  2003-12-16 20:51     ` Andre Hedrick
  1 sibling, 4 replies; 27+ messages in thread
From: Linus Torvalds @ 2003-12-16 16:42 UTC (permalink / raw)
  To: Helge Hafting; +Cc: jw schultz, Linux Kernel Mailing List



On Tue, 16 Dec 2003, Helge Hafting wrote:
>
> Raid-0 is ideally N times faster than a single disk, when
> you have N disks.

Well, that's a _really_ "ideal" world. Ideal to the point of being
unrealistic.

In most real-world situations, latency is at least as important as
throughput, and often dominates the story. At which point RAID-0 doesn't
improve performance one iota (it might make the seeks shorter, but since
seek latency tends to be dominated by things like rotational delay and
settle times, that's unlikely to be a really noticeable issue).

Latency is noticeable even on what appears to be "prue throughput" tests,
because not only do you seldom get perfect overlap (RAID-0 also increases
your required IO window size by a factor of N to get the N-time
improvement), but even "pure throughput" benchmarks often have small
serialized sections, and Amdahls law bites you in the ass _really_
quickly.

In fact, Amdahls law should be revered a hell of a lot more than Moore's
law. One is a conjecture, the other one is simple math.

Anyway, the serialized sections can be CPU or bus (quite common at the
point where a single disk can stream 50MB/s when accessed linearly), or it
can be things like fetching meta-data (ie indirect blocks).

> Wether the current drivers manages that is of course another story.

No. Please don't confuse limitations of RAID0 with limitations of "the
current drivers".

Yes, the drivers are a part of the picture, but they are a _small_ part of
a very fundamental issue.

The fact is, modern disks are GOOD at streaming data. They're _really_
good at it compared to just about anything else they ever do. The win you
get from even medium-sized stripes on RAID0 are likely to not be all that
noticeable, and you can definitely lose _big_ just because it tends to
hack your IO patterns to pieces.

My personal guess is that modern RAID0 stripes should be on the order of
several MEGABYTES in size rather than the few hundred kB that most people
use (not to mention the people who have 32kB stripes or smaller - they
just kill their IO access patterns with that, and put the CPU at
ridiculous strain).

Big stripes help because:

 - disks already do big transfers well, so you shouldn't split them up.
   Quite frankly, the kinds of access patterns that let you stream
   multiple streams of 50MB/s and get N-way throughput increases just
   don't exists in the real world outside of some very special niches (DoD
   satellite data backup, or whatever).

 - it makes it more likely that the disks in the array really have
   _independent_ IO patterns, ie if you access multiple files the disks
   may not seek around together, but instead one disk accesses one file.
   At this point RAID0 starts to potentially help _latency_, simply
   because by now it may help avoid physical seeking rather than just try
   to make throughput go up.

I may be wrong, of course. But I doubt it.

			Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-16  4:01 ` jw schultz
  2003-12-16 14:51   ` Helge Hafting
@ 2003-12-16 20:09   ` Witold Krecicki
  2003-12-16 21:11   ` Adam Kropelin
  2 siblings, 0 replies; 27+ messages in thread
From: Witold Krecicki @ 2003-12-16 20:09 UTC (permalink / raw)
  To: jw schultz, Linux Kernel Mailing List

Dnia Tuesday 16 of December 2003 05:01, jw schultz napisał:
> No Linux [R]AID improves sequential performance.  How would
> reading 65KB from two disks in alternation be faster than
> reading continuously from one disk?
Well, but at the beginning I've got about 85-90MB/sec for buffered array 
reads. That was on 2.4.21-pre or even patched 2.4.20 (on siimage - in it's 
early stages, not sata_sil driver). Now it's 3 times slower (checkedwith 
preemptible kernel, it's even slower) - so something went bad.
-- 
Witold Kręcicki (adasi) adasi [at] culm.net
GPG key: 7AE20871
http://www.culm.net

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-16 14:51   ` Helge Hafting
  2003-12-16 16:42     ` Linus Torvalds
@ 2003-12-16 20:51     ` Andre Hedrick
  2003-12-16 21:04       ` Andre Hedrick
  1 sibling, 1 reply; 27+ messages in thread
From: Andre Hedrick @ 2003-12-16 20:51 UTC (permalink / raw)
  To: Helge Hafting; +Cc: jw schultz, Linux Kernel Mailing List


Helge,

Reads you may gain on writes only if all devices are on single ended mode.
Both pATA and pSCSI suck wind in writes, pSCSI should smoke pATA on reads.
It is all a matter of the physical protocol on the wire.

Only in SATA/SAS will you even reach close to ideal world.

Cheers,

Andre Hedrick
LAD Storage Consulting Group

On Tue, 16 Dec 2003, Helge Hafting wrote:

> jw schultz wrote:
> 
> > No Linux [R]AID improves sequential performance.  How would
> > reading 65KB from two disks in alternation be faster than
> > reading continuously from one disk?
> > 
> Raid-0 is ideally N times faster than a single disk, when
> you have N disks.  Because you can read continuously from N
> disks instead of from 1, thereby N-doubling the bandwith.
> 
> Wether the current drivers manages that is of course another story.
> 
> Helge Hafting
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-16 16:42     ` Linus Torvalds
@ 2003-12-16 20:58       ` Mike Fedyk
  2003-12-16 21:11         ` Linus Torvalds
  2003-12-17 19:22       ` Jamie Lokier
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Mike Fedyk @ 2003-12-16 20:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Helge Hafting, jw schultz, Linux Kernel Mailing List

On Tue, Dec 16, 2003 at 08:42:52AM -0800, Linus Torvalds wrote:
> My personal guess is that modern RAID0 stripes should be on the order of
> several MEGABYTES in size rather than the few hundred kB that most people
> use (not to mention the people who have 32kB stripes or smaller - they
> just kill their IO access patterns with that, and put the CPU at
> ridiculous strain).

Larger stripes may help in general, but I'd suggest that for raid5 (ie, not
raid0), the stripe size should not be enlarged as much.  On many
filesystems, a bitmap change, or inode table update shouldn't require
reading a large stripe from several drives to complete the pairity
calculations.

Probably finding the largest block of data the drive can return in one
command would be a good size for a raid5 stripe.  That's just speculation
though.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-16 20:51     ` Andre Hedrick
@ 2003-12-16 21:04       ` Andre Hedrick
  2003-12-16 21:46         ` Witold Krecicki
  0 siblings, 1 reply; 27+ messages in thread
From: Andre Hedrick @ 2003-12-16 21:04 UTC (permalink / raw)
  To: Helge Hafting; +Cc: jw schultz, Linux Kernel Mailing List



Size is MB, BlkSz is Bytes, Read, Write, and Seeks are MB/secd . -T

         File   Block  Num  Seq Read    Rand Read   Seq Write  Rand Write
  Dir    Size   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
   .     2018   4096    1  244.7 53.6% 8.847 2.26% 88.04 41.7% 5.594 7.16%
   .     2018   4096    2  281.6 65.6% 11.04 3.53% 89.86 69.5% 6.645 9.78%
   .     2018   4096    4  235.6 64.3% 13.91 4.45% 88.26 96.2% 7.647 12.7%
   .     2018   4096    8  231.2 68.0% 15.91 5.34% 85.59 105.% 7.557 10.3%

Two channel, Six drives, three per channel.

U320 with 15K U160 drives in a soft Raid 0.

Andre Hedrick
LAD Storage Consulting Group

On Tue, 16 Dec 2003, Andre Hedrick wrote:

> 
> Helge,
> 
> Reads you may gain on writes only if all devices are on single ended mode.
> Both pATA and pSCSI suck wind in writes, pSCSI should smoke pATA on reads.
> It is all a matter of the physical protocol on the wire.
> 
> Only in SATA/SAS will you even reach close to ideal world.
> 
> Cheers,
> 
> Andre Hedrick
> LAD Storage Consulting Group
> 
> On Tue, 16 Dec 2003, Helge Hafting wrote:
> 
> > jw schultz wrote:
> > 
> > > No Linux [R]AID improves sequential performance.  How would
> > > reading 65KB from two disks in alternation be faster than
> > > reading continuously from one disk?
> > > 
> > Raid-0 is ideally N times faster than a single disk, when
> > you have N disks.  Because you can read continuously from N
> > disks instead of from 1, thereby N-doubling the bandwith.
> > 
> > Wether the current drivers manages that is of course another story.
> > 
> > Helge Hafting
> > 
> > 
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-16 20:58       ` Mike Fedyk
@ 2003-12-16 21:11         ` Linus Torvalds
  2003-12-17 10:53           ` Jörn Engel
  2003-12-17 11:39           ` Peter Zaitsev
  0 siblings, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2003-12-16 21:11 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Helge Hafting, jw schultz, Linux Kernel Mailing List



On Tue, 16 Dec 2003, Mike Fedyk wrote:
>
> On Tue, Dec 16, 2003 at 08:42:52AM -0800, Linus Torvalds wrote:
> > My personal guess is that modern RAID0 stripes should be on the order of
> > several MEGABYTES in size rather than the few hundred kB that most people
> > use (not to mention the people who have 32kB stripes or smaller - they
> > just kill their IO access patterns with that, and put the CPU at
> > ridiculous strain).
>
> Larger stripes may help in general, but I'd suggest that for raid5 (ie, not
> raid0), the stripe size should not be enlarged as much.  On many
> filesystems, a bitmap change, or inode table update shouldn't require
> reading a large stripe from several drives to complete the pairity
> calculations.

Oh, absolutely. I only made the argument as it works for RAID0, ie just
striping.  There the only downside of a large stripe is the potential for
a lack of parallelism, but as mentioned, I don't think that downside much
exists with modern disks - the platter density and throughput (once you've
seeked to the right place) are so high that there is no point to try to
parallelise it at the data transfer point.

The thing you should try to do in parallel is the seeking, not the media
throughput. And then small stripes hurt you, because they will end up
seeking in sync.

For RAID5, you have different issues since the error correction makes
updates be read-modify-write. At that point there are latency reasons to
make the blocking be small.

			Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-16  4:01 ` jw schultz
  2003-12-16 14:51   ` Helge Hafting
  2003-12-16 20:09   ` Witold Krecicki
@ 2003-12-16 21:11   ` Adam Kropelin
  2 siblings, 0 replies; 27+ messages in thread
From: Adam Kropelin @ 2003-12-16 21:11 UTC (permalink / raw)
  To: jw schultz, Linux Kernel Mailing List

On Mon, Dec 15, 2003 at 08:01:56PM -0800, jw schultz wrote:
> On Mon, Dec 15, 2003 at 02:34:54PM +0100, Witold Krecicki wrote:
> > Those are results of hdparm -tT on drives:
> > <cite>
> > /dev/md/1:
> >  Timing buffer-cache reads:   128 MB in  0.40 seconds =323.28 MB/sec
> >  Timing buffered disk reads:  64 MB in  1.75 seconds = 36.47 MB/sec
> > /dev/sda:
> >  Timing buffer-cache reads:   128 MB in  0.41 seconds =309.23 MB/sec
> >  Timing buffered disk reads:  64 MB in  1.46 seconds = 43.87 MB/sec
> > /dev/sdb:
> >  Timing buffer-cache reads:   128 MB in  0.41 seconds =315.32 MB/sec
> >  Timing buffered disk reads:  64 MB in  1.23 seconds = 52.04 MB/sec
> > </cite>
> 
> No Linux [R]AID improves sequential performance.  How would
> reading 65KB from two disks in alternation be faster than
> reading continuously from one disk?

Never say never:

/dev/sda:
 Timing buffer-cache reads:   128 MB in  3.34 seconds = 38.38 MB/sec
 Timing buffered disk reads:  64 MB in  8.60 seconds =  7.44 MB/sec
/dev/sdb:
 Timing buffer-cache reads:   128 MB in  3.40 seconds = 37.64 MB/sec
 Timing buffered disk reads:  64 MB in  8.60 seconds =  7.44 MB/sec

<...plus four more just like them...>

/dev/md0:
 Timing buffer-cache reads:   128 MB in  3.35 seconds = 38.17 MB/sec
 Timing buffered disk reads:  64 MB in  4.05 seconds = 15.79 MB/sec

md0 is a simple RAID0 of all six disks.

Yes, these disks are dirt slow to begin with (Andrew Morton once
mentioned he had pencils that wrote faster than my disks) but apparently
md manages to get some parallelism going, even for sequential reads.

(This is 2.6.0-test11-bk8, FWIW.)

--Adam


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-15 13:34 raid0 slower than devices it is assembled of? Witold Krecicki
  2003-12-15 15:44 ` Witold Krecicki
  2003-12-16  4:01 ` jw schultz
@ 2003-12-16 21:25 ` jw schultz
  2 siblings, 0 replies; 27+ messages in thread
From: jw schultz @ 2003-12-16 21:25 UTC (permalink / raw)
  To: Linux Kernel Mailing List

On Mon, Dec 15, 2003 at 02:34:54PM +0100, Witold Krecicki wrote:
> I've got / on linux-raid0 on 2.6.0-t11-cset-20031209_2107:
> <cite>
> /dev/md/1:
>         Version : 00.90.01
>   Creation Time : Thu Sep 11 22:04:54 2003
>      Raid Level : raid0
>      Array Size : 232315776 (221.55 GiB 237.89 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 1
>     Persistence : Superblock is persistent
> 
>     Update Time : Mon Dec 15 12:55:48 2003
>           State : clean, no-errors
>  Active Devices : 2
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 0
> 
>      Chunk Size : 64K
> 
[snip]

> Disks are two ST3120026AS connected to sii3112a controller, driven by sata_sil 
> 'patched' so no limit for block size is applied (it's not needed for it). 
> 
> Those are results of hdparm -tT on drives:
> <cite>
> /dev/md/1:
>  Timing buffer-cache reads:   128 MB in  0.40 seconds =323.28 MB/sec
>  Timing buffered disk reads:  64 MB in  1.75 seconds = 36.47 MB/sec
> /dev/sda:
>  Timing buffer-cache reads:   128 MB in  0.41 seconds =309.23 MB/sec
>  Timing buffered disk reads:  64 MB in  1.46 seconds = 43.87 MB/sec
> /dev/sdb:
>  Timing buffer-cache reads:   128 MB in  0.41 seconds =315.32 MB/sec
>  Timing buffered disk reads:  64 MB in  1.23 seconds = 52.04 MB/sec
> </cite>
> What seems strange to me is that second drive is faster than first one 
> (devices are symmetrical, sd[a,b]2 is swapspace (not mounted at time of 
> test), sd[a,b]1 is /boot (raid1)).

Possible reasons:

	internal differences on controller

	block remapping (even new disks have bad blocks)

	different firmware

	different physical geometry -- two production runs of
	the same make+model drive may have different
	geometry

	cable quality or routing differences, or interface
	variations that cause subtle timing differences


> What is even stranger is that raid0 which should be faster than single drive, 
> is pretty much slower- what's the reason of that?

You could try increasing the read ahead but that may slow
things down in real world use.

AID-0 isn't RAID (no R), but then again for many arrays the
I is also out of place.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-16 21:04       ` Andre Hedrick
@ 2003-12-16 21:46         ` Witold Krecicki
  0 siblings, 0 replies; 27+ messages in thread
From: Witold Krecicki @ 2003-12-16 21:46 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Dnia Tuesday 16 of December 2003 22:04, Andre Hedrick napisał:
So, could you take a look at http://www.kernel.pl/~adasi/odd-tio.txt ?
It seems that array on write is faster than on read - I was told that it's not 
normal - OR i'm getting those results wrong.
-- 
Witold Kręcicki (adasi) adasi [at] culm.net
GPG key: 7AE20871
http://www.culm.net

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-16 21:11         ` Linus Torvalds
@ 2003-12-17 10:53           ` Jörn Engel
  2003-12-17 11:39           ` Peter Zaitsev
  1 sibling, 0 replies; 27+ messages in thread
From: Jörn Engel @ 2003-12-17 10:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Fedyk, Helge Hafting, jw schultz, Linux Kernel Mailing List

On Tue, 16 December 2003 13:11:25 -0800, Linus Torvalds wrote:
> On Tue, 16 Dec 2003, Mike Fedyk wrote:
> >
> > Larger stripes may help in general, but I'd suggest that for raid5 (ie, not
> > raid0), the stripe size should not be enlarged as much.  On many
> > filesystems, a bitmap change, or inode table update shouldn't require
> > reading a large stripe from several drives to complete the pairity
> > calculations.
> 
> Oh, absolutely. I only made the argument as it works for RAID0, ie just
> striping.  There the only downside of a large stripe is the potential for
> a lack of parallelism, but as mentioned, I don't think that downside much
> exists with modern disks - the platter density and throughput (once you've
> seeked to the right place) are so high that there is no point to try to
> parallelise it at the data transfer point.
> 
> The thing you should try to do in parallel is the seeking, not the media
> throughput. And then small stripes hurt you, because they will end up
> seeking in sync.
> 
> For RAID5, you have different issues since the error correction makes
> updates be read-modify-write. At that point there are latency reasons to
> make the blocking be small.

Maybe I don't get it, but shouldn't large stripes help RAID5 as well?
For any write, you do the rmw stuff, so you have two seeks on two
drives, one of which is the parity one.  For RAID0, the same access
would result in one seek on one drive, but no fundamental difference.

With more than three drives, you should be able to parallelize the
seeks on RAID5 as well, shouldn't you?  So the same reasoning wrt long
stripes should apply, unless I'm stupid again.

Jörn

-- 
The competent programmer is fully aware of the strictly limited size of
his own skull; therefore he approaches the programming task in full
humility, and among other things he avoids clever tricks like the plague. 
-- Edsger W. Dijkstra

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-16 21:11         ` Linus Torvalds
  2003-12-17 10:53           ` Jörn Engel
@ 2003-12-17 11:39           ` Peter Zaitsev
  2003-12-17 16:01             ` Linus Torvalds
  2003-12-17 17:02             ` bill davidsen
  1 sibling, 2 replies; 27+ messages in thread
From: Peter Zaitsev @ 2003-12-17 11:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Fedyk, Helge Hafting, jw schultz, Linux Kernel Mailing List

On Wed, 2003-12-17 at 00:11, Linus Torvalds wrote:

> >
> > Larger stripes may help in general, but I'd suggest that for raid5 (ie, not
> > raid0), the stripe size should not be enlarged as much.  On many
> > filesystems, a bitmap change, or inode table update shouldn't require
> > reading a large stripe from several drives to complete the pairity
> > calculations.
> 
> Oh, absolutely. I only made the argument as it works for RAID0, ie just
> striping.  There the only downside of a large stripe is the potential for
> a lack of parallelism, but as mentioned, I don't think that downside much
> exists with modern disks - the platter density and throughput (once you've
> seeked to the right place) are so high that there is no point to try to
> parallelise it at the data transfer point.

I'm pretty curious about this argument,

Practically as RAID5 uses XOR for checksum computation you do not have
to read the whole stripe to recompute the checksum.

If you have lets say 1Mb stripe but modify  just  few bytes somewhere,
there is no reason why you can't read lets say 4KB blocks from 2
devices, and write updated 4K blocks back.

The problem here lies what some (many?) RAID controllers have cache-line
equals to stripe size,  so working with whole stripes only. Some (at
least Mylex) however have different settings for cache line size and
stripe size.

What is about it in Linux software RAID5 implementation  ?


One more issue with smaller stripes both for RAID5 and RAID0 (at least
for DBMS workloads) is - you normally want multi-block IO (ie fetching
many sequentially located pages) to be close in cost to reading single
page, which is true for single hard drive. However with small stripe
size you will hit many of underlying devices  putting excessive not
necessary load. 

I was also wondering is there any way in Linux to make sure files are
aligned to stripe size ?   Performing IO in some particular page size
you would not like these to come on stripe  border touching two devices
instead of one. 



-- 
Peter Zaitsev, Full-Time Developer
MySQL AB, www.mysql.com

Are you MySQL certified?  www.mysql.com/certification


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-17 11:39           ` Peter Zaitsev
@ 2003-12-17 16:01             ` Linus Torvalds
  2003-12-17 18:37               ` Mike Fedyk
  2003-12-17 21:55               ` bill davidsen
  2003-12-17 17:02             ` bill davidsen
  1 sibling, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2003-12-17 16:01 UTC (permalink / raw)
  To: Peter Zaitsev
  Cc: Mike Fedyk, Helge Hafting, jw schultz, Linux Kernel Mailing List



On Wed, 17 Dec 2003, Peter Zaitsev wrote:
> 
> I'm pretty curious about this argument,
> 
> Practically as RAID5 uses XOR for checksum computation you do not have
> to read the whole stripe to recompute the checksum.

Ahh, good point. Ignore my argument - large stripes should work well. Mea 
culpa, I forgot how simple the parity thing is, and that it is "local".

However, since seeking will be limited by the checksum drive anyway (for 
writing), the advantages of large stripes in trying to keep the disks 
independent aren't as one-sided. 

		Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-17 11:39           ` Peter Zaitsev
  2003-12-17 16:01             ` Linus Torvalds
@ 2003-12-17 17:02             ` bill davidsen
  2003-12-17 20:14               ` Peter Zaitsev
  1 sibling, 1 reply; 27+ messages in thread
From: bill davidsen @ 2003-12-17 17:02 UTC (permalink / raw)
  To: linux-kernel

In article <1071657159.2155.76.camel@abyss.local>,
Peter Zaitsev  <peter@mysql.com> wrote:

| One more issue with smaller stripes both for RAID5 and RAID0 (at least
| for DBMS workloads) is - you normally want multi-block IO (ie fetching
| many sequentially located pages) to be close in cost to reading single
| page, which is true for single hard drive. However with small stripe
| size you will hit many of underlying devices  putting excessive not
| necessary load. 

All this depends on what you're trying to optimize and the speed of the
drives. I spent several years running on software raid and got to look
harder than I wanted at the tuning.

If the read size is large enough for transfer time to matter, not hidden
in the latency, adjusting the stripe size so that you use many drives is
a win. You want to avoid having a user i/o generate more than one i/o
per drive if you can, which can lead to large stripe sizes.

Also, the read to write ratio is important. RAID-5 does poorly with
write, since the CRC needs to be recalculated and written each time. On
read, unless you are in fallback mode, you just read the data and the
performance is similar to RAID-0.

If you have (a) a high read to write load, and (b) a very heavy read
load, then RAID-1 works better, possibly with more than two copies of
the data to reduce head motion contention.
-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-17 16:01             ` Linus Torvalds
@ 2003-12-17 18:37               ` Mike Fedyk
  2003-12-17 21:55               ` bill davidsen
  1 sibling, 0 replies; 27+ messages in thread
From: Mike Fedyk @ 2003-12-17 18:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zaitsev, Helge Hafting, jw schultz, Linux Kernel Mailing List

On Wed, Dec 17, 2003 at 08:01:13AM -0800, Linus Torvalds wrote:
> However, since seeking will be limited by the checksum drive anyway (for 
> writing), the advantages of large stripes in trying to keep the disks 
> independent aren't as one-sided. 

It seems to me that you are referring to a single pairity drive in the
array.  That is raid4, where only one drive handles the pairity data for the
entire array regardless of array size.

Raid5 staggers the pairidy data (md has four different staggering layouts)
so that the data, and pairity are on two different drives, and that varies
based on stripe and layout.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-16 16:42     ` Linus Torvalds
  2003-12-16 20:58       ` Mike Fedyk
@ 2003-12-17 19:22       ` Jamie Lokier
  2003-12-17 19:40         ` Linus Torvalds
  2003-12-18  2:47         ` jw schultz
  2003-12-17 22:29       ` bill davidsen
  2004-01-08  4:54       ` Greg Stark
  3 siblings, 2 replies; 27+ messages in thread
From: Jamie Lokier @ 2003-12-17 19:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Helge Hafting, jw schultz, Linux Kernel Mailing List

Linus Torvalds wrote:
> My personal guess is that modern RAID0 stripes should be on the order of
> several MEGABYTES in size rather than the few hundred kB that most people
> use (not to mention the people who have 32kB stripes or smaller - they
> just kill their IO access patterns with that, and put the CPU at
> ridiculous strain).

If a large fs-level I/O transaction is split into lots of 32k
transactions by the RAID layer, many of those 32k transactions will be
contiguous on the disks.

That doesn't mean they're contiguous from the fs point of view, but
given that all modern hardware does scatter-gather, shouldn't the
contiguous transactions be merged before being sent to the disk?

It may strain the CPU (splitting and merging in a different order lots
of requests), but I don't see why it should kill I/O access patterns,
as they can be as large as if you had large stripes in the first place.

-- Jamie

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-17 19:22       ` Jamie Lokier
@ 2003-12-17 19:40         ` Linus Torvalds
  2003-12-17 22:36           ` bill davidsen
  2003-12-18  2:47         ` jw schultz
  1 sibling, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2003-12-17 19:40 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Helge Hafting, jw schultz, Linux Kernel Mailing List



On Wed, 17 Dec 2003, Jamie Lokier wrote:
> 
> If a large fs-level I/O transaction is split into lots of 32k
> transactions by the RAID layer, many of those 32k transactions will be
> contiguous on the disks.

Yes.

> That doesn't mean they're contiguous from the fs point of view, but
> given that all modern hardware does scatter-gather, shouldn't the
> contiguous transactions be merged before being sent to the disk?

Yes, as long as the RAID layer (or lowlevel disk) doesn't try to avoid the 
elevator.

BUT (and this is a big but) - apart from wasting a lot of CPU time by
splitting and re-merging, the problem is more fundamental than that.

Let's say that you are striping four disks, with 32kB blocking. Not 
an unreasonable setup.

Now, let's say that the contiguous block IO from high up is 256kB in size. 
Again, this is not unreasonable, although it is actually larger than a lot 
of IO actually is (it is smaller than _some_ IO patterns, but on the whole 
I'm willing to bet that it's in the "high 1%" of the IO done).

Now, we can split that up in 32kB blocks (8 of them), and then merge it
back into 4 64kB blocks sent to disk. We can even avoid a lot of the CPU
overhead by not merging in the first place (and I think we largely do,
actually), and just generate 4 64kB requests in the first place.

But did you notice something?

In one schenario, the disk got a 256kB request, in the other it got a 64kB 
requests.

And guess what? The bigger request is likely to be more efficient.  
Normal disks these days have 8MB+ of cache on the disk, and do partial 
track buffering etc, and the bigger the requests are, the better.

> It may strain the CPU (splitting and merging in a different order lots
> of requests), but I don't see why it should kill I/O access patterns,
> as they can be as large as if you had large stripes in the first place.

But you _did_ kill the IO access patterns. You started out with a 256kB 
IO, and you ended up splitting it in four. You lose.

The thing is, in real life you do NOT have "infinite IO blocks" to start 
with. If that were true, splitting it up across the disks wouldn't cost 
you anything: infinite divided by four is still infinite. But in real life 
you have something that is already of a finite length and a few hundred kB 
is "good" in most normal loads - and splitting it in four is a BAD IDEA!

In contrast, imagine that you had a 1MB stripe. Most of the time the 256kB 
request wouldn't be split at all, and even in the worst case it would get 
split into just 2 requests.

Yes, there are some loads where you can get largely "infinite" request 
sizes. But I'd claim that they are quite rare.

			Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-17 17:02             ` bill davidsen
@ 2003-12-17 20:14               ` Peter Zaitsev
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zaitsev @ 2003-12-17 20:14 UTC (permalink / raw)
  To: bill davidsen; +Cc: linux-kernel

On Wed, 2003-12-17 at 20:02, bill davidsen wrote:
> In article <1071657159.2155.76.camel@abyss.local>,
> Peter Zaitsev  <peter@mysql.com> wrote:
> 
> | One more issue with smaller stripes both for RAID5 and RAID0 (at least
> | for DBMS workloads) is - you normally want multi-block IO (ie fetching
> | many sequentially located pages) to be close in cost to reading single
> | page, which is true for single hard drive. However with small stripe
> | size you will hit many of underlying devices  putting excessive not
> | necessary load. 
> 
> All this depends on what you're trying to optimize and the speed of the
> drives. I spent several years running on software raid and got to look
> harder than I wanted at the tuning.

Well I'm obviously interested mainly in Database workloads, both OLTP
(which is mainly random IO from many clients) as well as multi user
concurrent "scans" which are typical for some of some OLAP applications.
Yes of course speed of the drive makes sense here. However even lower
end IDE drives can do some 40Mb/sec now, which means  (some 100 random
req/sec drive can do)  you can transfer 400Kb in about the same time as
you can do random IO request, 

> 
> If the read size is large enough for transfer time to matter, not hidden
> in the latency, adjusting the stripe size so that you use many drives is
> a win. You want to avoid having a user i/o generate more than one i/o
> per drive if you can, which can lead to large stripe sizes.

Yes this is true.  However looking at the same logic as below we can
identify what transfer time starts to matter (compared to seek time
which you needed to do to start it)  if  400Kb+ is transferred from
single drive. Which means there is not much sense to have stripe sizes 
less than 256-512Kb if you're looking at single scan. If you have many
concurrent scans you might even with to have blocks larger. 


> 
> Also, the read to write ratio is important. RAID-5 does poorly with
> write, since the CRC needs to be recalculated and written each time. On
> read, unless you are in fallback mode, you just read the data and the
> performance is similar to RAID-0.

Yes sure. Reads are sort of trivial unless you're running in degraded
mode. I was just wondering how write handling is implemented in Linux
kernel.

RAID5 write speed is quite sensitive to the cache size.  Which cache
Linux software RAID5 is using (if any) for write optimization ?

> 
> If you have (a) a high read to write load, and (b) a very heavy read
> load, then RAID-1 works better, possibly with more than two copies of
> the data to reduce head motion contention.

Yes. That is actually interesting question. Lets take a look at
read-only cases. This is obvious if you have few concurrent clients (or
actually concurrent IO requests), as with RAID0 there is probability to
have IO unbalanced on devices, with RAID1 we have full copies so we
always can balance reads as we need. 

However with growing number of concurrent clients the probability uneven
device load decreases.  If I remember correctly with 100 concurrent
clients I had quite similar performance from RAID0 and RAID1. 

Yes there is other risk with RAID0 - boundary reads which would require
2 reads instead of one. I however used perfectly aligned reads in my
test so it could not happen :)


-- 
Peter Zaitsev, Full-Time Developer
MySQL AB, www.mysql.com

Are you MySQL certified?  www.mysql.com/certification


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-17 16:01             ` Linus Torvalds
  2003-12-17 18:37               ` Mike Fedyk
@ 2003-12-17 21:55               ` bill davidsen
  1 sibling, 0 replies; 27+ messages in thread
From: bill davidsen @ 2003-12-17 21:55 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.58.0312170758220.8541@home.osdl.org>,
Linus Torvalds  <torvalds@osdl.org> wrote:
| 
| 
| On Wed, 17 Dec 2003, Peter Zaitsev wrote:
| > 
| > I'm pretty curious about this argument,
| > 
| > Practically as RAID5 uses XOR for checksum computation you do not have
| > to read the whole stripe to recompute the checksum.
| 
| Ahh, good point. Ignore my argument - large stripes should work well. Mea 
| culpa, I forgot how simple the parity thing is, and that it is "local".
| 
| However, since seeking will be limited by the checksum drive anyway (for 
| writing), the advantages of large stripes in trying to keep the disks 
| independent aren't as one-sided. 

There is no "the" parity drive, remember the RAID-5 parity is
distributed. A write takes two seeks, a read, a data write, and a parity
write, but the parity isn't a bottleneck, and as noted above the size
only need be the blocks containing the modified data.
-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-16 16:42     ` Linus Torvalds
  2003-12-16 20:58       ` Mike Fedyk
  2003-12-17 19:22       ` Jamie Lokier
@ 2003-12-17 22:29       ` bill davidsen
  2003-12-18  2:18         ` jw schultz
  2004-01-08  4:54       ` Greg Stark
  3 siblings, 1 reply; 27+ messages in thread
From: bill davidsen @ 2003-12-17 22:29 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.58.0312160825570.1599@home.osdl.org>,
Linus Torvalds  <torvalds@osdl.org> wrote:
| 
| 
| On Tue, 16 Dec 2003, Helge Hafting wrote:
| >
| > Raid-0 is ideally N times faster than a single disk, when
| > you have N disks.
| 
| Well, that's a _really_ "ideal" world. Ideal to the point of being
| unrealistic.
| 
| In most real-world situations, latency is at least as important as
| throughput, and often dominates the story. At which point RAID-0 doesn't
| improve performance one iota (it might make the seeks shorter, but since
| seek latency tends to be dominated by things like rotational delay and
| settle times, that's unlikely to be a really noticeable issue).

Don't forget time in o/s queues, once an array get loaded that may
dominate the mechanical latency and transfer times. If you call "access
time" the sum of all latency between syscall and the first data
transfer, then reading from multiple drives doesn't reliably help until
you get the transfer time from an i/o somewhere between 2 and 4x the
access time. So if the transfer time for a typical i/o is less than 2x
the typical access time, gains are unlikely. If you set the stripe size
high enough to make it likely that a typical i/o falls on a single drive
you usually win. And when the transfer time reaches 4x the access time,
you almost always win with a split.

So if you are copying 100MB elements you probably win by spreading the
i/o, but for more normal things it doesn't much matter.

THERE'S ONE EXCEPTION: if you have a f/s type which puts the inodes at
the beginning of the space, and you are creating and deleting a LOT of
files, with a large stripe you will beat the snot out of one drive and
the system will bottleneck no end. In that one case you gain by using
small stripe size and spreading the head motion, even though the file
i/o itself may really rot. Makes me wish for a f/s which could put the
inodes in some distributed pattern.
| 
| Latency is noticeable even on what appears to be "prue throughput" tests,
| because not only do you seldom get perfect overlap (RAID-0 also increases
| your required IO window size by a factor of N to get the N-time
| improvement), but even "pure throughput" benchmarks often have small
| serialized sections, and Amdahls law bites you in the ass _really_
| quickly.
| 
| In fact, Amdahls law should be revered a hell of a lot more than Moore's
| law. One is a conjecture, the other one is simple math.
| 
| Anyway, the serialized sections can be CPU or bus (quite common at the
| point where a single disk can stream 50MB/s when accessed linearly), or it
| can be things like fetching meta-data (ie indirect blocks).
| 
| > Wether the current drivers manages that is of course another story.
| 
| No. Please don't confuse limitations of RAID0 with limitations of "the
| current drivers".
| 
| Yes, the drivers are a part of the picture, but they are a _small_ part of
| a very fundamental issue.
| 
| The fact is, modern disks are GOOD at streaming data. They're _really_
| good at it compared to just about anything else they ever do. The win you
| get from even medium-sized stripes on RAID0 are likely to not be all that
| noticeable, and you can definitely lose _big_ just because it tends to
| hack your IO patterns to pieces.
| 
| My personal guess is that modern RAID0 stripes should be on the order of
| several MEGABYTES in size rather than the few hundred kB that most people
| use (not to mention the people who have 32kB stripes or smaller - they
| just kill their IO access patterns with that, and put the CPU at
| ridiculous strain).

Yeah, yeah, what you said... all true.
| 
| Big stripes help because:
| 
|  - disks already do big transfers well, so you shouldn't split them up.
|    Quite frankly, the kinds of access patterns that let you stream
|    multiple streams of 50MB/s and get N-way throughput increases just
|    don't exists in the real world outside of some very special niches (DoD
|    satellite data backup, or whatever).
| 
|  - it makes it more likely that the disks in the array really have
|    _independent_ IO patterns, ie if you access multiple files the disks
|    may not seek around together, but instead one disk accesses one file.
|    At this point RAID0 starts to potentially help _latency_, simply
|    because by now it may help avoid physical seeking rather than just try
|    to make throughput go up.
| 
| I may be wrong, of course. But I doubt it.

Not that I can see, but I'm not sure that you had thought that the
inodes and the data may have very different usage patterns.

About six years ago most usenet software used one file per article to
hold data. Creating and deleting 10-20 files/sec put a severe load on
the directory and journal drives. Eventually (with AIX and JFS) I put
the journal file on a solid state drive to get performance up.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-17 19:40         ` Linus Torvalds
@ 2003-12-17 22:36           ` bill davidsen
  0 siblings, 0 replies; 27+ messages in thread
From: bill davidsen @ 2003-12-17 22:36 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.58.0312171129040.8541@home.osdl.org>,
Linus Torvalds  <torvalds@osdl.org> wrote:

| Let's say that you are striping four disks, with 32kB blocking. Not 
| an unreasonable setup.

Let me drop one of my pet complaints here, that the install programs of
many (most? all?) commercial releases don't give you a stripe size menu
to let the user make a decision based on intended use. Instead the
program uses the "one size fits all" approach and picks a size. As you
say here it's not unreasonable in terms of being typical, but for most
people it such for performance. As you noted elsewhere big stripes are
almost always better, and a default of 256k or so would work better for
most people.

Sorry, related flamage, but your comments welcome, since this does
affect the perception of performance of the o/s.
-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-17 22:29       ` bill davidsen
@ 2003-12-18  2:18         ` jw schultz
  0 siblings, 0 replies; 27+ messages in thread
From: jw schultz @ 2003-12-18  2:18 UTC (permalink / raw)
  To: linux-kernel

On Wed, Dec 17, 2003 at 10:29:18PM +0000, bill davidsen wrote:
> In article <Pine.LNX.4.58.0312160825570.1599@home.osdl.org>,
> Linus Torvalds  <torvalds@osdl.org> wrote:
> | 
> | 
> | On Tue, 16 Dec 2003, Helge Hafting wrote:
> | >
> | > Raid-0 is ideally N times faster than a single disk, when
> | > you have N disks.
> | 
> | Well, that's a _really_ "ideal" world. Ideal to the point of being
> | unrealistic.
> | 
> | In most real-world situations, latency is at least as important as
> | throughput, and often dominates the story. At which point RAID-0 doesn't
> | improve performance one iota (it might make the seeks shorter, but since
> | seek latency tends to be dominated by things like rotational delay and
> | settle times, that's unlikely to be a really noticeable issue).
> 
> Don't forget time in o/s queues, once an array get loaded that may
> dominate the mechanical latency and transfer times. If you call "access
> time" the sum of all latency between syscall and the first data
> transfer, then reading from multiple drives doesn't reliably help until
> you get the transfer time from an i/o somewhere between 2 and 4x the
> access time. So if the transfer time for a typical i/o is less than 2x
> the typical access time, gains are unlikely. If you set the stripe size
> high enough to make it likely that a typical i/o falls on a single drive
> you usually win. And when the transfer time reaches 4x the access time,
> you almost always win with a split.
> 
> So if you are copying 100MB elements you probably win by spreading the
> i/o, but for more normal things it doesn't much matter.
> 
> THERE'S ONE EXCEPTION: if you have a f/s type which puts the inodes at
> the beginning of the space, and you are creating and deleting a LOT of
> files, with a large stripe you will beat the snot out of one drive and
> the system will bottleneck no end. In that one case you gain by using
> small stripe size and spreading the head motion, even though the file
> i/o itself may really rot. Makes me wish for a f/s which could put the
> inodes in some distributed pattern.

If i recall correctly ext2 like ufs splits the inode table
up and puts parts of it at the beginning of each cylinder or
block group.  Inode assignment being based on an allocation
rule that spreads them across the disks so the file data and
inode will be near each other.  ext[23] also has

       -R raid-options
              Set  raid-related options for the filesystem.  Raid
              options are comma separated, and may take an argu­
              ment  using  the  equals ('=') sign.  The following
              options are supported:

                   stride=stripe-size
                          Configure the  filesystem  for  a RAID
                          array   with   stripe-size filesystem
                          blocks per stripe.

The purpose of the stride option is so that that the inode
table pieces won't all wind up on the same disks as would
happen if stripe size aligns with block group size but be
staggered.

My recollection is that one or both of XFS and JFS store
the inode table in extents which are allocated on demand so
i would hope they also make inode and file data locality a
priority.




-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-17 19:22       ` Jamie Lokier
  2003-12-17 19:40         ` Linus Torvalds
@ 2003-12-18  2:47         ` jw schultz
  1 sibling, 0 replies; 27+ messages in thread
From: jw schultz @ 2003-12-18  2:47 UTC (permalink / raw)
  To: Linux Kernel Mailing List

On Wed, Dec 17, 2003 at 07:22:44PM +0000, Jamie Lokier wrote:
> Linus Torvalds wrote:
> > My personal guess is that modern RAID0 stripes should be on the order of
> > several MEGABYTES in size rather than the few hundred kB that most people
> > use (not to mention the people who have 32kB stripes or smaller - they
> > just kill their IO access patterns with that, and put the CPU at
> > ridiculous strain).
> 
> If a large fs-level I/O transaction is split into lots of 32k
> transactions by the RAID layer, many of those 32k transactions will be
> contiguous on the disks.
> 
> That doesn't mean they're contiguous from the fs point of view, but
> given that all modern hardware does scatter-gather, shouldn't the
> contiguous transactions be merged before being sent to the disk?
> 
> It may strain the CPU (splitting and merging in a different order lots
> of requests), but I don't see why it should kill I/O access patterns,
> as they can be as large as if you had large stripes in the first place.

Only now instead of the latency of one disk seeking to
service the request you have the worst case latency of all
the disks.

Years ago i had a SCSI outboard HW RAID-5 array of 5 disks
on two chains.  The controller used a 512 byte chunk so a
stripe was 2KB.  A single 2KB read would flash lights on 4
drives simultaneously.  An aligned 2KB write would calculate
parity without any reads and write to all 5 at once.  Any
I/O 4KB or larger would engage all 5 drives in parallel.
Given that the OS in question had a 2KB page size and the
filesystems had a 2KB block size it worked pretty well.
When i spec'd the array i made sure the stripe size would
align with access -- one drive more or less and the whole
thing would have been a disaster.

At that time the xfer rate of the drives was a fraction of
what it was today and this setup allowed the array to
saturate the SCSI connection to the host.  Which is
something the drives could not do individually.  However,
disk latency was worst case of the drives although since
they ran almost lock-step wasn't much longer than single
drive latency.  This was just one step up from RAID-3.

Today xfer rates are an order of magnitude higher while
latency has not shrunk.  In fact, by reducing platter count
many drives today have worse latency.  I don't think i'd
ever recommend such a small stripe size today, the latency
of handshaking and the overhead of splitting and merging
would outweigh the bandwidth gains in all but a few rare
applications.


-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid0 slower than devices it is assembled of?
  2003-12-16 16:42     ` Linus Torvalds
                         ` (2 preceding siblings ...)
  2003-12-17 22:29       ` bill davidsen
@ 2004-01-08  4:54       ` Greg Stark
  3 siblings, 0 replies; 27+ messages in thread
From: Greg Stark @ 2004-01-08  4:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Helge Hafting, jw schultz, Linux Kernel Mailing List


Linus Torvalds <torvalds@osdl.org> writes:

> The fact is, modern disks are GOOD at streaming data. They're _really_
> good at it compared to just about anything else they ever do. The win you
> get from even medium-sized stripes on RAID0 are likely to not be all that
> noticeable, and you can definitely lose _big_ just because it tends to
> hack your IO patterns to pieces.

I'm not sure how you reach this conclusion. 50MB/s may sound like a lot, it's
sure a whole lot more than the 2MB/s I get on this 486 over here. But then the
hard drive I have that gets 50MB/s is also 250G and the one in the 486 is
425M, a factor of 588 difference in size. So as good as the drives are getting
at streaming data, the amount of data we want to stream is going up even
faster.

> My personal guess is that modern RAID0 stripes should be on the order of
> several MEGABYTES in size rather than the few hundred kB that most people
> use (not to mention the people who have 32kB stripes or smaller - they
> just kill their IO access patterns with that, and put the CPU at
> ridiculous strain).

> Big stripes help because:
> 
>  - disks already do big transfers well, so you shouldn't split them up.
>    Quite frankly, the kinds of access patterns that let you stream
>    multiple streams of 50MB/s and get N-way throughput increases just
>    don't exists in the real world outside of some very special niches (DoD
>    satellite data backup, or whatever).

Or just about any moderate sized SQL database. Virtually any large query will
cause what Oracle calls "full table scan"s or what postgres calls a
"sequential scan" precisely because reading sequential data is way faster than
random access. Often a single query will generate several such streams, and
often large on-disk sorts which have sequential access patterns as well.

It seems to me that having a stripe-size of several megabytes will defeat the
read-ahead and essentially limit the database to 50MB/s which while it seems
like a lot really isn't fast enough to keep up with the increase in the amount
of data being handled. Even a small database with tables around 1GB will
benefit enormously from being able to stream the data at 100MB/s or 150MB/s.

> I may be wrong, of course. But I doubt it.

Well it should be easy enough to test. It would be quite a radical change in
thinking. All the raidtools documentation suggests starting with 32kb and
experimenting -- largely with smaller stripe sizes. I've certainly never
considered anything much larger. It would be really interesting to know how
even a typical database query ran on raid arrays of varying stripe sizes.

-- 
greg


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2004-01-08  4:54 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-15 13:34 raid0 slower than devices it is assembled of? Witold Krecicki
2003-12-15 15:44 ` Witold Krecicki
2003-12-16  4:01 ` jw schultz
2003-12-16 14:51   ` Helge Hafting
2003-12-16 16:42     ` Linus Torvalds
2003-12-16 20:58       ` Mike Fedyk
2003-12-16 21:11         ` Linus Torvalds
2003-12-17 10:53           ` Jörn Engel
2003-12-17 11:39           ` Peter Zaitsev
2003-12-17 16:01             ` Linus Torvalds
2003-12-17 18:37               ` Mike Fedyk
2003-12-17 21:55               ` bill davidsen
2003-12-17 17:02             ` bill davidsen
2003-12-17 20:14               ` Peter Zaitsev
2003-12-17 19:22       ` Jamie Lokier
2003-12-17 19:40         ` Linus Torvalds
2003-12-17 22:36           ` bill davidsen
2003-12-18  2:47         ` jw schultz
2003-12-17 22:29       ` bill davidsen
2003-12-18  2:18         ` jw schultz
2004-01-08  4:54       ` Greg Stark
2003-12-16 20:51     ` Andre Hedrick
2003-12-16 21:04       ` Andre Hedrick
2003-12-16 21:46         ` Witold Krecicki
2003-12-16 20:09   ` Witold Krecicki
2003-12-16 21:11   ` Adam Kropelin
2003-12-16 21:25 ` jw schultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).