All of lore.kernel.org
 help / color / mirror / Atom feed
* Suboptimal raid6 linear read speed
@ 2013-01-15 12:33 Peter Rabbitson
  2013-01-15 12:45 ` Mikael Abrahamsson
  2013-01-15 12:49 ` Phil Turmel
  0 siblings, 2 replies; 46+ messages in thread
From: Peter Rabbitson @ 2013-01-15 12:33 UTC (permalink / raw)
  To: linux-raid

Hello,

Apologies in advance if this question has been answered before - I 
perused the archives to no avail.

I am experiencing slow linear read from a bare raid6 device, while the 
underlying drives are capable of saturating their capabilities. I can't
seem to find an explanation for this.

Regular parallel read from members:
===========
Imladris:~# echo 3 > /proc/sys/vm/drop_caches; sleep 3; for d in /dev/sd[abcd] ; do dd if=$d of=/dev/null bs=1M count=2048 & done
[1] 13953
[2] 13954
[3] 13955
[4] 13956
Imladris:~# 2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 17.331 s, 124 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 17.5719 s, 122 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 18.47 s, 116 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 18.6834 s, 115 MB/s
===========

Same sort of read from array itself:
===========
Imladris:~# echo 3 > /proc/sys/vm/drop_caches; sleep 3; dd if=/dev/md6 of=/dev/null bs=1M count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 34.9699 s, 246 MB/s
===========

I was expecting to see numbers in the neighborhood of 450 MB/s

My kernel: linux-image-3.2.0-4-rt-amd64 3.2.32-1 Linux 3.2 for 64-bit PCs, PREEMPT_RT

My raid6 parameters: http://paste.debian.net/224887/

My partition alignment: http://paste.debian.net/224888/
(reationale: 4k * 255 * 63, keeps both 4k sector drives aligned and older
partition utilities happy)

My disk parameters: http://paste.debian.net/224890/

Readaheads: http://paste.debian.net/224892/

Any ideas/suggestions welcome.

Cheers

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-15 12:33 Suboptimal raid6 linear read speed Peter Rabbitson
@ 2013-01-15 12:45 ` Mikael Abrahamsson
  2013-01-15 12:56   ` Peter Rabbitson
  2013-01-15 12:49 ` Phil Turmel
  1 sibling, 1 reply; 46+ messages in thread
From: Mikael Abrahamsson @ 2013-01-15 12:45 UTC (permalink / raw)
  To: Peter Rabbitson; +Cc: linux-raid

On Tue, 15 Jan 2013, Peter Rabbitson wrote:

> Any ideas/suggestions welcome.

I'm interested in seeing "iostat -x 5" output from when you're doing 
sustained reading.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-15 12:33 Suboptimal raid6 linear read speed Peter Rabbitson
  2013-01-15 12:45 ` Mikael Abrahamsson
@ 2013-01-15 12:49 ` Phil Turmel
  2013-01-15 12:55   ` Peter Rabbitson
  1 sibling, 1 reply; 46+ messages in thread
From: Phil Turmel @ 2013-01-15 12:49 UTC (permalink / raw)
  To: Peter Rabbitson; +Cc: linux-raid

On 01/15/2013 07:33 AM, Peter Rabbitson wrote:
> Hello,
> 
> Apologies in advance if this question has been answered before - I 
> perused the archives to no avail.
> 
> I am experiencing slow linear read from a bare raid6 device, while the 
> underlying drives are capable of saturating their capabilities. I can't
> seem to find an explanation for this.
> 
> Regular parallel read from members:
> ===========
> Imladris:~# echo 3 > /proc/sys/vm/drop_caches; sleep 3; for d in /dev/sd[abcd] ; do dd if=$d of=/dev/null bs=1M count=2048 & done
> [1] 13953
> [2] 13954
> [3] 13955
> [4] 13956
> Imladris:~# 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 17.331 s, 124 MB/s
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 17.5719 s, 122 MB/s
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 18.47 s, 116 MB/s
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 18.6834 s, 115 MB/s
> ===========
> 
> Same sort of read from array itself:
> ===========
> Imladris:~# echo 3 > /proc/sys/vm/drop_caches; sleep 3; dd if=/dev/md6 of=/dev/null bs=1M count=8192
> 8192+0 records in
> 8192+0 records out
> 8589934592 bytes (8.6 GB) copied, 34.9699 s, 246 MB/s
> ===========
> 
> I was expecting to see numbers in the neighborhood of 450 MB/s

You are neglecting each drive's need to skip over parity blocks.  If the
array's chunk size is small, the drives won't have to seek, just wait
for the platter spin.  Larger chunks might need a seek.  Either way, you
won't get better than (single drive rate) * (n-2) where "n" is the
number of drives in your array. (Large sequential reads.)

Phil

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-15 12:49 ` Phil Turmel
@ 2013-01-15 12:55   ` Peter Rabbitson
  2013-01-15 17:09     ` Charles Polisher
                       ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Peter Rabbitson @ 2013-01-15 12:55 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

On Tue, Jan 15, 2013 at 07:49:10AM -0500, Phil Turmel wrote:
> You are neglecting each drive's need to skip over parity blocks.  If the
> array's chunk size is small, the drives won't have to seek, just wait
> for the platter spin.  Larger chunks might need a seek.

> Either way, you
> won't get better than (single drive rate) * (n-2) where "n" is the
> number of drives in your array. (Large sequential reads.)

This can't be right. As far as I know the md layer is smarter than that, and
includes various anticipatory codepaths specifically to leverage multiple
drives in this fashion. Fwiw raid5 does give me the near-expected speed
(n * single drive).

Cheers


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-15 12:45 ` Mikael Abrahamsson
@ 2013-01-15 12:56   ` Peter Rabbitson
  2013-01-15 16:13     ` Mikael Abrahamsson
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Rabbitson @ 2013-01-15 12:56 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid

On Tue, Jan 15, 2013 at 01:45:18PM +0100, Mikael Abrahamsson wrote:
> On Tue, 15 Jan 2013, Peter Rabbitson wrote:
> 
> >Any ideas/suggestions welcome.
> 
> I'm interested in seeing "iostat -x 5" output from when you're doing
> sustained reading.
> 

The entire run "from quiet to quiet" of

echo 3 > /proc/sys/vm/drop_caches; sleep 3; dd if=/dev/md6 of=/dev/null bs=1M count=8192

http://paste.debian.net/224907/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-15 12:56   ` Peter Rabbitson
@ 2013-01-15 16:13     ` Mikael Abrahamsson
  0 siblings, 0 replies; 46+ messages in thread
From: Mikael Abrahamsson @ 2013-01-15 16:13 UTC (permalink / raw)
  To: Peter Rabbitson; +Cc: linux-raid

On Tue, 15 Jan 2013, Peter Rabbitson wrote:

> http://paste.debian.net/224907/

Looks like sda is slower than the rest of the drives for some reason (see 
last column):

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb           14909.80     0.00  117.40    0.00 120217.60     0.00  1024.00     0.47    3.99   3.99  46.80
sdc           14779.80     0.00  234.80    0.00 120115.20     0.00   511.56     0.53    2.26   2.03  47.76
sdd           14884.40     0.00  117.40    0.00 120217.60     0.00  1024.00     0.52    4.45   4.44  52.16
sda           14779.80     0.00  234.40    0.00 120012.80     0.00   512.00     1.79    7.65   4.17  97.84

Also peculiar how sdc and sda has twice the amount of r/s compared to the 
other drives?

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-15 12:55   ` Peter Rabbitson
@ 2013-01-15 17:09     ` Charles Polisher
  2013-01-15 19:57       ` keld
  2013-01-15 23:17     ` Phil Turmel
  2013-01-16  2:48     ` Stan Hoeppner
  2 siblings, 1 reply; 46+ messages in thread
From: Charles Polisher @ 2013-01-15 17:09 UTC (permalink / raw)
  To: Peter Rabbitson; +Cc: Phil Turmel, linux-raid

On Tue, Jan 15, 2013 at 11:55:07PM +1100, Peter Rabbitson wrote:
> On Tue, Jan 15, 2013 at 07:49:10AM -0500, Phil Turmel wrote:
> > You are neglecting each drive's need to skip over parity blocks.  If the
> > array's chunk size is small, the drives won't have to seek, just wait
> > for the platter spin.  Larger chunks might need a seek.
> 
> > Either way, you
> > won't get better than (single drive rate) * (n-2) where "n" is the
> > number of drives in your array. (Large sequential reads.)
> 
> This can't be right. As far as I know the md layer is smarter than that, and
> includes various anticipatory codepaths specifically to leverage multiple
> drives in this fashion. Fwiw raid5 does give me the near-expected speed
> (n * single drive).

Happen to be working with comparative benchmarks looking for
relative throughput, varying the number of active drives in the
array and the RAID level. Clearly in this data RAID6 sequential
writes are bottlenecked by the 2 parity stripes. RAID6 setup
increases from 2 non-parity drives in the 4 drive configuration
to 6 non-parity drives in the 8 drive configuration, so one
might hope for 3x advantage. Yet the data show an advantage of
only 1.83 for reads. My guess is the need to read the parity
stripes is again a limiting factor. Next benchmark will vary
stripe and stride.

                                        Advantage     Advantage
                                        vs 4 drives   vs RAID0
Config  Drives  Seq write   Seq  read   Write  Read   Write Read
------  ------  ----------  ----------  ----- -----   ----  ----
RAID0   4        8.1MB/sec   9.3MB/sec   1.00  1.00   1.00  1.00
RAID0   8       16.8MB/sec  15.0MB/sec   2.07  1.61   1.00  1.00
 
RAID1   4        2.1MB/sec   3.6MB/sec   1.00  1.00   0.25  0.38
RAID1   8        1.6MB/sec   3.6MB/sec   0.76  1.00   0.09  0.24

RAID5   4       16.8MB/sec   9.1MB/sec   1.00  1.00   2.07  0.97
RAID5   8       17.2MB/sec  14.9MB/sec   1.02  1.63   2.12  1.60

RAID6   4       12.6MB/sec   7.9MB/sec   1.00  1.00   1.55  0.84
RAID6   8       14.4MB/sec  14.5MB/sec   1.63  1.83   1.77  1.55

RAID10  4        4.0MB/sec   7.3MB/sec   1.00  1.00   0.49  0.78
RAID10  8        6.3MB/sec  13.4MB/sec   1.57  1.83   0.37  0.89

Yes, these drives are *really* slow (Connor CP 30548). 
The math doesn't change.
-- 
Charles



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-15 17:09     ` Charles Polisher
@ 2013-01-15 19:57       ` keld
  2013-01-16  4:43         ` Charles Polisher
  0 siblings, 1 reply; 46+ messages in thread
From: keld @ 2013-01-15 19:57 UTC (permalink / raw)
  To: Charles Polisher; +Cc: Peter Rabbitson, Phil Turmel, linux-raid

On Tue, Jan 15, 2013 at 09:09:38AM -0800, Charles Polisher wrote:
> On Tue, Jan 15, 2013 at 11:55:07PM +1100, Peter Rabbitson wrote:
> > On Tue, Jan 15, 2013 at 07:49:10AM -0500, Phil Turmel wrote:
> > > You are neglecting each drive's need to skip over parity blocks.  If the
> > > array's chunk size is small, the drives won't have to seek, just wait
> > > for the platter spin.  Larger chunks might need a seek.
> > 
> > > Either way, you
> > > won't get better than (single drive rate) * (n-2) where "n" is the
> > > number of drives in your array. (Large sequential reads.)
> > 
> > This can't be right. As far as I know the md layer is smarter than that, and
> > includes various anticipatory codepaths specifically to leverage multiple
> > drives in this fashion. Fwiw raid5 does give me the near-expected speed
> > (n * single drive).
> 
> Happen to be working with comparative benchmarks looking for
> relative throughput, varying the number of active drives in the
> array and the RAID level. Clearly in this data RAID6 sequential
> writes are bottlenecked by the 2 parity stripes. RAID6 setup
> increases from 2 non-parity drives in the 4 drive configuration
> to 6 non-parity drives in the 8 drive configuration, so one
> might hope for 3x advantage. Yet the data show an advantage of
> only 1.83 for reads. My guess is the need to read the parity
> stripes is again a limiting factor. Next benchmark will vary
> stripe and stride.
> 
>                                         Advantage     Advantage
>                                         vs 4 drives   vs RAID0
> Config  Drives  Seq write   Seq  read   Write  Read   Write Read
> ------  ------  ----------  ----------  ----- -----   ----  ----
> RAID0   4        8.1MB/sec   9.3MB/sec   1.00  1.00   1.00  1.00
> RAID0   8       16.8MB/sec  15.0MB/sec   2.07  1.61   1.00  1.00
>  
> RAID1   4        2.1MB/sec   3.6MB/sec   1.00  1.00   0.25  0.38
> RAID1   8        1.6MB/sec   3.6MB/sec   0.76  1.00   0.09  0.24
> 
> RAID5   4       16.8MB/sec   9.1MB/sec   1.00  1.00   2.07  0.97
> RAID5   8       17.2MB/sec  14.9MB/sec   1.02  1.63   2.12  1.60
> 
> RAID6   4       12.6MB/sec   7.9MB/sec   1.00  1.00   1.55  0.84
> RAID6   8       14.4MB/sec  14.5MB/sec   1.63  1.83   1.77  1.55
> 
> RAID10  4        4.0MB/sec   7.3MB/sec   1.00  1.00   0.49  0.78
> RAID10  8        6.3MB/sec  13.4MB/sec   1.57  1.83   0.37  0.89
> 
> Yes, these drives are *really* slow (Connor CP 30548). 
> The math doesn't change.
> -- 
> Charles

What layout are you using for RAID10?
Is it Linux MD RAID10?

Best regards
Keld

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-15 12:55   ` Peter Rabbitson
  2013-01-15 17:09     ` Charles Polisher
@ 2013-01-15 23:17     ` Phil Turmel
  2013-01-16  2:48     ` Stan Hoeppner
  2 siblings, 0 replies; 46+ messages in thread
From: Phil Turmel @ 2013-01-15 23:17 UTC (permalink / raw)
  To: Peter Rabbitson; +Cc: linux-raid

On 01/15/2013 07:55 AM, Peter Rabbitson wrote:
> On Tue, Jan 15, 2013 at 07:49:10AM -0500, Phil Turmel wrote:
>> You are neglecting each drive's need to skip over parity blocks.  If the
>> array's chunk size is small, the drives won't have to seek, just wait
>> for the platter spin.  Larger chunks might need a seek.
> 
>> Either way, you
>> won't get better than (single drive rate) * (n-2) where "n" is the
>> number of drives in your array. (Large sequential reads.)
> 
> This can't be right. As far as I know the md layer is smarter than that, and
> includes various anticipatory codepaths specifically to leverage multiple
> drives in this fashion. Fwiw raid5 does give me the near-expected speed
> (n * single drive).

Please look at the chunk layout for raid6.  There's parity P and Q
chunks evenly distributed amongst all drives.

http://en.wikipedia.org/wiki/Standard_RAID_levels

When not degraded, reading many chunks worth of sequential data from the
array, MD's requests to the drives will omit those parity blocks.  The
drive, if it was reading ahead, will have to discard that data, or if
not reading ahead, will have to seek past it.  This happens every N-2
chunks per drive.

Your test reads from the individual disks read contiguous sequential
blocks.  Sequential reads from a raid6 array will generate short
sequential reads on each drive, separated by skips over the unneeded
parity chunks.  This is true for raid5 as well, but only skipping one
chunk instead of two.

MD doesn't have any secret sauce that'll let it magically avoid those
skips.  If you can't see that, I can't help you further.

Phil

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-15 12:55   ` Peter Rabbitson
  2013-01-15 17:09     ` Charles Polisher
  2013-01-15 23:17     ` Phil Turmel
@ 2013-01-16  2:48     ` Stan Hoeppner
  2013-01-16  2:58       ` Peter Rabbitson
  2 siblings, 1 reply; 46+ messages in thread
From: Stan Hoeppner @ 2013-01-16  2:48 UTC (permalink / raw)
  To: Peter Rabbitson; +Cc: Phil Turmel, linux-raid

On 1/15/2013 6:55 AM, Peter Rabbitson wrote:
> On Tue, Jan 15, 2013 at 07:49:10AM -0500, Phil Turmel wrote:
>> You are neglecting each drive's need to skip over parity blocks.  If the
>> array's chunk size is small, the drives won't have to seek, just wait
>> for the platter spin.  Larger chunks might need a seek.
> 
>> Either way, you
>> won't get better than (single drive rate) * (n-2) where "n" is the
>> number of drives in your array. (Large sequential reads.)
> 
> This can't be right. As far as I know the md layer is smarter than that, and
> includes various anticipatory codepaths specifically to leverage multiple
> drives in this fashion. Fwiw raid5 does give me the near-expected speed
> (n * single drive).

It is right.  You're likely confusing the "smarts" of RAID1/10
optimizations.  In that case you have more than one copy of each block
on more than one drive allowing for additional parallelism.  With a 4
drive RAID6 you only have one copy of each block on one drive.  Thus as
Phil states the best performance you can get here is 2 spindles of
throughput, which is why you're seeing a max of ~250MB/s for the array.

Unless you plan to expand this array in the future by adding more drives
and doing a reshape, I'd suggest you switch to RAID10.   It will give
you 3x or more write throughput with greatly reduced latency,
substantially faster rebuild times, and possibly a little extra read
throughput.

With only 4 drives RAID6 doesn't make sense as RAID10 is superior in
every way.

-- 
Stan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-16  2:48     ` Stan Hoeppner
@ 2013-01-16  2:58       ` Peter Rabbitson
  2013-01-16 20:29         ` Stan Hoeppner
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Rabbitson @ 2013-01-16  2:58 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Phil Turmel, linux-raid

On Tue, Jan 15, 2013 at 08:48:23PM -0600, Stan Hoeppner wrote:
> With only 4 drives RAID6 doesn't make sense as RAID10 is superior in
> every way.

Except raid6 can lose any random 2 drives, while raid10 can't.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-15 19:57       ` keld
@ 2013-01-16  4:43         ` Charles Polisher
  2013-01-16  6:37           ` Tommy Apel Hansen
  2013-01-16  9:36           ` keld
  0 siblings, 2 replies; 46+ messages in thread
From: Charles Polisher @ 2013-01-16  4:43 UTC (permalink / raw)
  To: keld; +Cc: Peter Rabbitson, Phil Turmel, linux-raid

keld@keldix.com wrote:
> >                                         Advantage     Advantage
> >                                         vs 4 drives   vs RAID0
> > Config  Drives  Seq write   Seq  read   Write  Read   Write Read
> > ------  ------  ----------  ----------  ----- -----   ----  ----
> > RAID0   4        8.1MB/sec   9.3MB/sec   1.00  1.00   1.00  1.00
> > RAID0   8       16.8MB/sec  15.0MB/sec   2.07  1.61   1.00  1.00
> >  
> > RAID1   4        2.1MB/sec   3.6MB/sec   1.00  1.00   0.25  0.38
> > RAID1   8        1.6MB/sec   3.6MB/sec   0.76  1.00   0.09  0.24
> > 
> > RAID5   4       16.8MB/sec   9.1MB/sec   1.00  1.00   2.07  0.97
> > RAID5   8       17.2MB/sec  14.9MB/sec   1.02  1.63   2.12  1.60
> > 
> > RAID6   4       12.6MB/sec   7.9MB/sec   1.00  1.00   1.55  0.84
> > RAID6   8       14.4MB/sec  14.5MB/sec   1.63  1.83   1.77  1.55
> > 
> > RAID10  4        4.0MB/sec   7.3MB/sec   1.00  1.00   0.49  0.78
> > RAID10  8        6.3MB/sec  13.4MB/sec   1.57  1.83   0.37  0.89
> > 
> What layout are you using for RAID10?
> Is it Linux MD RAID10?

# cat /proc/mdstat 
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] 
md1 : active raid10 sdi[7] sdh[6] sdg[5] sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
      2118656 blocks 64K chunks 2 near-copies [8/8] [UUUUUUUU]



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-16  4:43         ` Charles Polisher
@ 2013-01-16  6:37           ` Tommy Apel Hansen
  2013-01-16  9:36           ` keld
  1 sibling, 0 replies; 46+ messages in thread
From: Tommy Apel Hansen @ 2013-01-16  6:37 UTC (permalink / raw)
  To: Charles Polisher; +Cc: keld, Peter Rabbitson, Phil Turmel, linux-raid

Hello,
1) you shouldn't use more than 2 drives for RAID1 as performance will be impacted (* at least that's what my own tests show)
2) for RAID10 use "2 far-copies" (-p f2) as that seems to be the most optimal layout

/Tommy

On Tue, 2013-01-15 at 20:43 -0800, Charles Polisher wrote:
> keld@keldix.com wrote:
> > >                                         Advantage     Advantage
> > >                                         vs 4 drives   vs RAID0
> > > Config  Drives  Seq write   Seq  read   Write  Read   Write Read
> > > ------  ------  ----------  ----------  ----- -----   ----  ----
> > > RAID0   4        8.1MB/sec   9.3MB/sec   1.00  1.00   1.00  1.00
> > > RAID0   8       16.8MB/sec  15.0MB/sec   2.07  1.61   1.00  1.00
> > >  
> > > RAID1   4        2.1MB/sec   3.6MB/sec   1.00  1.00   0.25  0.38
> > > RAID1   8        1.6MB/sec   3.6MB/sec   0.76  1.00   0.09  0.24
> > > 
> > > RAID5   4       16.8MB/sec   9.1MB/sec   1.00  1.00   2.07  0.97
> > > RAID5   8       17.2MB/sec  14.9MB/sec   1.02  1.63   2.12  1.60
> > > 
> > > RAID6   4       12.6MB/sec   7.9MB/sec   1.00  1.00   1.55  0.84
> > > RAID6   8       14.4MB/sec  14.5MB/sec   1.63  1.83   1.77  1.55
> > > 
> > > RAID10  4        4.0MB/sec   7.3MB/sec   1.00  1.00   0.49  0.78
> > > RAID10  8        6.3MB/sec  13.4MB/sec   1.57  1.83   0.37  0.89
> > > 
> > What layout are you using for RAID10?
> > Is it Linux MD RAID10?
> 
> # cat /proc/mdstat 
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] 
> md1 : active raid10 sdi[7] sdh[6] sdg[5] sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
>       2118656 blocks 64K chunks 2 near-copies [8/8] [UUUUUUUU]
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-16  4:43         ` Charles Polisher
  2013-01-16  6:37           ` Tommy Apel Hansen
@ 2013-01-16  9:36           ` keld
  2013-01-16 16:09             ` Charles Polisher
  1 sibling, 1 reply; 46+ messages in thread
From: keld @ 2013-01-16  9:36 UTC (permalink / raw)
  To: Charles Polisher; +Cc: Peter Rabbitson, Phil Turmel, linux-raid

Hi Charles

It really does not show below which layout you use for RAID10. This is quite
important as the different layouts of RAID10 have quite different 
performance characteristics. The 'far' layout tends to be the fastest.

Which command did you use to create the RAID10 array?

best regards
Keld

On Tue, Jan 15, 2013 at 08:43:35PM -0800, Charles Polisher wrote:
> keld@keldix.com wrote:
> > >                                         Advantage     Advantage
> > >                                         vs 4 drives   vs RAID0
> > > Config  Drives  Seq write   Seq  read   Write  Read   Write Read
> > > ------  ------  ----------  ----------  ----- -----   ----  ----
> > > RAID0   4        8.1MB/sec   9.3MB/sec   1.00  1.00   1.00  1.00
> > > RAID0   8       16.8MB/sec  15.0MB/sec   2.07  1.61   1.00  1.00
> > >  
> > > RAID1   4        2.1MB/sec   3.6MB/sec   1.00  1.00   0.25  0.38
> > > RAID1   8        1.6MB/sec   3.6MB/sec   0.76  1.00   0.09  0.24
> > > 
> > > RAID5   4       16.8MB/sec   9.1MB/sec   1.00  1.00   2.07  0.97
> > > RAID5   8       17.2MB/sec  14.9MB/sec   1.02  1.63   2.12  1.60
> > > 
> > > RAID6   4       12.6MB/sec   7.9MB/sec   1.00  1.00   1.55  0.84
> > > RAID6   8       14.4MB/sec  14.5MB/sec   1.63  1.83   1.77  1.55
> > > 
> > > RAID10  4        4.0MB/sec   7.3MB/sec   1.00  1.00   0.49  0.78
> > > RAID10  8        6.3MB/sec  13.4MB/sec   1.57  1.83   0.37  0.89
> > > 
> > What layout are you using for RAID10?
> > Is it Linux MD RAID10?
> 
> # cat /proc/mdstat 
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] 
> md1 : active raid10 sdi[7] sdh[6] sdg[5] sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
>       2118656 blocks 64K chunks 2 near-copies [8/8] [UUUUUUUU]
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-16  9:36           ` keld
@ 2013-01-16 16:09             ` Charles Polisher
  2013-01-16 20:40               ` EJ Vincent
  0 siblings, 1 reply; 46+ messages in thread
From: Charles Polisher @ 2013-01-16 16:09 UTC (permalink / raw)
  To: keld; +Cc: Peter Rabbitson, Phil Turmel, linux-raid

keld@keldix.com wrote:
> Hi Charles
> 
> It really does not show below which layout you use for RAID10. This is quite
> important as the different layouts of RAID10 have quite different 
> performance characteristics. The 'far' layout tends to be the fastest.

> Which command did you use to create the RAID10 array?
 

mdadm --create /dev/md1 --chunk=64 --level=10 \
      --raid-devices=8 --spare-devices=0      \
      --parity=n2                             \
      /dev/sd[abcdefgh]1

> > # cat /proc/mdstat 
> > Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] 
> > md1 : active raid10 sdi[7] sdh[6] sdg[5] sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
> >       2118656 blocks 64K chunks 2 near-copies [8/8] [UUUUUUUU]
                                    ^^^^^^^^^^^^^
I will be testing far placement as well.

I very much appreciate your interest and guidance. To clarify my
goals, I am trying to demonstrate generalized RAID charac-
teristics using an experimental workbench in various config-
urations, tying the observed experimental data to expected
values. I'd like to wring out obvious errors in my experimental
technique.

The initial motivation was a glaringly poor choice of RAID level
for a performance-sensitive production system which I have to
live with for years to come. The most unsettling aspect was that
the selection process was stymied by the complexity of the
choices and an abundance of disinformation. I want to address
this pitfall by publishing an illustrated guide to performance,
cost, and reliability up and down the storage stack, helping
people to make reasoned choices in purchasing and configuring
storage. Varying the RAID level and configuration are key (but
not the only) parts of this work. I'm well along in the
literature search, workbench setup, and toolchain. I'm expecting
to put a year or two into the project and to open source
everything.
 
Best regards,
-- 
Charles


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-16  2:58       ` Peter Rabbitson
@ 2013-01-16 20:29         ` Stan Hoeppner
  2013-01-16 21:20           ` Roy Sigurd Karlsbakk
  2013-01-17 15:51           ` Mikael Abrahamsson
  0 siblings, 2 replies; 46+ messages in thread
From: Stan Hoeppner @ 2013-01-16 20:29 UTC (permalink / raw)
  To: Peter Rabbitson; +Cc: Phil Turmel, linux-raid

On 1/15/2013 8:58 PM, Peter Rabbitson wrote:
> On Tue, Jan 15, 2013 at 08:48:23PM -0600, Stan Hoeppner wrote:
>> With only 4 drives RAID6 doesn't make sense as RAID10 is superior in
>> every way.
> 
> Except raid6 can lose any random 2 drives, while raid10 can't.

This isn't a legitimate argument.  The probability of you being struck
by lightning is greater than two drives in the same mirror in a 4 drive
RAID10 dying before a rebuild completes.

I challenge you to do an exhaustive search for anyone, at any time in
history, who was managing the array properly, suffering such a two drive
failure and losing a RAID10 array, 4 drives or greater.  Note that
controller failures with all drives on one controller don't count, as
that failure mode will take down any array of any RAID level.

-- 
Stan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-16 16:09             ` Charles Polisher
@ 2013-01-16 20:40               ` EJ Vincent
  0 siblings, 0 replies; 46+ messages in thread
From: EJ Vincent @ 2013-01-16 20:40 UTC (permalink / raw)
  Cc: linux-raid

On 1/16/2013 11:09 AM, Charles Polisher wrote:
> keld@keldix.com wrote:
>> Hi Charles
>>
>> It really does not show below which layout you use for RAID10. This is quite
>> important as the different layouts of RAID10 have quite different
>> performance characteristics. The 'far' layout tends to be the fastest.
>> Which command did you use to create the RAID10 array?
>   
>
> mdadm --create /dev/md1 --chunk=64 --level=10 \
>        --raid-devices=8 --spare-devices=0      \
>        --parity=n2                             \
>        /dev/sd[abcdefgh]1
>
>>> # cat /proc/mdstat
>>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
>>> md1 : active raid10 sdi[7] sdh[6] sdg[5] sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
>>>        2118656 blocks 64K chunks 2 near-copies [8/8] [UUUUUUUU]
>                                      ^^^^^^^^^^^^^
> I will be testing far placement as well.
>
> I very much appreciate your interest and guidance. To clarify my
> goals, I am trying to demonstrate generalized RAID charac-
> teristics using an experimental workbench in various config-
> urations, tying the observed experimental data to expected
> values. I'd like to wring out obvious errors in my experimental
> technique.
>
> The initial motivation was a glaringly poor choice of RAID level
> for a performance-sensitive production system which I have to
> live with for years to come. The most unsettling aspect was that
> the selection process was stymied by the complexity of the
> choices and an abundance of disinformation. I want to address
> this pitfall by publishing an illustrated guide to performance,
> cost, and reliability up and down the storage stack, helping
> people to make reasoned choices in purchasing and configuring
> storage. Varying the RAID level and configuration are key (but
> not the only) parts of this work. I'm well along in the
> literature search, workbench setup, and toolchain. I'm expecting
> to put a year or two into the project and to open source
> everything.
>   
> Best regards,

I am excited to see the results of this project. Good luck! I'll be 
waiting patiently.

-- 
EJ Vincent
ej@ejane.org


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-16 20:29         ` Stan Hoeppner
@ 2013-01-16 21:20           ` Roy Sigurd Karlsbakk
  2013-01-17 15:51           ` Mikael Abrahamsson
  1 sibling, 0 replies; 46+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-01-16 21:20 UTC (permalink / raw)
  To: stan; +Cc: Phil Turmel, linux-raid, Peter Rabbitson

> This isn't a legitimate argument. The probability of you being struck
> by lightning is greater than two drives in the same mirror in a 4
> drive RAID10 dying before a rebuild completes.

No, it's not!

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-16 20:29         ` Stan Hoeppner
  2013-01-16 21:20           ` Roy Sigurd Karlsbakk
@ 2013-01-17 15:51           ` Mikael Abrahamsson
  2013-01-18  8:31             ` Stan Hoeppner
  1 sibling, 1 reply; 46+ messages in thread
From: Mikael Abrahamsson @ 2013-01-17 15:51 UTC (permalink / raw)
  To: linux-raid

On Wed, 16 Jan 2013, Stan Hoeppner wrote:

> This isn't a legitimate argument.  The probability of you being struck 
> by lightning is greater than two drives in the same mirror in a 4 drive 
> RAID10 dying before a rebuild completes.

The probability of getting struck by lightning is a lot less than being 
struck by a read error when rebuilding from the only remaining mirror when 
one drive failed and you've replaced it.

> I challenge you to do an exhaustive search for anyone, at any time in 
> history, who was managing the array properly, suffering such a two drive 
> failure and losing a RAID10 array, 4 drives or greater.  Note that 
> controller failures with all drives on one controller don't count, as 
> that failure mode will take down any array of any RAID level.

<http://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162> 
is applicable to RAID1 and RAID10 as well as RAID5.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-17 15:51           ` Mikael Abrahamsson
@ 2013-01-18  8:31             ` Stan Hoeppner
  2013-01-18  9:18               ` Mikael Abrahamsson
  0 siblings, 1 reply; 46+ messages in thread
From: Stan Hoeppner @ 2013-01-18  8:31 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid

On 1/17/2013 9:51 AM, Mikael Abrahamsson wrote:

> The probability of getting struck by lightning is a lot less than being
> struck by a read error when rebuilding from the only remaining mirror
> when one drive failed and you've replaced it.

The probability of a URE during rebuild increases with the number and
size of the source drives being read to rebuild the failed drive.  Thus
the probability of encountering a URE in the 1:1 drive scenario is
extremely low, close to zero if you believe manufacturer specs.

> <http://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162>
> is applicable to RAID1 and RAID10 as well as RAID5.

In Robin's example we're reading 12TB of sectors from 6 drives to
complete the rebuild of one failed drive, so the overall probably of a
URE is less than that of a single drive.  With RAID1/10 we're only
reading 2TB, well below the URE rates for single drives.

So, no, the "URE scare" being propagated these days doesn't affect
RAID1/10.  If/when individual drive capacities exceed 10TB in the
future, and if at that time the URE rates per drive do not improve,
-then- this phenomenon will affect RAID1/10.  But it does not currently
with today's drives.

-- 
Stan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-18  8:31             ` Stan Hoeppner
@ 2013-01-18  9:18               ` Mikael Abrahamsson
  2013-01-18 22:56                 ` Stan Hoeppner
  0 siblings, 1 reply; 46+ messages in thread
From: Mikael Abrahamsson @ 2013-01-18  9:18 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: linux-raid

On Fri, 18 Jan 2013, Stan Hoeppner wrote:

> The probability of a URE during rebuild increases with the number and
> size of the source drives being read to rebuild the failed drive.  Thus
> the probability of encountering a URE in the 1:1 drive scenario is
> extremely low, close to zero if you believe manufacturer specs.

For a 2TB drive and BER 10^-14 (common for non-enterprise drives), the 
probability is 1/6 of a single URE for a read of the entire drive.

> So, no, the "URE scare" being propagated these days doesn't affect 
> RAID1/10.  If/when individual drive capacities exceed 10TB in the 
> future, and if at that time the URE rates per drive do not improve, 
> -then- this phenomenon will affect RAID1/10.  But it does not currently 
> with today's drives.

Let's agree to disagree.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-18  9:18               ` Mikael Abrahamsson
@ 2013-01-18 22:56                 ` Stan Hoeppner
  2013-01-19  7:43                   ` Mikael Abrahamsson
  2013-01-19 13:21                   ` Roy Sigurd Karlsbakk
  0 siblings, 2 replies; 46+ messages in thread
From: Stan Hoeppner @ 2013-01-18 22:56 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid

On 1/18/2013 3:18 AM, Mikael Abrahamsson wrote:
> On Fri, 18 Jan 2013, Stan Hoeppner wrote:
> 
>> The probability of a URE during rebuild increases with the number and
>> size of the source drives being read to rebuild the failed drive.  Thus
>> the probability of encountering a URE in the 1:1 drive scenario is
>> extremely low, close to zero if you believe manufacturer specs.
> 
> For a 2TB drive and BER 10^-14 (common for non-enterprise drives), the
> probability is 1/6 of a single URE for a read of the entire drive.

If my math is correct, with a URE rate of 10E14, that's one URE for
every ~12.5TB read.  So theoretically one would have to read the entire
2TB drive more than 6 times before hitting the first URE.  So it seems
unlikely that one would hit a URE during a mirror rebuild with such a
2TB drive.

>> So, no, the "URE scare" being propagated these days doesn't affect
>> RAID1/10.  If/when individual drive capacities exceed 10TB in the
>> future, and if at that time the URE rates per drive do not improve,
>> -then- this phenomenon will affect RAID1/10.  But it does not
>> currently with today's drives.
> 
> Let's agree to disagree.

This is math so there is no room for disagreement--there is one right
answer.  Either mine is correct or yours is.  If my math is incorrect
I'd certainly appreciate it if you, or anyone else, would explain where
I'm in error, so I don't disseminate incorrect information in the
future.  But given that the articles I've read on this subject agree
with my math, I don't believe I'm in error.

I made the point in a previous post that I use the smallest drives I can
get away with for a given array/workload/capacity, as performance is
generally better and rebuild times much lower.  Potential URE issues
provide yet another reason to use a higher count of smaller drives,
though again, this doesn't tend to affect most RAID1/10 users, yet.

-- 
Stan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-18 22:56                 ` Stan Hoeppner
@ 2013-01-19  7:43                   ` Mikael Abrahamsson
  2013-01-19 22:48                     ` Stan Hoeppner
                                       ` (2 more replies)
  2013-01-19 13:21                   ` Roy Sigurd Karlsbakk
  1 sibling, 3 replies; 46+ messages in thread
From: Mikael Abrahamsson @ 2013-01-19  7:43 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: linux-raid

On Fri, 18 Jan 2013, Stan Hoeppner wrote:

> If my math is correct, with a URE rate of 10E14, that's one URE for 
> every ~12.5TB read.  So theoretically one would have to read the entire 
> 2TB drive more than 6 times before hitting the first URE.  So it seems 
> unlikely that one would hit a URE during a mirror rebuild with such a 
> 2TB drive.

Unlikely yes, but it also means one in 6 rebuilds (statistically) will 
fail with URE. I'm not willing to take that chance, thus I use RAID6.

Usually, with scrubbing etc I'd imagine that the probability is better 
than 1 in 6, but it's still a substantial risk.

> This is math so there is no room for disagreement--there is one right 
> answer.  Either mine is correct or yours is.  If my math is incorrect 
> I'd certainly appreciate it if you, or anyone else, would explain where 
> I'm in error, so I don't disseminate incorrect information in the 
> future.  But given that the articles I've read on this subject agree 
> with my math, I don't believe I'm in error.

With a BER of 10^-14 you have a 16% risk of getting URE when reading an 
entire 2TB drive. We both agree on that. You called the risk in comparison 
with "getting hit by lightning" which I strongly disagree with.

> I made the point in a previous post that I use the smallest drives I can 
> get away with for a given array/workload/capacity, as performance is 
> generally better and rebuild times much lower.  Potential URE issues 
> provide yet another reason to use a higher count of smaller drives, 
> though again, this doesn't tend to affect most RAID1/10 users, yet.

It's important that people make informed decsision. With enterprise drives 
(better BER) that are smaller (let's say 300 GB), the risk is greatly 
reduced.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-18 22:56                 ` Stan Hoeppner
  2013-01-19  7:43                   ` Mikael Abrahamsson
@ 2013-01-19 13:21                   ` Roy Sigurd Karlsbakk
  1 sibling, 0 replies; 46+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-01-19 13:21 UTC (permalink / raw)
  To: stan; +Cc: linux-raid, Mikael Abrahamsson

> >> The probability of a URE during rebuild increases with the number
> >> and
> >> size of the source drives being read to rebuild the failed drive.
> >> Thus
> >> the probability of encountering a URE in the 1:1 drive scenario is
> >> extremely low, close to zero if you believe manufacturer specs.
> >
> > For a 2TB drive and BER 10^-14 (common for non-enterprise drives),
> > the
> > probability is 1/6 of a single URE for a read of the entire drive.
> 
> If my math is correct, with a URE rate of 10E14, that's one URE for
> every ~12.5TB read. So theoretically one would have to read the entire
> 2TB drive more than 6 times before hitting the first URE. So it seems
> unlikely that one would hit a URE during a mirror rebuild with such a
> 2TB drive.

ok, perhaps, maybe, but then it's 17% chance of losing data after a mirror or raid-5 rebuild with 2TB drives, or the double of that if using 4TB drives. That's not very amusing…

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-19  7:43                   ` Mikael Abrahamsson
@ 2013-01-19 22:48                     ` Stan Hoeppner
  2013-01-19 23:51                       ` Maarten
  2013-01-19 23:53                       ` Phil Turmel
  2013-01-20  9:04                     ` Wolfgang Denk
  2013-01-20 19:28                     ` Peter Grandi
  2 siblings, 2 replies; 46+ messages in thread
From: Stan Hoeppner @ 2013-01-19 22:48 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid

On 1/19/2013 1:43 AM, Mikael Abrahamsson wrote:

> With a BER of 10^-14 you have a 16% risk of getting URE when reading an
> entire 2TB drive.

On 1/19/2013 7:21 AM, Roy Sigurd Karlsbakk wrote:

> ok, perhaps, maybe, but then it's 17% chance of losing data after a
> mirror or raid-5 rebuild with 2TB drives...


Where are you guys coming up with this 16-17% chance of URE on any
single full read of this 2TB, 10E14 drive?  The URE rate here is 1 bit
for every 12.5 trillion bytes.  Thus, statistically, one must read this
drive more than 6 times to encounter a URE.  Given that, how is any
single full read between the 1st and the 6th going to have a 16-17%
chance of encountering a URE for that one full read?  That doesn't make
sense.

-- 
Stan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-19 22:48                     ` Stan Hoeppner
@ 2013-01-19 23:51                       ` Maarten
  2013-01-20  0:16                         ` Chris Murphy
  2013-01-19 23:53                       ` Phil Turmel
  1 sibling, 1 reply; 46+ messages in thread
From: Maarten @ 2013-01-19 23:51 UTC (permalink / raw)
  To: linux-raid

On 01/19/13 23:48, Stan Hoeppner wrote:
> On 1/19/2013 1:43 AM, Mikael Abrahamsson wrote:
>
>> With a BER of 10^-14 you have a 16% risk of getting URE when reading an
>> entire 2TB drive.
> On 1/19/2013 7:21 AM, Roy Sigurd Karlsbakk wrote:
>
>> ok, perhaps, maybe, but then it's 17% chance of losing data after a
>> mirror or raid-5 rebuild with 2TB drives...
> Where are you guys coming up with this 16-17% chance of URE on any
> single full read of this 2TB, 10E14 drive?  The URE rate here is 1 bit
> for every 12.5 trillion bytes.  Thus, statistically, one must read this
> drive more than 6 times to encounter a URE.  Given that, how is any
> single full read between the 1st and the 6th going to have a 16-17%
> chance of encountering a URE for that one full read?  That doesn't make
> sense.
Sorry but now I have to speak up too. Of course that 16-17% figure is
right! Did you miss out on math classes ? It is all statistics. There is
a chance of '1.0' to get one URE reading 12.5 TB. That URE may be
encountered at the very start of the first TB, or it may not come at
all, because that is how statistics work. But *on*average*, you'll get
1.0 URE per 12.5 TB, ergo, 0.16 per 2.0 TB. Basic simple math... jeez.

Can we now give it a rest? Or do I need to unsubscribe ?

Cheers,
Maarten


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-19 22:48                     ` Stan Hoeppner
  2013-01-19 23:51                       ` Maarten
@ 2013-01-19 23:53                       ` Phil Turmel
  1 sibling, 0 replies; 46+ messages in thread
From: Phil Turmel @ 2013-01-19 23:53 UTC (permalink / raw)
  To: stan; +Cc: Mikael Abrahamsson, linux-raid

On 01/19/2013 05:48 PM, Stan Hoeppner wrote:
> On 1/19/2013 1:43 AM, Mikael Abrahamsson wrote:
> 
>> With a BER of 10^-14 you have a 16% risk of getting URE when reading an
>> entire 2TB drive.
> 
> On 1/19/2013 7:21 AM, Roy Sigurd Karlsbakk wrote:
> 
>> ok, perhaps, maybe, but then it's 17% chance of losing data after a
>> mirror or raid-5 rebuild with 2TB drives...
> 
> 
> Where are you guys coming up with this 16-17% chance of URE on any
> single full read of this 2TB, 10E14 drive?  The URE rate here is 1 bit
> for every 12.5 trillion bytes.  Thus, statistically, one must read this
> drive more than 6 times to encounter a URE.  Given that, how is any
> single full read between the 1st and the 6th going to have a 16-17%
> chance of encountering a URE for that one full read?  That doesn't make
> sense.

2TB/12.5TB == .16 == 16%.

It's not quite right, though.  A more precise prediction is to use the
Poisson distribution[1], as UREs are generally statistically independent
of each other (independent of the time since the previous one).

For 2TB in a 1:10^14 spec'd drive, it works out to ~ 14.8%.

Probability of zero errors in 2TB == P(0, 2TB/12.5TB) == 0.8521.

Note that the probability of reading a 12TB array without error given
1:10^14 spec'd drives is P(0, 12TB/12.5TB) ==> 38.29%, not 4%.  You
can't just scale the error rate by the size of the data to be read.

Similarly, the odds of reading through it twice ==> P(0, 24TB/12.5TB)
==> 14.66%.  It's not linear.

Of course, drives don't have a constant average error rate through their
whole life, but they behave as one through most of it.

HTH,

Phil

[1] http://stattrek.com/probability-distributions/poisson.aspx



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-19 23:51                       ` Maarten
@ 2013-01-20  0:16                         ` Chris Murphy
  2013-01-20  0:49                           ` Maarten
  2013-01-20  6:26                           ` Mikael Abrahamsson
  0 siblings, 2 replies; 46+ messages in thread
From: Chris Murphy @ 2013-01-20  0:16 UTC (permalink / raw)
  To: Maarten; +Cc: linux-raid


On Jan 19, 2013, at 4:51 PM, Maarten <maarten@ultratux.net> wrote:

> On 01/19/13 23:48, Stan Hoeppner wrote:
>> On 1/19/2013 1:43 AM, Mikael Abrahamsson wrote:
>> 
>>> With a BER of 10^-14 you have a 16% risk of getting URE when reading an
>>> entire 2TB drive.
>> On 1/19/2013 7:21 AM, Roy Sigurd Karlsbakk wrote:
>> 
>>> ok, perhaps, maybe, but then it's 17% chance of losing data after a
>>> mirror or raid-5 rebuild with 2TB drives...
>> Where are you guys coming up with this 16-17% chance of URE on any
>> single full read of this 2TB, 10E14 drive?  The URE rate here is 1 bit
>> for every 12.5 trillion bytes.  Thus, statistically, one must read this
>> drive more than 6 times to encounter a URE.  Given that, how is any
>> single full read between the 1st and the 6th going to have a 16-17%
>> chance of encountering a URE for that one full read?  That doesn't make
>> sense.
> Sorry but now I have to speak up too. Of course that 16-17% figure is
> right! Did you miss out on math classes ? It is all statistics. There is
> a chance of '1.0' to get one URE reading 12.5 TB. That URE may be
> encountered at the very start of the first TB, or it may not come at
> all, because that is how statistics work. But *on*average*, you'll get
> 1.0 URE per 12.5 TB, ergo, 0.16 per 2.0 TB. Basic simple math… jeez.

Please explain this basic, simple math, where a URE is equivalent to 1 bit of information. And also, explain the simple math where bit of error is equal to a URE. And please explain the simple math in the context of a conventional HDD 512 byte sector, which is 4096 bits.

If you have a URE, you have lost not 1 bit. You have lost 4096 bits. A loss of 4096 bits in 12.5TB (not 12.5TiB) is an error rate of 1 bit of error in 2.44^10 bits. That is a gross difference from published error rates.

And then explain how the manufacturer spec does not actually report the URE in anything approaching "on average" terms, but *less than* 1 bit in 10^14. If you propose the manufacturers are incorrectly reporting the error rate, realize you're basically accusing them of a rather massive fraud because less than 1 bit of error in X, is a significantly different thing than "on average" 1 bit of error in X. This could be up to, but not including, a full order magnitude higher error rate than the published spec. It's not an insignificant difference.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-20  0:16                         ` Chris Murphy
@ 2013-01-20  0:49                           ` Maarten
  2013-01-20  1:37                             ` Phil Turmel
  2013-01-20  9:44                             ` Chris Murphy
  2013-01-20  6:26                           ` Mikael Abrahamsson
  1 sibling, 2 replies; 46+ messages in thread
From: Maarten @ 2013-01-20  0:49 UTC (permalink / raw)
  To: linux-raid

On 01/20/13 01:16, Chris Murphy wrote:
> 
> On Jan 19, 2013, at 4:51 PM, Maarten <maarten@ultratux.net> wrote:
> 
>> On 01/19/13 23:48, Stan Hoeppner wrote:
>>> On 1/19/2013 1:43 AM, Mikael Abrahamsson wrote:
>>>
>>>> With a BER of 10^-14 you have a 16% risk of getting URE when reading an
>>>> entire 2TB drive.
>>> On 1/19/2013 7:21 AM, Roy Sigurd Karlsbakk wrote:
>>>
>>>> ok, perhaps, maybe, but then it's 17% chance of losing data after a
>>>> mirror or raid-5 rebuild with 2TB drives...
>>> Where are you guys coming up with this 16-17% chance of URE on any
>>> single full read of this 2TB, 10E14 drive?  The URE rate here is 1 bit
>>> for every 12.5 trillion bytes.  Thus, statistically, one must read this
>>> drive more than 6 times to encounter a URE.  Given that, how is any
>>> single full read between the 1st and the 6th going to have a 16-17%
>>> chance of encountering a URE for that one full read?  That doesn't make
>>> sense.
>> Sorry but now I have to speak up too. Of course that 16-17% figure is
>> right! Did you miss out on math classes ? It is all statistics. There is
>> a chance of '1.0' to get one URE reading 12.5 TB. That URE may be
>> encountered at the very start of the first TB, or it may not come at
>> all, because that is how statistics work. But *on*average*, you'll get
>> 1.0 URE per 12.5 TB, ergo, 0.16 per 2.0 TB. Basic simple math… jeez.
> 
> Please explain this basic, simple math, where a URE is equivalent to 1 bit of information. And also, explain the simple math where bit of error is equal to a URE. And please explain the simple math in the context of a conventional HDD 512 byte sector, which is 4096 bits.
> 
> If you have a URE, you have lost not 1 bit. You have lost 4096 bits. A loss of 4096 bits in 12.5TB (not 12.5TiB) is an error rate of 1 bit of error in 2.44^10 bits. That is a gross difference from published error rates.
> 
> And then explain how the manufacturer spec does not actually report the URE in anything approaching "on average" terms, but *less than* 1 bit in 10^14. If you propose the manufacturers are incorrectly reporting the error rate, realize you're basically accusing them of a rather massive fraud because less than 1 bit of error in X, is a significantly different thing than "on average" 1 bit of error in X. This could be up to, but not including, a full order magnitude higher error rate than the published spec. It's not an insignificant difference.

All very nice, but that is not the point, is it. The point is, to
calculate (or rather: estimate) the odds of an URE encounter when
reading 2TB, based on the figure one has for reading 12,5 TB. Whether
that 12,5 figure is correct or not, whether endorsed by manufacturers or
not, is totally irrelevant.  It simply boils down to, if there are 10
X's in every 10G Y's, then there are 2 X's in every 2G Y's. Yes ?

cheers,
Maarten

> 
> Chris Murphy
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-20  0:49                           ` Maarten
@ 2013-01-20  1:37                             ` Phil Turmel
  2013-01-20  9:44                             ` Chris Murphy
  1 sibling, 0 replies; 46+ messages in thread
From: Phil Turmel @ 2013-01-20  1:37 UTC (permalink / raw)
  To: Maarten; +Cc: linux-raid

On 01/19/2013 07:49 PM, Maarten wrote:
> On 01/20/13 01:16, Chris Murphy wrote:
>> 
>> On Jan 19, 2013, at 4:51 PM, Maarten <maarten@ultratux.net> wrote:
>> 
>>> On 01/19/13 23:48, Stan Hoeppner wrote:
>>>> On 1/19/2013 1:43 AM, Mikael Abrahamsson wrote:
>>>> 
>>>>> With a BER of 10^-14 you have a 16% risk of getting URE when
>>>>> reading an entire 2TB drive.

>>>> On 1/19/2013 7:21 AM, Roy Sigurd Karlsbakk wrote:
>>>> 
>>>>> ok, perhaps, maybe, but then it's 17% chance of losing data
>>>>> after a mirror or raid-5 rebuild with 2TB drives...

>>>> Where are you guys coming up with this 16-17% chance of URE on
>>>> any single full read of this 2TB, 10E14 drive?  The URE rate
>>>> here is 1 bit for every 12.5 trillion bytes.  Thus,
>>>> statistically, one must read this drive more than 6 times to
>>>> encounter a URE.  Given that, how is any single full read
>>>> between the 1st and the 6th going to have a 16-17% chance of
>>>> encountering a URE for that one full read?  That doesn't make 
>>>> sense.

>>> Sorry but now I have to speak up too. Of course that 16-17%
>>> figure is right! Did you miss out on math classes ? It is all
>>> statistics. There is a chance of '1.0' to get one URE reading
>>> 12.5 TB. That URE may be encountered at the very start of the
>>> first TB, or it may not come at all, because that is how
>>> statistics work. But *on*average*, you'll get 1.0 URE per 12.5
>>> TB, ergo, 0.16 per 2.0 TB. Basic simple math… jeez.
>> 
>> Please explain this basic, simple math, where a URE is equivalent
>> to 1 bit of information. And also, explain the simple math where
>> bit of error is equal to a URE. And please explain the simple math
>> in the context of a conventional HDD 512 byte sector, which is 4096
>> bits.
>> 
>> If you have a URE, you have lost not 1 bit. You have lost 4096
>> bits. A loss of 4096 bits in 12.5TB (not 12.5TiB) is an error rate
>> of 1 bit of error in 2.44^10 bits. That is a gross difference from
>> published error rates.
>> 
>> And then explain how the manufacturer spec does not actually report
>> the URE in anything approaching "on average" terms, but *less than*
>> 1 bit in 10^14. If you propose the manufacturers are incorrectly
>> reporting the error rate, realize you're basically accusing them of
>> a rather massive fraud because less than 1 bit of error in X, is a
>> significantly different thing than "on average" 1 bit of error in
>> X. This could be up to, but not including, a full order magnitude
>> higher error rate than the published spec. It's not an
>> insignificant difference.
> 
> All very nice, but that is not the point, is it. The point is, to 
> calculate (or rather: estimate) the odds of an URE encounter when 
> reading 2TB, based on the figure one has for reading 12,5 TB.
> Whether that 12,5 figure is correct or not, whether endorsed by
> manufacturers or not, is totally irrelevant.  It simply boils down
> to, if there are 10 X's in every 10G Y's, then there are 2 X's in
> every 2G Y's. Yes ?

On *average* !

The odds of an error within a given period of reading is *not* a linear
function of the average.  With your simplistic math, the odds of an
error while reading 25TB would be 200% !  Ummm, no.  Probability goes
from 0 to 100%.

It would be nice if statistics were simple, as they are very useful in
understanding the world around us.  Unfortunately, statistics aren't simple.

Please see my other post on the Poisson distribution.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-20  0:16                         ` Chris Murphy
  2013-01-20  0:49                           ` Maarten
@ 2013-01-20  6:26                           ` Mikael Abrahamsson
  2013-01-20  9:39                             ` Chris Murphy
  1 sibling, 1 reply; 46+ messages in thread
From: Mikael Abrahamsson @ 2013-01-20  6:26 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Maarten, linux-raid

On Sat, 19 Jan 2013, Chris Murphy wrote:

> Please explain this basic, simple math, where a URE is equivalent to 1 
> bit of information. And also, explain the simple math where bit of error 
> is equal to a URE. And please explain the simple math in the context of 
> a conventional HDD 512 byte sector, which is 4096 bits.
>
> If you have a URE, you have lost not 1 bit. You have lost 4096 bits. A 
> loss of 4096 bits in 12.5TB (not 12.5TiB) is an error rate of 1 bit of 
> error in 2.44^10 bits. That is a gross difference from published error 
> rates.

I have seen your point of view posted in other discussions, and I don't 
buy it. I believe the manufacturers are talking about how many bits read 
before there is one or more bit error (ie can't error correct the bit 
errors on that sector, so now the whole sector is URE. Since the sector is 
an atomic unit, the drive can't report a single bit error (even though 
that's probably what it is), it'll URE the whole 4k bytes. The 
manufacturer is still talking about what's on the platter, not what the OS 
sees.

Your view on how this works would mean that drives would read more than 
10^3 more data before an URE, which from my empirical data isn't right. 
Also, for a 4k sector drive, with your logic, would have 8 times better 
BER ratio which I don't believe either.

I believe Phil was spot on when it comes to how it works. His post "19 Jan 
2013 18:53:41" is exactly how I believe things work.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-19  7:43                   ` Mikael Abrahamsson
  2013-01-19 22:48                     ` Stan Hoeppner
@ 2013-01-20  9:04                     ` Wolfgang Denk
  2013-01-20 19:28                     ` Peter Grandi
  2 siblings, 0 replies; 46+ messages in thread
From: Wolfgang Denk @ 2013-01-20  9:04 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: Stan Hoeppner, linux-raid

Dear Mikael Abrahamsson,

In message <alpine.DEB.2.00.1301190838440.12098@uplift.swm.pp.se> you wrote:
> 
> > If my math is correct, with a URE rate of 10E14, that's one URE for 
> > every ~12.5TB read.  So theoretically one would have to read the entire 
> > 2TB drive more than 6 times before hitting the first URE.  So it seems 
> > unlikely that one would hit a URE during a mirror rebuild with such a 
> > 2TB drive.
> 
> Unlikely yes, but it also means one in 6 rebuilds (statistically) will 
> fail with URE. I'm not willing to take that chance, thus I use RAID6.

Me too, as actually it will be probably more than one out of six
failing.  The URE rate as published in the drive's documentation is
only true under specific conditions.  These conditions may not be met
during extended periods of more or less continuous operation of the
drive, like during backups or RAID array rebuilds.

For example, years ago we had repeated cases of double errors taking
down RAID 5 arrays with Maxtor MaXLine Plus II 7Y250M0; the pattern
was always the same: a disk error during a backup run, followed by
another disk error during rebuild.  This specific drive type gets
extremely hot under continuous operaton, which greatly shifts the URE
rate to the worse.  

So even if the failure rate appears to be acceptable in theory, it may
bite you hard when you lose your data in reality.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Alliance: In international politics, the union  of  two  thieves  who
have  their hands so deeply inserted in each other's pocket that they
cannot separately plunder a third.                   - Ambrose Bierce

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-20  6:26                           ` Mikael Abrahamsson
@ 2013-01-20  9:39                             ` Chris Murphy
  2013-01-20 16:55                               ` Mikael Abrahamsson
  0 siblings, 1 reply; 46+ messages in thread
From: Chris Murphy @ 2013-01-20  9:39 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: Maarten, linux-raid


On Jan 19, 2013, at 11:26 PM, Mikael Abrahamsson <swmike@swm.pp.se> wrote:

> On Sat, 19 Jan 2013, Chris Murphy wrote:
> 
>> Please explain this basic, simple math, where a URE is equivalent to 1 bit of information. And also, explain the simple math where bit of error is equal to a URE. And please explain the simple math in the context of a conventional HDD 512 byte sector, which is 4096 bits.
>> 
>> If you have a URE, you have lost not 1 bit. You have lost 4096 bits. A loss of 4096 bits in 12.5TB (not 12.5TiB) is an error rate of 1 bit of error in 2.44^10 bits. That is a gross difference from published error rates.
> 
> I have seen your point of view posted in other discussions, and I don't buy it. I believe the manufacturers are talking about how many bits read before there is one or more bit error (ie can't error correct the bit errors on that sector, so now the whole sector is URE. Since the sector is an atomic unit, the drive can't report a single bit error (even though that's probably what it is), it'll URE the whole 4k bytes. The manufacturer is still talking about what's on the platter, not what the OS sees.

You haven't said a single thing that contradicts what I've said. I'm not talking at all about the OS. I am in fact referring to bits read before there is a bit of error. You simply aren't going to get a URE every 12.5TB, with a disk purporting to have "less than 1 bit error in 1E14 bits" because such a rate of error is *NOT* 1 bit in 1E14 bits. It's like arguing 2+2=5 and then blabbing on for 10 minutes asserting your belief it's true.

> 
> Your view on how this works would mean that drives would read more than 10^3 more data before an URE, which from my empirical data isn't right. Also, for a 4k sector drive, with your logic, would have 8 times better BER ratio which I don't believe either.

In fact the whole point of the 4K sector size is to improve ECC, and reduce error rate. This is a stated, yet de-emphasized goal of AF disks. Yet it's still consistent with the "less than 1 bit in X bits" language, which is a minimum expected performance criteria. We don't know what the top end is.
> 
> I believe Phil was spot on when it comes to how it works. His post "19 Jan 2013 18:53:41" is exactly how I believe things work.


This explanation is identical to religious belief explanations. You believe it because you believe it. It's circular. There is no new information here, at all. Mere disagreement with my position useless.

Chris Murphy

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-20  0:49                           ` Maarten
  2013-01-20  1:37                             ` Phil Turmel
@ 2013-01-20  9:44                             ` Chris Murphy
  1 sibling, 0 replies; 46+ messages in thread
From: Chris Murphy @ 2013-01-20  9:44 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org Raid


On Jan 19, 2013, at 5:49 PM, Maarten <maarten@ultratux.net> wrote:

> On 01/20/13 01:16, Chris Murphy wrote:
>> 
>> And then explain how the manufacturer spec does not actually report the URE in anything approaching "on average" terms, but *less than* 1 bit in 10^14. If you propose the manufacturers are incorrectly reporting the error rate, realize you're basically accusing them of a rather massive fraud because less than 1 bit of error in X, is a significantly different thing than "on average" 1 bit of error in X. This could be up to, but not including, a full order magnitude higher error rate than the published spec. It's not an insignificant difference.
> 
> All very nice, but that is not the point, is it.

It is the point. You stated simple math and yet the math appears to be wrong and you haven't explained otherwise.

> The point is, to
> calculate (or rather: estimate) the odds of an URE encounter when
> reading 2TB, based on the figure one has for reading 12,5 TB. Whether
> that 12,5 figure is correct or not, whether endorsed by manufacturers or
> not, is totally irrelevant.  It simply boils down to, if there are 10
> X's in every 10G Y's, then there are 2 X's in every 2G Y's. Yes ?

So when in doubt just change the entire argument? That's absurd.

What is a bit?
What is a URE?
How much data, in bits, is lost when a URE occurs?
How often does a URE occur?


Chris Murphy

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-20  9:39                             ` Chris Murphy
@ 2013-01-20 16:55                               ` Mikael Abrahamsson
  2013-01-20 17:15                                 ` Chris Murphy
  0 siblings, 1 reply; 46+ messages in thread
From: Mikael Abrahamsson @ 2013-01-20 16:55 UTC (permalink / raw)
  To: linux-raid

On Sun, 20 Jan 2013, Chris Murphy wrote:

> bits" because such a rate of error is *NOT* 1 bit in 1E14 bits. It's 
> like arguing 2+2=5 and then blabbing on for 10 minutes asserting your 
> belief it's true.

Thanks, insult taken.

If you're not even trying to be civil, there is no point in continuing 
this discussion.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-20 16:55                               ` Mikael Abrahamsson
@ 2013-01-20 17:15                                 ` Chris Murphy
  2013-01-20 17:17                                   ` Mikael Abrahamsson
  0 siblings, 1 reply; 46+ messages in thread
From: Chris Murphy @ 2013-01-20 17:15 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org Raid


On Jan 20, 2013, at 9:55 AM, Mikael Abrahamsson <swmike@swm.pp.se> wrote:

> On Sun, 20 Jan 2013, Chris Murphy wrote:
> 
>> bits" because such a rate of error is *NOT* 1 bit in 1E14 bits. It's like arguing 2+2=5 and then blabbing on for 10 minutes asserting your belief it's true.
> 
> Thanks, insult taken.

If you literally can't count, then insult intended. Otherwise, grow up, and explain your assertions, and how I'm wrong rather than just repeating yourself.

> 
> If you're not even trying to be civil, there is no point in continuing this discussion.


Classic ad hominem attack. You are making this discussion about me, about my civility, rather than about the argument. Can you demonstrate the flaw in what I've said?

Chris Murphy

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-20 17:15                                 ` Chris Murphy
@ 2013-01-20 17:17                                   ` Mikael Abrahamsson
  2013-01-20 17:20                                     ` Chris Murphy
  0 siblings, 1 reply; 46+ messages in thread
From: Mikael Abrahamsson @ 2013-01-20 17:17 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid@vger.kernel.org Raid

On Sun, 20 Jan 2013, Chris Murphy wrote:

> Classic ad hominem attack. You are making this discussion about me, 
> about my civility, rather than about the argument. Can you demonstrate 
> the flaw in what I've said?

I tried, and you called it blabbing without saying what in it was wrong.

So yes, I'm done talking to you.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-20 17:17                                   ` Mikael Abrahamsson
@ 2013-01-20 17:20                                     ` Chris Murphy
  0 siblings, 0 replies; 46+ messages in thread
From: Chris Murphy @ 2013-01-20 17:20 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid@vger.kernel.org Raid


On Jan 20, 2013, at 10:17 AM, Mikael Abrahamsson <swmike@swm.pp.se> wrote:

> On Sun, 20 Jan 2013, Chris Murphy wrote:
> 
>> Classic ad hominem attack. You are making this discussion about me, about my civility, rather than about the argument. Can you demonstrate the flaw in what I've said?
> 
> I tried, and you called it blabbing without saying what in it was wrong.

I wrote in response:

"You haven't said a single thing that contradicts what I've said."

That's what was wrong. You said nothing relevant. You're just saying you don't buy it. Well, who the f cares what you don't or won't buy? Make your case. Saying what you believe is utterly ridiculously useless. Let's see the logical or mathematical flaw in what I've presented.



Chris Murphy


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-19  7:43                   ` Mikael Abrahamsson
  2013-01-19 22:48                     ` Stan Hoeppner
  2013-01-20  9:04                     ` Wolfgang Denk
@ 2013-01-20 19:28                     ` Peter Grandi
  2013-01-20 21:09                       ` Mikael Abrahamsson
  2013-01-21 14:40                       ` Peter Rabbitson
  2 siblings, 2 replies; 46+ messages in thread
From: Peter Grandi @ 2013-01-20 19:28 UTC (permalink / raw)
  To: Linux RAID

[ ... the original question on 2+2 RAID delivering 2x linear
transfers of 1x linear transfers ... ]

The original question was based on the (euphemism) very peculiar
belief that skipping over P/Q blocks has negligible cost. An
interesting detail is that this might be actually the case with
SSD devices, and perhaps even with flash SSD ones.

[ ... on whether 2+2 RAID6 or 2x(1+1) RAID10 is more likely to
fail and errors during rebuilds ... ]

>> If my math is correct, with a URE rate of 10E14, that's one
>> URE for every ~12.5TB read.  So theoretically one would have
>> to read the entire 2TB drive more than 6 times before hitting
>> the first URE.  So it seems unlikely that one would hit a URE
>> during a mirror rebuild with such a 2TB drive.

> Unlikely yes, but it also means one in 6 rebuilds
> (statistically) will fail with URE. I'm not willing to take
> that chance, thus I use RAID6.  Usually, with scrubbing etc
> I'd imagine that the probability is better than 1 in 6, but
> it's still a substantial risk.

Most of this discussion seems to me based on (euphemism) amusing
misconceptions of failure statistics and failure modes. The UREs
manufacturers quote are baselines "all other things being equal",
and in a steady state, etc. etc.; translating these to actual
failure probabilities and intervals by simple arithmetic is
(euphemism) futile.

In practice what matters is measured failure rates per unit of
time (generally reported as 2-4% per year) and taking into account
common modes of failure and environmental factors such as:

  * Whether all the members of a RAID set are of the same brand
    and model with (nearly) consecutive serial numbers.

  * Whether the same members are all in the same enclosure
    subject to the same electrical, vibration and thermal
    conditions.

  * Whether the very act of rebuilding is likely to increase
    electrical, vibration or thermal stress on the members,
    and/or 

  * What is the age and the age-related robustness to stress of
    the members.

It so happens that the vast majority of RAID sets are built by
people like the (euphemism) contributors to this thread and are
(euphemism) designed to maximize common modes of failure.

It is very convenient to build RAID sets that are all made from
drives of the same brand. model, and with consecutive serial
numbers all drawn from the same shipping carton, all screwed into
the same enclosure with the same power supply, cooling system, and
vibrating in resonance with the same chassis and each other, and
to choose RAID modes like RAID6 which extend the stress of
rebuilding to all members of the set, and on sets with members
mostly of the same age.

But that is the way bankers work, creating phenomenally correlated
risks, because it works very well when things go well, even if it
tends to fail catastrophically, rather than gracefully, when
something fails. But then ideally it has become someone else's
problem :-), otherwise "who could have known" is the eternal
refrain.

As StorageMojo.com pointed out, none of the large scale web
storage infrastructures is based on within-machine RAID; they are
all based on something like distributed chunk mirroring (as a
rule, 3-way) across very different infrastructures. Interesting...

  I once read with great (euphemism) amusement a proposal to
  replace intersite mirroring with intersite erasure codes, which
  seemed based on (euphemism) optimism about latencies.

Getting back to RAID, I feel (euphemism) dismayed when I read
(euphemism) superficialities like:

  "raid6 can lose any random 2 drives, while raid10 can't."

because they are based on the (euphemism) disregard of the very
many differences between the two, and that what matters is the
level of reliability and performance achievable with the same
budget. Because ultimately it is reliability/performance per
budget that matters, not (euphemism) uninformed issues of mere
geometry.

Anyhow if one wants that arbitrary "lose any random 2 drives" goal
regardless of performance or budget, on purely geometric grounds,
it is very easy to setup 2x(1+1+1) RAID10.

And as to the issue of performance/reliability vs. budget that
seems to be so (euphemism) unimportant in most of this thread,
there are some nontrivial issues with comparing a 2+2 RAID6 with a
2x(1+1) RAID10, because of their very different properties under
differently shaped workloads, but some considerations are:

* A 2+2 RAID6 delivers down to half the read "speed" of a 2x(1+1)
  RAID10 when complete (depending on whether single or multi
  threaded), and equivalent or less for many cases of writing
  especially if unaligned.

* On small-transaction workloads RAID6 requires that each
  transaction be complete only when *all* data (for read) blocks
  for reading or all stripe blocks (for writing) have been
  written, and that usually involves 1/2 of the rotational latency
  of the drives of dead time, because the drives are not
  synchronized, and this involves difficult chunk size tradeoffs.
  RAID10 only requires that reads of writes from one member of
  each mirror set be read or written to complete the operation,
  and the RAID0 chunk size matters but less.

* When incomplete, RAID6 can have even worse aggregate transfer
  rates during reading, because of the need for whole stripe
  reads whenever the missing drive supplies a non-P/Q block
  in the stripe, which for a 2+2 RAID6 is 50% of stripes; this
  also means that on an incomplete RAID6 stress (electrical,
  vibration and temperature) becomes worse in a highly
  correlated way exactly at the worst moment, when one drive
  is already missing.

* When rebuilding, RAID6 impacts the speed of *all* drives in
  the RAID set, and also causes greatly increased stress on all
  the drives, making them hotter, vibrate more, and draw more
  current, and all at the same time and in exactly the same way,
  and just after omne of them has failed, and they often are
  all the same brand, model and taken out of the same carton.

So for example try to compare like for like, as much as plausible,
and we want a RAID set with a capacity of 4TB; we would need a
RAID6 set of at least 3+2 or really 4+2 2TB drives, each drive to
be kept half-empty, to get equivalent read speeds in many
workloads to a 2x(1+1) RAID10.

Then if the RAID10 were allowed to have 6x 2TB drives we could
have a set of 2x(1+1+1) drives which would still be faster *and*
rather more resilient than the 4+2 RAID6.

Note: The RAID6 could be 4+2 1TB drives and still deliver 4TB of
  capacity, at a lower, but not proportionally lower cost, but
  it would still suck on unaligned writes, suffer a big impact
  when incomplete (66% of stripes need a full stripe read) or
  rebuilding, and still be likely less reliable than a 2x(1+1+1)
  of 1TB drives.

Again, comparisons between RAID levels and especially parity
RAID and non-parity RAID are very difficult because there are
performance (speed, reliability, value) envelopes are rather
differently shaped, but the issue of:

  "raid6 can lose any random 2 drives, while raid10 can't."

and associated rebuild error probability cannot be discussed in a
(euphemism) simplistic way.

NB: while in general I think that most (euphemism) less informed
people should use only RAID10, there are a few narrow cases where
the rather skewed performance envelopes of RAID5 and even of RAID6
match workload and budget requirements. But it takes apparently
unusual insight to recognize these cases, so just use RAID10 even
if you suspect it is one of those narrow cases.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-20 19:28                     ` Peter Grandi
@ 2013-01-20 21:09                       ` Mikael Abrahamsson
  2013-01-20 21:50                         ` Peter Grandi
  2013-01-21 14:40                       ` Peter Rabbitson
  1 sibling, 1 reply; 46+ messages in thread
From: Mikael Abrahamsson @ 2013-01-20 21:09 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On Sun, 20 Jan 2013, Peter Grandi wrote:

> NB: while in general I think that most (euphemism) less informed
> people should use only RAID10, there are a few narrow cases where
> the rather skewed performance envelopes of RAID5 and even of RAID6
> match workload and budget requirements. But it takes apparently
> unusual insight to recognize these cases, so just use RAID10 even
> if you suspect it is one of those narrow cases.

In your whole post you never touched on URE rates (well you did, but you 
didn't seem to this was a problem).

I'm using RAID6 because I don't really care about performance, but I do 
want to be able to fail one drive and have scattered URE handled while 
rebuilding. I have had scattered URE hit me numerous times over the past 
10 years. With RAID6 they are handled nicely even with a failed drive.

If I cared about performance, I would either do what was discussed earlier 
in the thread (use smaller enterprise drives with better BER) in RAID10, 
or I would use threeway mirror RAID1 and use lvm to vg several RAID1:s 
together.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-20 21:09                       ` Mikael Abrahamsson
@ 2013-01-20 21:50                         ` Peter Grandi
  2013-01-21  5:24                           ` Mikael Abrahamsson
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Grandi @ 2013-01-20 21:50 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

>> [ ... ] there are a few narrow cases where the rather skewed
>> performance envelopes of RAID5 and even of RAID6 match
>> workload and budget requirements. But it takes apparently
>> unusual insight to recognize these cases, so just use RAID10
>> even if you suspect it is one of those narrow cases.

> In your whole post you never touched on URE rates (well you
> did, but you didn't seem to this was a problem).

They are a big problem, especially because in a typical RAID set
they are not uncorrelated, either with each other or the
environment, and I wrote a lot about that.

The *absolute* level of URE rates matters less, but some people
on this thread have noticed that they are not that low, compared
with whole-disk reading, even the perhaps somewhat optimistic
ones quoted by manufacturers.

> I'm using RAID6 because I don't really care about performance,

That's a pretty unusual case, but perhaps that falls under the
"know better" qualification above, except that:

> but I do want to be able to fail one drive and have scattered
> URE handled while rebuilding.

RAID6 is not appropriate for this either.

Perhaps I have not been clear in my earlier comment, but the URE
rate is not a constant you can just (euphemism) uncritically
read from a spec sheet.

Let's try shouting:

  * THE URE RATE DEPENDS ON ENVIRONMENTAL FACTORS AND COMMON
    MODES OF FAILURE (INCLUDING THE AGE OF THE DRIVE).

  * In a typical incomplete or rebuilding RAID6 the aggregate
    URE rate is much higher than the single drive URE rate or
    even the RAID10 URE rate.

I also sometimes suspect that manufacturers quote ideal numbers;
for example I just had a look at some user manuals for a few
"enterprise" and "desktop" Seagate (they do very detailed
manuals) and their "annualized return rate" is usually around
0.4-0.7%, and many large sites report annual failure rates of
around 2-4%.

> I have had scattered URE hit me numerous times over the past
> 10 years.

That is indeed a big problem, and it is rare and good that you
don't underestimate it.

> With RAID6 they are handled nicely even with a failed drive.

Unfortunately RAID6 correlates failure modes across drives
because of the !"£$%^ parity, and increases them by stressing
all of them hard while the RAID6 set is incomplete or syncing.

Incomplete or syncing RAID6 not only has pretty bad speed, which
you don't care about, but drives up error rates, and you should
care about that.

Sure, most of the time one can replace the failed drive and sync
before something bad happens, but an incomplete or syncing RAID6
has at some point in the life of the RAID set a much higher
chance of 2 or more failures...

BTW important qualification as to all this: when people mention
BERs they really imply that this discussion is about *sector*
UREs, and *single*-sector ones in particular, which matters a
fair bit.

Because a typical RAID6 under the stress of being incomplete or
syncing can have something far worse than single-sector UREs, it
can trigger much more brutal mechanical or electronic failures.

While single-sector UREs are in most cases not such a big deal,
as usally losing a single sector's content allows for nearly
complete recovery; some/most drives IIRC can return the sector
content that has been reconstructed, which often is wrong in
only a few bits.

> If I cared about performance, I would either do what was
> discussed earlier in the thread (use smaller enterprise drives
> with better BER) in RAID10,

Whether "enterprise" drives effectively have a better URE is an
experimental question that is difficult to settle.

> or I would use threeway mirror RAID1 and use lvm to vg several
> RAID1:s together.

This particular case is not caring much about performance either
because linear (concat) is not as quick for most workloads as a
RAID0.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-20 21:50                         ` Peter Grandi
@ 2013-01-21  5:24                           ` Mikael Abrahamsson
  0 siblings, 0 replies; 46+ messages in thread
From: Mikael Abrahamsson @ 2013-01-21  5:24 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On Sun, 20 Jan 2013, Peter Grandi wrote:

> complete recovery; some/most drives IIRC can return the sector
> content that has been reconstructed, which often is wrong in
> only a few bits.

This is not my experience. When I get UREs on a 4k drive, I get read error 
on 8 consecutive 512 byte blocks. This is the first time I've ever heard 
about someone claiming drives will give back information that is just a 
little bit wrong. Perhaps there should be such a command to tell the drive 
to "give me what you've got", but most of the time, this is undesireable.

Going back to the BER discussion.

I'm a network engineer. We count BER as "flipped bits rate", which is 
detected using CRC (ethernet does this). "under" this one can do g.709 FEC 
(forward error correction), which can detect flipped bits and still return 
a correct result to the CRC checksummer because there is enough other 
information (~10 percent overhead) to reconstruct the original 
information, thus passing the CRC check.

Typical rated ethernet BER is 10^-12. This means in 10^-12 bits sent on 
the wire, if a bit is flipped and a whole packet is thus list, it's still 
within specifications. Normally when one does things right, the BER is way 
better than 10^-12, the norm is to have a 10GE link running for months 
without a single user-detectable bit error.

In the articles I have read about how harddrives work, they all state that 
hdd manufacturers do very similar things. They store bits on the media 
with extra information so the drive can do FEC, they have a checksum, if 
the checksum doesn't match then the block is re-read, and if after a while 
no correct checksum block can be served, an URE is reported and the OS 
reports the read as failed. The advantage with 4k block drives is that FEC 
is more effective on larger blocks because errors usually turn up in 
bursts (one gets several flipped bits in a row), so having larger blocks 
means more flipped bits in a row can get corrected. ADSL2+ works in a 
similar way when one turns on 16ms interleaving, it smears out the bits 
over a longer time, so a 0.1ms disturbance (complete, no bits are correct) 
can be corrected using FEC.

Also, I do agree with you that RAID6 puts mechanical stress on the drives 
but my main failure scenario (own experience) is still single drive 
failure and then scattered UREs when reading from the other drives, which 
can be corrected by RAID6 parity during the resync. RAID6 is economical 
when using it with 10-12 drives, and fits my storage needs (as long as I 
get ~30 megabyte/s or better large file read/write performance from the 
array, I'm fine). Other workloads, as you say, might have other 
requirements.

A lot of people I see coming in on the IRC channel use RAID5 for their 
storage needs and they come in with UREs when reconstructing a failed 
drive. I'd say taking the array offline and using dd_rescue is something 
we have to recommend several times a month. This is why I keep 
recommending RAID6 over RAID5.

More reading:

http://www.high-rely.com/hr_66/blog/why-raid-5-stops-working-in-2009-not/

www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162

I still don't understand how these two articles seem to come to wildly 
different conclusions and the first one still claims the last one is 
correct :P Well, the zdnet one matches my own experiences with mine and 
other people I talk to.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-20 19:28                     ` Peter Grandi
  2013-01-20 21:09                       ` Mikael Abrahamsson
@ 2013-01-21 14:40                       ` Peter Rabbitson
  2013-01-21 20:32                         ` Peter Grandi
  2013-01-21 22:00                         ` Peter Grandi
  1 sibling, 2 replies; 46+ messages in thread
From: Peter Rabbitson @ 2013-01-21 14:40 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

Thank you for the thorough reply. While I agree with *most* of what
you say I have a comment and a followup question below.

On Sun, Jan 20, 2013 at 07:28:13PM +0000, Peter Grandi wrote:
> [ ... the original question on 2+2 RAID delivering 2x linear
> transfers of 1x linear transfers ... ]
> 
> The original question was based on the (euphemism) very peculiar
> belief that skipping over P/Q blocks has negligible cost.

I was indeed very surprised to find out that the skipping is *not* free. 
I am planning to do some research on whether it is possible to use 
specific chunksizes so that when laid out on top of the physical media 
the "skip penalty" is minimized. This will probably take me a while, but 
I will come back to this thread with the results eventually.

> Getting back to RAID, I feel (euphemism) dismayed when I read
> (euphemism) superficialities like:
> 
>   "raid6 can lose any random 2 drives, while raid10 can't."
> 
> because they are based on the (euphemism) disregard of the very
> many differences between the two, and that what matters is the
> level of reliability and performance achievable with the same
> budget. Because ultimately it is reliability/performance per
> budget that matters, not (euphemism) uninformed issues of mere
> geometry.

I am not sure what you are saying... I see raid as a way for me to keep 
a higher layer "online", while some of the physical drives fall on the 
floor. In the case of 4 drives (very typical for mom&pop+consultant 
shops with near-sufficient expertise but far-insufficient funds) a raid6 
is the more obvious choice as it provides the array size of 2xdrives, 
with reasonable redundancy (*ANY* 2 drives), and reasonable-ish 
read/write rates in normal operation. With the prospect of minimizing the 
skip-penalty the read rate (which is what matters, again, in most cases) 
will go even higher.

By the way "normal operation" is what I am basing my observation on, 
because a degraded raid does not run for years without being taken care 
of. If it does - someone is doing it wrong. Besides with raid6 the 
degradation of operational speeds will be a contributing factor to 
repair the array *sooner*.

Compare to raid10, which has better read characteristics, but in order 
to reach the "any 2 drives" bounty, one needs to assemble a -l 10 -n 4 
-p f3, which isn't... very optimal (mom&pop just went from 2xsize to 
1.3xsize).

From your comment above I gather you disagree with this. Can you 
elaborate more on the economics of mom&pop installations, and how my 
assesment is (euphemism) wrong? :)

Cheers


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-21 14:40                       ` Peter Rabbitson
@ 2013-01-21 20:32                         ` Peter Grandi
  2013-01-21 20:55                           ` Peter Grandi
  2013-01-21 22:00                         ` Peter Grandi
  1 sibling, 1 reply; 46+ messages in thread
From: Peter Grandi @ 2013-01-21 20:32 UTC (permalink / raw)
  To: Linux RAID

[ ... RAID6 reading suffering from "skipping" over parity blocks
... ]

> I was indeed very surprised to find out that the skipping is
> *not* free.

Uhmmm :-).

> I am planning to do some research on whether it is possible to
> use specific chunksizes so that when laid out on top of the
> physical media the "skip penalty" is minimized. [ ... ]

Oh no, that does not address the issue. The issue is that
whatever happens in an N+M RAID[56...], a fraction M/N of the
data is not relevant for reading, and that if one does a simple
mapping M/N of the drives won't be "active" at any one point.
Since usually (and unfortunately) N>>M (that is, very wide
RAID[56...] sets) not many people worry about that.

The only "solution" that seems practical to me is a "far" layout
as in the MD RAID10, generalizing it, that is laying data and
"parity" blocks not across the drives, but along them; for the
trivial and not so clever case, to put the "parity" blocks on
the same drive(s) as the stripe they relate to. But obviously
that does not have redundancy, so like in the MD RAID10 "far"
layout, the idea is to move stagger/diagonalize them to the next
drive.

For example, in a 2+2 layout like yours, each drive is divided
in two regions, the top one for data, the bottom one for parity,
and the first two stripes are laid out like this:

   A       B       C       D
----------------------------
[0:0     0:1]   [1:0     1:1]
[2:0     2:1]   [3:0     3:1]
....    ....    ....    ....
....    ....    ....    ....
----------------------------
....    ....    ....    ....
....    ....    ....    ....
[3:P     3:Q]   [2:P     2:Q]
[1:P     1:Q]   [0:P     0:Q]
----------------------------

and the 4+1 case would be:

   A       B       C       D       E
------------------------------------
[0:0     0:1     0:2     0:3]   [1:0
 1:1     1:2     1:3]   [2:0     2:1
 2:2     2:3]   [3:0     3:1     3:2
 3:3]   [4:0     4:1     4:2     4:3]
....    ....    ....    ....    ....
....    ....    ....    ....    ....
....    ....    ....    ....    ....
....    ....    ....    ....    ....
------------------------------------
....    ....    ....    ....    ....
[4:P]   [3:P]   [2:P]  [1:P]    [0:P]
------------------------------------

and for "fun" this is the 3+3 case:

  A       B       C       D       E       F
--------------------------------------------
[0:1     0:1     0:2]   [1:0     1:1     1:2]
[2:0     2:1     2:2]   [3:0     3:1     3:2]
....    ....    ....    ....    ....    ....
....    ....    ....    ....    ....    ....
--------------------------------------------
....    ....    ....    ....    ....    ....
....    ....    ....    ....    ....    ....
[3:P     3:Q     3:R]  [2:P     2:Q     2:R]
[1:P     1:Q     1:R]  [0:P     0:Q     0:R]
--------------------------------------------

More generally, given a N data blocks and M "parity" blocks per
stripe, each drive is divided two areas, a data one N/(N+M) of
the disk capacity, a "parity" one M/(N+M) of the disk capacity
and:

  - The N blocks long data parts of each stripe are written
    consecutively in the data areas across the N+M RAID set
    members.
  - The M blocks long "parity" parts of each stripe are written
    consecutively *backwards* from the *end* of the "parity"
    areas across the N+M RAID set members.

This ensures that each N and M blocks long parts of each stripe
are written on different disks, staggered neatly.

The ordinary layout is one in which one divides the stripes in
one section only. One can generalize with other stripe and block
within stripe distribution functions too, including those that
might subdivide a stripe (and thus each disk in the RAID set) in
more than 2 parts, but I don't see much point in that, except
for RAID1 1+N where for example each of the N can be considered
an indepedent "parity".

  Note: RAID10 "far" should also write the mirror from the end,
  and arguably RAID1 should mirror backwards too (woudl give
  nicely uniform average IO rates depsite the speed difference
  between inner and outer tracks. Of course one could write the
  "parity" section of each stripe backwards in each row starting
  with the first row, but I like better the fully inverted
  layout...  But it may not be practical with disk scheduling
  algorithms that probably prefer forwards seeking.

But just as "far" RAID10 pays for better single threaded reading
with slower writing, "far" RAID[56...] (or any other, as the
"far" layout is easy to generalize) pays for greater reading
speed in the optimal case with even more terrible writing or
incomplete or resync reading, because to read two consecutive
stripe requires seeking across some of the disks.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-21 20:32                         ` Peter Grandi
@ 2013-01-21 20:55                           ` Peter Grandi
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Grandi @ 2013-01-21 20:55 UTC (permalink / raw)
  To: Linux RAID


> [ ... ] More generally, given a N data blocks and M "parity"
> blocks per stripe, [ ... ]

As an aside, N or M (not both...) can be 0. When M is zero
that's RAID0, and when N is zero all or (more usefully) a subset
of M "parity" blocks are needed to reconstruct any one data
block; for example M is 5, there are 3 data blocks encoded
across them, and any data block can be reconstructed given any 4
"parity" blocks -- numbers made up BTW.

In general a logical RAID is a matrix of rows of W data and/or
parity blocks times S the number of stripes, and RAID
implementations remap that onto another matrix of D storage
devices each with C blocks; where both the stripe matrix and
athe storage matrix can be be subdivided, by row or column
(usually the stripe matrix columwise and the device matrix
rowwise).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Suboptimal raid6 linear read speed
  2013-01-21 14:40                       ` Peter Rabbitson
  2013-01-21 20:32                         ` Peter Grandi
@ 2013-01-21 22:00                         ` Peter Grandi
  1 sibling, 0 replies; 46+ messages in thread
From: Peter Grandi @ 2013-01-21 22:00 UTC (permalink / raw)
  To: Linux RAID

[ ... RAID6 vs. RAID10 ... ]

> I am not sure what you are saying... I see raid as a way for
> me to keep a higher layer "online", while some of the physical
> drives fall on the floor. In the case of 4 drives (very
> typical for mom&pop+consultant shops with near-sufficient
> expertise but far-insufficient funds)

If it is literally true that «far-insufficient funds» then these
people are making a business bet that they will be lucky.

Something like "with this computing setup I have 1 chance in 10
to go bankrupt in 5 years because of loss of data, but I hope
that I will be one of the 9 that don't and also the business
will last less than 5 years". That's a common business strategy,
and a very defensible one in many cases.

> a raid6 is the more obvious choice as it provides the array
> size of 2xdrives,

Despite my aversion to parity RAID in all its form (as shared
with the BAARF.com supporters), in some cases it does match
particular requirements, and RAID6 4+2 and RAID5 2+1 or 4+1 seem
to be the most plausible, because they have narrow stripes
(minimizing alignment issues) and a decent degree of
redundancy. My notes on this:

  http://www.sabi.co.uk/blog/0709sep.html#070923b
  http://www.sabi.co.uk/blog/12-two.html#120218

To me RAID6 2+2 seems a bit crazy because RAID10 2x(1+1) has
much better (read) speed or RAID10 2x(1+2) has much better
resilience too, for not much higher price in the second case.

But at least it has a decent amount of redundancy, 50% like
RAID10, just distributed differently. I really dislike wide
RAID5 and RAID6 instead because of this:

> with reasonable redundancy (*ANY* 2 drives),

This point has been made to me for decades and I was never able
to understand it because it sounded quite absurd, and then only
relatively recently that it is based on two assumptions, that
probability of failure is independent of RAID set size, and it
is not correlated across drives in the same RAID set.

  http://www.sabi.co.uk/blog/1104Apr.html#110401

Unfortunately once a drive has failed in a RAID set the chances
that more drives will fail are the proportional to the size of
the RAID set, and larger the more similar the RAID set members
and their environment are.

> and reasonable-ish read/write rates in normal operation.

Ahhhhhh but yourself were complaining about that from a 2+2 you
only get 2x sequential transfer read rates, so you can't use
that argument.

[ ... ]

> By the way "normal operation" is what I am basing my
> observation on, because a degraded raid does not run for years
> without being taken care of. If it does - someone is doing it
> wrong.

Wish that were true, but the problem is not "years", it is
"hours"/"days" being incomplete or resyncing, in which a RAID6
gets much more stressed than a RAID10, and usually failures are
very correlated.

When a disk starts failing, it often because of age or design or
vibration, and usually all drives in the same RAID set have the
same age, design or are subject to the same vibrations; or it
fails because of some external power or thermal or mechanical
shock, and as a rule such a shock affects all drives at the same
time; some may fail outright, some may be instead partially
affected, and the stressful incomplete/resyncing load of RAID6
can drive them to fail too.

Sure, a RAID set can work for 2 to 5 years without much of a
problem if one is lucky. As a friend said, it is like saying

  "I did not change the oil in my car for 3 years and it is
  still running, that means I don't need to change it for
  another 3 years".

Sure it can be like that if one has a "golden" engine and uses
it not that much.

Admittedly I have seen cases with 10+2 and 14+2 RAID6 set "work"
for years, but in the better case they were part of a system
with a responsive support team, they were very lightly loaded,
on nearly entirely read-only data, they were well backed up, the
disks had 512B sectors and relatively small and the system
components were pretty high quality "enterprise" ones that fail
relatively rarely.

Also, the workload was mostly running in memory, with occasional
persist/backup style writing, and as to writing RAID5 and RAID6
can work fairly well if they are used mostly for bulk streaming
writes.

> Besides with raid6 the degradation of operational speeds will
> be a contributing factor to repair the array *sooner*.

That does not seem to me like a positive argument...

> Compare to raid10, which has better read characteristics, but
> in order to reach the "any 2 drives" bounty,

But it can also lose more than 2 drives and still work, as long
as they are not in the same pair. That can be pretty valuable.

> one needs to assemble a -l 10 -n 4 -p f3, which isn't... very
> optimal (mom&pop just went from 2xsize to 1.3xsize).

Well at parity of target capacity of 2TB capacity and (single
threaded) read speed you can use a RAID10 of 6x1TB drives
instead of 4x2TB RAID6 drives.

And we are not talking about *huge* amounts of money for a small
RAID set here.

> From your comment above I gather you disagree with this. Can
> you elaborate more on the economics of mom&pop installations,
> and how my assesment is (euphemism) wrong? :)

It is not necessarily wrong, it is that you are comparing two
very different performance envelopes and being hypnotized by
geometry considerations without regard to effective failure
probability and speed.

At least you are using a RAID6 2+2 which has a decent percent of
redundancy, 50% like a RAID10.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2013-01-21 22:00 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-15 12:33 Suboptimal raid6 linear read speed Peter Rabbitson
2013-01-15 12:45 ` Mikael Abrahamsson
2013-01-15 12:56   ` Peter Rabbitson
2013-01-15 16:13     ` Mikael Abrahamsson
2013-01-15 12:49 ` Phil Turmel
2013-01-15 12:55   ` Peter Rabbitson
2013-01-15 17:09     ` Charles Polisher
2013-01-15 19:57       ` keld
2013-01-16  4:43         ` Charles Polisher
2013-01-16  6:37           ` Tommy Apel Hansen
2013-01-16  9:36           ` keld
2013-01-16 16:09             ` Charles Polisher
2013-01-16 20:40               ` EJ Vincent
2013-01-15 23:17     ` Phil Turmel
2013-01-16  2:48     ` Stan Hoeppner
2013-01-16  2:58       ` Peter Rabbitson
2013-01-16 20:29         ` Stan Hoeppner
2013-01-16 21:20           ` Roy Sigurd Karlsbakk
2013-01-17 15:51           ` Mikael Abrahamsson
2013-01-18  8:31             ` Stan Hoeppner
2013-01-18  9:18               ` Mikael Abrahamsson
2013-01-18 22:56                 ` Stan Hoeppner
2013-01-19  7:43                   ` Mikael Abrahamsson
2013-01-19 22:48                     ` Stan Hoeppner
2013-01-19 23:51                       ` Maarten
2013-01-20  0:16                         ` Chris Murphy
2013-01-20  0:49                           ` Maarten
2013-01-20  1:37                             ` Phil Turmel
2013-01-20  9:44                             ` Chris Murphy
2013-01-20  6:26                           ` Mikael Abrahamsson
2013-01-20  9:39                             ` Chris Murphy
2013-01-20 16:55                               ` Mikael Abrahamsson
2013-01-20 17:15                                 ` Chris Murphy
2013-01-20 17:17                                   ` Mikael Abrahamsson
2013-01-20 17:20                                     ` Chris Murphy
2013-01-19 23:53                       ` Phil Turmel
2013-01-20  9:04                     ` Wolfgang Denk
2013-01-20 19:28                     ` Peter Grandi
2013-01-20 21:09                       ` Mikael Abrahamsson
2013-01-20 21:50                         ` Peter Grandi
2013-01-21  5:24                           ` Mikael Abrahamsson
2013-01-21 14:40                       ` Peter Rabbitson
2013-01-21 20:32                         ` Peter Grandi
2013-01-21 20:55                           ` Peter Grandi
2013-01-21 22:00                         ` Peter Grandi
2013-01-19 13:21                   ` Roy Sigurd Karlsbakk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.