All of lore.kernel.org
 help / color / mirror / Atom feed
* proactive disk replacement
@ 2017-03-20 12:47 Jeff Allison
  2017-03-20 13:25 ` Reindl Harald
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Jeff Allison @ 2017-03-20 12:47 UTC (permalink / raw)
  To: linux-raid

Hi all I’ve had a poke around but am yet to find something definitive.

I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this disks are getting a bit long in the tooth so before I get into problems I’ve bought 4 new disks to replace them.

I have a backup so if it all goes west I’m covered. So I’m looking for suggestions.

My current plan is just to replace the 2tb drives with the new 3tb drives and move on, I’d like to do it on line with out having to trash the array and start again, so does anyone have a game plan for doing that.

Or is a 9tb raid 5 array the wrong thing to be doing and should I be doing something else 6tb raid 10 or something I’m open to suggestions.

Cheers Jeff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-20 12:47 proactive disk replacement Jeff Allison
@ 2017-03-20 13:25 ` Reindl Harald
  2017-03-20 14:59 ` Adam Goryachev
  2017-03-22 14:51 ` John Stoffel
  2 siblings, 0 replies; 34+ messages in thread
From: Reindl Harald @ 2017-03-20 13:25 UTC (permalink / raw)
  To: Jeff Allison, linux-raid



Am 20.03.2017 um 13:47 schrieb Jeff Allison:
> Hi all I’ve had a poke around but am yet to find something definitive.
>
> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this disks are getting a bit long in the tooth so before I get into problems I’ve bought 4 new disks to replace them.
>
> I have a backup so if it all goes west I’m covered. So I’m looking for suggestions.
>
> My current plan is just to replace the 2tb drives with the new 3tb drives and move on, I’d like to do it on line with out having to trash the array and start again, so does anyone have a game plan for doing that.
>
> Or is a 9tb raid 5 array the wrong thing to be doing and should I be doing something else 6tb raid 10 or something I’m open to suggestions.

you just manually fail them and replace them the same way as if they 
would have died unexpected - done that multiple times

on machines without bayes i just poweroff, replace a disk and then clone 
the mbr and add the partitions also the same way as i do when one dies 
(partitions in case you didn't use the whole drives for the array)

http://bencane.com/2011/07/06/mdadm-manually-fail-a-drive/


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-20 12:47 proactive disk replacement Jeff Allison
  2017-03-20 13:25 ` Reindl Harald
@ 2017-03-20 14:59 ` Adam Goryachev
  2017-03-20 15:04   ` Reindl Harald
  2017-03-21  2:33   ` Jeff Allison
  2017-03-22 14:51 ` John Stoffel
  2 siblings, 2 replies; 34+ messages in thread
From: Adam Goryachev @ 2017-03-20 14:59 UTC (permalink / raw)
  To: Jeff Allison, linux-raid



On 20/3/17 23:47, Jeff Allison wrote:
> Hi all I’ve had a poke around but am yet to find something definitive.
>
> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this disks are getting a bit long in the tooth so before I get into problems I’ve bought 4 new disks to replace them.
>
> I have a backup so if it all goes west I’m covered. So I’m looking for suggestions.
>
> My current plan is just to replace the 2tb drives with the new 3tb drives and move on, I’d like to do it on line with out having to trash the array and start again, so does anyone have a game plan for doing that.
Yes, do not fail a disk and then replace it, use the newer replace 
method (it keeps redundancy in the array).
Even better would be to add a disk, and convert to RAID6, then add a 
second disk (using replace), and so on, then remove the last disk, grow 
the array to fill the 3TB, and then reduce the number of disks in the raid.
This way, you end up with RAID6...
> Or is a 9tb raid 5 array the wrong thing to be doing and should I be doing something else 6tb raid 10 or something I’m open to suggestions.
I'd feel safer with RAID6, but it depends on your requirements. RAID10 
is also a nice option, but, it depends...

Regards,
Adam


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-20 14:59 ` Adam Goryachev
@ 2017-03-20 15:04   ` Reindl Harald
  2017-03-20 15:23     ` Adam Goryachev
  2017-03-21  2:33   ` Jeff Allison
  1 sibling, 1 reply; 34+ messages in thread
From: Reindl Harald @ 2017-03-20 15:04 UTC (permalink / raw)
  To: Adam Goryachev, Jeff Allison, linux-raid



Am 20.03.2017 um 15:59 schrieb Adam Goryachev:
> On 20/3/17 23:47, Jeff Allison wrote:
>> Hi all I’ve had a poke around but am yet to find something definitive.
>>
>> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this
>> disks are getting a bit long in the tooth so before I get into
>> problems I’ve bought 4 new disks to replace them.
>>
>> I have a backup so if it all goes west I’m covered. So I’m looking for
>> suggestions.
>>
>> My current plan is just to replace the 2tb drives with the new 3tb
>> drives and move on, I’d like to do it on line with out having to trash
>> the array and start again, so does anyone have a game plan for doing
>> that.
> Yes, do not fail a disk and then replace it, use the newer replace
> method (it keeps redundancy in the array)

how should it keep redundancy when you have to remove a disk anyways 
except you have enough slots to at least temporary add a additional one?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-20 15:04   ` Reindl Harald
@ 2017-03-20 15:23     ` Adam Goryachev
  2017-03-20 16:19       ` Wols Lists
  0 siblings, 1 reply; 34+ messages in thread
From: Adam Goryachev @ 2017-03-20 15:23 UTC (permalink / raw)
  To: Reindl Harald, Jeff Allison, linux-raid



On 21/3/17 02:04, Reindl Harald wrote:
>
>
> Am 20.03.2017 um 15:59 schrieb Adam Goryachev:
>> On 20/3/17 23:47, Jeff Allison wrote:
>>> Hi all I’ve had a poke around but am yet to find something definitive.
>>>
>>> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this
>>> disks are getting a bit long in the tooth so before I get into
>>> problems I’ve bought 4 new disks to replace them.
>>>
>>> I have a backup so if it all goes west I’m covered. So I’m looking for
>>> suggestions.
>>>
>>> My current plan is just to replace the 2tb drives with the new 3tb
>>> drives and move on, I’d like to do it on line with out having to trash
>>> the array and start again, so does anyone have a game plan for doing
>>> that.
>> Yes, do not fail a disk and then replace it, use the newer replace
>> method (it keeps redundancy in the array)
>
> how should it keep redundancy when you have to remove a disk anyways 
> except you have enough slots to at least temporary add a additional one?
Yes, assuming you can (at least temporarily) add an additional disk, 
then you will not lose redundancy by using the replace instead of 
fail/add method.

Regards,
Adam

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-20 15:23     ` Adam Goryachev
@ 2017-03-20 16:19       ` Wols Lists
  0 siblings, 0 replies; 34+ messages in thread
From: Wols Lists @ 2017-03-20 16:19 UTC (permalink / raw)
  To: Adam Goryachev, Reindl Harald, Jeff Allison, linux-raid

On 20/03/17 15:23, Adam Goryachev wrote:
> 
> 
> On 21/3/17 02:04, Reindl Harald wrote:
>>
>>
>> Am 20.03.2017 um 15:59 schrieb Adam Goryachev:
>>> On 20/3/17 23:47, Jeff Allison wrote:
>>>> Hi all I’ve had a poke around but am yet to find something definitive.
>>>>
>>>> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this
>>>> disks are getting a bit long in the tooth so before I get into
>>>> problems I’ve bought 4 new disks to replace them.
>>>>
>>>> I have a backup so if it all goes west I’m covered. So I’m looking for
>>>> suggestions.
>>>>
>>>> My current plan is just to replace the 2tb drives with the new 3tb
>>>> drives and move on, I’d like to do it on line with out having to trash
>>>> the array and start again, so does anyone have a game plan for doing
>>>> that.
>>> Yes, do not fail a disk and then replace it, use the newer replace
>>> method (it keeps redundancy in the array)
>>
>> how should it keep redundancy when you have to remove a disk anyways
>> except you have enough slots to at least temporary add a additional one?
> Yes, assuming you can (at least temporarily) add an additional disk,
> then you will not lose redundancy by using the replace instead of
> fail/add method.
> 
Take a look at the raid wiki. Especially this page ...

https://raid.wiki.kernel.org/index.php/Replacing_a_failed_drive

Okay, it's my work (unless people have come in since and edited it) but
I make a point of asking "the people who should know" to check my work
if I'm at all unsure. So this will have been looked over for mistakes by
various people on the list who either write the code or provide advice
and support.

And yes, as you can see from that page, I'd say add a new disk then
--replace it into the array. And upgrading the array to raid6 is a good
idea. But Adam's way I think you need two extra temporary drive slots.
What I think you can do is - the new drives you need to make the
underlying partition the full 3TB. You can then replace all four drives.
So long as 2*3TB >= 3*2TB (don't laugh - it might not be!!!) you should
be able to reduce the number of drives to three then add the fourth back
to give raid6.

The other thing is, if you've got the space for Adam's method, you could
always temporarily create a 4TB drive by combining 2*2TB in a raid0 -
probably best striped rather than linear.

Cheers,
Wol


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-20 14:59 ` Adam Goryachev
  2017-03-20 15:04   ` Reindl Harald
@ 2017-03-21  2:33   ` Jeff Allison
  2017-03-21  9:54     ` Reindl Harald
  1 sibling, 1 reply; 34+ messages in thread
From: Jeff Allison @ 2017-03-21  2:33 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid

I don't have a spare SATA slot I do however have a spare USB carrier,
is that fast enough to be used temporarily?

On 21 March 2017 at 01:59, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
>
>
> On 20/3/17 23:47, Jeff Allison wrote:
>>
>> Hi all I’ve had a poke around but am yet to find something definitive.
>>
>> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this disks
>> are getting a bit long in the tooth so before I get into problems I’ve
>> bought 4 new disks to replace them.
>>
>> I have a backup so if it all goes west I’m covered. So I’m looking for
>> suggestions.
>>
>> My current plan is just to replace the 2tb drives with the new 3tb drives
>> and move on, I’d like to do it on line with out having to trash the array
>> and start again, so does anyone have a game plan for doing that.
>
> Yes, do not fail a disk and then replace it, use the newer replace method
> (it keeps redundancy in the array).
> Even better would be to add a disk, and convert to RAID6, then add a second
> disk (using replace), and so on, then remove the last disk, grow the array
> to fill the 3TB, and then reduce the number of disks in the raid.
> This way, you end up with RAID6...
>>
>> Or is a 9tb raid 5 array the wrong thing to be doing and should I be doing
>> something else 6tb raid 10 or something I’m open to suggestions.
>
> I'd feel safer with RAID6, but it depends on your requirements. RAID10 is
> also a nice option, but, it depends...
>
> Regards,
> Adam
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21  2:33   ` Jeff Allison
@ 2017-03-21  9:54     ` Reindl Harald
  2017-03-21 10:54       ` Adam Goryachev
  2017-03-21 13:02       ` David Brown
  0 siblings, 2 replies; 34+ messages in thread
From: Reindl Harald @ 2017-03-21  9:54 UTC (permalink / raw)
  To: Jeff Allison, Adam Goryachev; +Cc: linux-raid



Am 21.03.2017 um 03:33 schrieb Jeff Allison:
> I don't have a spare SATA slot I do however have a spare USB carrier,
> is that fast enough to be used temporarily?

USB3 yes, USB2 don't make fun because the speed of the array depends on 
the slowest disk in the spindle

and about RAID5/RAID6 versus RAID10: both RAID5 and RAID6 suffer from 
the same problems - due rebuild you have a lot of random-IO load on all 
remaining disks which leads in bad performance and make it more likely 
that before the rebuild is finished another disk fails, RAID6 produces 
even more random IO because of the double parity and if you have a 
Unrecoverable-Read-Error on RAID5 you are dead, RAID6 is not much better 
here and the probability of a URE becomes more likely with larger disks

RAID10: less to zero performance impact due rebuild and no random-IO 
caused by the rebuild, it's just "read a disk from start to end and 
write the data on another disk linear" while the only head moves on your 
disks is the normal workload on the array

with disks 2 TB or larger you can make the conclusion "do not use 
RAID5/6 anymore and when you do be prepared that you won't survive a 
rebuild caused by a failed disk"

> On 21 March 2017 at 01:59, Adam Goryachev
> <mailinglists@websitemanagers.com.au> wrote:
>>
>>
>> On 20/3/17 23:47, Jeff Allison wrote:
>>>
>>> Hi all I’ve had a poke around but am yet to find something definitive.
>>>
>>> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this disks
>>> are getting a bit long in the tooth so before I get into problems I’ve
>>> bought 4 new disks to replace them.
>>>
>>> I have a backup so if it all goes west I’m covered. So I’m looking for
>>> suggestions.
>>>
>>> My current plan is just to replace the 2tb drives with the new 3tb drives
>>> and move on, I’d like to do it on line with out having to trash the array
>>> and start again, so does anyone have a game plan for doing that.
>>
>> Yes, do not fail a disk and then replace it, use the newer replace method
>> (it keeps redundancy in the array).
>> Even better would be to add a disk, and convert to RAID6, then add a second
>> disk (using replace), and so on, then remove the last disk, grow the array
>> to fill the 3TB, and then reduce the number of disks in the raid.
>> This way, you end up with RAID6...
>>>
>>> Or is a 9tb raid 5 array the wrong thing to be doing and should I be doing
>>> something else 6tb raid 10 or something I’m open to suggestions.
>>
>> I'd feel safer with RAID6, but it depends on your requirements. RAID10 is
>> also a nice option, but, it depends...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21  9:54     ` Reindl Harald
@ 2017-03-21 10:54       ` Adam Goryachev
  2017-03-21 11:03         ` Reindl Harald
  2017-03-21 11:55         ` Gandalf Corvotempesta
  2017-03-21 13:02       ` David Brown
  1 sibling, 2 replies; 34+ messages in thread
From: Adam Goryachev @ 2017-03-21 10:54 UTC (permalink / raw)
  To: Reindl Harald, Jeff Allison; +Cc: linux-raid



On 21/3/17 20:54, Reindl Harald wrote:
>
>
> Am 21.03.2017 um 03:33 schrieb Jeff Allison:
>> I don't have a spare SATA slot I do however have a spare USB carrier,
>> is that fast enough to be used temporarily?
>
> USB3 yes, USB2 don't make fun because the speed of the array depends 
> on the slowest disk in the spindle
>
> and about RAID5/RAID6 versus RAID10: both RAID5 and RAID6 suffer from 
> the same problems - due rebuild you have a lot of random-IO load on 
> all remaining disks which leads in bad performance and make it more 
> likely that before the rebuild is finished another disk fails, RAID6 
> produces even more random IO because of the double parity and if you 
> have a Unrecoverable-Read-Error on RAID5 you are dead, RAID6 is not 
> much better here and the probability of a URE becomes more likely with 
> larger disks
>
> RAID10: less to zero performance impact due rebuild and no random-IO 
> caused by the rebuild, it's just "read a disk from start to end and 
> write the data on another disk linear" while the only head moves on 
> your disks is the normal workload on the array
>
> with disks 2 TB or larger you can make the conclusion "do not use 
> RAID5/6 anymore and when you do be prepared that you won't survive a 
> rebuild caused by a failed disk"
>
I can't say I'm an expert in this, but in actual fact, I disagree with 
both your arguments against RAID6...
You say recovery on a RAID10 is a simple linear read from one drive (the 
surviving member of the RAID1 portion) and a linear write on the other 
(the replaced drive). You also declare that there is no random IO with 
normal work load + recovery. I think you have forgotten that the "normal 
workload" is probably random IO, but certainly once combined with the 
recovery IO then it will be random IO.

In addition, you claim that a drive larger than 2TB is almost certainly 
going to suffer from a URE during recovery, yet this is exactly the 
situation you will be in when trying to recover a RAID10 with member 
devices 2TB or larger. A single URE on the surviving portion of the 
RAID1 will cause you to lose the entire RAID10 array. On the other hand, 
3 URE's on the three remaining members of the RAID6 will not cause more 
than a hiccup (as long as no more than one URE on the same stripe, which 
I would argue is ... exceptionally unlikely).

In addition, with a 4 disk RAID6 you have a 100% chance of surviving a 2 
drive failure without data loss, yet with 4 disk RAID10 you have a 50% 
chance of surviving a 2 drive failure.

Sure, there are other things to consider (performance, cost, etc) but on 
a reliability point, RAID6 seems to be the far better option.

Regards,
Adam
>> On 21 March 2017 at 01:59, Adam Goryachev
>> <mailinglists@websitemanagers.com.au> wrote:
>>>
>>>
>>> On 20/3/17 23:47, Jeff Allison wrote:
>>>>
>>>> Hi all I’ve had a poke around but am yet to find something definitive.
>>>>
>>>> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now 
>>>> this disks
>>>> are getting a bit long in the tooth so before I get into problems I’ve
>>>> bought 4 new disks to replace them.
>>>>
>>>> I have a backup so if it all goes west I’m covered. So I’m looking for
>>>> suggestions.
>>>>
>>>> My current plan is just to replace the 2tb drives with the new 3tb 
>>>> drives
>>>> and move on, I’d like to do it on line with out having to trash the 
>>>> array
>>>> and start again, so does anyone have a game plan for doing that.
>>>
>>> Yes, do not fail a disk and then replace it, use the newer replace 
>>> method
>>> (it keeps redundancy in the array).
>>> Even better would be to add a disk, and convert to RAID6, then add a 
>>> second
>>> disk (using replace), and so on, then remove the last disk, grow the 
>>> array
>>> to fill the 3TB, and then reduce the number of disks in the raid.
>>> This way, you end up with RAID6...
>>>>
>>>> Or is a 9tb raid 5 array the wrong thing to be doing and should I 
>>>> be doing
>>>> something else 6tb raid 10 or something I’m open to suggestions.
>>>
>>> I'd feel safer with RAID6, but it depends on your requirements. 
>>> RAID10 is
>>> also a nice option, but, it depends...
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 10:54       ` Adam Goryachev
@ 2017-03-21 11:03         ` Reindl Harald
  2017-03-21 11:34           ` Andreas Klauer
                             ` (2 more replies)
  2017-03-21 11:55         ` Gandalf Corvotempesta
  1 sibling, 3 replies; 34+ messages in thread
From: Reindl Harald @ 2017-03-21 11:03 UTC (permalink / raw)
  To: Adam Goryachev, Jeff Allison; +Cc: linux-raid



Am 21.03.2017 um 11:54 schrieb Adam Goryachev:
> On 21/3/17 20:54, Reindl Harald wrote:
>> Am 21.03.2017 um 03:33 schrieb Jeff Allison:
>>> I don't have a spare SATA slot I do however have a spare USB carrier,
>>> is that fast enough to be used temporarily?
>>
>> USB3 yes, USB2 don't make fun because the speed of the array depends
>> on the slowest disk in the spindle
>>
>> and about RAID5/RAID6 versus RAID10: both RAID5 and RAID6 suffer from
>> the same problems - due rebuild you have a lot of random-IO load on
>> all remaining disks which leads in bad performance and make it more
>> likely that before the rebuild is finished another disk fails, RAID6
>> produces even more random IO because of the double parity and if you
>> have a Unrecoverable-Read-Error on RAID5 you are dead, RAID6 is not
>> much better here and the probability of a URE becomes more likely with
>> larger disks
>>
>> RAID10: less to zero performance impact due rebuild and no random-IO
>> caused by the rebuild, it's just "read a disk from start to end and
>> write the data on another disk linear" while the only head moves on
>> your disks is the normal workload on the array
>>
>> with disks 2 TB or larger you can make the conclusion "do not use
>> RAID5/6 anymore and when you do be prepared that you won't survive a
>> rebuild caused by a failed disk"
>>
> I can't say I'm an expert in this, but in actual fact, I disagree with
> both your arguments against RAID6...
> You say recovery on a RAID10 is a simple linear read from one drive (the
> surviving member of the RAID1 portion) and a linear write on the other
> (the replaced drive). You also declare that there is no random IO with
> normal work load + recovery. I think you have forgotten that the "normal
> workload" is probably random IO, but certainly once combined with the
> recovery IO then it will be random IO.

but the point is that with RAID5/6 the recovery itself is *heavy random 
IO* and that get *combined* with the random IO auf the normal workload 
and that means *heavy load on the disks*

> In addition, you claim that a drive larger than 2TB is almost certainly
> going to suffer from a URE during recovery, yet this is exactly the
> situation you will be in when trying to recover a RAID10 with member
> devices 2TB or larger. A single URE on the surviving portion of the
> RAID1 will cause you to lose the entire RAID10 array. On the other hand,
> 3 URE's on the three remaining members of the RAID6 will not cause more
> than a hiccup (as long as no more than one URE on the same stripe, which
> I would argue is ... exceptionally unlikely).

given that when your disks have the same age errors on another disk 
become more likely when one failed and the heavy disk IO due recovery of 
a RAID6 with takes *many hours* where you have heavy IO on *all disks* 
compared with a way faster restore of RAID1/10 guess in which case a URE 
is more likely

additionally why should the whole array fail just because a single block 
get lost? the is no parity which needs to be calculated, you just lost a 
single block somewhere - RAID1/10 are way easier in their implementation

> In addition, with a 4 disk RAID6 you have a 100% chance of surviving a 2
> drive failure without data loss, yet with 4 disk RAID10 you have a 50%
> chance of surviving a 2 drive failure.

yeah and you *need that* when it takes many hours ot a few days until 
your 8 TB RAID6 is resynced while the whole time *all disks* are under 
heavy stress

> Sure, there are other things to consider (performance, cost, etc) but on
> a reliability point, RAID6 seems to be the far better option

*no* - it takes twice as long to recalculate from parity and stresses 
the remaining disks twice as hard as RAID5 and so you pretty soon end 
with lost both of the disk you can lose without the array goes down 
while you still have many hours remaining recovery time

here you go: http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 11:03         ` Reindl Harald
@ 2017-03-21 11:34           ` Andreas Klauer
  2017-03-21 12:03             ` Reindl Harald
  2017-03-21 11:56           ` Adam Goryachev
  2017-03-21 13:13           ` David Brown
  2 siblings, 1 reply; 34+ messages in thread
From: Andreas Klauer @ 2017-03-21 11:34 UTC (permalink / raw)
  To: Reindl Harald; +Cc: Adam Goryachev, Jeff Allison, linux-raid

On Tue, Mar 21, 2017 at 12:03:51PM +0100, Reindl Harald wrote:
> but the point is that with RAID5/6 the recovery itself is *heavy random 
> IO* and that get *combined* with the random IO auf the normal workload 
> and that means *heavy load on the disks*

Where do you get that random I/O idea from? Rebuild is linear.
Or what do you mean by random I/O in this context? (RAID rebuilds)
What kind of random things do you think the RAID is doing?

If you see read errors during rebuild, the most common cause is 
that the rebuild also happens to be the first read test since forever. 
(Happens to be the case for people who don't do any disk monitoring.)

> here you go: http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/

This is just wrong.

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 10:54       ` Adam Goryachev
  2017-03-21 11:03         ` Reindl Harald
@ 2017-03-21 11:55         ` Gandalf Corvotempesta
  1 sibling, 0 replies; 34+ messages in thread
From: Gandalf Corvotempesta @ 2017-03-21 11:55 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Reindl Harald, Jeff Allison, linux-raid

2017-03-21 11:54 GMT+01:00 Adam Goryachev <mailinglists@websitemanagers.com.au>:
> I can't say I'm an expert in this, but in actual fact, I disagree with both
> your arguments against RAID6...
> You say recovery on a RAID10 is a simple linear read from one drive (the
> surviving member of the RAID1 portion) and a linear write on the other (the
> replaced drive). You also declare that there is no random IO with normal
> work load + recovery. I think you have forgotten that the "normal workload"
> is probably random IO, but certainly once combined with the recovery IO then
> it will be random IO.
>
> In addition, you claim that a drive larger than 2TB is almost certainly
> going to suffer from a URE during recovery, yet this is exactly the
> situation you will be in when trying to recover a RAID10 with member devices
> 2TB or larger. A single URE on the surviving portion of the RAID1 will cause
> you to lose the entire RAID10 array. On the other hand, 3 URE's on the three
> remaining members of the RAID6 will not cause more than a hiccup (as long as
> no more than one URE on the same stripe, which I would argue is ...
> exceptionally unlikely).
>
> In addition, with a 4 disk RAID6 you have a 100% chance of surviving a 2
> drive failure without data loss, yet with 4 disk RAID10 you have a 50%
> chance of surviving a 2 drive failure.
>
> Sure, there are other things to consider (performance, cost, etc) but on a
> reliability point, RAID6 seems to be the far better option.

Totally agree

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 11:03         ` Reindl Harald
  2017-03-21 11:34           ` Andreas Klauer
@ 2017-03-21 11:56           ` Adam Goryachev
  2017-03-21 12:10             ` Reindl Harald
  2017-03-21 13:13           ` David Brown
  2 siblings, 1 reply; 34+ messages in thread
From: Adam Goryachev @ 2017-03-21 11:56 UTC (permalink / raw)
  To: Reindl Harald, Jeff Allison; +Cc: linux-raid

Sorry, but I'm just seeing scaremongering and things that don't compute. 
Possibly I'm just not seeing it, but I don't see your advise being given 
by a majority of "experts" either on this list or elsewhere. I'll try to 
refrain from responding beyond this one, and return to lurking and 
hopefully learning more.

Also, please note that the quoting / attribution seems to be wrong 
(inverted).

On 21/3/17 22:03, Reindl Harald wrote:
>
> Am 21.03.2017 um 11:54 schrieb Adam Goryachev:
>> On 21/3/17 20:54, Reindl Harald wrote:
>>> and about RAID5/RAID6 versus RAID10: both RAID5 and RAID6 suffer from
>>> the same problems - due rebuild you have a lot of random-IO load on
>>> all remaining disks which leads in bad performance and make it more
>>> likely that before the rebuild is finished another disk fails, RAID6
>>> produces even more random IO because of the double parity and if you
>>> have a Unrecoverable-Read-Error on RAID5 you are dead, RAID6 is not
>>> much better here and the probability of a URE becomes more likely with
>>> larger disks
>>>
>>> RAID10: less to zero performance impact due rebuild and no random-IO
>>> caused by the rebuild, it's just "read a disk from start to end and
>>> write the data on another disk linear" while the only head moves on
>>> your disks is the normal workload on the array
>>>
>>> with disks 2 TB or larger you can make the conclusion "do not use
>>> RAID5/6 anymore and when you do be prepared that you won't survive a
>>> rebuild caused by a failed disk"
>>>
>> I can't say I'm an expert in this, but in actual fact, I disagree with
>> both your arguments against RAID6...
>> You say recovery on a RAID10 is a simple linear read from one drive (the
>> surviving member of the RAID1 portion) and a linear write on the other
>> (the replaced drive). You also declare that there is no random IO with
>> normal work load + recovery. I think you have forgotten that the "normal
>> workload" is probably random IO, but certainly once combined with the
>> recovery IO then it will be random IO.
>
> but the point is that with RAID5/6 the recovery itself is *heavy 
> random IO* and that get *combined* with the random IO auf the normal 
> workload and that means *heavy load on the disks*
random IO is the same as random IO, regardless of the "cause" of making 
the IO random.
In most systems, you won't be running anywhere near the IO limits, so 
allowing your recovery some portion of IO is not an issue.
>
>> In addition, you claim that a drive larger than 2TB is almost certainly
>> going to suffer from a URE during recovery, yet this is exactly the
>> situation you will be in when trying to recover a RAID10 with member
>> devices 2TB or larger. A single URE on the surviving portion of the
>> RAID1 will cause you to lose the entire RAID10 array. On the other hand,
>> 3 URE's on the three remaining members of the RAID6 will not cause more
>> than a hiccup (as long as no more than one URE on the same stripe, which
>> I would argue is ... exceptionally unlikely).
>
> given that when your disks have the same age errors on another disk 
> become more likely when one failed and the heavy disk IO due recovery 
> of a RAID6 with takes *many hours* where you have heavy IO on *all 
> disks* compared with a way faster restore of RAID1/10 guess in which 
> case a URE is more likely
>
URE's are based on amount of data read, and that isn't cumulative, every 
block read starts again with the same chance. If winning lottery is a 
chance of 100:1 it doesn't mean you will win at least once if you buy 
100 tickets. So reading 200,000,000 blocks also doesn't ensure you will 
see a URE (equally, you just might be lucky and win the lottery more 
than once, and get more than one URE).
In any case, if you only have a single source of data, then you are more 
likely to lose it (this is one of the reasons for RAID and backups). So 
RAID6 which stores your data in more than one location (during a drive 
failure event) is better.
BTW, just because you say that you will suffer a URE under heavy load 
doesn't make it true. The load factor doesn't change the frequency of a 
URE (even though it sounds possible).
> additionally why should the whole array fail just because a single 
> block get lost? the is no parity which needs to be calculated, you 
> just lost a single block somewhere - RAID1/10 are way easier in their 
> implementation
Equally, worst case, you have multiple URE on the same stripe on RAID6 
only loses a single stripe (ok, a stripe is bigger than a block, but 
still much less likely to occur anyway).
>
>> In addition, with a 4 disk RAID6 you have a 100% chance of surviving a 2
>> drive failure without data loss, yet with 4 disk RAID10 you have a 50%
>> chance of surviving a 2 drive failure.
>
> yeah and you *need that* when it takes many hours ot a few days until 
> your 8 TB RAID6 is resynced while the whole time *all disks* are under 
> heavy stress
Why are all disks under heavy stress? Again, you don't operate (under 
normal conditions) at a heavy stress level, you need room to grow, and 
also peak load is going to be higher but for short duration. Normal 
activity might be 50% of maximum, degraded performance together with 
recovery might push that to 80%, but disks (decent ones) are not going 
to have a problem doing simple read/write activity, that is what they 
are designed for right?
>
>> Sure, there are other things to consider (performance, cost, etc) but on
>> a reliability point, RAID6 seems to be the far better option
>
> *no* - it takes twice as long to recalculate from parity and stresses 
> the remaining disks twice as hard as RAID5 and so you pretty soon end 
> with lost both of the disk you can lose without the array goes down 
> while you still have many hours remaining recovery time
>
> here you go: 
> http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/
That was written in 2010, 2019 is only 2 years away, (unless you meant 
2029 and it was a typo) and I don't see evidence of that being true nor 
becoming true in such a short time. We don't see many (any?) people 
trying to recover their RAID6 arrays with double URE failures.

You say it takes twice as long to recalculate from parity for RAID6 
compared to RAID5, but with CPU performance, this is still faster than 
the drive speed (unless you have NVMe or some SSD's, but then I assume 
the whole URE issue is different there anyway). Also, why do you think 
it stresses the disks twice as hard as RAID5? To recover a RAID5 you 
need a full read of all surviving drives, that's 100% read. To recover a 
RAID6 you need a full read of all remaining drives minus one, so that is 
less than 100% read. So why are you "stressing the remaining disks twice 
as hard"? Also, why does a URE equal losing a disk, all you do is read 
that block from another member in the array, and fix the URE at the same 
time.

If anything, you might suggest triple mirror RAID (what is that called? 
RAID110?)
If I was to believe you, then that is the only sensible option, with 
triple mirror, when you lose any one drive, then you may recover by 
simply reading from the surviving members, and you are no worse off 
under any scenario. Even losing any two drives and you are still 
protected, potentially you can lose up to 4 drives without data loss 
(assuming a minimum of 6 drives). However, cost is a factor here.

Finally, other than RAID110 (really, what is this called?) do you have 
any other sensible suggestions? RAID10 just doesn't seem to be it, and 
zfs doesn't seem to be mainstream enough either, same with btrfs and 
other FS's which can do various checksum/redundant data storage.

PS, In case you are wondering, I am still running 8 drive RAID5 in real 
life workloads, and don't have any problems with data loss (albeit, I do 
use DRBD to replicate the data between two systems with RAID5 each, so 
you can call that RAID51 perhaps, but the point remains, I've never 
(yet) lost an entire RAID5 array due to multiple drive failure or URE's).


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 11:34           ` Andreas Klauer
@ 2017-03-21 12:03             ` Reindl Harald
  2017-03-21 12:41               ` Andreas Klauer
  0 siblings, 1 reply; 34+ messages in thread
From: Reindl Harald @ 2017-03-21 12:03 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: Adam Goryachev, Jeff Allison, linux-raid



Am 21.03.2017 um 12:34 schrieb Andreas Klauer:
> On Tue, Mar 21, 2017 at 12:03:51PM +0100, Reindl Harald wrote:
>> but the point is that with RAID5/6 the recovery itself is *heavy random
>> IO* and that get *combined* with the random IO auf the normal workload
>> and that means *heavy load on the disks*
>
> Where do you get that random I/O idea from? Rebuild is linear.
> Or what do you mean by random I/O in this context? (RAID rebuilds)
> What kind of random things do you think the RAID is doing?

the IO of a RAID5/6 rebuild is hardly linear beause the informations 
(data + parity) are spread all over the disks while in case of RAID1/10 
it is really linear

> If you see read errors during rebuild, the most common cause is
> that the rebuild also happens to be the first read test since forever.
> (Happens to be the case for people who don't do any disk monitoring.)
>
>> here you go: http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/
>
> This is just wrong

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 11:56           ` Adam Goryachev
@ 2017-03-21 12:10             ` Reindl Harald
  0 siblings, 0 replies; 34+ messages in thread
From: Reindl Harald @ 2017-03-21 12:10 UTC (permalink / raw)
  To: Adam Goryachev, Jeff Allison; +Cc: linux-raid


Am 21.03.2017 um 12:56 schrieb Adam Goryachev:
> Sorry, but I'm just seeing scaremongering and things that don't compute.
> Possibly I'm just not seeing it, but I don't see your advise being given
> by a majority of "experts" either on this list or elsewhere. I'll try to
> refrain from responding beyond this one, and return to lurking and
> hopefully learning more.
>
> Also, please note that the quoting / attribution seems to be wrong
> (inverted).

only in your mail client

> On 21/3/17 22:03, Reindl Harald wrote:
>> Am 21.03.2017 um 11:54 schrieb Adam Goryachev:
>> but the point is that with RAID5/6 the recovery itself is *heavy
>> random IO* and that get *combined* with the random IO auf the normal
>> workload and that means *heavy load on the disks*
> random IO is the same as random IO, regardless of the "cause" of making
> the IO random

no - it's a matter of *how much* random IO you have - when the rebuild 
process needs to seek for parity and remaining data blocks and hence 
produces heavily head movements all over the time this is added to the 
IO of the normal workload

in case of a RAID1/10 rebuild the rbuild process itself is just a linear 
read and the only head moves of the disks is the normal workload on the 
array

> In most systems, you won't be running anywhere near the IO limits, so
> allowing your recovery some portion of IO is not an issue

IO limits don't matter here when we talk about IOPS and drive head moves 
around heavily all the time because parity and data blocks for the 
restore are spread all over the disk *and* the requested workload data 
is also somewhere else

in case of a RAID1/10 rebuild you have all the time linear IO from time 
to time interrupted by the workload on the array - that's a completly 
other stress level for a disk compared with seek for hours and days 
parity and data to restore the data for the failed disk

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 12:03             ` Reindl Harald
@ 2017-03-21 12:41               ` Andreas Klauer
  2017-03-22  4:16                 ` NeilBrown
  0 siblings, 1 reply; 34+ messages in thread
From: Andreas Klauer @ 2017-03-21 12:41 UTC (permalink / raw)
  To: Reindl Harald; +Cc: Adam Goryachev, Jeff Allison, linux-raid

On Tue, Mar 21, 2017 at 01:03:22PM +0100, Reindl Harald wrote:
> the IO of a RAID5/6 rebuild is hardly linear beause the informations 
> (data + parity) are spread all over the disks

It's not "randomly" spread all over. The blocks are always where they belong.

https://en.wikipedia.org/wiki/Standard_RAID_levels#/media/File:RAID_6.svg

It's AAAA, BBBB, CCCC, DDDD. Not DBCA, BADC, ADBC, ...

There is no random I/O involved here, at worst it will decide to not read 
a parity block because it's not needed but that does not cause huge/random
jumps for the HDD read heads.

> while in case of RAID1/10 it is really linear

Actually RAID 10 has the most interesting layout choices... 
to this day mdadm is unable to grow/convert some of these.

In a RAID 10 rebuild the HDD might have to jump from end to start.

Of course if you consider metadata updates (progress has to be 
recorded somewhere?) then ALL rebuilds regardless of RAID level 
are random I/O in a way.

But such is the fate of a HDD, it's their bread and butter. 
Any server that does anything other than "idle" does random I/O 24/7.

If there was no other I/O (because the RAID is live during rebuild) 
and no metadata updates (or external metadata) you could totally do 
RAID0/1/5/6 rebuilds with tape drives. That's how random it is.
RAID10 might need a rewind in between.

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21  9:54     ` Reindl Harald
  2017-03-21 10:54       ` Adam Goryachev
@ 2017-03-21 13:02       ` David Brown
  2017-03-21 13:26         ` Gandalf Corvotempesta
                           ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: David Brown @ 2017-03-21 13:02 UTC (permalink / raw)
  To: Reindl Harald, Jeff Allison, Adam Goryachev; +Cc: linux-raid

On 21/03/17 10:54, Reindl Harald wrote:
> 
> 
> Am 21.03.2017 um 03:33 schrieb Jeff Allison:
>> I don't have a spare SATA slot I do however have a spare USB carrier,
>> is that fast enough to be used temporarily?
> 
> USB3 yes, USB2 don't make fun because the speed of the array depends on
> the slowest disk in the spindle

When you are turning your RAID5 into RAID6, you can use a non-standard
layout with the external drive being the second parity.  That way you
don't need to re-write the data on the existing drives, and the access
to the external drive will all be writes of the Q parity - the system
will not read from that drive unless it has to recover from a two drive
failure.  This will reduce stress on all the disks, and make the limited
USB2 bandwidth less of an issue.

If you have to use two USB carriers for the whole process, try to make
sure they are connected to separate root hubs so that they don't share
the bandwidth.  This is not always just a matter of using two USB ports
- sometimes two adjacent USB ports on a PC share an internal hub.

> 
> and about RAID5/RAID6 versus RAID10: both RAID5 and RAID6 suffer from
> the same problems - due rebuild you have a lot of random-IO load on all
> remaining disks which leads in bad performance and make it more likely
> that before the rebuild is finished another disk fails, RAID6 produces
> even more random IO because of the double parity and if you have a
> Unrecoverable-Read-Error on RAID5 you are dead, RAID6 is not much better
> here and the probability of a URE becomes more likely with larger disks

Rebuilds are done using streamed linear access - the only random access
is the mix of rebuild transfers with normal usage of the array.  This
applies to RAID5 and RAID6 as well as RAID1 or RAID10.

With RAID5 or two-disk RAID1, if you get an URE on a read then you can
recover the data without loss.  This is the case for normal
(non-degraded) use, or if you are using "replace" to duplicate an
existing disk before replacement.  If you have failed a drive (manually,
or due to a serious disk failure), then any single URE means lost data
in that stripe.

With RAID6 (or three-disk RAID1), you can tolerate /two/ URE's on the
same stripe.  If you have failed a disk for replacement, you can
tolerate one URE.

Note that to cause failure in non-degraded RAID5 (or degraded RAID6),
your two URE's need to be on the same stripe in order to cause data
loss.  The chances of getting an URE somewhere on the disk are roughly
proportional to the size of the disk - but the chance of getting an URE
on the same stripe as another URE on another disk are basically
independent of the disk size, and it is extraordinarily small.

> 
> RAID10: less to zero performance impact due rebuild and no random-IO
> caused by the rebuild, it's just "read a disk from start to end and
> write the data on another disk linear" while the only head moves on your
> disks is the normal workload on the array

RAID1 (and RAID0) rebuilds are a little more efficient than RAID5 or
RAID6 rebuilds - but not hugely so.  Depending on factors such as IO
structures, cpu speed and loading, number of disks in the array,
concurrent access to other data, etc., they can be something like 25% to
50% faster.  They do not involve noticeably more or less linear access
than a RAID5/RAID6 rebuild, but they avoid heavy access to disks other
than those in the RAID1 pair being rebuilt.

> 
> with disks 2 TB or larger you can make the conclusion "do not use
> RAID5/6 anymore and when you do be prepared that you won't survive a
> rebuild caused by a failed disk"

No, you cannot.  Your conclusion here is based on several totally
incorrect assumptions:

1. You think that RAID5/RAID6 recovery is more stressful, because the
parity is "all over the place".  This is wrong.

2. You think that random IO has higher chance of getting an URE than
linear IO.  This is wrong.

3. You think that getting an URE on one disk, then getting an URE on a
second disk, counts as a double failure that will break an single-parity
redundancy (RAID5, RAID1, RAID6 in degraded mode).  This is wrong - it
is only a problem if the two UREs are in the same stripe, which is quite
literally a one in a million chance.


There are certainly good reasons to prefer RAID10 systems to RAID5/RAID6
- for some types of loads, it can be significantly faster, and even
though the rebuild time is not as much faster as you think, it is still
faster.  Linux supports a range of different RAID types for good reason
- it is not a "one size fits all" problem.  But you should learn the
differences and make your choices and recommendations based on facts,
rather than articles written by people trying to sell their own "solutions".

mvh.,

David


> 
>> On 21 March 2017 at 01:59, Adam Goryachev
>> <mailinglists@websitemanagers.com.au> wrote:
>>>
>>>
>>> On 20/3/17 23:47, Jeff Allison wrote:
>>>>
>>>> Hi all I’ve had a poke around but am yet to find something definitive.
>>>>
>>>> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this
>>>> disks
>>>> are getting a bit long in the tooth so before I get into problems I’ve
>>>> bought 4 new disks to replace them.
>>>>
>>>> I have a backup so if it all goes west I’m covered. So I’m looking for
>>>> suggestions.
>>>>
>>>> My current plan is just to replace the 2tb drives with the new 3tb
>>>> drives
>>>> and move on, I’d like to do it on line with out having to trash the
>>>> array
>>>> and start again, so does anyone have a game plan for doing that.
>>>
>>> Yes, do not fail a disk and then replace it, use the newer replace
>>> method
>>> (it keeps redundancy in the array).
>>> Even better would be to add a disk, and convert to RAID6, then add a
>>> second
>>> disk (using replace), and so on, then remove the last disk, grow the
>>> array
>>> to fill the 3TB, and then reduce the number of disks in the raid.
>>> This way, you end up with RAID6...
>>>>
>>>> Or is a 9tb raid 5 array the wrong thing to be doing and should I be
>>>> doing
>>>> something else 6tb raid 10 or something I’m open to suggestions.
>>>
>>> I'd feel safer with RAID6, but it depends on your requirements.
>>> RAID10 is
>>> also a nice option, but, it depends...
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 11:03         ` Reindl Harald
  2017-03-21 11:34           ` Andreas Klauer
  2017-03-21 11:56           ` Adam Goryachev
@ 2017-03-21 13:13           ` David Brown
  2017-03-21 13:24             ` Reindl Harald
  2 siblings, 1 reply; 34+ messages in thread
From: David Brown @ 2017-03-21 13:13 UTC (permalink / raw)
  To: Reindl Harald, Adam Goryachev, Jeff Allison; +Cc: linux-raid

On 21/03/17 12:03, Reindl Harald wrote:
> 
> 
> Am 21.03.2017 um 11:54 schrieb Adam Goryachev:
<snip>
> 
>> In addition, you claim that a drive larger than 2TB is almost certainly
>> going to suffer from a URE during recovery, yet this is exactly the
>> situation you will be in when trying to recover a RAID10 with member
>> devices 2TB or larger. A single URE on the surviving portion of the
>> RAID1 will cause you to lose the entire RAID10 array. On the other hand,
>> 3 URE's on the three remaining members of the RAID6 will not cause more
>> than a hiccup (as long as no more than one URE on the same stripe, which
>> I would argue is ... exceptionally unlikely).
> 
> given that when your disks have the same age errors on another disk
> become more likely when one failed and the heavy disk IO due recovery of
> a RAID6 with takes *many hours* where you have heavy IO on *all disks*
> compared with a way faster restore of RAID1/10 guess in which case a URE
> is more likely
> 
> additionally why should the whole array fail just because a single block
> get lost? the is no parity which needs to be calculated, you just lost a
> single block somewhere - RAID1/10 are way easier in their implementation

If you have RAID1, and you have an URE, then the data can be recovered
from the other have of that RAID1 pair.  If you have had a disk failure
(manual for replacement, or a real failure), and you get an URE on the
other half of that pair, then you lose data.

With RAID6, you need an additional failure (either another full disk
failure or an URE in the /same/ stripe) to lose data.  RAID6 has higher
redundancy than two-way RAID1 - of this there is /no/ doubt.

> 
>> In addition, with a 4 disk RAID6 you have a 100% chance of surviving a 2
>> drive failure without data loss, yet with 4 disk RAID10 you have a 50%
>> chance of surviving a 2 drive failure.
> 
> yeah and you *need that* when it takes many hours ot a few days until
> your 8 TB RAID6 is resynced while the whole time *all disks* are under
> heavy stress
> 
>> Sure, there are other things to consider (performance, cost, etc) but on
>> a reliability point, RAID6 seems to be the far better option
> 
> *no* - it takes twice as long to recalculate from parity and stresses
> the remaining disks twice as hard as RAID5 and so you pretty soon end
> with lost both of the disk you can lose without the array goes down
> while you still have many hours remaining recovery time

For RAID5 and RAID6, you read the same data - the full data stripe.  For
RAID5, you calculate and write a single parity block, while for RAID6
you calculate and write an additional parity block.  The disk reads are
the same in both cases, but you write out twice as many blocks.  You do
not stress the disks noticeably harder with RAID6 than with RAID5.

> 
> here you go: http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/

This is an article heavily based on a Sun engineer trying to promote his
own alternative using scaremongering.

It is, however, correct in suggesting that RAID6 is more reliable than
RAID5.  And triple-parity raid (or additional layered RAID) is more
reliable than RAID6.  Nowhere does it suggest that RAID1 is more
reliable than RAID6.

It all boils down to the redundancy level.  Two-drive RAID1 pairs have a
single drive redundancy.  RAID5 has a single drive redundancy.  RAID6
has two drive redundancy - thus it is more reliable and will tolerate
more failures before losing data.  If this is not enough, and you don't
have triple parity RAID (it is not yet implemented in md - one day,
perhaps), you can use more mirrors on RAID1 or use layers such as a
RAID5 array built on RAID1 pairs.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 13:13           ` David Brown
@ 2017-03-21 13:24             ` Reindl Harald
  2017-03-21 14:15               ` David Brown
  0 siblings, 1 reply; 34+ messages in thread
From: Reindl Harald @ 2017-03-21 13:24 UTC (permalink / raw)
  To: David Brown, Adam Goryachev, Jeff Allison; +Cc: linux-raid



Am 21.03.2017 um 14:13 schrieb David Brown:
> On 21/03/17 12:03, Reindl Harald wrote:
>>
>> Am 21.03.2017 um 11:54 schrieb Adam Goryachev:
> <snip>
>>
>>> In addition, you claim that a drive larger than 2TB is almost certainly
>>> going to suffer from a URE during recovery, yet this is exactly the
>>> situation you will be in when trying to recover a RAID10 with member
>>> devices 2TB or larger. A single URE on the surviving portion of the
>>> RAID1 will cause you to lose the entire RAID10 array. On the other hand,
>>> 3 URE's on the three remaining members of the RAID6 will not cause more
>>> than a hiccup (as long as no more than one URE on the same stripe, which
>>> I would argue is ... exceptionally unlikely).
>>
>> given that when your disks have the same age errors on another disk
>> become more likely when one failed and the heavy disk IO due recovery of
>> a RAID6 with takes *many hours* where you have heavy IO on *all disks*
>> compared with a way faster restore of RAID1/10 guess in which case a URE
>> is more likely
>>
>> additionally why should the whole array fail just because a single block
>> get lost? the is no parity which needs to be calculated, you just lost a
>> single block somewhere - RAID1/10 are way easier in their implementation
>
> If you have RAID1, and you have an URE, then the data can be recovered
> from the other have of that RAID1 pair.  If you have had a disk failure
> (manual for replacement, or a real failure), and you get an URE on the
> other half of that pair, then you lose data.
>
> With RAID6, you need an additional failure (either another full disk
> failure or an URE in the /same/ stripe) to lose data.  RAID6 has higher
> redundancy than two-way RAID1 - of this there is /no/ doubt

yes, but with RAID5/RAID6 *all disks* are involved in the rebuild, with 
a 10 disk RAID10 only one disk needs to be read and the data written to 
the new one - all other disks are not involved in the resync at all

for most arrays the disks have a similar age and usage pattern, so when 
the first one fails it becomes likely that it don't take too long for 
another one and so load and recovery time matters

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 13:02       ` David Brown
@ 2017-03-21 13:26         ` Gandalf Corvotempesta
  2017-03-21 14:26           ` David Brown
  2017-03-21 15:29         ` Wols Lists
  2017-03-21 16:55         ` Phil Turmel
  2 siblings, 1 reply; 34+ messages in thread
From: Gandalf Corvotempesta @ 2017-03-21 13:26 UTC (permalink / raw)
  To: David Brown; +Cc: Reindl Harald, Jeff Allison, Adam Goryachev, linux-raid

2017-03-21 14:02 GMT+01:00 David Brown <david.brown@hesbynett.no>:
> Note that to cause failure in non-degraded RAID5 (or degraded RAID6),
> your two URE's need to be on the same stripe in order to cause data
> loss.  The chances of getting an URE somewhere on the disk are roughly
> proportional to the size of the disk - but the chance of getting an URE
> on the same stripe as another URE on another disk are basically
> independent of the disk size, and it is extraordinarily small.

Little bit OT:
is this the same even for HW RAID Controllers like LSI Megaraid
or they tend to fail the rebuild in case of multiple URE even in
different stripes?

> No, you cannot.  Your conclusion here is based on several totally
> incorrect assumptions:
>
> 1. You think that RAID5/RAID6 recovery is more stressful, because the
> parity is "all over the place".  This is wrong.
>
> 2. You think that random IO has higher chance of getting an URE than
> linear IO.  This is wrong.

Totally agree.

> 3. You think that getting an URE on one disk, then getting an URE on a
> second disk, counts as a double failure that will break an single-parity
> redundancy (RAID5, RAID1, RAID6 in degraded mode).  This is wrong - it
> is only a problem if the two UREs are in the same stripe, which is quite
> literally a one in a million chance.

I'm not sure about this.
The posted paper is talking about "standard" raid made with hw raid controllers
and I'm not sure if they are able to finish a rebuild in case of double URE even
if coming from different stripes.

I think they fail the whole rebuild.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 13:24             ` Reindl Harald
@ 2017-03-21 14:15               ` David Brown
  2017-03-21 15:25                 ` Wols Lists
  0 siblings, 1 reply; 34+ messages in thread
From: David Brown @ 2017-03-21 14:15 UTC (permalink / raw)
  To: Reindl Harald, Adam Goryachev, Jeff Allison; +Cc: linux-raid

On 21/03/17 14:24, Reindl Harald wrote:
> 
> 
> Am 21.03.2017 um 14:13 schrieb David Brown:
>> On 21/03/17 12:03, Reindl Harald wrote:
>>>
>>> Am 21.03.2017 um 11:54 schrieb Adam Goryachev:
>> <snip>
>>>
>>>> In addition, you claim that a drive larger than 2TB is almost certainly
>>>> going to suffer from a URE during recovery, yet this is exactly the
>>>> situation you will be in when trying to recover a RAID10 with member
>>>> devices 2TB or larger. A single URE on the surviving portion of the
>>>> RAID1 will cause you to lose the entire RAID10 array. On the other
>>>> hand,
>>>> 3 URE's on the three remaining members of the RAID6 will not cause more
>>>> than a hiccup (as long as no more than one URE on the same stripe,
>>>> which
>>>> I would argue is ... exceptionally unlikely).
>>>
>>> given that when your disks have the same age errors on another disk
>>> become more likely when one failed and the heavy disk IO due recovery of
>>> a RAID6 with takes *many hours* where you have heavy IO on *all disks*
>>> compared with a way faster restore of RAID1/10 guess in which case a URE
>>> is more likely
>>>
>>> additionally why should the whole array fail just because a single block
>>> get lost? the is no parity which needs to be calculated, you just lost a
>>> single block somewhere - RAID1/10 are way easier in their implementation
>>
>> If you have RAID1, and you have an URE, then the data can be recovered
>> from the other have of that RAID1 pair.  If you have had a disk failure
>> (manual for replacement, or a real failure), and you get an URE on the
>> other half of that pair, then you lose data.
>>
>> With RAID6, you need an additional failure (either another full disk
>> failure or an URE in the /same/ stripe) to lose data.  RAID6 has higher
>> redundancy than two-way RAID1 - of this there is /no/ doubt
> 
> yes, but with RAID5/RAID6 *all disks* are involved in the rebuild, with
> a 10 disk RAID10 only one disk needs to be read and the data written to
> the new one - all other disks are not involved in the resync at all

True...

> 
> for most arrays the disks have a similar age and usage pattern, so when
> the first one fails it becomes likely that it don't take too long for
> another one and so load and recovery time matters

False.  There is no reason to suspect that - certainly not to within the
hours or day it takes to rebuild your array.  Disk failure pattern shows
a peak within the first month or so (failures due to manufacturing or
handling), then a very low error rate for a few years, then a gradually
increasing rate after that.  There is not a very significant correlation
between drive failures within the same system, nor is there a very
significant correlation between usage and failures.  It might seem
reasonable to suspect that a drive is more likely to fail during a
rebuild since the disk is being heavily used, but that does not appear
to be the practice.  You will /spot/ more errors at that point - simply
because you don't see errors in parts of the disk that are not read -
but the rebuilding does not cause them.

And even if it /were/ true, then the key point is if there is an error
that causes data loss.  An error during reading for a RAID1 rebuild
means lost data.  An error during reading for a RAID6 rebuild means you
have to read an extra sector from another disk and correct the mistake.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 13:26         ` Gandalf Corvotempesta
@ 2017-03-21 14:26           ` David Brown
  2017-03-21 15:31             ` Wols Lists
  0 siblings, 1 reply; 34+ messages in thread
From: David Brown @ 2017-03-21 14:26 UTC (permalink / raw)
  To: Gandalf Corvotempesta
  Cc: Reindl Harald, Jeff Allison, Adam Goryachev, linux-raid

On 21/03/17 14:26, Gandalf Corvotempesta wrote:
> 2017-03-21 14:02 GMT+01:00 David Brown <david.brown@hesbynett.no>:
>> Note that to cause failure in non-degraded RAID5 (or degraded RAID6),
>> your two URE's need to be on the same stripe in order to cause data
>> loss.  The chances of getting an URE somewhere on the disk are roughly
>> proportional to the size of the disk - but the chance of getting an URE
>> on the same stripe as another URE on another disk are basically
>> independent of the disk size, and it is extraordinarily small.
> 
> Little bit OT:
> is this the same even for HW RAID Controllers like LSI Megaraid
> or they tend to fail the rebuild in case of multiple URE even in
> different stripes?

It should be true, for decent HW RAID setups.  One possible problem is
the famous re-read timeouts - if you use a consumer hard drive with long
re-read timeouts, and have not (or cannot) configure it to have a short
timeout, then a hardware RAID controller might consider a drive to be
completely dead while the drive is simply spending 30 seconds re-trying
its read.  If the raid controller drops the drive, then it is like an
URE in /all/ stripes at once!

> 
>> No, you cannot.  Your conclusion here is based on several totally
>> incorrect assumptions:
>>
>> 1. You think that RAID5/RAID6 recovery is more stressful, because the
>> parity is "all over the place".  This is wrong.
>>
>> 2. You think that random IO has higher chance of getting an URE than
>> linear IO.  This is wrong.
> 
> Totally agree.
> 
>> 3. You think that getting an URE on one disk, then getting an URE on a
>> second disk, counts as a double failure that will break an single-parity
>> redundancy (RAID5, RAID1, RAID6 in degraded mode).  This is wrong - it
>> is only a problem if the two UREs are in the same stripe, which is quite
>> literally a one in a million chance.
> 
> I'm not sure about this.
> The posted paper is talking about "standard" raid made with hw raid controllers
> and I'm not sure if they are able to finish a rebuild in case of double URE even
> if coming from different stripes.
> 
> I think they fail the whole rebuild.
> 

I cannot imagine why that would be the case.

Suppose you have seven drive RAID6, with data blocks ABCDE and parities
PQ.  To make it simpler, assume that on this particular stripe, the
order is ABCDEPQ.  If drive 5 has failed and you are rebuilding, the
RAID system will read in ABCD-P-.  It will not read from drive 5 (since
you are rebuilding it), and it will not bother reading drive 7 because
it doesn't need the Q parity (it /might/ read it in as part of a
streamed read).  It calculates E from ABCD and P, and writes it out.
If, for example, drive 3 gets an URE at this point then it will read the
Q parity and calculate C and E from ABD P and Q.  It will write out E to
the rebuild drive, and also C to the drive with the URE - the drive will
handle sector relocation as needed.  The result is that the stripe
ABCDEPQ is correct on the disk.  The drive with the URE will not be
dropped from the array.

Then it moves on to the next stripe, and repeats the process.  An URE
here is independent of an URE in the previous stripe, and errors can
again be corrected.

It is possible that if there are a large number of UREs from a drive,
that the RAID system will consider the whole drive bad and drop it.  But
other than that, UREs will be treated independently.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 14:15               ` David Brown
@ 2017-03-21 15:25                 ` Wols Lists
  2017-03-21 15:41                   ` David Brown
  0 siblings, 1 reply; 34+ messages in thread
From: Wols Lists @ 2017-03-21 15:25 UTC (permalink / raw)
  To: David Brown, Reindl Harald, Adam Goryachev, Jeff Allison; +Cc: linux-raid

On 21/03/17 14:15, David Brown wrote:
>> for most arrays the disks have a similar age and usage pattern, so when
>> > the first one fails it becomes likely that it don't take too long for
>> > another one and so load and recovery time matters

> False.  There is no reason to suspect that - certainly not to within the
> hours or day it takes to rebuild your array.  Disk failure pattern shows
> a peak within the first month or so (failures due to manufacturing or
> handling), then a very low error rate for a few years, then a gradually
> increasing rate after that.  There is not a very significant correlation
> between drive failures within the same system, nor is there a very
> significant correlation between usage and failures.

Except your argument and the claim don't match. You're right - disk
failures follow the pattern you describe. BUT.

If the array was created from completely new disks, then the usage
patterns will be very similar, therefore there will be a statistical
correlation between failures as compared to the population as a whole.
(Bit like a false DNA match is much higher in an inbred town, than in a
cosmopolitan city of immigrants.)

EVEN WORSE. The probability of all the drives coming off the same batch,
and sharing the same systematic defects, is much much higher. One only
has to look at the Seagate 3TB Barracuda mess to see a perfect example.

In other words, IFF your array is built of a bunch of identical drives
all bought at the same time, the risk of multiple failure is
significantly higher. How significant that is I don't know, but it is a
very valid reason for replacing your drives at semi-random intervals.

(Completely off topic :-) but a real-world demonstrable example is
couples' initials. "Like chooses like" and if you compare a couple's
first initials against what you would expect from a random sample, there
is a VERY significant spike in couples that share the same initial.)

To put it bluntly, if your array consists of disks with near-identical
characteristics (including manufacturing batch), then your chances of
random multiple failure are noticeably increased. Is it worth worrying
about? If you can do something about it, of course!

Cheers,
Wol

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 13:02       ` David Brown
  2017-03-21 13:26         ` Gandalf Corvotempesta
@ 2017-03-21 15:29         ` Wols Lists
  2017-03-21 16:55         ` Phil Turmel
  2 siblings, 0 replies; 34+ messages in thread
From: Wols Lists @ 2017-03-21 15:29 UTC (permalink / raw)
  To: David Brown, Reindl Harald, Jeff Allison, Adam Goryachev; +Cc: linux-raid

On 21/03/17 13:02, David Brown wrote:
> If you have to use two USB carriers for the whole process, try to make
> sure they are connected to separate root hubs so that they don't share
> the bandwidth.  This is not always just a matter of using two USB ports
> - sometimes two adjacent USB ports on a PC share an internal hub.

Having built a bunch of desktop pcs from parts, I'd say adjacent ports
almost certainly share an internal hub. Typically, a single mobo header
will run a wire to a double slot at the front, or a double slot at the
back. So plugging one in at the front, and one at the back, will get
round this unless it's actually just one hub in the ?northbridge.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 14:26           ` David Brown
@ 2017-03-21 15:31             ` Wols Lists
  2017-03-21 17:00               ` Phil Turmel
  0 siblings, 1 reply; 34+ messages in thread
From: Wols Lists @ 2017-03-21 15:31 UTC (permalink / raw)
  To: David Brown, Gandalf Corvotempesta
  Cc: Reindl Harald, Jeff Allison, Adam Goryachev, linux-raid

On 21/03/17 14:26, David Brown wrote:
> It is possible that if there are a large number of UREs from a drive,
> that the RAID system will consider the whole drive bad and drop it.  But
> other than that, UREs will be treated independently.

Doesn't mdadm have a setting that does exactly that? Too many UREs and
the drive gets dropped? I'm sure I've come across that interfering with
rebuilds.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 15:25                 ` Wols Lists
@ 2017-03-21 15:41                   ` David Brown
  2017-03-21 16:49                     ` Phil Turmel
  0 siblings, 1 reply; 34+ messages in thread
From: David Brown @ 2017-03-21 15:41 UTC (permalink / raw)
  To: Wols Lists, Reindl Harald, Adam Goryachev, Jeff Allison; +Cc: linux-raid

On 21/03/17 16:25, Wols Lists wrote:
> On 21/03/17 14:15, David Brown wrote:
>>> for most arrays the disks have a similar age and usage pattern, so when
>>>> the first one fails it becomes likely that it don't take too long for
>>>> another one and so load and recovery time matters
> 
>> False.  There is no reason to suspect that - certainly not to within the
>> hours or day it takes to rebuild your array.  Disk failure pattern shows
>> a peak within the first month or so (failures due to manufacturing or
>> handling), then a very low error rate for a few years, then a gradually
>> increasing rate after that.  There is not a very significant correlation
>> between drive failures within the same system, nor is there a very
>> significant correlation between usage and failures.
> 
> Except your argument and the claim don't match. You're right - disk
> failures follow the pattern you describe. BUT.
> 
> If the array was created from completely new disks, then the usage
> patterns will be very similar, therefore there will be a statistical
> correlation between failures as compared to the population as a whole.
> (Bit like a false DNA match is much higher in an inbred town, than in a
> cosmopolitan city of immigrants.)
> 
> EVEN WORSE. The probability of all the drives coming off the same batch,
> and sharing the same systematic defects, is much much higher. One only
> has to look at the Seagate 3TB Barracuda mess to see a perfect example.
> 
> In other words, IFF your array is built of a bunch of identical drives
> all bought at the same time, the risk of multiple failure is
> significantly higher. How significant that is I don't know, but it is a
> very valid reason for replacing your drives at semi-random intervals.
> 

There /is/ a bit of correlation for early-fail drives coming from the
same batch.  But there is little correlation for normal lifetime drives.

If you roll three dice and sum them, the expected sum will follow a nice
Bell curve distribution.  If you pick another three dice and roll them,
they will follow the same distribution for the expected sum.  But there
is no correlation between the sums.

Similarly, maybe you figure out that there is a 10% chance of the drive
dying in the first month, 10% chance of it dying in the next three
years, then 30% for the fourth year, 40% for the fifth year, and 10%
spread out over the following years.  Multiple drives of the same type
bought at the same time, and run in the same conditions (usage patterns,
heat, humidity, etc.) will have the same expected lifetime curves.  But
if one drive fails in its fourth year, that does not affect the
probability of a second drive also failing in the same year - it is
basically independent.

Now, there will be a little bit of correlation, especially if there are
factors that may significantly affect reliability (such as someone
bumping the server).  But you are still extremely unlikely to find that
after one drive dies, a second drive dies on the same day or so (during
the rebuild) - it is possible, but it is very bad luck.  There is no
statistical basis for thinking it that when one drive dies, it is likely
that another one will die too.

Of course, some types of failures can affect several drives - a
motherboard failure, power supply problem, or similar event could kill
all your disks at the same time.  RAID does not avoid the need for backups!

Also early death failures can be correlated with a bad production batch
- mixing different batches helps reduce the risk of total failure.
Similarly, mixing different disk types reduces the risk of total
failures due to systematic errors such as firmware bugs.


> (Completely off topic :-) but a real-world demonstrable example is
> couples' initials. "Like chooses like" and if you compare a couple's
> first initials against what you would expect from a random sample, there
> is a VERY significant spike in couples that share the same initial.)
> 
> To put it bluntly, if your array consists of disks with near-identical
> characteristics (including manufacturing batch), then your chances of
> random multiple failure are noticeably increased. Is it worth worrying
> about? If you can do something about it, of course!
> 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 15:41                   ` David Brown
@ 2017-03-21 16:49                     ` Phil Turmel
  2017-03-22 13:53                       ` Gandalf Corvotempesta
  0 siblings, 1 reply; 34+ messages in thread
From: Phil Turmel @ 2017-03-21 16:49 UTC (permalink / raw)
  To: David Brown, Wols Lists, Reindl Harald, Adam Goryachev, Jeff Allison
  Cc: linux-raid

On 03/21/2017 11:41 AM, David Brown wrote:

> There /is/ a bit of correlation for early-fail drives coming from
> the same batch.  But there is little correlation for normal lifetime
> drives.
> 
> If you roll three dice and sum them, the expected sum will follow a
> nice Bell curve distribution.  If you pick another three dice and
> roll them, they will follow the same distribution for the expected
> sum.  But there is no correlation between the sums.

Let me add to this:

The correlation is effectively immaterial in a non-degraded raid5 and
singly-degraded raid6 because recovery will succeed as long as any two
errors are in different 4k block/sector locations.  And for non-degraded
raid6, all three UREs must occur in the same block/sector to lose
data. Some participants in this discussion need to read the statistical
description of this stuff here:

http://marc.info/?l=linux-raid&m=139050322510249&w=2

As long as you are 'check' scrubbing every so often (I scrub weekly),
the odds of catastrophe on raid6 are the odds of something *else* taking
out the machine or controller, not the odds of simultaneous drive
failures.

Phil


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 13:02       ` David Brown
  2017-03-21 13:26         ` Gandalf Corvotempesta
  2017-03-21 15:29         ` Wols Lists
@ 2017-03-21 16:55         ` Phil Turmel
  2 siblings, 0 replies; 34+ messages in thread
From: Phil Turmel @ 2017-03-21 16:55 UTC (permalink / raw)
  To: David Brown, Reindl Harald, Jeff Allison, Adam Goryachev; +Cc: linux-raid

On 03/21/2017 09:02 AM, David Brown wrote:

> With RAID6 (or three-disk RAID1), you can tolerate /two/ URE's on
> the same stripe.  If you have failed a disk for replacement, you can 
> tolerate one URE.

One nit to pick here:  The UREs have to be in the same 4k block/sector,
not just in the same stripe.  The stripe cache and all parity
calculations are done on strips of 4k blocks, not whole N*chunk stripes.

That makes the odds even larger.

Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 15:31             ` Wols Lists
@ 2017-03-21 17:00               ` Phil Turmel
  0 siblings, 0 replies; 34+ messages in thread
From: Phil Turmel @ 2017-03-21 17:00 UTC (permalink / raw)
  To: Wols Lists, David Brown, Gandalf Corvotempesta
  Cc: Reindl Harald, Jeff Allison, Adam Goryachev, linux-raid

On 03/21/2017 11:31 AM, Wols Lists wrote:
> On 21/03/17 14:26, David Brown wrote:
>> It is possible that if there are a large number of UREs from a
>> drive, that the RAID system will consider the whole drive bad and
>> drop it.  But other than that, UREs will be treated independently.
> 
> Doesn't mdadm have a setting that does exactly that? Too many UREs
> and the drive gets dropped? I'm sure I've come across that
> interfering with rebuilds.

Yes.  MD maintains a per-member-device counter of read errors and drops
the device when the counter reaches 20 (twenty).  The counter is
decremented by 10 (ten) once an hour.  A short burst of less than 20
read errors will be tolerated, as long as they don't continue at more
than 10/hour.

Last I checked, this behavior is hard-coded.

Phil


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 12:41               ` Andreas Klauer
@ 2017-03-22  4:16                 ` NeilBrown
  0 siblings, 0 replies; 34+ messages in thread
From: NeilBrown @ 2017-03-22  4:16 UTC (permalink / raw)
  To: Andreas Klauer, Reindl Harald; +Cc: Adam Goryachev, Jeff Allison, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2267 bytes --]

On Tue, Mar 21 2017, Andreas Klauer wrote:

> On Tue, Mar 21, 2017 at 01:03:22PM +0100, Reindl Harald wrote:
>> the IO of a RAID5/6 rebuild is hardly linear beause the informations 
>> (data + parity) are spread all over the disks
>
> It's not "randomly" spread all over. The blocks are always where they belong.
>
> https://en.wikipedia.org/wiki/Standard_RAID_levels#/media/File:RAID_6.svg
>
> It's AAAA, BBBB, CCCC, DDDD. Not DBCA, BADC, ADBC, ...
>
> There is no random I/O involved here, at worst it will decide to not read 
> a parity block because it's not needed but that does not cause huge/random
> jumps for the HDD read heads.

RAID5 resync (after an unclean shutdown) does read the parity.
It reads all devices in parallel and checks parity.  Normally all the
parity is correct so it doesn't write at all.
Occasionally there might be incorrect parity, in which case the head
will seek back and write the correct parity.

RAID5 recovery (when a device was removed and a new device is added)
reads all the *other* devices in parallel, calculates the missing block
(parity or data) and writes out to the replaced devices.  All reads and
writes are sequential.

NeilBrown


>
>> while in case of RAID1/10 it is really linear
>
> Actually RAID 10 has the most interesting layout choices... 
> to this day mdadm is unable to grow/convert some of these.
>
> In a RAID 10 rebuild the HDD might have to jump from end to start.
>
> Of course if you consider metadata updates (progress has to be 
> recorded somewhere?) then ALL rebuilds regardless of RAID level 
> are random I/O in a way.
>
> But such is the fate of a HDD, it's their bread and butter. 
> Any server that does anything other than "idle" does random I/O 24/7.
>
> If there was no other I/O (because the RAID is live during rebuild) 
> and no metadata updates (or external metadata) you could totally do 
> RAID0/1/5/6 rebuilds with tape drives. That's how random it is.
> RAID10 might need a rewind in between.
>
> Regards
> Andreas Klauer
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-21 16:49                     ` Phil Turmel
@ 2017-03-22 13:53                       ` Gandalf Corvotempesta
  2017-03-22 14:12                         ` David Brown
  2017-03-22 14:32                         ` Phil Turmel
  0 siblings, 2 replies; 34+ messages in thread
From: Gandalf Corvotempesta @ 2017-03-22 13:53 UTC (permalink / raw)
  To: Phil Turmel
  Cc: David Brown, Wols Lists, Reindl Harald, Adam Goryachev,
	Jeff Allison, linux-raid

2017-03-21 17:49 GMT+01:00 Phil Turmel <philip@turmel.org>:
> The correlation is effectively immaterial in a non-degraded raid5 and
> singly-degraded raid6 because recovery will succeed as long as any two
> errors are in different 4k block/sector locations.  And for non-degraded
> raid6, all three UREs must occur in the same block/sector to lose
> data. Some participants in this discussion need to read the statistical
> description of this stuff here:
>
> http://marc.info/?l=linux-raid&m=139050322510249&w=2
>
> As long as you are 'check' scrubbing every so often (I scrub weekly),
> the odds of catastrophe on raid6 are the odds of something *else* taking
> out the machine or controller, not the odds of simultaneous drive
> failures.

This is true but disk failures happens much more than multiple UREs on
the same stripe.
I think that in a RAID6 is much easier to loose data due to multiple
disk failures.

Last years i've lose a server due to 4 (of 6) disks failures in less
than an hours during a rebuild.

The first failure was detected in the middle of the night. It was a
disconnection/reconnaction of a single disks.
The riconnection triggered a resync. During the resync another disk
failed. RAID6 recovered even from this double failure
but at about 60% of rebuild, the third disk failed bringing the whole raid down.

I was waked up by our monitoring system and looking at the server,
there was also a fourth disk down :)

4 disks down in less than a hour. All disk was enterprise: SAS 15K,
not desktop drives.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-22 13:53                       ` Gandalf Corvotempesta
@ 2017-03-22 14:12                         ` David Brown
  2017-03-22 14:32                         ` Phil Turmel
  1 sibling, 0 replies; 34+ messages in thread
From: David Brown @ 2017-03-22 14:12 UTC (permalink / raw)
  To: Gandalf Corvotempesta, Phil Turmel
  Cc: Wols Lists, Reindl Harald, Adam Goryachev, Jeff Allison, linux-raid

On 22/03/17 14:53, Gandalf Corvotempesta wrote:
> 2017-03-21 17:49 GMT+01:00 Phil Turmel <philip@turmel.org>:
>> The correlation is effectively immaterial in a non-degraded raid5 and
>> singly-degraded raid6 because recovery will succeed as long as any two
>> errors are in different 4k block/sector locations.  And for non-degraded
>> raid6, all three UREs must occur in the same block/sector to lose
>> data. Some participants in this discussion need to read the statistical
>> description of this stuff here:
>>
>> http://marc.info/?l=linux-raid&m=139050322510249&w=2
>>
>> As long as you are 'check' scrubbing every so often (I scrub weekly),
>> the odds of catastrophe on raid6 are the odds of something *else* taking
>> out the machine or controller, not the odds of simultaneous drive
>> failures.
> 
> This is true but disk failures happens much more than multiple UREs on
> the same stripe.
> I think that in a RAID6 is much easier to loose data due to multiple
> disk failures.

Certainly multiple disk failures is an easy way to loose data in /any/
storage system (or at least, loose data since the last backup).

The issue here is whether it is more or less likely to be a problem in
RAID6 than other raid arrangements.  And the answer is that complete
disk failures are not more likely during a RAID6 rebuild than during
other raid rebuilds, and a RAID6 will tolerate more failures than RAID1
or RAID5.

Of course, multiple disk failures /do/ occur.  There can be a common
cause of failure.  I have had a few raid systems die completely over the
years.  The causes I can remember include:

1. The SAS controller card died - and I didn't have a replacement.  The
data on the disks is probably still fine.

2. The whole computer died in some unknown way.  The data on the disks
was fine - I put them in another cabinet and re-assembled the md array.

3. A hardware raid card died.  The data may have been on the disks, but
the hardware raid was in a proprietary format.

4. I knocked a disk cabinet off its shelf.  This let to multiple
simultaneous drive failures.

Based on these, my policy is:

1. Stick to SATA drives that are easily available, easily replaced, and
easily read from any system.

2. Avoid hardware raid - use md raid and/or btrfs raid.

3. Do a lot of backups - on independent systems, and with off-site
copies.  Raid does not prevent loss from fire or theft, or a UPS going
bananas, or a user deleting the wrong file.

4. Mount your equipment securely, and turn round slowly!

> 
> Last years i've lose a server due to 4 (of 6) disks failures in less
> than an hours during a rebuild.
> 
> The first failure was detected in the middle of the night. It was a
> disconnection/reconnaction of a single disks.
> The riconnection triggered a resync. During the resync another disk
> failed. RAID6 recovered even from this double failure
> but at about 60% of rebuild, the third disk failed bringing the whole raid down.
> 
> I was waked up by our monitoring system and looking at the server,
> there was also a fourth disk down :)
> 
> 4 disks down in less than a hour. All disk was enterprise: SAS 15K,
> not desktop drives.
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-22 13:53                       ` Gandalf Corvotempesta
  2017-03-22 14:12                         ` David Brown
@ 2017-03-22 14:32                         ` Phil Turmel
  1 sibling, 0 replies; 34+ messages in thread
From: Phil Turmel @ 2017-03-22 14:32 UTC (permalink / raw)
  To: Gandalf Corvotempesta
  Cc: David Brown, Wols Lists, Reindl Harald, Adam Goryachev,
	Jeff Allison, linux-raid

On 03/22/2017 09:53 AM, Gandalf Corvotempesta wrote:

> Last years i've lose a server due to 4 (of 6) disks failures in less
> than an hours during a rebuild.
> 
> The first failure was detected in the middle of the night. It was a
> disconnection/reconnaction of a single disks.
> The riconnection triggered a resync. During the resync another disk
> failed. RAID6 recovered even from this double failure
> but at about 60% of rebuild, the third disk failed bringing the whole raid down.
> 
> I was waked up by our monitoring system and looking at the server,
> there was also a fourth disk down :)
> 
> 4 disks down in less than a hour. All disk was enterprise: SAS 15K,
> not desktop drives.

You should win a prize, Gandalf.  In the several years I've participated
on this mailing list, you are the first to describe such a catastrophe
where the drives really were at fault, instead of timeout mismatch,
power supplies, cables, or controllers.

All four disks had permanent "FAILED" smartctl status after this, yes?

Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: proactive disk replacement
  2017-03-20 12:47 proactive disk replacement Jeff Allison
  2017-03-20 13:25 ` Reindl Harald
  2017-03-20 14:59 ` Adam Goryachev
@ 2017-03-22 14:51 ` John Stoffel
  2 siblings, 0 replies; 34+ messages in thread
From: John Stoffel @ 2017-03-22 14:51 UTC (permalink / raw)
  To: Jeff Allison; +Cc: linux-raid


Jeff> Hi all I’ve had a poke around but am yet to find something
Jeff> definitive.  I have a raid 5 array of 4 disks amounting to
Jeff> approx 5.5tb. Now this disks are getting a bit long in the tooth
Jeff> so before I get into problems I’ve bought 4 new disks to replace
Jeff> them.

Can I suggest that you buy another disk and convert into a RAID6 setup
for even more resiliency?  Esp with that much data (great that you
have backups!) the piece of mind of an extra disk is well worth the
cost in my mind.

Personally, I just go with RAID1 mirrors on large disks like this for
my home system.  I don't have *that* much stuff... though my disks too
are getting long in tooth.  

Jeff> I have a backup so if it all goes west I’m covered. So I’m
Jeff> looking for suggestions.

Jeff> My current plan is just to replace the 2tb drives with the new
Jeff> 3tb drives and move on, I’d like to do it on line with out
Jeff> having to trash the array and start again, so does anyone have a
Jeff> game plan for doing that.

You don't say how your system is setup, whether or not you have LVM on
top of the MD RAID5 array or not.  If you, you could simply do:

1. Build a new RAID6 array with five disks (buying another one like I
   suggest above).

2. Add this into your VG with the 4x2tb disks.

3. pvmove all your data onto the new PVs:

   pvmove -b <VG> <old-raid5-PV>

And once it's done, you can them remove that PV from the VG and pull
them from the system.  Or turn them into a scratch space until they
die...

Jeff> Or is a 9tb raid 5 array the wrong thing to be doing and should
Jeff> I be doing something else 6tb raid 10 or something I’m open to
Jeff> suggestions.

Depends on how good your backups are and how critical it is that this
data stay online.

John

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2017-03-22 14:51 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-20 12:47 proactive disk replacement Jeff Allison
2017-03-20 13:25 ` Reindl Harald
2017-03-20 14:59 ` Adam Goryachev
2017-03-20 15:04   ` Reindl Harald
2017-03-20 15:23     ` Adam Goryachev
2017-03-20 16:19       ` Wols Lists
2017-03-21  2:33   ` Jeff Allison
2017-03-21  9:54     ` Reindl Harald
2017-03-21 10:54       ` Adam Goryachev
2017-03-21 11:03         ` Reindl Harald
2017-03-21 11:34           ` Andreas Klauer
2017-03-21 12:03             ` Reindl Harald
2017-03-21 12:41               ` Andreas Klauer
2017-03-22  4:16                 ` NeilBrown
2017-03-21 11:56           ` Adam Goryachev
2017-03-21 12:10             ` Reindl Harald
2017-03-21 13:13           ` David Brown
2017-03-21 13:24             ` Reindl Harald
2017-03-21 14:15               ` David Brown
2017-03-21 15:25                 ` Wols Lists
2017-03-21 15:41                   ` David Brown
2017-03-21 16:49                     ` Phil Turmel
2017-03-22 13:53                       ` Gandalf Corvotempesta
2017-03-22 14:12                         ` David Brown
2017-03-22 14:32                         ` Phil Turmel
2017-03-21 11:55         ` Gandalf Corvotempesta
2017-03-21 13:02       ` David Brown
2017-03-21 13:26         ` Gandalf Corvotempesta
2017-03-21 14:26           ` David Brown
2017-03-21 15:31             ` Wols Lists
2017-03-21 17:00               ` Phil Turmel
2017-03-21 15:29         ` Wols Lists
2017-03-21 16:55         ` Phil Turmel
2017-03-22 14:51 ` John Stoffel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.