All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID6 dead on the water after Controller failure
@ 2014-02-14 16:19 Florian Lampel
  2014-02-14 20:35 ` Phil Turmel
  0 siblings, 1 reply; 13+ messages in thread
From: Florian Lampel @ 2014-02-14 16:19 UTC (permalink / raw)
  To: linux-raid

Greetings,

The title says it all: 2 days before my RAID6 lost a HDD (sdh). Not  a problem, I thought, just let it reassemble and be done with it.

Unfortunately, my Mainboard-Controller didn't seem to like that, and after about 2 hours into the rebuilding process it showed me that the array was missing 5 drives ( 4 from the MB-Controller and the one that went south before).
Being a Admin for quite a while, I did not panic and have not issued a single command that writes to the RAID in any form as of yet.

Having read the wiki page about broken RAID arrays reading some messages on the list it became obvious that I should check with you guys before I do anything. The Server is still running, but I intend to restart it after unplugging an SATA cable that I assume to be faulty.

Here are the relevant logs and outputs of mdadm as requested on the Wiki:

h__p://pastebin.com/1xweaLYG

cat /proc/mdstat:
root@Lserve:~# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10] 
md0 : active raid6 sdh1[12](S) sdc1[10](F) sdb1[9](F) sda1[8](F) sdd1[11](F) sdf1[5] sdk1[1] sdl1[2] sdg1[6] sde1[4] sdm1[3] sdj1[0]
      19535129600 blocks super 1.0 level 6, 512k chunk, algorithm 2 [12/7] [UUUUUUU_____]
      
unused devices: <none>

sda, sdb, sdc and sdd can't be reached anymore by any means. I believe a restart might fix this, but I am not sure.

2) I assume that I should do the following, in this order: 

2.1) restart the machine and check all the cables etc.
---> and hope that /dev/sda, sdb, sdc and sdd will talk to me again.

2.2) mdadm --assemble --scan 
---> and hope for the best. I don't think it will work.

2.3 madm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 (since the Event count is the same) /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1
--> I don't believe this one will work, too. When using --force, is the sequence of the HDDs in the command important?

2.4) mdadm --create --assume-clean --chunk=512 --metadata=1.0 --level 6 --raid-devices=12 --size=1953512960 /dev/md0 /dev/sdj1 /dev/sdk1 /dev/sdl1 etc. (using the sequence numbers of the /proc/mdstat pasted above)

--> That should do it, right?

Thanks in advance,
Florian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID6 dead on the water after Controller failure
  2014-02-14 16:19 RAID6 dead on the water after Controller failure Florian Lampel
@ 2014-02-14 20:35 ` Phil Turmel
  2014-02-15 12:31   ` Florian Lampel
  0 siblings, 1 reply; 13+ messages in thread
From: Phil Turmel @ 2014-02-14 20:35 UTC (permalink / raw)
  To: Florian Lampel, linux-raid

Hi Florian,

On 02/14/2014 11:19 AM, Florian Lampel wrote:
> Greetings,
> 
> The title says it all: 2 days before my RAID6 lost a HDD (sdh). Not  a problem, I thought, just let it reassemble and be done with it.
> 
> Unfortunately, my Mainboard-Controller didn't seem to like that, and after about 2 hours into the rebuilding process it showed me that the array was missing 5 drives ( 4 from the MB-Controller and the one that went south before).
> Being a Admin for quite a while, I did not panic and have not issued a single command that writes to the RAID in any form as of yet.
> 
> Having read the wiki page about broken RAID arrays reading some messages on the list it became obvious that I should check with you guys before I do anything. The Server is still running, but I intend to restart it after unplugging an SATA cable that I assume to be faulty.
> 
> Here are the relevant logs and outputs of mdadm as requested on the Wiki:
> 
> h__p://pastebin.com/1xweaLYG

Good report.  It even includes the mapping of serial numbers to devices!

To consolidate some critical parts:

sda1: WD-WMC300595645 probably device 8
sdb1: WD-WMC300314217 probably device 9
sdc1: WD-WMC300595957 probably device 10
sdd1: WD-WMC300313432 probably device 11
sde1: WD-WMC300595440 Active device 4
sdf1: WD-WMC300595880 Active device 5
sdg1: WD-WMC1T1521826 Active device 6
sdh1: WD-WMC300314126 spare, incomplete device 7
sdj1: WD-WMC300312702 Active device 0
sdk1: WD-WMC300248734 Active device 1
sdl1: WD-WMC300314248 Active device 2
sdm1: WD-WMC300585843 Active device 3

> sda, sdb, sdc and sdd can't be reached anymore by any means. I believe a restart might fix this, but I am not sure.
> 
> 2) I assume that I should do the following, in this order: 
> 
> 2.1) restart the machine and check all the cables etc.
> ---> and hope that /dev/sda, sdb, sdc and sdd will talk to me again.

Keep replacing controllers, cables, power supplies (anything except the
drives) until you can communicate with all of them.

Except /dev/sdh.  It wasn't finished syncing, so is no help.

Figure out what went wrong with the hardware.  After you get them all
talking, show us the missing mdadm --examine data and an exhaustive
smartctl report:

mdadm -E /dev/sd[abcd]1 >pastebin.txt

for x in /dev/sd[a-z] ; do echo $x : ; smartctl -x $x ; done >>pastebin.txt

> 2.2) mdadm --assemble --scan 
> ---> and hope for the best. I don't think it will work.

Don't bother.  It certainly won't work now that four drives will have
different event counts.  "--scan" is less than useful in these cases, too.

> 2.3 madm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 (since the Event count is the same) /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1
> --> I don't believe this one will work, too. When using --force, is the sequence of the HDDs in the command important?

This is the right tool.  Order doesn't matter, as the metadata carries
the member ID.  Leave out /dev/sdh1 (or wherever WD-WMC300314126 ends up).

mdadm -Afv /dev/md0 /dev/sd[abcdefgjklm]1

If it fails, show us the output.

> 2.4) mdadm --create --assume-clean --chunk=512 --metadata=1.0 --level 6 --raid-devices=12 --size=1953512960 /dev/md0 /dev/sdj1 /dev/sdk1 /dev/sdl1 etc. (using the sequence numbers of the /proc/mdstat pasted above)

Do *not* do this!  You have metadata.  You have enough drives to run the
array.  Re-creating the array is *madness*.

HTH,

Phil

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID6 dead on the water after Controller failure
  2014-02-14 20:35 ` Phil Turmel
@ 2014-02-15 12:31   ` Florian Lampel
  2014-02-15 15:12     ` Phil Turmel
       [not found]     ` <CADNH=7EiY18TJDBDQsT6LDtw+Ft_2XCFaP30uK7uJb_e7xKhsQ@mail.gmail.com>
  0 siblings, 2 replies; 13+ messages in thread
From: Florian Lampel @ 2014-02-15 12:31 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Greetings,

first of all - thanks to Phil Turmel for pointing me in the right direction. I checked all the cables and true enough, the System SSD's cable's shielding was halfway peeled off.

Anyway, the current state is as follows:

*) The missing HDDs came up right after the reboot, and I had to use the "bootdegraded=true" kernel option.
*) All 12 drives are functional.

Here is a link to the requested output of 

--- mdadm -E /dev/sd[abcd]1 ---
--- for x in /dev/sd[a-z] ; do echo $x : ; smartctl -x $x ; done ----

as well as

---- mdadm --examine /dev/sd[abcdefghijklmnop]1 ------


Link:
h__p://pastebin.com/v6yzn3KX

My findings:
The Event count does differ, but not by much. As my next step, I would follow Phil Turmel's advice and reassemble the Array using the --force option, to be precise:

mdadm -Afv /dev/md0 /dev/sd[abcdefgjklm]1

Could you please advise me wether this next step is all right to do now that we have new logs etc.?

Thanks in advance,
Florian Lampel

PS: Thanks again to Phil for pointing out that --create would be madness.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID6 dead on the water after Controller failure
  2014-02-15 12:31   ` Florian Lampel
@ 2014-02-15 15:12     ` Phil Turmel
  2014-02-15 18:52       ` Florian Lampel
  2014-02-15 22:04       ` Jon Nelson
       [not found]     ` <CADNH=7EiY18TJDBDQsT6LDtw+Ft_2XCFaP30uK7uJb_e7xKhsQ@mail.gmail.com>
  1 sibling, 2 replies; 13+ messages in thread
From: Phil Turmel @ 2014-02-15 15:12 UTC (permalink / raw)
  To: Florian Lampel; +Cc: linux-raid

Good morning Florian,

On 02/15/2014 07:31 AM, Florian Lampel wrote:
> Greetings,
> 
> first of all - thanks to Phil Turmel for pointing me in the right direction. I checked all the cables and true enough, the System SSD's cable's shielding was halfway peeled off.

Very good.

> Anyway, the current state is as follows:
> 
> *) The missing HDDs came up right after the reboot, and I had to use the "bootdegraded=true" kernel option.
> *) All 12 drives are functional.
> 
> Here is a link to the requested output of 
> 
> --- mdadm -E /dev/sd[abcd]1 ---
> --- for x in /dev/sd[a-z] ; do echo $x : ; smartctl -x $x ; done ----
> 
> as well as
> 
> ---- mdadm --examine /dev/sd[abcdefghijklmnop]1 ------
> 
> Link:
> h__p://pastebin.com/v6yzn3KX

Device order has changed, summary:

/dev/sda1: WD-WMC300595440 Device #4 @442
/dev/sdb1: WD-WMC300595880 Device #5 @442
/dev/sdc1: WD-WMC1T1521826 Device #6 @442
/dev/sdd1: WD-WMC300314126 spare
/dev/sde1: WD-WMC300595645 Device #8 @435
/dev/sdf1: WD-WMC300314217 Device #9 @435
/dev/sdg1: WD-WMC300595957 Device #10 @435
/dev/sdh1: WD-WMC300313432 Device #11 @435
/dev/sdj1: WD-WMC300312702 Device #0 @442
/dev/sdk1: WD-WMC300248734 Device #1 @442
/dev/sdl1: WD-WMC300314248 Device #2 @442
/dev/sdm1: WD-WMC300585843 Device #3 @442

and your SSD is now /dev/sdi.

> My findings:
> The Event count does differ, but not by much. As my next step, I would follow Phil Turmel's advice and reassemble the Array using the --force option, to be precise:
> 
> mdadm -Afv /dev/md0 /dev/sd[abcdefgjklm]1

Not quite.  What was 'h' is now 'd'.  Use:

mdadm -Afv /dev/md0 /dev/sd[abcefghjklm]1

> Could you please advise me wether this next step is all right to do now that we have new logs etc.?

Yes.  You may also need "mdadm --stop /dev/md0" first if your boot
process partially assembled the array already.

After assembly, your array will be single-degraded but fully functional.
 That would be a good time to backup any critical data that isn't
already in a backup.

Then you can add /dev/sdd1 back into the array and let it rebuild.

> Thanks in advance,
> Florian Lampel
> 
> PS: Thanks again to Phil for pointing out that --create would be madness.--

One more thing:  your drives report never having a self-test run.  You
should have a cron job that triggers a long background self-test on a
regular basis.  Weekly, perhaps.

Similarly, you should have a cron job trigger an occasional "check"
scrub on the array, too.  Not at the same time as the self-tests,
though.  (I understand some distributions have this already.)

HTH,

Phil

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID6 dead on the water after Controller failure
  2014-02-15 15:12     ` Phil Turmel
@ 2014-02-15 18:52       ` Florian Lampel
  2014-02-15 19:00         ` Phil Turmel
  2014-02-15 22:04       ` Jon Nelson
  1 sibling, 1 reply; 13+ messages in thread
From: Florian Lampel @ 2014-02-15 18:52 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Am 15.02.2014 um 16:12 schrieb Phil Turmel <philip@turmel.org>:

> Good morning Florian,

Good Evening - it's 19:37 here in Austria.

> Device order has changed, summary:
> 
> /dev/sda1: WD-WMC300595440 Device #4 @442
> /dev/sdb1: WD-WMC300595880 Device #5 @442
> /dev/sdc1: WD-WMC1T1521826 Device #6 @442
> /dev/sdd1: WD-WMC300314126 spare
> /dev/sde1: WD-WMC300595645 Device #8 @435
> /dev/sdf1: WD-WMC300314217 Device #9 @435
> /dev/sdg1: WD-WMC300595957 Device #10 @435
> /dev/sdh1: WD-WMC300313432 Device #11 @435
> /dev/sdj1: WD-WMC300312702 Device #0 @442
> /dev/sdk1: WD-WMC300248734 Device #1 @442
> /dev/sdl1: WD-WMC300314248 Device #2 @442
> /dev/sdm1: WD-WMC300585843 Device #3 @442
> 
> and your SSD is now /dev/sdi.

Thank you again for going through all those logs and helping me. 

> Not quite.  What was 'h' is now 'd'.  Use:
> 
> mdadm -Afv /dev/md0 /dev/sd[abcefghjklm]1

Well, that did not went as well as I had hoped. Here is what happened:

root@Lserve:~# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
root@Lserve:~# mdadm -Afv /dev/md0 /dev/sd[abcefghjklm]1
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 4.
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 5.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 6.
mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 8.
mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 9.
mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 10.
mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot 11.
mdadm: /dev/sdj1 is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdk1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdl1 is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdm1 is identified as a member of /dev/md0, slot 3.
mdadm: forcing event count in /dev/sde1(8) from 435 upto 442
mdadm: forcing event count in /dev/sdf1(9) from 435 upto 442
mdadm: forcing event count in /dev/sdg1(10) from 435 upto 442
mdadm: forcing event count in /dev/sdh1(11) from 435 upto 442
mdadm: clearing FAULTY flag for device 3 in /dev/md0 for /dev/sde1
mdadm: clearing FAULTY flag for device 4 in /dev/md0 for /dev/sdf1
mdadm: clearing FAULTY flag for device 5 in /dev/md0 for /dev/sdg1
mdadm: clearing FAULTY flag for device 6 in /dev/md0 for /dev/sdh1
mdadm: Marking array /dev/md0 as 'clean'
mdadm: added /dev/sdk1 to /dev/md0 as 1
mdadm: added /dev/sdl1 to /dev/md0 as 2
mdadm: added /dev/sdm1 to /dev/md0 as 3
mdadm: added /dev/sda1 to /dev/md0 as 4
mdadm: added /dev/sdb1 to /dev/md0 as 5
mdadm: added /dev/sdc1 to /dev/md0 as 6
mdadm: no uptodate device for slot 7 of /dev/md0
mdadm: added /dev/sde1 to /dev/md0 as 8
mdadm: added /dev/sdf1 to /dev/md0 as 9
mdadm: added /dev/sdg1 to /dev/md0 as 10
mdadm: added /dev/sdh1 to /dev/md0 as 11
mdadm: added /dev/sdj1 to /dev/md0 as 0
mdadm: /dev/md0 assembled from 11 drives - not enough to start the array.

AND:

cat /proc/mdstat:

cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : inactive sdj1[0](S) sdh1[11](S) sdg1[10](S) sdf1[9](S) sde1[8](S) sdc1[6](S) sdb1[5](S) sda1[4](S) sdm1[3](S) sdl1[2](S) sdk1[1](S)
      21488646696 blocks super 1.0
       
unused devices: <none>

Seems like every HDD got marked as a spare. Why would mdadm do this, and how can I convince mdadm that they are not spares?


> That would be a good time to backup any critical data that isn't
> already in a backup.

Crashplan had about 30% before it happened. 20TB is a lot to upload.

> One more thing:  your drives report never having a self-test run.  You
> should have a cron job that triggers a long background self-test on a
> regular basis.  Weekly, perhaps.
> 
> Similarly, you should have a cron job trigger an occasional "check"
> scrub on the array, too.  Not at the same time as the self-tests,
> though.  (I understand some distributions have this already.)

I will certainly do so in the future.

Thanks again everyone, and I hope this will all end well.

Thanks,
Florian Lampel


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID6 dead on the water after Controller failure
       [not found]     ` <CADNH=7EiY18TJDBDQsT6LDtw+Ft_2XCFaP30uK7uJb_e7xKhsQ@mail.gmail.com>
@ 2014-02-15 18:56       ` Florian Lampel
  0 siblings, 0 replies; 13+ messages in thread
From: Florian Lampel @ 2014-02-15 18:56 UTC (permalink / raw)
  To: Mathias Burén; +Cc: linux-raid


Am 15.02.2014 um 15:00 schrieb Mathias Burén <mathias.buren@gmail.com>:

> Hi,
> 
> Judging by the CRC errors on many drives, I'm inclined to ask you if
> you use a backplane? I would at the very least replace the SAS/SATA
> cable(s).
> 
> Good luck,
> Mathias

Why yes - I do use a backplane to house 4 3,5" drives in 3x 5,25" slots.

Thanks,
Florian--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID6 dead on the water after Controller failure
  2014-02-15 18:52       ` Florian Lampel
@ 2014-02-15 19:00         ` Phil Turmel
  2014-02-15 19:01           ` Phil Turmel
  2014-02-15 19:09           ` Bakk. Florian Lampel
  0 siblings, 2 replies; 13+ messages in thread
From: Phil Turmel @ 2014-02-15 19:00 UTC (permalink / raw)
  To: Florian Lampel; +Cc: linux-raid

Hi Florian,

On 02/15/2014 01:52 PM, Florian Lampel wrote:
> Well, that did not went as well as I had hoped. Here is what happened:
> 
> root@Lserve:~# mdadm --stop /dev/md0
> mdadm: stopped /dev/md0
> root@Lserve:~# mdadm -Afv /dev/md0 /dev/sd[abcefghjklm]1
> mdadm: looking for devices for /dev/md0
> mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 4.
> mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 5.
> mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 6.
> mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 8.
> mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 9.
> mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 10.
> mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot 11.
> mdadm: /dev/sdj1 is identified as a member of /dev/md0, slot 0.
> mdadm: /dev/sdk1 is identified as a member of /dev/md0, slot 1.
> mdadm: /dev/sdl1 is identified as a member of /dev/md0, slot 2.
> mdadm: /dev/sdm1 is identified as a member of /dev/md0, slot 3.
> mdadm: forcing event count in /dev/sde1(8) from 435 upto 442
> mdadm: forcing event count in /dev/sdf1(9) from 435 upto 442
> mdadm: forcing event count in /dev/sdg1(10) from 435 upto 442
> mdadm: forcing event count in /dev/sdh1(11) from 435 upto 442
> mdadm: clearing FAULTY flag for device 3 in /dev/md0 for /dev/sde1
> mdadm: clearing FAULTY flag for device 4 in /dev/md0 for /dev/sdf1
> mdadm: clearing FAULTY flag for device 5 in /dev/md0 for /dev/sdg1
> mdadm: clearing FAULTY flag for device 6 in /dev/md0 for /dev/sdh1
> mdadm: Marking array /dev/md0 as 'clean'
> mdadm: added /dev/sdk1 to /dev/md0 as 1
> mdadm: added /dev/sdl1 to /dev/md0 as 2
> mdadm: added /dev/sdm1 to /dev/md0 as 3
> mdadm: added /dev/sda1 to /dev/md0 as 4
> mdadm: added /dev/sdb1 to /dev/md0 as 5
> mdadm: added /dev/sdc1 to /dev/md0 as 6
> mdadm: no uptodate device for slot 7 of /dev/md0
> mdadm: added /dev/sde1 to /dev/md0 as 8
> mdadm: added /dev/sdf1 to /dev/md0 as 9
> mdadm: added /dev/sdg1 to /dev/md0 as 10
> mdadm: added /dev/sdh1 to /dev/md0 as 11
> mdadm: added /dev/sdj1 to /dev/md0 as 0
> mdadm: /dev/md0 assembled from 11 drives - not enough to start the array.
> 
> AND:
> 
> cat /proc/mdstat:
> 
> cat /proc/mdstat 
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : inactive sdj1[0](S) sdh1[11](S) sdg1[10](S) sdf1[9](S) sde1[8](S) sdc1[6](S) sdb1[5](S) sda1[4](S) sdm1[3](S) sdl1[2](S) sdk1[1](S)
>       21488646696 blocks super 1.0
>        
> unused devices: <none>
> 
> Seems like every HDD got marked as a spare. Why would mdadm do this, and how can I convince mdadm that they are not spares?

Ok.  It seems you also need "--run".

Try:

mdadm --stop /dev/md0
mdadm -AfRv/dev/md0 /dev/sd[abcefghjklm]1

Also, what kernel version and mdadm version are you using?

Phil

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID6 dead on the water after Controller failure
  2014-02-15 19:00         ` Phil Turmel
@ 2014-02-15 19:01           ` Phil Turmel
  2014-02-15 19:09           ` Bakk. Florian Lampel
  1 sibling, 0 replies; 13+ messages in thread
From: Phil Turmel @ 2014-02-15 19:01 UTC (permalink / raw)
  To: Florian Lampel; +Cc: linux-raid

On 02/15/2014 02:00 PM, Phil Turmel wrote:
> Hi Florian,
> 
> mdadm --stop /dev/md0
> mdadm -AfRv/dev/md0 /dev/sd[abcefghjklm]1

Actually, you can probably just issue "mdadm --run /dev/md0" instead of
stopping and re-assembling.

Phil


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID6 dead on the water after Controller failure
  2014-02-15 19:00         ` Phil Turmel
  2014-02-15 19:01           ` Phil Turmel
@ 2014-02-15 19:09           ` Bakk. Florian Lampel
  1 sibling, 0 replies; 13+ messages in thread
From: Bakk. Florian Lampel @ 2014-02-15 19:09 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Wow, that was fast. Thanks so much, Guys!

The RAID runs now and I'm copying everything off the RAID now.

...yay...

Florian Lampel


Von meinem iPad gesendet

> Am 15.02.2014 um 20:00 schrieb Phil Turmel <philip@turmel.org>:
> 
> Hi Florian,
> 
>> On 02/15/2014 01:52 PM, Florian Lampel wrote:
>> Well, that did not went as well as I had hoped. Here is what happened:
>> 
>> root@Lserve:~# mdadm --stop /dev/md0
>> mdadm: stopped /dev/md0
>> root@Lserve:~# mdadm -Afv /dev/md0 /dev/sd[abcefghjklm]1
>> mdadm: looking for devices for /dev/md0
>> mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 4.
>> mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 5.
>> mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 6.
>> mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 8.
>> mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 9.
>> mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 10.
>> mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot 11.
>> mdadm: /dev/sdj1 is identified as a member of /dev/md0, slot 0.
>> mdadm: /dev/sdk1 is identified as a member of /dev/md0, slot 1.
>> mdadm: /dev/sdl1 is identified as a member of /dev/md0, slot 2.
>> mdadm: /dev/sdm1 is identified as a member of /dev/md0, slot 3.
>> mdadm: forcing event count in /dev/sde1(8) from 435 upto 442
>> mdadm: forcing event count in /dev/sdf1(9) from 435 upto 442
>> mdadm: forcing event count in /dev/sdg1(10) from 435 upto 442
>> mdadm: forcing event count in /dev/sdh1(11) from 435 upto 442
>> mdadm: clearing FAULTY flag for device 3 in /dev/md0 for /dev/sde1
>> mdadm: clearing FAULTY flag for device 4 in /dev/md0 for /dev/sdf1
>> mdadm: clearing FAULTY flag for device 5 in /dev/md0 for /dev/sdg1
>> mdadm: clearing FAULTY flag for device 6 in /dev/md0 for /dev/sdh1
>> mdadm: Marking array /dev/md0 as 'clean'
>> mdadm: added /dev/sdk1 to /dev/md0 as 1
>> mdadm: added /dev/sdl1 to /dev/md0 as 2
>> mdadm: added /dev/sdm1 to /dev/md0 as 3
>> mdadm: added /dev/sda1 to /dev/md0 as 4
>> mdadm: added /dev/sdb1 to /dev/md0 as 5
>> mdadm: added /dev/sdc1 to /dev/md0 as 6
>> mdadm: no uptodate device for slot 7 of /dev/md0
>> mdadm: added /dev/sde1 to /dev/md0 as 8
>> mdadm: added /dev/sdf1 to /dev/md0 as 9
>> mdadm: added /dev/sdg1 to /dev/md0 as 10
>> mdadm: added /dev/sdh1 to /dev/md0 as 11
>> mdadm: added /dev/sdj1 to /dev/md0 as 0
>> mdadm: /dev/md0 assembled from 11 drives - not enough to start the array.
>> 
>> AND:
>> 
>> cat /proc/mdstat:
>> 
>> cat /proc/mdstat 
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
>> md0 : inactive sdj1[0](S) sdh1[11](S) sdg1[10](S) sdf1[9](S) sde1[8](S) sdc1[6](S) sdb1[5](S) sda1[4](S) sdm1[3](S) sdl1[2](S) sdk1[1](S)
>>      21488646696 blocks super 1.0
>> 
>> unused devices: <none>
>> 
>> Seems like every HDD got marked as a spare. Why would mdadm do this, and how can I convince mdadm that they are not spares?
> 
> Ok.  It seems you also need "--run".
> 
> Try:
> 
> mdadm --stop /dev/md0
> mdadm -AfRv/dev/md0 /dev/sd[abcefghjklm]1
> 
> Also, what kernel version and mdadm version are you using?
> 
> Phil

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID6 dead on the water after Controller failure
  2014-02-15 15:12     ` Phil Turmel
  2014-02-15 18:52       ` Florian Lampel
@ 2014-02-15 22:04       ` Jon Nelson
  2014-02-15 23:04         ` Mikael Abrahamsson
  1 sibling, 1 reply; 13+ messages in thread
From: Jon Nelson @ 2014-02-15 22:04 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Florian Lampel, LinuxRaid

On Sat, Feb 15, 2014 at 9:12 AM, Phil Turmel <philip@turmel.org> wrote:
> Good morning Florian,

I'm very pleased to have observed this interaction. What an excellent
example, and I'm glad you've got your raid running again.

However, I was hoping I might have some questions answered.

...
> Device order has changed, summary:
>
> /dev/sda1: WD-WMC300595440 Device #4 @442
> /dev/sdb1: WD-WMC300595880 Device #5 @442
> /dev/sdc1: WD-WMC1T1521826 Device #6 @442
> /dev/sdd1: WD-WMC300314126 spare
> /dev/sde1: WD-WMC300595645 Device #8 @435
> /dev/sdf1: WD-WMC300314217 Device #9 @435
> /dev/sdg1: WD-WMC300595957 Device #10 @435
> /dev/sdh1: WD-WMC300313432 Device #11 @435
> /dev/sdj1: WD-WMC300312702 Device #0 @442
> /dev/sdk1: WD-WMC300248734 Device #1 @442
> /dev/sdl1: WD-WMC300314248 Device #2 @442
> /dev/sdm1: WD-WMC300585843 Device #3 @442

So there are 7 drives with event count 442, 4 drives with event count
435, and a single spare.
...

> After assembly, your array will be single-degraded but fully functional.
>  That would be a good time to backup any critical data that isn't
> already in a backup.

Out of 12 drives, I thought RAID6 only offered a total of *2* failed devices.
It seems to me that you have 7 devices in sync and 4 *almost* in sync.
It's this "almost" part that has me confused. How can the raid run if
the event count doesn't match? Wouldn't at least 10 out of 12 drives
have to have the same event count to avoid data loss?


-- 
Jon

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID6 dead on the water after Controller failure
  2014-02-15 22:04       ` Jon Nelson
@ 2014-02-15 23:04         ` Mikael Abrahamsson
  2014-02-15 23:23           ` Jon Nelson
  0 siblings, 1 reply; 13+ messages in thread
From: Mikael Abrahamsson @ 2014-02-15 23:04 UTC (permalink / raw)
  To: Jon Nelson; +Cc: Phil Turmel, Florian Lampel, LinuxRaid

On Sat, 15 Feb 2014, Jon Nelson wrote:

> Out of 12 drives, I thought RAID6 only offered a total of *2* failed 
> devices. It seems to me that you have 7 devices in sync and 4 *almost* 
> in sync. It's this "almost" part that has me confused. How can the raid 
> run if the event count doesn't match? Wouldn't at least 10 out of 12 
> drives have to have the same event count to avoid data loss?

Correct. When you use --assemble --force you're basically telling mdadm "I 
know what I'm doing and I'll take the risk of data loss or corruption". If 
you assemble in with a kicked drive that was kicked long ago that has a 
really far off event count, you can really really screw things up.

Unless you use --force, mdadm won't assemble an array where the event 
count doesn't match up.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID6 dead on the water after Controller failure
  2014-02-15 23:04         ` Mikael Abrahamsson
@ 2014-02-15 23:23           ` Jon Nelson
  2014-02-16  3:49             ` Phil Turmel
  0 siblings, 1 reply; 13+ messages in thread
From: Jon Nelson @ 2014-02-15 23:23 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: Phil Turmel, Florian Lampel, LinuxRaid

On Sat, Feb 15, 2014 at 5:04 PM, Mikael Abrahamsson <swmike@swm.pp.se> wrote:
> On Sat, 15 Feb 2014, Jon Nelson wrote:
>
>> Out of 12 drives, I thought RAID6 only offered a total of *2* failed
>> devices. It seems to me that you have 7 devices in sync and 4 *almost* in
>> sync. It's this "almost" part that has me confused. How can the raid run if
>> the event count doesn't match? Wouldn't at least 10 out of 12 drives have to
>> have the same event count to avoid data loss?
>
>
> Correct. When you use --assemble --force you're basically telling mdadm "I
> know what I'm doing and I'll take the risk of data loss or corruption". If
> you assemble in with a kicked drive that was kicked long ago that has a
> really far off event count, you can really really screw things up.
>
> Unless you use --force, mdadm won't assemble an array where the event count
> doesn't match up.

Aha. So if you ran a check in this case, it would find some number of
blocks that don't match up. What does MD do in that case?

-- 
Jon

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID6 dead on the water after Controller failure
  2014-02-15 23:23           ` Jon Nelson
@ 2014-02-16  3:49             ` Phil Turmel
  0 siblings, 0 replies; 13+ messages in thread
From: Phil Turmel @ 2014-02-16  3:49 UTC (permalink / raw)
  To: Jon Nelson, Mikael Abrahamsson; +Cc: Florian Lampel, LinuxRaid

On 02/15/2014 06:23 PM, Jon Nelson wrote:
> On Sat, Feb 15, 2014 at 5:04 PM, Mikael Abrahamsson <swmike@swm.pp.se> wrote:
>> On Sat, 15 Feb 2014, Jon Nelson wrote:
>>
>>> Out of 12 drives, I thought RAID6 only offered a total of *2* failed
>>> devices. It seems to me that you have 7 devices in sync and 4 *almost* in
>>> sync. It's this "almost" part that has me confused. How can the raid run if
>>> the event count doesn't match? Wouldn't at least 10 out of 12 drives have to
>>> have the same event count to avoid data loss?
>>
>>
>> Correct. When you use --assemble --force you're basically telling mdadm "I
>> know what I'm doing and I'll take the risk of data loss or corruption". If
>> you assemble in with a kicked drive that was kicked long ago that has a
>> really far off event count, you can really really screw things up.
>>
>> Unless you use --force, mdadm won't assemble an array where the event count
>> doesn't match up.
> 
> Aha. So if you ran a check in this case, it would find some number of
> blocks that don't match up. What does MD do in that case?

Nothing other than reporting it in the mismatch_cnt.  If you then
perform a "repair" scrub, it will regenerate P&Q from the data blocks,
period.

If you think about it, five of the seven events are easily explained:
four dropped drives, plus cancellation of the rebuild onto what was then
/dev/sdh1.  So the opportunity for other corruption was small.  This is
precisely the scenario where forced assembly makes sense.  The drives
were failed for reasons other than actual drive failure.  The use of
--force tells mdadm that those drives aren't really bad.  But I left out
the partially rebuilt drive because it really couldn't be trusted.

HTH,

Phil



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-02-16  3:49 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-14 16:19 RAID6 dead on the water after Controller failure Florian Lampel
2014-02-14 20:35 ` Phil Turmel
2014-02-15 12:31   ` Florian Lampel
2014-02-15 15:12     ` Phil Turmel
2014-02-15 18:52       ` Florian Lampel
2014-02-15 19:00         ` Phil Turmel
2014-02-15 19:01           ` Phil Turmel
2014-02-15 19:09           ` Bakk. Florian Lampel
2014-02-15 22:04       ` Jon Nelson
2014-02-15 23:04         ` Mikael Abrahamsson
2014-02-15 23:23           ` Jon Nelson
2014-02-16  3:49             ` Phil Turmel
     [not found]     ` <CADNH=7EiY18TJDBDQsT6LDtw+Ft_2XCFaP30uK7uJb_e7xKhsQ@mail.gmail.com>
2014-02-15 18:56       ` Florian Lampel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.