All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID10 failure(s)
@ 2011-02-14 16:09 Mark Keisler
  2011-02-14 20:33 ` Mark Keisler
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Keisler @ 2011-02-14 16:09 UTC (permalink / raw)
  To: linux-raid

Sorry in advance for the long email :)


I had a RAID10 array set up using 4 WD 1TB caviar black drives (SATA3)
on 64 bit on a 2.6.36 kernel using mdadm 3.1.4.  I noticed last night
that one drive had faulted out of the array.  It had a bunch of errors
like so:

Feb  8 03:39:48 samsara kernel: [41330.835285] ata3.00: exception
Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Feb  8 03:39:48 samsara kernel: [41330.835288] ata3.00: irq_stat 0x40000008
Feb  8 03:39:48 samsara kernel: [41330.835292] ata3.00: failed
command: READ FPDMA QUEUED
Feb  8 03:39:48 samsara kernel: [41330.835297] ata3.00: cmd
60/f8:00:f8:9a:45/00:00:04:00:00/40 tag 0 ncq 126976 in
Feb  8 03:39:48 samsara kernel: [41330.835297]          res
41/40:00:70:9b:45/00:00:04:00:00/40 Emask 0x409 (media error) <F>
Feb  8 03:39:48 samsara kernel: [41330.835300] ata3.00: status: { DRDY ERR }
Feb  8 03:39:48 samsara kernel: [41330.835301] ata3.00: error: { UNC }
Feb  8 03:39:48 samsara kernel: [41330.839776] ata3.00: configured for UDMA/133
Feb  8 03:39:48 samsara kernel: [41330.839788] ata3: EH complete
....

Feb  8 03:39:58 samsara kernel: [41340.423236] sd 2:0:0:0: [sdc]
Unhandled sense code
Feb  8 03:39:58 samsara kernel: [41340.423238] sd 2:0:0:0: [sdc]
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb  8 03:39:58 samsara kernel: [41340.423240] sd 2:0:0:0: [sdc]
Sense Key : Medium Error [current] [descriptor]
Feb  8 03:39:58 samsara kernel: [41340.423243] Descriptor sense data
with sense descriptors (in hex):
Feb  8 03:39:58 samsara kernel: [41340.423244]         72 03 11 04 00
00 00 0c 00 0a 80 00 00 00 00 00
Feb  8 03:39:58 samsara kernel: [41340.423249]         04 45 9b 70
Feb  8 03:39:58 samsara kernel: [41340.423251] sd 2:0:0:0: [sdc]  Add.
Sense: Unrecovered read error - auto reallocate failed
Feb  8 03:39:58 samsara kernel: [41340.423254] sd 2:0:0:0: [sdc] CDB:
Read(10): 28 00 04 45 9a f8 00 00 f8 00
Feb  8 03:39:58 samsara kernel: [41340.423259] end_request: I/O error,
dev sdc, sector 71670640
Feb  8 03:39:58 samsara kernel: [41340.423262] md/raid10:md0: sdc1:
rescheduling sector 143332600
....
Feb  8 03:40:10 samsara kernel: [41351.940796] md/raid10:md0: read
error corrected (8 sectors at 2168 on sdc1)
Feb  8 03:40:10 samsara kernel: [41351.954972] md/raid10:md0: sdb1:
redirecting sector 143332600 to another mirror

and so on until:
Feb  8 03:55:01 samsara kernel: [42243.609414] md/raid10:md0: sdc1:
Raid device exceeded read_error threshold [cur 21:max 20]
Feb  8 03:55:01 samsara kernel: [42243.609417] md/raid10:md0: sdc1:
Failing raid device
Feb  8 03:55:01 samsara kernel: [42243.609419] md/raid10:md0: Disk
failure on sdc1, disabling device.
Feb  8 03:55:01 samsara kernel: [42243.609420] <1>md/raid10:md0:
Operation continuing on 3 devices.
Feb  8 03:55:01 samsara kernel: [42243.609423] md/raid10:md0: sdb1:
redirecting sector 143163888 to another mirror
Feb  8 03:55:01 samsara kernel: [42243.609650] md/raid10:md0: sdb1:
redirecting sector 143164416 to another mirror
Feb  8 03:55:01 samsara kernel: [42243.610095] md/raid10:md0: sdb1:
redirecting sector 143164664 to another mirror
Feb  8 03:55:01 samsara kernel: [42243.633814] RAID10 conf printout:
Feb  8 03:55:01 samsara kernel: [42243.633817]  --- wd:3 rd:4
Feb  8 03:55:01 samsara kernel: [42243.633820]  disk 0, wo:0, o:1, dev:sdb1
Feb  8 03:55:01 samsara kernel: [42243.633821]  disk 1, wo:1, o:0, dev:sdc1
Feb  8 03:55:01 samsara kernel: [42243.633823]  disk 2, wo:0, o:1, dev:sdd1
Feb  8 03:55:01 samsara kernel: [42243.633824]  disk 3, wo:0, o:1, dev:sde1
Feb  8 03:55:01 samsara kernel: [42243.645880] RAID10 conf printout:
Feb  8 03:55:01 samsara kernel: [42243.645883]  --- wd:3 rd:4
Feb  8 03:55:01 samsara kernel: [42243.645885]  disk 0, wo:0, o:1, dev:sdb1
Feb  8 03:55:01 samsara kernel: [42243.645887]  disk 2, wo:0, o:1, dev:sdd1
Feb  8 03:55:01 samsara kernel: [42243.645888]  disk 3, wo:0, o:1, dev:sde1


This seemed weird as the machine is only a week or two old.  I powered
down to open it up and get the serial number off the drive fro an RMA.
 I powered back up and mdadm had automatically removed the drive from
the RAID.  Fine.  The RAID had already been running on just 3 disks
since the 8th.  For some reason, I thought to add the drive back into
the array to see if it failed out again thinking worst case scenario
I'm back to a degraded RAID10 again.  So I added it back in and did an
mdadm --detail to check on it after a little while and found this:
samsara log # mdadm --detail /dev/md0
/dev/md0:
       Version : 1.2
 Creation Time : Sat Feb  5 22:00:52 2011
    Raid Level : raid10
    Array Size : 1953519104 (1863.02 GiB 2000.40 GB)
 Used Dev Size : 976759552 (931.51 GiB 1000.20 GB)
  Raid Devices : 4
 Total Devices : 4
   Persistence : Superblock is persistent

   Update Time : Mon Feb 14 00:04:46 2011
         State : clean, FAILED, recovering
 Active Devices : 2
Working Devices : 2
 Failed Devices : 2
 Spare Devices : 0

        Layout : near=2
    Chunk Size : 256K

 Rebuild Status : 99% complete

          Name : samsara:0  (local to host samsara)
          UUID : 26804ec8:a20a4365:bc7d5b4e:653ade03
        Events : 30348

   Number   Major   Minor   RaidDevice State
      0       8       17        0      faulty spare rebuilding   /dev/sdb1
      1       8       33        1      faulty spare rebuilding   /dev/sdc1
      2       8       49        2      active sync   /dev/sdd1
      3       8       65        3      active sync   /dev/sde1
samsara log # exit

It had faulted drive 0 also during the rebuild.
[ 1177.064359] RAID10 conf printout:
[ 1177.064362]  --- wd:2 rd:4
[ 1177.064365]  disk 0, wo:1, o:0, dev:sdb1
[ 1177.064367]  disk 1, wo:1, o:0, dev:sdc1
[ 1177.064368]  disk 2, wo:0, o:1, dev:sdd1
[ 1177.064370]  disk 3, wo:0, o:1, dev:sde1
[ 1177.073325] RAID10 conf printout:
[ 1177.073328]  --- wd:2 rd:4
[ 1177.073330]  disk 0, wo:1, o:0, dev:sdb1
[ 1177.073332]  disk 2, wo:0, o:1, dev:sdd1
[ 1177.073333]  disk 3, wo:0, o:1, dev:sde1
[ 1177.073340] RAID10 conf printout:
[ 1177.073341]  --- wd:2 rd:4
[ 1177.073342]  disk 0, wo:1, o:0, dev:sdb1
[ 1177.073343]  disk 2, wo:0, o:1, dev:sdd1
[ 1177.073344]  disk 3, wo:0, o:1, dev:sde1
[ 1177.083323] RAID10 conf printout:
[ 1177.083326]  --- wd:2 rd:4
[ 1177.083329]  disk 2, wo:0, o:1, dev:sdd1
[ 1177.083330]  disk 3, wo:0, o:1, dev:sde1


So the RAID ended up being marked "clean, FAILED."  Gee, glad it is
clean at least ;).  I'm wondering wtf went wrong and if it actually
makes sense that I had a double disk failure like that.  I can't even
force it to assemble the raid anymore:
 # mdadm --assemble --verbose --force /dev/md0
mdadm: looking for devices for /dev/md0
mdadm: cannot open device /dev/sde1: Device or resource busy
mdadm: /dev/sde1 has wrong uuid.
mdadm: cannot open device /dev/sdd1: Device or resource busy
mdadm: /dev/sdd1 has wrong uuid.
mdadm: cannot open device /dev/sdc1: Device or resource busy
mdadm: /dev/sdc1 has wrong uuid.
mdadm: cannot open device /dev/sdb1: Device or resource busy
mdadm: /dev/sdb1 has wrong uuid.

Am I totally SOL?  Thanks for any suggestions or things to try.

--
Mark
Tact is the ability to tell a man he has an open mind when he has a
hole in his head.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 failure(s)
  2011-02-14 16:09 RAID10 failure(s) Mark Keisler
@ 2011-02-14 20:33 ` Mark Keisler
  2011-02-14 22:29   ` Stan Hoeppner
  2011-02-14 22:48   ` NeilBrown
  0 siblings, 2 replies; 11+ messages in thread
From: Mark Keisler @ 2011-02-14 20:33 UTC (permalink / raw)
  To: linux-raid

Sorry for the double-post on the original.
I realize that I also left out the fact that I rebooted since drive 0
also reported a fault and mdadm won't start the array at all.  I'm not
sure how to tell which members were the in two RAID0 groups.  I would
think that if I have a RAID0 pair left from the RAID10, I should be
able to recover somehow.  Not sure if that was drive 0 and 2, 1 and 3
or 0 and 1, 2 and 3.

Anyway, the drives do still show the correct array UUID when queried
with mdadm -E, but they disagree about the state of the array:
# mdadm -E /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 | grep 'Array State'
   Array State : AAAA ('A' == active, '.' == missing)
   Array State : .AAA ('A' == active, '.' == missing)
   Array State : ..AA ('A' == active, '.' == missing)
   Array State : ..AA ('A' == active, '.' == missing)

sdc still shows a recovery offset, too:

/dev/sdb1:
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
/dev/sdc1:
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
Recovery Offset : 2 sectors
/dev/sdd1:
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
/dev/sde1:
    Data Offset : 2048 sectors
   Super Offset : 8 sectors

I did some searching on the "READ FPDMA QUEUED" error message that my
drive was reporting and have found that there seems to be a
correlation between that and having AHCI (NCQ in particular) enabled.
I've now set my BIOS back to Native IDE (which was the default anyway)
instead of AHCI for the SATA setting.  I'm hoping that was the issue.

Still wondering if there is some magic to be done to get at my data again :)

--
Mark
Tact is the ability to tell a man he has an open mind when he has a
hole in his head.



On Mon, Feb 14, 2011 at 10:09 AM, Mark Keisler <grimm26@gmail.com> wrote:
>
> Sorry in advance for the long email :)
>
>
> I had a RAID10 array set up using 4 WD 1TB caviar black drives (SATA3)
> on 64 bit on a 2.6.36 kernel using mdadm 3.1.4.  I noticed last night
> that one drive had faulted out of the array.  It had a bunch of errors
> like so:
>
> Feb  8 03:39:48 samsara kernel: [41330.835285] ata3.00: exception
> Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
> Feb  8 03:39:48 samsara kernel: [41330.835288] ata3.00: irq_stat 0x40000008
> Feb  8 03:39:48 samsara kernel: [41330.835292] ata3.00: failed
> command: READ FPDMA QUEUED
> Feb  8 03:39:48 samsara kernel: [41330.835297] ata3.00: cmd
> 60/f8:00:f8:9a:45/00:00:04:00:00/40 tag 0 ncq 126976 in
> Feb  8 03:39:48 samsara kernel: [41330.835297]          res
> 41/40:00:70:9b:45/00:00:04:00:00/40 Emask 0x409 (media error) <F>
> Feb  8 03:39:48 samsara kernel: [41330.835300] ata3.00: status: { DRDY ERR }
> Feb  8 03:39:48 samsara kernel: [41330.835301] ata3.00: error: { UNC }
> Feb  8 03:39:48 samsara kernel: [41330.839776] ata3.00: configured for UDMA/133
> Feb  8 03:39:48 samsara kernel: [41330.839788] ata3: EH complete
> ....
>
> Feb  8 03:39:58 samsara kernel: [41340.423236] sd 2:0:0:0: [sdc]
> Unhandled sense code
> Feb  8 03:39:58 samsara kernel: [41340.423238] sd 2:0:0:0: [sdc]
> Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Feb  8 03:39:58 samsara kernel: [41340.423240] sd 2:0:0:0: [sdc]
> Sense Key : Medium Error [current] [descriptor]
> Feb  8 03:39:58 samsara kernel: [41340.423243] Descriptor sense data
> with sense descriptors (in hex):
> Feb  8 03:39:58 samsara kernel: [41340.423244]         72 03 11 04 00
> 00 00 0c 00 0a 80 00 00 00 00 00
> Feb  8 03:39:58 samsara kernel: [41340.423249]         04 45 9b 70
> Feb  8 03:39:58 samsara kernel: [41340.423251] sd 2:0:0:0: [sdc]  Add.
> Sense: Unrecovered read error - auto reallocate failed
> Feb  8 03:39:58 samsara kernel: [41340.423254] sd 2:0:0:0: [sdc] CDB:
> Read(10): 28 00 04 45 9a f8 00 00 f8 00
> Feb  8 03:39:58 samsara kernel: [41340.423259] end_request: I/O error,
> dev sdc, sector 71670640
> Feb  8 03:39:58 samsara kernel: [41340.423262] md/raid10:md0: sdc1:
> rescheduling sector 143332600
> ....
> Feb  8 03:40:10 samsara kernel: [41351.940796] md/raid10:md0: read
> error corrected (8 sectors at 2168 on sdc1)
> Feb  8 03:40:10 samsara kernel: [41351.954972] md/raid10:md0: sdb1:
> redirecting sector 143332600 to another mirror
>
> and so on until:
> Feb  8 03:55:01 samsara kernel: [42243.609414] md/raid10:md0: sdc1:
> Raid device exceeded read_error threshold [cur 21:max 20]
> Feb  8 03:55:01 samsara kernel: [42243.609417] md/raid10:md0: sdc1:
> Failing raid device
> Feb  8 03:55:01 samsara kernel: [42243.609419] md/raid10:md0: Disk
> failure on sdc1, disabling device.
> Feb  8 03:55:01 samsara kernel: [42243.609420] <1>md/raid10:md0:
> Operation continuing on 3 devices.
> Feb  8 03:55:01 samsara kernel: [42243.609423] md/raid10:md0: sdb1:
> redirecting sector 143163888 to another mirror
> Feb  8 03:55:01 samsara kernel: [42243.609650] md/raid10:md0: sdb1:
> redirecting sector 143164416 to another mirror
> Feb  8 03:55:01 samsara kernel: [42243.610095] md/raid10:md0: sdb1:
> redirecting sector 143164664 to another mirror
> Feb  8 03:55:01 samsara kernel: [42243.633814] RAID10 conf printout:
> Feb  8 03:55:01 samsara kernel: [42243.633817]  --- wd:3 rd:4
> Feb  8 03:55:01 samsara kernel: [42243.633820]  disk 0, wo:0, o:1, dev:sdb1
> Feb  8 03:55:01 samsara kernel: [42243.633821]  disk 1, wo:1, o:0, dev:sdc1
> Feb  8 03:55:01 samsara kernel: [42243.633823]  disk 2, wo:0, o:1, dev:sdd1
> Feb  8 03:55:01 samsara kernel: [42243.633824]  disk 3, wo:0, o:1, dev:sde1
> Feb  8 03:55:01 samsara kernel: [42243.645880] RAID10 conf printout:
> Feb  8 03:55:01 samsara kernel: [42243.645883]  --- wd:3 rd:4
> Feb  8 03:55:01 samsara kernel: [42243.645885]  disk 0, wo:0, o:1, dev:sdb1
> Feb  8 03:55:01 samsara kernel: [42243.645887]  disk 2, wo:0, o:1, dev:sdd1
> Feb  8 03:55:01 samsara kernel: [42243.645888]  disk 3, wo:0, o:1, dev:sde1
>
>
> This seemed weird as the machine is only a week or two old.  I powered
> down to open it up and get the serial number off the drive fro an RMA.
>  I powered back up and mdadm had automatically removed the drive from
> the RAID.  Fine.  The RAID had already been running on just 3 disks
> since the 8th.  For some reason, I thought to add the drive back into
> the array to see if it failed out again thinking worst case scenario
> I'm back to a degraded RAID10 again.  So I added it back in and did an
> mdadm --detail to check on it after a little while and found this:
> samsara log # mdadm --detail /dev/md0
> /dev/md0:
>       Version : 1.2
>  Creation Time : Sat Feb  5 22:00:52 2011
>    Raid Level : raid10
>    Array Size : 1953519104 (1863.02 GiB 2000.40 GB)
>  Used Dev Size : 976759552 (931.51 GiB 1000.20 GB)
>  Raid Devices : 4
>  Total Devices : 4
>   Persistence : Superblock is persistent
>
>   Update Time : Mon Feb 14 00:04:46 2011
>         State : clean, FAILED, recovering
>  Active Devices : 2
> Working Devices : 2
>  Failed Devices : 2
>  Spare Devices : 0
>
>        Layout : near=2
>    Chunk Size : 256K
>
>  Rebuild Status : 99% complete
>
>          Name : samsara:0  (local to host samsara)
>          UUID : 26804ec8:a20a4365:bc7d5b4e:653ade03
>        Events : 30348
>
>   Number   Major   Minor   RaidDevice State
>      0       8       17        0      faulty spare rebuilding   /dev/sdb1
>      1       8       33        1      faulty spare rebuilding   /dev/sdc1
>      2       8       49        2      active sync   /dev/sdd1
>      3       8       65        3      active sync   /dev/sde1
> samsara log # exit
>
> It had faulted drive 0 also during the rebuild.
> [ 1177.064359] RAID10 conf printout:
> [ 1177.064362]  --- wd:2 rd:4
> [ 1177.064365]  disk 0, wo:1, o:0, dev:sdb1
> [ 1177.064367]  disk 1, wo:1, o:0, dev:sdc1
> [ 1177.064368]  disk 2, wo:0, o:1, dev:sdd1
> [ 1177.064370]  disk 3, wo:0, o:1, dev:sde1
> [ 1177.073325] RAID10 conf printout:
> [ 1177.073328]  --- wd:2 rd:4
> [ 1177.073330]  disk 0, wo:1, o:0, dev:sdb1
> [ 1177.073332]  disk 2, wo:0, o:1, dev:sdd1
> [ 1177.073333]  disk 3, wo:0, o:1, dev:sde1
> [ 1177.073340] RAID10 conf printout:
> [ 1177.073341]  --- wd:2 rd:4
> [ 1177.073342]  disk 0, wo:1, o:0, dev:sdb1
> [ 1177.073343]  disk 2, wo:0, o:1, dev:sdd1
> [ 1177.073344]  disk 3, wo:0, o:1, dev:sde1
> [ 1177.083323] RAID10 conf printout:
> [ 1177.083326]  --- wd:2 rd:4
> [ 1177.083329]  disk 2, wo:0, o:1, dev:sdd1
> [ 1177.083330]  disk 3, wo:0, o:1, dev:sde1
>
>
> So the RAID ended up being marked "clean, FAILED."  Gee, glad it is
> clean at least ;).  I'm wondering wtf went wrong and if it actually
> makes sense that I had a double disk failure like that.  I can't even
> force it to assemble the raid anymore:
>  # mdadm --assemble --verbose --force /dev/md0
> mdadm: looking for devices for /dev/md0
> mdadm: cannot open device /dev/sde1: Device or resource busy
> mdadm: /dev/sde1 has wrong uuid.
> mdadm: cannot open device /dev/sdd1: Device or resource busy
> mdadm: /dev/sdd1 has wrong uuid.
> mdadm: cannot open device /dev/sdc1: Device or resource busy
> mdadm: /dev/sdc1 has wrong uuid.
> mdadm: cannot open device /dev/sdb1: Device or resource busy
> mdadm: /dev/sdb1 has wrong uuid.
>
> Am I totally SOL?  Thanks for any suggestions or things to try.
>
> --
> Mark
> Tact is the ability to tell a man he has an open mind when he has a
> hole in his head.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 failure(s)
  2011-02-14 20:33 ` Mark Keisler
@ 2011-02-14 22:29   ` Stan Hoeppner
       [not found]     ` <AANLkTikyjuUo=X3m5Q6L9x3vXbgV1KC+TfOaLN2z4Keo@mail.gmail.com>
  2011-02-14 22:48   ` NeilBrown
  1 sibling, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2011-02-14 22:29 UTC (permalink / raw)
  To: Mark Keisler; +Cc: linux-raid

Mark Keisler put forth on 2/14/2011 2:33 PM:

> Still wondering if there is some magic to be done to get at my data again :)
>>
>> Am I totally SOL?  Thanks for any suggestions or things to try.
>>
>> --
>> Mark
>> Tact is the ability to tell a man he has an open mind when he has a
>> hole in his head.

Interesting, and ironically appropriate, sig, Mark.

No magic is required.  Simply wipe each disk by writing all zeros with dd.  You
can do all 4 in parallel.  This will take a while with 1TB drives.  If there are
still SATA/NCQ/etc issues they should pop up while wiping the drives.  If not,
and all dd operations complete successfully, simply create a new RAID 10 array
and format it with your favorite filesystem.

Then restore all your files from your backups.[1]

[1] Tact is the ability to tell a man he has an open mind when he has a hole in
his head.

-- 
Stan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 failure(s)
  2011-02-14 20:33 ` Mark Keisler
  2011-02-14 22:29   ` Stan Hoeppner
@ 2011-02-14 22:48   ` NeilBrown
  2011-02-14 23:08     ` Mark Keisler
  1 sibling, 1 reply; 11+ messages in thread
From: NeilBrown @ 2011-02-14 22:48 UTC (permalink / raw)
  To: Mark Keisler; +Cc: linux-raid

On Mon, 14 Feb 2011 14:33:03 -0600 Mark Keisler <grimm26@gmail.com> wrote:

> Sorry for the double-post on the original.
> I realize that I also left out the fact that I rebooted since drive 0
> also reported a fault and mdadm won't start the array at all.  I'm not
> sure how to tell which members were the in two RAID0 groups.  I would
> think that if I have a RAID0 pair left from the RAID10, I should be
> able to recover somehow.  Not sure if that was drive 0 and 2, 1 and 3
> or 0 and 1, 2 and 3.
> 
> Anyway, the drives do still show the correct array UUID when queried
> with mdadm -E, but they disagree about the state of the array:
> # mdadm -E /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 | grep 'Array State'
>    Array State : AAAA ('A' == active, '.' == missing)
>    Array State : .AAA ('A' == active, '.' == missing)
>    Array State : ..AA ('A' == active, '.' == missing)
>    Array State : ..AA ('A' == active, '.' == missing)
> 
> sdc still shows a recovery offset, too:
> 
> /dev/sdb1:
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
> /dev/sdc1:
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
> Recovery Offset : 2 sectors
> /dev/sdd1:
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
> /dev/sde1:
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
> 
> I did some searching on the "READ FPDMA QUEUED" error message that my
> drive was reporting and have found that there seems to be a
> correlation between that and having AHCI (NCQ in particular) enabled.
> I've now set my BIOS back to Native IDE (which was the default anyway)
> instead of AHCI for the SATA setting.  I'm hoping that was the issue.
> 
> Still wondering if there is some magic to be done to get at my data again :)

No need for magic here .. but you better stand back, as
  I'm going to try ... Science.
(or is that Engineering...)

 mdadm -S /dev/md0
 mdadm -C /dev/md0 -l10 -n4 -c256 missing /dev/sdc1 /dev/sdd1 /dev/sde1
 mdadm --wait /dev/md0
 mdadm /dev/md0 --add /dev/sdb1

(but be really sure that the devices really are working before you try this).

BTW, for a near=2, Raid-disks=4 arrangement, the first and second devices
contain the same data, and the third and fourth devices also container the
same data as each other (but obviously different to the first and second).

NeilBrown


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 failure(s)
  2011-02-14 22:48   ` NeilBrown
@ 2011-02-14 23:08     ` Mark Keisler
  2011-02-14 23:20       ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Keisler @ 2011-02-14 23:08 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Mon, Feb 14, 2011 at 4:48 PM, NeilBrown <neilb@suse.de> wrote:
> On Mon, 14 Feb 2011 14:33:03 -0600 Mark Keisler <grimm26@gmail.com> wrote:
>
>> Sorry for the double-post on the original.
>> I realize that I also left out the fact that I rebooted since drive 0
>> also reported a fault and mdadm won't start the array at all.  I'm not
>> sure how to tell which members were the in two RAID0 groups.  I would
>> think that if I have a RAID0 pair left from the RAID10, I should be
>> able to recover somehow.  Not sure if that was drive 0 and 2, 1 and 3
>> or 0 and 1, 2 and 3.
>>
>> Anyway, the drives do still show the correct array UUID when queried
>> with mdadm -E, but they disagree about the state of the array:
>> # mdadm -E /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 | grep 'Array State'
>>    Array State : AAAA ('A' == active, '.' == missing)
>>    Array State : .AAA ('A' == active, '.' == missing)
>>    Array State : ..AA ('A' == active, '.' == missing)
>>    Array State : ..AA ('A' == active, '.' == missing)
>>
>> sdc still shows a recovery offset, too:
>>
>> /dev/sdb1:
>>     Data Offset : 2048 sectors
>>    Super Offset : 8 sectors
>> /dev/sdc1:
>>     Data Offset : 2048 sectors
>>    Super Offset : 8 sectors
>> Recovery Offset : 2 sectors
>> /dev/sdd1:
>>     Data Offset : 2048 sectors
>>    Super Offset : 8 sectors
>> /dev/sde1:
>>     Data Offset : 2048 sectors
>>    Super Offset : 8 sectors
>>
>> I did some searching on the "READ FPDMA QUEUED" error message that my
>> drive was reporting and have found that there seems to be a
>> correlation between that and having AHCI (NCQ in particular) enabled.
>> I've now set my BIOS back to Native IDE (which was the default anyway)
>> instead of AHCI for the SATA setting.  I'm hoping that was the issue.
>>
>> Still wondering if there is some magic to be done to get at my data again :)
>
> No need for magic here .. but you better stand back, as
>  I'm going to try ... Science.
> (or is that Engineering...)
>
>  mdadm -S /dev/md0
>  mdadm -C /dev/md0 -l10 -n4 -c256 missing /dev/sdc1 /dev/sdd1 /dev/sde1
>  mdadm --wait /dev/md0
>  mdadm /dev/md0 --add /dev/sdb1
>
> (but be really sure that the devices really are working before you try this).
>
> BTW, for a near=2, Raid-disks=4 arrangement, the first and second devices
> contain the same data, and the third and fourth devices also container the
> same data as each other (but obviously different to the first and second).
>
> NeilBrown
>
>
Ah, that's the kind of info that I was looking for.  So, the third and
fourth disks are a complete RAID0 set and the entire RAID10 should be
able to rebuild from them if I replace the first two disks with new
ones (hence being sure the devices are working)?  Or I need to hope
the originals will hold up to a rebuild?

Thanks for the info, Neil, and all your work in FOSS :)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 failure(s)
  2011-02-14 23:08     ` Mark Keisler
@ 2011-02-14 23:20       ` NeilBrown
  2011-02-15  0:49         ` Mark Keisler
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2011-02-14 23:20 UTC (permalink / raw)
  To: Mark Keisler; +Cc: linux-raid

On Mon, 14 Feb 2011 17:08:45 -0600 Mark Keisler <grimm26@gmail.com> wrote:

> On Mon, Feb 14, 2011 at 4:48 PM, NeilBrown <neilb@suse.de> wrote:
> > On Mon, 14 Feb 2011 14:33:03 -0600 Mark Keisler <grimm26@gmail.com> wrote:
> >
> >> Sorry for the double-post on the original.
> >> I realize that I also left out the fact that I rebooted since drive 0
> >> also reported a fault and mdadm won't start the array at all.  I'm not
> >> sure how to tell which members were the in two RAID0 groups.  I would
> >> think that if I have a RAID0 pair left from the RAID10, I should be
> >> able to recover somehow.  Not sure if that was drive 0 and 2, 1 and 3
> >> or 0 and 1, 2 and 3.
> >>
> >> Anyway, the drives do still show the correct array UUID when queried
> >> with mdadm -E, but they disagree about the state of the array:
> >> # mdadm -E /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 | grep 'Array State'
> >>    Array State : AAAA ('A' == active, '.' == missing)
> >>    Array State : .AAA ('A' == active, '.' == missing)
> >>    Array State : ..AA ('A' == active, '.' == missing)
> >>    Array State : ..AA ('A' == active, '.' == missing)
> >>
> >> sdc still shows a recovery offset, too:
> >>
> >> /dev/sdb1:
> >>     Data Offset : 2048 sectors
> >>    Super Offset : 8 sectors
> >> /dev/sdc1:
> >>     Data Offset : 2048 sectors
> >>    Super Offset : 8 sectors
> >> Recovery Offset : 2 sectors
> >> /dev/sdd1:
> >>     Data Offset : 2048 sectors
> >>    Super Offset : 8 sectors
> >> /dev/sde1:
> >>     Data Offset : 2048 sectors
> >>    Super Offset : 8 sectors
> >>
> >> I did some searching on the "READ FPDMA QUEUED" error message that my
> >> drive was reporting and have found that there seems to be a
> >> correlation between that and having AHCI (NCQ in particular) enabled.
> >> I've now set my BIOS back to Native IDE (which was the default anyway)
> >> instead of AHCI for the SATA setting.  I'm hoping that was the issue.
> >>
> >> Still wondering if there is some magic to be done to get at my data again :)
> >
> > No need for magic here .. but you better stand back, as
> >  I'm going to try ... Science.
> > (or is that Engineering...)
> >
> >  mdadm -S /dev/md0
> >  mdadm -C /dev/md0 -l10 -n4 -c256 missing /dev/sdc1 /dev/sdd1 /dev/sde1
> >  mdadm --wait /dev/md0
> >  mdadm /dev/md0 --add /dev/sdb1
> >
> > (but be really sure that the devices really are working before you try this).
> >
> > BTW, for a near=2, Raid-disks=4 arrangement, the first and second devices
> > contain the same data, and the third and fourth devices also container the
> > same data as each other (but obviously different to the first and second).
> >
> > NeilBrown
> >
> >
> Ah, that's the kind of info that I was looking for.  So, the third and
> fourth disks are a complete RAID0 set and the entire RAID10 should be
> able to rebuild from them if I replace the first two disks with new
> ones (hence being sure the devices are working)?  Or I need to hope
> the originals will hold up to a rebuild?

No.

third and fourth are like a RAID1 set, not a RAID0 set.

First and second are a RAID1 pair.  Third and fourth are a RAID1 pair.

First and third
first and fourth
second and third
second and fourth

can each be seen as a RAID0 pair which container all of the data.

NeilBrown



> 
> Thanks for the info, Neil, and all your work in FOSS :)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 failure(s)
       [not found]     ` <AANLkTikyjuUo=X3m5Q6L9x3vXbgV1KC+TfOaLN2z4Keo@mail.gmail.com>
@ 2011-02-15  0:40       ` Stan Hoeppner
  0 siblings, 0 replies; 11+ messages in thread
From: Stan Hoeppner @ 2011-02-15  0:40 UTC (permalink / raw)
  To: Mark Keisler, Linux RAID

Mark Keisler put forth on 2/14/2011 4:39 PM:
> On Mon, Feb 14, 2011 at 4:29 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>
>> Mark Keisler put forth on 2/14/2011 2:33 PM:
>>
>>> Still wondering if there is some magic to be done to get at my data again :)
>>>>
>>>> Am I totally SOL?  Thanks for any suggestions or things to try.
>>>>
>>>> --
>>>> Mark
>>>> Tact is the ability to tell a man he has an open mind when he has a
>>>> hole in his head.
>>
>> Interesting, and ironically appropriate, sig, Mark.
>>
>> No magic is required.  Simply wipe each disk by writing all zeros with dd.  You
>> can do all 4 in parallel.  This will take a while with 1TB drives.  If there are
>> still SATA/NCQ/etc issues they should pop up while wiping the drives.  If not,
>> and all dd operations complete successfully, simply create a new RAID 10 array
>> and format it with your favorite filesystem.
>>
>> Then restore all your files from your backups.[1]
>>
>> [1] Tact is the ability to tell a man he has an open mind when he has a hole in
>> his head.
>>
>> --
>> Stan
> 
> Well, that was completely unhelpful and devoid of any information.
> Backups don't keep a RAID from failing and that's what my question was
> about.  I don't want to spend all of my time rebuilding an array and
> restoring from backup every week.

"Backups don't keep a RAID from failing" -- good sig material ;)

It seems you lack the sense of humor implied by your signature.  Given that
fact, I can understand the defensiveness.  However, note that there are some
very helpful suggestions in my reply.  IIRC, your question wasn't "how to keep a
RAID from failing" but "why did one drive in my RAID 10 fail, and then another
during rebuild".  My suggestions could help you answer the first, possibly the
second.

You still don't know if the first dropped drive is actually bad or not.  Zeroing
it with dd may very well help to inform you if there is a real problem with it.
 Checking your logs and smart data during/afterward may/should tell you.
Zeroing all of them with dd gives you a clean slate for further troubleshooting.

You don't currently have a full backup.  While the reminder of such may have
irritated you, it is nonetheless very relevant, and useful, especially for other
list OPs not donning a Jimmy hat.  RAID is not a replacement for a proper backup
procedure.  You (re)discovered that fact here, or you simply believe, foolishly,
the opposite.

My reply was full of useful information.  Apparently just not useful to someone
who wants to cut corners without having to face the potential negative consequences.

RAID won't save you from massive filesystem corruption.  A proper backup can.
And if this scenario would have turned dire (or still does) it could save you
here as well.  Again, you need a proper backup solution.

-- 
Stan

Backups don't keep a RAID from failing. --Mark Keisler

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 failure(s)
  2011-02-14 23:20       ` NeilBrown
@ 2011-02-15  0:49         ` Mark Keisler
  2011-02-15  0:57           ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Keisler @ 2011-02-15  0:49 UTC (permalink / raw)
  To: linux-raid

On Mon, Feb 14, 2011 at 5:20 PM, NeilBrown <neilb@suse.de> wrote:
> On Mon, 14 Feb 2011 17:08:45 -0600 Mark Keisler <grimm26@gmail.com> wrote:
>
>> On Mon, Feb 14, 2011 at 4:48 PM, NeilBrown <neilb@suse.de> wrote:
>> > On Mon, 14 Feb 2011 14:33:03 -0600 Mark Keisler <grimm26@gmail.com> wrote:
>> >
>> >> Sorry for the double-post on the original.
>> >> I realize that I also left out the fact that I rebooted since drive 0
>> >> also reported a fault and mdadm won't start the array at all.  I'm not
>> >> sure how to tell which members were the in two RAID0 groups.  I would
>> >> think that if I have a RAID0 pair left from the RAID10, I should be
>> >> able to recover somehow.  Not sure if that was drive 0 and 2, 1 and 3
>> >> or 0 and 1, 2 and 3.
>> >>
>> >> Anyway, the drives do still show the correct array UUID when queried
>> >> with mdadm -E, but they disagree about the state of the array:
>> >> # mdadm -E /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 | grep 'Array State'
>> >>    Array State : AAAA ('A' == active, '.' == missing)
>> >>    Array State : .AAA ('A' == active, '.' == missing)
>> >>    Array State : ..AA ('A' == active, '.' == missing)
>> >>    Array State : ..AA ('A' == active, '.' == missing)
>> >>
>> >> sdc still shows a recovery offset, too:
>> >>
>> >> /dev/sdb1:
>> >>     Data Offset : 2048 sectors
>> >>    Super Offset : 8 sectors
>> >> /dev/sdc1:
>> >>     Data Offset : 2048 sectors
>> >>    Super Offset : 8 sectors
>> >> Recovery Offset : 2 sectors
>> >> /dev/sdd1:
>> >>     Data Offset : 2048 sectors
>> >>    Super Offset : 8 sectors
>> >> /dev/sde1:
>> >>     Data Offset : 2048 sectors
>> >>    Super Offset : 8 sectors
>> >>
>> >> I did some searching on the "READ FPDMA QUEUED" error message that my
>> >> drive was reporting and have found that there seems to be a
>> >> correlation between that and having AHCI (NCQ in particular) enabled.
>> >> I've now set my BIOS back to Native IDE (which was the default anyway)
>> >> instead of AHCI for the SATA setting.  I'm hoping that was the issue.
>> >>
>> >> Still wondering if there is some magic to be done to get at my data again :)
>> >
>> > No need for magic here .. but you better stand back, as
>> >  I'm going to try ... Science.
>> > (or is that Engineering...)
>> >
>> >  mdadm -S /dev/md0
>> >  mdadm -C /dev/md0 -l10 -n4 -c256 missing /dev/sdc1 /dev/sdd1 /dev/sde1
>> >  mdadm --wait /dev/md0
>> >  mdadm /dev/md0 --add /dev/sdb1
>> >
>> > (but be really sure that the devices really are working before you try this).
>> >
>> > BTW, for a near=2, Raid-disks=4 arrangement, the first and second devices
>> > contain the same data, and the third and fourth devices also container the
>> > same data as each other (but obviously different to the first and second).
>> >
>> > NeilBrown
>> >
>> >
>> Ah, that's the kind of info that I was looking for.  So, the third and
>> fourth disks are a complete RAID0 set and the entire RAID10 should be
>> able to rebuild from them if I replace the first two disks with new
>> ones (hence being sure the devices are working)?  Or I need to hope
>> the originals will hold up to a rebuild?
>
> No.
>
> third and fourth are like a RAID1 set, not a RAID0 set.
>
> First and second are a RAID1 pair.  Third and fourth are a RAID1 pair.
>
> First and third
> first and fourth
> second and third
> second and fourth
>
> can each be seen as a RAID0 pair which container all of the data.
>
> NeilBrown
>
>
>
>>
>> Thanks for the info, Neil, and all your work in FOSS :)
>
>
Oh, duh, was thinking in 0+1 instead of 10.  I'm still wondering why
you made mention of "but be really sure that the devices really are
working before you try this."  If trying to bring the RAID back fails,
I'm just back to not having access to the data which is where I am now
:).
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 failure(s)
  2011-02-15  0:49         ` Mark Keisler
@ 2011-02-15  0:57           ` NeilBrown
  2011-02-15 17:47             ` Mark Keisler
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2011-02-15  0:57 UTC (permalink / raw)
  To: Mark Keisler; +Cc: linux-raid

On Mon, 14 Feb 2011 18:49:03 -0600 Mark Keisler <grimm26@gmail.com> wrote:

> Oh, duh, was thinking in 0+1 instead of 10.  I'm still wondering why
> you made mention of "but be really sure that the devices really are
> working before you try this."  If trying to bring the RAID back fails,
> I'm just back to not having access to the data which is where I am now
> :).

If you try reconstructing the array before you are sure you have resolved the
original problem (be it BIOS setting, bad cables, dodgey controller or even a
bad disk drive) then you risk compounding your problems and at least are
likely to waste time.
Sometimes people are in such a hurry to get access to their data that they
cut corners to their detriment.  I don't know if you are such a person, but
I mentioned it anyway just in case.

NeilBrown


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 failure(s)
  2011-02-15  0:57           ` NeilBrown
@ 2011-02-15 17:47             ` Mark Keisler
  0 siblings, 0 replies; 11+ messages in thread
From: Mark Keisler @ 2011-02-15 17:47 UTC (permalink / raw)
  To: linux-raid

On Mon, Feb 14, 2011 at 6:57 PM, NeilBrown <neilb@suse.de> wrote:
> On Mon, 14 Feb 2011 18:49:03 -0600 Mark Keisler <grimm26@gmail.com> wrote:
>
>> Oh, duh, was thinking in 0+1 instead of 10.  I'm still wondering why
>> you made mention of "but be really sure that the devices really are
>> working before you try this."  If trying to bring the RAID back fails,
>> I'm just back to not having access to the data which is where I am now
>> :).
>
> If you try reconstructing the array before you are sure you have resolved the
> original problem (be it BIOS setting, bad cables, dodgey controller or even a
> bad disk drive) then you risk compounding your problems and at least are
> likely to waste time.
> Sometimes people are in such a hurry to get access to their data that they
> cut corners to their detriment.  I don't know if you are such a person, but
> I mentioned it anyway just in case.
>
> NeilBrown
>
>
After checking things over, SMART tests were showing quite a few
Offline_Uncorrectable and  a high Current_Pending_Sector count on the
two drives that had failed out of the array.  So, based on that, I
figured I had nothing to lose in trying to create the array again.  I
just went with the array in a degraded state with 3 drives and was
able to activate the volumes on it and get the part of the data off
that wasn't backed up yet before it failed again.

Stan's dd zero idea also confirms with its output and logs:
 # dd if=/dev/zero of=/dev/sdb
dd: writing to `/dev/sdb': Input/output error
9368201+0 records in
9368200+0 records out
4796518400 bytes (4.8 GB) copied, 229.128 s, 20.9 MB/s


So, RMA of drives, keep smartd running, rebuild the array, load some
data and monitor :).  Thanks for the help guys.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* raid10 failure(s)
@ 2011-02-14 16:07 Mark Keisler
  0 siblings, 0 replies; 11+ messages in thread
From: Mark Keisler @ 2011-02-14 16:07 UTC (permalink / raw)
  To: linux-raid

Sorry in advance for the long email :)


I had a RAID10 array set up using 4 WD 1TB caviar black drives (SATA3)
on 64 bit on a 2.6.36 kernel using mdadm 3.1.4.  I noticed last night
that one drive had faulted out of the array.  It had a bunch of errors
like so:

Feb  8 03:39:48 samsara kernel: [41330.835285] ata3.00: exception
Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Feb  8 03:39:48 samsara kernel: [41330.835288] ata3.00: irq_stat 0x40000008
Feb  8 03:39:48 samsara kernel: [41330.835292] ata3.00: failed
command: READ FPDMA QUEUED
Feb  8 03:39:48 samsara kernel: [41330.835297] ata3.00: cmd
60/f8:00:f8:9a:45/00:00:04:00:00/40 tag 0 ncq 126976 in
Feb  8 03:39:48 samsara kernel: [41330.835297]          res
41/40:00:70:9b:45/00:00:04:00:00/40 Emask 0x409 (media error) <F>
Feb  8 03:39:48 samsara kernel: [41330.835300] ata3.00: status: { DRDY ERR }
Feb  8 03:39:48 samsara kernel: [41330.835301] ata3.00: error: { UNC }
Feb  8 03:39:48 samsara kernel: [41330.839776] ata3.00: configured for UDMA/133
Feb  8 03:39:48 samsara kernel: [41330.839788] ata3: EH complete
....

Feb  8 03:39:58 samsara kernel: [41340.423236] sd 2:0:0:0: [sdc]
Unhandled sense code
Feb  8 03:39:58 samsara kernel: [41340.423238] sd 2:0:0:0: [sdc]
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb  8 03:39:58 samsara kernel: [41340.423240] sd 2:0:0:0: [sdc]
Sense Key : Medium Error [current] [descriptor]
Feb  8 03:39:58 samsara kernel: [41340.423243] Descriptor sense data
with sense descriptors (in hex):
Feb  8 03:39:58 samsara kernel: [41340.423244]         72 03 11 04 00
00 00 0c 00 0a 80 00 00 00 00 00
Feb  8 03:39:58 samsara kernel: [41340.423249]         04 45 9b 70
Feb  8 03:39:58 samsara kernel: [41340.423251] sd 2:0:0:0: [sdc]  Add.
Sense: Unrecovered read error - auto reallocate failed
Feb  8 03:39:58 samsara kernel: [41340.423254] sd 2:0:0:0: [sdc] CDB:
Read(10): 28 00 04 45 9a f8 00 00 f8 00
Feb  8 03:39:58 samsara kernel: [41340.423259] end_request: I/O error,
dev sdc, sector 71670640
Feb  8 03:39:58 samsara kernel: [41340.423262] md/raid10:md0: sdc1:
rescheduling sector 143332600
....
Feb  8 03:40:10 samsara kernel: [41351.940796] md/raid10:md0: read
error corrected (8 sectors at 2168 on sdc1)
Feb  8 03:40:10 samsara kernel: [41351.954972] md/raid10:md0: sdb1:
redirecting sector 143332600 to another mirror

and so on until:
Feb  8 03:55:01 samsara kernel: [42243.609414] md/raid10:md0: sdc1:
Raid device exceeded read_error threshold [cur 21:max 20]
Feb  8 03:55:01 samsara kernel: [42243.609417] md/raid10:md0: sdc1:
Failing raid device
Feb  8 03:55:01 samsara kernel: [42243.609419] md/raid10:md0: Disk
failure on sdc1, disabling device.
Feb  8 03:55:01 samsara kernel: [42243.609420] <1>md/raid10:md0:
Operation continuing on 3 devices.
Feb  8 03:55:01 samsara kernel: [42243.609423] md/raid10:md0: sdb1:
redirecting sector 143163888 to another mirror
Feb  8 03:55:01 samsara kernel: [42243.609650] md/raid10:md0: sdb1:
redirecting sector 143164416 to another mirror
Feb  8 03:55:01 samsara kernel: [42243.610095] md/raid10:md0: sdb1:
redirecting sector 143164664 to another mirror
Feb  8 03:55:01 samsara kernel: [42243.633814] RAID10 conf printout:
Feb  8 03:55:01 samsara kernel: [42243.633817]  --- wd:3 rd:4
Feb  8 03:55:01 samsara kernel: [42243.633820]  disk 0, wo:0, o:1, dev:sdb1
Feb  8 03:55:01 samsara kernel: [42243.633821]  disk 1, wo:1, o:0, dev:sdc1
Feb  8 03:55:01 samsara kernel: [42243.633823]  disk 2, wo:0, o:1, dev:sdd1
Feb  8 03:55:01 samsara kernel: [42243.633824]  disk 3, wo:0, o:1, dev:sde1
Feb  8 03:55:01 samsara kernel: [42243.645880] RAID10 conf printout:
Feb  8 03:55:01 samsara kernel: [42243.645883]  --- wd:3 rd:4
Feb  8 03:55:01 samsara kernel: [42243.645885]  disk 0, wo:0, o:1, dev:sdb1
Feb  8 03:55:01 samsara kernel: [42243.645887]  disk 2, wo:0, o:1, dev:sdd1
Feb  8 03:55:01 samsara kernel: [42243.645888]  disk 3, wo:0, o:1, dev:sde1


This seemed weird as the machine is only a week or two old.  I powered
down to open it up and get the serial number off the drive fro an RMA.
 I powered back up and mdadm had automatically removed the drive from
the RAID.  Fine.  The RAID had already been running on just 3 disks
since the 8th.  For some reason, I thought to add the drive back into
the array to see if it failed out again thinking worst case scenario
I'm back to a degraded RAID10 again.  So I added it back in and did an
mdadm --detail to check on it after a little while and found this:
samsara log # mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sat Feb  5 22:00:52 2011
     Raid Level : raid10
     Array Size : 1953519104 (1863.02 GiB 2000.40 GB)
  Used Dev Size : 976759552 (931.51 GiB 1000.20 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Mon Feb 14 00:04:46 2011
          State : clean, FAILED, recovering
 Active Devices : 2
Working Devices : 2
 Failed Devices : 2
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 256K

 Rebuild Status : 99% complete

           Name : samsara:0  (local to host samsara)
           UUID : 26804ec8:a20a4365:bc7d5b4e:653ade03
         Events : 30348

    Number   Major   Minor   RaidDevice State
       0       8       17        0      faulty spare rebuilding   /dev/sdb1
       1       8       33        1      faulty spare rebuilding   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       8       65        3      active sync   /dev/sde1
samsara log # exit

It had faulted drive 0 also during the rebuild.
[ 1177.064359] RAID10 conf printout:
[ 1177.064362]  --- wd:2 rd:4
[ 1177.064365]  disk 0, wo:1, o:0, dev:sdb1
[ 1177.064367]  disk 1, wo:1, o:0, dev:sdc1
[ 1177.064368]  disk 2, wo:0, o:1, dev:sdd1
[ 1177.064370]  disk 3, wo:0, o:1, dev:sde1
[ 1177.073325] RAID10 conf printout:
[ 1177.073328]  --- wd:2 rd:4
[ 1177.073330]  disk 0, wo:1, o:0, dev:sdb1
[ 1177.073332]  disk 2, wo:0, o:1, dev:sdd1
[ 1177.073333]  disk 3, wo:0, o:1, dev:sde1
[ 1177.073340] RAID10 conf printout:
[ 1177.073341]  --- wd:2 rd:4
[ 1177.073342]  disk 0, wo:1, o:0, dev:sdb1
[ 1177.073343]  disk 2, wo:0, o:1, dev:sdd1
[ 1177.073344]  disk 3, wo:0, o:1, dev:sde1
[ 1177.083323] RAID10 conf printout:
[ 1177.083326]  --- wd:2 rd:4
[ 1177.083329]  disk 2, wo:0, o:1, dev:sdd1
[ 1177.083330]  disk 3, wo:0, o:1, dev:sde1


So the RAID ended up being marked "clean, FAILED."  Gee, glad it is
clean at least ;).  I'm wondering wtf went wrong and if it actually
makes sense that I had a double disk failure like that.  I can't even
force it to assemble the raid anymore:
 # mdadm --assemble --verbose --force /dev/md0
mdadm: looking for devices for /dev/md0
mdadm: cannot open device /dev/sde1: Device or resource busy
mdadm: /dev/sde1 has wrong uuid.
mdadm: cannot open device /dev/sdd1: Device or resource busy
mdadm: /dev/sdd1 has wrong uuid.
mdadm: cannot open device /dev/sdc1: Device or resource busy
mdadm: /dev/sdc1 has wrong uuid.
mdadm: cannot open device /dev/sdb1: Device or resource busy
mdadm: /dev/sdb1 has wrong uuid.

Am I totally SOL?  Thanks for any suggestions or things to try.


--
Mark Keisler

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-02-15 17:47 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-14 16:09 RAID10 failure(s) Mark Keisler
2011-02-14 20:33 ` Mark Keisler
2011-02-14 22:29   ` Stan Hoeppner
     [not found]     ` <AANLkTikyjuUo=X3m5Q6L9x3vXbgV1KC+TfOaLN2z4Keo@mail.gmail.com>
2011-02-15  0:40       ` Stan Hoeppner
2011-02-14 22:48   ` NeilBrown
2011-02-14 23:08     ` Mark Keisler
2011-02-14 23:20       ` NeilBrown
2011-02-15  0:49         ` Mark Keisler
2011-02-15  0:57           ` NeilBrown
2011-02-15 17:47             ` Mark Keisler
  -- strict thread matches above, loose matches on Subject: below --
2011-02-14 16:07 raid10 failure(s) Mark Keisler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.