Re: RAID10 failure(s)

From: Mark Keisler <grimm26@gmail.com>
To: linux-raid@vger.kernel.org
Subject: Re: RAID10 failure(s)
Date: Mon, 14 Feb 2011 14:33:03 -0600	[thread overview]
Message-ID: <AANLkTinfaYYcq1++_vx0nNMDftiv9Eqg_eDR2Q0QfoS7@mail.gmail.com> (raw)
In-Reply-To: <AANLkTimqZqjJuwmC+zX3Uyx_xrWC0i07y5idrnaQXyWF@mail.gmail.com>

Sorry for the double-post on the original.
I realize that I also left out the fact that I rebooted since drive 0
also reported a fault and mdadm won't start the array at all.  I'm not
sure how to tell which members were the in two RAID0 groups.  I would
think that if I have a RAID0 pair left from the RAID10, I should be
able to recover somehow.  Not sure if that was drive 0 and 2, 1 and 3
or 0 and 1, 2 and 3.

Anyway, the drives do still show the correct array UUID when queried
with mdadm -E, but they disagree about the state of the array:
# mdadm -E /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 | grep 'Array State'
   Array State : AAAA ('A' == active, '.' == missing)
   Array State : .AAA ('A' == active, '.' == missing)
   Array State : ..AA ('A' == active, '.' == missing)
   Array State : ..AA ('A' == active, '.' == missing)

sdc still shows a recovery offset, too:

/dev/sdb1:
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
/dev/sdc1:
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
Recovery Offset : 2 sectors
/dev/sdd1:
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
/dev/sde1:
    Data Offset : 2048 sectors
   Super Offset : 8 sectors

I did some searching on the "READ FPDMA QUEUED" error message that my
drive was reporting and have found that there seems to be a
correlation between that and having AHCI (NCQ in particular) enabled.
I've now set my BIOS back to Native IDE (which was the default anyway)
instead of AHCI for the SATA setting.  I'm hoping that was the issue.

Still wondering if there is some magic to be done to get at my data again :)

--
Mark
Tact is the ability to tell a man he has an open mind when he has a
hole in his head.

On Mon, Feb 14, 2011 at 10:09 AM, Mark Keisler <grimm26@gmail.com> wrote:
>
> Sorry in advance for the long email :)
>
>
> I had a RAID10 array set up using 4 WD 1TB caviar black drives (SATA3)
> on 64 bit on a 2.6.36 kernel using mdadm 3.1.4.  I noticed last night
> that one drive had faulted out of the array.  It had a bunch of errors
> like so:
>
> Feb  8 03:39:48 samsara kernel: [41330.835285] ata3.00: exception
> Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
> Feb  8 03:39:48 samsara kernel: [41330.835288] ata3.00: irq_stat 0x40000008
> Feb  8 03:39:48 samsara kernel: [41330.835292] ata3.00: failed
> command: READ FPDMA QUEUED
> Feb  8 03:39:48 samsara kernel: [41330.835297] ata3.00: cmd
> 60/f8:00:f8:9a:45/00:00:04:00:00/40 tag 0 ncq 126976 in
> Feb  8 03:39:48 samsara kernel: [41330.835297]          res
> 41/40:00:70:9b:45/00:00:04:00:00/40 Emask 0x409 (media error) <F>
> Feb  8 03:39:48 samsara kernel: [41330.835300] ata3.00: status: { DRDY ERR }
> Feb  8 03:39:48 samsara kernel: [41330.835301] ata3.00: error: { UNC }
> Feb  8 03:39:48 samsara kernel: [41330.839776] ata3.00: configured for UDMA/133
> Feb  8 03:39:48 samsara kernel: [41330.839788] ata3: EH complete
> ....
>
> Feb  8 03:39:58 samsara kernel: [41340.423236] sd 2:0:0:0: [sdc]
> Unhandled sense code
> Feb  8 03:39:58 samsara kernel: [41340.423238] sd 2:0:0:0: [sdc]
> Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Feb  8 03:39:58 samsara kernel: [41340.423240] sd 2:0:0:0: [sdc]
> Sense Key : Medium Error [current] [descriptor]
> Feb  8 03:39:58 samsara kernel: [41340.423243] Descriptor sense data
> with sense descriptors (in hex):
> Feb  8 03:39:58 samsara kernel: [41340.423244]         72 03 11 04 00
> 00 00 0c 00 0a 80 00 00 00 00 00
> Feb  8 03:39:58 samsara kernel: [41340.423249]         04 45 9b 70
> Feb  8 03:39:58 samsara kernel: [41340.423251] sd 2:0:0:0: [sdc]  Add.
> Sense: Unrecovered read error - auto reallocate failed
> Feb  8 03:39:58 samsara kernel: [41340.423254] sd 2:0:0:0: [sdc] CDB:
> Read(10): 28 00 04 45 9a f8 00 00 f8 00
> Feb  8 03:39:58 samsara kernel: [41340.423259] end_request: I/O error,
> dev sdc, sector 71670640
> Feb  8 03:39:58 samsara kernel: [41340.423262] md/raid10:md0: sdc1:
> rescheduling sector 143332600
> ....
> Feb  8 03:40:10 samsara kernel: [41351.940796] md/raid10:md0: read
> error corrected (8 sectors at 2168 on sdc1)
> Feb  8 03:40:10 samsara kernel: [41351.954972] md/raid10:md0: sdb1:
> redirecting sector 143332600 to another mirror
>
> and so on until:
> Feb  8 03:55:01 samsara kernel: [42243.609414] md/raid10:md0: sdc1:
> Raid device exceeded read_error threshold [cur 21:max 20]
> Feb  8 03:55:01 samsara kernel: [42243.609417] md/raid10:md0: sdc1:
> Failing raid device
> Feb  8 03:55:01 samsara kernel: [42243.609419] md/raid10:md0: Disk
> failure on sdc1, disabling device.
> Feb  8 03:55:01 samsara kernel: [42243.609420] <1>md/raid10:md0:
> Operation continuing on 3 devices.
> Feb  8 03:55:01 samsara kernel: [42243.609423] md/raid10:md0: sdb1:
> redirecting sector 143163888 to another mirror
> Feb  8 03:55:01 samsara kernel: [42243.609650] md/raid10:md0: sdb1:
> redirecting sector 143164416 to another mirror
> Feb  8 03:55:01 samsara kernel: [42243.610095] md/raid10:md0: sdb1:
> redirecting sector 143164664 to another mirror
> Feb  8 03:55:01 samsara kernel: [42243.633814] RAID10 conf printout:
> Feb  8 03:55:01 samsara kernel: [42243.633817]  --- wd:3 rd:4
> Feb  8 03:55:01 samsara kernel: [42243.633820]  disk 0, wo:0, o:1, dev:sdb1
> Feb  8 03:55:01 samsara kernel: [42243.633821]  disk 1, wo:1, o:0, dev:sdc1
> Feb  8 03:55:01 samsara kernel: [42243.633823]  disk 2, wo:0, o:1, dev:sdd1
> Feb  8 03:55:01 samsara kernel: [42243.633824]  disk 3, wo:0, o:1, dev:sde1
> Feb  8 03:55:01 samsara kernel: [42243.645880] RAID10 conf printout:
> Feb  8 03:55:01 samsara kernel: [42243.645883]  --- wd:3 rd:4
> Feb  8 03:55:01 samsara kernel: [42243.645885]  disk 0, wo:0, o:1, dev:sdb1
> Feb  8 03:55:01 samsara kernel: [42243.645887]  disk 2, wo:0, o:1, dev:sdd1
> Feb  8 03:55:01 samsara kernel: [42243.645888]  disk 3, wo:0, o:1, dev:sde1
>
>
> This seemed weird as the machine is only a week or two old.  I powered
> down to open it up and get the serial number off the drive fro an RMA.
>  I powered back up and mdadm had automatically removed the drive from
> the RAID.  Fine.  The RAID had already been running on just 3 disks
> since the 8th.  For some reason, I thought to add the drive back into
> the array to see if it failed out again thinking worst case scenario
> I'm back to a degraded RAID10 again.  So I added it back in and did an
> mdadm --detail to check on it after a little while and found this:
> samsara log # mdadm --detail /dev/md0
> /dev/md0:
>       Version : 1.2
>  Creation Time : Sat Feb  5 22:00:52 2011
>    Raid Level : raid10
>    Array Size : 1953519104 (1863.02 GiB 2000.40 GB)
>  Used Dev Size : 976759552 (931.51 GiB 1000.20 GB)
>  Raid Devices : 4
>  Total Devices : 4
>   Persistence : Superblock is persistent
>
>   Update Time : Mon Feb 14 00:04:46 2011
>         State : clean, FAILED, recovering
>  Active Devices : 2
> Working Devices : 2
>  Failed Devices : 2
>  Spare Devices : 0
>
>        Layout : near=2
>    Chunk Size : 256K
>
>  Rebuild Status : 99% complete
>
>          Name : samsara:0  (local to host samsara)
>          UUID : 26804ec8:a20a4365:bc7d5b4e:653ade03
>        Events : 30348
>
>   Number   Major   Minor   RaidDevice State
>      0       8       17        0      faulty spare rebuilding   /dev/sdb1
>      1       8       33        1      faulty spare rebuilding   /dev/sdc1
>      2       8       49        2      active sync   /dev/sdd1
>      3       8       65        3      active sync   /dev/sde1
> samsara log # exit
>
> It had faulted drive 0 also during the rebuild.
> [ 1177.064359] RAID10 conf printout:
> [ 1177.064362]  --- wd:2 rd:4
> [ 1177.064365]  disk 0, wo:1, o:0, dev:sdb1
> [ 1177.064367]  disk 1, wo:1, o:0, dev:sdc1
> [ 1177.064368]  disk 2, wo:0, o:1, dev:sdd1
> [ 1177.064370]  disk 3, wo:0, o:1, dev:sde1
> [ 1177.073325] RAID10 conf printout:
> [ 1177.073328]  --- wd:2 rd:4
> [ 1177.073330]  disk 0, wo:1, o:0, dev:sdb1
> [ 1177.073332]  disk 2, wo:0, o:1, dev:sdd1
> [ 1177.073333]  disk 3, wo:0, o:1, dev:sde1
> [ 1177.073340] RAID10 conf printout:
> [ 1177.073341]  --- wd:2 rd:4
> [ 1177.073342]  disk 0, wo:1, o:0, dev:sdb1
> [ 1177.073343]  disk 2, wo:0, o:1, dev:sdd1
> [ 1177.073344]  disk 3, wo:0, o:1, dev:sde1
> [ 1177.083323] RAID10 conf printout:
> [ 1177.083326]  --- wd:2 rd:4
> [ 1177.083329]  disk 2, wo:0, o:1, dev:sdd1
> [ 1177.083330]  disk 3, wo:0, o:1, dev:sde1
>
>
> So the RAID ended up being marked "clean, FAILED."  Gee, glad it is
> clean at least ;).  I'm wondering wtf went wrong and if it actually
> makes sense that I had a double disk failure like that.  I can't even
> force it to assemble the raid anymore:
>  # mdadm --assemble --verbose --force /dev/md0
> mdadm: looking for devices for /dev/md0
> mdadm: cannot open device /dev/sde1: Device or resource busy
> mdadm: /dev/sde1 has wrong uuid.
> mdadm: cannot open device /dev/sdd1: Device or resource busy
> mdadm: /dev/sdd1 has wrong uuid.
> mdadm: cannot open device /dev/sdc1: Device or resource busy
> mdadm: /dev/sdc1 has wrong uuid.
> mdadm: cannot open device /dev/sdb1: Device or resource busy
> mdadm: /dev/sdb1 has wrong uuid.
>
> Am I totally SOL?  Thanks for any suggestions or things to try.
>
> --
> Mark
> Tact is the ability to tell a man he has an open mind when he has a
> hole in his head.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html