* RAID 6 Failure follow up
@ 2009-11-08 14:07 Andrew Dunn
2009-11-08 14:23 ` Roger Heflin
0 siblings, 1 reply; 23+ messages in thread
From: Andrew Dunn @ 2009-11-08 14:07 UTC (permalink / raw)
To: linux-raid list
This is kind of interesting:
storrgie@ALEXANDRIA:~$ sudo mdadm --assemble --force /dev/md0
mdadm: no devices found for /dev/md0
All of the devices are there in /dev, so I wanted to examine them:
storrgie@ALEXANDRIA:~$ sudo mdadm --examine /dev/sde1
/dev/sde1:
Magic : a92b4efc
Version : 00.90.00
UUID : 397e0b3f:34cbe4cc:613e2239:070da8c8 (local to host
ALEXANDRIA)
Creation Time : Fri Nov 6 07:06:34 2009
Raid Level : raid6
Used Dev Size : 976759808 (931.51 GiB 1000.20 GB)
Array Size : 6837318656 (6520.58 GiB 7001.41 GB)
Raid Devices : 9
Total Devices : 9
Preferred Minor : 0
Update Time : Sun Nov 8 08:57:04 2009
State : clean
Active Devices : 5
Working Devices : 5
Failed Devices : 4
Spare Devices : 0
Checksum : 4ff41c5f - correct
Events : 43
Chunk Size : 1024K
Number Major Minor RaidDevice State
this 0 8 65 0 active sync /dev/sde1
0 0 8 65 0 active sync /dev/sde1
1 1 8 81 1 active sync /dev/sdf1
2 2 8 97 2 active sync /dev/sdg1
3 3 8 113 3 active sync /dev/sdh1
4 4 0 0 4 faulty removed
5 5 0 0 5 faulty removed
6 6 0 0 6 faulty removed
7 7 0 0 7 faulty removed
8 8 8 193 8 active sync /dev/sdm1
First raid device shows the failures....
One of the 'removed' devices:
storrgie@ALEXANDRIA:~$ sudo mdadm --examine /dev/sdi1
/dev/sdi1:
Magic : a92b4efc
Version : 00.90.00
UUID : 397e0b3f:34cbe4cc:613e2239:070da8c8 (local to host
ALEXANDRIA)
Creation Time : Fri Nov 6 07:06:34 2009
Raid Level : raid6
Used Dev Size : 976759808 (931.51 GiB 1000.20 GB)
Array Size : 6837318656 (6520.58 GiB 7001.41 GB)
Raid Devices : 9
Total Devices : 9
Preferred Minor : 0
Update Time : Sun Nov 8 08:53:30 2009
State : active
Active Devices : 9
Working Devices : 9
Failed Devices : 0
Spare Devices : 0
Checksum : 4ff41b2f - correct
Events : 21
Chunk Size : 1024K
Number Major Minor RaidDevice State
this 4 8 129 4 active sync /dev/sdi1
0 0 8 65 0 active sync /dev/sde1
1 1 8 81 1 active sync /dev/sdf1
2 2 8 97 2 active sync /dev/sdg1
3 3 8 113 3 active sync /dev/sdh1
4 4 8 129 4 active sync /dev/sdi1
5 5 8 145 5 active sync /dev/sdj1
6 6 8 161 6 active sync /dev/sdk1
7 7 8 177 7 active sync /dev/sdl1
8 8 8 193 8 active sync /dev/sdm1
--
Andrew Dunn
http://agdunn.net
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-08 14:07 RAID 6 Failure follow up Andrew Dunn
@ 2009-11-08 14:23 ` Roger Heflin
2009-11-08 14:30 ` Andrew Dunn
2009-11-08 14:36 ` Andrew Dunn
0 siblings, 2 replies; 23+ messages in thread
From: Roger Heflin @ 2009-11-08 14:23 UTC (permalink / raw)
To: Andrew Dunn; +Cc: linux-raid list
Andrew Dunn wrote:
> This is kind of interesting:
>
> storrgie@ALEXANDRIA:~$ sudo mdadm --assemble --force /dev/md0
> mdadm: no devices found for /dev/md0
>
> All of the devices are there in /dev, so I wanted to examine them:
>
> storrgie@ALEXANDRIA:~$ sudo mdadm --examine /dev/sde1
> /dev/sde1:
> Magic : a92b4efc
> Version : 00.90.00
> UUID : 397e0b3f:34cbe4cc:613e2239:070da8c8 (local to host
> ALEXANDRIA)
> Creation Time : Fri Nov 6 07:06:34 2009
> Raid Level : raid6
> Used Dev Size : 976759808 (931.51 GiB 1000.20 GB)
> Array Size : 6837318656 (6520.58 GiB 7001.41 GB)
> Raid Devices : 9
> Total Devices : 9
> Preferred Minor : 0
>
> Update Time : Sun Nov 8 08:57:04 2009
> State : clean
> Active Devices : 5
> Working Devices : 5
> Failed Devices : 4
> Spare Devices : 0
> Checksum : 4ff41c5f - correct
> Events : 43
>
> Chunk Size : 1024K
>
> Number Major Minor RaidDevice State
> this 0 8 65 0 active sync /dev/sde1
>
> 0 0 8 65 0 active sync /dev/sde1
> 1 1 8 81 1 active sync /dev/sdf1
> 2 2 8 97 2 active sync /dev/sdg1
> 3 3 8 113 3 active sync /dev/sdh1
> 4 4 0 0 4 faulty removed
> 5 5 0 0 5 faulty removed
> 6 6 0 0 6 faulty removed
> 7 7 0 0 7 faulty removed
> 8 8 8 193 8 active sync /dev/sdm1
>
> First raid device shows the failures....
>
> One of the 'removed' devices:
>
> storrgie@ALEXANDRIA:~$ sudo mdadm --examine /dev/sdi1
> /dev/sdi1:
> Magic : a92b4efc
> Version : 00.90.00
> UUID : 397e0b3f:34cbe4cc:613e2239:070da8c8 (local to host
> ALEXANDRIA)
> Creation Time : Fri Nov 6 07:06:34 2009
> Raid Level : raid6
> Used Dev Size : 976759808 (931.51 GiB 1000.20 GB)
> Array Size : 6837318656 (6520.58 GiB 7001.41 GB)
> Raid Devices : 9
> Total Devices : 9
> Preferred Minor : 0
>
> Update Time : Sun Nov 8 08:53:30 2009
> State : active
> Active Devices : 9
> Working Devices : 9
> Failed Devices : 0
> Spare Devices : 0
> Checksum : 4ff41b2f - correct
> Events : 21
>
> Chunk Size : 1024K
>
> Number Major Minor RaidDevice State
> this 4 8 129 4 active sync /dev/sdi1
>
> 0 0 8 65 0 active sync /dev/sde1
> 1 1 8 81 1 active sync /dev/sdf1
> 2 2 8 97 2 active sync /dev/sdg1
> 3 3 8 113 3 active sync /dev/sdh1
> 4 4 8 129 4 active sync /dev/sdi1
> 5 5 8 145 5 active sync /dev/sdj1
> 6 6 8 161 6 active sync /dev/sdk1
> 7 7 8 177 7 active sync /dev/sdl1
> 8 8 8 193 8 active sync /dev/sdm1
>
Did you check dmesg and see if there were errors on those disks?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-08 14:23 ` Roger Heflin
@ 2009-11-08 14:30 ` Andrew Dunn
2009-11-08 18:01 ` Richard Scobie
2009-11-08 14:36 ` Andrew Dunn
1 sibling, 1 reply; 23+ messages in thread
From: Andrew Dunn @ 2009-11-08 14:30 UTC (permalink / raw)
To: Roger Heflin, robin; +Cc: linux-raid list
storrgie@ALEXANDRIA:~$ dmesg | grep sdi
[ 31.019358] sd 11:0:0:0: [sdi] 1953525168 512-byte logical blocks:
(1.00 TB/931 GiB)
[ 31.032233] sd 11:0:0:0: [sdi] Write Protect is off
[ 31.032235] sd 11:0:0:0: [sdi] Mode Sense: 73 00 00 08
[ 31.037483] sd 11:0:0:0: [sdi] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 31.066991] sdi:
[ 31.075719] sdi1
[ 31.124713] sd 11:0:0:0: [sdi] Attached SCSI disk
[ 31.147407] md: bind<sdi1>
[ 31.712366] raid5: device sdi1 operational as raid disk 4
[ 31.713153] disk 4, o:1, dev:sdi1
[ 33.112975] disk 4, o:1, dev:sdi1
[ 297.528544] sd 11:0:0:0: [sdi] Sense Key : Recovered Error [current]
[descriptor]
[ 297.528573] sd 11:0:0:0: [sdi] Add. Sense: ATA pass through
information available
[ 297.591382] sd 11:0:0:0: [sdi] Sense Key : Recovered Error [current]
[descriptor]
[ 297.591407] sd 11:0:0:0: [sdi] Add. Sense: ATA pass through
information available
I don't see anything glaring.
You should be able to force an assembly anyway (using the --force flag)
but I'd make sure you know exactly what the issue is first, otherwise
this is likely to happen again.
Do you think that the controller is dropping out? I know that I have 4
drives on one controller (AOC-USAS-L8i) and 5 drives on the other
controller (SAME make/model). but I think they are sequentially
connected... as in sd[efghi] should be on one device and sd[jklm] should
be on the other... any easy way to verify?
Roger Heflin wrote:
> Andrew Dunn wrote:
>> This is kind of interesting:
>>
>> storrgie@ALEXANDRIA:~$ sudo mdadm --assemble --force /dev/md0
>> mdadm: no devices found for /dev/md0
>>
>> All of the devices are there in /dev, so I wanted to examine them:
>>
>> storrgie@ALEXANDRIA:~$ sudo mdadm --examine /dev/sde1
>> /dev/sde1:
>> Magic : a92b4efc
>> Version : 00.90.00
>> UUID : 397e0b3f:34cbe4cc:613e2239:070da8c8 (local to host
>> ALEXANDRIA)
>> Creation Time : Fri Nov 6 07:06:34 2009
>> Raid Level : raid6
>> Used Dev Size : 976759808 (931.51 GiB 1000.20 GB)
>> Array Size : 6837318656 (6520.58 GiB 7001.41 GB)
>> Raid Devices : 9
>> Total Devices : 9
>> Preferred Minor : 0
>>
>> Update Time : Sun Nov 8 08:57:04 2009
>> State : clean
>> Active Devices : 5
>> Working Devices : 5
>> Failed Devices : 4
>> Spare Devices : 0
>> Checksum : 4ff41c5f - correct
>> Events : 43
>>
>> Chunk Size : 1024K
>>
>> Number Major Minor RaidDevice State
>> this 0 8 65 0 active sync /dev/sde1
>>
>> 0 0 8 65 0 active sync /dev/sde1
>> 1 1 8 81 1 active sync /dev/sdf1
>> 2 2 8 97 2 active sync /dev/sdg1
>> 3 3 8 113 3 active sync /dev/sdh1
>> 4 4 0 0 4 faulty removed
>> 5 5 0 0 5 faulty removed
>> 6 6 0 0 6 faulty removed
>> 7 7 0 0 7 faulty removed
>> 8 8 8 193 8 active sync /dev/sdm1
>>
>> First raid device shows the failures....
>>
>> One of the 'removed' devices:
>>
>> storrgie@ALEXANDRIA:~$ sudo mdadm --examine /dev/sdi1
>> /dev/sdi1:
>> Magic : a92b4efc
>> Version : 00.90.00
>> UUID : 397e0b3f:34cbe4cc:613e2239:070da8c8 (local to host
>> ALEXANDRIA)
>> Creation Time : Fri Nov 6 07:06:34 2009
>> Raid Level : raid6
>> Used Dev Size : 976759808 (931.51 GiB 1000.20 GB)
>> Array Size : 6837318656 (6520.58 GiB 7001.41 GB)
>> Raid Devices : 9
>> Total Devices : 9
>> Preferred Minor : 0
>>
>> Update Time : Sun Nov 8 08:53:30 2009
>> State : active
>> Active Devices : 9
>> Working Devices : 9
>> Failed Devices : 0
>> Spare Devices : 0
>> Checksum : 4ff41b2f - correct
>> Events : 21
>>
>> Chunk Size : 1024K
>>
>> Number Major Minor RaidDevice State
>> this 4 8 129 4 active sync /dev/sdi1
>>
>> 0 0 8 65 0 active sync /dev/sde1
>> 1 1 8 81 1 active sync /dev/sdf1
>> 2 2 8 97 2 active sync /dev/sdg1
>> 3 3 8 113 3 active sync /dev/sdh1
>> 4 4 8 129 4 active sync /dev/sdi1
>> 5 5 8 145 5 active sync /dev/sdj1
>> 6 6 8 161 6 active sync /dev/sdk1
>> 7 7 8 177 7 active sync /dev/sdl1
>> 8 8 8 193 8 active sync /dev/sdm1
>>
>
>
> Did you check dmesg and see if there were errors on those disks?
>
>
--
Andrew Dunn
http://agdunn.net
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-08 14:23 ` Roger Heflin
2009-11-08 14:30 ` Andrew Dunn
@ 2009-11-08 14:36 ` Andrew Dunn
2009-11-08 14:56 ` Roger Heflin
1 sibling, 1 reply; 23+ messages in thread
From: Andrew Dunn @ 2009-11-08 14:36 UTC (permalink / raw)
To: Roger Heflin; +Cc: linux-raid list
[10:0:0:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sde
[10:0:1:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdf
[10:0:2:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdg
[10:0:3:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdh
[11:0:0:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdi
[11:0:1:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdj
[11:0:2:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdk
[11:0:3:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdl
[11:0:4:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdm
So 4 drives dropped out on the second controller. But why didnt sdm go
with them?
Roger Heflin wrote:
> Andrew Dunn wrote:
>> This is kind of interesting:
>>
>> storrgie@ALEXANDRIA:~$ sudo mdadm --assemble --force /dev/md0
>> mdadm: no devices found for /dev/md0
>>
>> All of the devices are there in /dev, so I wanted to examine them:
>>
>> storrgie@ALEXANDRIA:~$ sudo mdadm --examine /dev/sde1
>> /dev/sde1:
>> Magic : a92b4efc
>> Version : 00.90.00
>> UUID : 397e0b3f:34cbe4cc:613e2239:070da8c8 (local to host
>> ALEXANDRIA)
>> Creation Time : Fri Nov 6 07:06:34 2009
>> Raid Level : raid6
>> Used Dev Size : 976759808 (931.51 GiB 1000.20 GB)
>> Array Size : 6837318656 (6520.58 GiB 7001.41 GB)
>> Raid Devices : 9
>> Total Devices : 9
>> Preferred Minor : 0
>>
>> Update Time : Sun Nov 8 08:57:04 2009
>> State : clean
>> Active Devices : 5
>> Working Devices : 5
>> Failed Devices : 4
>> Spare Devices : 0
>> Checksum : 4ff41c5f - correct
>> Events : 43
>>
>> Chunk Size : 1024K
>>
>> Number Major Minor RaidDevice State
>> this 0 8 65 0 active sync /dev/sde1
>>
>> 0 0 8 65 0 active sync /dev/sde1
>> 1 1 8 81 1 active sync /dev/sdf1
>> 2 2 8 97 2 active sync /dev/sdg1
>> 3 3 8 113 3 active sync /dev/sdh1
>> 4 4 0 0 4 faulty removed
>> 5 5 0 0 5 faulty removed
>> 6 6 0 0 6 faulty removed
>> 7 7 0 0 7 faulty removed
>> 8 8 8 193 8 active sync /dev/sdm1
>>
>> First raid device shows the failures....
>>
>> One of the 'removed' devices:
>>
>> storrgie@ALEXANDRIA:~$ sudo mdadm --examine /dev/sdi1
>> /dev/sdi1:
>> Magic : a92b4efc
>> Version : 00.90.00
>> UUID : 397e0b3f:34cbe4cc:613e2239:070da8c8 (local to host
>> ALEXANDRIA)
>> Creation Time : Fri Nov 6 07:06:34 2009
>> Raid Level : raid6
>> Used Dev Size : 976759808 (931.51 GiB 1000.20 GB)
>> Array Size : 6837318656 (6520.58 GiB 7001.41 GB)
>> Raid Devices : 9
>> Total Devices : 9
>> Preferred Minor : 0
>>
>> Update Time : Sun Nov 8 08:53:30 2009
>> State : active
>> Active Devices : 9
>> Working Devices : 9
>> Failed Devices : 0
>> Spare Devices : 0
>> Checksum : 4ff41b2f - correct
>> Events : 21
>>
>> Chunk Size : 1024K
>>
>> Number Major Minor RaidDevice State
>> this 4 8 129 4 active sync /dev/sdi1
>>
>> 0 0 8 65 0 active sync /dev/sde1
>> 1 1 8 81 1 active sync /dev/sdf1
>> 2 2 8 97 2 active sync /dev/sdg1
>> 3 3 8 113 3 active sync /dev/sdh1
>> 4 4 8 129 4 active sync /dev/sdi1
>> 5 5 8 145 5 active sync /dev/sdj1
>> 6 6 8 161 6 active sync /dev/sdk1
>> 7 7 8 177 7 active sync /dev/sdl1
>> 8 8 8 193 8 active sync /dev/sdm1
>>
>
>
> Did you check dmesg and see if there were errors on those disks?
>
>
--
Andrew Dunn
http://agdunn.net
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-08 14:36 ` Andrew Dunn
@ 2009-11-08 14:56 ` Roger Heflin
2009-11-08 17:08 ` Andrew Dunn
0 siblings, 1 reply; 23+ messages in thread
From: Roger Heflin @ 2009-11-08 14:56 UTC (permalink / raw)
To: Andrew Dunn; +Cc: linux-raid list
Andrew Dunn wrote:
> [10:0:0:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sde
> [10:0:1:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdf
> [10:0:2:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdg
> [10:0:3:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdh
> [11:0:0:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdi
> [11:0:1:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdj
> [11:0:2:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdk
> [11:0:3:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdl
> [11:0:4:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdm
>
> So 4 drives dropped out on the second controller. But why didnt sdm go
> with them?
>
>
It is possible that by the time it got to checking the last drive that
the errors had cleared up, so sdm was ok with it checked.
Is this on a port multiplier?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-08 14:56 ` Roger Heflin
@ 2009-11-08 17:08 ` Andrew Dunn
0 siblings, 0 replies; 23+ messages in thread
From: Andrew Dunn @ 2009-11-08 17:08 UTC (permalink / raw)
To: Roger Heflin; +Cc: linux-raid list
No multiplier, they are on a backpane though. 2 on one backpane, 3 on
another... but only 2 of the 3 dropped off that one.
I looked through dmesg some more, maybe you all might see something of
significance. I don't think this was around when it happened, but it
might shed light onto the issue. I will continue to sift through the log.
[ 19.021969] scsi10 : ioc0: LSISAS1068E B3, FwRev=011a0000h, Ports=1,
MaxQ=478, IRQ=16
[ 19.061176] mptsas: ioc0: attaching sata device: fw_channel 0, fw_id
0, phy 0, sas_addr 0x1221000000000000
[ 19.063708] scsi 10:0:0:0: Direct-Access ATA WDC
WD1001FALS-0 0K05 PQ: 0 ANSI: 5
[ 19.065473] sd 10:0:0:0: Attached scsi generic sg4 type 0
[ 19.067322] mptsas: ioc0: attaching sata device: fw_channel 0, fw_id
1, phy 1, sas_addr 0x1221000001000000
[ 19.068074] sd 10:0:0:0: [sde] 1953523055 512-byte logical blocks:
(1.00 TB/931 GiB)
[ 19.070474] scsi 10:0:1:0: Direct-Access ATA WDC
WD1001FALS-0 0K05 PQ: 0 ANSI: 5
[ 19.072797] sd 10:0:1:0: Attached scsi generic sg5 type 0
[ 19.074994] mptsas: ioc0: attaching sata device: fw_channel 0, fw_id
4, phy 4, sas_addr 0x1221000004000000
[ 19.076025] sd 10:0:1:0: [sdf] 1953525168 512-byte logical blocks:
(1.00 TB/931 GiB)
[ 19.078091] scsi 10:0:2:0: Direct-Access ATA WDC
WD1001FALS-0 0K05 PQ: 0 ANSI: 5
[ 19.080417] sd 10:0:2:0: Attached scsi generic sg6 type 0
[ 19.082589] mptsas: ioc0: attaching sata device: fw_channel 0, fw_id
5, phy 5, sas_addr 0x1221000005000000
[ 19.082966] sd 10:0:0:0: [sde] Write Protect is off
[ 19.082970] sd 10:0:0:0: [sde] Mode Sense: 73 00 00 08
[ 19.084186] sd 10:0:2:0: [sdg] 1953525168 512-byte logical blocks:
(1.00 TB/931 GiB)
[ 19.086521] sd 10:0:0:0: [sde] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 19.087036] scsi 10:0:3:0: Direct-Access ATA WDC
WD1001FALS-0 0K05 PQ: 0 ANSI: 5
[ 19.088389] sd 10:0:1:0: [sdf] Write Protect is off
[ 19.088393] sd 10:0:1:0: [sdf] Mode Sense: 73 00 00 08
[ 19.089642] sd 10:0:3:0: Attached scsi generic sg7 type 0
[ 19.092400] mptsas 0000:02:00.0: PCI INT A -> GSI 16 (level, low) ->
IRQ 16
[ 19.092525] mptbase: ioc1: Initiating bringup
[ 19.093974] sd 10:0:3:0: [sdh] 1953525168 512-byte logical blocks:
(1.00 TB/931 GiB)
[ 19.095129] sd 10:0:1:0: [sdf] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 19.101887] sd 10:0:2:0: [sdg] Write Protect is off
[ 19.101891] sd 10:0:2:0: [sdg] Mode Sense: 73 00 00 08
[ 19.104250] sd 10:0:2:0: [sdg] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 19.107231] sd 10:0:3:0: [sdh] Write Protect is off
[ 19.107236] sd 10:0:3:0: [sdh] Mode Sense: 73 00 00 08
[ 19.109398] sde:
[ 19.111301] sd 10:0:3:0: [sdh] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 19.111659] sdf:
[ 19.118664] sdg: sdf1
[ 19.122127] sde1
[ 19.126192] sdh: sdg1
[ 19.137786] sd 10:0:1:0: [sdf] Attached SCSI disk
[ 19.143743] sdh1
[ 19.146360] sd 10:0:0:0: [sde] Attached SCSI disk
[ 19.148589] sd 10:0:2:0: [sdg] Attached SCSI disk
[ 19.158613] sd 10:0:3:0: [sdh] Attached SCSI disk
[ 20.780022] ioc1: LSISAS1068E B3: Capabilities={Initiator}
[ 20.780035] mptsas 0000:02:00.0: setting latency timer to 64
[ 30.971934] scsi11 : ioc1: LSISAS1068E B3, FwRev=011a0000h, Ports=1,
MaxQ=478, IRQ=16
[ 31.012437] mptsas: ioc1: attaching sata device: fw_channel 0, fw_id
0, phy 0, sas_addr 0x1221000000000000
[ 31.015009] scsi 11:0:0:0: Direct-Access ATA WDC
WD1001FALS-0 0K05 PQ: 0 ANSI: 5
[ 31.016755] sd 11:0:0:0: Attached scsi generic sg8 type 0
[ 31.018603] mptsas: ioc1: attaching sata device: fw_channel 0, fw_id
1, phy 1, sas_addr 0x1221000001000000
[ 31.019358] sd 11:0:0:0: [sdi] 1953525168 512-byte logical blocks:
(1.00 TB/931 GiB)
[ 31.021753] scsi 11:0:1:0: Direct-Access ATA WDC
WD1001FALS-0 0K05 PQ: 0 ANSI: 5
[ 31.024075] sd 11:0:1:0: Attached scsi generic sg9 type 0
[ 31.026273] mptsas: ioc1: attaching sata device: fw_channel 0, fw_id
4, phy 4, sas_addr 0x1221000004000000
[ 31.027302] sd 11:0:1:0: [sdj] 1953525168 512-byte logical blocks:
(1.00 TB/931 GiB)
[ 31.029693] scsi 11:0:2:0: Direct-Access ATA WDC
WD1001FALS-0 0K05 PQ: 0 ANSI: 5
[ 31.032004] sd 11:0:2:0: Attached scsi generic sg10 type 0
[ 31.032233] sd 11:0:0:0: [sdi] Write Protect is off
[ 31.032235] sd 11:0:0:0: [sdi] Mode Sense: 73 00 00 08
[ 31.034133] mptsas: ioc1: attaching sata device: fw_channel 0, fw_id
5, phy 5, sas_addr 0x1221000005000000
[ 31.035571] sd 11:0:2:0: [sdk] 1953525168 512-byte logical blocks:
(1.00 TB/931 GiB)
[ 31.037483] sd 11:0:0:0: [sdi] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 31.038793] scsi 11:0:3:0: Direct-Access ATA WDC
WD1001FALS-0 0K05 PQ: 0 ANSI: 5
[ 31.041160] sd 11:0:3:0: Attached scsi generic sg11 type 0
[ 31.043506] mptsas: ioc1: attaching sata device: fw_channel 0, fw_id
6, phy 6, sas_addr 0x1221000006000000
[ 31.043884] sd 11:0:1:0: [sdj] Write Protect is off
[ 31.043887] sd 11:0:1:0: [sdj] Mode Sense: 73 00 00 08
[ 31.046683] sd 11:0:3:0: [sdl] 1953525168 512-byte logical blocks:
(1.00 TB/931 GiB)
[ 31.047038] sd 11:0:1:0: [sdj] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 31.050845] scsi 11:0:4:0: Direct-Access ATA WDC
WD1001FALS-0 0K05 PQ: 0 ANSI: 5
[ 31.054206] sd 11:0:4:0: Attached scsi generic sg12 type 0
[ 31.056125] sd 11:0:2:0: [sdk] Write Protect is off
[ 31.056129] sd 11:0:2:0: [sdk] Mode Sense: 73 00 00 08
[ 31.059805] sd 11:0:4:0: [sdm] 1953525168 512-byte logical blocks:
(1.00 TB/931 GiB)
[ 31.061019] sd 11:0:2:0: [sdk] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 31.065705] sd 11:0:3:0: [sdl] Write Protect is off
[ 31.065710] sd 11:0:3:0: [sdl] Mode Sense: 73 00 00 08
[ 31.066991] sdi:
[ 31.069131] sdj:
[ 31.070087] sd 11:0:3:0: [sdl] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 31.073259] sd 11:0:4:0: [sdm] Write Protect is off
[ 31.073262] sd 11:0:4:0: [sdm] Mode Sense: 73 00 00 08
[ 31.074045] sdj1
[ 31.075719] sdi1
[ 31.077424] sd 11:0:4:0: [sdm] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 31.083141] sdk:
[ 31.090760] sdl: sdk1
[ 31.099798] sdm: sdl1
[ 31.115614] sdm1
[ 31.122247] sd 11:0:1:0: [sdj] Attached SCSI disk
[ 31.124713] sd 11:0:0:0: [sdi] Attached SCSI disk
[ 31.131908] sd 11:0:2:0: [sdk] Attached SCSI disk
[ 31.141444] md: bind<sdj1>
[ 31.143383] sd 11:0:3:0: [sdl] Attached SCSI disk
[ 31.147407] md: bind<sdi1>
[ 31.153910] sd 11:0:4:0: [sdm] Attached SCSI disk
[ 31.159932] md: bind<sdl1>
[ 31.176695] md: bind<sdm1>
[ 31.265544] md: bind<sde1>
[ 31.354001] md: bind<sdk1>
[ 31.467249] md: bind<sdh1>
[ 31.476153] md: bind<sdg1>
[ 31.670444] md: bind<sdf1>
[ 31.672643] md: kicking non-fresh sdk1 from array!
[ 31.672652] md: unbind<sdk1>
[ 31.711286] md: export_rdev(sdk1)
[ 31.712356] raid5: device sdf1 operational as raid disk 1
[ 31.712358] raid5: device sdg1 operational as raid disk 2
[ 31.712360] raid5: device sdh1 operational as raid disk 3
[ 31.712362] raid5: device sde1 operational as raid disk 0
[ 31.712363] raid5: device sdm1 operational as raid disk 8
[ 31.712365] raid5: device sdl1 operational as raid disk 7
[ 31.712366] raid5: device sdi1 operational as raid disk 4
[ 31.712368] raid5: device sdj1 operational as raid disk 5
[ 31.712962] raid5: allocated 9540kB for md0
[ 31.713094] raid5: raid level 6 set md0 active with 8 out of 9
devices, algorithm 2
Roger Heflin wrote:
> Andrew Dunn wrote:
>> [10:0:0:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sde
>> [10:0:1:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdf
>> [10:0:2:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdg
>> [10:0:3:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdh
>> [11:0:0:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdi
>> [11:0:1:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdj
>> [11:0:2:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdk
>> [11:0:3:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdl
>> [11:0:4:0] disk ATA WDC WD1001FALS-0 0K05 /dev/sdm
>>
>> So 4 drives dropped out on the second controller. But why didnt sdm go
>> with them?
>>
>>
>
> It is possible that by the time it got to checking the last drive that
> the errors had cleared up, so sdm was ok with it checked.
>
>
> Is this on a port multiplier?
>
>
--
Andrew Dunn
http://agdunn.net
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-08 14:30 ` Andrew Dunn
@ 2009-11-08 18:01 ` Richard Scobie
2009-11-08 18:22 ` Andrew Dunn
2009-11-08 22:09 ` Andrew Dunn
0 siblings, 2 replies; 23+ messages in thread
From: Richard Scobie @ 2009-11-08 18:01 UTC (permalink / raw)
To: Andrew Dunn; +Cc: Roger Heflin, robin, linux-raid list
Andrew Dunn wrote:
> Do you think that the controller is dropping out? I know that I have 4
> drives on one controller (AOC-USAS-L8i) and 5 drives on the other
> controller (SAME make/model). but I think they are sequentially
> connected... as in sd[efghi] should be on one device and sd[jklm] should
> be on the other... any easy way to verify?
If you are running smartd, cease doing so and do not use the smartctl
command on drives attached to these controllers - use causes drives to
be offlined.
It appears the smartctl is broken with LSISAS 1068E based controllers.
See:
https://bugzilla.redhat.com/show_bug.cgi?id=452389
and
http://marc.info/?l=linux-scsi&m=125673590221135&w=2
Regards,
Richard
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-08 18:01 ` Richard Scobie
@ 2009-11-08 18:22 ` Andrew Dunn
2009-11-08 18:34 ` Joe Landman
2009-11-08 22:09 ` Andrew Dunn
1 sibling, 1 reply; 23+ messages in thread
From: Andrew Dunn @ 2009-11-08 18:22 UTC (permalink / raw)
To: Richard Scobie; +Cc: Roger Heflin, robin, linux-raid list, nfbrown
New data now, I got this from dmesg when it went down again. Hopefully
there is some significance to you guys.
> [14269.650381] sd 10:0:3:0: rejecting I/O to offline device
> [14269.650453] sd 10:0:3:0: rejecting I/O to offline device
> [14269.650524] sd 10:0:3:0: rejecting I/O to offline device
> [14269.650595] sd 10:0:3:0: rejecting I/O to offline device
> [14269.650672] sd 10:0:3:0: [sdh] Unhandled error code
> [14269.650675] sd 10:0:3:0: [sdh] Result: hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK
> [14269.650680] end_request: I/O error, dev sdh, sector 1435085631
> [14269.650749] raid5:md0: read error not correctable (sector
1435085568 on sdh1).
> [14269.650753] raid5: Disk failure on sdh1, disabling device.
> [14269.650754] raid5: Operation continuing on 7 devices.
> [14269.650886] raid5:md0: read error not correctable (sector
1435085576 on sdh1).
> [14269.650890] raid5:md0: read error not correctable (sector
1435085584 on sdh1).
> [14269.650894] raid5:md0: read error not correctable (sector
1435085592 on sdh1).
> [14269.650898] raid5:md0: read error not correctable (sector
1435085600 on sdh1).
> [14269.650902] raid5:md0: read error not correctable (sector
1435085608 on sdh1).
> [14269.650905] raid5:md0: read error not correctable (sector
1435085616 on sdh1).
> [14269.650909] raid5:md0: read error not correctable (sector
1435085624 on sdh1).
> [14269.650913] raid5:md0: read error not correctable (sector
1435085632 on sdh1).
> [14269.650917] raid5:md0: read error not correctable (sector
1435085640 on sdh1).
> [14269.650943] sd 10:0:3:0: [sdh] Unhandled error code
> [14269.650946] sd 10:0:3:0: [sdh] Result: hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK
> [14269.650950] end_request: I/O error, dev sdh, sector 1435085887
> [14269.651049] sd 10:0:3:0: [sdh] Unhandled error code
> [14269.651051] sd 10:0:3:0: [sdh] Result: hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK
> [14269.651055] end_request: I/O error, dev sdh, sector 1435086143
> [14269.651151] sd 10:0:3:0: [sdh] Unhandled error code
> [14269.651153] sd 10:0:3:0: [sdh] Result: hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK
> [14269.651157] end_request: I/O error, dev sdh, sector 1435086399
> [14269.651253] sd 10:0:3:0: [sdh] Unhandled error code
> [14269.651255] sd 10:0:3:0: [sdh] Result: hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK
> [14269.651259] end_request: I/O error, dev sdh, sector 1435086655
> [14269.651358] sd 10:0:3:0: [sdh] Unhandled error code
> [14269.651361] sd 10:0:3:0: [sdh] Result: hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK
> [14269.651364] end_request: I/O error, dev sdh, sector 1435086911
> [14269.651461] sd 10:0:3:0: [sdh] Unhandled error code
> [14269.651463] sd 10:0:3:0: [sdh] Result: hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK
> [14269.651467] end_request: I/O error, dev sdh, sector 1435087167
> [14269.651565] sd 10:0:3:0: [sdh] Unhandled error code
> [14269.651568] sd 10:0:3:0: [sdh] Result: hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK
> [14269.651571] end_request: I/O error, dev sdh, sector 1435087423
> [14269.670675] end_request: I/O error, dev sdf, sector 1953519935
> [14269.670739] md: super_written gets error=-5, uptodate=0
> [14269.670743] raid5: Disk failure on sdf1, disabling device.
> [14269.670745] raid5: Operation continuing on 6 devices.
> [14269.672525] end_request: I/O error, dev sdg, sector 1953519935
> [14269.672598] md: super_written gets error=-5, uptodate=0
> [14269.672603] raid5: Disk failure on sdg1, disabling device.
> [14269.672605] raid5: Operation continuing on 5 devices.
> [14269.674402] end_request: I/O error, dev sde, sector 1953519935
> [14269.674474] md: super_written gets error=-5, uptodate=0
> [14269.674478] raid5: Disk failure on sde1, disabling device.
> [14269.674480] raid5: Operation continuing on 4 devices.
> [14269.769991] sd 11:0:0:0: [sdi] Sense Key : Recovered Error
[current] [descriptor]
> [14269.769997] Descriptor sense data with sense descriptors (in hex):
> [14269.770000] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14269.770012] 00 4f 00 c2 00 50
> [14269.770018] sd 11:0:0:0: [sdi] Add. Sense: ATA pass through
information available
> [14269.800245] md: md0: recovery done.
> [14269.869990] sd 11:0:0:0: [sdi] Sense Key : Recovered Error
[current] [descriptor]
> [14269.869997] Descriptor sense data with sense descriptors (in hex):
> [14269.870008] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14269.870019] 00 4f 00 c2 00 50
> [14269.870025] sd 11:0:0:0: [sdi] Add. Sense: ATA pass through
information available
> [14269.905144] RAID5 conf printout:
> [14269.905148] --- rd:9 wd:4
> [14269.905152] disk 0, o:0, dev:sde1
> [14269.905155] disk 1, o:0, dev:sdf1
> [14269.905157] disk 2, o:0, dev:sdg1
> [14269.905160] disk 3, o:0, dev:sdh1
> [14269.905162] disk 4, o:1, dev:sdi1
> [14269.905165] disk 5, o:1, dev:sdj1
> [14269.905167] disk 6, o:1, dev:sdk1
> [14269.905169] disk 7, o:1, dev:sdl1
> [14269.905172] disk 8, o:1, dev:sdm1
> [14269.941265] RAID5 conf printout:
> [14269.941269] --- rd:9 wd:4
> [14269.941273] disk 0, o:0, dev:sde1
> [14269.941276] disk 1, o:0, dev:sdf1
> [14269.941278] disk 2, o:0, dev:sdg1
> [14269.941281] disk 3, o:0, dev:sdh1
> [14269.941283] disk 4, o:1, dev:sdi1
> [14269.941286] disk 5, o:1, dev:sdj1
> [14269.941289] disk 7, o:1, dev:sdl1
> [14269.941291] disk 8, o:1, dev:sdm1
> [14269.941300] RAID5 conf printout:
> [14269.941302] --- rd:9 wd:4
> [14269.941304] disk 0, o:0, dev:sde1
> [14269.941307] disk 1, o:0, dev:sdf1
> [14269.941309] disk 2, o:0, dev:sdg1
> [14269.941311] disk 3, o:0, dev:sdh1
> [14269.941314] disk 4, o:1, dev:sdi1
> [14269.941316] disk 5, o:1, dev:sdj1
> [14269.941318] disk 7, o:1, dev:sdl1
> [14269.941321] disk 8, o:1, dev:sdm1
> [14269.981260] RAID5 conf printout:
> [14269.981263] --- rd:9 wd:4
> [14269.981265] disk 0, o:0, dev:sde1
> [14269.981268] disk 2, o:0, dev:sdg1
> [14269.981270] disk 3, o:0, dev:sdh1
> [14269.981273] disk 4, o:1, dev:sdi1
> [14269.981275] disk 5, o:1, dev:sdj1
> [14269.981277] disk 7, o:1, dev:sdl1
> [14269.981280] disk 8, o:1, dev:sdm1
> [14269.981284] RAID5 conf printout:
> [14269.981286] --- rd:9 wd:4
> [14269.981289] disk 0, o:0, dev:sde1
> [14269.981291] disk 2, o:0, dev:sdg1
> [14269.981293] disk 3, o:0, dev:sdh1
> [14269.981296] disk 4, o:1, dev:sdi1
> [14269.981298] disk 5, o:1, dev:sdj1
> [14269.981300] disk 7, o:1, dev:sdl1
> [14269.981302] disk 8, o:1, dev:sdm1
> [14270.003316] sd 11:0:0:0: [sdi] Sense Key : Recovered Error
[current] [descriptor]
> [14270.003324] Descriptor sense data with sense descriptors (in hex):
> [14270.003327] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14270.003338] 00 4f 00 c2 00 50
> [14270.003344] sd 11:0:0:0: [sdi] Add. Sense: ATA pass through
information available
> [14270.021260] RAID5 conf printout:
> [14270.021263] --- rd:9 wd:4
> [14270.021266] disk 0, o:0, dev:sde1
> [14270.021269] disk 3, o:0, dev:sdh1
> [14270.021271] disk 4, o:1, dev:sdi1
> [14270.021274] disk 5, o:1, dev:sdj1
> [14270.021276] disk 7, o:1, dev:sdl1
> [14270.021278] disk 8, o:1, dev:sdm1
> [14270.021283] RAID5 conf printout:
> [14270.021285] --- rd:9 wd:4
> [14270.021287] disk 0, o:0, dev:sde1
> [14270.021289] disk 3, o:0, dev:sdh1
> [14270.021292] disk 4, o:1, dev:sdi1
> [14270.021294] disk 5, o:1, dev:sdj1
> [14270.021296] disk 7, o:1, dev:sdl1
> [14270.021298] disk 8, o:1, dev:sdm1
> [14270.061261] RAID5 conf printout:
> [14270.061264] --- rd:9 wd:4
> [14270.061266] disk 0, o:0, dev:sde1
> [14270.061269] disk 4, o:1, dev:sdi1
> [14270.061272] disk 5, o:1, dev:sdj1
> [14270.061274] disk 7, o:1, dev:sdl1
> [14270.061276] disk 8, o:1, dev:sdm1
> [14270.061281] RAID5 conf printout:
> [14270.061283] --- rd:9 wd:4
> [14270.061285] disk 0, o:0, dev:sde1
> [14270.061287] disk 4, o:1, dev:sdi1
> [14270.061289] disk 5, o:1, dev:sdj1
> [14270.061292] disk 7, o:1, dev:sdl1
> [14270.061294] disk 8, o:1, dev:sdm1
> [14270.061647] sd 11:0:0:0: [sdi] Sense Key : Recovered Error
[current] [descriptor]
> [14270.061653] Descriptor sense data with sense descriptors (in hex):
> [14270.061656] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14270.061667] 00 4f 00 c2 00 50
> [14270.061672] sd 11:0:0:0: [sdi] Add. Sense: ATA pass through
information available
> [14270.091263] RAID5 conf printout:
> [14270.091267] --- rd:9 wd:4
> [14270.091271] disk 4, o:1, dev:sdi1
> [14270.091274] disk 5, o:1, dev:sdj1
> [14270.091276] disk 7, o:1, dev:sdl1
> [14270.091279] disk 8, o:1, dev:sdm1
> [14270.153319] sd 11:0:0:0: [sdi] Sense Key : Recovered Error
[current] [descriptor]
> [14270.153325] Descriptor sense data with sense descriptors (in hex):
> [14270.153328] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14270.153340] 00 4f 00 c2 00 50
> [14270.153346] sd 11:0:0:0: [sdi] Add. Sense: ATA pass through
information available
> [14270.211651] sd 11:0:0:0: [sdi] Sense Key : Recovered Error
[current] [descriptor]
> [14270.211657] Descriptor sense data with sense descriptors (in hex):
> [14270.211660] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14270.211671] 00 4f 00 c2 00 50
> [14270.211677] sd 11:0:0:0: [sdi] Add. Sense: ATA pass through
information available
> [14270.324057] sd 11:0:1:0: [sdj] Sense Key : Recovered Error
[current] [descriptor]
> [14270.324065] Descriptor sense data with sense descriptors (in hex):
> [14270.324067] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14270.324079] 00 4f 00 c2 00 50
> [14270.324085] sd 11:0:1:0: [sdj] Add. Sense: ATA pass through
information available
> [14270.382390] sd 11:0:1:0: [sdj] Sense Key : Recovered Error
[current] [descriptor]
> [14270.382396] Descriptor sense data with sense descriptors (in hex):
> [14270.382399] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14270.382410] 00 4f 00 c2 00 50
> [14270.382416] sd 11:0:1:0: [sdj] Add. Sense: ATA pass through
information available
> [14270.474060] sd 11:0:1:0: [sdj] Sense Key : Recovered Error
[current] [descriptor]
> [14270.474068] Descriptor sense data with sense descriptors (in hex):
> [14270.474071] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14270.474083] 00 4f 00 c2 00 50
> [14270.474089] sd 11:0:1:0: [sdj] Add. Sense: ATA pass through
information available
> [14270.532394] sd 11:0:1:0: [sdj] Sense Key : Recovered Error
[current] [descriptor]
> [14270.532401] Descriptor sense data with sense descriptors (in hex):
> [14270.532404] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14270.532415] 00 4f 00 c2 00 50
> [14270.532421] sd 11:0:1:0: [sdj] Add. Sense: ATA pass through
information available
> [14270.632394] sd 11:0:1:0: [sdj] Sense Key : Recovered Error
[current] [descriptor]
> [14270.632402] Descriptor sense data with sense descriptors (in hex):
> [14270.632405] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14270.632417] 00 4f 00 c2 00 50
> [14270.632423] sd 11:0:1:0: [sdj] Add. Sense: ATA pass through
information available
> [14270.690729] sd 11:0:1:0: [sdj] Sense Key : Recovered Error
[current] [descriptor]
> [14270.690736] Descriptor sense data with sense descriptors (in hex):
> [14270.690739] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14270.690751] 00 4f 00 c2 00 50
> [14270.690757] sd 11:0:1:0: [sdj] Add. Sense: ATA pass through
information available
> [14270.804065] sd 11:0:2:0: [sdk] Sense Key : Recovered Error
[current] [descriptor]
> [14270.804073] Descriptor sense data with sense descriptors (in hex):
> [14270.804076] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14270.804088] 00 4f 00 c2 00 50
> [14270.804094] sd 11:0:2:0: [sdk] Add. Sense: ATA pass through
information available
> [14270.862400] sd 11:0:2:0: [sdk] Sense Key : Recovered Error
[current] [descriptor]
> [14270.862406] Descriptor sense data with sense descriptors (in hex):
> [14270.862409] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14270.862420] 00 4f 00 c2 00 50
> [14270.862426] sd 11:0:2:0: [sdk] Add. Sense: ATA pass through
information available
> [14270.954070] sd 11:0:2:0: [sdk] Sense Key : Recovered Error
[current] [descriptor]
> [14270.954079] Descriptor sense data with sense descriptors (in hex):
> [14270.954081] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14270.954093] 00 4f 00 c2 00 50
> [14270.954099] sd 11:0:2:0: [sdk] Add. Sense: ATA pass through
information available
> [14271.012399] sd 11:0:2:0: [sdk] Sense Key : Recovered Error
[current] [descriptor]
> [14271.012406] Descriptor sense data with sense descriptors (in hex):
> [14271.012408] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14271.012420] 00 4f 00 c2 00 50
> [14271.012426] sd 11:0:2:0: [sdk] Add. Sense: ATA pass through
information available
> [14271.104072] sd 11:0:2:0: [sdk] Sense Key : Recovered Error
[current] [descriptor]
> [14271.104080] Descriptor sense data with sense descriptors (in hex):
> [14271.104083] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14271.104094] 00 4f 00 c2 00 50
> [14271.104100] sd 11:0:2:0: [sdk] Add. Sense: ATA pass through
information available
> [14271.162400] sd 11:0:2:0: [sdk] Sense Key : Recovered Error
[current] [descriptor]
> [14271.162407] Descriptor sense data with sense descriptors (in hex):
> [14271.162410] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14271.162422] 00 4f 00 c2 00 50
> [14271.162428] sd 11:0:2:0: [sdk] Add. Sense: ATA pass through
information available
> [14271.278147] sd 11:0:3:0: [sdl] Sense Key : Recovered Error
[current] [descriptor]
> [14271.278155] Descriptor sense data with sense descriptors (in hex):
> [14271.278157] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14271.278169] 00 4f 00 c2 00 50
> [14271.278175] sd 11:0:3:0: [sdl] Add. Sense: ATA pass through
information available
> [14271.336487] sd 11:0:3:0: [sdl] Sense Key : Recovered Error
[current] [descriptor]
> [14271.336495] Descriptor sense data with sense descriptors (in hex):
> [14271.336498] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14271.336509] 00 4f 00 c2 00 50
> [14271.336515] sd 11:0:3:0: [sdl] Add. Sense: ATA pass through
information available
> [14271.428148] sd 11:0:3:0: [sdl] Sense Key : Recovered Error
[current] [descriptor]
> [14271.428156] Descriptor sense data with sense descriptors (in hex):
> [14271.428158] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14271.428170] 00 4f 00 c2 00 50
> [14271.428176] sd 11:0:3:0: [sdl] Add. Sense: ATA pass through
information available
> [14271.486485] sd 11:0:3:0: [sdl] Sense Key : Recovered Error
[current] [descriptor]
> [14271.486493] Descriptor sense data with sense descriptors (in hex):
> [14271.486496] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14271.486508] 00 4f 00 c2 00 50
> [14271.486514] sd 11:0:3:0: [sdl] Add. Sense: ATA pass through
information available
> [14271.586482] sd 11:0:3:0: [sdl] Sense Key : Recovered Error
[current] [descriptor]
> [14271.586489] Descriptor sense data with sense descriptors (in hex):
> [14271.586492] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14271.586503] 00 4f 00 c2 00 50
> [14271.586509] sd 11:0:3:0: [sdl] Add. Sense: ATA pass through
information available
> [14271.644813] sd 11:0:3:0: [sdl] Sense Key : Recovered Error
[current] [descriptor]
> [14271.644819] Descriptor sense data with sense descriptors (in hex):
> [14271.644822] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14271.644833] 00 4f 00 c2 00 50
> [14271.644839] sd 11:0:3:0: [sdl] Add. Sense: ATA pass through
information available
> [14271.762812] sd 11:0:4:0: [sdm] Sense Key : Recovered Error
[current] [descriptor]
> [14271.762820] Descriptor sense data with sense descriptors (in hex):
> [14271.762823] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14271.762834] 00 4f 00 c2 00 50
> [14271.762841] sd 11:0:4:0: [sdm] Add. Sense: ATA pass through
information available
> [14271.821145] sd 11:0:4:0: [sdm] Sense Key : Recovered Error
[current] [descriptor]
> [14271.821152] Descriptor sense data with sense descriptors (in hex):
> [14271.821154] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14271.821166] 00 4f 00 c2 00 50
> [14271.821172] sd 11:0:4:0: [sdm] Add. Sense: ATA pass through
information available
> [14271.912816] sd 11:0:4:0: [sdm] Sense Key : Recovered Error
[current] [descriptor]
> [14271.912824] Descriptor sense data with sense descriptors (in hex):
> [14271.912827] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14271.912838] 00 4f 00 c2 00 50
> [14271.912844] sd 11:0:4:0: [sdm] Add. Sense: ATA pass through
information available
> [14271.971152] sd 11:0:4:0: [sdm] Sense Key : Recovered Error
[current] [descriptor]
> [14271.971161] Descriptor sense data with sense descriptors (in hex):
> [14271.971163] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14271.971175] 00 4f 00 c2 00 50
> [14271.971181] sd 11:0:4:0: [sdm] Add. Sense: ATA pass through
information available
> [14272.071150] sd 11:0:4:0: [sdm] Sense Key : Recovered Error
[current] [descriptor]
> [14272.071157] Descriptor sense data with sense descriptors (in hex):
> [14272.071160] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14272.071172] 00 4f 00 c2 00 50
> [14272.071178] sd 11:0:4:0: [sdm] Add. Sense: ATA pass through
information available
> [14272.129485] sd 11:0:4:0: [sdm] Sense Key : Recovered Error
[current] [descriptor]
> [14272.129494] Descriptor sense data with sense descriptors (in hex):
> [14272.129497] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> [14272.129508] 00 4f 00 c2 00 50
> [14272.129514] sd 11:0:4:0: [sdm] Add. Sense: ATA pass through
information available
> [14365.066847] Aborting journal on device md0:8.
> [14365.066946] __ratelimit: 246 callbacks suppressed
> [14365.066949] Buffer I/O error on device md0, logical block 854622208
> [14365.067018] lost page write due to I/O error on md0
> [14365.067023] JBD2: I/O error detected when updating journal
superblock for md0:8.
> [14382.768622] EXT4-fs error (device md0): ext4_find_entry: reading
directory #6879966 offset 0
> [14382.820264] Buffer I/O error on device md0, logical block 0
> [14382.820332] lost page write due to I/O error on md0
> [14401.997859] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891861, block=27267765
> [14401.998043] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.041639] Buffer I/O error on device md0, logical block 0
> [14402.041708] lost page write due to I/O error on md0
> [14402.042025] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892055, block=27267777
> [14402.042189] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.042337] Buffer I/O error on device md0, logical block 0
> [14402.042404] lost page write due to I/O error on md0
> [14402.042615] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891691, block=27267754
> [14402.042780] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.042927] Buffer I/O error on device md0, logical block 0
> [14402.042994] lost page write due to I/O error on md0
> [14402.043204] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891589, block=27267748
> [14402.043369] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.043514] Buffer I/O error on device md0, logical block 0
> [14402.043581] lost page write due to I/O error on md0
> [14402.045186] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892719, block=27267818
> [14402.045351] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.045500] Buffer I/O error on device md0, logical block 0
> [14402.045569] lost page write due to I/O error on md0
> [14402.061829] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891914, block=27267768
> [14402.061983] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.062117] Buffer I/O error on device md0, logical block 0
> [14402.062175] lost page write due to I/O error on md0
> [14402.062495] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891136, block=27267719
> [14402.062651] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.062793] Buffer I/O error on device md0, logical block 0
> [14402.062859] lost page write due to I/O error on md0
> [14402.063053] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6893036, block=27267838
> [14402.063217] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.063357] Buffer I/O error on device md0, logical block 0
> [14402.063423] lost page write due to I/O error on md0
> [14402.063624] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892357, block=27267796
> [14402.063793] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.063935] Buffer I/O error on device md0, logical block 0
> [14402.064001] lost page write due to I/O error on md0
> [14402.064193] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891032, block=27267713
> [14402.064355] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.064496] Buffer I/O error on device md0, logical block 0
> [14402.064561] lost page write due to I/O error on md0
> [14402.064741] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891782, block=27267760
> [14402.064906] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.065232] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892332, block=27267794
> [14402.065395] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.065714] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892755, block=27267821
> [14402.065878] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.066197] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892235, block=27267788
> [14402.066362] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.066675] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892552, block=27267808
> [14402.066840] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.067156] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892123, block=27267781
> [14402.067321] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.067635] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892256, block=27267789
> [14402.067800] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.068114] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892532, block=27267807
> [14402.068278] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.068594] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892318, block=27267793
> [14402.068758] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.069069] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892845, block=27267826
> [14402.069233] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.069543] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892980, block=27267835
> [14402.069707] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.074971] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6890993, block=27267711
> [14402.075140] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.075540] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891750, block=27267758
> [14402.075686] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.076028] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891642, block=27267751
> [14402.076174] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.076543] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892605, block=27267811
> [14402.076689] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.077059] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892136, block=27267782
> [14402.077223] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.077567] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892717, block=27267818
> [14402.077732] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.078080] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6890999, block=27267711
> [14402.078243] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.078593] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892362, block=27267796
> [14402.080842] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.083259] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892867, block=27267828
> [14402.083423] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.083798] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891361, block=27267734
> [14402.083963] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.084315] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892012, block=27267774
> [14402.084480] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.084852] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891626, block=27267750
> [14402.085014] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.085365] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891320, block=27267731
> [14402.085530] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.085880] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891588, block=27267748
> [14402.086044] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.086390] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892584, block=27267810
> [14402.086556] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.086901] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6892891, block=27267829
> [14402.087066] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.087416] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891118, block=27267718
> [14402.087579] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.087930] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891559, block=27267746
> [14402.088094] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.088445] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6891212, block=27267724
> [14402.088609] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.091550] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6890993, block=27267711
> [14402.091718] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.106045] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6890999, block=27267711
> [14402.106212] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.141662] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889579, block=27267622
> [14402.141829] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.142185] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889980, block=27267647
> [14402.142350] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.142703] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889704, block=27267630
> [14402.142868] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.143318] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889340, block=27267607
> [14402.143483] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.143826] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889563, block=27267621
> [14402.143990] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.144341] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889220, block=27267600
> [14402.144506] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.144869] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889358, block=27267608
> [14402.145034] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.145379] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889530, block=27267619
> [14402.145542] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.145890] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889721, block=27267631
> [14402.146054] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.146398] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889621, block=27267625
> [14402.146562] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.146900] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889920, block=27267643
> [14402.147047] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.147390] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889508, block=27267618
> [14402.147536] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.147869] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889673, block=27267628
> [14402.148015] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.153911] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889220, block=27267600
> [14402.154075] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.155819] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889340, block=27267607
> [14402.155987] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.261374] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6876431, block=27266800
> [14402.261522] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.261981] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6875468, block=27266740
> [14402.262128] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.262587] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6878644, block=27266939
> [14402.262753] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.263223] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6875990, block=27266773
> [14402.263388] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.263741] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6875325, block=27266731
> [14402.263908] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.264259] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6875779, block=27266760
> [14402.264424] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.264808] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6879881, block=27267016
> [14402.264972] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.265325] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6878489, block=27266929
> [14402.265491] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.265842] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6878591, block=27266935
> [14402.266005] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.266357] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6876138, block=27266782
> [14402.266520] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.266876] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6875310, block=27266730
> [14402.267042] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.267396] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6876274, block=27266791
> [14402.267560] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.267907] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6875648, block=27266751
> [14402.268071] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.268422] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6876119, block=27266781
> [14402.268586] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.269056] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6877808, block=27266886
> [14402.269219] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.269573] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6878101, block=27266905
> [14402.269738] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.270088] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6877967, block=27266896
> [14402.270264] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.270614] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6877835, block=27266888
> [14402.270793] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.271146] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6875370, block=27266734
> [14402.271323] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.271679] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6877955, block=27266896
> [14402.271854] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.272214] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6876218, block=27266787
> [14402.272391] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.272745] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6875340, block=27266732
> [14402.272922] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.273281] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6878031, block=27266900
> [14402.273452] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.273919] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6874892, block=27266704
> [14402.274097] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.274454] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6878014, block=27266899
> [14402.274628] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.274987] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6876066, block=27266778
> [14402.275146] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.275488] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6879778, block=27267010
> [14402.275646] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.275996] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6878310, block=27266918
> [14402.276151] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.276624] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6877450, block=27266864
> [14402.276793] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.277148] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6875397, block=27266736
> [14402.277315] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.277778] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6877004, block=27266836
> [14402.277943] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.278295] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6875606, block=27266749
> [14402.280543] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.283306] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6874892, block=27266704
> [14402.283472] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.285354] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6875310, block=27266730
> [14402.285519] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.302533] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6874640, block=27266688
> [14402.302698] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.304480] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6874640, block=27266688
> [14402.304629] EXT4-fs (md0): previous I/O error to superblock detected
> [14402.752437] EXT4-fs error (device md0): ext4_journal_start_sb:
Detected aborted journal
> [14402.752606] EXT4-fs (md0): Remounting filesystem read-only
> [14419.267133] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6874640, block=27266688
> [14419.297937] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6874892, block=27266704
> [14419.301517] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6875310, block=27266730
> [14419.332861] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889220, block=27267600
> [14419.335590] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6889340, block=27267607
> [14419.341744] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6890993, block=27267711
> [14419.343458] EXT4-fs error (device md0): __ext4_get_inode_loc:
unable to read inode block - inode=6890999, block=27267711
Richard Scobie wrote:
> Andrew Dunn wrote:
>
>> Do you think that the controller is dropping out? I know that I have 4
>> drives on one controller (AOC-USAS-L8i) and 5 drives on the other
>> controller (SAME make/model). but I think they are sequentially
>> connected... as in sd[efghi] should be on one device and sd[jklm] should
>> be on the other... any easy way to verify?
>
> If you are running smartd, cease doing so and do not use the smartctl
> command on drives attached to these controllers - use causes drives to
> be offlined.
>
> It appears the smartctl is broken with LSISAS 1068E based controllers.
>
> See:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=452389
>
> and
>
> http://marc.info/?l=linux-scsi&m=125673590221135&w=2
>
> Regards,
>
> Richard
>
--
Andrew Dunn
http://agdunn.net
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-08 18:22 ` Andrew Dunn
@ 2009-11-08 18:34 ` Joe Landman
0 siblings, 0 replies; 23+ messages in thread
From: Joe Landman @ 2009-11-08 18:34 UTC (permalink / raw)
To: Andrew Dunn; +Cc: Richard Scobie, Roger Heflin, robin, linux-raid list, nfbrown
Andrew Dunn wrote:
> New data now, I got this from dmesg when it went down again. Hopefully
> there is some significance to you guys.
>
>> [14269.650381] sd 10:0:3:0: rejecting I/O to offline device
>> [14269.650453] sd 10:0:3:0: rejecting I/O to offline device
>> [14269.650524] sd 10:0:3:0: rejecting I/O to offline device
>> [14269.650595] sd 10:0:3:0: rejecting I/O to offline device
>> [14269.650672] sd 10:0:3:0: [sdh] Unhandled error code
>> [14269.650675] sd 10:0:3:0: [sdh] Result: hostbyte=DID_NO_CONNECT
> driverbyte=DRIVER_OK
>> [14269.650680] end_request: I/O error, dev sdh, sector 1435085631
>> [14269.650749] raid5:md0: read error not correctable (sector
> 1435085568 on sdh1).
>> [14269.650753] raid5: Disk failure on sdh1, disabling device.
>> [14269.650754] raid5: Operation continuing on 7 devices.
>> [14269.650886] raid5:md0: read error not correctable (sector
> 1435085576 on sdh1).
>> [14269.650890] raid5:md0: read error not correctable (sector
> 1435085584 on sdh1).
>> [14269.650894] raid5:md0: read error not correctable (sector
> 1435085592 on sdh1).
[...]
I am not convinced this is a drive failure (yet). You have
sdh,sdi,sdj,sdk,sdl,sdm all reporting errors or error recovery.
This sounds like a physical backplane failure (is this on an expander
system? we have seen this/had this happen before), a cable to the SATA
card failing (we have seen this/had this happen before), or a power
supply issue (can't handle all the drives in constant operation, which
we have seen before as well).
Driver issues are possible, but it is pursuing normal failure code
paths, so unless the driver is tickling the remove code on its own ...
Smart could be offlining the drive, and having it non-responsive.
Something else could be doing that as well (vibration, power quality, ...)
What does
hdparm -I /dev/sdh
tell us?
If nothing, we need to use sdparm to get some information.
Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-08 18:01 ` Richard Scobie
2009-11-08 18:22 ` Andrew Dunn
@ 2009-11-08 22:09 ` Andrew Dunn
2009-11-08 22:59 ` Richard Scobie
1 sibling, 1 reply; 23+ messages in thread
From: Andrew Dunn @ 2009-11-08 22:09 UTC (permalink / raw)
To: Richard Scobie; +Cc: Roger Heflin, robin, linux-raid list
I am not, but this is quite interesting. What versions are affected? I
am in ubuntu 9.10.
Richard Scobie wrote:
> Andrew Dunn wrote:
>
>> Do you think that the controller is dropping out? I know that I have 4
>> drives on one controller (AOC-USAS-L8i) and 5 drives on the other
>> controller (SAME make/model). but I think they are sequentially
>> connected... as in sd[efghi] should be on one device and sd[jklm] should
>> be on the other... any easy way to verify?
>
> If you are running smartd, cease doing so and do not use the smartctl
> command on drives attached to these controllers - use causes drives to
> be offlined.
>
> It appears the smartctl is broken with LSISAS 1068E based controllers.
>
> See:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=452389
>
> and
>
> http://marc.info/?l=linux-scsi&m=125673590221135&w=2
>
> Regards,
>
> Richard
>
--
Andrew Dunn
http://agdunn.net
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-08 22:09 ` Andrew Dunn
@ 2009-11-08 22:59 ` Richard Scobie
2009-11-09 2:45 ` Ryan Wagoner
0 siblings, 1 reply; 23+ messages in thread
From: Richard Scobie @ 2009-11-08 22:59 UTC (permalink / raw)
To: Andrew Dunn, Linux RAID Mailing List
Andrew Dunn wrote:
> I am not, but this is quite interesting. What versions are affected? I
> am in ubuntu 9.10.
To my knowledge, there is no smartmontools version safe for use on these
LSI based controllers.
You may run smartctl commands a few times and get away with it, but
eventually it will bite you.
Losing 14 drives on a 16 drive array as a result, is no fun...
Hopefully it will be fixed one day.
Regards,
Richard
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-08 22:59 ` Richard Scobie
@ 2009-11-09 2:45 ` Ryan Wagoner
2009-11-09 2:57 ` Richard Scobie
2009-11-09 8:09 ` Gabor Gombas
0 siblings, 2 replies; 23+ messages in thread
From: Ryan Wagoner @ 2009-11-09 2:45 UTC (permalink / raw)
To: Richard Scobie; +Cc: Andrew Dunn, Linux RAID Mailing List
This is interesting to hear as I have been using smartmontools on my
Supermicro LSI 1068E controller with the target firmware for 2 years
now on CentOS 5. I have 3 RAID 1 arrays across 2 drives, a RAID 5
drive across 3 drives, and a RAID 0 across 2 drives.
I routinely will query smartctrl with something like
for i in a b c d e f; do smartctl -a /dev/sd$i | grep Reallocated; done
or
for i in a b c d e f; do smartctl -a /dev/sd$i | grep Temperature; done
Here are the system details
cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [raid0]
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]
md2 : active raid1 sdb3[1] sda3[0]
2032128 blocks [2/2] [UU]
md3 : active raid0 sdd1[1] sdc1[0]
625137152 blocks 64k chunks
md4 : active raid5 sdg1[2] sdf1[1] sde1[0]
1953519872 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
md1 : active raid1 sdb2[1] sda2[0]
154151616 blocks [2/2] [UU]
lpsci
02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)
modprobe.conf
alias scsi_hostadapter mptbase
alias scsi_hostadapter1 mptsas
rpm -qa | grep smartmontools
smartmontools-5.38-2.el5
uname -r
2.6.18-128.4.1.el5
On Sun, Nov 8, 2009 at 5:59 PM, Richard Scobie <richard@sauce.co.nz> wrote:
> Andrew Dunn wrote:
>>
>> I am not, but this is quite interesting. What versions are affected? I
>> am in ubuntu 9.10.
>
> To my knowledge, there is no smartmontools version safe for use on these LSI
> based controllers.
>
> You may run smartctl commands a few times and get away with it, but
> eventually it will bite you.
>
> Losing 14 drives on a 16 drive array as a result, is no fun...
>
> Hopefully it will be fixed one day.
>
> Regards,
>
> Richard
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-09 2:45 ` Ryan Wagoner
@ 2009-11-09 2:57 ` Richard Scobie
2009-11-09 8:09 ` Gabor Gombas
1 sibling, 0 replies; 23+ messages in thread
From: Richard Scobie @ 2009-11-09 2:57 UTC (permalink / raw)
To: Ryan Wagoner; +Cc: Andrew Dunn, Linux RAID Mailing List
Ryan Wagoner wrote:
> This is interesting to hear as I have been using smartmontools on my
> Supermicro LSI 1068E controller with the target firmware for 2 years
> now on CentOS 5. I have 3 RAID 1 arrays across 2 drives, a RAID 5
> drive across 3 drives, and a RAID 0 across 2 drives.
I have 3 boxes using 1068E controllers attached to 16 drive port
expander based chassis that have been built over the last 2.5 years and
they all react badly.
In fact the latest one put together a month ago (which is using more
recent controller IT firmware and kernel than the other two), will not
tolerate a single smartctl command, where the other two will maybe 50%
of the time.
Something is not right here and others running different drive setups -
direct attached and port multipler based are seeing the same thing.
Suffice it to say, I would recommend heavy testing before putting into
production and I personally have no confidence currently.
Regards,
Richard
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-09 2:45 ` Ryan Wagoner
2009-11-09 2:57 ` Richard Scobie
@ 2009-11-09 8:09 ` Gabor Gombas
2009-11-09 10:08 ` Andrew Dunn
1 sibling, 1 reply; 23+ messages in thread
From: Gabor Gombas @ 2009-11-09 8:09 UTC (permalink / raw)
To: Ryan Wagoner; +Cc: Richard Scobie, Andrew Dunn, Linux RAID Mailing List
On Sun, Nov 08, 2009 at 09:45:40PM -0500, Ryan Wagoner wrote:
> This is interesting to hear as I have been using smartmontools on my
> Supermicro LSI 1068E controller with the target firmware for 2 years
> now on CentOS 5. I have 3 RAID 1 arrays across 2 drives, a RAID 5
> drive across 3 drives, and a RAID 0 across 2 drives.
[...]
> uname -r
> 2.6.18-128.4.1.el5
Kernel version matters. With 2.6.22 we only got occassional complaints
that the drives are not capable of SMART checks that were not true but
were otherwise harmless. With 2.6.26 and 2.6.30, the controller offlines
the disks.
Gabor
--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-09 8:09 ` Gabor Gombas
@ 2009-11-09 10:08 ` Andrew Dunn
2009-11-09 11:34 ` Gabor Gombas
0 siblings, 1 reply; 23+ messages in thread
From: Andrew Dunn @ 2009-11-09 10:08 UTC (permalink / raw)
To: Gabor Gombas; +Cc: Ryan Wagoner, Richard Scobie, Linux RAID Mailing List
does it momentarily offline the disks? like they re-appear in /dev
within moments? That would be similar behavior to what I am
experiencing, the disks drop from the array, but they are in /dev by the
time I get a chance to see them.
I am however not running smard to my knowledge, smartmontools is
installed and I access it through the webmin module, but checking the
drives with that and the array failures have not happened at the same time.
Gabor Gombas wrote:
> On Sun, Nov 08, 2009 at 09:45:40PM -0500, Ryan Wagoner wrote:
>
>
>> This is interesting to hear as I have been using smartmontools on my
>> Supermicro LSI 1068E controller with the target firmware for 2 years
>> now on CentOS 5. I have 3 RAID 1 arrays across 2 drives, a RAID 5
>> drive across 3 drives, and a RAID 0 across 2 drives.
>>
>
> [...]
>
>
>> uname -r
>> 2.6.18-128.4.1.el5
>>
>
> Kernel version matters. With 2.6.22 we only got occassional complaints
> that the drives are not capable of SMART checks that were not true but
> were otherwise harmless. With 2.6.26 and 2.6.30, the controller offlines
> the disks.
>
> Gabor
>
>
--
Andrew Dunn
http://agdunn.net
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-09 10:08 ` Andrew Dunn
@ 2009-11-09 11:34 ` Gabor Gombas
2009-11-09 22:04 ` Andrew Dunn
2009-11-10 10:55 ` Andrew Dunn
0 siblings, 2 replies; 23+ messages in thread
From: Gabor Gombas @ 2009-11-09 11:34 UTC (permalink / raw)
To: Andrew Dunn; +Cc: Ryan Wagoner, Richard Scobie, Linux RAID Mailing List
On Mon, Nov 09, 2009 at 05:08:23AM -0500, Andrew Dunn wrote:
> does it momentarily offline the disks? like they re-appear in /dev
> within moments? That would be similar behavior to what I am
> experiencing, the disks drop from the array, but they are in /dev by the
> time I get a chance to see them.
No, either the disks need to be physically removed and re-inserted, or
the machine needs to be rebooted.
Gabor
--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-09 11:34 ` Gabor Gombas
@ 2009-11-09 22:04 ` Andrew Dunn
2009-11-10 10:55 ` Andrew Dunn
1 sibling, 0 replies; 23+ messages in thread
From: Andrew Dunn @ 2009-11-09 22:04 UTC (permalink / raw)
To: Gabor Gombas; +Cc: Ryan Wagoner, Richard Scobie, Linux RAID Mailing List
I am not experiencing this issue then. My devices are in /dev after the
raid drop out. I can use smart scanning on them without issue also.
Gabor Gombas wrote:
> On Mon, Nov 09, 2009 at 05:08:23AM -0500, Andrew Dunn wrote:
>
>
>> does it momentarily offline the disks? like they re-appear in /dev
>> within moments? That would be similar behavior to what I am
>> experiencing, the disks drop from the array, but they are in /dev by the
>> time I get a chance to see them.
>>
>
> No, either the disks need to be physically removed and re-inserted, or
> the machine needs to be rebooted.
>
> Gabor
>
>
--
Andrew Dunn
http://agdunn.net
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-09 11:34 ` Gabor Gombas
2009-11-09 22:04 ` Andrew Dunn
@ 2009-11-10 10:55 ` Andrew Dunn
2009-11-10 11:34 ` Vincent Schut
2009-11-10 12:45 ` Ryan Wagoner
1 sibling, 2 replies; 23+ messages in thread
From: Andrew Dunn @ 2009-11-10 10:55 UTC (permalink / raw)
To: Gabor Gombas; +Cc: Ryan Wagoner, Richard Scobie, Linux RAID Mailing List
I am able to reproduce this smart error now. I have done it twice, so
maybe other things are causing this also.
When I scanned the devices this morning with smartctl via webmin I lost
8 of the 9 drives. They are howerver still in my /dev folder.
Now I sent out my logs from the first failure last night, smartctl was
on the system... I dont know if ubuntu server's default smartd
configuration makes it do periodic scans because I didnt change anything.
I would hate to move back to 9.10 and see this problem again.
Should I just not install smartmontools? This seems like a bad solution
because now I wont be able to check the drives in advance for failures.
Have you installed LSI's linux drivers? Some people say this solves
their issue.
From the logs sent out last night do you think it could be something else?
Thanks a ton,
Gabor Gombas wrote:
> On Mon, Nov 09, 2009 at 05:08:23AM -0500, Andrew Dunn wrote:
>
>
>> does it momentarily offline the disks? like they re-appear in /dev
>> within moments? That would be similar behavior to what I am
>> experiencing, the disks drop from the array, but they are in /dev by the
>> time I get a chance to see them.
>>
>
> No, either the disks need to be physically removed and re-inserted, or
> the machine needs to be rebooted.
>
> Gabor
>
>
--
Andrew Dunn
http://agdunn.net
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-10 10:55 ` Andrew Dunn
@ 2009-11-10 11:34 ` Vincent Schut
2009-11-11 12:34 ` Andrew Dunn
2009-11-17 8:40 ` Vincent Schut
2009-11-10 12:45 ` Ryan Wagoner
1 sibling, 2 replies; 23+ messages in thread
From: Vincent Schut @ 2009-11-10 11:34 UTC (permalink / raw)
To: Andrew Dunn
Cc: Gabor Gombas, Ryan Wagoner, Richard Scobie, Linux RAID Mailing List
Andrew Dunn wrote:
> I am able to reproduce this smart error now. I have done it twice, so
> maybe other things are causing this also.
>
> When I scanned the devices this morning with smartctl via webmin I lost
> 8 of the 9 drives. They are howerver still in my /dev folder.
>
> Now I sent out my logs from the first failure last night, smartctl was
> on the system... I dont know if ubuntu server's default smartd
> configuration makes it do periodic scans because I didnt change anything.
>
> I would hate to move back to 9.10 and see this problem again.
>
> Should I just not install smartmontools? This seems like a bad solution
> because now I wont be able to check the drives in advance for failures.
>
> Have you installed LSI's linux drivers? Some people say this solves
> their issue.
>
> From the logs sent out last night do you think it could be something else?
>
> Thanks a ton,
FWIW, I encountered the same issue, and seem to have found a viable
workaround by accessing the SATA disks on that LSI backplane as scsi
devices, e.g. by adding '-d scsi' to my smartctl/smartd.conf lines. No
more errors in the logs, no more drives being kicked out.
Though not as much info is available that way as when using de sata
driver ('-d sat', or automatically), like temperature is unavailable, it
does allow me to initiate the selftests and get their result, and to
monitor generic smart status of the drives. Quite enough for me.
YMMV, though.
Vincent.
>
> Gabor Gombas wrote:
>> On Mon, Nov 09, 2009 at 05:08:23AM -0500, Andrew Dunn wrote:
>>
>>
>>> does it momentarily offline the disks? like they re-appear in /dev
>>> within moments? That would be similar behavior to what I am
>>> experiencing, the disks drop from the array, but they are in /dev by the
>>> time I get a chance to see them.
>>>
>> No, either the disks need to be physically removed and re-inserted, or
>> the machine needs to be rebooted.
>>
>> Gabor
>>
>>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-10 10:55 ` Andrew Dunn
2009-11-10 11:34 ` Vincent Schut
@ 2009-11-10 12:45 ` Ryan Wagoner
1 sibling, 0 replies; 23+ messages in thread
From: Ryan Wagoner @ 2009-11-10 12:45 UTC (permalink / raw)
To: Andrew Dunn; +Cc: Linux RAID Mailing List
Boot up a CentOS 5 LiveCD. It should detect your arrays and try
running smartctl. From my experience with different distros I have
found that Red Hat spends a good amount of time making sure enterprise
hardware is stable on their system. Ubuntu seems to focus more on
desktops.
Ryan
On Tue, Nov 10, 2009 at 5:55 AM, Andrew Dunn <andrew.g.dunn@gmail.com> wrote:
> I am able to reproduce this smart error now. I have done it twice, so
> maybe other things are causing this also.
>
> When I scanned the devices this morning with smartctl via webmin I lost
> 8 of the 9 drives. They are howerver still in my /dev folder.
>
> Now I sent out my logs from the first failure last night, smartctl was
> on the system... I dont know if ubuntu server's default smartd
> configuration makes it do periodic scans because I didnt change anything.
>
> I would hate to move back to 9.10 and see this problem again.
>
> Should I just not install smartmontools? This seems like a bad solution
> because now I wont be able to check the drives in advance for failures.
>
> Have you installed LSI's linux drivers? Some people say this solves
> their issue.
>
> From the logs sent out last night do you think it could be something else?
>
> Thanks a ton,
>
> Gabor Gombas wrote:
>> On Mon, Nov 09, 2009 at 05:08:23AM -0500, Andrew Dunn wrote:
>>
>>
>>> does it momentarily offline the disks? like they re-appear in /dev
>>> within moments? That would be similar behavior to what I am
>>> experiencing, the disks drop from the array, but they are in /dev by the
>>> time I get a chance to see them.
>>>
>>
>> No, either the disks need to be physically removed and re-inserted, or
>> the machine needs to be rebooted.
>>
>> Gabor
>>
>>
>
> --
> Andrew Dunn
> http://agdunn.net
>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-10 11:34 ` Vincent Schut
@ 2009-11-11 12:34 ` Andrew Dunn
2009-11-11 12:46 ` Vincent Schut
2009-11-17 8:40 ` Vincent Schut
1 sibling, 1 reply; 23+ messages in thread
From: Andrew Dunn @ 2009-11-11 12:34 UTC (permalink / raw)
To: Vincent Schut
Cc: Gabor Gombas, Ryan Wagoner, Richard Scobie, Linux RAID Mailing List
Thanks for your help, so far without smartctl installed I have had no
issues... but it has only been about 12 hours.
Could you send me your smatd.conf?
Vincent Schut wrote:
> Andrew Dunn wrote:
>> I am able to reproduce this smart error now. I have done it twice, so
>> maybe other things are causing this also.
>>
>> When I scanned the devices this morning with smartctl via webmin I lost
>> 8 of the 9 drives. They are howerver still in my /dev folder.
>>
>> Now I sent out my logs from the first failure last night, smartctl was
>> on the system... I dont know if ubuntu server's default smartd
>> configuration makes it do periodic scans because I didnt change
>> anything.
>>
>> I would hate to move back to 9.10 and see this problem again.
>>
>> Should I just not install smartmontools? This seems like a bad solution
>> because now I wont be able to check the drives in advance for failures.
>>
>> Have you installed LSI's linux drivers? Some people say this solves
>> their issue.
>>
>> From the logs sent out last night do you think it could be something
>> else?
>>
>> Thanks a ton,
>
> FWIW, I encountered the same issue, and seem to have found a viable
> workaround by accessing the SATA disks on that LSI backplane as scsi
> devices, e.g. by adding '-d scsi' to my smartctl/smartd.conf lines. No
> more errors in the logs, no more drives being kicked out.
> Though not as much info is available that way as when using de sata
> driver ('-d sat', or automatically), like temperature is unavailable,
> it does allow me to initiate the selftests and get their result, and
> to monitor generic smart status of the drives. Quite enough for me.
>
> YMMV, though.
>
> Vincent.
>>
>> Gabor Gombas wrote:
>>> On Mon, Nov 09, 2009 at 05:08:23AM -0500, Andrew Dunn wrote:
>>>
>>>
>>>> does it momentarily offline the disks? like they re-appear in /dev
>>>> within moments? That would be similar behavior to what I am
>>>> experiencing, the disks drop from the array, but they are in /dev
>>>> by the
>>>> time I get a chance to see them.
>>>>
>>> No, either the disks need to be physically removed and re-inserted, or
>>> the machine needs to be rebooted.
>>>
>>> Gabor
>>>
>>>
>>
>
>
--
Andrew Dunn
http://agdunn.net
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-11 12:34 ` Andrew Dunn
@ 2009-11-11 12:46 ` Vincent Schut
0 siblings, 0 replies; 23+ messages in thread
From: Vincent Schut @ 2009-11-11 12:46 UTC (permalink / raw)
To: linux-raid
Andrew Dunn wrote:
> Thanks for your help, so far without smartctl installed I have had no
> issues... but it has only been about 12 hours.
I also had no issues when not running smartd/smartctl. It seems the
combination of kernel, backplane SAS driver, and smart which triggers
the trouble...
>
> Could you send me your smatd.conf?
It's pretty much default, there's just one uncommented line in it:
DEVICESCAN -d scsi -a -o on -S on -s (S/../.././02|L/../../6/03) -W
4,45,55 -R 5 -m my@mail.address -M exec
/usr/share/smartmontools/smartd-runner
(the above 3 lines should be all on one line).
I plan to replace the devicescan with explicit /dev/sd.. items, but as
I'm currently regularly adding and removing (usb) drives, I kept the
auto devicescan statement.
The rest means: enable smart on all drives, plan daily short and weekly
long selftests, and warn on temperature too high or temp change of more
than 5 deg., and mail warnings/errors to me.
VS.
>
> Vincent Schut wrote:
>> Andrew Dunn wrote:
>>> I am able to reproduce this smart error now. I have done it twice, so
>>> maybe other things are causing this also.
>>>
>>> When I scanned the devices this morning with smartctl via webmin I lost
>>> 8 of the 9 drives. They are howerver still in my /dev folder.
>>>
>>> Now I sent out my logs from the first failure last night, smartctl was
>>> on the system... I dont know if ubuntu server's default smartd
>>> configuration makes it do periodic scans because I didnt change
>>> anything.
>>>
>>> I would hate to move back to 9.10 and see this problem again.
>>>
>>> Should I just not install smartmontools? This seems like a bad solution
>>> because now I wont be able to check the drives in advance for failures.
>>>
>>> Have you installed LSI's linux drivers? Some people say this solves
>>> their issue.
>>>
>>> From the logs sent out last night do you think it could be something
>>> else?
>>>
>>> Thanks a ton,
>> FWIW, I encountered the same issue, and seem to have found a viable
>> workaround by accessing the SATA disks on that LSI backplane as scsi
>> devices, e.g. by adding '-d scsi' to my smartctl/smartd.conf lines. No
>> more errors in the logs, no more drives being kicked out.
>> Though not as much info is available that way as when using de sata
>> driver ('-d sat', or automatically), like temperature is unavailable,
>> it does allow me to initiate the selftests and get their result, and
>> to monitor generic smart status of the drives. Quite enough for me.
>>
>> YMMV, though.
>>
>> Vincent.
>>> Gabor Gombas wrote:
>>>> On Mon, Nov 09, 2009 at 05:08:23AM -0500, Andrew Dunn wrote:
>>>>
>>>>
>>>>> does it momentarily offline the disks? like they re-appear in /dev
>>>>> within moments? That would be similar behavior to what I am
>>>>> experiencing, the disks drop from the array, but they are in /dev
>>>>> by the
>>>>> time I get a chance to see them.
>>>>>
>>>> No, either the disks need to be physically removed and re-inserted, or
>>>> the machine needs to be rebooted.
>>>>
>>>> Gabor
>>>>
>>>>
>>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID 6 Failure follow up
2009-11-10 11:34 ` Vincent Schut
2009-11-11 12:34 ` Andrew Dunn
@ 2009-11-17 8:40 ` Vincent Schut
1 sibling, 0 replies; 23+ messages in thread
From: Vincent Schut @ 2009-11-17 8:40 UTC (permalink / raw)
Cc: Andrew Dunn, Gabor Gombas, Ryan Wagoner, Richard Scobie,
Linux RAID Mailing List
Vincent Schut wrote:
> Andrew Dunn wrote:
>> I am able to reproduce this smart error now. I have done it twice, so
>> maybe other things are causing this also.
>>
>> When I scanned the devices this morning with smartctl via webmin I lost
>> 8 of the 9 drives. They are howerver still in my /dev folder.
>>
>> Now I sent out my logs from the first failure last night, smartctl was
>> on the system... I dont know if ubuntu server's default smartd
>> configuration makes it do periodic scans because I didnt change anything.
>>
>> I would hate to move back to 9.10 and see this problem again.
>>
>> Should I just not install smartmontools? This seems like a bad solution
>> because now I wont be able to check the drives in advance for failures.
>>
>> Have you installed LSI's linux drivers? Some people say this solves
>> their issue.
>>
>> From the logs sent out last night do you think it could be something
>> else?
>>
>> Thanks a ton,
>
> FWIW, I encountered the same issue, and seem to have found a viable
> workaround by accessing the SATA disks on that LSI backplane as scsi
> devices, e.g. by adding '-d scsi' to my smartctl/smartd.conf lines. No
> more errors in the logs, no more drives being kicked out.
> Though not as much info is available that way as when using de sata
> driver ('-d sat', or automatically), like temperature is unavailable, it
> does allow me to initiate the selftests and get their result, and to
> monitor generic smart status of the drives. Quite enough for me.
>
> YMMV, though.
Folks, I need to retract this. Thought I've had far less problems with
'-d scsi' instead of '-d sat' when running the LSI SAS / smartmontools /
mdadm combo, I got bitten again last night by a drive being kicked out
for no apparent reason. For now my only possible advise is: don't use
smartmontools on drives that are on this LSI SAS backplane.
I dearly hope this will improve soon; I hate it to have my drives go
unmonitored for too long...
Vincent.
>
> Vincent.
>>
>> Gabor Gombas wrote:
>>> On Mon, Nov 09, 2009 at 05:08:23AM -0500, Andrew Dunn wrote:
>>>
>>>
>>>> does it momentarily offline the disks? like they re-appear in /dev
>>>> within moments? That would be similar behavior to what I am
>>>> experiencing, the disks drop from the array, but they are in /dev by
>>>> the
>>>> time I get a chance to see them.
>>>>
>>> No, either the disks need to be physically removed and re-inserted, or
>>> the machine needs to be rebooted.
>>>
>>> Gabor
>>>
>>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2009-11-17 8:40 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-08 14:07 RAID 6 Failure follow up Andrew Dunn
2009-11-08 14:23 ` Roger Heflin
2009-11-08 14:30 ` Andrew Dunn
2009-11-08 18:01 ` Richard Scobie
2009-11-08 18:22 ` Andrew Dunn
2009-11-08 18:34 ` Joe Landman
2009-11-08 22:09 ` Andrew Dunn
2009-11-08 22:59 ` Richard Scobie
2009-11-09 2:45 ` Ryan Wagoner
2009-11-09 2:57 ` Richard Scobie
2009-11-09 8:09 ` Gabor Gombas
2009-11-09 10:08 ` Andrew Dunn
2009-11-09 11:34 ` Gabor Gombas
2009-11-09 22:04 ` Andrew Dunn
2009-11-10 10:55 ` Andrew Dunn
2009-11-10 11:34 ` Vincent Schut
2009-11-11 12:34 ` Andrew Dunn
2009-11-11 12:46 ` Vincent Schut
2009-11-17 8:40 ` Vincent Schut
2009-11-10 12:45 ` Ryan Wagoner
2009-11-08 14:36 ` Andrew Dunn
2009-11-08 14:56 ` Roger Heflin
2009-11-08 17:08 ` Andrew Dunn
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.