* Monitoring for failed drives
@ 2012-04-25 12:39 Brian Candler
2012-04-25 16:18 ` Jan Ceuleers
2012-05-03 9:06 ` Brian Candler
0 siblings, 2 replies; 4+ messages in thread
From: Brian Candler @ 2012-04-25 12:39 UTC (permalink / raw)
To: linux-raid
One of the servers I've been setting up, which has an md RAID0 for temporary
storage, has just had a disk error.
root@storage2:~# ls -l /disk/scratch/scratch/path/to/file
ls: cannot access /disk/scratch/scratch/path/to/file/file.4000.new.1521.rsi: Remote I/O error
ls: cannot access /disk/scratch/scratch/path/to/file/file.4000.new.1522.rsi: Remote I/O error
ls: cannot access /disk/scratch/scratch/path/to/file/file.4000.new.1523.rsi: Remote I/O error
...
dmesg shows:
[ 1232.406491] mpt2sas1: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1232.406497] mpt2sas1: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1232.406512] sd 5:0:0:0: [sdr] Unhandled sense code
[ 1232.406514] sd 5:0:0:0: [sdr] Result: hostbyte=invalid driverbyte=DRIVER_SENSE
[ 1232.406518] sd 5:0:0:0: [sdr] Sense Key : Medium Error [current]
[ 1232.406522] Info fld=0x30000588
[ 1232.406524] sd 5:0:0:0: [sdr] Add. Sense: Unrecovered read error
[ 1232.406528] sd 5:0:0:0: [sdr] CDB: Read(10): 28 00 30 00 05 80 00 00 10 00
[ 1232.406537] end_request: critical target error, dev sdr, sector 805307776
OK, so that's fairly obviously a failed drive.
The problem is, how to detect and report this? At the md RAID level,
`cat /proc/mdstat` and `mdadm --detail` show nothing amiss.
# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : active raid0 sdk[8] sdf[4] sdb[0] sdj[9] sdc[1] sde[2] sdd[3] sdi[6] sdg[5] sdh[7] sdv[20] sdw[21] sdl[11] sdu[19] sdt[18] sdn[13] sds[17] sdq[14] sdm[10] sdx[22] sdr[16] sdo[12] sdp[15] sdy[23]
70326362112 blocks super 1.2 512k chunks
unused devices: <none>
root@storage2:~# mdadm --detail /dev/md/scratch
/dev/md/scratch:
Version : 1.2
Creation Time : Mon Apr 23 16:53:59 2012
Raid Level : raid0
Array Size : 70326362112 (67068.45 GiB 72014.19 GB)
Raid Devices : 24
Total Devices : 24
Persistence : Superblock is persistent
Update Time : Mon Apr 23 16:53:59 2012
State : clean
Active Devices : 24
Working Devices : 24
Failed Devices : 0
Spare Devices : 0
Chunk Size : 512K
Name : storage2:scratch (local to host storage2)
UUID : e5d2dce6:91d1d3b9:ae08f838:5e12132a
Events : 0
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
2 8 64 2 active sync /dev/sde
3 8 48 3 active sync /dev/sdd
4 8 80 4 active sync /dev/sdf
5 8 96 5 active sync /dev/sdg
6 8 128 6 active sync /dev/sdi
7 8 112 7 active sync /dev/sdh
8 8 160 8 active sync /dev/sdk
9 8 144 9 active sync /dev/sdj
10 8 192 10 active sync /dev/sdm
11 8 176 11 active sync /dev/sdl
12 8 224 12 active sync /dev/sdo
13 8 208 13 active sync /dev/sdn
14 65 0 14 active sync /dev/sdq
15 8 240 15 active sync /dev/sdp
16 65 16 16 active sync /dev/sdr
17 65 32 17 active sync /dev/sds
18 65 48 18 active sync /dev/sdt
19 65 64 19 active sync /dev/sdu
20 65 80 20 active sync /dev/sdv
21 65 96 21 active sync /dev/sdw
22 65 112 22 active sync /dev/sdx
23 65 128 23 active sync /dev/sdy
So first question is this: what does it take for a drive to be marked as
"failed" by md RAID? Is there some threshold I can set?
Second question: what's a better way of monitoring this proactively, rather
than just waiting for applications to fail and then digging into dmesg?
Recently I installed an excellent set of snmp plugins and MIBs for exposing
both md-raid and smartctl information via SNMP, which I got from
http://www.mad-hacking.net/software/index.xml
http://downloads.mad-hacking.net/software/
Here's the md RAID output (which really is just reformatting of info
from mdadm --detail)
root@storage2:~# snmptable -c XXXXXXXX -v 2c storage2 MD-RAID-MIB::mdRaidTableSNMP table: MD-RAID-MIB::mdRaidTable
mdRaidArrayIndex mdRaidArrayDev mdRaidArrayVersion mdRaidArrayUUID mdRaidArrayLevel mdRaidArrayLayout mdRaidArrayChunkSize mdRaidArraySize mdRaidArrayDeviceSize mdRaidArrayHealthOK mdRaidArrayHasFailedComponents mdRaidArrayHasAvailableSpares mdRaidArrayTotalComponents mdRaidArrayActiveComponents mdRaidArrayWorkingComponents mdRaidArrayFailedComponents mdRaidArraySpareComponents
1 /dev/md/scratch 1.2 e5d2dce6:91d1d3b9:ae08f838:5e12132a raid0 N/A 512K 70326362112 N/A true false false 24 24 24 0 0
And here's the output for SMART (which combines smartctl -i, -H and -A):
root@storage2:~# snmptable -c XXXXXXXX -v 2c storage2 SMARTCTL-MIB::smartCtlTable
SNMP table: SMARTCTL-MIB::smartCtlTable
smartCtlDeviceIndex smartCtlDeviceDev smartCtlDeviceModelFamily smartCtlDeviceDeviceModel smartCtlDeviceSerialNumber smartCtlDeviceUserCapacity smartCtlDeviceATAVersion smartCtlDeviceHealthOK smartCtlDeviceTemperatureCelsius smartCtlDeviceReallocatedSectorCt smartCtlDeviceCurrentPendingSector smartCtlDeviceOfflineUncorrectable smartCtlDeviceUDMACRCErrorCount smartCtlDeviceReadErrorRate smartCtlDeviceSeekErrorRate smartCtlDeviceHardwareECCRecovered
1 /dev/sda ST1000DM003-9YN162 Z1D0BQHF 1,000,204,886,016 bytes [1.00 TB] 8 true 28 0 0 0 0 105 30 ?
2 /dev/sdb ST3000DM001-9YN166 S1F01Z36 3,000,592,982,016 bytes [3.00 TB] 8 true 28 0 0 0 0 105 31 ?
3 /dev/sdc ST3000DM001-9YN166 S1F01932 3,000,592,982,016 bytes [3.00 TB] 8 true 24 0 0 0 0 103 31 ?
4 /dev/sdd ST3000DM001-9YN166 S1F04Y7G 3,000,592,982,016 bytes [3.00 TB] 8 true 26 0 0 0 0 104 31 ?
5 /dev/sde ST3000DM001-9YN166 S1F00KF2 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 104 31 ?
6 /dev/sdf ST3000DM001-9YN166 S1F01C0D 3,000,592,982,016 bytes [3.00 TB] 8 true 27 0 0 0 0 103 31 ?
7 /dev/sdg ST3000DM001-9YN166 S1F01DFM 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 104 31 ?
8 /dev/sdh ST3000DM001-9YN166 S1F054EP 3,000,592,982,016 bytes [3.00 TB] 8 true 27 0 0 0 0 105 31 ?
9 /dev/sdi ST3000DM001-9YN166 S1F05304 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 105 31 ?
10 /dev/sdj ST3000DM001-9YN166 S1F015X5 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 105 31 ?
11 /dev/sdk ST3000DM001-9YN166 S1F046FB 3,000,592,982,016 bytes [3.00 TB] 8 true 27 0 0 0 0 103 31 ?
12 /dev/sdl ST3000DM001-9YN166 S1F024DW 3,000,592,982,016 bytes [3.00 TB] 8 true 26 0 0 0 0 103 31 ?
13 /dev/sdm ST3000DM001-9YN166 S1F04DKQ 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 104 31 ?
14 /dev/sdn ST3000DM001-9YN166 S1F014NH 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 104 31 ?
15 /dev/sdo ST3000DM001-9YN166 S1F049KM 3,000,592,982,016 bytes [3.00 TB] 8 true 26 0 0 0 0 105 31 ?
16 /dev/sdp ST3000DM001-9YN166 S1F01D5A 3,000,592,982,016 bytes [3.00 TB] 8 true 26 0 0 0 0 103 31 ?
17 /dev/sdq ST3000DM001-9YN166 S1F00L20 3,000,592,982,016 bytes [3.00 TB] 8 true 24 0 0 0 0 103 31 ?
18 /dev/sdr ST3000DM001-9YN166 S1F07PN8 3,000,592,982,016 bytes [3.00 TB] 8 true 28 0 8 8 0 81 31 ?
19 /dev/sds ST3000DM001-9YN166 S1F03PS8 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 104 31 ?
20 /dev/sdt ST3000DM001-9YN166 S1F04SM4 3,000,592,982,016 bytes [3.00 TB] 8 true 25 0 0 0 0 103 31 ?
21 /dev/sdu ST3000DM001-9YN166 S1F00MCQ 3,000,592,982,016 bytes [3.00 TB] 8 true 27 0 0 0 0 105 31 ?
22 /dev/sdv ST3000DM001-9YN166 S1F020YG 3,000,592,982,016 bytes [3.00 TB] 8 true 28 0 0 0 0 104 31 ?
23 /dev/sdw ST3000DM001-9YN166 S1F03NXP 3,000,592,982,016 bytes [3.00 TB] 8 true 26 0 0 0 0 103 31 ?
24 /dev/sdx ST3000DM001-9YN166 S1F054Y7 3,000,592,982,016 bytes [3.00 TB] 8 true 26 0 0 0 0 104 31 ?
25 /dev/sdy ST3000DM001-9YN166 S1F04A0Y 3,000,592,982,016 bytes [3.00 TB] 8 true 27 0 40 40 0 105 31 ?
All drives report smartCtlDeviceHealthOK = True, which derives from the
test "PASSED" result from smartctl -H:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0.0-16-server] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
The only anomoly I can see here is that sdr has reported 8 unrecoverable
errors - and also sdy has reported 40 unrecoverable errors!
So based on this information, I am going to return sdr and sdy to the
manufacturer for replacement.
But is there any better way that I can be notified quickly of I/O errors
and/or retries, for example counters being maintained in the kernel?
Thanks,
Brian.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Monitoring for failed drives
2012-04-25 12:39 Monitoring for failed drives Brian Candler
@ 2012-04-25 16:18 ` Jan Ceuleers
2012-04-25 16:54 ` Brian Candler
2012-05-03 9:06 ` Brian Candler
1 sibling, 1 reply; 4+ messages in thread
From: Jan Ceuleers @ 2012-04-25 16:18 UTC (permalink / raw)
To: Brian Candler; +Cc: linux-raid
Brian Candler wrote:
> The problem is, how to detect and report this? At the md RAID level,
> `cat /proc/mdstat` and `mdadm --detail` show nothing amiss.
>
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
> md127 : active raid0 sdk[8] sdf[4] sdb[0] sdj[9] sdc[1] sde[2] sdd[3] sdi[6] sdg[5] sdh[7] sdv[20] sdw[21] sdl[11] sdu[19] sdt[18] sdn[13] sds[17] sdq[14] sdm[10] sdx[22] sdr[16] sdo[12] sdp[15] sdy[23]
> 70326362112 blocks super 1.2 512k chunks
Brian,
I know that you know this, but this is a RAID0 which does not have any
redundancy. What would you expect md to do? It cannot kick the drive
from the array since this would bring the entire array down.
Unlike with other RAID levels it is practicable to kick failed drives
from the array because you can reconstruct their contents from the
parity information.
Jan
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Monitoring for failed drives
2012-04-25 16:18 ` Jan Ceuleers
@ 2012-04-25 16:54 ` Brian Candler
0 siblings, 0 replies; 4+ messages in thread
From: Brian Candler @ 2012-04-25 16:54 UTC (permalink / raw)
To: Jan Ceuleers; +Cc: linux-raid
On Wed, Apr 25, 2012 at 06:18:05PM +0200, Jan Ceuleers wrote:
> Brian Candler wrote:
> >The problem is, how to detect and report this? At the md RAID level,
> >`cat /proc/mdstat` and `mdadm --detail` show nothing amiss.
> >
> > # cat /proc/mdstat
> > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
> > md127 : active raid0 sdk[8] sdf[4] sdb[0] sdj[9] sdc[1] sde[2] sdd[3] sdi[6] sdg[5] sdh[7] sdv[20] sdw[21] sdl[11] sdu[19] sdt[18] sdn[13] sds[17] sdq[14] sdm[10] sdx[22] sdr[16] sdo[12] sdp[15] sdy[23]
> > 70326362112 blocks super 1.2 512k chunks
>
> Brian,
>
> I know that you know this, but this is a RAID0 which does not have
> any redundancy. What would you expect md to do?
In an ideal world, maybe it would mark the drive as "failing" or "erroring"
but still keep it in service.
There is some data on the failing disk which is still accessible, so as you
say there is value in keeping the array online, but if you try to access
data in the bad area of that disk you get I/O errors. This is a serious
problem and I want to know about it somehow.
I'm not saying that md is necessarily the right layer to give me this
information. I'm not sure that SMART is necessarily the right place either,
although it may suffice.
If there are some drive error counters I can read, preferably exposed via
SNMP, that would be great. Otherwise, I seek your advice on the best place
to look for this information...
Thanks,
Brian.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Monitoring for failed drives
2012-04-25 12:39 Monitoring for failed drives Brian Candler
2012-04-25 16:18 ` Jan Ceuleers
@ 2012-05-03 9:06 ` Brian Candler
1 sibling, 0 replies; 4+ messages in thread
From: Brian Candler @ 2012-05-03 9:06 UTC (permalink / raw)
To: linux-raid
On Wed, Apr 25, 2012 at 01:39:51PM +0100, Brian Candler wrote:
> OK, so that's fairly obviously a failed drive.
>
> The problem is, how to detect and report this?
More specifically, is there a kernel counter I can look at, perhaps
something under /sys, which counts the number of I/O errors when accessing a
block device? (Recovered and non-recovered?)
Or is the only way to find this sort of error through parsing syslog
messages?
I did find this:
$ cat /sys/block/sda/stat
134257 51671 11141686 1337160 60502372 124014063 1476148888 1325166568 0 137287732 1326384944
However there don't seem to be error counters in there. According to
http://www.kernel.org/doc/Documentation/block/stat.txt
Name units description
---- ----- -----------
read I/Os requests number of read I/Os processed
read merges requests number of read I/Os merged with in-queue I/O
read sectors sectors number of sectors read
read ticks milliseconds total wait time for read requests
write I/Os requests number of write I/Os processed
write merges requests number of write I/Os merged with in-queue I/O
write sectors sectors number of sectors written
write ticks milliseconds total wait time for write requests
in_flight requests number of I/Os currently in flight
io_ticks milliseconds total time this block device has been active
time_in_queue milliseconds total wait time for all requests
I also found UCD-DISKIO-MIB in net-snmp, but it doesn't have error counters
either:
diskIOEntry OBJECT-TYPE
SYNTAX DiskIOEntry
MAX-ACCESS not-accessible
STATUS current
DESCRIPTION
"An entry containing a device and its statistics."
INDEX { diskIOIndex }
::= { diskIOTable 1 }
DiskIOEntry ::= SEQUENCE {
diskIOIndex Integer32,
diskIODevice DisplayString,
diskIONRead Counter32,
diskIONWritten Counter32,
diskIOReads Counter32,
diskIOWrites Counter32,
diskIONReadX Counter64,
diskIONWrittenX Counter64
}
Is there anywhere else I should look for this?
Thanks,
Brian.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2012-05-03 9:06 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-25 12:39 Monitoring for failed drives Brian Candler
2012-04-25 16:18 ` Jan Ceuleers
2012-04-25 16:54 ` Brian Candler
2012-05-03 9:06 ` Brian Candler
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.