All of lore.kernel.org
 help / color / mirror / Atom feed
* read errors corrected
@ 2010-12-30  3:20 James
  2010-12-30  5:24 ` Mikael Abrahamsson
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: James @ 2010-12-30  3:20 UTC (permalink / raw)
  To: linux-raid

All,

I'm looking for a bit of guidance here. I have a RAID 6 set up on my
system and am seeing some errors in my logs as follows:

# cat messages | grep "read erro"
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262528 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262536 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262544 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262552 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262560 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262568 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262576 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262584 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262592 on sda4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923648 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923656 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923664 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923672 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923680 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923688 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923696 on sdb4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923520 on sdc4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923528 on sdc4)
Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
sectors at 600923536 on sdc4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940552 on sdd4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940672 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940680 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940688 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940696 on sdb4)

I've Google'd the heck out of this error message but am not seeing a
clear and concise message: is this benign? What would cause these
errors? Should I be concerned?

There is an error message (read error corrected) on each of the drives
in the array. They all seem to be functioning properly. The I/O on the
drives is pretty heavy for some parts of the day.

# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5]
[raid4] [multipath]
md1 : active raid6 sdb1[1] sda1[0] sdd1[3] sdc1[2]
      497792 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]

md2 : active raid6 sdb2[1] sda2[0] sdd2[3] sdc2[2]
      4000000 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]

md3 : active raid6 sdb3[1] sda3[0] sdd3[3] sdc3[2]
      25992960 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]

md4 : active raid6 sdb4[1] sda4[0] sdd4[3] sdc4[2]
      2899780480 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]

unused devices: <none>

I have a really hard time believing there's something wrong with all
of the drives in the array, although admittedly they're the same model
from the same manufacturer.

Can someone point me in the right direction?
(a) what causes these errors precisely?
(b) is the error benign? How can I determine if it is *likely* a
hardware problem? (I imagine it's probably impossible to tell if it's
HW until it's too late)
(c) are these errors expected in a RAID array that is heavily used?
(d) what kind of errors should I see regarding "read errors" that
*would* indicate an imminent hardware failure?

Thoughts and ideas would be welcomed. I'm sure a thread where some
hefty discussion is thrown at this topic will help future Googlers
like me. :)

-james

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
  2010-12-30  3:20 read errors corrected James
@ 2010-12-30  5:24 ` Mikael Abrahamsson
  2010-12-30 16:33   ` James
  2010-12-30  9:15 ` Neil Brown
  2010-12-30 10:13 ` Giovanni Tessore
  2 siblings, 1 reply; 18+ messages in thread
From: Mikael Abrahamsson @ 2010-12-30  5:24 UTC (permalink / raw)
  To: James; +Cc: linux-raid

On Thu, 30 Dec 2010, James wrote:

> Can someone point me in the right direction?
> (a) what causes these errors precisely?

dmesg should give you information if this is SATA errors.

> (c) are these errors expected in a RAID array that is heavily used?

No.

> (d) what kind of errors should I see regarding "read errors" that
> *would* indicate an imminent hardware failure?

You should look into the SMART information on the drives using smartctl.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
  2010-12-30  3:20 read errors corrected James
  2010-12-30  5:24 ` Mikael Abrahamsson
@ 2010-12-30  9:15 ` Neil Brown
       [not found]   ` <AANLkTik2+Gk1XveqD=crGMH5JshzJqQb_i77ZpOFUncB@mail.gmail.com>
  2010-12-30 10:13 ` Giovanni Tessore
  2 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2010-12-30  9:15 UTC (permalink / raw)
  To: James; +Cc: linux-raid

On Thu, 30 Dec 2010 03:20:48 +0000 James <jtp@nc.rr.com> wrote:

> All,
> 
> I'm looking for a bit of guidance here. I have a RAID 6 set up on my
> system and am seeing some errors in my logs as follows:
> 
> # cat messages | grep "read erro"
> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 974262528 on sda4)
> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 974262536 on sda4)
.....

> 
> I've Google'd the heck out of this error message but am not seeing a
> clear and concise message: is this benign? What would cause these
> errors? Should I be concerned?
> 
> There is an error message (read error corrected) on each of the drives
> in the array. They all seem to be functioning properly. The I/O on the
> drives is pretty heavy for some parts of the day.
> 
> # cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5]
> [raid4] [multipath]
> md1 : active raid6 sdb1[1] sda1[0] sdd1[3] sdc1[2]
>       497792 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
> 
> md2 : active raid6 sdb2[1] sda2[0] sdd2[3] sdc2[2]
>       4000000 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
> 
> md3 : active raid6 sdb3[1] sda3[0] sdd3[3] sdc3[2]
>       25992960 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
> 
> md4 : active raid6 sdb4[1] sda4[0] sdd4[3] sdc4[2]
>       2899780480 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
> 
> unused devices: <none>
> 
> I have a really hard time believing there's something wrong with all
> of the drives in the array, although admittedly they're the same model
> from the same manufacturer.
> 
> Can someone point me in the right direction?
> (a) what causes these errors precisely?

When md/raid6 tries to read from a device and gets a read error, it try to
read from other other devices.  When that succeeds it computes the data that
it had tried to read and then write it back to the original drive.  If this
succeeded is assumes that the read error has been correct by a write, and
prints the message that you see.


> (b) is the error benign? How can I determine if it is *likely* a
> hardware problem? (I imagine it's probably impossible to tell if it's
> HW until it's too late)

A few occasional messages like this are fairly benign.  The could be a sign
that the drive surface is degrading.  If you see lots of these messages, then
you should seriously consider replacing the drive.

As you are seeing these message across all devices, it is possible that the
problem is with the sata controller rather than the disks.  Do know which you
should check the errors that are reported in dmesg.  If you don't understand
these message, then post them to the list - feel free to post several hundred
lines of logs - too much is much much better than not enough.

NeilBrown



> (c) are these errors expected in a RAID array that is heavily used?
> (d) what kind of errors should I see regarding "read errors" that
> *would* indicate an imminent hardware failure?
> 
> Thoughts and ideas would be welcomed. I'm sure a thread where some
> hefty discussion is thrown at this topic will help future Googlers
> like me. :)
> 
> -james
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
  2010-12-30  3:20 read errors corrected James
  2010-12-30  5:24 ` Mikael Abrahamsson
  2010-12-30  9:15 ` Neil Brown
@ 2010-12-30 10:13 ` Giovanni Tessore
  2010-12-30 16:41   ` James
  2 siblings, 1 reply; 18+ messages in thread
From: Giovanni Tessore @ 2010-12-30 10:13 UTC (permalink / raw)
  To: linux-raid

On 12/30/2010 04:20 AM, James wrote:
> Can someone point me in the right direction?
> (a) what causes these errors precisely?
> (b) is the error benign? How can I determine if it is *likely* a
> hardware problem? (I imagine it's probably impossible to tell if it's
> HW until it's too late)
> (c) are these errors expected in a RAID array that is heavily used?
> (d) what kind of errors should I see regarding "read errors" that
> *would* indicate an imminent hardware failure?

(a) these errors usually come from defective disk sectors. raid 
recostructs the missing sector from parity from other disks in the 
array, then rewrites the sector on the defective disk; if the sector is 
rewritten without error (maybe the hd remaps the sector into its 
reserved area), then just the log messages is displayed.

(b) with raid-6 it's almost benign; to get troubles you should get a 
read error on same sector for >2 disks; or have 2 disks failed and out 
of the array and get a read error on one of the other disks while 
recostructing the array; or have 1 disk failed and get a read error on 
same sector on >1 disk while recostructing (with raid-5 it's almost 
dangerous instead, as you can have big troubles if a disk fails and you 
get a read error on another disk while recostructing; that happened to me!)

(c) no; it's also a good rule to perform a periodic scrub of the array 
(check of the array), to reveal and correct defective sectors

(d) check smart status of the disks, for "relocated sectors count"; also 
if md superblock is >= 1 there is a persistent count of corrected read 
errors for each device into /sys/block/mdXX/md/dev-XX/errors, when this 
counter reaches 256 the disk is marked failed; ihmo when a disk is 
giving even few corrected read errors in a short interval its better to 
replace it.

-- 
Yours faithfully.

Giovanni Tessore



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
  2010-12-30  5:24 ` Mikael Abrahamsson
@ 2010-12-30 16:33   ` James
  2010-12-30 16:44     ` Roman Mamedov
  0 siblings, 1 reply; 18+ messages in thread
From: James @ 2010-12-30 16:33 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid

Lots of tremendous responses. I appreciate it. I'm going to reply to
the first person who responded here, but this email should cover some
of the questions posed in further responses.

On Thu, Dec 30, 2010 at 00:24, Mikael Abrahamsson <swmike@swm.pp.se> wrote:
> On Thu, 30 Dec 2010, James wrote:
>
>> Can someone point me in the right direction?
>> (a) what causes these errors precisely?
>
> dmesg should give you information if this is SATA errors.

Here are some other logs that may be relevant:

Dec 15 15:40:34 nuova kernel: sd 0:0:0:0: [sda] Unhandled error code
Dec 15 15:40:34 nuova kernel: sd 0:0:0:0: [sda] Result: hostbyte=0x00
driverbyte=0x06
Dec 15 15:40:34 nuova kernel: sd 0:0:0:0: [sda] CDB: cdb[0]=0x28: 28
00 3b e3 53 ea 00 00 48 00
Dec 15 15:40:34 nuova kernel: end_request: I/O error, dev sda, sector 1004753898
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262528 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262536 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262544 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262552 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262560 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262568 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262576 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262584 on sda4)
Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
sectors at 974262592 on sda4)

Unfortunately I had not caught those error messages at first
glance...I/O error? Hrmm...doesn't sound good. The issue is repeated
later on.

Dec 29 03:04:01 nuova kernel: sd 1:0:1:0: [sdd] Unhandled error code
Dec 29 03:04:01 nuova kernel: sd 0:0:1:0: [sdb] Unhandled error code
Dec 29 03:04:01 nuova kernel: sd 0:0:1:0: [sdb] Result: hostbyte=0x00
driverbyte=0x06
Dec 29 03:04:01 nuova kernel: sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28
00 1b 06 d2 ea 00 00 78 00
Dec 29 03:04:01 nuova kernel: end_request: I/O error, dev sdb, sector 453432042
Dec 29 03:04:01 nuova kernel: sd 1:0:1:0: [sdd] Result: hostbyte=0x00
driverbyte=0x06
Dec 29 03:04:01 nuova kernel: sd 1:0:1:0: [sdd] CDB: cdb[0]=0x28: 28
00 1b 06 d2 62 00 00 88 00
Dec 29 03:04:01 nuova kernel: end_request: I/O error, dev sdd, sector 453431906
Dec 29 03:04:01 nuova kernel: raid5_end_read_request: 13 callbacks suppressed
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940552 on sdd4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940672 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940680 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940688 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940696 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940704 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940712 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940720 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940728 on sdb4)
Dec 29 03:04:01 nuova kernel: md/raid:md4: read error corrected (8
sectors at 422940736 on sdb4)

Ouch.

>> (c) are these errors expected in a RAID array that is heavily used?
>
> No.
>
>> (d) what kind of errors should I see regarding "read errors" that
>> *would* indicate an imminent hardware failure?
>
> You should look into the SMART information on the drives using smartctl.

All of the drives indicate that the SMART status is
"passed"...unfortuantely this isn't very verbose. :)

Is there something specific I should be looking at in my SMART status?

I also see hundreds and hundreds of lines in my /var/log/messages that
indicates the following:


Dec 20 06:12:40 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 47 to 46
Dec 20 07:12:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
Dec 20 07:12:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
Dec 20 07:12:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
Dec 20 07:12:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
Dec 20 07:42:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 46 to 45
Dec 20 08:12:39 nuova smartd[22451]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 42 to 41
Dec 20 08:42:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 46 to 45
Dec 20 08:42:39 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 66 to 67
Dec 20 08:42:39 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 34 to 33
Dec 20 09:42:39 nuova smartd[22451]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
Dec 20 09:42:39 nuova smartd[22451]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
Dec 20 10:12:39 nuova smartd[22451]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 67
Dec 20 10:12:39 nuova smartd[22451]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 32 to 33
Dec 20 10:12:39 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 45 to 44
Dec 20 11:12:40 nuova smartd[22451]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 41 to 40
Dec 20 13:42:39 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 44 to 43
Dec 20 14:42:40 nuova smartd[22451]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 40 to 39
Dec 20 15:42:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 67
Dec 20 15:42:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 32 to 33
Dec 20 15:42:40 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 66
Dec 20 15:42:40 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 33 to 34
Dec 20 15:42:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 67
Dec 20 15:42:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 32 to 33
Dec 20 16:12:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
Dec 20 16:12:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
Dec 20 16:12:40 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 66 to 67
Dec 20 16:12:40 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 34 to 33
Dec 20 16:12:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
Dec 20 16:12:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
Dec 20 16:42:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 67
Dec 20 16:42:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 32 to 33
Dec 20 16:42:39 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 66
Dec 20 16:42:39 nuova smartd[22451]: Device: /dev/sdc [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 33 to 34
Dec 20 16:42:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 67
Dec 20 16:42:40 nuova smartd[22451]: Device: /dev/sdd [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 32 to 33
Dec 20 17:12:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68

Is it normal for SMART to update the attributes as the drives are
being used? (I've never had SMART installed before, so this is all
very new to me).

-james

> --
> Mikael Abrahamsson    email: swmike@swm.pp.se
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
       [not found]   ` <AANLkTik2+Gk1XveqD=crGMH5JshzJqQb_i77ZpOFUncB@mail.gmail.com>
@ 2010-12-30 16:35     ` James
  2010-12-30 23:12       ` Neil Brown
  0 siblings, 1 reply; 18+ messages in thread
From: James @ 2010-12-30 16:35 UTC (permalink / raw)
  To: Neil Brown, linux-raid

Sorry Neil, I meant to reply-all.

-james

On Thu, Dec 30, 2010 at 11:35, James <jtp@nc.rr.com> wrote:
> Inline.
>
> On Thu, Dec 30, 2010 at 04:15, Neil Brown <neilb@suse.de> wrote:
>> On Thu, 30 Dec 2010 03:20:48 +0000 James <jtp@nc.rr.com> wrote:
>>
>>> All,
>>>
>>> I'm looking for a bit of guidance here. I have a RAID 6 set up on my
>>> system and am seeing some errors in my logs as follows:
>>>
>>> # cat messages | grep "read erro"
>>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
>>> sectors at 974262528 on sda4)
>>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
>>> sectors at 974262536 on sda4)
>> .....
>>
>>>
>>> I've Google'd the heck out of this error message but am not seeing a
>>> clear and concise message: is this benign? What would cause these
>>> errors? Should I be concerned?
>>>
>>> There is an error message (read error corrected) on each of the drives
>>> in the array. They all seem to be functioning properly. The I/O on the
>>> drives is pretty heavy for some parts of the day.
>>>
>>> # cat /proc/mdstat
>>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5]
>>> [raid4] [multipath]
>>> md1 : active raid6 sdb1[1] sda1[0] sdd1[3] sdc1[2]
>>>       497792 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
>>>
>>> md2 : active raid6 sdb2[1] sda2[0] sdd2[3] sdc2[2]
>>>       4000000 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
>>>
>>> md3 : active raid6 sdb3[1] sda3[0] sdd3[3] sdc3[2]
>>>       25992960 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
>>>
>>> md4 : active raid6 sdb4[1] sda4[0] sdd4[3] sdc4[2]
>>>       2899780480 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
>>>
>>> unused devices: <none>
>>>
>>> I have a really hard time believing there's something wrong with all
>>> of the drives in the array, although admittedly they're the same model
>>> from the same manufacturer.
>>>
>>> Can someone point me in the right direction?
>>> (a) what causes these errors precisely?
>>
>> When md/raid6 tries to read from a device and gets a read error, it try to
>> read from other other devices.  When that succeeds it computes the data that
>> it had tried to read and then write it back to the original drive.  If this
>> succeeded is assumes that the read error has been correct by a write, and
>> prints the message that you see.
>>
>>
>>> (b) is the error benign? How can I determine if it is *likely* a
>>> hardware problem? (I imagine it's probably impossible to tell if it's
>>> HW until it's too late)
>>
>> A few occasional messages like this are fairly benign.  The could be a sign
>> that the drive surface is degrading.  If you see lots of these messages, then
>> you should seriously consider replacing the drive.
>
> Wow, this is hard for me to believe considering this is happening on
> all the drives. It's not impossible, however, since the drives are
> likely from the same batch.
>
>> As you are seeing these message across all devices, it is possible that the
>> problem is with the sata controller rather than the disks.  Do know which you
>> should check the errors that are reported in dmesg.  If you don't understand
>> these message, then post them to the list - feel free to post several hundred
>> lines of logs - too much is much much better than not enough.
>
> I posted a few errors in my response to the thread a bit ago -- here's
> another snippet:
>
> Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Unhandled error code
> Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Result: hostbyte=0x00
> driverbyte=0x06
> Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] CDB: cdb[0]=0x28: 28
> 00 25 a2 a0 6a 00 00 80 00
> Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdc, sector 631414890
> Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Unhandled error code
> Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Result: hostbyte=0x00
> driverbyte=0x06
> Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28
> 00 25 a2 a0 ea 00 00 38 00
> Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdb, sector 631415018
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923648 on sdb4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923656 on sdb4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923664 on sdb4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923672 on sdb4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923680 on sdb4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923688 on sdb4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923696 on sdb4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923520 on sdc4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923528 on sdc4)
> Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> sectors at 600923536 on sdc4)
>
> Is there a good way to determine if the issue is with the motherboard
> (where the SATA controller is), or with the drives themselves?
>
>> NeilBrown
>>
>>
>>
>>> (c) are these errors expected in a RAID array that is heavily used?
>>> (d) what kind of errors should I see regarding "read errors" that
>>> *would* indicate an imminent hardware failure?
>>>
>>> Thoughts and ideas would be welcomed. I'm sure a thread where some
>>> hefty discussion is thrown at this topic will help future Googlers
>>> like me. :)
>>>
>>> -james
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
  2010-12-30 10:13 ` Giovanni Tessore
@ 2010-12-30 16:41   ` James
  2011-01-15 12:00     ` Giovanni Tessore
  0 siblings, 1 reply; 18+ messages in thread
From: James @ 2010-12-30 16:41 UTC (permalink / raw)
  To: Giovanni Tessore; +Cc: linux-raid

Inline.

On Thu, Dec 30, 2010 at 05:13, Giovanni Tessore <giotex@texsoft.it> wrote:
> On 12/30/2010 04:20 AM, James wrote:
>>
>> Can someone point me in the right direction?
>> (a) what causes these errors precisely?
>> (b) is the error benign? How can I determine if it is *likely* a
>> hardware problem? (I imagine it's probably impossible to tell if it's
>> HW until it's too late)
>> (c) are these errors expected in a RAID array that is heavily used?
>> (d) what kind of errors should I see regarding "read errors" that
>> *would* indicate an imminent hardware failure?
>
> (a) these errors usually come from defective disk sectors. raid recostructs
> the missing sector from parity from other disks in the array, then rewrites
> the sector on the defective disk; if the sector is rewritten without error
> (maybe the hd remaps the sector into its reserved area), then just the log
> messages is displayed.
>
> (b) with raid-6 it's almost benign; to get troubles you should get a read
> error on same sector for >2 disks; or have 2 disks failed and out of the
> array and get a read error on one of the other disks while recostructing the
> array; or have 1 disk failed and get a read error on same sector on >1 disk
> while recostructing (with raid-5 it's almost dangerous instead, as you can
> have big troubles if a disk fails and you get a read error on another disk
> while recostructing; that happened to me!)
>
> (c) no; it's also a good rule to perform a periodic scrub of the array
> (check of the array), to reveal and correct defective sectors
>
> (d) check smart status of the disks, for "relocated sectors count"; also if
> md superblock is >= 1 there is a persistent count of corrected read errors
> for each device into /sys/block/mdXX/md/dev-XX/errors, when this counter
> reaches 256 the disk is marked failed; ihmo when a disk is giving even few
> corrected read errors in a short interval its better to replace it.

Good call.

Here's the output of the reallocated sector count:

~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep Realloc ; done
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
Always       -       1
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
Always       -       3
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
Always       -       5
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
Always       -       1

Are these values high? Low? Acceptable?

How about values like "Raw_Read_Error_Rate" and "Seek_Error_Rate" -- I
believe I've read those are values that are normally very high...is
this true?

~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep
Raw_Read_Error_Rate ; done
  1 Raw_Read_Error_Rate     0x000f   116   099   006    Pre-fail
Always       -       106523474
  1 Raw_Read_Error_Rate     0x000f   114   099   006    Pre-fail
Always       -       77952706
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail
Always       -       137525325
  1 Raw_Read_Error_Rate     0x000f   118   099   006    Pre-fail
Always       -       179042738

...and...

 ~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep
Seek_Error_Rate ; done
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail
Always       -       14923821
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail
Always       -       15648709
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail
Always       -       15733727
  7 Seek_Error_Rate         0x000f   071   060   030    Pre-fail
Always       -       14279452

Thoughts appreciated.

> --
> Yours faithfully.
>
> Giovanni Tessore
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
  2010-12-30 16:33   ` James
@ 2010-12-30 16:44     ` Roman Mamedov
  2010-12-30 16:51       ` James
  0 siblings, 1 reply; 18+ messages in thread
From: Roman Mamedov @ 2010-12-30 16:44 UTC (permalink / raw)
  To: James; +Cc: Mikael Abrahamsson, linux-raid

[-- Attachment #1: Type: text/plain, Size: 539 bytes --]

On Thu, 30 Dec 2010 11:33:31 -0500
James <jtp@nc.rr.com> wrote:

> Dec 20 17:12:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
> Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
> 
> Is it normal for SMART to update the attributes as the drives are
> being used? (I've never had SMART installed before, so this is all
> very new to me).

If your drives run at 68 degrees Celsius, you should emergency-cut the power
ASAP and perhaps reach for the nearest fire extinguisher.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
  2010-12-30 16:44     ` Roman Mamedov
@ 2010-12-30 16:51       ` James
  2010-12-30 17:59         ` Ryan Wagoner
  0 siblings, 1 reply; 18+ messages in thread
From: James @ 2010-12-30 16:51 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Mikael Abrahamsson, linux-raid

On Thu, Dec 30, 2010 at 11:44, Roman Mamedov <rm@romanrm.ru> wrote:
> On Thu, 30 Dec 2010 11:33:31 -0500
> James <jtp@nc.rr.com> wrote:
>
>> Dec 20 17:12:39 nuova smartd[22451]: Device: /dev/sda [SAT], SMART
>> Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
>>
>> Is it normal for SMART to update the attributes as the drives are
>> being used? (I've never had SMART installed before, so this is all
>> very new to me).
>
> If your drives run at 68 degrees Celsius, you should emergency-cut the power
> ASAP and perhaps reach for the nearest fire extinguisher.

Agreed. ;) That's why I posted those messages -- I'm unsure why it
would change those values.

Here's what smartctl shows for all of the drives:

~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep Temperature ;
done
190 Airflow_Temperature_Cel 0x0022   069   059   045    Old_age
Always       -       31 (Lifetime Min/Max 23/37)
194 Temperature_Celsius     0x0022   031   041   000    Old_age
Always       -       31 (0 23 0 0)
190 Airflow_Temperature_Cel 0x0022   068   058   045    Old_age
Always       -       32 (Lifetime Min/Max 22/38)
194 Temperature_Celsius     0x0022   032   042   000    Old_age
Always       -       32 (0 22 0 0)
190 Airflow_Temperature_Cel 0x0022   068   057   045    Old_age
Always       -       32 (Lifetime Min/Max 22/38)
194 Temperature_Celsius     0x0022   032   043   000    Old_age
Always       -       32 (0 22 0 0)
190 Airflow_Temperature_Cel 0x0022   069   059   045    Old_age
Always       -       31 (Lifetime Min/Max 23/37)
194 Temperature_Celsius     0x0022   031   041   000    Old_age
Always       -       31 (0 23 0 0)

Those values seem appropriate, particularly since the "max" is 37 (as
defined by the drive manufacturer?).

> --
> With respect,
> Roman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
  2010-12-30 16:51       ` James
@ 2010-12-30 17:59         ` Ryan Wagoner
  2010-12-30 18:03           ` James
  0 siblings, 1 reply; 18+ messages in thread
From: Ryan Wagoner @ 2010-12-30 17:59 UTC (permalink / raw)
  To: linux-raid

2010/12/30 James <jtp@nc.rr.com>:
>
> Here's what smartctl shows for all of the drives:
>
> ~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep Temperature ;
> done
> 190 Airflow_Temperature_Cel 0x0022   069   059   045    Old_age
> Always       -       31 (Lifetime Min/Max 23/37)
> 194 Temperature_Celsius     0x0022   031   041   000    Old_age
> Always       -       31 (0 23 0 0)
> 190 Airflow_Temperature_Cel 0x0022   068   058   045    Old_age
> Always       -       32 (Lifetime Min/Max 22/38)
> 194 Temperature_Celsius     0x0022   032   042   000    Old_age
> Always       -       32 (0 22 0 0)
> 190 Airflow_Temperature_Cel 0x0022   068   057   045    Old_age
> Always       -       32 (Lifetime Min/Max 22/38)
> 194 Temperature_Celsius     0x0022   032   043   000    Old_age
> Always       -       32 (0 22 0 0)
> 190 Airflow_Temperature_Cel 0x0022   069   059   045    Old_age
> Always       -       31 (Lifetime Min/Max 23/37)
> 194 Temperature_Celsius     0x0022   031   041   000    Old_age
> Always       -       31 (0 23 0 0)
>
> Those values seem appropriate, particularly since the "max" is 37 (as
> defined by the drive manufacturer?).
>

Not sure why the log is showing the weird C temp. The output from
smartctrl looks correct. The max is not defined by the manufacture,
but the maximum temp the drive has reached.

Ryan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
  2010-12-30 17:59         ` Ryan Wagoner
@ 2010-12-30 18:03           ` James
  0 siblings, 0 replies; 18+ messages in thread
From: James @ 2010-12-30 18:03 UTC (permalink / raw)
  To: Ryan Wagoner; +Cc: linux-raid

Fair enough. :) Thanks for the response.

So the big question (to all) becomes this: is this a hard drive issue,
or a motherboard / SATA controller issue? Either one would suck, but
hard drives are obviously easier to swap than a motherboard.

Thoughts on how to go about diagnosing the issue further to determine
what is going on would be greatly appreciated. Aside from replacing
all the drives and hoping for the best, I don't see an easy way to
really figure out what is causing the I/O errors that are resulting in
bad sectors.

-james

On Thu, Dec 30, 2010 at 12:59, Ryan Wagoner <rswagoner@gmail.com> wrote:
> 2010/12/30 James <jtp@nc.rr.com>:
>>
>> Here's what smartctl shows for all of the drives:
>>
>> ~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep Temperature ;
>> done
>> 190 Airflow_Temperature_Cel 0x0022   069   059   045    Old_age
>> Always       -       31 (Lifetime Min/Max 23/37)
>> 194 Temperature_Celsius     0x0022   031   041   000    Old_age
>> Always       -       31 (0 23 0 0)
>> 190 Airflow_Temperature_Cel 0x0022   068   058   045    Old_age
>> Always       -       32 (Lifetime Min/Max 22/38)
>> 194 Temperature_Celsius     0x0022   032   042   000    Old_age
>> Always       -       32 (0 22 0 0)
>> 190 Airflow_Temperature_Cel 0x0022   068   057   045    Old_age
>> Always       -       32 (Lifetime Min/Max 22/38)
>> 194 Temperature_Celsius     0x0022   032   043   000    Old_age
>> Always       -       32 (0 22 0 0)
>> 190 Airflow_Temperature_Cel 0x0022   069   059   045    Old_age
>> Always       -       31 (Lifetime Min/Max 23/37)
>> 194 Temperature_Celsius     0x0022   031   041   000    Old_age
>> Always       -       31 (0 23 0 0)
>>
>> Those values seem appropriate, particularly since the "max" is 37 (as
>> defined by the drive manufacturer?).
>>
>
> Not sure why the log is showing the weird C temp. The output from
> smartctrl looks correct. The max is not defined by the manufacture,
> but the maximum temp the drive has reached.
>
> Ryan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
  2010-12-30 16:35     ` James
@ 2010-12-30 23:12       ` Neil Brown
  2010-12-31  1:48         ` James
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2010-12-30 23:12 UTC (permalink / raw)
  To: James; +Cc: linux-raid

On Thu, 30 Dec 2010 11:35:59 -0500 James <jtp@nc.rr.com> wrote:

> Sorry Neil, I meant to reply-all.
> 
> -james
> 
> On Thu, Dec 30, 2010 at 11:35, James <jtp@nc.rr.com> wrote:
> > Inline.
> >
> > On Thu, Dec 30, 2010 at 04:15, Neil Brown <neilb@suse.de> wrote:
> >> On Thu, 30 Dec 2010 03:20:48 +0000 James <jtp@nc.rr.com> wrote:
> >>
> >>> All,
> >>>
> >>> I'm looking for a bit of guidance here. I have a RAID 6 set up on my
> >>> system and am seeing some errors in my logs as follows:
> >>>
> >>> # cat messages | grep "read erro"
> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
> >>> sectors at 974262528 on sda4)
> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
> >>> sectors at 974262536 on sda4)
> >> .....
> >>
> >>>
> >>> I've Google'd the heck out of this error message but am not seeing a
> >>> clear and concise message: is this benign? What would cause these
> >>> errors? Should I be concerned?
> >>>
> >>> There is an error message (read error corrected) on each of the drives
> >>> in the array. They all seem to be functioning properly. The I/O on the
> >>> drives is pretty heavy for some parts of the day.
> >>>
> >>> # cat /proc/mdstat
> >>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5]
> >>> [raid4] [multipath]
> >>> md1 : active raid6 sdb1[1] sda1[0] sdd1[3] sdc1[2]
> >>>       497792 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
> >>>
> >>> md2 : active raid6 sdb2[1] sda2[0] sdd2[3] sdc2[2]
> >>>       4000000 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
> >>>
> >>> md3 : active raid6 sdb3[1] sda3[0] sdd3[3] sdc3[2]
> >>>       25992960 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
> >>>
> >>> md4 : active raid6 sdb4[1] sda4[0] sdd4[3] sdc4[2]
> >>>       2899780480 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
> >>>
> >>> unused devices: <none>
> >>>
> >>> I have a really hard time believing there's something wrong with all
> >>> of the drives in the array, although admittedly they're the same model
> >>> from the same manufacturer.
> >>>
> >>> Can someone point me in the right direction?
> >>> (a) what causes these errors precisely?
> >>
> >> When md/raid6 tries to read from a device and gets a read error, it try to
> >> read from other other devices.  When that succeeds it computes the data that
> >> it had tried to read and then write it back to the original drive.  If this
> >> succeeded is assumes that the read error has been correct by a write, and
> >> prints the message that you see.
> >>
> >>
> >>> (b) is the error benign? How can I determine if it is *likely* a
> >>> hardware problem? (I imagine it's probably impossible to tell if it's
> >>> HW until it's too late)
> >>
> >> A few occasional messages like this are fairly benign.  The could be a sign
> >> that the drive surface is degrading.  If you see lots of these messages, then
> >> you should seriously consider replacing the drive.
> >
> > Wow, this is hard for me to believe considering this is happening on
> > all the drives. It's not impossible, however, since the drives are
> > likely from the same batch.
> >
> >> As you are seeing these message across all devices, it is possible that the
> >> problem is with the sata controller rather than the disks.  Do know which you
> >> should check the errors that are reported in dmesg.  If you don't understand
> >> these message, then post them to the list - feel free to post several hundred
> >> lines of logs - too much is much much better than not enough.
> >
> > I posted a few errors in my response to the thread a bit ago -- here's
> > another snippet:
> >
> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Unhandled error code
> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Result: hostbyte=0x00
> > driverbyte=0x06
> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] CDB: cdb[0]=0x28: 28
> > 00 25 a2 a0 6a 00 00 80 00
> > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdc, sector 631414890
> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Unhandled error code
> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Result: hostbyte=0x00
> > driverbyte=0x06

"Unhandled error code" sounds like it could be a driver problem...

Try googling that error message...

http://us.generation-nt.com/answer/2-6-33-libata-issues-via-sata-pata-controller-help-197123882.html


"Also, please try the latest 2.6.34-rc kernel, as that has several fixes
for both pata_via and sata_via which did not make 2.6.33."

What kernel are  you running???

NeilBrown




> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28
> > 00 25 a2 a0 ea 00 00 38 00
> > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdb, sector 631415018
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923648 on sdb4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923656 on sdb4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923664 on sdb4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923672 on sdb4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923680 on sdb4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923688 on sdb4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923696 on sdb4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923520 on sdc4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923528 on sdc4)
> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
> > sectors at 600923536 on sdc4)
> >
> > Is there a good way to determine if the issue is with the motherboard
> > (where the SATA controller is), or with the drives themselves?
> >
> >> NeilBrown
> >>
> >>
> >>
> >>> (c) are these errors expected in a RAID array that is heavily used?
> >>> (d) what kind of errors should I see regarding "read errors" that
> >>> *would* indicate an imminent hardware failure?
> >>>
> >>> Thoughts and ideas would be welcomed. I'm sure a thread where some
> >>> hefty discussion is thrown at this topic will help future Googlers
> >>> like me. :)
> >>>
> >>> -james
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> >

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
  2010-12-30 23:12       ` Neil Brown
@ 2010-12-31  1:48         ` James
  2010-12-31  1:56           ` Guy Watkins
  2010-12-31  2:08           ` Neil Brown
  0 siblings, 2 replies; 18+ messages in thread
From: James @ 2010-12-31  1:48 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil,

I'm runinng 2.6.35.

Although an expensive route, the only thing I can think to do to
determine 100% whether the issue is software or hardware (and, if
hardware, whether SATA controller or the drives) is to swap the drives
out.

Ouch!

Any other ideas, however, would be appreciated before I drop a few
hundred bucks. :)

-james

On Thu, Dec 30, 2010 at 23:12, Neil Brown <neilb@suse.de> wrote:
> On Thu, 30 Dec 2010 11:35:59 -0500 James <jtp@nc.rr.com> wrote:
>
>> Sorry Neil, I meant to reply-all.
>>
>> -james
>>
>> On Thu, Dec 30, 2010 at 11:35, James <jtp@nc.rr.com> wrote:
>> > Inline.
>> >
>> > On Thu, Dec 30, 2010 at 04:15, Neil Brown <neilb@suse.de> wrote:
>> >> On Thu, 30 Dec 2010 03:20:48 +0000 James <jtp@nc.rr.com> wrote:
>> >>
>> >>> All,
>> >>>
>> >>> I'm looking for a bit of guidance here. I have a RAID 6 set up on my
>> >>> system and am seeing some errors in my logs as follows:
>> >>>
>> >>> # cat messages | grep "read erro"
>> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
>> >>> sectors at 974262528 on sda4)
>> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
>> >>> sectors at 974262536 on sda4)
>> >> .....
>> >>
>> >>>
>> >>> I've Google'd the heck out of this error message but am not seeing a
>> >>> clear and concise message: is this benign? What would cause these
>> >>> errors? Should I be concerned?
>> >>>
>> >>> There is an error message (read error corrected) on each of the drives
>> >>> in the array. They all seem to be functioning properly. The I/O on the
>> >>> drives is pretty heavy for some parts of the day.
>> >>>
>> >>> # cat /proc/mdstat
>> >>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5]
>> >>> [raid4] [multipath]
>> >>> md1 : active raid6 sdb1[1] sda1[0] sdd1[3] sdc1[2]
>> >>>       497792 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
>> >>>
>> >>> md2 : active raid6 sdb2[1] sda2[0] sdd2[3] sdc2[2]
>> >>>       4000000 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
>> >>>
>> >>> md3 : active raid6 sdb3[1] sda3[0] sdd3[3] sdc3[2]
>> >>>       25992960 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
>> >>>
>> >>> md4 : active raid6 sdb4[1] sda4[0] sdd4[3] sdc4[2]
>> >>>       2899780480 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
>> >>>
>> >>> unused devices: <none>
>> >>>
>> >>> I have a really hard time believing there's something wrong with all
>> >>> of the drives in the array, although admittedly they're the same model
>> >>> from the same manufacturer.
>> >>>
>> >>> Can someone point me in the right direction?
>> >>> (a) what causes these errors precisely?
>> >>
>> >> When md/raid6 tries to read from a device and gets a read error, it try to
>> >> read from other other devices.  When that succeeds it computes the data that
>> >> it had tried to read and then write it back to the original drive.  If this
>> >> succeeded is assumes that the read error has been correct by a write, and
>> >> prints the message that you see.
>> >>
>> >>
>> >>> (b) is the error benign? How can I determine if it is *likely* a
>> >>> hardware problem? (I imagine it's probably impossible to tell if it's
>> >>> HW until it's too late)
>> >>
>> >> A few occasional messages like this are fairly benign.  The could be a sign
>> >> that the drive surface is degrading.  If you see lots of these messages, then
>> >> you should seriously consider replacing the drive.
>> >
>> > Wow, this is hard for me to believe considering this is happening on
>> > all the drives. It's not impossible, however, since the drives are
>> > likely from the same batch.
>> >
>> >> As you are seeing these message across all devices, it is possible that the
>> >> problem is with the sata controller rather than the disks.  Do know which you
>> >> should check the errors that are reported in dmesg.  If you don't understand
>> >> these message, then post them to the list - feel free to post several hundred
>> >> lines of logs - too much is much much better than not enough.
>> >
>> > I posted a few errors in my response to the thread a bit ago -- here's
>> > another snippet:
>> >
>> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Unhandled error code
>> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Result: hostbyte=0x00
>> > driverbyte=0x06
>> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] CDB: cdb[0]=0x28: 28
>> > 00 25 a2 a0 6a 00 00 80 00
>> > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdc, sector 631414890
>> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Unhandled error code
>> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Result: hostbyte=0x00
>> > driverbyte=0x06
>
> "Unhandled error code" sounds like it could be a driver problem...
>
> Try googling that error message...
>
> http://us.generation-nt.com/answer/2-6-33-libata-issues-via-sata-pata-controller-help-197123882.html
>
>
> "Also, please try the latest 2.6.34-rc kernel, as that has several fixes
> for both pata_via and sata_via which did not make 2.6.33."
>
> What kernel are  you running???
>
> NeilBrown
>
>
>
>
>> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28
>> > 00 25 a2 a0 ea 00 00 38 00
>> > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdb, sector 631415018
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923648 on sdb4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923656 on sdb4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923664 on sdb4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923672 on sdb4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923680 on sdb4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923688 on sdb4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923696 on sdb4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923520 on sdc4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923528 on sdc4)
>> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
>> > sectors at 600923536 on sdc4)
>> >
>> > Is there a good way to determine if the issue is with the motherboard
>> > (where the SATA controller is), or with the drives themselves?
>> >
>> >> NeilBrown
>> >>
>> >>
>> >>
>> >>> (c) are these errors expected in a RAID array that is heavily used?
>> >>> (d) what kind of errors should I see regarding "read errors" that
>> >>> *would* indicate an imminent hardware failure?
>> >>>
>> >>> Thoughts and ideas would be welcomed. I'm sure a thread where some
>> >>> hefty discussion is thrown at this topic will help future Googlers
>> >>> like me. :)
>> >>>
>> >>> -james
>> >>> --
>> >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> >>> the body of a message to majordomo@vger.kernel.org
>> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>> >
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: read errors corrected
  2010-12-31  1:48         ` James
@ 2010-12-31  1:56           ` Guy Watkins
  2010-12-31  2:08           ` Neil Brown
  1 sibling, 0 replies; 18+ messages in thread
From: Guy Watkins @ 2010-12-31  1:56 UTC (permalink / raw)
  To: 'James', 'Neil Brown'; +Cc: linux-raid

} -----Original Message-----
} From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
} owner@vger.kernel.org] On Behalf Of James
} Sent: Thursday, December 30, 2010 8:48 PM
} To: Neil Brown
} Cc: linux-raid@vger.kernel.org
} Subject: Re: read errors corrected
} 
} Neil,
} 
} I'm runinng 2.6.35.
} 
} Although an expensive route, the only thing I can think to do to
} determine 100% whether the issue is software or hardware (and, if
} hardware, whether SATA controller or the drives) is to swap the drives
} out.
} 
} Ouch!
} 
} Any other ideas, however, would be appreciated before I drop a few
} hundred bucks. :)

Just swap out 1 for now?  :)

I believe your drives are fine because your smart stats don't reflect the
number of errors you see in the logs.

} 
} -james
} 
} On Thu, Dec 30, 2010 at 23:12, Neil Brown <neilb@suse.de> wrote:
} > On Thu, 30 Dec 2010 11:35:59 -0500 James <jtp@nc.rr.com> wrote:
} >
} >> Sorry Neil, I meant to reply-all.
} >>
} >> -james
} >>
} >> On Thu, Dec 30, 2010 at 11:35, James <jtp@nc.rr.com> wrote:
} >> > Inline.
} >> >
} >> > On Thu, Dec 30, 2010 at 04:15, Neil Brown <neilb@suse.de> wrote:
} >> >> On Thu, 30 Dec 2010 03:20:48 +0000 James <jtp@nc.rr.com> wrote:
} >> >>
} >> >>> All,
} >> >>>
} >> >>> I'm looking for a bit of guidance here. I have a RAID 6 set up on
} my
} >> >>> system and am seeing some errors in my logs as follows:
} >> >>>
} >> >>> # cat messages | grep "read erro"
} >> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
} >> >>> sectors at 974262528 on sda4)
} >> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8
} >> >>> sectors at 974262536 on sda4)
} >> >> .....
} >> >>
} >> >>>
} >> >>> I've Google'd the heck out of this error message but am not seeing
} a
} >> >>> clear and concise message: is this benign? What would cause these
} >> >>> errors? Should I be concerned?
} >> >>>
} >> >>> There is an error message (read error corrected) on each of the
} drives
} >> >>> in the array. They all seem to be functioning properly. The I/O on
} the
} >> >>> drives is pretty heavy for some parts of the day.
} >> >>>
} >> >>> # cat /proc/mdstat
} >> >>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5]
} >> >>> [raid4] [multipath]
} >> >>> md1 : active raid6 sdb1[1] sda1[0] sdd1[3] sdc1[2]
} >> >>>       497792 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
} >> >>>
} >> >>> md2 : active raid6 sdb2[1] sda2[0] sdd2[3] sdc2[2]
} >> >>>       4000000 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
} >> >>>
} >> >>> md3 : active raid6 sdb3[1] sda3[0] sdd3[3] sdc3[2]
} >> >>>       25992960 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU]
} >> >>>
} >> >>> md4 : active raid6 sdb4[1] sda4[0] sdd4[3] sdc4[2]
} >> >>>       2899780480 blocks level 6, 64k chunk, algorithm 2 [4/4]
} [UUUU]
} >> >>>
} >> >>> unused devices: <none>
} >> >>>
} >> >>> I have a really hard time believing there's something wrong with
} all
} >> >>> of the drives in the array, although admittedly they're the same
} model
} >> >>> from the same manufacturer.
} >> >>>
} >> >>> Can someone point me in the right direction?
} >> >>> (a) what causes these errors precisely?
} >> >>
} >> >> When md/raid6 tries to read from a device and gets a read error, it
} try to
} >> >> read from other other devices.  When that succeeds it computes the
} data that
} >> >> it had tried to read and then write it back to the original drive.
}  If this
} >> >> succeeded is assumes that the read error has been correct by a
} write, and
} >> >> prints the message that you see.
} >> >>
} >> >>
} >> >>> (b) is the error benign? How can I determine if it is *likely* a
} >> >>> hardware problem? (I imagine it's probably impossible to tell if
} it's
} >> >>> HW until it's too late)
} >> >>
} >> >> A few occasional messages like this are fairly benign.  The could be
} a sign
} >> >> that the drive surface is degrading.  If you see lots of these
} messages, then
} >> >> you should seriously consider replacing the drive.
} >> >
} >> > Wow, this is hard for me to believe considering this is happening on
} >> > all the drives. It's not impossible, however, since the drives are
} >> > likely from the same batch.
} >> >
} >> >> As you are seeing these message across all devices, it is possible
} that the
} >> >> problem is with the sata controller rather than the disks.  Do know
} which you
} >> >> should check the errors that are reported in dmesg.  If you don't
} understand
} >> >> these message, then post them to the list - feel free to post
} several hundred
} >> >> lines of logs - too much is much much better than not enough.
} >> >
} >> > I posted a few errors in my response to the thread a bit ago --
} here's
} >> > another snippet:
} >> >
} >> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Unhandled error code
} >> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Result: hostbyte=0x00
} >> > driverbyte=0x06
} >> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] CDB: cdb[0]=0x28: 28
} >> > 00 25 a2 a0 6a 00 00 80 00
} >> > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdc, sector
} 631414890
} >> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Unhandled error code
} >> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Result: hostbyte=0x00
} >> > driverbyte=0x06
} >
} > "Unhandled error code" sounds like it could be a driver problem...
} >
} > Try googling that error message...
} >
} > http://us.generation-nt.com/answer/2-6-33-libata-issues-via-sata-pata-
} controller-help-197123882.html
} >
} >
} > "Also, please try the latest 2.6.34-rc kernel, as that has several fixes
} > for both pata_via and sata_via which did not make 2.6.33."
} >
} > What kernel are  you running???
} >
} > NeilBrown
} >
} >
} >
} >
} >> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28
} >> > 00 25 a2 a0 ea 00 00 38 00
} >> > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdb, sector
} 631415018
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
} >> > sectors at 600923648 on sdb4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
} >> > sectors at 600923656 on sdb4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
} >> > sectors at 600923664 on sdb4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
} >> > sectors at 600923672 on sdb4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
} >> > sectors at 600923680 on sdb4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
} >> > sectors at 600923688 on sdb4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
} >> > sectors at 600923696 on sdb4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
} >> > sectors at 600923520 on sdc4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
} >> > sectors at 600923528 on sdc4)
} >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8
} >> > sectors at 600923536 on sdc4)
} >> >
} >> > Is there a good way to determine if the issue is with the motherboard
} >> > (where the SATA controller is), or with the drives themselves?
} >> >
} >> >> NeilBrown
} >> >>
} >> >>
} >> >>
} >> >>> (c) are these errors expected in a RAID array that is heavily used?
} >> >>> (d) what kind of errors should I see regarding "read errors" that
} >> >>> *would* indicate an imminent hardware failure?
} >> >>>
} >> >>> Thoughts and ideas would be welcomed. I'm sure a thread where some
} >> >>> hefty discussion is thrown at this topic will help future Googlers
} >> >>> like me. :)
} >> >>>
} >> >>> -james
} >> >>> --
} >> >>> To unsubscribe from this list: send the line "unsubscribe linux-
} raid" in
} >> >>> the body of a message to majordomo@vger.kernel.org
} >> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
} >> >>
} >> >>
} >> >
} >
} >
} --
} To unsubscribe from this list: send the line "unsubscribe linux-raid" in
} the body of a message to majordomo@vger.kernel.org
} More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
  2010-12-31  1:48         ` James
  2010-12-31  1:56           ` Guy Watkins
@ 2010-12-31  2:08           ` Neil Brown
  1 sibling, 0 replies; 18+ messages in thread
From: Neil Brown @ 2010-12-31  2:08 UTC (permalink / raw)
  To: James; +Cc: linux-raid

On Fri, 31 Dec 2010 01:48:07 +0000 James <jtp@nc.rr.com> wrote:

> Neil,
> 
> I'm runinng 2.6.35.
> 
> Although an expensive route, the only thing I can think to do to
> determine 100% whether the issue is software or hardware (and, if
> hardware, whether SATA controller or the drives) is to swap the drives
> out.
> 
> Ouch!
> 
> Any other ideas, however, would be appreciated before I drop a few
> hundred bucks. :)

Buy a PCIe SATA controller, plug it in and move some/all drives over to that?
Should be a lot less than $100.  Make sure it is a different chipset to what
you have on your motherboard.

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
  2010-12-30 16:41   ` James
@ 2011-01-15 12:00     ` Giovanni Tessore
  2011-01-16  8:33       ` Jaap Crezee
  0 siblings, 1 reply; 18+ messages in thread
From: Giovanni Tessore @ 2011-01-15 12:00 UTC (permalink / raw)
  To: linux-raid

On 12/30/2010 05:41 PM, James wrote:
> Inline.
>
> On Thu, Dec 30, 2010 at 05:13, Giovanni Tessore<giotex@texsoft.it>  wrote:
>> On 12/30/2010 04:20 AM, James wrote:
>>> Can someone point me in the right direction?
>>> (a) what causes these errors precisely?
>>> (b) is the error benign? How can I determine if it is *likely* a
>>> hardware problem? (I imagine it's probably impossible to tell if it's
>>> HW until it's too late)
>>> (c) are these errors expected in a RAID array that is heavily used?
>>> (d) what kind of errors should I see regarding "read errors" that
>>> *would* indicate an imminent hardware failure?
>> (a) these errors usually come from defective disk sectors. raid recostructs
>> the missing sector from parity from other disks in the array, then rewrites
>> the sector on the defective disk; if the sector is rewritten without error
>> (maybe the hd remaps the sector into its reserved area), then just the log
>> messages is displayed.
>>
>> (b) with raid-6 it's almost benign; to get troubles you should get a read
>> error on same sector for>2 disks; or have 2 disks failed and out of the
>> array and get a read error on one of the other disks while recostructing the
>> array; or have 1 disk failed and get a read error on same sector on>1 disk
>> while recostructing (with raid-5 it's almost dangerous instead, as you can
>> have big troubles if a disk fails and you get a read error on another disk
>> while recostructing; that happened to me!)
>>
>> (c) no; it's also a good rule to perform a periodic scrub of the array
>> (check of the array), to reveal and correct defective sectors
>>
>> (d) check smart status of the disks, for "relocated sectors count"; also if
>> md superblock is>= 1 there is a persistent count of corrected read errors
>> for each device into /sys/block/mdXX/md/dev-XX/errors, when this counter
>> reaches 256 the disk is marked failed; ihmo when a disk is giving even few
>> corrected read errors in a short interval its better to replace it.
> Good call.
>
> Here's the output of the reallocated sector count:
>
> ~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep Realloc ; done
>    5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
> Always       -       1
>    5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
> Always       -       3
>    5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
> Always       -       5
>    5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
> Always       -       1
>
> Are these values high? Low? Acceptable?
>
> How about values like "Raw_Read_Error_Rate" and "Seek_Error_Rate" -- I
> believe I've read those are values that are normally very high...is
> this true?
>
> ~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep
> Raw_Read_Error_Rate ; done
>    1 Raw_Read_Error_Rate     0x000f   116   099   006    Pre-fail
> Always       -       106523474
>    1 Raw_Read_Error_Rate     0x000f   114   099   006    Pre-fail
> Always       -       77952706
>    1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail
> Always       -       137525325
>    1 Raw_Read_Error_Rate     0x000f   118   099   006    Pre-fail
> Always       -       179042738
>
> ...and...
>
>   ~ # for i in a b c d ; do smartctl -a /dev/sd$i | grep
> Seek_Error_Rate ; done
>    7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail
> Always       -       14923821
>    7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail
> Always       -       15648709
>    7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail
> Always       -       15733727
>    7 Seek_Error_Rate         0x000f   071   060   030    Pre-fail
> Always       -       14279452
>
> Thoughts appreciated.
>

As I know,  Reallocated_Sector_Ct is the most meaningful SMART parameter 
related to disk sectors health.
Also check for Current_Pending_Sector (sector that gave read on error 
and has not been reallocated yet).
The values of your disks seems quite safe at the moment.
Be proactive if the value grows in short time.

I had same problem this week, one of my disk gave >800 reallocated read 
errors.
The disk was still marked good and alive into array, but I replaced it 
immediately.

Regards.

-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
  2011-01-15 12:00     ` Giovanni Tessore
@ 2011-01-16  8:33       ` Jaap Crezee
  0 siblings, 0 replies; 18+ messages in thread
From: Jaap Crezee @ 2011-01-16  8:33 UTC (permalink / raw)
  To: Giovanni Tessore; +Cc: linux-raid

On 01/15/11 13:00, Giovanni Tessore wrote:
> On 12/30/2010 05:41 PM, James wrote:
> As I know, Reallocated_Sector_Ct is the most meaningful SMART parameter related
> to disk sectors health.
> Also check for Current_Pending_Sector (sector that gave read on error and has
> not been reallocated yet).
> The values of your disks seems quite safe at the moment.
> Be proactive if the value grows in short time.

I wouldn't be too happy with more than 0 relocated- and/or pending sectors:
I replace these disks at once. Never had any problems with warranty about more 
then 0 sectors relocated. It seems manufacturers use the same values....

regards,

Jaap Crezee

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: read errors corrected
@ 2010-12-30 20:19 Richard Scobie
  0 siblings, 0 replies; 18+ messages in thread
From: Richard Scobie @ 2010-12-30 20:19 UTC (permalink / raw)
  To: Linux RAID Mailing List; +Cc: rswagoner

Ryan wrote:

 > Not sure why the log is showing the weird C temp.

See the SMART attribute 190 definition here:

http://en.wikipedia.org/wiki/S.M.A.R.T.

Regards,

Richard

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2011-01-16  8:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-30  3:20 read errors corrected James
2010-12-30  5:24 ` Mikael Abrahamsson
2010-12-30 16:33   ` James
2010-12-30 16:44     ` Roman Mamedov
2010-12-30 16:51       ` James
2010-12-30 17:59         ` Ryan Wagoner
2010-12-30 18:03           ` James
2010-12-30  9:15 ` Neil Brown
     [not found]   ` <AANLkTik2+Gk1XveqD=crGMH5JshzJqQb_i77ZpOFUncB@mail.gmail.com>
2010-12-30 16:35     ` James
2010-12-30 23:12       ` Neil Brown
2010-12-31  1:48         ` James
2010-12-31  1:56           ` Guy Watkins
2010-12-31  2:08           ` Neil Brown
2010-12-30 10:13 ` Giovanni Tessore
2010-12-30 16:41   ` James
2011-01-15 12:00     ` Giovanni Tessore
2011-01-16  8:33       ` Jaap Crezee
2010-12-30 20:19 Richard Scobie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.