* RAID6 check found different events, how should I proceed?
@ 2011-08-06 13:23 Mathias Burén
2011-08-06 16:02 ` Mathias Burén
2011-08-06 17:54 ` Alexander Kühn
0 siblings, 2 replies; 5+ messages in thread
From: Mathias Burén @ 2011-08-06 13:23 UTC (permalink / raw)
To: Linux-RAID
First, thanks for this:
> The primary purpose of data scrubbing a RAID is to detect & correct
> read errors on any of the member devices; both check and repair
> perform this function. Finding (and w/ repair correcting) mismatches
> is only a secondary purpose - it is only if there are no read errors
> but the data copy or parity blocks are found to be inconsistent that a
> mismatch is reported. In order to repair a mismatch, MD needs to
> restore consistency, by over writing the inconsistent data copy or
> parity blocks w/ the correct data. But, because the underlying member
> devices did not return any errors, MD has no way of knowing which
> blocks are correct, and which are incorrect; when it is told to do a
> repair, it makes the assumption that the first copy in a RAID1 or
> RAID10, or the data (non-parity) blocks in RAID4/5/6 are correct, and
> corrects the mismatch based on that assumption.
>
> That assumption may or may not be correct, but MD has no way of
> determining that reliably - but the user might be able to, by using
> additional knowledge or tools, so MD gives the user the option to
> perform data scrubbing either with (repair) or without (check) MD
> correcting the mismatches using that assumption.
>
>
> I hope that answers your question,
> Beolach
My RAID6 is currently degraded with one HDD (panic mail on the list),
and my weekly cron job kicked in doing the RAID6 check action. This is
the result:
DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END
sdb1 6239487 0 0 0 2 0 0
sdc1 6239487 0 0 0 0 0 0
sdd1 6239487 0 0 0 0 0 0
sde1 6239487 0 0 0 0 0 0
sdf1 6239490 0 0 0 0 49 6
sdg1 6239491 0 0 0 0 0 0
sdh1 (missing, on RMA trip)
(so the SMART is actually fine for all drives)
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdf1[5] sdg1[0] sdd1[4] sde1[7] sdc1[3] sdb1[1]
9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2
[7/6] [UUUUU_U]
unused devices: <none>
/dev/md0:
Version : 1.2
Creation Time : Tue Oct 19 08:58:41 2010
Raid Level : raid6
Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
Raid Devices : 7
Total Devices : 6
Persistence : Superblock is persistent
Update Time : Sat Aug 6 14:13:08 2011
State : clean, degraded
Active Devices : 6
Working Devices : 6
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
Name : ion:0 (local to host ion)
UUID : e6595c64:b3ae90b3:f01133ac:3f402d20
Events : 6239491
Number Major Minor RaidDevice State
0 8 97 0 active sync /dev/sdg1
1 8 17 1 active sync /dev/sdb1
4 8 49 2 active sync /dev/sdd1
3 8 33 3 active sync /dev/sdc1
5 8 81 4 active sync /dev/sdf1
5 0 0 5 removed
7 8 65 6 active sync /dev/sde1
So sdf1 and sdg1 have a different event count. Does this mean the HDDs
have silently corrupted the data? I have no way of checking if the
data itself is corrupt or not, except for perhaps a fsck of the
filesystem? Does that make sense?
* Should I run a repair?
* Chould I run a check again, to see if the event count changes?
* Is it likely I've 2 more bad harddrives that will die soon?
* Is it wise to run another smartctl -t long on all devices?
Thanks,
Mathias
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RAID6 check found different events, how should I proceed?
2011-08-06 13:23 RAID6 check found different events, how should I proceed? Mathias Burén
@ 2011-08-06 16:02 ` Mathias Burén
[not found] ` <CALvtuFTKBRtco1VFw9xv6x3qsLx+rdJg2wL8E9+1g3LQf=Xkuw@mail.gmail.com>
2011-08-08 22:57 ` NeilBrown
2011-08-06 17:54 ` Alexander Kühn
1 sibling, 2 replies; 5+ messages in thread
From: Mathias Burén @ 2011-08-06 16:02 UTC (permalink / raw)
To: Linux-RAID
On 6 August 2011 14:23, Mathias Burén <mathias.buren@gmail.com> wrote:
> My RAID6 is currently degraded with one HDD (panic mail on the list),
> and my weekly cron job kicked in doing the RAID6 check action. This is
> the result:
>
> DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END
> sdb1 6239487 0 0 0 2 0 0
> sdc1 6239487 0 0 0 0 0 0
> sdd1 6239487 0 0 0 0 0 0
> sde1 6239487 0 0 0 0 0 0
> sdf1 6239490 0 0 0 0 49 6
> sdg1 6239491 0 0 0 0 0 0
> sdh1 (missing, on RMA trip)
>
(snip)
> * Should I run a repair?
> * Chould I run a check again, to see if the event count changes?
> * Is it likely I've 2 more bad harddrives that will die soon?
> * Is it wise to run another smartctl -t long on all devices?
>
> Thanks,
> Mathias
>
A followup;
I ran smartctl -t long on all devices, and they all passed, SMART is
fine. The number of events is also the same for all HDDs now:
DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END
sdb1 6244415 0 0 0 2 0 0
sdc1 6244415 0 0 0 0 0 0
sdd1 6244415 0 0 0 0 0 0
sde1 6244415 0 0 0 0 0 0
sdf1 6244415 0 0 0 0 49 6
sdg1 6244415 0 0 0 0 0 0
sdh1
This is without me running repair or anything like that.
Mathias
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RAID6 check found different events, how should I proceed?
[not found] ` <CALvtuFTKBRtco1VFw9xv6x3qsLx+rdJg2wL8E9+1g3LQf=Xkuw@mail.gmail.com>
@ 2011-08-06 17:09 ` Cal Leeming [Simplicity Media Ltd]
0 siblings, 0 replies; 5+ messages in thread
From: Cal Leeming [Simplicity Media Ltd] @ 2011-08-06 17:09 UTC (permalink / raw)
To: Mathias Burén; +Cc: Linux-RAID
Can't offer any advice on this issue, but would be very interested to
hear the debrief once the situation is resolved.
On Sat, Aug 6, 2011 at 6:08 PM, Cal Leeming [Simplicity Media Ltd]
<cal.leeming@simplicitymedialtd.co.uk> wrote:
>
> Can't offer any advice on this issue, but would be very interested to hear the debrief once the situation is resolved.
> On Sat, Aug 6, 2011 at 5:02 PM, Mathias Burén <mathias.buren@gmail.com> wrote:
>>
>> On 6 August 2011 14:23, Mathias Burén <mathias.buren@gmail.com> wrote:
>> > My RAID6 is currently degraded with one HDD (panic mail on the list),
>> > and my weekly cron job kicked in doing the RAID6 check action. This is
>> > the result:
>> >
>> > DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END
>> > sdb1 6239487 0 0 0 2 0 0
>> > sdc1 6239487 0 0 0 0 0 0
>> > sdd1 6239487 0 0 0 0 0 0
>> > sde1 6239487 0 0 0 0 0 0
>> > sdf1 6239490 0 0 0 0 49 6
>> > sdg1 6239491 0 0 0 0 0 0
>> > sdh1 (missing, on RMA trip)
>> >
>> (snip)
>> > * Should I run a repair?
>> > * Chould I run a check again, to see if the event count changes?
>> > * Is it likely I've 2 more bad harddrives that will die soon?
>> > * Is it wise to run another smartctl -t long on all devices?
>> >
>> > Thanks,
>> > Mathias
>> >
>>
>> A followup;
>>
>> I ran smartctl -t long on all devices, and they all passed, SMART is
>> fine. The number of events is also the same for all HDDs now:
>>
>> DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END
>> sdb1 6244415 0 0 0 2 0 0
>> sdc1 6244415 0 0 0 0 0 0
>> sdd1 6244415 0 0 0 0 0 0
>> sde1 6244415 0 0 0 0 0 0
>> sdf1 6244415 0 0 0 0 49 6
>> sdg1 6244415 0 0 0 0 0 0
>> sdh1
>>
>> This is without me running repair or anything like that.
>>
>> Mathias
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RAID6 check found different events, how should I proceed?
2011-08-06 13:23 RAID6 check found different events, how should I proceed? Mathias Burén
2011-08-06 16:02 ` Mathias Burén
@ 2011-08-06 17:54 ` Alexander Kühn
1 sibling, 0 replies; 5+ messages in thread
From: Alexander Kühn @ 2011-08-06 17:54 UTC (permalink / raw)
To: Mathias Burén; +Cc: Linux-RAID
I'd do _nothing_ until I got a replacement drive. Then plug that in
and let it regain full redundancy.
After that you can start stressing the disks with the actions you
suggested if you like.
Alex.
Zitat von Mathias Burén <mathias.buren@gmail.com>:
> First, thanks for this:
>
>> The primary purpose of data scrubbing a RAID is to detect & correct
>> read errors on any of the member devices; both check and repair
>> perform this function. Finding (and w/ repair correcting) mismatches
>> is only a secondary purpose - it is only if there are no read errors
>> but the data copy or parity blocks are found to be inconsistent that a
>> mismatch is reported. In order to repair a mismatch, MD needs to
>> restore consistency, by over writing the inconsistent data copy or
>> parity blocks w/ the correct data. But, because the underlying member
>> devices did not return any errors, MD has no way of knowing which
>> blocks are correct, and which are incorrect; when it is told to do a
>> repair, it makes the assumption that the first copy in a RAID1 or
>> RAID10, or the data (non-parity) blocks in RAID4/5/6 are correct, and
>> corrects the mismatch based on that assumption.
>>
>> That assumption may or may not be correct, but MD has no way of
>> determining that reliably - but the user might be able to, by using
>> additional knowledge or tools, so MD gives the user the option to
>> perform data scrubbing either with (repair) or without (check) MD
>> correcting the mismatches using that assumption.
>>
>>
>> I hope that answers your question,
>> Beolach
>
> My RAID6 is currently degraded with one HDD (panic mail on the list),
> and my weekly cron job kicked in doing the RAID6 check action. This is
> the result:
>
> DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END
> sdb1 6239487 0 0 0 2 0 0
> sdc1 6239487 0 0 0 0 0 0
> sdd1 6239487 0 0 0 0 0 0
> sde1 6239487 0 0 0 0 0 0
> sdf1 6239490 0 0 0 0 49 6
> sdg1 6239491 0 0 0 0 0 0
> sdh1 (missing, on RMA trip)
>
>
> (so the SMART is actually fine for all drives)
>
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdf1[5] sdg1[0] sdd1[4] sde1[7] sdc1[3] sdb1[1]
> 9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2
> [7/6] [UUUUU_U]
>
> unused devices: <none>
>
>
> /dev/md0:
> Version : 1.2
> Creation Time : Tue Oct 19 08:58:41 2010
> Raid Level : raid6
> Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
> Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
> Raid Devices : 7
> Total Devices : 6
> Persistence : Superblock is persistent
>
> Update Time : Sat Aug 6 14:13:08 2011
> State : clean, degraded
> Active Devices : 6
> Working Devices : 6
> Failed Devices : 0
> Spare Devices : 0
>
> Layout : left-symmetric
> Chunk Size : 64K
>
> Name : ion:0 (local to host ion)
> UUID : e6595c64:b3ae90b3:f01133ac:3f402d20
> Events : 6239491
>
> Number Major Minor RaidDevice State
> 0 8 97 0 active sync /dev/sdg1
> 1 8 17 1 active sync /dev/sdb1
> 4 8 49 2 active sync /dev/sdd1
> 3 8 33 3 active sync /dev/sdc1
> 5 8 81 4 active sync /dev/sdf1
> 5 0 0 5 removed
> 7 8 65 6 active sync /dev/sde1
>
> So sdf1 and sdg1 have a different event count. Does this mean the HDDs
> have silently corrupted the data? I have no way of checking if the
> data itself is corrupt or not, except for perhaps a fsck of the
> filesystem? Does that make sense?
>
> * Should I run a repair?
> * Chould I run a check again, to see if the event count changes?
> * Is it likely I've 2 more bad harddrives that will die soon?
> * Is it wise to run another smartctl -t long on all devices?
>
> Thanks,
> Mathias
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RAID6 check found different events, how should I proceed?
2011-08-06 16:02 ` Mathias Burén
[not found] ` <CALvtuFTKBRtco1VFw9xv6x3qsLx+rdJg2wL8E9+1g3LQf=Xkuw@mail.gmail.com>
@ 2011-08-08 22:57 ` NeilBrown
1 sibling, 0 replies; 5+ messages in thread
From: NeilBrown @ 2011-08-08 22:57 UTC (permalink / raw)
To: Mathias Burén; +Cc: Linux-RAID
On Sat, 6 Aug 2011 17:02:48 +0100 Mathias Burén <mathias.buren@gmail.com>
wrote:
> On 6 August 2011 14:23, Mathias Burén <mathias.buren@gmail.com> wrote:
> > My RAID6 is currently degraded with one HDD (panic mail on the list),
> > and my weekly cron job kicked in doing the RAID6 check action. This is
> > the result:
> >
> > DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END
> > sdb1 6239487 0 0 0 2 0 0
> > sdc1 6239487 0 0 0 0 0 0
> > sdd1 6239487 0 0 0 0 0 0
> > sde1 6239487 0 0 0 0 0 0
> > sdf1 6239490 0 0 0 0 49 6
> > sdg1 6239491 0 0 0 0 0 0
> > sdh1 (missing, on RMA trip)
> >
> (snip)
> > * Should I run a repair?
> > * Chould I run a check again, to see if the event count changes?
> > * Is it likely I've 2 more bad harddrives that will die soon?
> > * Is it wise to run another smartctl -t long on all devices?
> >
> > Thanks,
> > Mathias
> >
>
> A followup;
>
> I ran smartctl -t long on all devices, and they all passed, SMART is
> fine. The number of events is also the same for all HDDs now:
>
> DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END
> sdb1 6244415 0 0 0 2 0 0
> sdc1 6244415 0 0 0 0 0 0
> sdd1 6244415 0 0 0 0 0 0
> sde1 6244415 0 0 0 0 0 0
> sdf1 6244415 0 0 0 0 49 6
> sdg1 6244415 0 0 0 0 0 0
> sdh1
>
> This is without me running repair or anything like that.
The thing that you did which produced the change was that you let time pass.
Presumably there was a time delay (maybe small) between extracting the
'events' number from sde1 and sdf1, then sdf1 and sdg1. During these times
the events on all devices in the array was updated. This implies some thread
was writing, but possibly not writing very heavily.
When you sampled them all the second time and got the same number there were
presumably no writes happening, so the event numbers didn't change.
When there are occasional writes the array oscillates between 'clean' and
'active' and each change updates the 'events' number.
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2011-08-08 22:57 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-06 13:23 RAID6 check found different events, how should I proceed? Mathias Burén
2011-08-06 16:02 ` Mathias Burén
[not found] ` <CALvtuFTKBRtco1VFw9xv6x3qsLx+rdJg2wL8E9+1g3LQf=Xkuw@mail.gmail.com>
2011-08-06 17:09 ` Cal Leeming [Simplicity Media Ltd]
2011-08-08 22:57 ` NeilBrown
2011-08-06 17:54 ` Alexander Kühn
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.