* Help understanding the root cause of a member dropping out of a RAID 1 set.
@ 2009-08-13 8:44 Simon Jackson
2009-08-13 16:13 ` Billy Crook
0 siblings, 1 reply; 7+ messages in thread
From: Simon Jackson @ 2009-08-13 8:44 UTC (permalink / raw)
To: linux-raid
I am running RAID1 partitions on some systems and a few times I have seen a raid set become degraded as a member has failed out of the md device. Looking at the /var/log/message file I have seen output similar to below:
Can anyone help me decode what actually happened here.
Thanks Simon.
2009-08-11T06:21:04-07:00 Metro-1 kernel: [556568.670377] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
2009-08-11T06:21:04-07:00 Metro-1 kernel: [556568.670477] ata1: hard resetting link
2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.122562] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.259057] ata1.00: configured for UDMA/133
2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1.01: configured for UDMA/133
2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] md: super_written gets error=-5, uptodate=0
2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] raid1: Operation continuing on 1 devices.
2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1: EH complete
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Help understanding the root cause of a member dropping out of a RAID 1 set.
2009-08-13 8:44 Help understanding the root cause of a member dropping out of a RAID 1 set Simon Jackson
@ 2009-08-13 16:13 ` Billy Crook
2009-08-13 16:26 ` John Robinson
0 siblings, 1 reply; 7+ messages in thread
From: Billy Crook @ 2009-08-13 16:13 UTC (permalink / raw)
To: Simon Jackson; +Cc: linux-raid
On Thu, Aug 13, 2009 at 03:44, Simon Jackson<sjackson@bluearc.com> wrote:
>
> I am running RAID1 partitions on some systems and a few times I have seen a raid set become degraded as a member has failed out of the md device. Looking at the /var/log/message file I have seen output similar to below:
>
> Can anyone help me decode what actually happened here.
>
> Thanks Simon.
>
> 2009-08-11T06:21:04-07:00 Metro-1 kernel: [556568.670377] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
the hard drive didn't respond to an ata command
> 2009-08-11T06:21:04-07:00 Metro-1 kernel: [556568.670477] ata1: hard resetting link
kernel tells hdd controller to reset link. (Sometimes this gets
"frozen" hard drives to respond again.)
> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.122562] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.259057] ata1.00: configured for UDMA/133
> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1.01: configured for UDMA/133
link has been reset.
> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] md: super_written gets error=-5, uptodate=0
mdraid notices. says oh craps.
> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] raid1: Operation continuing on 1 devices.
mdraid marks the component that encountered the error failed, and
keeps on keeping on
> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1: EH complete
ata reset (of link, and subsequently drive) is complete.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Help understanding the root cause of a member dropping out of a RAID 1 set.
2009-08-13 16:13 ` Billy Crook
@ 2009-08-13 16:26 ` John Robinson
2009-08-14 13:09 ` Paweł Brodacki
2009-08-14 13:21 ` Robin Hill
0 siblings, 2 replies; 7+ messages in thread
From: John Robinson @ 2009-08-13 16:26 UTC (permalink / raw)
To: linux-raid
On Thu, 13 August, 2009 5:13 pm, Billy Crook wrote:
> On Thu, Aug 13, 2009 at 03:44, Simon Jackson<sjackson@bluearc.com> wrote:
[...]
>> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] md:
>> super_written gets error=-5, uptodate=0
>
> mdraid notices. says oh craps.
>
>> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] raid1:
>> Operation continuing on 1 devices.
>
> mdraid marks the component that encountered the error failed, and
> keeps on keeping on
>
>> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1: EH
>> complete
>
> ata reset (of link, and subsequently drive) is complete.
Can or could md be made or configured to try re-adding a device if this
sort of thing happens? After all, a stray cosmic ray or whatever perhaps
shouldn't make one lose redundancy if the drive's actually OK?
Cheers,
John.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Help understanding the root cause of a member dropping out of a RAID 1 set.
2009-08-13 16:26 ` John Robinson
@ 2009-08-14 13:09 ` Paweł Brodacki
2009-08-14 17:07 ` John Robinson
2009-08-14 13:21 ` Robin Hill
1 sibling, 1 reply; 7+ messages in thread
From: Paweł Brodacki @ 2009-08-14 13:09 UTC (permalink / raw)
To: linux-raid
2009/8/13 John Robinson <john.robinson@anonymous.org.uk>:
> Can or could md be made or configured to try re-adding a device if this
> sort of thing happens? After all, a stray cosmic ray or whatever perhaps
> shouldn't make one lose redundancy if the drive's actually OK?
>
> Cheers,
>
> John.
>
I think that from the coding point of view md probably could. The more
important thing is if it should. The only hard fact is that there was
an error while accessing the device. md has no way of telling if it
was just a freak accident, or the drive is unreliable from now on.
Therefore it does the one safe thing and says "I won't trust you
anymore.". If a human being knows better, the said being is free to
re-add the drive.
Personally I'd hate having a suspicious drive being auto-added in hope
it will rebuild and function properly.
Because such an option could seem tempting but could and would cause
loss of reliability I'd expect bad publicity if it was actually added.
Just my 2c.
Regards,
Paweł
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Help understanding the root cause of a member dropping out of a RAID 1 set.
2009-08-13 16:26 ` John Robinson
2009-08-14 13:09 ` Paweł Brodacki
@ 2009-08-14 13:21 ` Robin Hill
1 sibling, 0 replies; 7+ messages in thread
From: Robin Hill @ 2009-08-14 13:21 UTC (permalink / raw)
To: linux-raid
[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]
On Thu Aug 13, 2009 at 05:26:39PM +0100, John Robinson wrote:
> On Thu, 13 August, 2009 5:13 pm, Billy Crook wrote:
> > On Thu, Aug 13, 2009 at 03:44, Simon Jackson<sjackson@bluearc.com> wrote:
> [...]
> >> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] md:
> >> super_written gets error=-5, uptodate=0
> >
> > mdraid notices. says oh craps.
> >
> >> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] raid1:
> >> Operation continuing on 1 devices.
> >
> > mdraid marks the component that encountered the error failed, and
> > keeps on keeping on
> >
> >> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1: EH
> >> complete
> >
> > ata reset (of link, and subsequently drive) is complete.
>
> Can or could md be made or configured to try re-adding a device if this
> sort of thing happens? After all, a stray cosmic ray or whatever perhaps
> shouldn't make one lose redundancy if the drive's actually OK?
>
If you want to do this, it should be doable via the PROGRAM option in
mdadm.conf (using standard mdadm calls). As has been pointed out
elsewhere though, doing so automatically can be rather a risky option.
Cheers,
Robin
--
___
( ' } | Robin Hill <robin@robinhill.me.uk> |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |
[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Help understanding the root cause of a member dropping out of a RAID 1 set.
2009-08-14 13:09 ` Paweł Brodacki
@ 2009-08-14 17:07 ` John Robinson
2009-08-14 20:56 ` Richard Scobie
0 siblings, 1 reply; 7+ messages in thread
From: John Robinson @ 2009-08-14 17:07 UTC (permalink / raw)
To: Paweł Brodacki; +Cc: linux-raid
On 14/08/2009 14:09, Paweł Brodacki wrote:
> 2009/8/13 John Robinson <john.robinson@anonymous.org.uk>:
>
>> Can or could md be made or configured to try re-adding a device if this
>> sort of thing happens? After all, a stray cosmic ray or whatever perhaps
>> shouldn't make one lose redundancy if the drive's actually OK?
>
> I think that from the coding point of view md probably could. The more
> important thing is if it should. The only hard fact is that there was
> an error while accessing the device. md has no way of telling if it
> was just a freak accident, or the drive is unreliable from now on.
Ah well, perhaps we need to give md a way of knowing the difference
between a transient error (that has been recovered from) and a more
serious error.
> Therefore it does the one safe thing and says "I won't trust you
> anymore.". If a human being knows better, the said being is free to
> re-add the drive.
>
> Personally I'd hate having a suspicious drive being auto-added in hope
> it will rebuild and function properly.
I wouldn't want it to be the default behaviour, but I'd like the option
to configure things that way. I'd want the number of auto-re-adds
configurable too.
> Because such an option could seem tempting but could and would cause
> loss of reliability I'd expect bad publicity if it was actually added.
But it could cause improvements in reliability too. If the cable on
drive A is hit by cosmic rays, the drive is taken out of the array, but
the drive's actually still fine, then drive B fails before the operator
has re-added drive A, the array goes down when it didn't need to.
What is the operator's most likely response to seeing the SATA bus
reset? She's going to re-add the drive assuming it was a transient
error. If we could make this happen automatically, we could close a
window when the array's more vulnerable. I wouldn't suggest we do it
silently; it gets logged, notified etc. just like the drive being taken
out of the array would be.
Cheers,
John.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Help understanding the root cause of a member dropping out of a RAID 1 set.
2009-08-14 17:07 ` John Robinson
@ 2009-08-14 20:56 ` Richard Scobie
0 siblings, 0 replies; 7+ messages in thread
From: Richard Scobie @ 2009-08-14 20:56 UTC (permalink / raw)
To: John Robinson; +Cc: Paweł Brodacki, linux-raid
John Robinson wrote:
> What is the operator's most likely response to seeing the SATA bus
> reset? She's going to re-add the drive assuming it was a transient
> error. If we could make this happen automatically, we could close a
I'd like to think a better response would be to use smartctl on the
drive to examine it for signs of internal errors first...
Regards,
Richard
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-08-14 20:56 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-13 8:44 Help understanding the root cause of a member dropping out of a RAID 1 set Simon Jackson
2009-08-13 16:13 ` Billy Crook
2009-08-13 16:26 ` John Robinson
2009-08-14 13:09 ` Paweł Brodacki
2009-08-14 17:07 ` John Robinson
2009-08-14 20:56 ` Richard Scobie
2009-08-14 13:21 ` Robin Hill
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.