Help understanding the root cause of a member dropping out of a RAID 1 set.

All of lore.kernel.org
 help / color / mirror / Atom feed

* Help understanding the root cause of a member dropping out of a RAID 1 set.
@ 2009-08-13  8:44 Simon Jackson
  2009-08-13 16:13 ` Billy Crook
  0 siblings, 1 reply; 7+ messages in thread
From: Simon Jackson @ 2009-08-13  8:44 UTC (permalink / raw)
  To: linux-raid


I am running RAID1 partitions on some systems and a few times I have seen a raid set become degraded as a member has failed out of the md device.  Looking at the /var/log/message file I have seen output similar to below:

Can anyone help me decode what actually happened here.  

Thanks Simon.

2009-08-11T06:21:04-07:00 Metro-1 kernel: [556568.670377]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
2009-08-11T06:21:04-07:00 Metro-1 kernel: [556568.670477] ata1: hard resetting link
2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.122562] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.259057] ata1.00: configured for UDMA/133
2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1.01: configured for UDMA/133
2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] md: super_written gets error=-5, uptodate=0
2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] raid1: Operation continuing on 1 devices.
2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1: EH complete

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help understanding the root cause of a member dropping out of a RAID 1 set.
  2009-08-13  8:44 Help understanding the root cause of a member dropping out of a RAID 1 set Simon Jackson
@ 2009-08-13 16:13 ` Billy Crook
  2009-08-13 16:26   ` John Robinson
  0 siblings, 1 reply; 7+ messages in thread
From: Billy Crook @ 2009-08-13 16:13 UTC (permalink / raw)
  To: Simon Jackson; +Cc: linux-raid

On Thu, Aug 13, 2009 at 03:44, Simon Jackson<sjackson@bluearc.com> wrote:
>
> I am running RAID1 partitions on some systems and a few times I have seen a raid set become degraded as a member has failed out of the md device.  Looking at the /var/log/message file I have seen output similar to below:
>
> Can anyone help me decode what actually happened here.
>
> Thanks Simon.
>
> 2009-08-11T06:21:04-07:00 Metro-1 kernel: [556568.670377]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

the hard drive didn't respond to an ata command

> 2009-08-11T06:21:04-07:00 Metro-1 kernel: [556568.670477] ata1: hard resetting link

kernel tells hdd controller to reset link. (Sometimes this gets
"frozen" hard drives to respond again.)

> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.122562] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.259057] ata1.00: configured for UDMA/133
> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1.01: configured for UDMA/133

link has been reset.

> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] md: super_written gets error=-5, uptodate=0

mdraid notices.  says oh craps.

> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] raid1: Operation continuing on 1 devices.

mdraid marks the component that encountered the error failed, and
keeps on keeping on

> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1: EH complete

ata reset (of link, and subsequently drive) is complete.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help understanding the root cause of a member dropping out of a  RAID 1 set.
  2009-08-13 16:13 ` Billy Crook
@ 2009-08-13 16:26   ` John Robinson
  2009-08-14 13:09     ` Paweł Brodacki
  2009-08-14 13:21     ` Robin Hill
  0 siblings, 2 replies; 7+ messages in thread
From: John Robinson @ 2009-08-13 16:26 UTC (permalink / raw)
  To: linux-raid

On Thu, 13 August, 2009 5:13 pm, Billy Crook wrote:
> On Thu, Aug 13, 2009 at 03:44, Simon Jackson<sjackson@bluearc.com> wrote:
[...]
>> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] md:
>> super_written gets error=-5, uptodate=0
>
> mdraid notices.  says oh craps.
>
>> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] raid1:
>> Operation continuing on 1 devices.
>
> mdraid marks the component that encountered the error failed, and
> keeps on keeping on
>
>> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1: EH
>> complete
>
> ata reset (of link, and subsequently drive) is complete.

Can or could md be made or configured to try re-adding a device if this
sort of thing happens? After all, a stray cosmic ray or whatever perhaps
shouldn't make one lose redundancy if the drive's actually OK?

Cheers,

John.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help understanding the root cause of a member dropping out of a RAID 1 set.
  2009-08-13 16:26   ` John Robinson
@ 2009-08-14 13:09     ` Paweł Brodacki
  2009-08-14 17:07       ` John Robinson
  2009-08-14 13:21     ` Robin Hill
  1 sibling, 1 reply; 7+ messages in thread
From: Paweł Brodacki @ 2009-08-14 13:09 UTC (permalink / raw)
  To: linux-raid

2009/8/13 John Robinson <john.robinson@anonymous.org.uk>:

> Can or could md be made or configured to try re-adding a device if this
> sort of thing happens? After all, a stray cosmic ray or whatever perhaps
> shouldn't make one lose redundancy if the drive's actually OK?
>
> Cheers,
>
> John.
>

I think that from the coding point of view md probably could. The more
important thing is if it should. The only hard fact is that there was
an error while accessing the device. md has no way of telling if it
was just a freak accident, or the drive is unreliable from now on.
Therefore it does the one safe thing and says "I won't trust you
anymore.". If a human being knows better, the said being is free to
re-add the drive.

Personally I'd hate having a suspicious drive being auto-added in hope
it will rebuild and function properly.

Because such an option could seem tempting but could and would cause
loss of reliability I'd expect bad publicity if it was actually added.

Just my 2c.

Regards,
Paweł
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help understanding the root cause of a member dropping out of a  RAID 1 set.
  2009-08-13 16:26   ` John Robinson
  2009-08-14 13:09     ` Paweł Brodacki
@ 2009-08-14 13:21     ` Robin Hill
  1 sibling, 0 replies; 7+ messages in thread
From: Robin Hill @ 2009-08-14 13:21 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]

On Thu Aug 13, 2009 at 05:26:39PM +0100, John Robinson wrote:

> On Thu, 13 August, 2009 5:13 pm, Billy Crook wrote:
> > On Thu, Aug 13, 2009 at 03:44, Simon Jackson<sjackson@bluearc.com> wrote:
> [...]
> >> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] md:
> >> super_written gets error=-5, uptodate=0
> >
> > mdraid notices.  says oh craps.
> >
> >> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] raid1:
> >> Operation continuing on 1 devices.
> >
> > mdraid marks the component that encountered the error failed, and
> > keeps on keeping on
> >
> >> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1: EH
> >> complete
> >
> > ata reset (of link, and subsequently drive) is complete.
> 
> Can or could md be made or configured to try re-adding a device if this
> sort of thing happens? After all, a stray cosmic ray or whatever perhaps
> shouldn't make one lose redundancy if the drive's actually OK?
> 
If you want to do this, it should be doable via the PROGRAM option in
mdadm.conf (using standard mdadm calls).  As has been pointed out
elsewhere though, doing so automatically can be rather a risky option.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help understanding the root cause of a member dropping out of a  RAID 1 set.
  2009-08-14 13:09     ` Paweł Brodacki
@ 2009-08-14 17:07       ` John Robinson
  2009-08-14 20:56         ` Richard Scobie
  0 siblings, 1 reply; 7+ messages in thread
From: John Robinson @ 2009-08-14 17:07 UTC (permalink / raw)
  To: Paweł Brodacki; +Cc: linux-raid

On 14/08/2009 14:09, Paweł Brodacki wrote:
> 2009/8/13 John Robinson <john.robinson@anonymous.org.uk>:
> 
>> Can or could md be made or configured to try re-adding a device if this
>> sort of thing happens? After all, a stray cosmic ray or whatever perhaps
>> shouldn't make one lose redundancy if the drive's actually OK?
> 
> I think that from the coding point of view md probably could. The more
> important thing is if it should. The only hard fact is that there was
> an error while accessing the device. md has no way of telling if it
> was just a freak accident, or the drive is unreliable from now on.

Ah well, perhaps we need to give md a way of knowing the difference 
between a transient error (that has been recovered from) and a more 
serious error.

> Therefore it does the one safe thing and says "I won't trust you
> anymore.". If a human being knows better, the said being is free to
> re-add the drive.
> 
> Personally I'd hate having a suspicious drive being auto-added in hope
> it will rebuild and function properly.

I wouldn't want it to be the default behaviour, but I'd like the option 
to configure things that way. I'd want the number of auto-re-adds 
configurable too.

> Because such an option could seem tempting but could and would cause
> loss of reliability I'd expect bad publicity if it was actually added.

But it could cause improvements in reliability too. If the cable on 
drive A is hit by cosmic rays, the drive is taken out of the array, but 
the drive's actually still fine, then drive B fails before the operator 
has re-added drive A, the array goes down when it didn't need to.

What is the operator's most likely response to seeing the SATA bus 
reset? She's going to re-add the drive assuming it was a transient 
error. If we could make this happen automatically, we could close a 
window when the array's more vulnerable. I wouldn't suggest we do it 
silently; it gets logged, notified etc. just like the drive being taken 
out of the array would be.

Cheers,

John.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help understanding the root cause of a member dropping out of a  RAID 1 set.
  2009-08-14 17:07       ` John Robinson
@ 2009-08-14 20:56         ` Richard Scobie
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Scobie @ 2009-08-14 20:56 UTC (permalink / raw)
  To: John Robinson; +Cc: Paweł Brodacki, linux-raid

John Robinson wrote:

> What is the operator's most likely response to seeing the SATA bus 
> reset? She's going to re-add the drive assuming it was a transient 
> error. If we could make this happen automatically, we could close a 

I'd like to think a better response would be to use smartctl on the 
drive  to examine it for signs of internal errors first...

Regards,

Richard

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-08-14 20:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-13  8:44 Help understanding the root cause of a member dropping out of a RAID 1 set Simon Jackson
2009-08-13 16:13 ` Billy Crook
2009-08-13 16:26   ` John Robinson
2009-08-14 13:09     ` Paweł Brodacki
2009-08-14 17:07       ` John Robinson
2009-08-14 20:56         ` Richard Scobie
2009-08-14 13:21     ` Robin Hill

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.