All of lore.kernel.org
 help / color / mirror / Atom feed
* Recovered disk error caused disk to go offline.
@ 2004-01-29  5:21 Guy
  2004-01-30 19:02 ` Guy
  0 siblings, 1 reply; 9+ messages in thread
From: Guy @ 2004-01-29  5:21 UTC (permalink / raw)
  To: linux-scsi

Neil Brown said to send this message to linux-scsi, so here it is.

Please help.
Thanks,
Guy

On Thursday January 29, bugzilla@watkins-home.com wrote:
> As you can see in the log, the write error recovered with auto
reallocation!
> As I understand it, this is a normal event with today's disks.
> I don't think the disk should have been considered failed.
> 
> Comments please?

You need to talk to linux-scsi about this.
The scsi subsystem told the raid subsystem that there was an error, so
the raid subsystem stopped using the device.

If the write error was recovered, scsi shouldn't have reported an
error to raid.

NeilBrown

> 
> Thanks,
> Guy
> 
> The spare disk resynced just fine.^[,A  I never knew for over 24 hours!
> This is cool stuff!
> 
> Jan 27 12:44:06 watkins kernel: SCSI disk error : host 2 channel 0 id 4
lun
> 0 return code = 8000002
> Jan 27 12:44:06 watkins kernel: Info fld=0x7e5c81, Deferred sd08:71: sense
> key Recovered Error
> Jan 27 12:44:06 watkins kernel: Additional sense indicates Write error -
> recovered with auto reallocation
> Jan 27 12:44:06 watkins kernel:^[,A  I/O error: dev 08:71, sector 8280704
> Jan 27 12:44:06 watkins kernel: raid5: Disk failure on sdh1, disabling
> device. Operation continuing on 13 devices
> Jan 27 12:44:06 watkins kernel: md: updating md2 RAID superblock on device
> Jan 27 12:44:06 watkins kernel: md: sdc1 [events: 00000009]<6>(write)
sdc1's
> sb offset: 17767744
> Jan 27 12:44:06 watkins kernel: md: recovery thread got woken up ...
> Jan 27 12:44:06 watkins kernel: md2: resyncing spare disk sdc1 to replace
> failed disk
> Jan 27 12:44:06 watkins kernel: RAID5 conf printout:
> Jan 27 12:44:06 watkins kernel:^[,A  --- rd:14 wd:13 fd:1
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Recovered disk error caused disk to go offline.
  2004-01-29  5:21 Recovered disk error caused disk to go offline Guy
@ 2004-01-30 19:02 ` Guy
  2004-01-30 20:33   ` Clay Haapala
  2004-02-01 15:25   ` James Bottomley
  0 siblings, 2 replies; 9+ messages in thread
From: Guy @ 2004-01-30 19:02 UTC (permalink / raw)
  To: linux-scsi

Sorry about the re-post, but no comments after almost 2 days.

-----Original Message-----
From: linux-scsi-owner@vger.kernel.org
[mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Guy
Sent: Thursday, January 29, 2004 12:21 AM
To: linux-scsi@vger.kernel.org
Subject: Recovered disk error caused disk to go offline.

Neil Brown said to send this message to linux-scsi, so here it is.

Please help.
Thanks,
Guy

On Thursday January 29, bugzilla@watkins-home.com wrote:
> As you can see in the log, the write error recovered with auto
reallocation!
> As I understand it, this is a normal event with today's disks.
> I don't think the disk should have been considered failed.
> 
> Comments please?

You need to talk to linux-scsi about this.
The scsi subsystem told the raid subsystem that there was an error, so
the raid subsystem stopped using the device.

If the write error was recovered, scsi shouldn't have reported an
error to raid.

NeilBrown

> 
> Thanks,
> Guy
> 
> The spare disk resynced just fine.^[,A  I never knew for over 24 hours!
> This is cool stuff!
> 
> Jan 27 12:44:06 watkins kernel: SCSI disk error : host 2 channel 0 id 4
lun
> 0 return code = 8000002
> Jan 27 12:44:06 watkins kernel: Info fld=0x7e5c81, Deferred sd08:71: sense
> key Recovered Error
> Jan 27 12:44:06 watkins kernel: Additional sense indicates Write error -
> recovered with auto reallocation
> Jan 27 12:44:06 watkins kernel:^[,A  I/O error: dev 08:71, sector 8280704
> Jan 27 12:44:06 watkins kernel: raid5: Disk failure on sdh1, disabling
> device. Operation continuing on 13 devices
> Jan 27 12:44:06 watkins kernel: md: updating md2 RAID superblock on device
> Jan 27 12:44:06 watkins kernel: md: sdc1 [events: 00000009]<6>(write)
sdc1's
> sb offset: 17767744
> Jan 27 12:44:06 watkins kernel: md: recovery thread got woken up ...
> Jan 27 12:44:06 watkins kernel: md2: resyncing spare disk sdc1 to replace
> failed disk
> Jan 27 12:44:06 watkins kernel: RAID5 conf printout:
> Jan 27 12:44:06 watkins kernel:^[,A  --- rd:14 wd:13 fd:1
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Recovered disk error caused disk to go offline.
  2004-01-30 19:02 ` Guy
@ 2004-01-30 20:33   ` Clay Haapala
  2004-02-02  5:00     ` Neil Brown
  2004-02-01 15:25   ` James Bottomley
  1 sibling, 1 reply; 9+ messages in thread
From: Clay Haapala @ 2004-01-30 20:33 UTC (permalink / raw)
  To: Guy; +Cc: linux-scsi, linux-raid

iSCSI acts as another HBA, and conveys status up from the [Fibre
Channel] devices to the scsi layer.  SCSI reported that event, and the
raid system rolled over the disk to another, more reliable, one.
Wouldn't that be correct behavior for Raid?  Cc-ing linux-raid...

On Fri, 30 Jan 2004, Guy verbalised:
> Sorry about the re-post, but no comments after almost 2 days.
> 
> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org
> [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Guy
> Sent: Thursday, January 29, 2004 12:21 AM
> To: linux-scsi@vger.kernel.org
> Subject: Recovered disk error caused disk to go offline.
> 
> Neil Brown said to send this message to linux-scsi, so here it is.
> 
> Please help.
> Thanks,
> Guy
> 
> On Thursday January 29, bugzilla@watkins-home.com wrote:
>> As you can see in the log, the write error recovered with auto
> reallocation!
>> As I understand it, this is a normal event with today's disks.
>> I don't think the disk should have been considered failed.
>> 
>> Comments please?
> 
> You need to talk to linux-scsi about this.  The scsi subsystem told
> the raid subsystem that there was an error, so the raid subsystem
> stopped using the device.
> 
> If the write error was recovered, scsi shouldn't have reported an
> error to raid.
> 
> NeilBrown
> 
>> 
>> Thanks,
>> Guy
>> 
>> The spare disk resynced just fine..,A I never knew for over 24
>> hours!  This is cool stuff!
>> 
>> Jan 27 12:44:06 watkins kernel: SCSI disk error : host 2 channel 0
>> id 4
> lun
>> 0 return code = 8000002 Jan 27 12:44:06 watkins kernel: Info
>> fld=0x7e5c81, Deferred sd08:71: sense key Recovered Error Jan 27
>> 12:44:06 watkins kernel: Additional sense indicates Write error -
>> recovered with auto reallocation Jan 27 12:44:06 watkins kernel:.,A
>> I/O error: dev 08:71, sector 8280704 Jan 27 12:44:06 watkins
>> kernel: raid5: Disk failure on sdh1, disabling device. Operation
>> continuing on 13 devices Jan 27 12:44:06 watkins kernel: md:
>> updating md2 RAID superblock on device Jan 27 12:44:06 watkins
>> kernel: md: sdc1 [events: 00000009]<6>(write)
> sdc1's
>> sb offset: 17767744 Jan 27 12:44:06 watkins kernel: md: recovery
>> thread got woken up ...  Jan 27 12:44:06 watkins kernel: md2:
>> resyncing spare disk sdc1 to replace failed disk Jan 27 12:44:06
>> watkins kernel: RAID5 conf printout: Jan 27 12:44:06 watkins
>> kernel:.,A --- rd:14 wd:13 fd:1
>> 
>> - To unsubscribe from this list: send the line "unsubscribe
>> linux-raid" in the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> - To unsubscribe from this list: send the line "unsubscribe
> linux-raid" in the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> 
> - To unsubscribe from this list: send the line "unsubscribe
> linux-scsi" in the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> 
> - To unsubscribe from this list: send the line "unsubscribe
> linux-scsi" in the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

-- 
Clay Haapala (chaapala@cisco.com) Cisco Systems SRBU +1 763-398-1056
   6450 Wedgwood Rd, Suite 130 Maple Grove MN 55311 PGP: C89240AD
             Minnesota, a quite agreeable state.  Lately,
             Celsius and Fahrenheit have tended to agree.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Recovered disk error caused disk to go offline.
  2004-01-30 19:02 ` Guy
  2004-01-30 20:33   ` Clay Haapala
@ 2004-02-01 15:25   ` James Bottomley
  2004-02-01 16:10     ` Guy
  1 sibling, 1 reply; 9+ messages in thread
From: James Bottomley @ 2004-02-01 15:25 UTC (permalink / raw)
  To: Guy; +Cc: SCSI Mailing List

On Fri, 2004-01-30 at 14:02, Guy wrote:
> Sorry about the re-post, but no comments after almost 2 days.

Recovered disc error processing has been in the sd driver for nearly two
years now.  What kernel version was this?

James



^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Recovered disk error caused disk to go offline.
  2004-02-01 15:25   ` James Bottomley
@ 2004-02-01 16:10     ` Guy
  2004-02-01 16:23       ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Guy @ 2004-02-01 16:10 UTC (permalink / raw)
  To: 'James Bottomley'; +Cc: 'SCSI Mailing List'

RedHat 9.0
>From uname -a:
Linux watkins 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686 i686 i386
GNU/Linux

Guy

-----Original Message-----
From: linux-scsi-owner@vger.kernel.org
[mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of James Bottomley
Sent: Sunday, February 01, 2004 10:25 AM
To: Guy
Cc: SCSI Mailing List
Subject: RE: Recovered disk error caused disk to go offline.

On Fri, 2004-01-30 at 14:02, Guy wrote:
> Sorry about the re-post, but no comments after almost 2 days.

Recovered disc error processing has been in the sd driver for nearly two
years now.  What kernel version was this?

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Recovered disk error caused disk to go offline.
  2004-02-01 16:10     ` Guy
@ 2004-02-01 16:23       ` James Bottomley
  0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2004-02-01 16:23 UTC (permalink / raw)
  To: Guy; +Cc: 'SCSI Mailing List'

On Sun, 2004-02-01 at 11:10, Guy wrote:
> RedHat 9.0
> >From uname -a:
> Linux watkins 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686 i686 i386
> GNU/Linux

Aha, for a vendor kernel you need to file a bugzilla with redhat.

Alternatively, if you can reproduce it with the latest kernel.org 2.4 or
2.6 kernel, we can diagnose it.

James



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Recovered disk error caused disk to go offline.
  2004-01-30 20:33   ` Clay Haapala
@ 2004-02-02  5:00     ` Neil Brown
  0 siblings, 0 replies; 9+ messages in thread
From: Neil Brown @ 2004-02-02  5:00 UTC (permalink / raw)
  To: Clay Haapala; +Cc: Guy, linux-scsi, linux-raid

On Friday January 30, chaapala@cisco.com wrote:
> iSCSI acts as another HBA, and conveys status up from the [Fibre
> Channel] devices to the scsi layer.  SCSI reported that event, and the
> raid system rolled over the disk to another, more reliable, one.
> Wouldn't that be correct behavior for Raid?  Cc-ing linux-raid...
> 

The only events the raid can see coming from scsi are:
  successful read/write
  unsuccessful read/write

There is now way in the Linux block layer to report "write was
successful, but I had to retry".

It appears that an 'unsuccessful write' was reported when the write
was actually successful.  This seems wrong.

NeilBrown


> On Fri, 30 Jan 2004, Guy verbalised:
> > Sorry about the re-post, but no comments after almost 2 days.
> > 
> > -----Original Message-----
> > From: linux-scsi-owner@vger.kernel.org
> > [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Guy
> > Sent: Thursday, January 29, 2004 12:21 AM
> > To: linux-scsi@vger.kernel.org
> > Subject: Recovered disk error caused disk to go offline.
> > 
> > Neil Brown said to send this message to linux-scsi, so here it is.
> > 
> > Please help.
> > Thanks,
> > Guy
> > 
> > On Thursday January 29, bugzilla@watkins-home.com wrote:
> >> As you can see in the log, the write error recovered with auto
> > reallocation!
> >> As I understand it, this is a normal event with today's disks.
> >> I don't think the disk should have been considered failed.
> >> 
> >> Comments please?
> > 
> > You need to talk to linux-scsi about this.  The scsi subsystem told
> > the raid subsystem that there was an error, so the raid subsystem
> > stopped using the device.
> > 
> > If the write error was recovered, scsi shouldn't have reported an
> > error to raid.
> > 
> > NeilBrown
> > 
> >> 
> >> Thanks,
> >> Guy
> >> 
> >> The spare disk resynced just fine..,A I never knew for over 24
> >> hours!  This is cool stuff!
> >> 
> >> Jan 27 12:44:06 watkins kernel: SCSI disk error : host 2 channel 0
> >> id 4
> > lun
> >> 0 return code = 8000002 Jan 27 12:44:06 watkins kernel: Info
> >> fld=0x7e5c81, Deferred sd08:71: sense key Recovered Error Jan 27
> >> 12:44:06 watkins kernel: Additional sense indicates Write error -
> >> recovered with auto reallocation Jan 27 12:44:06 watkins kernel:.,A
> >> I/O error: dev 08:71, sector 8280704 Jan 27 12:44:06 watkins
> >> kernel: raid5: Disk failure on sdh1, disabling device. Operation
> >> continuing on 13 devices Jan 27 12:44:06 watkins kernel: md:
> >> updating md2 RAID superblock on device Jan 27 12:44:06 watkins
> >> kernel: md: sdc1 [events: 00000009]<6>(write)
> > sdc1's
> >> sb offset: 17767744 Jan 27 12:44:06 watkins kernel: md: recovery
> >> thread got woken up ...  Jan 27 12:44:06 watkins kernel: md2:
> >> resyncing spare disk sdc1 to replace failed disk Jan 27 12:44:06
> >> watkins kernel: RAID5 conf printout: Jan 27 12:44:06 watkins
> >> kernel:.,A --- rd:14 wd:13 fd:1
> >> 
> >> - To unsubscribe from this list: send the line "unsubscribe
> >> linux-raid" in the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > - To unsubscribe from this list: send the line "unsubscribe
> > linux-raid" in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > 
> > - To unsubscribe from this list: send the line "unsubscribe
> > linux-scsi" in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > 
> > - To unsubscribe from this list: send the line "unsubscribe
> > linux-scsi" in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Clay Haapala (chaapala@cisco.com) Cisco Systems SRBU +1 763-398-1056
>    6450 Wedgwood Rd, Suite 130 Maple Grove MN 55311 PGP: C89240AD
>              Minnesota, a quite agreeable state.  Lately,
>              Celsius and Fahrenheit have tended to agree.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Recovered disk error caused disk to go offline.
  2004-01-29  5:00 Guy
@ 2004-01-29  5:05 ` Neil Brown
  0 siblings, 0 replies; 9+ messages in thread
From: Neil Brown @ 2004-01-29  5:05 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid

On Thursday January 29, bugzilla@watkins-home.com wrote:
> As you can see in the log, the write error recovered with auto reallocation!
> As I understand it, this is a normal event with today

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Recovered disk error caused disk to go offline.
@ 2004-01-29  5:00 Guy
  2004-01-29  5:05 ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: Guy @ 2004-01-29  5:00 UTC (permalink / raw)
  To: linux-raid

As you can see in the log, the write error recovered with auto reallocation!
As I understand it, this is a normal event with today’s disks.
I don’t think the disk should have been considered failed.

Comments please?

Thanks,
Guy

The spare disk resynced just fine.  I never knew for over 24 hours!
This is cool stuff!

Jan 27 12:44:06 watkins kernel: SCSI disk error : host 2 channel 0 id 4 lun
0 return code = 8000002
Jan 27 12:44:06 watkins kernel: Info fld=0x7e5c81, Deferred sd08:71: sense
key Recovered Error
Jan 27 12:44:06 watkins kernel: Additional sense indicates Write error -
recovered with auto reallocation
Jan 27 12:44:06 watkins kernel:  I/O error: dev 08:71, sector 8280704
Jan 27 12:44:06 watkins kernel: raid5: Disk failure on sdh1, disabling
device. Operation continuing on 13 devices
Jan 27 12:44:06 watkins kernel: md: updating md2 RAID superblock on device
Jan 27 12:44:06 watkins kernel: md: sdc1 [events: 00000009]<6>(write) sdc1's
sb offset: 17767744
Jan 27 12:44:06 watkins kernel: md: recovery thread got woken up ...
Jan 27 12:44:06 watkins kernel: md2: resyncing spare disk sdc1 to replace
failed disk
Jan 27 12:44:06 watkins kernel: RAID5 conf printout:
Jan 27 12:44:06 watkins kernel:  --- rd:14 wd:13 fd:1

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-02-02  5:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-01-29  5:21 Recovered disk error caused disk to go offline Guy
2004-01-30 19:02 ` Guy
2004-01-30 20:33   ` Clay Haapala
2004-02-02  5:00     ` Neil Brown
2004-02-01 15:25   ` James Bottomley
2004-02-01 16:10     ` Guy
2004-02-01 16:23       ` James Bottomley
  -- strict thread matches above, loose matches on Subject: below --
2004-01-29  5:00 Guy
2004-01-29  5:05 ` Neil Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.