linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: Reproducible SCSI Error with Adaptec 7902 & ST336607LW
@ 2003-03-18 15:58 Cress, Andrew R
  2003-03-18 16:26 ` Terry Barnaby
  0 siblings, 1 reply; 4+ messages in thread
From: Cress, Andrew R @ 2003-03-18 15:58 UTC (permalink / raw)
  To: 'Terry Barnaby'; +Cc: linux-kernel

Terry,

>From your description in (1) below it sounds like the disk firmware did go
out to lunch.
Disk firmware is complex, and there are always possibilities that it can
hang/crash so that it won't respond, especially since a new disk model comes
out every 9-12 months.  

To get a disk firmware problem escalated you will need to gather some
evidence:
  1) The current firmware level (you have this already)
  2) The mode pages of the drive (you can use sgmode or other tools to
get/set these)
     See http://scsirastools.sourceforge.net for sgmode
     One thing to check, is whether SMART on or off has an effect (page 1c
0a 88 means completely off).  
     SMART processing on the drive runs in background on the disk and can
cause strange errors.
  3) A SCSI trace of the problem (requires a SCSI analyzer)
     If you don't have a SCSI analyzer, the bug conditions would have to be
very well-defined so 
     that it could be reproduced readily by Seagate.

Hopefully you can devise a workaround by tweaking the disk mode pages, since
reporting, analyzing, and producing a new disk firmware version would take
longer.

Andy

-----Original Message-----
From: Terry Barnaby [mailto:terry@beam.ltd.uk] 
Sent: Tuesday, March 18, 2003 4:38 AM
To: Cress, Andrew R
Cc: 'Ingo Oeser'; Michael Madore; Justin T. Gibbs;
linux-kernel@vger.kernel.org
Subject: Re: Reproducible SCSI Error with Adaptec 7902


Hi Andy,

We have just updated to the latest driver 1.3.4. This has stopped the
drive locking up, but we are now getting nasty SCSI error reports
in /var/log/messages. Will continue to delve into this.

However, what ever the fault that triggers our drive to lock-up, the
drive certainly locks up. It locks up with LED on and will not respond
to a SCSI bus reset. We need to power cycle the system to get the drive
working again. We have tried two Seagate ST336607LW drives both exibit
the same behaviour. It appears to only happen when Linux is running in
SMP mode and when the drive is running in packetized mode.

So there is certainly the possibility of the Seagate ST336607LW not 
responding to resets. This may be a firmware fault so we have talked
to Seagate about the issue. The statement is the result of our direct 
question:

> I realise that the problem could be due to the Linux SCSI driver, the
Motherboard SCSI controller, the SCSI lead or the drive. We are used to
> tracking down such nasty problems. However, I have one firm pointer:
> 
> 1. Once the drive is locked up, with its LED on, a SCSI bus reset will
>     not clear the drive. A full poweroff/poweron cycle is needed.
> 
> So I ask again, is there a case where the drive will not respond to a
> SCSI bus reset ? 

Is there any way of getting this information to higher level Seagate 
support ?

Terry


Cress, Andrew R wrote:
> Ingo,
> 
> Our testing with that drive (same firmware, using same aic7902 chipset)
has
> not shown any problems like this.  However, we were using a later aic79xx
> driver versions (1.3.x).  That upgrade should be the first step.
> 
> I wouldn't get too excited about the statement by a level-1 Seagate
support
> guy, probably just a blanket statement when they want to disclaim
> responsibility.  
> 
> Andy
> 
> -----Original Message-----
> From: Ingo Oeser [mailto:ingo.oeser@informatik.tu-chemnitz.de] 
> Sent: Saturday, March 15, 2003 8:12 AM
> To: Terry Barnaby
> Cc: Michael Madore; Justin T. Gibbs; linux-kernel@vger.kernel.org
> Subject: Re: Reproducible SCSI Error with Adaptec 7902
> 
> 
> On Fri, Mar 14, 2003 at 04:17:59PM +0000, Terry Barnaby wrote:
> 
>>The Seagate ST336607LW has firmware: 0004.
>>Seagate have stated to me that this is the latest.
>>They have also stated to me:
>>
>>  Issuing an unrecognized or illegal command to the drive can cause the
>>  drive to go into a hardware fault mode where it will no longer respond,
>>  and may or may not respond to a SCSI BUS reset. It seems, in this case,
>>  the drive will no longer respond to any commands issued by the
>>  controller.
>>
>>Is this "feature" now common on SCSI drives ????
> 
> 
> Could we add a KERN_WARNING printk in sd.c quoting/referencing
> this message on inquiry detecting this device? 
> 
> So sysadmins who are used to SCSI being robust could return the
> drive to their vendors in exchange to a drive working along the
> SCSI specs after reading this message.
> 
> Thanks in the name of the sysadmins.
> 
> Regards
> 
> Ingo Oeser
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Dr Terry Barnaby                     BEAM Ltd
Phone: +44 1454 324512               Northavon Business Center, Dean Rd
Fax:   +44 1454 313172               Yate, Bristol, BS37 5NH, UK
Email: terry@beam.ltd.uk             Web: www.beam.ltd.uk
BEAM for: Visually Impaired X-Terminals, Parallel Processing, Software
                       "Tandems are twice the fun !"

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Reproducible SCSI Error with Adaptec 7902 & ST336607LW
  2003-03-18 15:58 Reproducible SCSI Error with Adaptec 7902 & ST336607LW Cress, Andrew R
@ 2003-03-18 16:26 ` Terry Barnaby
  0 siblings, 0 replies; 4+ messages in thread
From: Terry Barnaby @ 2003-03-18 16:26 UTC (permalink / raw)
  To: Cress, Andrew R; +Cc: linux-kernel, support

Hi Andrew,

We don't have a SCSI analyzer, so cannot do this bit. However we
know that the Adaptec AIC79XX driver curred this from 1.1.0 to 1.3.4
and that 1.3.4 is still indicating an error. I presume that some
error condition was not handled by the driver correctly and this
tiggered the disk lock up. Maybe a look at the changes could determine
the condition that resulted in the error.

However, the disks hardware should respond to a hardware RESET no
matter what the firmware is doing. This could be due to use of
FPGA's, with faulty bit files, which implement the hardware RESET ...

My easiest, and cheapest, work around is to buy another manufacturers
drive (In fact in our immediate use a modified Linux driver will 
probably suit). But, I would like to feed back the problem to save 
others wasting their time on this or having nasty system errors in the 
future.

My issues are:

1. Do Seagate consider that drives not responding to RESET in some
	cases is OK or is a bug. My responce from them indicates
	the former. And I would like, as a customer, to feed back that
	this is not acceptable.

2. I would like to feed back the existance of the fault to a higher
	level Seagate support enginneer.

Does anyone know how best to communicate with Seagate to achieve this ?

Cheers

Terry

Cress, Andrew R wrote:
> Terry,
> 
>>From your description in (1) below it sounds like the disk firmware did go
> out to lunch.
> Disk firmware is complex, and there are always possibilities that it can
> hang/crash so that it won't respond, especially since a new disk model comes
> out every 9-12 months.  
> 
> To get a disk firmware problem escalated you will need to gather some
> evidence:
>   1) The current firmware level (you have this already)
>   2) The mode pages of the drive (you can use sgmode or other tools to
> get/set these)
>      See http://scsirastools.sourceforge.net for sgmode
>      One thing to check, is whether SMART on or off has an effect (page 1c
> 0a 88 means completely off).  
>      SMART processing on the drive runs in background on the disk and can
> cause strange errors.
>   3) A SCSI trace of the problem (requires a SCSI analyzer)
>      If you don't have a SCSI analyzer, the bug conditions would have to be
> very well-defined so 
>      that it could be reproduced readily by Seagate.
> 
> Hopefully you can devise a workaround by tweaking the disk mode pages, since
> reporting, analyzing, and producing a new disk firmware version would take
> longer.
> 
> Andy
> 
> -----Original Message-----
> From: Terry Barnaby [mailto:terry@beam.ltd.uk] 
> Sent: Tuesday, March 18, 2003 4:38 AM
> To: Cress, Andrew R
> Cc: 'Ingo Oeser'; Michael Madore; Justin T. Gibbs;
> linux-kernel@vger.kernel.org
> Subject: Re: Reproducible SCSI Error with Adaptec 7902
> 
> 
> Hi Andy,
> 
> We have just updated to the latest driver 1.3.4. This has stopped the
> drive locking up, but we are now getting nasty SCSI error reports
> in /var/log/messages. Will continue to delve into this.
> 
> However, what ever the fault that triggers our drive to lock-up, the
> drive certainly locks up. It locks up with LED on and will not respond
> to a SCSI bus reset. We need to power cycle the system to get the drive
> working again. We have tried two Seagate ST336607LW drives both exibit
> the same behaviour. It appears to only happen when Linux is running in
> SMP mode and when the drive is running in packetized mode.
> 
> So there is certainly the possibility of the Seagate ST336607LW not 
> responding to resets. This may be a firmware fault so we have talked
> to Seagate about the issue. The statement is the result of our direct 
> question:
> 
> 
>>I realise that the problem could be due to the Linux SCSI driver, the
> 
> Motherboard SCSI controller, the SCSI lead or the drive. We are used to
> 
>>tracking down such nasty problems. However, I have one firm pointer:
>>
>>1. Once the drive is locked up, with its LED on, a SCSI bus reset will
>>    not clear the drive. A full poweroff/poweron cycle is needed.
>>
>>So I ask again, is there a case where the drive will not respond to a
>>SCSI bus reset ? 
> 
> 
> Is there any way of getting this information to higher level Seagate 
> support ?
> 
> Terry
> 
> 
> Cress, Andrew R wrote:
> 
>>Ingo,
>>
>>Our testing with that drive (same firmware, using same aic7902 chipset)
> 
> has
> 
>>not shown any problems like this.  However, we were using a later aic79xx
>>driver versions (1.3.x).  That upgrade should be the first step.
>>
>>I wouldn't get too excited about the statement by a level-1 Seagate
> 
> support
> 
>>guy, probably just a blanket statement when they want to disclaim
>>responsibility.  
>>
>>Andy
>>
>>-----Original Message-----
>>From: Ingo Oeser [mailto:ingo.oeser@informatik.tu-chemnitz.de] 
>>Sent: Saturday, March 15, 2003 8:12 AM
>>To: Terry Barnaby
>>Cc: Michael Madore; Justin T. Gibbs; linux-kernel@vger.kernel.org
>>Subject: Re: Reproducible SCSI Error with Adaptec 7902
>>
>>
>>On Fri, Mar 14, 2003 at 04:17:59PM +0000, Terry Barnaby wrote:
>>
>>
>>>The Seagate ST336607LW has firmware: 0004.
>>>Seagate have stated to me that this is the latest.
>>>They have also stated to me:
>>>
>>> Issuing an unrecognized or illegal command to the drive can cause the
>>> drive to go into a hardware fault mode where it will no longer respond,
>>> and may or may not respond to a SCSI BUS reset. It seems, in this case,
>>> the drive will no longer respond to any commands issued by the
>>> controller.
>>>
>>>Is this "feature" now common on SCSI drives ????
>>
>>
>>Could we add a KERN_WARNING printk in sd.c quoting/referencing
>>this message on inquiry detecting this device? 
>>
>>So sysadmins who are used to SCSI being robust could return the
>>drive to their vendors in exchange to a drive working along the
>>SCSI specs after reading this message.
>>
>>Thanks in the name of the sysadmins.
>>
>>Regards
>>
>>Ingo Oeser
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>Please read the FAQ at  http://www.tux.org/lkml/
>>
> 
> 

-- 
Dr Terry Barnaby                     BEAM Ltd
Phone: +44 1454 324512               Northavon Business Center, Dean Rd
Fax:   +44 1454 313172               Yate, Bristol, BS37 5NH, UK
Email: terry@beam.ltd.uk             Web: www.beam.ltd.uk
BEAM for: Visually Impaired X-Terminals, Parallel Processing, Software
                       "Tandems are twice the fun !"


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Reproducible SCSI Error with Adaptec 7902 & ST336607LW
  2003-03-20 16:46 Cress, Andrew R
@ 2003-03-20 16:55 ` Terry Barnaby
  0 siblings, 0 replies; 4+ messages in thread
From: Terry Barnaby @ 2003-03-20 16:55 UTC (permalink / raw)
  To: Cress, Andrew R; +Cc: 'linux-kernel@vger.kernel.org'

Hi Andrew,

No I haven't, so I will try this.
I have downloaded the scsirastools and get the following mode page dump:

Device [3] is ready for modepage
Starting modepage process for 1 disk.
Getting mode sense pages from disk 3

Block Descriptor & Mode Pages:
04 45 dc cc 00 00 02 00    :Block Descriptor
01 0a c0 0b ff 00 00 00 05 00 ff ff
02 0e 80 80 00 0a 00 00 00 00 00 00 00 00 00 00
03 16 11 aa 00 00 00 08 00 00 02 d0 02 00 00 01 00 66 00 66 40 00 00 00
04 16 00 c2 bf 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 27 31 00 00
07 0a 00 0b ff 00 00 00 00 00 ff ff
08 12 14 00 ff ff 00 00 ff ff ff ff 00 20 00 00 00 00 00 00
0a 0a 02 00 00 00 00 00 00 00 03 00
19 06 01 00 ff ff 00 00
1a 0a 00 03 00 00 00 01 00 00 00 04
1c 0a 10 00 00 00 00 00 00 00 00 01
00 06 00 00 0f 00 00 00

I'm not up on SCSI mode pages, do I just have to change the "line"
1c 0a 10 00 00 00 00 00 00 00 00 01

to

1c 0a 88 00 00 00 00 00 00 00 00 01

??

Cheers

Terry

Cress, Andrew R wrote:
> Terry,
> 
> Did you try changing the mode pages on the Seagate ST336607LW to turn off
> SMART?
> The symptoms indicate to me that SMART might be an issue.
> 
> Andy
> 
> -----Original Message-----
> From: Terry Barnaby [mailto:terry@beam.ltd.uk] 
> Sent: Thursday, March 20, 2003 5:07 AM
> To: linux-kernel@vger.kernel.org
> Cc: Justin T. Gibbs; mmadore@aslab.com
> Subject: Re: Reproducible SCSI Error with Adaptec 7902
> 
> 
> Hi,
> 
> We have continued to try and get to the bottom of the problem we have
> with the Seagate ST336607LW drive with an Adaptec 7902 SCSI controller
> under Linux on an SMP machine. We have recently tried the latest
> Adaptec Linux driver (1.3.4) from Justin Gibbs who is one of the Adaptec
> SCSI driver developers. This has stopped the drive locking up but now
> lists SCSI errors in the log files. I enclose a portion of this log
> file. I have run the error logs past Justin and he has stated:
> 
> "The drive has unexpectedly dropped off the bus during a connection.
> Without a SCSI bus trace it is impossible to know why the drive might
> have done this or if perhaps a glitch on the BSY line is causing the
> controller to detect a spurious busfree."
> 
> My current conclusions are:
> 
> 1. The Seagate ST336607LW drive has a bug where in certain circumstances
> 	the drive can lock up, with LED on. In this state it will not
> 	respond to a hardware reset and a power off/on cycle is needed to
> 	reset the drive. There is a difference between the way the Linux
> 	Adaptec AIC79XX 1.1.0 driver and the 1.3.4 driver handles a SCSI
> 	error condition that triggers this behaviour.
> 
> 2. There is a problem with one of the following: The Seagate ST336607LW
> drive,
> 	the Adaptec 7902 SCSI controller on the SuperMicro X5DA8 Motherboard
> or
> 	the Linux AIC79XX driver that causes a SCSI bus fault.
> 
> I am now giving up with Seagate ST336607LW drive and intend to try a
> Maxtor Atlas 10K IV drive instead.
> I include this information to hopefully assist others who may encounter this
> problem and to list the bugs so that those who are in a position to fix them
> know about it.
> 
> Terry
> 
> [...snip...]
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Dr Terry Barnaby                     BEAM Ltd
Phone: +44 1454 324512               Northavon Business Center, Dean Rd
Fax:   +44 1454 313172               Yate, Bristol, BS37 5NH, UK
Email: terry@beam.ltd.uk             Web: www.beam.ltd.uk
BEAM for: Visually Impaired X-Terminals, Parallel Processing, Software
                       "Tandems are twice the fun !"


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Reproducible SCSI Error with Adaptec 7902 & ST336607LW
@ 2003-03-20 16:46 Cress, Andrew R
  2003-03-20 16:55 ` Terry Barnaby
  0 siblings, 1 reply; 4+ messages in thread
From: Cress, Andrew R @ 2003-03-20 16:46 UTC (permalink / raw)
  To: 'Terry Barnaby'; +Cc: 'linux-kernel@vger.kernel.org'

Terry,

Did you try changing the mode pages on the Seagate ST336607LW to turn off
SMART?
The symptoms indicate to me that SMART might be an issue.

Andy

-----Original Message-----
From: Terry Barnaby [mailto:terry@beam.ltd.uk] 
Sent: Thursday, March 20, 2003 5:07 AM
To: linux-kernel@vger.kernel.org
Cc: Justin T. Gibbs; mmadore@aslab.com
Subject: Re: Reproducible SCSI Error with Adaptec 7902


Hi,

We have continued to try and get to the bottom of the problem we have
with the Seagate ST336607LW drive with an Adaptec 7902 SCSI controller
under Linux on an SMP machine. We have recently tried the latest
Adaptec Linux driver (1.3.4) from Justin Gibbs who is one of the Adaptec
SCSI driver developers. This has stopped the drive locking up but now
lists SCSI errors in the log files. I enclose a portion of this log
file. I have run the error logs past Justin and he has stated:

"The drive has unexpectedly dropped off the bus during a connection.
Without a SCSI bus trace it is impossible to know why the drive might
have done this or if perhaps a glitch on the BSY line is causing the
controller to detect a spurious busfree."

My current conclusions are:

1. The Seagate ST336607LW drive has a bug where in certain circumstances
	the drive can lock up, with LED on. In this state it will not
	respond to a hardware reset and a power off/on cycle is needed to
	reset the drive. There is a difference between the way the Linux
	Adaptec AIC79XX 1.1.0 driver and the 1.3.4 driver handles a SCSI
	error condition that triggers this behaviour.

2. There is a problem with one of the following: The Seagate ST336607LW
drive,
	the Adaptec 7902 SCSI controller on the SuperMicro X5DA8 Motherboard
or
	the Linux AIC79XX driver that causes a SCSI bus fault.

I am now giving up with Seagate ST336607LW drive and intend to try a
Maxtor Atlas 10K IV drive instead.
I include this information to hopefully assist others who may encounter this
problem and to list the bugs so that those who are in a position to fix them
know about it.

Terry

[...snip...]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2003-03-20 16:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-03-18 15:58 Reproducible SCSI Error with Adaptec 7902 & ST336607LW Cress, Andrew R
2003-03-18 16:26 ` Terry Barnaby
2003-03-20 16:46 Cress, Andrew R
2003-03-20 16:55 ` Terry Barnaby

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).