2.6.0-test1 Adaptec aic7899 Ultra160 SCSI grief

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 2.6.0-test1 Adaptec aic7899 Ultra160 SCSI grief
@ 2003-07-24 11:17 Tugrul Galatali
  2003-07-24 17:17 ` Justin T. Gibbs
  0 siblings, 1 reply; 4+ messages in thread
From: Tugrul Galatali @ 2003-07-24 11:17 UTC (permalink / raw)
  To: linux-kernel

	After months of using 2.5.x with stability on my box, and using 
2.6.0-test1 since the day after its release (with the 20030714 ACPI 
patch), I had two seemingly random SCSI hangs today. One shortly after 
I booted the box in the afternoon, and one after about 15 hours of 
uptime. I was busy the first time around, but the second time I managed 
to scp out a copy of the current dmesg to another box before a hard 
power down.

	Can somebody translate the error in the dmesg into english and advise 
me on whether I want to change something in the software or the 
hardware?

http://acm.cs.nyu.edu/~tugrul/scsi/

	Thanks in advance,
		Tugrul Galatali


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.6.0-test1 Adaptec aic7899 Ultra160 SCSI grief
  2003-07-24 11:17 2.6.0-test1 Adaptec aic7899 Ultra160 SCSI grief Tugrul Galatali
@ 2003-07-24 17:17 ` Justin T. Gibbs
  2003-07-25  1:02   ` Tugrul Galatali
  0 siblings, 1 reply; 4+ messages in thread
From: Justin T. Gibbs @ 2003-07-24 17:17 UTC (permalink / raw)
  To: Tugrul Galatali, linux-kernel

> 	After months of using 2.5.x with stability on my box, and using
> 2.6.0-test1 since the day after its release (with the 20030714 ACPI patch),
> I had two seemingly random SCSI hangs today. One shortly after I booted the
> box in the afternoon, and one after about 15 hours of uptime. I was busy the
> first time around, but the second time I managed to scp out a copy of the
> current dmesg to another box before a hard power down.
> 
> 	Can somebody translate the error in the dmesg into english and advise
> me on whether I want to change something in the software or the hardware?

What the controller is saying is that the drive attempted to complete
a command it knew nothing about.  At the time of the failure, the only
command outstanding on the device had tag identifier 0x3c.  The drive
came back with a tag identifier of 0x20.  This looks like a drive
firmware bug, but a bug in the aic7xxx driver cannot be completely
ruled out without a SCSI bus trace of the failure.  All of the state in the
aic7xxx driver is consistent (disconnected cache matches the pending list)
which leads me to conclude that a drive firmware bug is more likely.  Why
would this happen now?  Most drive firmware bugs are load dependent.  They
often will only occur when two commands with just the right characteristics
overlap.  It may well be that a recent change in the 2.5/2.6 kernel has
caused a subtle change in I/O behavior that exposes this issue.

--
Justin

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.6.0-test1 Adaptec aic7899 Ultra160 SCSI grief
  2003-07-24 17:17 ` Justin T. Gibbs
@ 2003-07-25  1:02   ` Tugrul Galatali
  0 siblings, 0 replies; 4+ messages in thread
From: Tugrul Galatali @ 2003-07-25  1:02 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-kernel

On Thu, 2003-07-24 at 13:17, Justin T. Gibbs wrote:
[snip snip] 
> came back with a tag identifier of 0x20.  This looks like a drive
> firmware bug, but a bug in the aic7xxx driver cannot be completely
> ruled out without a SCSI bus trace of the failure.  
[snip snip]

	SCSI bus trace = logging? I started poking around online for how that
works, and I found repeatable case of what I hope is the same error (one
tar from the bad scsi disk piping into another tar onto a good scsi
disk). One problem I ran into is that scsi_logging=X as a kernel
parameter doesn't seem to work in 2.6.0-test1, so I put in a S00 init
script to do the:

echo "scsi log all" > /proc/scsi/scsi

	The resulting /var/log/messages is ~18M, compressed down to 300k.

http://acm.cs.nyu.edu/~tugrul/scsi/messages.bz2

	Is this what you need?

	I did a quick test of the above case on a 2.4.21 kernel and it didn't
seem to trigger anything evil.

	If it turns out to be a firmware problem, is the firmware upgradeable
or do I have to buy a new drive, in which case is there a blacklist?

	Tugrul Galatali





^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: 2.6.0-test1 Adaptec aic7899 Ultra160 SCSI grief
@ 2003-07-25 13:43 Cress, Andrew R
  0 siblings, 0 replies; 4+ messages in thread
From: Cress, Andrew R @ 2003-07-25 13:43 UTC (permalink / raw)
  To: 'Tugrul Galatali', Justin T. Gibbs; +Cc: linux-kernel

Tugrul,

If it is a firmware problem, the firmware is upgradable, but you have to get
the firmware from IBM rather than Seagate.  IBM has special firmware for
their ST (Seagate) OEM'd disks.

You can use the IBM utility (runs from a CD in DOS), or the sgdskfl utility
under Linux from scsirastools.sf.net.

But do verify the SCSI cabling/termination first.

Andy

-----Original Message-----
From: Tugrul Galatali [mailto:tugrul@galatali.com] 
Sent: Thursday, July 24, 2003 9:02 PM
To: Justin T. Gibbs
Cc: linux-kernel@vger.kernel.org
Subject: Re: 2.6.0-test1 Adaptec aic7899 Ultra160 SCSI grief

[... snip ...]

	If it turns out to be a firmware problem, is the firmware
upgradeable
or do I have to buy a new drive, in which case is there a blacklist?

	Tugrul Galatali

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2003-07-25 13:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-07-24 11:17 2.6.0-test1 Adaptec aic7899 Ultra160 SCSI grief Tugrul Galatali
2003-07-24 17:17 ` Justin T. Gibbs
2003-07-25  1:02   ` Tugrul Galatali
2003-07-25 13:43 Cress, Andrew R

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).