linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Patches for SCSI timeout bug
@ 2003-06-04 21:34 linas
  2003-06-04 21:44 ` J.A. Magallon
  2003-06-06 18:41 ` Anton Blanchard
  0 siblings, 2 replies; 6+ messages in thread
From: linas @ 2003-06-04 21:34 UTC (permalink / raw)
  To: lnz, mike, eric, linux-kernel, olh, groudier, axboe, acme; +Cc: linas



Hi,

I've got a SCSI timeout bug in kernels 2.4 and 2.5, and several 
different patches (appended) that fix it.  I'm not sure which way 
of fixing it is best.

Hardware:
IDE DVD/CDROM connected to ACHIP ARC765 based SCSI-to-IDE converter,
attached to symbios controller using sym53c8xx driver.

SYMPTOMS:
When booting, system hangs because the initial SCSI bus scan times
out when it gets to this device, causing a command abort, which
times out, and thence (in kernel 2.4) into an infinite loop of 
resets and timeouts.  In kernel 2.5, its not an infinite loop;
only two resets, but the device is never found.

ROOT CAUSE:
During boot, the sym53c8xx driver performs a SCSI bus reset.
The Achip takes about 15 seconds after a bus reset before it
is williing to reply to scsi commands.  However, in the current
code for the initial bus scan, a device is given 6 seconds
before scsi target aborts, resets, etc. come raining down.

I've got some lengthly SCSI bus traces if anyone cares to look.

MORE DETAILS:
During boot, the sym53c8xx driver for the SCSI controller
performs a SCSI bus reset. (Other drivers may or may not 
perform this reset; some are configurable).  After the
reset, it waits 2 seconds before starting a bus scan.
(Some other drivers wait 5, others 10; others maybe 
more or less). During the bus scan, generic (common
among all drivers) SCSI code gives each device 6 seconds 
to respond.  If the device doesn't respond,  the code
launches into a sequence of target aborts, bus resets, 
etc. in an attempt to recover.

If the DVD/CDROM is scanned early in the bus scan, then 
it will not have had time to finish reseting itself before
its scanned, and it won't respond fast enough, leading to the 
bad behaviour.  If the machine has lots of disks, then the
CDROM is scanned later, giving it enough time & then everything 
is fine.

FIXES:
There's several ways to fix this: 
1) By increasing the generic SCSI bus scan timeout to be 
   longer than 15 seconds (as well as the timeout for a 
   bus reset to be longer than this).

2) By incresing the sym53c8xx post-reset delay to at least
   12 seconds.

Fix 2) may not be bad: I have at least one scsi hard drive which 
takes 5 seconds to recover from a bus reset.   On the other hand,
fix 2) makes the boot process longer: it introduces a delay of 
N x 12 seconds, where N is the number of scsi channels.
(Most cards have two channels; some server-class machines with 
many cards may have a significantly longer boot).

Fix 1) does not introduce any delay at all, if the SCSI
devices respond quickly.  Fix 1) also will stop the problem
from recurring if/when this CDROM is attached to something 
other than a sym53c8xx.

I like fix 1) better, but I'm not a Linux SCSI guy, so I don't
really know & can't make this choice ....  Below are some
patches for kernel 2.4; they are almost identical for kernel 2.5.

--linas


PATCHES for 'fix 1':  (note these also fix a compile-time warning
in this code):


Index: scsi_scan.c
===================================================================
RCS file: /cvs/linuxppc64/linuxppc64_2_4/drivers/scsi/scsi_scan.c,v
retrieving revision 1.19
diff -u -r1.19 scsi_scan.c
--- scsi_scan.c 8 Jan 2003 18:47:06 -0000       1.19
+++ scsi_scan.c 29 May 2003 23:02:29 -0000
@@ -576,9 +576,10 @@
        SRpnt->sr_cmd_len = 0;
        SRpnt->sr_data_direction = SCSI_DATA_READ;
  
+       /* Some AChip ARC765 devices take 15 seconds recover from bus reset */
        scsi_wait_req (SRpnt, (void *) scsi_cmd,
                  (void *) scsi_result,
-                 256, SCSI_TIMEOUT+4*HZ, 3);
+                 256, SCSI_TIMEOUT+15*HZ, 3);
  
        SCSI_LOG_SCAN_BUS(3, printk("scsi: INQUIRY %s with code 0x%x\n",
                SRpnt->sr_result ? "failed" : "successful", SRpnt->sr_result));



Index: scsi_obsolete.c
===================================================================
RCS file: /cvs/linuxppc64/linuxppc64_2_4/drivers/scsi/scsi_obsolete.c,v
retrieving revision 1.4
diff -u -r1.4 scsi_obsolete.c
--- scsi_obsolete.c     22 Apr 2002 15:33:14 -0000      1.4
+++ scsi_obsolete.c     29 May 2003 23:02:29 -0000
@@ -106,21 +106,15 @@
 static void scsi_dump_status(void);
 #endif
  
-
-#ifdef DEBUG
-#define SCSI_TIMEOUT (5*HZ)
-#else
-#define SCSI_TIMEOUT (2*HZ)
-#endif
-
+/* same timeouts as scsi_error.c */
 #ifdef DEBUG
 #define SENSE_TIMEOUT SCSI_TIMEOUT
 #define ABORT_TIMEOUT SCSI_TIMEOUT
 #define RESET_TIMEOUT SCSI_TIMEOUT
 #else
-#define SENSE_TIMEOUT (5*HZ/10)
-#define RESET_TIMEOUT (5*HZ/10)
-#define ABORT_TIMEOUT (5*HZ/10)
+#define SENSE_TIMEOUT (10*HZ)
+#define RESET_TIMEOUT (2*HZ)
+#define ABORT_TIMEOUT (15*HZ)
 #endif
  
  


PATCH for 'fix 2'

Index: sym53c8xx_defs.h
===================================================================
RCS file: /cvs/linuxppc64/linuxppc64_2_4/drivers/scsi/sym53c8xx_defs.h,v
retrieving revision 1.8
diff -u -r1.8 sym53c8xx_defs.h
--- sym53c8xx_defs.h    22 Apr 2002 15:33:14 -0000      1.8
+++ sym53c8xx_defs.h    4 Jun 2003 21:24:49 -0000
@@ -269,7 +269,7 @@
 /*
  * Settle time after reset at boot-up
  */
-#define SCSI_NCR_SETUP_SETTLE_TIME     (2)
+#define SCSI_NCR_SETUP_SETTLE_TIME     (15)
  
 /*
 **     Bridge quirks work-around option defaulted to 1.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Patches for SCSI timeout bug
  2003-06-04 21:34 Patches for SCSI timeout bug linas
@ 2003-06-04 21:44 ` J.A. Magallon
  2003-06-04 22:20   ` linas
  2003-06-05 16:58   ` linas
  2003-06-06 18:41 ` Anton Blanchard
  1 sibling, 2 replies; 6+ messages in thread
From: J.A. Magallon @ 2003-06-04 21:44 UTC (permalink / raw)
  To: linas; +Cc: lnz, mike, eric, linux-kernel, olh, groudier, axboe, acme, linas


On 06.04, linas@austin.ibm.com wrote:
> 
> 
> Hi,
> 
> I've got a SCSI timeout bug in kernels 2.4 and 2.5, and several 
> different patches (appended) that fix it.  I'm not sure which way 
> of fixing it is best.
> 
[...]

Can you try with this:

--- linux-2.4.18-18mdk/drivers/scsi/scsi_error.c.scsi-eh-timeout	Thu May 30 16:22:37 2002
+++ linux-2.4.18-18mdk/drivers/scsi/scsi_error.c	Sun Jun  9 19:18:11 2002
@@ -1103,6 +1103,8 @@
  */
 STATIC int scsi_eh_completed_normally(Scsi_Cmnd * SCpnt)
 {
+	int rtn;
+
 	/*
 	 * First check the host byte, to see if there is anything in there
 	 * that would indicate what we need to do.
@@ -1116,14 +1118,18 @@
 			 * otherwise we just flag it as success.
 			 */
 			SCpnt->flags &= ~IS_RESETTING;
-			return NEEDS_RETRY;
+			goto maybe_retry;
 		}
 		/*
 		 * Rats.  We are already in the error handler, so we now get to try
 		 * and figure out what to do next.  If the sense is valid, we have
 		 * a pretty good idea of what to do.  If not, we mark it as failed.
 		 */
-		return scsi_check_sense(SCpnt);
+		rtn = scsi_check_sense(SCpnt);
+		if (rtn == NEEDS_RETRY) {
+			goto maybe_retry;
+		}
+		return rtn;
 	}
 	if (host_byte(SCpnt->result) != DID_OK) {
 		return FAILED;
@@ -1142,7 +1148,11 @@
 	case COMMAND_TERMINATED:
 		return SUCCESS;
 	case CHECK_CONDITION:
-		return scsi_check_sense(SCpnt);
+		rtn = scsi_check_sense(SCpnt);
+		if (rtn == NEEDS_RETRY) {
+			goto maybe_retry;
+		}
+		return rtn;
 	case CONDITION_GOOD:
 	case INTERMEDIATE_GOOD:
 	case INTERMEDIATE_C_GOOD:
@@ -1157,6 +1167,17 @@
 		return FAILED;
 	}
 	return FAILED;
+
+      maybe_retry:
+
+	if ((++SCpnt->retries) < SCpnt->allowed) {
+		return NEEDS_RETRY;
+	} else {
+                /*
+                 * No more retries - report this one back to upper level.
+                 */
+		return SUCCESS;
+	}
 }
 
 /*


-- 
J.A. Magallon <jamagallon@able.es>      \                 Software is like sex:
werewolf.able.es                         \           It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.21-rc7-jam1 (gcc 3.3 (Mandrake Linux 9.2 3.3-1mdk))

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Patches for SCSI timeout bug
  2003-06-04 21:44 ` J.A. Magallon
@ 2003-06-04 22:20   ` linas
  2003-06-05 16:58   ` linas
  1 sibling, 0 replies; 6+ messages in thread
From: linas @ 2003-06-04 22:20 UTC (permalink / raw)
  To: J.A. Magallon
  Cc: lnz, mike, eric, linux-kernel, olh, groudier, axboe, acme, linas

Hi,


On Wed, Jun 04, 2003 at 11:44:43PM +0200, J.A. Magallon wrote:
> 
> On 06.04, linas@austin.ibm.com wrote:
> > 
> > Hi,
> > 
> > I've got a SCSI timeout bug in kernels 2.4 and 2.5, and several 
> > different patches (appended) that fix it.  I'm not sure which way 
> > of fixing it is best.
> 
> Can you try with this:
> 
> --- linux-2.4.18-18mdk/drivers/scsi/scsi_error.c.scsi-eh-timeout	Thu May 30 16:22:37 2002
> +++ linux-2.4.18-18mdk/drivers/scsi/scsi_error.c	Sun Jun  9 19:18:11 2002
> @@ -1103,6 +1103,8 @@


Sorry, no, its not enough.
Here's the boot log; I'm going to turn on more debug messages in 
next email.

--linas

SCSI subsystem driver Revision: 1.00
PCI: Enabling device 00:0c.0 (0140 -> 0143)
sym53c8xx: at PCI bus 0, device 12, function 0
sym53c8xx: setting PCI_COMMAND_MASTER...(fix-up)
sym53c8xx: 53c875 detected
PCI: Enabling device 00:11.0 (0140 -> 0143)
sym53c8xx: at PCI bus 0, device 17, function 0
sym53c8xx: setting PCI_COMMAND_MASTER...(fix-up)
sym53c8xx: 53c895 detected
sym53c875-0: rev 0x4 on pci bus 0 device 12 function 0 irq 17
sym53c875-0: ID 7, Fast-20, Parity Checking
sym53c875-0: resetting, command processing suspended for 2 seconds
sym53c895-1: rev 0x1 on pci bus 0 device 17 function 0 irq 20
sym53c895-1: ID 7, Fast-40, Parity Checking
sym53c895-1: resetting, command processing suspended for 2 seconds
scsi0 : sym53c8xx-1.7.3c-20010512
scsi1 : sym53c8xx-1.7.3c-20010512
scsi : aborting command due to timeout : pid 8, scsi0, channel 0, id 4, lun 0 I
sym53c8xx_abort: pid=8 serial_number=9 serial_number_at_timeout=9
SCSI host 0 abort (pid 8) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
sym53c8xx_reset: pid=8 reset_flags=2 serial_number=9 serial_number_at_timeout=9
sym53c875-0: resetting, command processing suspended for 2 seconds
SCSI host 0 abort (pid 9) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
sym53c8xx_reset: pid=9 reset_flags=2 serial_number=10 serial_number_at_timeout=0sym53c875-0: resetting, command processing suspended for 2 seconds
SCSI host 0 abort (pid 10) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
sym53c8xx_reset: pid=10 reset_flags=2 serial_number=11 serial_number_at_timeout1sym53c875-0: resetting, command processing suspended for 2 seconds
SCSI host 0 abort (pid 11) timed out - resetting
SCSI bus is being reset for host 0 channel 0.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Patches for SCSI timeout bug
  2003-06-04 21:44 ` J.A. Magallon
  2003-06-04 22:20   ` linas
@ 2003-06-05 16:58   ` linas
  1 sibling, 0 replies; 6+ messages in thread
From: linas @ 2003-06-05 16:58 UTC (permalink / raw)
  To: J.A. Magallon; +Cc: linux-kernel, olh, groudier, axboe, acme, linas, linux-scsi

On Wed, Jun 04, 2003 at 11:44:43PM +0200, J.A. Magallon wrote:
> 
> On 06.04, linas@austin.ibm.com wrote:
> > 
> > I've got a SCSI timeout bug in kernels 2.4 and 2.5, and several 
> > different patches (appended) that fix it.  I'm not sure which way 
> > of fixing it is best.
> > 
> 
> Can you try with this:
> 
> --- linux-2.4.18-18mdk/drivers/scsi/scsi_error.c.scsi-eh-timeout	Thu May 30 16:22:37 2002

OK, some more details:
-- you patch doesn't affect operation of the 'old' symbios driver, since
   it doesn't use the 'new' eh code. 

-- I tried the new (version 2) symbios driver w/ your patch.  It does
   allow the machine to boot, but it disables the cdrom. This is bad,
   because it prevents a booting from CDROM (e.g. for install).

   BTW, the v2 code doesn't hit your patches either; I put a printk
   in there and it never showed up ... 

To reiterate; my cdrom needs 15 seconds after a bus reset before it
will respond to a queueed command.  The v2 symbios driver doesn't wait
that long-- it just keeps reseting again, too quickly, which does no 
good.

Below are the boot msgs from the symbios v2 driver:
(my cdrom is at scsi id 4)

--linas


SCSI subsystem driver Revision: 1.00
PCI: Enabling device 00:0c.0 (0140 -> 0143)
sym.0.12.0: setting PCI_COMMAND_MASTER...
sym.0.12.0: setting PCI_COMMAND_INVALIDATE.
PCI: Enabling device 00:11.0 (0140 -> 0143)
sym.0.17.0: setting PCI_COMMAND_MASTER...
sym.0.17.0: setting PCI_COMMAND_INVALIDATE.
sym0: <875> rev 0x4 on pci bus 0 device 12 function 0 irq 17
sym0: No NVRAM, ID 7, Fast-20, SE, parity checking
sym0: SCSI BUS has been reset.
sym1: <895> rev 0x1 on pci bus 0 device 17 function 0 irq 20
sym1: No NVRAM, ID 7, Fast-40, LVD, parity checking
sym1: SCSI BUS has been reset.
scsi0 : sym-2.1.17a
scsi1 : sym-2.1.17a
sym0:0: FAST-20 WIDE SCSI 40.0 MB/s ST (50.0 ns, offset 15)
  Vendor: IBM       Model: DCHS04U           Rev: 2727
  Type:   Direct-Access                      ANSI SCSI revision: 02
sym0:4:0: ABORT operation started.
sym0:4:0: ABORT operation timed-out.
sym0:4:0: DEVICE RESET operation started.
sym0:4:0: DEVICE RESET operation complete.
sym0:4:control msgout: c.
sym0: TARGET 4 has been reset.
sym0:4:0: ABORT operation started.
sym0:4:0: ABORT operation complete.
sym0:4:0: BUS RESET operation started.
sym0:4:0: BUS RESET operation failed.
sym0:4:0: HOST RESET operation started.
sym0:4:0: HOST RESET operation failed.
scsi: device set offline - command error recover failed: host 0 channel 0 id 4 0sym0:9: FAST-20 WIDE SCSI 40.0 MB/s ST (50.0 ns, offset 16)
  Vendor: IBM       Model: DNES-309170W      Rev: SAGU
  Type:   Direct-Access                      ANSI SCSI revision: 03
sym0:0:0: tagged command queuing enabled, command queue depth 16.
sym0:9:0: tagged command queuing enabled, command queue depth 16.
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
Attached scsi disk sdb at scsi0, channel 0, id 9, lun 0
SCSI device sda: 8813870 512-byte hdwr sectors (4513 MB)
Partition check:
 sda: sda2 sda3 sda4
SCSI device sdb: 17774160 512-byte hdwr sectors (9100 MB)
 sdb: sdb1 sdb2 sdb3


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Patches for SCSI timeout bug
  2003-06-04 21:34 Patches for SCSI timeout bug linas
  2003-06-04 21:44 ` J.A. Magallon
@ 2003-06-06 18:41 ` Anton Blanchard
  1 sibling, 0 replies; 6+ messages in thread
From: Anton Blanchard @ 2003-06-06 18:41 UTC (permalink / raw)
  To: linas; +Cc: lnz, mike, eric, linux-kernel, olh, groudier, axboe, acme, linas

 
> 2) By incresing the sym53c8xx post-reset delay to at least
>    12 seconds.
> 
> Fix 2) may not be bad: I have at least one scsi hard drive which 
> takes 5 seconds to recover from a bus reset.   On the other hand,
> fix 2) makes the boot process longer: it introduces a delay of 
> N x 12 seconds, where N is the number of scsi channels.
> (Most cards have two channels; some server-class machines with 
> many cards may have a significantly longer boot).

Yep, Ive got a box with 42 scsi controllers and the time to probe SCSI
is already unbearable :) 

So I like fix 1 as well.

Anton

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Patches for SCSI timeout bug
@ 2003-06-09 19:31 Perez-Gonzalez, Inaky
  0 siblings, 0 replies; 6+ messages in thread
From: Perez-Gonzalez, Inaky @ 2003-06-09 19:31 UTC (permalink / raw)
  To: Anton Blanchard, linux-kernel


> From: Anton Blanchard [mailto:anton@samba.org]
> 
> > 2) By incresing the sym53c8xx post-reset delay to at least
> >    12 seconds.
> >
> > Fix 2) may not be bad: I have at least one scsi hard drive which
> > takes 5 seconds to recover from a bus reset.   On the other hand,
> > fix 2) makes the boot process longer: it introduces a delay of
> > N x 12 seconds, where N is the number of scsi channels.
> > (Most cards have two channels; some server-class machines with
> > many cards may have a significantly longer boot).
> 
> Yep, Ive got a box with 42 scsi controllers and the time to probe SCSI
> is already unbearable :)

Woah, can I see a copy of your electricity bill? That must suck
amps...

Iñaky Pérez-González -- Not speaking for Intel -- all opinions are my own
(and my fault)

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2003-06-09 19:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-06-04 21:34 Patches for SCSI timeout bug linas
2003-06-04 21:44 ` J.A. Magallon
2003-06-04 22:20   ` linas
2003-06-05 16:58   ` linas
2003-06-06 18:41 ` Anton Blanchard
2003-06-09 19:31 Perez-Gonzalez, Inaky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).