Sym53c8xx tape corruption squashed! (was: Re: SCSI Tape corruption - update)

Message ID Pine.GSO.4.21.0112282115310.277-100000@vervain.sonytel.be
State New, archived
Headers show
Series
  • Sym53c8xx tape corruption squashed! (was: Re: SCSI Tape corruption - update)
Related show

Commit Message

Geert Uytterhoeven Dec. 28, 2001, 8:36 p.m. UTC
On Wed, 5 Dec 2001, Geert Uytterhoeven wrote:
> On Fri, 2 Nov 2001, [ISO-8859-1] Gérard Roudier wrote:
> > On Thu, 1 Nov 2001, Geert Uytterhoeven wrote:
> > As driver sym-2 is planned to replace sym53c8xx in the future, it would be
> > interesting to give it a try on your hardware. There are some source
> > available from ftp.tux.org, but I can provide you with a flat patch
> > against the stock kernel version you want. You may let me know.
> 
> I tried sym-2 (2.4.17-pre2) and it didn't show up the problem, which is good!
> 
> More news from the old driver:
> 
>     1.5c            OK
>     1.5d            OK
>     1.5e            page fault in interrupt handler 0xa53c0c68
>     1.5f            lock up
>     1.5pre-g1       lock up
>     1.5pre-g2       lock up
>     1.5pre-g3       corruption
>     1.5g            corruption
> 
> So it happened somewhere in between 1.5d and 1.5pre-g3. I'll see whether I can
> get any of the intermediates to run...

I made all intermediate versions to work.

The problem is introduced in 1.5pre-g2 by the following change:


This change causes the PCI latency timer to be changed from 0 to 80.

The sym-2 driver has a define for modifying the PCI latency timer
(SYM_SETUP_PCI_FIX_UP), but it is never used, so I see no corruption.

Is this a hardware bug in my SCSI host adapter (53c875 rev 04) or my host
bridge (VLSI VAS96011/12 Golden Gate II for PPC), or a software bug in the
driver (wrong burst_max)?

To recapitulate, the bug causes error bursts of (almost always) 32 bytes long.
The incorrect bytes are always a copy of previous data, at a fixed offset (10
kiB on my (now dead) DDS-1 tape drive, 32 kiB on my Plexwriter).

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comments

Gérard Roudier Dec. 29, 2001, 12:57 a.m. UTC | #1
On Fri, 28 Dec 2001, Geert Uytterhoeven wrote:

> On Wed, 5 Dec 2001, Geert Uytterhoeven wrote:
> > On Fri, 2 Nov 2001, [ISO-8859-1] Gérard Roudier wrote:
> > > On Thu, 1 Nov 2001, Geert Uytterhoeven wrote:
> > > As driver sym-2 is planned to replace sym53c8xx in the future, it would be
> > > interesting to give it a try on your hardware. There are some source
> > > available from ftp.tux.org, but I can provide you with a flat patch
> > > against the stock kernel version you want. You may let me know.
> >
> > I tried sym-2 (2.4.17-pre2) and it didn't show up the problem, which is good!
> >
> > More news from the old driver:
> >
> >     1.5c            OK
> >     1.5d            OK
> >     1.5e            page fault in interrupt handler 0xa53c0c68
> >     1.5f            lock up
> >     1.5pre-g1       lock up
> >     1.5pre-g2       lock up
> >     1.5pre-g3       corruption
> >     1.5g            corruption
> >
> > So it happened somewhere in between 1.5d and 1.5pre-g3. I'll see whether I can
> > get any of the intermediates to run...
>
> I made all intermediate versions to work.
>
> The problem is introduced in 1.5pre-g2 by the following change:
>
> diff -urN callisto-1.5g-pre2a/sym53c8xx.c callisto-1.5g-pre2+/sym53c8xx.c
> --- callisto-1.5g-pre2a/sym53c8xx.c	Fri Dec 28 21:12:30 2001
> +++ callisto-1.5g-pre2+/sym53c8xx.c	Fri Dec 28 20:11:10 2001
> @@ -11981,7 +11981,7 @@
>  	**    (latency timer >= burst length + 6, we add 10 to be quite sure)
>  	*/
>
> -	if ((pci_fix_up & 4) && chip->burst_max) {
> +	if (chip->burst_max && (latency_timer == 0 || (pci_fix_up & 4))) {
>  		uchar lt = (1 << chip->burst_max) + 6 + 10;
>  		if (latency_timer < lt) {
>  			printk(NAME53C8XX
>
> This change causes the PCI latency timer to be changed from 0 to 80.
>
> The sym-2 driver has a define for modifying the PCI latency timer
> (SYM_SETUP_PCI_FIX_UP), but it is never used, so I see no corruption.

By default sym-2 use value 3 for the pci_fix_up (cache line size + memory
write and invalidate). The latency timer fix-up has been removed, since it
is rather up to the generic PCI driver to tune latency timers.

> Is this a hardware bug in my SCSI host adapter (53c875 rev 04) or my host
> bridge (VLSI VAS96011/12 Golden Gate II for PPC), or a software bug in the
> driver (wrong burst_max)?

Great bug hunting!

It is about certainly not a software bug in the driver. Any latency timer
value should not give any trouble if hardware was flawless. Just the PCI
performances could be affected.

Anyway, value 0 looks way stupid for devices capable of bursting more than
1 data phase, thus the improvement above. :)

> To recapitulate, the bug causes error bursts of (almost always) 32 bytes long.
> The incorrect bytes are always a copy of previous data, at a fixed offset (10
> kiB on my (now dead) DDS-1 tape drive, 32 kiB on my Plexwriter).

Unfortunately, I haven't the errata listing for teh 53c875 rev 4. I have
the DEL for 875 rev. 3 and for 876 rev. 5.

If we assume that rev 4 hasn't more bugs than rev 3, then you may try to
disable MEMORY WRITE and INVALIDATE (and not tell the driver to fix this
up) but allow the driver to fix the bogus zero latency timer. The 875 rev
3 may, under certain conditions, execute unaligned PCI MEMORY WRITE and
INVALIDATE transactions. Note that this may explain data corruptions
occurring for SCSI READ commands not WRITE commands. No other items can
explain, on paper, data corruptions of the form you describe due to 875
chip misbehaviour.

Btw, latency timer zero should not change the likelyhood of this item.
This let me think that the host bridge is likely to be the culprit.

> Gr{oetje,eeting}s,

Gr{oudier,eat bug hunting, indeed}. :)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Geert Uytterhoeven Dec. 29, 2001, 10:49 a.m. UTC | #2
On Sat, 29 Dec 2001, [ISO-8859-1] Gérard Roudier wrote:
> On Fri, 28 Dec 2001, Geert Uytterhoeven wrote:
> > The problem is introduced in 1.5pre-g2 by the following change:

  [...]

> > This change causes the PCI latency timer to be changed from 0 to 80.
> >
> > The sym-2 driver has a define for modifying the PCI latency timer
> > (SYM_SETUP_PCI_FIX_UP), but it is never used, so I see no corruption.
> 
> By default sym-2 use value 3 for the pci_fix_up (cache line size + memory
> write and invalidate). The latency timer fix-up has been removed, since it
> is rather up to the generic PCI driver to tune latency timers.
> 
> > Is this a hardware bug in my SCSI host adapter (53c875 rev 04) or my host
> > bridge (VLSI VAS96011/12 Golden Gate II for PPC), or a software bug in the
> > driver (wrong burst_max)?
> 
> Great bug hunting!
> 
> It is about certainly not a software bug in the driver. Any latency timer
> value should not give any trouble if hardware was flawless. Just the PCI
> performances could be affected.
> 
> Anyway, value 0 looks way stupid for devices capable of bursting more than
> 1 data phase, thus the improvement above. :)

OK.

> > To recapitulate, the bug causes error bursts of (almost always) 32 bytes long.
> > The incorrect bytes are always a copy of previous data, at a fixed offset (10
> > kiB on my (now dead) DDS-1 tape drive, 32 kiB on my Plexwriter).
> 
> Unfortunately, I haven't the errata listing for teh 53c875 rev 4. I have
> the DEL for 875 rev. 3 and for 876 rev. 5.

And I'm afraid I won't be able to get errata for the VLSI VAS96011/12 :-( Of
course I can always give it a try...

> If we assume that rev 4 hasn't more bugs than rev 3, then you may try to
> disable MEMORY WRITE and INVALIDATE (and not tell the driver to fix this
> up) but allow the driver to fix the bogus zero latency timer. The 875 rev
> 3 may, under certain conditions, execute unaligned PCI MEMORY WRITE and
> INVALIDATE transactions. Note that this may explain data corruptions
> occurring for SCSI READ commands not WRITE commands. No other items can
> explain, on paper, data corruptions of the form you describe due to 875
> chip misbehaviour.

I'll give that a try...

> Btw, latency timer zero should not change the likelyhood of this item.
> This let me think that the host bridge is likely to be the culprit.

Hmmm... I'm still wondering why I see the problem when writing to tape or
CD-R(W), while I can't provoke it when writing to disk (Quantum Viking II U2W).

What's so special about tape and CD-R?

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Geert Uytterhoeven Dec. 29, 2001, 1:23 p.m. UTC | #3
On Sat, 29 Dec 2001, [ISO-8859-1] Gérard Roudier wrote:
> On Fri, 28 Dec 2001, Geert Uytterhoeven wrote:
> > The sym-2 driver has a define for modifying the PCI latency timer
> > (SYM_SETUP_PCI_FIX_UP), but it is never used, so I see no corruption.
> 
> By default sym-2 use value 3 for the pci_fix_up (cache line size + memory
> write and invalidate). The latency timer fix-up has been removed, since it
> is rather up to the generic PCI driver to tune latency timers.
> 
> > Is this a hardware bug in my SCSI host adapter (53c875 rev 04) or my host
> > bridge (VLSI VAS96011/12 Golden Gate II for PPC), or a software bug in the
> > driver (wrong burst_max)?
> 
> Great bug hunting!
> 
> It is about certainly not a software bug in the driver. Any latency timer
> value should not give any trouble if hardware was flawless. Just the PCI
> performances could be affected.

I played a bit with sym-2 and setpci. Everything goes fine as long as the PCI
latency timer value is smaller than 0x16 (yes, at first I thought it was
decimal, but setpci parameters are in hex).

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Gérard Roudier Dec. 29, 2001, 6:39 p.m. UTC | #4
On Sat, 29 Dec 2001, Geert Uytterhoeven wrote:

> On Sat, 29 Dec 2001, [ISO-8859-1] Gérard Roudier wrote:
> > On Fri, 28 Dec 2001, Geert Uytterhoeven wrote:
> > > The sym-2 driver has a define for modifying the PCI latency timer
> > > (SYM_SETUP_PCI_FIX_UP), but it is never used, so I see no corruption.
> >
> > By default sym-2 use value 3 for the pci_fix_up (cache line size + memory
> > write and invalidate). The latency timer fix-up has been removed, since it
> > is rather up to the generic PCI driver to tune latency timers.
> >
> > > Is this a hardware bug in my SCSI host adapter (53c875 rev 04) or my host
> > > bridge (VLSI VAS96011/12 Golden Gate II for PPC), or a software bug in the
> > > driver (wrong burst_max)?
> >
> > Great bug hunting!
> >
> > It is about certainly not a software bug in the driver. Any latency timer
> > value should not give any trouble if hardware was flawless. Just the PCI
> > performances could be affected.
>
> I played a bit with sym-2 and setpci. Everything goes fine as long as the PCI
> latency timer value is smaller than 0x16 (yes, at first I thought it was
> decimal, but setpci parameters are in hex).

Interesting result, even if it doesn't trigger any of my guessing
capabilities, for now. :-)

Just it means that the 875 must release the PCI BUS if its GNT# signal is
deasserted by PCI arbiter and current transaction lasted 22 PCI cycles or
more since the assertion of FRAME#.

If I remember correctly, the problem occurred when data is written to the
device. Is it ok?

If so, the MWI problem I pointed out in my previous posting is unlikely to
apply. But, for user data DMA write, the 875 may execute Memory Read Line
or Memory Read Multiple Lines transactions. It would be interesting to
know if it makes difference disabling those capabilities.

Setting to zero the PCI cache line register in the PCI configuration space
does force the chip not to use any of the cache line based PCI
transactions. It is brute force but should work.

In order to disable separately those features, some IO register bits must
be set to zero. The faster way is to hack the driver (sym_hipd.c) at some
place, for example (entered by hand just for you):

	/*
	 *  Select all supported special features.
	 *  If we are using on-board RAM for scripts, prefetch (PFEN)
	 *  does not help, but burst op fetch (BOF) does.
	 *  Disabling PFEN makes sure BOF will be used.
	 */
	if (np->features & FE_ERL)
		np->rv_dmode	|= ERL;		/* Enable Read Line */
	if (np->features & FE_BOF)
		np->rv_dmode	|= BOF;		/* Burst Opcode Fetch */
	if (np->features & FE_ERMP)
		np->rv_dmode	|= ERMP;	/* Enable Read Multiple */
#if 1
	if ((np->features & FE_PFEN) && !np->ram_ba)
#else
	if (np->features & FE_PFEN)
#endif
		np->rv_dcntl	|= PFEN;	/* Prefetch Enable */
	if (np->features & FE_CLSE)
		np->rv_dcntl	|= CLSE;	/* Cache Line Size Enable */
	if (np->features & FE_WRIE)
		np->rv_ctest3	|= WRIE;	/* Write and Invalidate */
	if (np->features & FE_DFS)
		np->rv_ctest5	|= DFS;		/* Dma Fifo Size */

+ #if 0 /* Disable all cache line based features */
+ 	np->rv_dcntl	&= ~CLSE;
+ #endif
+ #if 1 /* Disable Read Line */
+ 	np->rv_dmode	&= ~ERL;
+ #endif
+ #if 1 /* Disable Read Multiple */
+ 	np->rv_dmode	&= ~ERMP;
+ #endif
+ #if 0 /* Disable Write and Invalidate */
+ 	np->rv_ctest3	&= ~WRIE;
+ #endif

This example disables Read Line and Memory Read Multiple. I just added
provisions (#if'ed zero) for other bits that also apply to cache line
based transactions.

Gérard.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Geert Uytterhoeven Dec. 29, 2001, 9:28 p.m. UTC | #5
On Sat, 29 Dec 2001, [ISO-8859-1] Gérard Roudier wrote:
> On Sat, 29 Dec 2001, Geert Uytterhoeven wrote:
> > On Sat, 29 Dec 2001, [ISO-8859-1] Gérard Roudier wrote:
> > > On Fri, 28 Dec 2001, Geert Uytterhoeven wrote:
> > > > The sym-2 driver has a define for modifying the PCI latency timer
> > > > (SYM_SETUP_PCI_FIX_UP), but it is never used, so I see no corruption.
> > >
> > > By default sym-2 use value 3 for the pci_fix_up (cache line size + memory
> > > write and invalidate). The latency timer fix-up has been removed, since it
> > > is rather up to the generic PCI driver to tune latency timers.
> > >
> > > > Is this a hardware bug in my SCSI host adapter (53c875 rev 04) or my host
> > > > bridge (VLSI VAS96011/12 Golden Gate II for PPC), or a software bug in the
> > > > driver (wrong burst_max)?
> > >
> > > Great bug hunting!
> > >
> > > It is about certainly not a software bug in the driver. Any latency timer
> > > value should not give any trouble if hardware was flawless. Just the PCI
> > > performances could be affected.
> >
> > I played a bit with sym-2 and setpci. Everything goes fine as long as the PCI
> > latency timer value is smaller than 0x16 (yes, at first I thought it was
> > decimal, but setpci parameters are in hex).
> 
> Interesting result, even if it doesn't trigger any of my guessing
> capabilities, for now. :-)
> 
> Just it means that the 875 must release the PCI BUS if its GNT# signal is
> deasserted by PCI arbiter and current transaction lasted 22 PCI cycles or
> more since the assertion of FRAME#.

Exactly my thoughts.

> If I remember correctly, the problem occurred when data is written to the
> device. Is it ok?

Yes.

> If so, the MWI problem I pointed out in my previous posting is unlikely to
> apply. But, for user data DMA write, the 875 may execute Memory Read Line
> or Memory Read Multiple Lines transactions. It would be interesting to
> know if it makes difference disabling those capabilities.
> 
> Setting to zero the PCI cache line register in the PCI configuration space
> does force the chip not to use any of the cache line based PCI
> transactions. It is brute force but should work.

Note that on my system the PCI cache line register in the PCI configuration
space of the '875 is already set to zero.

> In order to disable separately those features, some IO register bits must
> be set to zero. The faster way is to hack the driver (sym_hipd.c) at some
> place, for example (entered by hand just for you):

So I don't think it would help to test this, since PCI_CACHE_LINE_SIZE is set
to 0?

Anyway, thanks for your time and suggestions!

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Gérard Roudier Dec. 30, 2001, midnight UTC | #6
On Sat, 29 Dec 2001, Geert Uytterhoeven wrote:

> On Sat, 29 Dec 2001, [ISO-8859-1] Gérard Roudier wrote:
> > On Sat, 29 Dec 2001, Geert Uytterhoeven wrote:
[...]
> > > I played a bit with sym-2 and setpci. Everything goes fine as long as the PCI
> > > latency timer value is smaller than 0x16 (yes, at first I thought it was
> > > decimal, but setpci parameters are in hex).
> >
> > Interesting result, even if it doesn't trigger any of my guessing
> > capabilities, for now. :-)
> >
> > Just it means that the 875 must release the PCI BUS if its GNT# signal is
> > deasserted by PCI arbiter and current transaction lasted 22 PCI cycles or
> > more since the assertion of FRAME#.
>
> Exactly my thoughts.

Note that this looks a bit less than 8 DWORDs. If your beast use such
cache line size, this can be related to.

> > If I remember correctly, the problem occurred when data is written to the
> > device. Is it ok?
>
> Yes.
>
> > If so, the MWI problem I pointed out in my previous posting is unlikely to
> > apply. But, for user data DMA write, the 875 may execute Memory Read Line
> > or Memory Read Multiple Lines transactions. It would be interesting to
> > know if it makes difference disabling those capabilities.
> >
> > Setting to zero the PCI cache line register in the PCI configuration space
> > does force the chip not to use any of the cache line based PCI
> > transactions. It is brute force but should work.
>
> Note that on my system the PCI cache line register in the PCI configuration
> space of the '875 is already set to zero.

Then, the 875 never used cache line based PCI transactions.

> > In order to disable separately those features, some IO register bits must
> > be set to zero. The faster way is to hack the driver (sym_hipd.c) at some
> > place, for example (entered by hand just for you):
>
> So I don't think it would help to test this, since PCI_CACHE_LINE_SIZE is set
> to 0?

Indeed. A least your system hasn't been bitten by PCI cache line related
bugs.

I donnot know how the 875 behaves when supplied with a zero latency timer.
Normally it should consider the timeout to happen immediately, but it must
and is allowed to perform at least one data phase. In this hypothesis, and
given that a latency timer greater than 22 PCI clocks makes problem, I may
risk the following:

Your hardware (probably the host bridge) is only able as a PCI target to
provide a limited amount of data for the current PCI read transaction in
some circumstances. It deasserts GNT#. If the master wants more data it
has to force transaction termination, otherwise it is the master that will
terminate the transaction.

Then the cause of the problem could be something like:

- The host bridge is unable to terminate a read transaction as a target
  and just feeds the master with stale data if it cannot get good ones.
  (Unlikely, but why not?)
- The host bridge just does not terminate the transaction in time in
  some circumstances and provides stale data to the master until the
  transaction terminates.

This is an example, probably just wrong.

Anyway, given that short latency timers hide (fix?) the problem, your
system seems to like much better PCI transactions to be preferently
(always?) terminated by the master.

  Gérard.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch

diff -urN callisto-1.5g-pre2a/sym53c8xx.c callisto-1.5g-pre2+/sym53c8xx.c
--- callisto-1.5g-pre2a/sym53c8xx.c	Fri Dec 28 21:12:30 2001
+++ callisto-1.5g-pre2+/sym53c8xx.c	Fri Dec 28 20:11:10 2001
@@ -11981,7 +11981,7 @@ 
 	**    (latency timer >= burst length + 6, we add 10 to be quite sure)
 	*/
 
-	if ((pci_fix_up & 4) && chip->burst_max) {
+	if (chip->burst_max && (latency_timer == 0 || (pci_fix_up & 4))) {
 		uchar lt = (1 << chip->burst_max) + 6 + 10;
 		if (latency_timer < lt) {
 			printk(NAME53C8XX