All of lore.kernel.org
 help / color / mirror / Atom feed
* qla1280.c broken on SGI visws, PCI coherency problem
@ 2005-12-09 19:11 Michael Joosten
  2005-12-09 23:48 ` Michael Reed
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Joosten @ 2005-12-09 19:11 UTC (permalink / raw)
  To: linux-scsi

[-- Attachment #1: Type: text/plain, Size: 1939 bytes --]

Hello,
since last week I'm trying to bring the current version of 2.6.12+ in 
working order on that almost abandoned SGI 320 Visual Workstation. This
beastlet also used have a QLA1080 as SCSI controller, which is actually
the only supported one to boot from.
I'm not sure if this is just a SGI320 problem (there seems to be two bus
bridges in use: PCI: Lithium bridge A bus: 1, bridge B (PIIX4) bus: 0)
or now a general problem for all platforms not implementing a mmiowb()
write barrier operation, but since 2.6.11 the qla1280.c driver gets
severly stuck after a few minutes of heavy use:

zapp kernel: qla1280: ISP invalid handle

and then usually the kernel hangs hard or the SCSI subsystem is
inoperable.
Last year Jesse Barnes published a patch introducing I/O space write 
barrier instructions especially for IA64 and MIPS multiprocessors. In 
that patch some PCI posted write flushs were replaced by mmiowb() 
(platform specific write barrier instruction), and at least for
the SGI VisWS, this was one replacement too much.... I'm aware that the 
Visws PCI controller (at least the Lithium chip resp. for PCI 64 bus) 
reused
parts from the O2 and sufferes the same problem of lacking cache 
coherency, but
I wonder now if the qla1280.c is actually stable anymore in kernels 
after 2.6.10 (last version with the PCI write flushes in 
qla1280_64/32bit_start_scsi() ) and non-x86 platforms.

I've just tried qla1280.[ch] from a more recent version than 2.6.12.4,
namely 2.6.14.3, and have the same problem again (only worse, but there
has been some patches in qla1280.c regarding error recovery recently,
and now the kernel just hangs...), unless I add the one/two RD_REG_WORD()
lines again.
To repeat: Has there been any notion of problems with qla1280.c recently,
last known good version in 2.6.10 is from Xmas last year.

I can run tests also on a Intel dual PII server board with that QLA1080 
HBA, but not now.

Regards, Michael


[-- Attachment #2: qla1280.c.diff --]
[-- Type: text/plain, Size: 584 bytes --]

--- ../linux-2.6.14.3/drivers/scsi/qla1280.c-	2005-11-24 23:10:21.000000000 +0100
+++ ../linux-2.6.14.3/drivers/scsi/qla1280.c	2005-12-07 21:27:42.000000000 +0100
@@ -3236,6 +3236,7 @@
 	WRT_REG_WORD(&reg->mailbox4, ha->req_ring_index);
 	/* Enforce mmio write ordering; see comment in qla1280_isp_cmd(). */
 	mmiowb();
+	RD_REG_WORD(&reg->mailbox4);
 
  out:
 	if (status)
@@ -3504,6 +3505,7 @@
 	WRT_REG_WORD(&reg->mailbox4, ha->req_ring_index);
 	/* Enforce mmio write ordering; see comment in qla1280_isp_cmd(). */
 	mmiowb();
+	RD_REG_WORD(&reg->mailbox4);
 
 out:
 	if (status)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-09 19:11 qla1280.c broken on SGI visws, PCI coherency problem Michael Joosten
@ 2005-12-09 23:48 ` Michael Reed
  2005-12-12 21:00   ` [PATCH]: " Michael Reed
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Reed @ 2005-12-09 23:48 UTC (permalink / raw)
  To: Michael Joosten; +Cc: linux-scsi

FWIW, SGI uses the 1280 driver with the 12160 it ships standard
on its IA64 Altix platforms.  While there are deficiencies in
the driver (most drivers, for that matter), it's good enough.

I'm asking around to see if anyone remembers the issues with the
VISW.

(Realized I didn't reply to list, only to Mr. Joosten.)

Mike


Michael Joosten wrote:
> Hello,
> since last week I'm trying to bring the current version of 2.6.12+ in
> working order on that almost abandoned SGI 320 Visual Workstation. This
> beastlet also used have a QLA1080 as SCSI controller, which is actually
> the only supported one to boot from.
> I'm not sure if this is just a SGI320 problem (there seems to be two bus
> bridges in use: PCI: Lithium bridge A bus: 1, bridge B (PIIX4) bus: 0)
> or now a general problem for all platforms not implementing a mmiowb()
> write barrier operation, but since 2.6.11 the qla1280.c driver gets
> severly stuck after a few minutes of heavy use:
> 
> zapp kernel: qla1280: ISP invalid handle
> 
> and then usually the kernel hangs hard or the SCSI subsystem is
> inoperable.
> Last year Jesse Barnes published a patch introducing I/O space write
> barrier instructions especially for IA64 and MIPS multiprocessors. In
> that patch some PCI posted write flushs were replaced by mmiowb()
> (platform specific write barrier instruction), and at least for
> the SGI VisWS, this was one replacement too much.... I'm aware that the
> Visws PCI controller (at least the Lithium chip resp. for PCI 64 bus)
> reused
> parts from the O2 and sufferes the same problem of lacking cache
> coherency, but
> I wonder now if the qla1280.c is actually stable anymore in kernels
> after 2.6.10 (last version with the PCI write flushes in
> qla1280_64/32bit_start_scsi() ) and non-x86 platforms.
> 
> I've just tried qla1280.[ch] from a more recent version than 2.6.12.4,
> namely 2.6.14.3, and have the same problem again (only worse, but there
> has been some patches in qla1280.c regarding error recovery recently,
> and now the kernel just hangs...), unless I add the one/two RD_REG_WORD()
> lines again.
> To repeat: Has there been any notion of problems with qla1280.c recently,
> last known good version in 2.6.10 is from Xmas last year.
> 
> I can run tests also on a Intel dual PII server board with that QLA1080
> HBA, but not now.
> 
> Regards, Michael
> 
> 
> ------------------------------------------------------------------------
> 
> --- ../linux-2.6.14.3/drivers/scsi/qla1280.c-	2005-11-24 23:10:21.000000000 +0100
> +++ ../linux-2.6.14.3/drivers/scsi/qla1280.c	2005-12-07 21:27:42.000000000 +0100
> @@ -3236,6 +3236,7 @@
>  	WRT_REG_WORD(&reg->mailbox4, ha->req_ring_index);
>  	/* Enforce mmio write ordering; see comment in qla1280_isp_cmd(). */
>  	mmiowb();
> +	RD_REG_WORD(&reg->mailbox4);
>  
>   out:
>  	if (status)
> @@ -3504,6 +3505,7 @@
>  	WRT_REG_WORD(&reg->mailbox4, ha->req_ring_index);
>  	/* Enforce mmio write ordering; see comment in qla1280_isp_cmd(). */
>  	mmiowb();
> +	RD_REG_WORD(&reg->mailbox4);
>  
>  out:
>  	if (status)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-09 23:48 ` Michael Reed
@ 2005-12-12 21:00   ` Michael Reed
  2005-12-12 21:24     ` Christoph Hellwig
  2005-12-12 21:47     ` James Bottomley
  0 siblings, 2 replies; 22+ messages in thread
From: Michael Reed @ 2005-12-12 21:00 UTC (permalink / raw)
  To: Michael Joosten; +Cc: Michael Reed, linux-scsi

(The subject of this email isn't quite accurate.  It's not
a pci coherency problem, it's a pio write ordering problem.)

I've been asked to pass along the suggestion that "mmiowb"
should be implemented for the platform.

Given that I've been unable to unearth the chipset documentation
for the Vis WS, I can only hope that you've got some good ideas
on how this might be accomplished.

I agree that replacing the pio read which flushed the preceeding
pio write with mmiowb() is what has likely broken the driver.  If you
restore them,  please make it either mmiowb or pio read, but not both.

Perhaps something like this?  It's not the most elegant solution....

--- old/drivers/scsi/qla1280.c        2005-12-05 12:39:36.000000000 -0600
+++ new/drivers/scsi/qla1280.c      2005-12-12 14:42:11.146215122 -0600
@@ -401,6 +401,10 @@
 #include "ql1280_fw.h"
 #include "ql1040_fw.h"

+#ifdef CONFIG_X86_VISWS
+  #undef mmiowb
+  #define mmiowb() RD_REG_WORD(&ha->iobase->id_l)
+#endif

 /*
  * Missing PCI ID's



It compiles but I have no platform against which to test it.

If I come up with any doc, I'll pass it along.  Don't hold your breath.


Mike


Michael Reed wrote:
> FWIW, SGI uses the 1280 driver with the 12160 it ships standard
> on its IA64 Altix platforms.  While there are deficiencies in
> the driver (most drivers, for that matter), it's good enough.
> 
> I'm asking around to see if anyone remembers the issues with the
> VISW.
> 
> (Realized I didn't reply to list, only to Mr. Joosten.)
> 
> Mike
> 
> 
> Michael Joosten wrote:
>>Hello,
>>since last week I'm trying to bring the current version of 2.6.12+ in
>>working order on that almost abandoned SGI 320 Visual Workstation. This
>>beastlet also used have a QLA1080 as SCSI controller, which is actually
>>the only supported one to boot from.
>>I'm not sure if this is just a SGI320 problem (there seems to be two bus
>>bridges in use: PCI: Lithium bridge A bus: 1, bridge B (PIIX4) bus: 0)
>>or now a general problem for all platforms not implementing a mmiowb()
>>write barrier operation, but since 2.6.11 the qla1280.c driver gets
>>severly stuck after a few minutes of heavy use:
>>
>>zapp kernel: qla1280: ISP invalid handle
>>
>>and then usually the kernel hangs hard or the SCSI subsystem is
>>inoperable.
>>Last year Jesse Barnes published a patch introducing I/O space write
>>barrier instructions especially for IA64 and MIPS multiprocessors. In
>>that patch some PCI posted write flushs were replaced by mmiowb()
>>(platform specific write barrier instruction), and at least for
>>the SGI VisWS, this was one replacement too much.... I'm aware that the
>>Visws PCI controller (at least the Lithium chip resp. for PCI 64 bus)
>>reused
>>parts from the O2 and sufferes the same problem of lacking cache
>>coherency, but
>>I wonder now if the qla1280.c is actually stable anymore in kernels
>>after 2.6.10 (last version with the PCI write flushes in
>>qla1280_64/32bit_start_scsi() ) and non-x86 platforms.
>>
>>I've just tried qla1280.[ch] from a more recent version than 2.6.12.4,
>>namely 2.6.14.3, and have the same problem again (only worse, but there
>>has been some patches in qla1280.c regarding error recovery recently,
>>and now the kernel just hangs...), unless I add the one/two RD_REG_WORD()
>>lines again.
>>To repeat: Has there been any notion of problems with qla1280.c recently,
>>last known good version in 2.6.10 is from Xmas last year.
>>
>>I can run tests also on a Intel dual PII server board with that QLA1080
>>HBA, but not now.
>>
>>Regards, Michael
>>
>>
>>------------------------------------------------------------------------
>>
>>--- ../linux-2.6.14.3/drivers/scsi/qla1280.c-	2005-11-24 23:10:21.000000000 +0100
>>+++ ../linux-2.6.14.3/drivers/scsi/qla1280.c	2005-12-07 21:27:42.000000000 +0100
>>@@ -3236,6 +3236,7 @@
>> 	WRT_REG_WORD(&reg->mailbox4, ha->req_ring_index);
>> 	/* Enforce mmio write ordering; see comment in qla1280_isp_cmd(). */
>> 	mmiowb();
>>+	RD_REG_WORD(&reg->mailbox4);
>> 
>>  out:
>> 	if (status)
>>@@ -3504,6 +3505,7 @@
>> 	WRT_REG_WORD(&reg->mailbox4, ha->req_ring_index);
>> 	/* Enforce mmio write ordering; see comment in qla1280_isp_cmd(). */
>> 	mmiowb();
>>+	RD_REG_WORD(&reg->mailbox4);
>> 
>> out:
>> 	if (status)
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-12 21:00   ` [PATCH]: " Michael Reed
@ 2005-12-12 21:24     ` Christoph Hellwig
  2005-12-12 21:31       ` Jesse Barnes
  2005-12-12 21:47     ` James Bottomley
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2005-12-12 21:24 UTC (permalink / raw)
  To: Michael Reed; +Cc: Michael Joosten, linux-scsi, linux-kernel

On Mon, Dec 12, 2005 at 03:00:59PM -0600, Michael Reed wrote:
> (The subject of this email isn't quite accurate.  It's not
> a pci coherency problem, it's a pio write ordering problem.)
> 
> I've been asked to pass along the suggestion that "mmiowb"
> should be implemented for the platform.
> 
> Given that I've been unable to unearth the chipset documentation
> for the Vis WS, I can only hope that you've got some good ideas
> on how this might be accomplished.
> 
> I agree that replacing the pio read which flushed the preceeding
> pio write with mmiowb() is what has likely broken the driver.  If you
> restore them,  please make it either mmiowb or pio read, but not both.
> 
> Perhaps something like this?  It's not the most elegant solution....

Yeah, it's not that nice.  After all we don't use mmio at all on visw but
pio.  So why do we need the flushing at all?

> --- old/drivers/scsi/qla1280.c        2005-12-05 12:39:36.000000000 -0600
> +++ new/drivers/scsi/qla1280.c      2005-12-12 14:42:11.146215122 -0600
> @@ -401,6 +401,10 @@
>  #include "ql1280_fw.h"
>  #include "ql1040_fw.h"
> 
> +#ifdef CONFIG_X86_VISWS
> +  #undef mmiowb
> +  #define mmiowb() RD_REG_WORD(&ha->iobase->id_l)
> +#endif

Macros with implicit arguments are pretty horrible.  If we want to go down
that road we should add a macro that expands to either version depending
on the config flags.

While we're at it, does anyone know whyt the ioread* interface doesn't
provide the _relaxed version?  I'd really love to switch qla1280 over to it
given that it needs to support both mmio and pio.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-12 21:24     ` Christoph Hellwig
@ 2005-12-12 21:31       ` Jesse Barnes
  0 siblings, 0 replies; 22+ messages in thread
From: Jesse Barnes @ 2005-12-12 21:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Michael Reed, Michael Joosten, linux-scsi, linux-kernel

On Monday, December 12, 2005 1:24 pm, Christoph Hellwig wrote:
> While we're at it, does anyone know whyt the ioread* interface doesn't
> provide the _relaxed version?  I'd really love to switch qla1280 over
> to it given that it needs to support both mmio and pio.

Back when ioread was being discussed on linux-arch, I remember Linus 
saying that perhaps *all* ioreads should be relaxed wrt. DMA (unlike the 
current mmio accessor functions), but I'm not sure if those are the 
semantics we finally settled on.  If not, an ioread*_relaxed would be 
nice to add for the platforms that can support it.

Jesse

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-12 21:00   ` [PATCH]: " Michael Reed
  2005-12-12 21:24     ` Christoph Hellwig
@ 2005-12-12 21:47     ` James Bottomley
  2005-12-12 23:00       ` Michael Reed
  2005-12-14  1:28       ` Jeremy Higdon
  1 sibling, 2 replies; 22+ messages in thread
From: James Bottomley @ 2005-12-12 21:47 UTC (permalink / raw)
  To: Michael Reed; +Cc: pazke, Michael Joosten, linux-scsi

On Mon, 2005-12-12 at 15:00 -0600, Michael Reed wrote:
> (The subject of this email isn't quite accurate.  It's not
> a pci coherency problem, it's a pio write ordering problem.)
> 
> I've been asked to pass along the suggestion that "mmiowb"
> should be implemented for the platform.

> Given that I've been unable to unearth the chipset documentation
> for the Vis WS, I can only hope that you've got some good ideas
> on how this might be accomplished.

Well, the idea was that mmiowb and posting flushes were orthogonal.
mmiowb would be used in places where a posted write flush was done but
was strictly unnnecessary.  This bug report is implying that the posted
write flush was necessary, so it was incorrectly replaced with mmiowb
(which is a nop on most platforms).

> I agree that replacing the pio read which flushed the preceeding
> pio write with mmiowb() is what has likely broken the driver.  If you
> restore them,  please make it either mmiowb or pio read, but not both.
> 
> Perhaps something like this?  It's not the most elegant solution....

I'm tempted to say I think we need to put the write posting flush back
in and dump the mmiowb(), but since the driver is supposedly doing PIO
for VISWS, there's something else going on here (PIO writes aren't
supposed to post).  I've cc'd the VISWS maintainer in case he can think
of anything.

James



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-12 21:47     ` James Bottomley
@ 2005-12-12 23:00       ` Michael Reed
  2005-12-13 13:22         ` Michael Reed
  2005-12-14  1:28       ` Jeremy Higdon
  1 sibling, 1 reply; 22+ messages in thread
From: Michael Reed @ 2005-12-12 23:00 UTC (permalink / raw)
  To: James Bottomley; +Cc: pazke, Michael Joosten, linux-scsi



James Bottomley wrote:
> On Mon, 2005-12-12 at 15:00 -0600, Michael Reed wrote:
>>(The subject of this email isn't quite accurate.  It's not
>>a pci coherency problem, it's a pio write ordering problem.)
>>
>>I've been asked to pass along the suggestion that "mmiowb"
>>should be implemented for the platform.
> 
>>Given that I've been unable to unearth the chipset documentation
>>for the Vis WS, I can only hope that you've got some good ideas
>>on how this might be accomplished.
> 
> Well, the idea was that mmiowb and posting flushes were orthogonal.
> mmiowb would be used in places where a posted write flush was done but
> was strictly unnnecessary.  This bug report is implying that the posted
> write flush was necessary, so it was incorrectly replaced with mmiowb
> (which is a nop on most platforms).

The mmiowb() is sufficient to assure
ordering of the write to the board register.
(Have I incorrectly used the term pio?
I'm not meaning to imply a particular address space used to
access a board's registers.)

It's not a timing issue.  The WRT_REG write doesn't have to reach
the board before the driver can perform another function.  It
just has to reach the board in the order issued.

> 
>>I agree that replacing the pio read which flushed the preceeding
>>pio write with mmiowb() is what has likely broken the driver.  If you
>>restore them,  please make it either mmiowb or pio read, but not both.
>>
>>Perhaps something like this?  It's not the most elegant solution....
> 
> I'm tempted to say I think we need to put the write posting flush back
> in and dump the mmiowb(), but since the driver is supposedly doing PIO
> for VISWS, there's something else going on here (PIO writes aren't
> supposed to post).  I've cc'd the VISWS maintainer in case he can think
> of anything.

Again, apologies if I've misused the term PIO.

Mike

> 
> James
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-12 23:00       ` Michael Reed
@ 2005-12-13 13:22         ` Michael Reed
  2005-12-13 14:50           ` James Bottomley
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Reed @ 2005-12-13 13:22 UTC (permalink / raw)
  To: Michael Reed; +Cc: James Bottomley, pazke, Michael Joosten, linux-scsi



Michael Reed wrote:
> 
> James Bottomley wrote:
>> On Mon, 2005-12-12 at 15:00 -0600, Michael Reed wrote:
>>> (The subject of this email isn't quite accurate.  It's not
>>> a pci coherency problem, it's a pio write ordering problem.)
>>>
>>> I've been asked to pass along the suggestion that "mmiowb"
>>> should be implemented for the platform.
>>> Given that I've been unable to unearth the chipset documentation
>>> for the Vis WS, I can only hope that you've got some good ideas
>>> on how this might be accomplished.
>> Well, the idea was that mmiowb and posting flushes were orthogonal.
>> mmiowb would be used in places where a posted write flush was done but
>> was strictly unnnecessary.  This bug report is implying that the posted
>> write flush was necessary, so it was incorrectly replaced with mmiowb
>> (which is a nop on most platforms).
> 
> The mmiowb() is sufficient to assure
> ordering of the write to the board register.
> (Have I incorrectly used the term pio?
> I'm not meaning to imply a particular address space used to
> access a board's registers.)
> 
> It's not a timing issue.  The WRT_REG write doesn't have to reach
> the board before the driver can perform another function.  It
> just has to reach the board in the order issued.

I believe the biggest issue with VISWS is that it appears to need
mmiowb() and we likely don't know how to implement it.  Hence, for
that platform, it would make sense to replace the mmiowb() with a
posting read.

We should let the maintainer chime in.  Perhaps he has the
info that would allow an implementation of mmiowb() for the
platform.  That would be best as there are other drivers
(tg3, for example) that also use mmiowb().  Retaining mmiowb()
also allows those platforms that can take advantage of it
a small (perhaps not so small?) performance gain.

Mike


> 
>>> I agree that replacing the pio read which flushed the preceeding
>>> pio write with mmiowb() is what has likely broken the driver.  If you
>>> restore them,  please make it either mmiowb or pio read, but not both.
>>>
>>> Perhaps something like this?  It's not the most elegant solution....
>> I'm tempted to say I think we need to put the write posting flush back
>> in and dump the mmiowb(), but since the driver is supposedly doing PIO
>> for VISWS, there's something else going on here (PIO writes aren't
>> supposed to post).  I've cc'd the VISWS maintainer in case he can think
>> of anything.
> 
> Again, apologies if I've misused the term PIO.
> 
> Mike
> 
>> James
>>
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-13 13:22         ` Michael Reed
@ 2005-12-13 14:50           ` James Bottomley
  2005-12-13 18:15             ` Michael Reed
  2005-12-14  1:39             ` Jeremy Higdon
  0 siblings, 2 replies; 22+ messages in thread
From: James Bottomley @ 2005-12-13 14:50 UTC (permalink / raw)
  To: Michael Reed; +Cc: pazke, Michael Joosten, linux-scsi

On Tue, 2005-12-13 at 07:22 -0600, Michael Reed wrote:
> I believe the biggest issue with VISWS is that it appears to need
> mmiowb() and we likely don't know how to implement it.  Hence, for
> that platform, it would make sense to replace the mmiowb() with a
> posting read.

Well, there's an easy way to tell ... the reason for the mmiowb in the
qla1280 driver is supposed to be an SMP race, according to the
description, so if it fails on UP as well there's something else going
on here ...

I'm still suspicious because the mmiowb() in this driver replaced a
posted write flush instruction, which altered the behaviour of the
driver.  The qla1280 is just rare enough that it might have taken this
long to notice ...

James



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-13 14:50           ` James Bottomley
@ 2005-12-13 18:15             ` Michael Reed
  2005-12-14  5:00               ` Michael Joosten
  2005-12-14  1:39             ` Jeremy Higdon
  1 sibling, 1 reply; 22+ messages in thread
From: Michael Reed @ 2005-12-13 18:15 UTC (permalink / raw)
  To: James Bottomley; +Cc: pazke, Michael Joosten, linux-scsi



James Bottomley wrote:
> On Tue, 2005-12-13 at 07:22 -0600, Michael Reed wrote:
>> I believe the biggest issue with VISWS is that it appears to need
>> mmiowb() and we likely don't know how to implement it.  Hence, for
>> that platform, it would make sense to replace the mmiowb() with a
>> posting read.
> 
> Well, there's an easy way to tell ... the reason for the mmiowb in the
> qla1280 driver is supposed to be an SMP race, according to the
> description, so if it fails on UP as well there's something else going
> on here ...
> 
> I'm still suspicious because the mmiowb() in this driver replaced a
> posted write flush instruction, which altered the behaviour of the
> driver.  The qla1280 is just rare enough that it might have taken this
> long to notice ...

Yup.  But.... keep in mind that the failing platform is the SGI
VISWS, the child of a PC and an O2.  I'd be much more suspicious
if it failed on a generic PC.  (It also works fine on SGI Altix,
a platform which has implemented mmiowb().)

Perhaps Mr. Joosten can confirm his failing case with the UP kernel?

Mike

> 
> James
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-12 21:47     ` James Bottomley
  2005-12-12 23:00       ` Michael Reed
@ 2005-12-14  1:28       ` Jeremy Higdon
  2005-12-14  4:59         ` James Bottomley
  1 sibling, 1 reply; 22+ messages in thread
From: Jeremy Higdon @ 2005-12-14  1:28 UTC (permalink / raw)
  To: James Bottomley; +Cc: Michael Reed, pazke, Michael Joosten, linux-scsi

On Mon, Dec 12, 2005 at 03:47:37PM -0600, James Bottomley wrote:
> > Given that I've been unable to unearth the chipset documentation
> > for the Vis WS, I can only hope that you've got some good ideas
> > on how this might be accomplished.
> 
> Well, the idea was that mmiowb and posting flushes were orthogonal.
> mmiowb would be used in places where a posted write flush was done but
> was strictly unnnecessary.  This bug report is implying that the posted
> write flush was necessary, so it was incorrectly replaced with mmiowb
> (which is a nop on most platforms).

No, I don't think it was necessary here.  Though if a platform does
write posting yet has a null mmiowb() implementation, it will have
trouble.

> > I agree that replacing the pio read which flushed the preceeding
> > pio write with mmiowb() is what has likely broken the driver.  If you
> > restore them,  please make it either mmiowb or pio read, but not both.
> > 
> > Perhaps something like this?  It's not the most elegant solution....
> 
> I'm tempted to say I think we need to put the write posting flush back
> in and dump the mmiowb(), but since the driver is supposedly doing PIO
> for VISWS, there's something else going on here (PIO writes aren't
> supposed to post).  I've cc'd the VISWS maintainer in case he can think
> of anything.

Yes, the posting of PIO writes is the real problem with the VISWS.
Early ports of Linux for Altix had the same problem.
The current Altix outw looks like this:

static inline void
___sn_outw (unsigned short val, unsigned long port)                                         
{
        volatile unsigned short *addr;

        if ((addr = sn_io_addr(port))) {
                *addr = val;
                __sn_mmiowb();
        }
}


There ought to be a similar facility in the VISWS, though finding
anyone who knows about it might be difficult (the last one was built
in 1999).

It sounds as though the VISWS should also implement the mmiowb(),
since it apparently needs it for write ordering.  Of course, you
can always do a readX for that, but that's quite a bit heavier weight
than necessary.

jeremy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-13 14:50           ` James Bottomley
  2005-12-13 18:15             ` Michael Reed
@ 2005-12-14  1:39             ` Jeremy Higdon
  2005-12-14  3:16               ` Michael Reed
  1 sibling, 1 reply; 22+ messages in thread
From: Jeremy Higdon @ 2005-12-14  1:39 UTC (permalink / raw)
  To: James Bottomley; +Cc: Michael Reed, pazke, Michael Joosten, linux-scsi

On Tue, Dec 13, 2005 at 08:50:13AM -0600, James Bottomley wrote:
> On Tue, 2005-12-13 at 07:22 -0600, Michael Reed wrote:
> > I believe the biggest issue with VISWS is that it appears to need
> > mmiowb() and we likely don't know how to implement it.  Hence, for
> > that platform, it would make sense to replace the mmiowb() with a
> > posting read.
> 
> Well, there's an easy way to tell ... the reason for the mmiowb in the
> qla1280 driver is supposed to be an SMP race, according to the
> description, so if it fails on UP as well there's something else going
> on here ...

The 320 was available with two CPUs, and though the post doesn't say
what this particular one had, it likely had two.  The original post
also indicated a problem with a two-cpu motherboard, though I don't
think that was the 320 (VisWS).

> I'm still suspicious because the mmiowb() in this driver replaced a
> posted write flush instruction, which altered the behaviour of the
> driver.  The qla1280 is just rare enough that it might have taken this
> long to notice ...

The 12160 is in most Altix machines and behaves just like a 1280.
If there were a problem with it in this context, we'd know about it.

I'm still betting on the old problem (time moves downward):

	cpu A	lock
		posted pio write X
		unlock
	cpu B	lock
		posted pio write Y
		unlock
	PCI bus retire pio write Y
		retire pio write X

The Qlogic architecture doesn't like this.

jeremy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-14  1:39             ` Jeremy Higdon
@ 2005-12-14  3:16               ` Michael Reed
  0 siblings, 0 replies; 22+ messages in thread
From: Michael Reed @ 2005-12-14  3:16 UTC (permalink / raw)
  To: Jeremy Higdon; +Cc: James Bottomley, pazke, Michael Joosten, linux-scsi

Jeremy and I spoke offline.  We have a REALLY STRONG belief that
the platform is broken with regard to PIO vs. mmio.  We believe
that PIOs on VISWS post, just like mmio.  No SGI designed box
that we're aware of ever implemented PIO semantics.  It had to
be implemented in software.  IT'S NOT A PC!

The best suggestion we can make is find a chip on the bridge
and read a register from that to implement mmiowb() for VISWS.
Just please don't pull the mmiowb() or augment it with a posting read.

Thanks,
 Mike


Jeremy Higdon wrote:
> On Tue, Dec 13, 2005 at 08:50:13AM -0600, James Bottomley wrote:
>> On Tue, 2005-12-13 at 07:22 -0600, Michael Reed wrote:
>>> I believe the biggest issue with VISWS is that it appears to need
>>> mmiowb() and we likely don't know how to implement it.  Hence, for
>>> that platform, it would make sense to replace the mmiowb() with a
>>> posting read.
>> Well, there's an easy way to tell ... the reason for the mmiowb in the
>> qla1280 driver is supposed to be an SMP race, according to the
>> description, so if it fails on UP as well there's something else going
>> on here ...
> 
> The 320 was available with two CPUs, and though the post doesn't say
> what this particular one had, it likely had two.  The original post
> also indicated a problem with a two-cpu motherboard, though I don't
> think that was the 320 (VisWS).
> 
>> I'm still suspicious because the mmiowb() in this driver replaced a
>> posted write flush instruction, which altered the behaviour of the
>> driver.  The qla1280 is just rare enough that it might have taken this
>> long to notice ...
> 
> The 12160 is in most Altix machines and behaves just like a 1280.
> If there were a problem with it in this context, we'd know about it.
> 
> I'm still betting on the old problem (time moves downward):
> 
> 	cpu A	lock
> 		posted pio write X
> 		unlock
> 	cpu B	lock
> 		posted pio write Y
> 		unlock
> 	PCI bus retire pio write Y
> 		retire pio write X
> 
> The Qlogic architecture doesn't like this.
> 
> jeremy
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-14  1:28       ` Jeremy Higdon
@ 2005-12-14  4:59         ` James Bottomley
  2005-12-14 23:56           ` Jeremy Higdon
  0 siblings, 1 reply; 22+ messages in thread
From: James Bottomley @ 2005-12-14  4:59 UTC (permalink / raw)
  To: Jeremy Higdon; +Cc: Michael Reed, pazke, Michael Joosten, linux-scsi

On Tue, 2005-12-13 at 17:28 -0800, Jeremy Higdon wrote:
> On Mon, Dec 12, 2005 at 03:47:37PM -0600, James Bottomley wrote:
> > Well, the idea was that mmiowb and posting flushes were orthogonal.
> > mmiowb would be used in places where a posted write flush was done but
> > was strictly unnnecessary.  This bug report is implying that the posted
> > write flush was necessary, so it was incorrectly replaced with mmiowb
> > (which is a nop on most platforms).
> 
> No, I don't think it was necessary here.  Though if a platform does
> write posting yet has a null mmiowb() implementation, it will have
> trouble.

Now you're worrying me: Every platform other than Altix does have a null
mmiowb() implementation and, obviously, PCI posting flush requirements
are within the province of the bridge rather than the platform (high end
servers being the ones that have posting bridges).  If mmiowb() is
meant to take the place of write posting flushes then we're in deep
do-do.

The only thing I thought mmiowb() was supposed to be used for is the
case where the platform implements relaxed ordering and we want to
enforce strong ordering on the PCI bus write transactions, but don't
actually care when the write actually completes.

> > > I agree that replacing the pio read which flushed the preceeding
> > > pio write with mmiowb() is what has likely broken the driver.  If you
> > > restore them,  please make it either mmiowb or pio read, but not both.
> > > 
> > > Perhaps something like this?  It's not the most elegant solution....
> > 
> > I'm tempted to say I think we need to put the write posting flush back
> > in and dump the mmiowb(), but since the driver is supposedly doing PIO
> > for VISWS, there's something else going on here (PIO writes aren't
> > supposed to post).  I've cc'd the VISWS maintainer in case he can think
> > of anything.
> 
> Yes, the posting of PIO writes is the real problem with the VISWS.
> Early ports of Linux for Altix had the same problem.
> The current Altix outw looks like this:
> 
> static inline void
> ___sn_outw (unsigned short val, unsigned long port)                                         
> {
>         volatile unsigned short *addr;
> 
>         if ((addr = sn_io_addr(port))) {
>                 *addr = val;
>                 __sn_mmiowb();
>         }
> }
> 
> 
> There ought to be a similar facility in the VISWS, though finding
> anyone who knows about it might be difficult (the last one was built
> in 1999).
> 
> It sounds as though the VISWS should also implement the mmiowb(),
> since it apparently needs it for write ordering.  Of course, you
> can always do a readX for that, but that's quite a bit heavier weight
> than necessary.

My primary concern in all of this is that the write posting flush was
incorrectly removed.  In this case, a UP VISWS should still show the
error (now we've established that it's posting even in the PIO case).
If it doesn't, I'll be happy with a VISWS specific fix.

James



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-13 18:15             ` Michael Reed
@ 2005-12-14  5:00               ` Michael Joosten
  2005-12-14 17:29                 ` James Bottomley
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Joosten @ 2005-12-14  5:00 UTC (permalink / raw)
  To: Michael Reed; +Cc: James Bottomley, pazke, linux-scsi

Michael Reed wrote:

>James Bottomley wrote:
>  
>
>>On Tue, 2005-12-13 at 07:22 -0600, Michael Reed wrote:
>>    
>>
>>>I believe the biggest issue with VISWS is that it appears to need
>>>mmiowb() and we likely don't know how to implement it.  Hence, for
>>>that platform, it would make sense to replace the mmiowb() with a
>>>posting read.
>>>      
>>>
>>Well, there's an easy way to tell ... the reason for the mmiowb in the
>>qla1280 driver is supposed to be an SMP race, according to the
>>description, so if it fails on UP as well there's something else going
>>on here ...
>>
>>I'm still suspicious because the mmiowb() in this driver replaced a
>>posted write flush instruction, which altered the behaviour of the
>>driver.  The qla1280 is just rare enough that it might have taken this
>>long to notice ...
>>    
>>
>
>Yup.  But.... keep in mind that the failing platform is the SGI
>VISWS, the child of a PC and an O2.  I'd be much more suspicious
>if it failed on a generic PC.  (It also works fine on SGI Altix,
>a platform which has implemented mmiowb().)
>
>Perhaps Mr. Joosten can confirm his failing case with the UP kernel?
>
>  
>
OK, I'm currently doing this, though with a Fedora Core3 kernel 
(2.6.12-1.1381-FC3) UP and SMP,  running it with some ooold filesystem 
benchmark on a similarly old PIII 500MHz board. What else is possible is 
an Intel dual PII (450MHz) server board (N440BX) , a dual PIII(730MHz) 
workstation and a very recent one with hyperthreading PIV. I'm currently 
using a distributed kernel with modules, because this version still has 
the mmiowb() in place (I hope!) .
There might be a timing issue (the faults happend somehow earlier once 
the board and the VisWS got warmer), but I hope that the other platforms 
will show a little difference...
Well, the PIII board with both a 550 and a 800 MHz proc showed no 
difference, the driver just *works*, no failure in 20 runs.  It looks 
like the problem only shows up in the VISWS. Perhaps I try it again 
putting the QLA1080 in the 32bit slot, which is apparently not 
controlled by the Lithium, but rather a plain PIIX chip. And perhaps 
some other platform and chipset.

Regards, Michael




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-14  5:00               ` Michael Joosten
@ 2005-12-14 17:29                 ` James Bottomley
  2005-12-15  1:17                   ` Michael Joosten
  0 siblings, 1 reply; 22+ messages in thread
From: James Bottomley @ 2005-12-14 17:29 UTC (permalink / raw)
  To: Michael Joosten; +Cc: Michael Reed, pazke, linux-scsi

On Wed, 2005-12-14 at 06:00 +0100, Michael Joosten wrote:
> >Perhaps Mr. Joosten can confirm his failing case with the UP kernel?
> >
> >  
> >
> OK, I'm currently doing this, though with a Fedora Core3 kernel 
> (2.6.12-1.1381-FC3) UP and SMP,  running it with some ooold filesystem 
> benchmark on a similarly old PIII 500MHz board. What else is possible is 
> an Intel dual PII (450MHz) server board (N440BX) , a dual PIII(730MHz) 
> workstation and a very recent one with hyperthreading PIV. I'm currently 
> using a distributed kernel with modules, because this version still has 
> the mmiowb() in place (I hope!) .
> There might be a timing issue (the faults happend somehow earlier once 
> the board and the VisWS got warmer), but I hope that the other platforms 
> will show a little difference...
> Well, the PIII board with both a 550 and a 800 MHz proc showed no 
> difference, the driver just *works*, no failure in 20 runs.  It looks 
> like the problem only shows up in the VISWS. Perhaps I try it again 
> putting the QLA1080 in the 32bit slot, which is apparently not 
> controlled by the Lithium, but rather a plain PIIX chip. And perhaps 
> some other platform and chipset.

Yes, the PIO posting issue is VISWS only, I think.  Could you confirm
that your original bug report was on a SMP VISWS, and could you try the
tests over using a UP kernel on the VISWS?

Thanks,

James



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-14  4:59         ` James Bottomley
@ 2005-12-14 23:56           ` Jeremy Higdon
  2005-12-15  0:14             ` Michael Reed
  0 siblings, 1 reply; 22+ messages in thread
From: Jeremy Higdon @ 2005-12-14 23:56 UTC (permalink / raw)
  To: James Bottomley; +Cc: Michael Reed, pazke, Michael Joosten, linux-scsi

On Tue, Dec 13, 2005 at 08:59:37PM -0800, James Bottomley wrote:
> On Tue, 2005-12-13 at 17:28 -0800, Jeremy Higdon wrote:
> > On Mon, Dec 12, 2005 at 03:47:37PM -0600, James Bottomley wrote:
> > > Well, the idea was that mmiowb and posting flushes were orthogonal.
> > > mmiowb would be used in places where a posted write flush was done but
> > > was strictly unnnecessary.  This bug report is implying that the posted
> > > write flush was necessary, so it was incorrectly replaced with mmiowb
> > > (which is a nop on most platforms).
> > 
> > No, I don't think it was necessary here.  Though if a platform does
> > write posting yet has a null mmiowb() implementation, it will have
> > trouble.
> 
> Now you're worrying me: Every platform other than Altix does have a null
> mmiowb() implementation and, obviously, PCI posting flush requirements
> are within the province of the bridge rather than the platform (high end
> servers being the ones that have posting bridges).  If mmiowb() is
> meant to take the place of write posting flushes then we're in deep
> do-do.
>
> The only thing I thought mmiowb() was supposed to be used for is the
> case where the platform implements relaxed ordering and we want to
> enforce strong ordering on the PCI bus write transactions, but don't
> actually care when the write actually completes.

Let me try that again, as I was being imprecise.  The mmiowb is needed
there to ensure proper ordering of writes to that register from different
CPUs.  Obviously, if there is no write posting, write ordering is not
a problem (my imprecision above), as long as access to that register is
protected by a spinlock, which it is.

The driver/OS/chip don't really care when that write completes as long
as later writes from the same or other CPUs complete afterward.

> > Yes, the posting of PIO writes is the real problem with the VISWS.
> > Early ports of Linux for Altix had the same problem.
> > The current Altix outw looks like this:
> > 
> > static inline void
> > ___sn_outw (unsigned short val, unsigned long port)                                         
> > {
> >         volatile unsigned short *addr;
> > 
> >         if ((addr = sn_io_addr(port))) {
> >                 *addr = val;
> >                 __sn_mmiowb();
> >         }
> > }
> > 
> > 
> > There ought to be a similar facility in the VISWS, though finding
> > anyone who knows about it might be difficult (the last one was built
> > in 1999).
> > 
> > It sounds as though the VISWS should also implement the mmiowb(),
> > since it apparently needs it for write ordering.  Of course, you
> > can always do a readX for that, but that's quite a bit heavier weight
> > than necessary.
> 
> My primary concern in all of this is that the write posting flush was
> incorrectly removed.  In this case, a UP VISWS should still show the
> error (now we've established that it's posting even in the PIO case).
> If it doesn't, I'll be happy with a VISWS specific fix.

I'm a little surprised that the UP VisWS is showing the problem.  It
looks like it might be reordering writes from the same CPU, though that
seems really unlikely.  I think more debugging of the VisWS is in order
before we can draw any conclustions.

In any case, the flush is unnecessary; only ordering is needed at that
point in the driver.

Btw, the VisWS Linux port apparently uses non-coherent DMA.  Perhaps it
should be switched to coherent DMA, as there could be bugs in that area
with qla1280.

> James

jeremy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-14 23:56           ` Jeremy Higdon
@ 2005-12-15  0:14             ` Michael Reed
  2005-12-15  1:13               ` Jeremy Higdon
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Reed @ 2005-12-15  0:14 UTC (permalink / raw)
  To: Jeremy Higdon; +Cc: James Bottomley, pazke, Michael Joosten, linux-scsi



Jeremy Higdon wrote:
...snip...
>> My primary concern in all of this is that the write posting flush was
>> incorrectly removed.  In this case, a UP VISWS should still show the
>> error (now we've established that it's posting even in the PIO case).
>> If it doesn't, I'll be happy with a VISWS specific fix.
> 
> I'm a little surprised that the UP VisWS is showing the problem.  It
> looks like it might be reordering writes from the same CPU, though that
> seems really unlikely.  I think more debugging of the VisWS is in order
> before we can draw any conclustions.
> 
...snip...

I don't think it has yet been determined that UP is exhibiting the
problem.  I haven't seen a reply after James requested that the
test be rerun using UP kernel.

(I think it was inferred that the VISWS in question was MP but
I didn't see it explicitely stated.)

Mike

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-15  0:14             ` Michael Reed
@ 2005-12-15  1:13               ` Jeremy Higdon
  0 siblings, 0 replies; 22+ messages in thread
From: Jeremy Higdon @ 2005-12-15  1:13 UTC (permalink / raw)
  To: Michael Reed; +Cc: James Bottomley, pazke, Michael Joosten, linux-scsi

On Wed, Dec 14, 2005 at 06:14:33PM -0600, Michael Reed wrote:
> 
> 
> Jeremy Higdon wrote:
> ...snip...
> >> My primary concern in all of this is that the write posting flush was
> >> incorrectly removed.  In this case, a UP VISWS should still show the
> >> error (now we've established that it's posting even in the PIO case).
> >> If it doesn't, I'll be happy with a VISWS specific fix.
> > 
> > I'm a little surprised that the UP VisWS is showing the problem.  It
> > looks like it might be reordering writes from the same CPU, though that
> > seems really unlikely.  I think more debugging of the VisWS is in order
> > before we can draw any conclustions.
> > 
> ...snip...
> 
> I don't think it has yet been determined that UP is exhibiting the
> problem.  I haven't seen a reply after James requested that the
> test be rerun using UP kernel.
> 
> (I think it was inferred that the VISWS in question was MP but
> I didn't see it explicitely stated.)
> 
> Mike


In a private email, Michael Joosten told me that he was running an SMP
kernel on a single processor 320.

jeremy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-14 17:29                 ` James Bottomley
@ 2005-12-15  1:17                   ` Michael Joosten
  2005-12-15  2:20                     ` Jeremy Higdon
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Joosten @ 2005-12-15  1:17 UTC (permalink / raw)
  To: James Bottomley; +Cc: Michael Reed, pazke, linux-scsi

James Bottomley wrote:

>>Well, the PIII board with both a 550 and a 800 MHz proc showed no 
>>difference, the driver just *works*, no failure in 20 runs.  It looks 
>>like the problem only shows up in the VISWS. Perhaps I try it again 
>>putting the QLA1080 in the 32bit slot, which is apparently not 
>>controlled by the Lithium, but rather a plain PIIX chip. And perhaps 
>>some other platform and chipset.
>>    
>>
>
>Yes, the PIO posting issue is VISWS only, I think.  Could you confirm
>that your original bug report was on a SMP VISWS, and could you try the
>tests over using a UP kernel on the VISWS?
>
>  
>
The original tests did run on a UP VISWS (single 800MHz PIII), but with 
SMP kernel. I can try and recompile the test kernel for UP, and/or stuff 
the old dual PIII 450MHz into the 320.

After some experiments I can say that:
1) no problems on PIV 3GHz, Intel 915G, ICH6, 1GB

2) UP kernel on VISWS single CPU, some "qla1280: ISP invalid handle" 
messages even during startup, after some time the fs-bench also gets 
stuck with some of these, resetting the adapter doesn't work, and after 
a few tries it just gets stuck and oopses.

What confuses me a little now is that I've some corrupted files, at 
least one not even on the same hard disk as the place for the fs-bench. 
Hmmmm.

3) MP kernel on VISWS, single CPU: either because it has warmed up or 
b/c of SMP, now it gets even stuck during the rc scripts...

4) MP kernel on VISWS, dual CPU: once it warmed up, the driver also failed.

Sooo, I'd think this is VISWS issue only, independent of MP or UP.

>Btw, the VisWS Linux port apparently uses non-coherent DMA.  Perhaps it
>should be switched to coherent DMA, as there could be bugs in that area
>with qla1280.

Aha, so we are back at coherence, but it's not MP, but DMA one...
(I'm not sure what is meant with DMA coherence - make sure that before a DMA transfer is started at least the related CPU cache lines (in the DMA address range) are written back?)
Have I actually mentioned that I already had started a discussion with Jesse Barnes (SGI) and Jes Sorenson before I posted in linux-scsi? He said on the topic:
>>>>
Jesse> Actually implementing them might be as easy as putting PIO reads from a 
Jesse> bridge register into the DMA unmap routines--that should guarantee 
Jesse> coherence.

Me>You probably mean pci_unmap_addr/_len() macros in asm-i386/pci.h, 
Me>according to i386/kernel/pci-dma.c ??
Me>But there isn't much to be found in asm-i386/mach-visws/lithium.h .

<<<<

Given the situation that the Lithium's chip documentation is lost, its probably back 
at introducing a 
#ifdef CONFIG_X86_VISWS
#define QLA_POSTED_PCI_FLUSH(mbx)  RD_REG_WORD(mbx)
#else
#define QLA_POSTED_PCI_FLUSH(mbx)
endif

and replace the previous 
RD_REG_WORD(&reg->mailbox4)
or add where's now the mmiowb() with 
QLA_POSTED_PCI_FLUSH(&reg->mailbox4)

in qla1280_32/64bit_start_scsi().

Could this chipset deficiency also explain why I had NO luck when trying to use a PCI graphics card for X11 instead of the frame buffer? The server either gets stuck in "write recombining range" (ELSA Gloria Synergy, Permedia) (i.e. the gfx driver module in Xorg/XFree86) or does not recognize the VBIOS and leaves a Matrox G450 largely uninitialized (X server runs, but monitor stays black and unsync'ed).

So long, Michael



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-15  1:17                   ` Michael Joosten
@ 2005-12-15  2:20                     ` Jeremy Higdon
  2005-12-15 16:21                       ` Michael Joosten
  0 siblings, 1 reply; 22+ messages in thread
From: Jeremy Higdon @ 2005-12-15  2:20 UTC (permalink / raw)
  To: Michael Joosten; +Cc: James Bottomley, Michael Reed, pazke, linux-scsi

On Thu, Dec 15, 2005 at 02:17:38AM +0100, Michael Joosten wrote:
> >Btw, the VisWS Linux port apparently uses non-coherent DMA.  Perhaps it
> >should be switched to coherent DMA, as there could be bugs in that area
> >with qla1280.
> 
> Aha, so we are back at coherence, but it's not MP, but DMA one...
> (I'm not sure what is meant with DMA coherence - make sure that before a 
> DMA transfer is started at least the related CPU cache lines (in the DMA 
> address range) are written back?)

For a start.  You need to invalidate before read into memory and write
back before write to disk.  You also need to make sure that the memory
stays out of the CPU cache during the DMA (so you need to make sure that
when you DMA into memory that nothing is loading or storing to a cacheline
that you're DMA'ing into.  On a speculative execution CPU, this is really
hard.

> Have I actually mentioned that I already had started a discussion with 
> Jesse Barnes (SGI) and Jes Sorenson before I posted in linux-scsi? He said 
> on the topic:
> >>>>
> Jesse> Actually implementing them might be as easy as putting PIO reads 
> from a Jesse> bridge register into the DMA unmap routines--that should 
> guarantee Jesse> coherence.

That's a different problem.  The interrupt doesn't guarantee a PCI
DMA completion.  The PIO read will, however (according to Lithium
spec).

> Me>You probably mean pci_unmap_addr/_len() macros in asm-i386/pci.h, 
> Me>according to i386/kernel/pci-dma.c ??
> Me>But there isn't much to be found in asm-i386/mach-visws/lithium.h .

No.  There is a Device Control register for each PCI device.  Bit
10 controls whether DMA for that device is coherent.  If 0 it's
not coherent.  If 1, coherent.

> <<<<
> 
> Given the situation that the Lithium's chip documentation is lost, its 
> probably back at introducing a 
> #ifdef CONFIG_X86_VISWS
> #define QLA_POSTED_PCI_FLUSH(mbx)  RD_REG_WORD(mbx)
> #else
> #define QLA_POSTED_PCI_FLUSH(mbx)
> endif
> 
> and replace the previous 
> RD_REG_WORD(&reg->mailbox4)
> or add where's now the mmiowb() with 
> QLA_POSTED_PCI_FLUSH(&reg->mailbox4)
> 
> in qla1280_32/64bit_start_scsi().

We're going to work on an API change to mmiowb() instead.

> Could this chipset deficiency also explain why I had NO luck when trying to 
> use a PCI graphics card for X11 instead of the frame buffer? The server 
> either gets stuck in "write recombining range" (ELSA Gloria Synergy, 
> Permedia) (i.e. the gfx driver module in Xorg/XFree86) or does not 
> recognize the VBIOS and leaves a Matrox G450 largely uninitialized (X 
> server runs, but monitor stays black and unsync'ed).

It's possible.

> So long, Michael

jeremy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH]: Re: qla1280.c broken on SGI visws, PCI coherency problem
  2005-12-15  2:20                     ` Jeremy Higdon
@ 2005-12-15 16:21                       ` Michael Joosten
  0 siblings, 0 replies; 22+ messages in thread
From: Michael Joosten @ 2005-12-15 16:21 UTC (permalink / raw)
  To: Jeremy Higdon; +Cc: James Bottomley, Michael Reed, pazke, linux-scsi

Jeremy Higdon wrote:

>On Thu, Dec 15, 2005 at 02:17:38AM +0100, Michael Joosten wrote:
>  
>
>>>Btw, the VisWS Linux port apparently uses non-coherent DMA.  Perhaps it
>>>should be switched to coherent DMA, as there could be bugs in that area
>>>with qla1280.
>>>      
>>>
>>Aha, so we are back at coherence, but it's not MP, but DMA one...
>>(I'm not sure what is meant with DMA coherence - make sure that before a 
>>DMA transfer is started at least the related CPU cache lines (in the DMA 
>>address range) are written back?)
>>    
>>
>
>For a start.  You need to invalidate before read into memory and write
>back before write to disk.  You also need to make sure that the memory
>stays out of the CPU cache during the DMA (so you need to make sure that
>when you DMA into memory that nothing is loading or storing to a cacheline
>that you're DMA'ing into.  On a speculative execution CPU, this is really
>hard.
>  
>
Aha. Thanks for the info, now I can figure what's going on.

>>in qla1280_32/64bit_start_scsi().
>>    
>>
>
>We're going to work on an API change to mmiowb() instead.
>
>  
>
OK. From my point, this has been settled and I'll stay tuned to test a 
patch. It's not urgent.

Thanks for the lively discussion and some insights, Michael


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2005-12-15 16:28 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-09 19:11 qla1280.c broken on SGI visws, PCI coherency problem Michael Joosten
2005-12-09 23:48 ` Michael Reed
2005-12-12 21:00   ` [PATCH]: " Michael Reed
2005-12-12 21:24     ` Christoph Hellwig
2005-12-12 21:31       ` Jesse Barnes
2005-12-12 21:47     ` James Bottomley
2005-12-12 23:00       ` Michael Reed
2005-12-13 13:22         ` Michael Reed
2005-12-13 14:50           ` James Bottomley
2005-12-13 18:15             ` Michael Reed
2005-12-14  5:00               ` Michael Joosten
2005-12-14 17:29                 ` James Bottomley
2005-12-15  1:17                   ` Michael Joosten
2005-12-15  2:20                     ` Jeremy Higdon
2005-12-15 16:21                       ` Michael Joosten
2005-12-14  1:39             ` Jeremy Higdon
2005-12-14  3:16               ` Michael Reed
2005-12-14  1:28       ` Jeremy Higdon
2005-12-14  4:59         ` James Bottomley
2005-12-14 23:56           ` Jeremy Higdon
2005-12-15  0:14             ` Michael Reed
2005-12-15  1:13               ` Jeremy Higdon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.