linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Quick aic7xxx bug hunt...
  2002-09-23 23:28                 ` Jeff Garzik
@ 2002-09-23  7:44                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2002-09-23  7:44 UTC (permalink / raw)
  To: Jeff Garzik, Justin T. Gibbs; +Cc: Konstantin Kletschke, linux-kernel

>Jeff Garzik wrote:
>> ahc_reset
>>     aic7770_config -> can sleep
>>     ahc_pci_config -> can sleep
>>     ahc_shutdown -> can't sleep, whoops
>
>
>Though, to answer your question from a previous email, you can call the 
>function in_interrupt() to see if you're in interrupt context.

Well... no, you can't rely on it for knowing if you can sleep or
not. in_interrupt() won't tell you interesting things like you
are holding a spinlock...

Ben.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* 2.4.20-pre7-ac3 aic7xxx broken?
@ 2002-09-23 18:00 Konstantin Kletschke
  2002-09-23 19:15 ` Justin T. Gibbs
  0 siblings, 1 reply; 11+ messages in thread
From: Konstantin Kletschke @ 2002-09-23 18:00 UTC (permalink / raw)
  To: linux-kernel


Hi!

I wonder if my 

Bus  0, device   9, function  0:
      SCSI storage controller: Adaptec AHA-7850 (rev 3).
        IRQ 11.
        Master Capable.  Latency=32.  Min Gnt=4.Max Lat=4.
        I/O at 0xa000 [0xa0ff].
        Non-prefetchable 32 bit memory at 0xe7001000 [0xe7001fff].

is broken or the aic7xxx driver in linux-2.4.20-pre7-ac3?

The modul loads fine, the cdrom seems to work properly, though,
but this messages appears in my
message.log:

==> /var/log/syslog <==
Sep 23 19:32:23 sexmachine kernel: scsi1: PCI error Interrupt at seqaddr
= 0x9
Sep 23 19:32:23 sexmachine kernel: scsi1: Data Parity Error Detected
during address or write data phase

several times a minute. The aic7xxx_old produces no errorr messages...

Konsti


-- 
Kletschke & Uhlig GbR: 
http://www.ku-gbr.de

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.20-pre7-ac3 aic7xxx broken?
  2002-09-23 18:00 2.4.20-pre7-ac3 aic7xxx broken? Konstantin Kletschke
@ 2002-09-23 19:15 ` Justin T. Gibbs
  2002-09-23 21:27   ` Quick aic7xxx bug hunt Jeff Garzik
  0 siblings, 1 reply; 11+ messages in thread
From: Justin T. Gibbs @ 2002-09-23 19:15 UTC (permalink / raw)
  To: Konstantin Kletschke, linux-kernel

> Hi!
> 
> I wonder if my 
> 
> Bus  0, device   9, function  0:
>       SCSI storage controller: Adaptec AHA-7850 (rev 3).

...

> is broken or the aic7xxx driver in linux-2.4.20-pre7-ac3?
> 
> The modul loads fine, the cdrom seems to work properly, though,
> but this messages appears in my
> message.log:
> 
> ==> /var/log/syslog <==
> Sep 23 19:32:23 sexmachine kernel: scsi1: PCI error Interrupt at seqaddr
> = 0x9
> Sep 23 19:32:23 sexmachine kernel: scsi1: Data Parity Error Detected
> during address or write data phase
> 
> several times a minute. The aic7xxx_old produces no errorr messages...

On some motherboards with some chipsets, you can get these messages if
another busmaster (say an IDE drive or a sound card) is hogging the bus.
Usually this is with a VIA chipset.  Its not clear why the aic7xxx_old
driver would behave differently other than it disables memory write
and invalidate PCI transactions on this chip.  The new driver doesn't
need that work around.

As to wether it is safe or not, well our investigation into these issues
has shown that the parity errors are real, but it is the parity that
is at fault.  It is hard to say if that is the case for your MB or not.
Can you provide, in private email, a full lspci dump?

Thanks,
Justin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Quick aic7xxx bug hunt...
  2002-09-23 19:15 ` Justin T. Gibbs
@ 2002-09-23 21:27   ` Jeff Garzik
  2002-09-23 21:54     ` Justin T. Gibbs
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff Garzik @ 2002-09-23 21:27 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: Konstantin Kletschke, linux-kernel

Justin T. Gibbs wrote:
> On some motherboards with some chipsets, you can get these messages if
> another busmaster (say an IDE drive or a sound card) is hogging the bus.
> Usually this is with a VIA chipset.  Its not clear why the aic7xxx_old
> driver would behave differently other than it disables memory write
> and invalidate PCI transactions on this chip.  The new driver doesn't
> need that work around.


Justin,

One thing I notice is at least one PCI posting bug.  When using MMIO 
(write[bwlq] under Linux), you _must_ use a read[bwlq] to flush the 
write to PCI, if you wish to ensure the write posts at a certain point 
in the code.

Here is the example PCI posting bug, in ahc_clear_critical_section:
>                 ahc_outb(ahc, HCNTRL, ahc->unpause);
>                 do {
>                         ahc_delay(200);
>                 } while (!ahc_is_paused(ahc));

As you can see, there is no read before the udelay(), which is very 
wrong on modern CPUs with write posting...  that's definitely a driver 
bug that will bite you on modern x86 motherboards [and is totally broken 
on ia64 and other platforms].

Please let me know if you have further questions on PCI write posting...

	Jeff




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Quick aic7xxx bug hunt...
  2002-09-23 21:27   ` Quick aic7xxx bug hunt Jeff Garzik
@ 2002-09-23 21:54     ` Justin T. Gibbs
  2002-09-23 22:18       ` Jeff Garzik
  0 siblings, 1 reply; 11+ messages in thread
From: Justin T. Gibbs @ 2002-09-23 21:54 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Konstantin Kletschke, linux-kernel

> Justin T. Gibbs wrote:
>> On some motherboards with some chipsets, you can get these messages if
>> another busmaster (say an IDE drive or a sound card) is hogging the bus.
>> Usually this is with a VIA chipset.  Its not clear why the aic7xxx_old
>> driver would behave differently other than it disables memory write
>> and invalidate PCI transactions on this chip.  The new driver doesn't
>> need that work around.
> 
> 
> Justin,
> 
> One thing I notice is at least one PCI posting bug.  When using MMIO
> (write[bwlq] under Linux), you _must_ use a read[bwlq] to flush the write
> to PCI, if you wish to ensure the write posts at a certain point in the
> code.

> 
> Here is the example PCI posting bug, in ahc_clear_critical_section:
>>                 ahc_outb(ahc, HCNTRL, ahc->unpause);
>>                 do {
>>                         ahc_delay(200);
>>                 } while (!ahc_is_paused(ahc));
> 
> As you can see, there is no read before the udelay(), which is very wrong
> on modern CPUs with write posting...  that's definitely a driver bug that
> will bite you on modern x86 motherboards [and is totally broken on ia64
> and other platforms].
> 
> Please let me know if you have further questions on PCI write posting...

I somewhat doubt that any CPU would hold onto a posted write for 200us
since you are not guaranteed that a read will occur quickly and you want
those write buffers to be availble for other clients, but regardless, the
code has not been as you describe since November of last year.  The
CHANGELOG reads:

                Always perform a bus read prior to waiting in
                a delay loop waiting for a bus write to take
                effect.  This ensures that the first time
                through the loop the delay occurs after the
                write has taken effect.

Of course, the "bug" was benign since the loop does perform a read
so the write is guaranteed to post during the first itteration through
the loop.

--
Justin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Quick aic7xxx bug hunt...
  2002-09-23 21:54     ` Justin T. Gibbs
@ 2002-09-23 22:18       ` Jeff Garzik
  2002-09-23 22:44         ` Justin T. Gibbs
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff Garzik @ 2002-09-23 22:18 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: Konstantin Kletschke, linux-kernel

Justin T. Gibbs wrote:
> I somewhat doubt that any CPU would hold onto a posted write for 200us
> since you are not guaranteed that a read will occur quickly and you want
> those write buffers to be availble for other clients, but regardless, the
> code has not been as you describe since November of last year.


Great, I stand corrected.  Looks like 2.5 code is ancient then?

comments on the 2.4 code:
* the 1000us delay in ahc_reset needs to be turned into a sleep, instead 
all paths to that function [AFAICS] can sleep.  likewise for the huge 
delay in ahc_acquire_seeprom.

* 400ms worst case udelay() is in ahc_clear_critical_section is kinda 
annoying [but I suppose it can be lived with, if the average is a lot 
less than that :)]

* the delay in ahc_init should be replaced with a sleep

* PCI posting?  (aic7xxx_core.c, line 1322, the last statement in the 
function...)

                 ahc_outb(ahc, CLRINT, CLRSCSIINT);

I'll look it over some more later.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Quick aic7xxx bug hunt...
  2002-09-23 22:18       ` Jeff Garzik
@ 2002-09-23 22:44         ` Justin T. Gibbs
  2002-09-23 22:50           ` Jeff Garzik
  0 siblings, 1 reply; 11+ messages in thread
From: Justin T. Gibbs @ 2002-09-23 22:44 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Konstantin Kletschke, linux-kernel

> Great, I stand corrected.  Looks like 2.5 code is ancient then?

Yes.  I didn't do the original port and am now just finishing up my
port to 2.5.X.

> comments on the 2.4 code:
> * the 1000us delay in ahc_reset needs to be turned into a sleep, instead
> all paths to that function [AFAICS] can sleep.  likewise for the huge
> delay in ahc_acquire_seeprom.

For all of these delays, I'd be more than happy to make them all into
sleeps if I can tell, from inside ahc_delay() if I'm in a context where
it is safe to sleep.  On the other platforms that this core code runs on
I'm usually not in a context where it is safe to sleep, so I don't want
to switch to using a different driver primitive.

> * PCI posting?  (aic7xxx_core.c, line 1322, the last statement in the
> function...)
> 
>                  ahc_outb(ahc, CLRINT, CLRSCSIINT);

I don't care when the write occurs only that it will occur eventually.
The buffer will get flushed eventually so there is no need to call
ahc_flush_device_writes().

--
Justin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Quick aic7xxx bug hunt...
  2002-09-23 22:44         ` Justin T. Gibbs
@ 2002-09-23 22:50           ` Jeff Garzik
  2002-09-23 22:55             ` Justin T. Gibbs
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff Garzik @ 2002-09-23 22:50 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: Konstantin Kletschke, linux-kernel

Justin T. Gibbs wrote:
>>Great, I stand corrected.  Looks like 2.5 code is ancient then?
> 
> 
> Yes.  I didn't do the original port and am now just finishing up my
> port to 2.5.X.
> 
> 
>>comments on the 2.4 code:
>>* the 1000us delay in ahc_reset needs to be turned into a sleep, instead
>>all paths to that function [AFAICS] can sleep.  likewise for the huge
>>delay in ahc_acquire_seeprom.
> 
> 
> For all of these delays, I'd be more than happy to make them all into
> sleeps if I can tell, from inside ahc_delay() if I'm in a context where
> it is safe to sleep.  On the other platforms that this core code runs on
> I'm usually not in a context where it is safe to sleep, so I don't want
> to switch to using a different driver primitive.

For Linux it's unconditionally safe, and other platforms is sounds like 
it's unconditionally not.  So, s/ahc_delay/ahc_sleep/ for the places I 
pointed out, and just make ahc_delay==ahc_sleep on non-Linux platforms 
(or any similarly-functioning solution)

It's pretty much impossible to detect if you are inside certain 
spinlocks, in a generic fashion.


>>* PCI posting?  (aic7xxx_core.c, line 1322, the last statement in the
>>function...)
>>
>>                 ahc_outb(ahc, CLRINT, CLRSCSIINT);
> 
> 
> I don't care when the write occurs only that it will occur eventually.
> The buffer will get flushed eventually so there is no need to call
> ahc_flush_device_writes().

ok, thanks for clarifying.

	Jeff




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Quick aic7xxx bug hunt...
  2002-09-23 22:50           ` Jeff Garzik
@ 2002-09-23 22:55             ` Justin T. Gibbs
  2002-09-23 23:22               ` Jeff Garzik
  0 siblings, 1 reply; 11+ messages in thread
From: Justin T. Gibbs @ 2002-09-23 22:55 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Konstantin Kletschke, linux-kernel

>> For all of these delays, I'd be more than happy to make them all into
>> sleeps if I can tell, from inside ahc_delay() if I'm in a context where
>> it is safe to sleep.  On the other platforms that this core code runs on
>> I'm usually not in a context where it is safe to sleep, so I don't want
>> to switch to using a different driver primitive.
> 
> For Linux it's unconditionally safe, and other platforms is sounds like
> it's unconditionally not.  So, s/ahc_delay/ahc_sleep/ for the places I
> pointed out, and just make ahc_delay==ahc_sleep on non-Linux platforms
> (or any similarly-functioning solution)

So you can sleep while in an interrupt context?  I didn't know that
2.5 had switched to using interrupt threads or some similar construct.

> It's pretty much impossible to detect if you are inside certain
> spinlocks, in a generic fashion.

In 2.5, I should only need to release my own host lock assuming it is
held.

--
Justin


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Quick aic7xxx bug hunt...
  2002-09-23 22:55             ` Justin T. Gibbs
@ 2002-09-23 23:22               ` Jeff Garzik
  2002-09-23 23:28                 ` Jeff Garzik
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff Garzik @ 2002-09-23 23:22 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: Konstantin Kletschke, linux-kernel

Justin T. Gibbs wrote:
>>>For all of these delays, I'd be more than happy to make them all into
>>>sleeps if I can tell, from inside ahc_delay() if I'm in a context where
>>>it is safe to sleep.  On the other platforms that this core code runs on
>>>I'm usually not in a context where it is safe to sleep, so I don't want
>>>to switch to using a different driver primitive.
>>
>>For Linux it's unconditionally safe, and other platforms is sounds like
>>it's unconditionally not.  So, s/ahc_delay/ahc_sleep/ for the places I
>>pointed out, and just make ahc_delay==ahc_sleep on non-Linux platforms
>>(or any similarly-functioning solution)
> 
> 
> So you can sleep while in an interrupt context?  I didn't know that
> 2.5 had switched to using interrupt threads or some similar construct.


Of course not :)

ahc_reset
	aic7770_config -> can sleep
	ahc_pci_config -> can sleep
	ahc_shutdown -> can't sleep, whoops
	ahc_resume -> dead code
ahc_init
	aic7770_config -> can sleep
	ahc_pci_config -> can sleep
ahc_acquire_seeprom
	check_extport -> can sleep
	ahc_proc_write_seeprom -> can sleep

so, ahc_init and ahc_acquire_seeprom can s/ahc_delay/ahc_sleep/ safely.

Oh, and I found another bug:  never use check_region, it's inherently 
racy.  Use request_region and check its return value.	

Note I agree with a comment in the code, that wrapping SHT->detect() in 
io_request_lock is silly...  the comment describing the rationale in 
drivers/scsi/scsi.c is not really accurate...

	Jeff




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Quick aic7xxx bug hunt...
  2002-09-23 23:22               ` Jeff Garzik
@ 2002-09-23 23:28                 ` Jeff Garzik
  2002-09-23  7:44                   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff Garzik @ 2002-09-23 23:28 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: Konstantin Kletschke, linux-kernel

Jeff Garzik wrote:
> ahc_reset
>     aic7770_config -> can sleep
>     ahc_pci_config -> can sleep
>     ahc_shutdown -> can't sleep, whoops


Though, to answer your question from a previous email, you can call the 
function in_interrupt() to see if you're in interrupt context.

	Jeff




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2002-09-24  9:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-23 18:00 2.4.20-pre7-ac3 aic7xxx broken? Konstantin Kletschke
2002-09-23 19:15 ` Justin T. Gibbs
2002-09-23 21:27   ` Quick aic7xxx bug hunt Jeff Garzik
2002-09-23 21:54     ` Justin T. Gibbs
2002-09-23 22:18       ` Jeff Garzik
2002-09-23 22:44         ` Justin T. Gibbs
2002-09-23 22:50           ` Jeff Garzik
2002-09-23 22:55             ` Justin T. Gibbs
2002-09-23 23:22               ` Jeff Garzik
2002-09-23 23:28                 ` Jeff Garzik
2002-09-23  7:44                   ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).