linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* IRQ disabled (SATA) on NForce2 and my theory
@ 2003-12-14 13:14 Julien Oster
  2003-12-15  6:06 ` Bob
  0 siblings, 1 reply; 7+ messages in thread
From: Julien Oster @ 2003-12-14 13:14 UTC (permalink / raw)
  To: Linux Kernel Mailing List


Hello!

I got an ASUS A7N8X Deluxe v2.0 and APIC and I/O APIC enabled, thanks
to athcool. (I didn't apply any patches, I just disable CPU Disconnect
with 'athcool off' as first thing on boot).

Now, however, since I am running with APIC, the following error occurs
quite often:

[...]
Dec  8 19:16:20 frodo kernel: hde: DMA disabled
Dec  8 19:16:20 frodo kernel: ide2: reset phy, status=0x00000113, siimage_reset
[...]

Shortly after that, the kernel would report:

Dec  8 19:16:21 frodo kernel: Disabling IRQ #18
Dec  8 19:16:22 frodo kernel: irq 18: nobody cared!

This happens sometimes under very high load on my onboard SATA where
both harddrivers (fast 10000rpm Raptors) are attached to a Linux
Softraid RAID0. IRQ18 is attached to this.

The drive/controller won't recover afterwards, only a reboot helps.

Now, my theory about this: One patch to fix the NForce2 lockups was to
insert a small delay in the acknowledgement of the timer
interrupt. Apparently, the machine would lock up if the timer
interrupt gets acknowledged too fast, meaning too soon.

I now suspect that my IRQ18 problems are a result of exactly the same
cause: IRQ18 getting acknowledged too soon on very high load, thus all
further interrupts won't occur anymore and disk operations come to a
halt.

It was most noticeable for the timer interrupt, because the timer
interrupt is basically always at "high load" and a lack of it would
result in a hard lockup of the board. However, it now seems like the
timer interrupt isn't the only interrupt suffering from this issue.

So, I think inserting the small delay in the appropriate IRQ
handler might fix this, too.

But there's still the question, why the delay is actually needed for
NForce2 boards. That would basically mean that you'll have to
introduce the delay for *every* IRQ, to avoid a lockup of any device
that will do high load at some time. I bet that, if I put my Firewire
Card back in (or just use the onboard Firewire ports) and stream a
video from my DV cam onto the harddisk, it would lock up as well after
a very short time, since those who know DV also know that DV has a
very high bandwidth, half an hour of film is like 40GB or the
like. (However, I can't test this right now, because my DV cam is
currently not accessible)

So, we're still not "rock solid" with NForce2, I guess...
Any idea?

Regards,
Julien

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRQ disabled (SATA) on NForce2 and my theory
  2003-12-14 13:14 IRQ disabled (SATA) on NForce2 and my theory Julien Oster
@ 2003-12-15  6:06 ` Bob
  2003-12-15 14:55   ` Bartlomiej Zolnierkiewicz
  0 siblings, 1 reply; 7+ messages in thread
From: Bob @ 2003-12-15  6:06 UTC (permalink / raw)
  To: linux-kernel

sii chips have a long history of needing to
hdparm off the unmask interrupt feature.

I don't know about that chip but for
sii680 there is a special option "-p9"
for hdparm which is to say pio mode 9
is a special instruction in addition to
standard hdparm opt "-u0" turning off
irq unmask.

/sbin/hdparm -d1 -c1 -p9 -X70 -u0 -k0 -i $a

also the sii sata chips can have the kernel config
low-level scsi driver CONFIG_SCSI_SATA=y
which you should read about in this list archive.
I don't personally know about that.

-Bob

Julien Oster wrote:

>Hello!
>
>I got an ASUS A7N8X Deluxe v2.0 and APIC and I/O APIC enabled, thanks
>to athcool. (I didn't apply any patches, I just disable CPU Disconnect
>with 'athcool off' as first thing on boot).
>
>Now, however, since I am running with APIC, the following error occurs
>quite often:
>
>[...]
>Dec  8 19:16:20 frodo kernel: hde: DMA disabled
>Dec  8 19:16:20 frodo kernel: ide2: reset phy, status=0x00000113, siimage_reset
>[...]
>
>Shortly after that, the kernel would report:
>
>Dec  8 19:16:21 frodo kernel: Disabling IRQ #18
>Dec  8 19:16:22 frodo kernel: irq 18: nobody cared!
>
>This happens sometimes under very high load on my onboard SATA where
>both harddrivers (fast 10000rpm Raptors) are attached to a Linux
>Softraid RAID0. IRQ18 is attached to this.
>
>The drive/controller won't recover afterwards, only a reboot helps.
>
>Now, my theory about this: One patch to fix the NForce2 lockups was to
>insert a small delay in the acknowledgement of the timer
>interrupt. Apparently, the machine would lock up if the timer
>interrupt gets acknowledged too fast, meaning too soon.
>
>I now suspect that my IRQ18 problems are a result of exactly the same
>cause: IRQ18 getting acknowledged too soon on very high load, thus all
>further interrupts won't occur anymore and disk operations come to a
>halt.
>
>It was most noticeable for the timer interrupt, because the timer
>interrupt is basically always at "high load" and a lack of it would
>result in a hard lockup of the board. However, it now seems like the
>timer interrupt isn't the only interrupt suffering from this issue.
>
>So, I think inserting the small delay in the appropriate IRQ
>handler might fix this, too.
>
>But there's still the question, why the delay is actually needed for
>NForce2 boards. That would basically mean that you'll have to
>introduce the delay for *every* IRQ, to avoid a lockup of any device
>that will do high load at some time. I bet that, if I put my Firewire
>Card back in (or just use the onboard Firewire ports) and stream a
>video from my DV cam onto the harddisk, it would lock up as well after
>a very short time, since those who know DV also know that DV has a
>very high bandwidth, half an hour of film is like 40GB or the
>like. (However, I can't test this right now, because my DV cam is
>currently not accessible)
>
>So, we're still not "rock solid" with NForce2, I guess...
>Any idea?
>
>Regards,
>Julien
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>  
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRQ disabled (SATA) on NForce2 and my theory
  2003-12-15  6:06 ` Bob
@ 2003-12-15 14:55   ` Bartlomiej Zolnierkiewicz
  2003-12-16  3:56     ` Bob
  0 siblings, 1 reply; 7+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2003-12-15 14:55 UTC (permalink / raw)
  To: Bob; +Cc: linux-kernel

On Monday 15 of December 2003 07:06, Bob wrote:
> sii chips have a long history of needing to
> hdparm off the unmask interrupt feature.
>
> I don't know about that chip but for
> sii680 there is a special option "-p9"
> for hdparm which is to say pio mode 9
> is a special instruction in addition to
> standard hdparm opt "-u0" turning off
> irq unmask.

There is no such thing as 'special option "-p9"' for sii680.

> /sbin/hdparm -d1 -c1 -p9 -X70 -u0 -k0 -i $a

-X70 is only valid if your device is UDMA133.

--bart


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRQ disabled (SATA) on NForce2 and my theory
  2003-12-15 14:55   ` Bartlomiej Zolnierkiewicz
@ 2003-12-16  3:56     ` Bob
  2003-12-16 20:00       ` Bartlomiej Zolnierkiewicz
  0 siblings, 1 reply; 7+ messages in thread
From: Bob @ 2003-12-16  3:56 UTC (permalink / raw)
  To: linux-kernel

Bartlomiej Zolnierkiewicz wrote:

>On Monday 15 of December 2003 07:06, Bob wrote:
>  
>
>>sii chips have a long history of needing to
>>hdparm off the unmask interrupt feature.
>>
>>I don't know about that chip but for
>>sii680 there is a special option "-p9"
>>for hdparm which is to say pio mode 9
>>is a special instruction in addition to
>>standard hdparm opt "-u0" turning off
>>irq unmask.
>>    
>>
>
>There is no such thing as 'special option "-p9"' for sii680.
>  
>
Passing PIO mode 9 to sii680 will make it do udma133 with
unmask off, same as "-X70 -u0". What sii did was to make a
bug a feature by embedding their own special pio mode for the
well-known cmdxxx unmask off requirement.

Making A Bug A Feature is begging for "deprecation".

Since -p9 was only documented to set u133 and unmask off,
making a bug a feature, non-bug features are not user-expected
to be set without using other(normal) hdparm options, so
somebody might as well "man hdparm" and bypass the silly
kludge which probably was an internal office joke anyway.

-Bob

>>/sbin/hdparm -d1 -c1 -p9 -X70 -u0 -k0 -i $a
>>    
>>
>
>-X70 is only valid if your device is UDMA133.
>
>--bart
>
>  
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRQ disabled (SATA) on NForce2 and my theory
  2003-12-16  3:56     ` Bob
@ 2003-12-16 20:00       ` Bartlomiej Zolnierkiewicz
  0 siblings, 0 replies; 7+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2003-12-16 20:00 UTC (permalink / raw)
  To: Bob; +Cc: linux-kernel

On Tuesday 16 of December 2003 04:56, Bob wrote:
> Bartlomiej Zolnierkiewicz wrote:
> >On Monday 15 of December 2003 07:06, Bob wrote:
> >>sii chips have a long history of needing to
> >>hdparm off the unmask interrupt feature.
> >>
> >>I don't know about that chip but for
> >>sii680 there is a special option "-p9"
> >>for hdparm which is to say pio mode 9
> >>is a special instruction in addition to
> >>standard hdparm opt "-u0" turning off
> >>irq unmask.
> >
> >There is no such thing as 'special option "-p9"' for sii680.
>
> Passing PIO mode 9 to sii680 will make it do udma133 with
> unmask off, same as "-X70 -u0". What sii did was to make a
> bug a feature by embedding their own special pio mode for the
> well-known cmdxxx unmask off requirement.

Please point me to the code or documentation...

> Making A Bug A Feature is begging for "deprecation".
>
> Since -p9 was only documented to set u133 and unmask off,

Where is it documented?

> making a bug a feature, non-bug features are not user-expected
> to be set without using other(normal) hdparm options, so
> somebody might as well "man hdparm" and bypass the silly
> kludge which probably was an internal office joke anyway.

Are you talking about drivers/ide/pci/siimage.c driver or something else?

> -Bob
>
> >>/sbin/hdparm -d1 -c1 -p9 -X70 -u0 -k0 -i $a
> >
> >-X70 is only valid if your device is UDMA133.
> >
> >--bart


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRQ disabled (SATA) on NForce2 and my theory
  2003-12-15  1:13 Ross Dickson
@ 2003-12-15  6:22 ` Bob
  0 siblings, 0 replies; 7+ messages in thread
From: Bob @ 2003-12-15  6:22 UTC (permalink / raw)
  To: linux-kernel

Ross Dickson wrote:

><snip>
>  
>
>>It was most noticeable for the timer interrupt, because the timer 
>>interrupt is basically always at "high load" and a lack of it would 
>>result in a hard lockup of the board. However, it now seems like the 
>>timer interrupt isn't the only interrupt suffering from this issue. 
>>    
>>
> 
>
>>, I think inserting the small delay in the appropriate IRQ 
>>handler might fix this, too.  
>>
>>    
>>
>>But there's still the question, why the delay is actually needed for 
>>NForce2 boards. That would basically mean that you'll have to 
>>introduce the delay for *every* IRQ, to avoid a lockup of any device 
>>that will do high load at some time. I bet that, if I put my Firewire 
>>Card back in (or just use the onboard Firewire ports) and stream a 
>>video from my DV cam onto the harddisk, it would lock up as well after 
>>a very short time, since those who know DV also know that DV has a 
>>very high bandwidth, half an hour of film is like 40GB or the 
>>like. (However, I can't test this right now, because my DV cam is 
>>currently not accessible) 
>>    
>>
I can't get usb or agp8 to work without crashing,
though ide onboard and on cards is stable. Might
be that delay on all interrupts is neededs.

>>So, we're still not "rock solid" with NForce2, I guess... 
>>Any idea? 
>>    
>>
>>Regards, 
>>Julien 
>>- 
>>    
>>
>
>
>If it is a C1 disconnect reconnection as we suspect then it should affect all
>local apic interrupts so the following is the conservative approach which may help.
>
>here is one of many educated? guesswork posts on topic
>
>http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-12/2307.html
>
>If you want to put the delay in all local apic irq acks then remove the delay code
>from apic.c and put it in 
>
>/usr/src/linux-2.4.23-rd2/include/asm-i386/apic.h
>
>also needs to bring in delay.h
>
>I am trying it now for my patched 2.4.23 (it boots and runs OK so far) kern as 
>follows but if it is really needed there then it would be better merged within
>the macro style of the bad ioapic selection code and given a kernel config
>selection mechanism.
>
>
>#ifndef __ASM_APIC_H
>#define __ASM_APIC_H
>
>#include <linux/config.h>
>#include <linux/pm.h>
>#include <asm/apicdef.h>
>#include <asm/system.h>
>#include <linux/delay.h>
>
><snip>
>
>static inline void ack_APIC_irq(void)
>{
>
>#if defined(CONFIG_MK7) && defined(CONFIG_BLK_DEV_AMD74XX)
>	/*
>	 * on 2200XP & nforce2 chipset we need at least 500ns delay here
>	 * to stop lockups with udma100 drive. try to scale delay time
>	 * with cpu speed. Ross Dickson.
>	 */
>	ndelay((cpu_khz >> 12)+200 ); /* don't ack too soon or hard lockup */
>#endif
>
>	/*
>	 * ack_APIC_irq() actually gets compiled as a single instruction:
>	 * - a single rmw on Pentium/82489DX
>	 * - a single write on P6+ cores (CONFIG_X86_GOOD_APIC)
>	 * ... yummie.
>	 */
>
>	/* Docs say use 0 for future compatibility */
>	apic_write_around(APIC_EOI, 0);
>}
><snip>
>
>Also note I stuffed up the syntax of the original #ifdef, code still works but
>only tests the first param not both. The ifdef code should also be adjusted for
>the ioapic patch if it is to be used widely on other chipsets and processor types.
>Also others more familiar with the kernel build system could choose better
>config params to test against.
>
>Anyhow we are still flying blind so far as manufacturers comments on this 
>topic.
>  
>
...instead of sending nforce2 boards back saying they don't
work with linux, try using competitive jealousy between
Phoenix and Award, since Award has a bios update which
fixes the lockups(except usb and agp8 and maybe firewire).

>So maybe occasional lockups could be caused by this on other AMD cpu systems?
>I don't know.
>
>Don't forget to recompile & install modules given the inline code above.
>
>Please cc me as to how it goes if you try it.
>
>Regards
>Ross
>
Another clue to all of this is there is a delay for onboard
amd74xx and that never crashed for me, but offboard
promise and sii did, on fsck or grep etc. I'm hoping these
latest patches with debug=1 will give a clue what the
Award bios update does!

-Bob

^ permalink raw reply	[flat|nested] 7+ messages in thread

* IRQ disabled (SATA) on NForce2 and my theory
@ 2003-12-15  1:13 Ross Dickson
  2003-12-15  6:22 ` Bob
  0 siblings, 1 reply; 7+ messages in thread
From: Ross Dickson @ 2003-12-15  1:13 UTC (permalink / raw)
  To: lkml-2315; +Cc: Jamie Lokier, forming, Ian Kumlien, linux-kernel

<snip>
>It was most noticeable for the timer interrupt, because the timer 
> interrupt is basically always at "high load" and a lack of it would 
> result in a hard lockup of the board. However, it now seems like the 
> timer interrupt isn't the only interrupt suffering from this issue. 
 


>So, I think inserting the small delay in the appropriate IRQ 
> handler might fix this, too. 
 


>But there's still the question, why the delay is actually needed for 
> NForce2 boards. That would basically mean that you'll have to 
> introduce the delay for *every* IRQ, to avoid a lockup of any device 
> that will do high load at some time. I bet that, if I put my Firewire 
> Card back in (or just use the onboard Firewire ports) and stream a 
> video from my DV cam onto the harddisk, it would lock up as well after 
> a very short time, since those who know DV also know that DV has a 
> very high bandwidth, half an hour of film is like 40GB or the 
> like. (However, I can't test this right now, because my DV cam is 
> currently not accessible) 
 


>So, we're still not "rock solid" with NForce2, I guess... 
> Any idea? 
 


>Regards, 
> Julien 
> - 


If it is a C1 disconnect reconnection as we suspect then it should affect all
local apic interrupts so the following is the conservative approach which may help.

here is one of many educated? guesswork posts on topic

http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-12/2307.html

If you want to put the delay in all local apic irq acks then remove the delay code
from apic.c and put it in 

/usr/src/linux-2.4.23-rd2/include/asm-i386/apic.h

also needs to bring in delay.h

I am trying it now for my patched 2.4.23 (it boots and runs OK so far) kern as 
follows but if it is really needed there then it would be better merged within
the macro style of the bad ioapic selection code and given a kernel config
selection mechanism.


#ifndef __ASM_APIC_H
#define __ASM_APIC_H

#include <linux/config.h>
#include <linux/pm.h>
#include <asm/apicdef.h>
#include <asm/system.h>
#include <linux/delay.h>

<snip>

static inline void ack_APIC_irq(void)
{

#if defined(CONFIG_MK7) && defined(CONFIG_BLK_DEV_AMD74XX)
	/*
	 * on 2200XP & nforce2 chipset we need at least 500ns delay here
	 * to stop lockups with udma100 drive. try to scale delay time
	 * with cpu speed. Ross Dickson.
	 */
	ndelay((cpu_khz >> 12)+200 ); /* don't ack too soon or hard lockup */
#endif

	/*
	 * ack_APIC_irq() actually gets compiled as a single instruction:
	 * - a single rmw on Pentium/82489DX
	 * - a single write on P6+ cores (CONFIG_X86_GOOD_APIC)
	 * ... yummie.
	 */

	/* Docs say use 0 for future compatibility */
	apic_write_around(APIC_EOI, 0);
}
<snip>

Also note I stuffed up the syntax of the original #ifdef, code still works but
only tests the first param not both. The ifdef code should also be adjusted for
the ioapic patch if it is to be used widely on other chipsets and processor types.
Also others more familiar with the kernel build system could choose better
config params to test against.

Anyhow we are still flying blind so far as manufacturers comments on this 
topic.

So maybe occasional lockups could be caused by this on other AMD cpu systems?
I don't know.

Don't forget to recompile & install modules given the inline code above.

Please cc me as to how it goes if you try it.

Regards
Ross





^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-12-16 19:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-14 13:14 IRQ disabled (SATA) on NForce2 and my theory Julien Oster
2003-12-15  6:06 ` Bob
2003-12-15 14:55   ` Bartlomiej Zolnierkiewicz
2003-12-16  3:56     ` Bob
2003-12-16 20:00       ` Bartlomiej Zolnierkiewicz
2003-12-15  1:13 Ross Dickson
2003-12-15  6:22 ` Bob

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).