linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* IRQ disabled (SATA) on NForce2 and my theory
@ 2003-12-14 13:14 Julien Oster
  2003-12-15  6:06 ` Bob
  0 siblings, 1 reply; 7+ messages in thread
From: Julien Oster @ 2003-12-14 13:14 UTC (permalink / raw)
  To: Linux Kernel Mailing List


Hello!

I got an ASUS A7N8X Deluxe v2.0 and APIC and I/O APIC enabled, thanks
to athcool. (I didn't apply any patches, I just disable CPU Disconnect
with 'athcool off' as first thing on boot).

Now, however, since I am running with APIC, the following error occurs
quite often:

[...]
Dec  8 19:16:20 frodo kernel: hde: DMA disabled
Dec  8 19:16:20 frodo kernel: ide2: reset phy, status=0x00000113, siimage_reset
[...]

Shortly after that, the kernel would report:

Dec  8 19:16:21 frodo kernel: Disabling IRQ #18
Dec  8 19:16:22 frodo kernel: irq 18: nobody cared!

This happens sometimes under very high load on my onboard SATA where
both harddrivers (fast 10000rpm Raptors) are attached to a Linux
Softraid RAID0. IRQ18 is attached to this.

The drive/controller won't recover afterwards, only a reboot helps.

Now, my theory about this: One patch to fix the NForce2 lockups was to
insert a small delay in the acknowledgement of the timer
interrupt. Apparently, the machine would lock up if the timer
interrupt gets acknowledged too fast, meaning too soon.

I now suspect that my IRQ18 problems are a result of exactly the same
cause: IRQ18 getting acknowledged too soon on very high load, thus all
further interrupts won't occur anymore and disk operations come to a
halt.

It was most noticeable for the timer interrupt, because the timer
interrupt is basically always at "high load" and a lack of it would
result in a hard lockup of the board. However, it now seems like the
timer interrupt isn't the only interrupt suffering from this issue.

So, I think inserting the small delay in the appropriate IRQ
handler might fix this, too.

But there's still the question, why the delay is actually needed for
NForce2 boards. That would basically mean that you'll have to
introduce the delay for *every* IRQ, to avoid a lockup of any device
that will do high load at some time. I bet that, if I put my Firewire
Card back in (or just use the onboard Firewire ports) and stream a
video from my DV cam onto the harddisk, it would lock up as well after
a very short time, since those who know DV also know that DV has a
very high bandwidth, half an hour of film is like 40GB or the
like. (However, I can't test this right now, because my DV cam is
currently not accessible)

So, we're still not "rock solid" with NForce2, I guess...
Any idea?

Regards,
Julien

^ permalink raw reply	[flat|nested] 7+ messages in thread
* IRQ disabled (SATA) on NForce2 and my theory
@ 2003-12-15  1:13 Ross Dickson
  2003-12-15  6:22 ` Bob
  0 siblings, 1 reply; 7+ messages in thread
From: Ross Dickson @ 2003-12-15  1:13 UTC (permalink / raw)
  To: lkml-2315; +Cc: Jamie Lokier, forming, Ian Kumlien, linux-kernel

<snip>
>It was most noticeable for the timer interrupt, because the timer 
> interrupt is basically always at "high load" and a lack of it would 
> result in a hard lockup of the board. However, it now seems like the 
> timer interrupt isn't the only interrupt suffering from this issue. 
 


>So, I think inserting the small delay in the appropriate IRQ 
> handler might fix this, too. 
 


>But there's still the question, why the delay is actually needed for 
> NForce2 boards. That would basically mean that you'll have to 
> introduce the delay for *every* IRQ, to avoid a lockup of any device 
> that will do high load at some time. I bet that, if I put my Firewire 
> Card back in (or just use the onboard Firewire ports) and stream a 
> video from my DV cam onto the harddisk, it would lock up as well after 
> a very short time, since those who know DV also know that DV has a 
> very high bandwidth, half an hour of film is like 40GB or the 
> like. (However, I can't test this right now, because my DV cam is 
> currently not accessible) 
 


>So, we're still not "rock solid" with NForce2, I guess... 
> Any idea? 
 


>Regards, 
> Julien 
> - 


If it is a C1 disconnect reconnection as we suspect then it should affect all
local apic interrupts so the following is the conservative approach which may help.

here is one of many educated? guesswork posts on topic

http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-12/2307.html

If you want to put the delay in all local apic irq acks then remove the delay code
from apic.c and put it in 

/usr/src/linux-2.4.23-rd2/include/asm-i386/apic.h

also needs to bring in delay.h

I am trying it now for my patched 2.4.23 (it boots and runs OK so far) kern as 
follows but if it is really needed there then it would be better merged within
the macro style of the bad ioapic selection code and given a kernel config
selection mechanism.


#ifndef __ASM_APIC_H
#define __ASM_APIC_H

#include <linux/config.h>
#include <linux/pm.h>
#include <asm/apicdef.h>
#include <asm/system.h>
#include <linux/delay.h>

<snip>

static inline void ack_APIC_irq(void)
{

#if defined(CONFIG_MK7) && defined(CONFIG_BLK_DEV_AMD74XX)
	/*
	 * on 2200XP & nforce2 chipset we need at least 500ns delay here
	 * to stop lockups with udma100 drive. try to scale delay time
	 * with cpu speed. Ross Dickson.
	 */
	ndelay((cpu_khz >> 12)+200 ); /* don't ack too soon or hard lockup */
#endif

	/*
	 * ack_APIC_irq() actually gets compiled as a single instruction:
	 * - a single rmw on Pentium/82489DX
	 * - a single write on P6+ cores (CONFIG_X86_GOOD_APIC)
	 * ... yummie.
	 */

	/* Docs say use 0 for future compatibility */
	apic_write_around(APIC_EOI, 0);
}
<snip>

Also note I stuffed up the syntax of the original #ifdef, code still works but
only tests the first param not both. The ifdef code should also be adjusted for
the ioapic patch if it is to be used widely on other chipsets and processor types.
Also others more familiar with the kernel build system could choose better
config params to test against.

Anyhow we are still flying blind so far as manufacturers comments on this 
topic.

So maybe occasional lockups could be caused by this on other AMD cpu systems?
I don't know.

Don't forget to recompile & install modules given the inline code above.

Please cc me as to how it goes if you try it.

Regards
Ross





^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-12-16 19:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-14 13:14 IRQ disabled (SATA) on NForce2 and my theory Julien Oster
2003-12-15  6:06 ` Bob
2003-12-15 14:55   ` Bartlomiej Zolnierkiewicz
2003-12-16  3:56     ` Bob
2003-12-16 20:00       ` Bartlomiej Zolnierkiewicz
2003-12-15  1:13 Ross Dickson
2003-12-15  6:22 ` Bob

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).