linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: 2.6.20->2.6.21 - networking dies after random time
@ 2007-06-29  8:50 Jean-Baptiste Vignaud
  2007-06-29 15:07 ` Jarek Poplawski
  0 siblings, 1 reply; 69+ messages in thread
From: Jean-Baptiste Vignaud @ 2007-06-29  8:50 UTC (permalink / raw)
  To: jarkao2; +Cc: linux-kernel, marcin.slusarz, shemminger, linux-net, netdev

Update...
I did 2 tests :

1)  booted with option acpi=off
It booted correctly, i managed to get some load on one of the card and after a while (10 minutes i guess) the Timeout occurs. Side effect, at the same moment the sata contolers lost control of the disks somehow and the raid 5 array on the system crashed hard. I have no traces as i was unable to rebuild it (and i tried a lot of extreme  voodoo methods).

2) changed the 3com cards
i replaced by two cards,
01:06.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42)
01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)

reinstalled and stressed the network (small download from a laptop) and :

Jun 29 09:34:10 loki kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jun 29 09:34:51 loki last message repeated 14 times
Jun 29 09:35:18 loki last message repeated 8 times

so it seems to be a more generic problem.

(i'v updated the fedora bugzilla aswell)

did not test the  "[PATCH] 8139cp dev->tx_timeout" yet.

JB


> On Tue, Jun 26, 2007 at 04:24:07PM +0200, Jean-Baptiste Vignaud wrote:
> > Hello, i have a very similar problem with 2.6.21 also;
> > 
> > 2 3com NICs and they are failling randomly.
> > 
> > The kernel is a basic fedora 7 kernel (2.6.21-1.3228.fc7)
> > I found a bug report and added details here : https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243960
> > 
> > I'm not subcribed on this list, so please cc me if there is any questions.
> > 
> > JB
> > 
> > > On Tue, Jun 26, 2007 at 08:10:17AM +0200, Marcin Ślusarz wrote:
> > > ...
> > > > I reproduced it on minimal config:
> ...
> > > We know your hardware should be OK - since it was fine with 2.6.20.
> ...
> 
> It looks like there is something common in the air...
> 
> Marcin: ne2k_pci with 8390, Jean: 3com, and now I see
> similar problem with 8139cp too (plus some ideas):
> 
> http://marc.info/?l=linux-netdev&m=118293314109648&w=2
> 
> So, you probably should wait a little & look for new patches here.
> 
> Cheers,
> Jarek P.
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-06-29  8:50 2.6.20->2.6.21 - networking dies after random time Jean-Baptiste Vignaud
@ 2007-06-29 15:07 ` Jarek Poplawski
  2007-07-23  5:44   ` Marcin Ślusarz
  0 siblings, 1 reply; 69+ messages in thread
From: Jarek Poplawski @ 2007-06-29 15:07 UTC (permalink / raw)
  To: Jean-Baptiste Vignaud
  Cc: linux-kernel, marcin.slusarz, shemminger, linux-net, netdev

On Fri, Jun 29, 2007 at 10:50:20AM +0200, Jean-Baptiste Vignaud wrote:
> Update...
> I did 2 tests :
> 
> 1)  booted with option acpi=off
> It booted correctly, i managed to get some load on one of the card
> and after a while (10 minutes i guess) the Timeout occurs. Side effect,
> at the same moment the sata contolers lost control of the disks somehow
> and the raid 5 array on the system crashed hard. I have no traces as i
> was unable to rebuild it (and i tried a lot of extreme  voodoo methods).

I think the main option: acpi=on is usually needed.

If you, guys, are not exhausted yet, I think you could try to
turn off (or change for somethig else) most of the options from
"Processors type and features", and maybe something below PCI
support. But there are many new options which couldn't be turned
off so easy, so there is no much hope...

> 
> 2) changed the 3com cards
> i replaced by two cards,
> 01:06.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42)
> 01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)
> 
> reinstalled and stressed the network (small download from a laptop) and :
> 
> Jun 29 09:34:10 loki kernel: NETDEV WATCHDOG: eth0: transmit timed out
> Jun 29 09:34:51 loki last message repeated 14 times
> Jun 29 09:35:18 loki last message repeated 8 times
> 
> so it seems to be a more generic problem.

I wonder if you tried to change the place - I've read this
advice many times. And maybe it would be better to try with
one card at first?

It seems there are some patches with dev->tx_timeout but it
looks like fixing results only. Let's wait...

Cheers,
Jarek P.

PS: Marcin - your last message wasn't plain text - so probably
dumped by kernel lists. 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-06-29 15:07 ` Jarek Poplawski
@ 2007-07-23  5:44   ` Marcin Ślusarz
  2007-07-23  8:53     ` Jarek Poplawski
                       ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Marcin Ślusarz @ 2007-07-23  5:44 UTC (permalink / raw)
  To: Jarek Poplawski, Jean-Baptiste Vignaud, linux-kernel, shemminger,
	linux-net, netdev, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Linus Torvalds

Ok, I've bisected this problem and found that this patch broke my NIC:

76d2160147f43f982dfe881404cfde9fd0a9da21 is first bad commit
commit 76d2160147f43f982dfe881404cfde9fd0a9da21
Author: Ingo Molnar <mingo@elte.hu>
Date:   Fri Feb 16 01:28:24 2007 -0800

    [PATCH] genirq: do not mask interrupts by default

    Never mask interrupts immediately upon request.  Disabling interrupts in
    high-performance codepaths is rare, and on the other hand this change could
    recover lost edges (or even other types of lost interrupts) by
conservatively
    only masking interrupts after they happen.  (NOTE: with this change the
    highlevel irq-disable code still soft-disables this IRQ line - and
if such an
    interrupt happens then the IRQ flow handler keeps the IRQ masked.)

    Mark i8529A controllers as 'never loses an edge'.

    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=76d2160147f43f982dfe881404cfde9fd0a9da21

After reverting it on top of 2.6.21.3 (with
d7e25f3394ba05a6d64cb2be42c2765fe72ea6b2 - [PATCH] genirq: remove
IRQ_DISABLED (which ment "remove IRQ_DELAYED_DISABLE")), the problem
didn't show up :)
(http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d7e25f3394ba05a6d64cb2be42c2765fe72ea6b2)

So I cooked patch like below and everything is working fine (so far)

Fix default_disable interrupt function (broken by [PATCH] genirq: do
not mask interrupts by default) - revert removal of codepath which was
invoked when removed flag (IRQ_DELAYED_DISABLE) wag NOT set

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
---
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 76a9106..0bb23cd 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -230,6 +230,8 @@ static void default_enable(unsigned int irq)
  */
 static void default_disable(unsigned int irq)
 {
+	struct irq_desc *desc = irq_desc + irq;
+	desc->chip->mask(irq);
 }

 /*

(Sorry for whitespace damage, but I have to send it from webmail :|)
(I'm a kernel noob, so don't kill me if my patch is wrong ;)
ps: Here is the beginning of this thread: http://lkml.org/lkml/2007/6/16/182


Regards,
Marcin Slusarz

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-23  5:44   ` Marcin Ślusarz
@ 2007-07-23  8:53     ` Jarek Poplawski
  2007-07-24  7:18     ` Jarek Poplawski
  2007-07-24  8:05     ` Ingo Molnar
  2 siblings, 0 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-07-23  8:53 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Linus Torvalds

On Mon, Jul 23, 2007 at 07:44:58AM +0200, Marcin Ślusarz wrote:
> Ok, I've bisected this problem and found that this patch broke my NIC:

Congratulations!

> 
> 76d2160147f43f982dfe881404cfde9fd0a9da21 is first bad commit
> commit 76d2160147f43f982dfe881404cfde9fd0a9da21
> Author: Ingo Molnar <mingo@elte.hu>
> Date:   Fri Feb 16 01:28:24 2007 -0800
> 
>    [PATCH] genirq: do not mask interrupts by default
...
> So I cooked patch like below and everything is working fine (so far)
> 
> Fix default_disable interrupt function (broken by [PATCH] genirq: do
> not mask interrupts by default) - revert removal of codepath which was
> invoked when removed flag (IRQ_DELAYED_DISABLE) wag NOT set
> 
> Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
> ---
> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> index 76a9106..0bb23cd 100644
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -230,6 +230,8 @@ static void default_enable(unsigned int irq)
>  */
> static void default_disable(unsigned int irq)
> {
> +	struct irq_desc *desc = irq_desc + irq;
> +	desc->chip->mask(irq);
> }
> 
> /*

I think your patch should very good point the source of the problem
and would help to many people, but it looks like too arbitrary for
those who didn't have such problems. It seems it was mainly with
x86_64, so maybe something like this below would be enough?

Cheers,
Jarek P.

PS: not tested!

---

diff -Nurp 2.6.22-/arch/x86_64/kernel/io_apic.c 2.6.22/arch/x86_64/kernel/io_apic.c
--- 2.6.22-/arch/x86_64/kernel/io_apic.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22/arch/x86_64/kernel/io_apic.c	2007-07-23 10:33:05.000000000 +0200
@@ -1427,6 +1427,7 @@ static struct irq_chip ioapic_chip __rea
 	.name 		= "IO-APIC",
 	.startup 	= startup_ioapic_irq,
 	.mask	 	= mask_IO_APIC_irq,
+	.disable	= mask_IO_APIC_irq,
 	.unmask	 	= unmask_IO_APIC_irq,
 	.ack 		= ack_apic_edge,
 	.eoi 		= ack_apic_level,

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-23  5:44   ` Marcin Ślusarz
  2007-07-23  8:53     ` Jarek Poplawski
@ 2007-07-24  7:18     ` Jarek Poplawski
  2007-07-24  8:05     ` Ingo Molnar
  2 siblings, 0 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-07-24  7:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcin Ślusarz, Jean-Baptiste Vignaud, linux-kernel,
	shemminger, linux-net, netdev, Ingo Molnar, Thomas Gleixner,
	Andrew Morton, Linus Torvalds

On Mon, Jul 23, 2007 at 07:44:58AM +0200, Marcin Ślusarz wrote:
> Ok, I've bisected this problem and found that this patch broke my NIC:
> 
> 76d2160147f43f982dfe881404cfde9fd0a9da21 is first bad commit
> commit 76d2160147f43f982dfe881404cfde9fd0a9da21
> Author: Ingo Molnar <mingo@elte.hu>
> Date:   Fri Feb 16 01:28:24 2007 -0800
> 
>    [PATCH] genirq: do not mask interrupts by default
> 
>    Never mask interrupts immediately upon request.  Disabling interrupts in
>    high-performance codepaths is rare, and on the other hand this change 
>    could
>    recover lost edges (or even other types of lost interrupts) by
> conservatively
>    only masking interrupts after they happen.  (NOTE: with this change the
>    highlevel irq-disable code still soft-disables this IRQ line - and
> if such an
>    interrupt happens then the IRQ flow handler keeps the IRQ masked.)
> 
>    Mark i8529A controllers as 'never loses an edge'.
> 
>    Signed-off-by: Ingo Molnar <mingo@elte.hu>
>    Cc: Thomas Gleixner <tglx@linutronix.de>
>    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

So, it seems nobody (except the users) cares...

BTW, maybe there should be created something like "Network Cards
Producers Made Rich on Unnecessary Changed Cards Linux Foundation"?:

On Fri, Jun 29, 2007 at 10:50:20AM +0200, Jean-Baptiste Vignaud wrote:
...
> 2) changed the 3com cards
> i replaced by two cards,
> 01:06.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42)
> 01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)
> 
> reinstalled and stressed the network (small download from a laptop) and :
> 
> Jun 29 09:34:10 loki kernel: NETDEV WATCHDOG: eth0: transmit timed out
> Jun 29 09:34:51 loki last message repeated 14 times
> Jun 29 09:35:18 loki last message repeated 8 times

...Of course, no response of any "serious" developer for this as well.

BTW #2: I wonder how true is this (after above-mentioned patch):

>From include/linux/irq.h:
> /**
>  * struct irq_chip - hardware interrupt chip descriptor
...
>  * @disable:		disable the interrupt (defaults to chip->mask if NULL)

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-23  5:44   ` Marcin Ślusarz
  2007-07-23  8:53     ` Jarek Poplawski
  2007-07-24  7:18     ` Jarek Poplawski
@ 2007-07-24  8:05     ` Ingo Molnar
  2007-07-24  9:42       ` Ingo Molnar
  2 siblings, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2007-07-24  8:05 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Jarek Poplawski, Jean-Baptiste Vignaud, linux-kernel, shemminger,
	linux-net, netdev, Thomas Gleixner, Andrew Morton,
	Linus Torvalds


* Marcin Ślusarz <marcin.slusarz@gmail.com> wrote:

> Ok, I've bisected this problem and found that this patch broke my NIC:
> 
> 76d2160147f43f982dfe881404cfde9fd0a9da21 is first bad commit
> commit 76d2160147f43f982dfe881404cfde9fd0a9da21
> Author: Ingo Molnar <mingo@elte.hu>
> Date:   Fri Feb 16 01:28:24 2007 -0800
> 
>    [PATCH] genirq: do not mask interrupts by default

thanks for tracking it down! Could you try the patch below (ontop an 
otherwise unmodified kernel)? This tests the theory whether the problem 
is related to the disable_irq_nosync() call in the ne2k driver's xmit 
path. Does this solve the hangs too?

	Ingo

Index: linux/kernel/irq/manage.c
===================================================================
--- linux.orig/kernel/irq/manage.c
+++ linux/kernel/irq/manage.c
@@ -102,7 +102,19 @@ void disable_irq_nosync(unsigned int irq
 	spin_lock_irqsave(&desc->lock, flags);
 	if (!desc->depth++) {
 		desc->status |= IRQ_DISABLED;
-		desc->chip->disable(irq);
+		/*
+		 * the _nosync variant of irq-disable suggests that the
+		 * caller is not worried about concurrency but about the
+		 * ordering of the irq flow itself. (such as hardware
+		 * getting confused about certain, normally valid irq
+		 * handling sequences.) So if the default disable handler
+		 * is in place then try the more conservative masking
+		 * instead:
+		 */
+		if (desc->chip->disable == default_disable && desc->chip->mask)
+			desc->chip->mask(irq);
+		else
+			desc->chip->disable(irq);
 	}
 	spin_unlock_irqrestore(&desc->lock, flags);
 }

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-24  8:05     ` Ingo Molnar
@ 2007-07-24  9:42       ` Ingo Molnar
  2007-07-24 19:30         ` Linus Torvalds
  0 siblings, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2007-07-24  9:42 UTC (permalink / raw)
  To: Marcin ??lusarz
  Cc: Jarek Poplawski, Jean-Baptiste Vignaud, linux-kernel, shemminger,
	linux-net, netdev, Thomas Gleixner, Andrew Morton,
	Linus Torvalds


* Ingo Molnar <mingo@elte.hu> wrote:

> thanks for tracking it down! Could you try the patch below (ontop an 
> otherwise unmodified kernel)? This tests the theory whether the 
> problem is related to the disable_irq_nosync() call in the ne2k 
> driver's xmit path. Does this solve the hangs too?

please try the patch below instead.

	Ingo

Index: linux/kernel/irq/chip.c
===================================================================
--- linux.orig/kernel/irq/chip.c
+++ linux/kernel/irq/chip.c
@@ -231,7 +231,7 @@ static void default_enable(unsigned int 
 /*
  * default disable function
  */
-static void default_disable(unsigned int irq)
+void default_disable(unsigned int irq)
 {
 }
 
Index: linux/kernel/irq/internals.h
===================================================================
--- linux.orig/kernel/irq/internals.h
+++ linux/kernel/irq/internals.h
@@ -10,6 +10,8 @@ extern void irq_chip_set_defaults(struct
 /* Set default handler: */
 extern void compat_irq_chip_set_default_handler(struct irq_desc *desc);
 
+extern void default_disable(unsigned int irq);
+
 #ifdef CONFIG_PROC_FS
 extern void register_irq_proc(unsigned int irq);
 extern void register_handler_proc(unsigned int irq, struct irqaction *action);
Index: linux/kernel/irq/manage.c
===================================================================
--- linux.orig/kernel/irq/manage.c
+++ linux/kernel/irq/manage.c
@@ -102,7 +102,19 @@ void disable_irq_nosync(unsigned int irq
 	spin_lock_irqsave(&desc->lock, flags);
 	if (!desc->depth++) {
 		desc->status |= IRQ_DISABLED;
-		desc->chip->disable(irq);
+		/*
+		 * the _nosync variant of irq-disable suggests that the
+		 * caller is not worried about concurrency but about the
+		 * ordering of the irq flow itself. (such as hardware
+		 * getting confused about certain, normally valid irq
+		 * handling sequences.) So if the default disable handler
+		 * is in place then try the more conservative masking
+		 * instead:
+		 */
+		if (desc->chip->disable == default_disable && desc->chip->mask)
+			desc->chip->mask(irq);
+		else
+			desc->chip->disable(irq);
 	}
 	spin_unlock_irqrestore(&desc->lock, flags);
 }

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-24  9:42       ` Ingo Molnar
@ 2007-07-24 19:30         ` Linus Torvalds
  2007-07-24 20:04           ` Ingo Molnar
  0 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2007-07-24 19:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcin ??lusarz, Jarek Poplawski, Jean-Baptiste Vignaud,
	linux-kernel, shemminger, linux-net, netdev, Thomas Gleixner,
	Andrew Morton



On Tue, 24 Jul 2007, Ingo Molnar wrote:
> 
> please try the patch below instead.

I'm hoping this is just a "let's see if the behavior changes" patch, not 
something that you think should be applied if it fixes something?

This patch looks like it is trying to paper over (rather than fix) some 
possible bug in the "->disable" logic. Makes sense as a "let's see if it's 
that" kind of thing, but not as a "let's fix it".

Or am I missing something?

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-24 19:30         ` Linus Torvalds
@ 2007-07-24 20:04           ` Ingo Molnar
  2007-07-25  0:19             ` Thomas Gleixner
  0 siblings, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2007-07-24 20:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marcin ??lusarz, Jarek Poplawski, Jean-Baptiste Vignaud,
	linux-kernel, shemminger, linux-net, netdev, Thomas Gleixner,
	Andrew Morton


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 24 Jul 2007, Ingo Molnar wrote:
> > 
> > please try the patch below instead.
> 
> I'm hoping this is just a "let's see if the behavior changes" patch, 
> not something that you think should be applied if it fixes something?
> 
> This patch looks like it is trying to paper over (rather than fix) 
> some possible bug in the "->disable" logic. Makes sense as a "let's 
> see if it's that" kind of thing, but not as a "let's fix it".
> 
> Or am I missing something?

yeah - it's a totaly bad and unacceptable hack (i realized how bad it 
was when i wrote up that comment section ...), i just wanted to see 
which portion of ne2k/lib8390.c is sensitive to the fact whether an irq 
line is masked or not. The patch has no SOB line either.

the current best fix forward is to undo my original change, unless we 
find a better fix for this problem. (Note that the other patches posted 
in this thread are broken too: they only mask the irq but dont reliably 
unmask it.)

here's the current method of handling irqs for Marcin's card:

17:         12   IO-APIC-fasteoi   eth1, eth0

and fasteoi is a really simple sequence: no masking/unmasking by the 
flow handler itself but a NOP at entry and an APIC-EOI at the end. The 
disable/enable irq thing should thus have minimal effect if done within 
an irq handler.

now ne2k does something uncommon: for xmit (which is normally done 
outside of irq handlers) it will disable_irq_nosync()/enable_irq() the 
interrupt. It does it to exclude the handler from _that_ CPU, but due to 
the _nosync it does not exclude it from any other CPUs. So that's a bit 
weird already.

just in case, i've just re-checked all the genirq bits that change 
IRQ_DISABLED (that bit accidentally clear would be the only way to truly 
allow an IRQ handler to interrupt the disable_irq_nosync() critical 
section on that CPU) - but i can see no way for that to happen: we 
unconditionally detect and report unbalanced and underflowing 
irq_desc->depth, and the only other place (besides enable_irq()) that 
clears IRQ_DISABLED is __set_irq_handler(), and on x86 that is only used 
during bootup.

Marcin, could you try the patch below too? [without having any other 
patch applied.] It basically turns the critical section into an irqs-off 
critical section and thus checks whether your problem is related to that 
particular area of code.

	Ingo

Index: linux/drivers/net/lib8390.c
===================================================================
--- linux.orig/drivers/net/lib8390.c
+++ linux/drivers/net/lib8390.c
@@ -297,9 +297,7 @@ static int ei_start_xmit(struct sk_buff 
 	 *	Slow phase with lock held.
 	 */
 
-	disable_irq_nosync_lockdep_irqsave(dev->irq, &flags);
-
-	spin_lock(&ei_local->page_lock);
+	spin_lock_irqsave(&ei_local->page_lock, flags);
 
 	ei_local->irqlock = 1;
 
@@ -376,8 +374,7 @@ static int ei_start_xmit(struct sk_buff 
 	ei_local->irqlock = 0;
 	ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
 
-	spin_unlock(&ei_local->page_lock);
-	enable_irq_lockdep_irqrestore(dev->irq, &flags);
+	spin_unlock_irqrestore(&ei_local->page_lock, flags);
 
 	dev_kfree_skb (skb);
 	ei_local->stat.tx_bytes += send_length;

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-24 20:04           ` Ingo Molnar
@ 2007-07-25  0:19             ` Thomas Gleixner
  2007-07-25  7:23               ` Jarek Poplawski
                                 ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Thomas Gleixner @ 2007-07-25  0:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Marcin ??lusarz, Jarek Poplawski,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton

On Tue, 2007-07-24 at 22:04 +0200, Ingo Molnar wrote:
> Marcin, could you try the patch below too? [without having any other 
> patch applied.] It basically turns the critical section into an irqs-off 
> critical section and thus checks whether your problem is related to that 
> particular area of code.
> 

I read back on this thread and I think the problem is somewhere else:

delayed disable relies on the ability to re-trigger the interrupt in the
case that a real interrupt happens after the software disable was set.
In this case we actually disable the interrupt on the hardware level
_after_ it occurred.

On enable_irq, we need to re-trigger the interrupt. On i386 this relies
on a hardware resend mechanism (send_IPI_self()). 

Actually we only need the resend for edge type interrupts. Level type
interrupts come back once enable_irq() re-enables the interrupt line.

I assume that the interrupt in question is level triggered because it is
shared and above the legacy irqs 0-15:

	17:         12   IO-APIC-fasteoi   eth1, eth0

Looking into the IO_APIC code, the resend via send_IPI_self() happens
unconditionally. So the resend is done for level and edge interrupts.
This makes the problem more mysterious.

The code in question lib8390.c does

	disable_irq();
	fiddle_with_the_network_card_hardware()
	enable_irq();

The fiddle_with_the_network_card_hardware() might cause interrupts,
which are cleared in the same code path again,

Marcin found that when he disables the irq line on the hardware level
(removing the delayed disable) the card is kept alive.

So the difference is that we can get a resend on enable_irq, when an
interrupt happens during the time, where we are in the disabled region.

No idea how this affects the network card, as the code there must be
able to handle interrupts, which are not originated from the card due to
interrupt sharing.

Marcin, can you please try the patch below ? It's just a debugging aid
to gather some more data about that problem.

If the patch fixes the problem, then we should try to disable the resend
mechanism for not edge type irq lines on the irq_chip level (i.e. the
IOAPIC code)

Thanks,

	tglx

--- linux-2.6.orig/kernel/irq/resend.c
+++ linux-2.6/kernel/irq/resend.c
@@ -62,6 +62,15 @@ void check_irq_resend(struct irq_desc *desc, unsigned int irq)
 	 */
 	desc->chip->enable(irq);
 
+	/*
+	 * Temporary hack to figure out more about the problem, which
+	 * is causing the ancient network cards to die.
+	 */
+	if (desc->handle_irq != handle_edge_irq) {
+		printk(KERN_DEBUG "Skip resend for irq %u\n", irq);
+		return;
+	}
+
 	if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
 		desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
 



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-25  0:19             ` Thomas Gleixner
@ 2007-07-25  7:23               ` Jarek Poplawski
  2007-07-25 13:57               ` Jarek Poplawski
  2007-07-26  7:16               ` Marcin Ślusarz
  2 siblings, 0 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-07-25  7:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Linus Torvalds, Marcin ??lusarz,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton

On Wed, Jul 25, 2007 at 02:19:31AM +0200, Thomas Gleixner wrote:
> On Tue, 2007-07-24 at 22:04 +0200, Ingo Molnar wrote:
> > Marcin, could you try the patch below too? [without having any other 
> > patch applied.] It basically turns the critical section into an irqs-off 
> > critical section and thus checks whether your problem is related to that 
> > particular area of code.
> > 
> 
> I read back on this thread and I think the problem is somewhere else:

So do I. Of course, I certainly miss most of the details, but I can't
imagine how this yesterday Ingo's patch couldn't work - unless
Marcin's test wasn't long enough...

IMHO, the main problem is that such delicate things shouldn't be
changed this way. If current ideas work for Marcin they will probably
break other boxes. Very similar symptoms were reported before Ingo's
patch too, so it looks like this place is very fragile. If such
things could happen:

(from: arch/i386/kernel/io_apic.c)
> static void ack_ioapic_quirk_irq(unsigned int irq)
> ...
> /*
>  * It appears there is an erratum which affects at least version 0x11
>  * of I/O APIC (that's the 82093AA and cores integrated into various
>  * chipsets).  Under certain conditions a level-triggered interrupt is
>  * erroneously delivered as edge-triggered one but the respective IRR
>  * bit gets set nevertheless.  As a result the I/O unit expects an EOI
>  * message but it will never arrive and further interrupts are blocked
>  * from the source.  The exact reason is so far unknown, but the
>  * phenomenon was observed when two consecutive interrupt requests
>  * from a given source get delivered to the same CPU and the source is
>  * temporarily disabled in between.
...

there is no reason to think this is all.

I can also see this comment in arch/x86_64/kernel/io_apic.c:

> static void setup_IO_APIC_irq(int apic, int pin, unsigned int irq,
>                               int trigger, int polarity)
...
>        /* Mask level triggered irqs.
>         * Use IRQ_DELAYED_DISABLE for edge triggered irqs.
>         */

It seems somebody have seen a difference, probably after testing,
but it wasn't respected.

I also presume ne2k/lib8390.c solution could be a result of "real
life", and I don't think Marcin's tests can be enough here. 

So, my point is that such places first of all need some documented
knobs in config or elsewere, which make it possible for users to
easily go back to previous method (i.e. from 2.6.21 to 2.6.20 here).

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-25  0:19             ` Thomas Gleixner
  2007-07-25  7:23               ` Jarek Poplawski
@ 2007-07-25 13:57               ` Jarek Poplawski
  2007-07-25 14:46                 ` Alan Cox
  2007-07-26  7:16               ` Marcin Ślusarz
  2 siblings, 1 reply; 69+ messages in thread
From: Jarek Poplawski @ 2007-07-25 13:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Linus Torvalds, Marcin ??lusarz,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton

On Wed, Jul 25, 2007 at 02:19:31AM +0200, Thomas Gleixner wrote:
...
> Looking into the IO_APIC code, the resend via send_IPI_self() happens
> unconditionally. So the resend is done for level and edge interrupts.
> This makes the problem more mysterious.
> 
> The code in question lib8390.c does
> 
> 	disable_irq();
> 	fiddle_with_the_network_card_hardware()
> 	enable_irq();
...
> 
> No idea how this affects the network card, as the code there must be
> able to handle interrupts, which are not originated from the card due to
> interrupt sharing.

I think, in this last yesterday's patch Ingo could be right, yet!
The comment at the beginnig points this is done like that because
of chip's slowness. And problems with timing are mysterious.

On the other hand author of this code didn't use spin_lock_irqsave
for some reason, probably after testing this option too. So, I hope
this is the right path, but alas, I'm not sure this patch has to
prove this 100%.

Anyway, in my opinion this situation where interrupts could/have_to
be used for such strange things should confirm the need of more
options for handling irqs individually.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-25 13:57               ` Jarek Poplawski
@ 2007-07-25 14:46                 ` Alan Cox
  2007-07-26 12:44                   ` [PATCH][netdrvr] lib8390: comment on locking by Alan Cox " Jarek Poplawski
  2007-07-30  8:46                   ` Ingo Molnar
  0 siblings, 2 replies; 69+ messages in thread
From: Alan Cox @ 2007-07-25 14:46 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Thomas Gleixner, Ingo Molnar, Linus Torvalds, Marcin ??lusarz,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton

> > The code in question lib8390.c does
> > 
> > 	disable_irq();
> > 	fiddle_with_the_network_card_hardware()
> > 	enable_irq();
> ...
> > 
> > No idea how this affects the network card, as the code there must be
> > able to handle interrupts, which are not originated from the card due to
> > interrupt sharing.
> 
> I think, in this last yesterday's patch Ingo could be right, yet!
> The comment at the beginnig points this is done like that because
> of chip's slowness. And problems with timing are mysterious.
> 
> On the other hand author of this code didn't use spin_lock_irqsave
> for some reason, probably after testing this option too. So, I hope
> this is the right path, but alas, I'm not sure this patch has to
> prove this 100%.

The author (me) didn't use spin_lock_irqsave because the slowness of the
card means that approach caused horrible problems like losing serial data
at 38400 baud on some chips. Rememeber many 8390 nics on PCI were ISA
chips with FPGA front ends.

> Anyway, in my opinion this situation where interrupts could/have_to
> be used for such strange things should confirm the need of more
> options for handling irqs individually.

Ok the logic behind the 8390 is very simple:

Things to know
	- IRQ delivery is asynchronous to the PCI bus
	- Blocking the local CPU IRQ via spin locks was too slow
	- The chip has register windows needing locking work

So the path was once (I say once as people appear to have changed it
in the mean time and it now looks rather bogus if the changes to use
disable_irq_nosync_irqsave are disabling the local IRQ)


	Take the page lock
	Mask the IRQ on chip
	Disable the IRQ (but not mask locally- someone seems to have
		broken this with the lock validator stuff)
		[This must be _nosync as the page lock may otherwise
			deadlock us]
	Drop the page lock and turn IRQs back on
	
	At this point an existing IRQ may still be running but we can't
	get a new one

	Take the lock (so we know the IRQ has terminated) but don't mask
the IRQs on the processor
	Set irqlock [for debug]

	Transmit (slow as ****)

	re-enable the IRQ


We have to use disable_irq because otherwise you will get delayed
interrupts on the APIC bus deadlocking the transmit path.

Quite hairy but the chip simply wasn't designed for SMP and you can't
even ACK an interrupt without risking corrupting other parallel
activities on the chip.

Alan

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-25  0:19             ` Thomas Gleixner
  2007-07-25  7:23               ` Jarek Poplawski
  2007-07-25 13:57               ` Jarek Poplawski
@ 2007-07-26  7:16               ` Marcin Ślusarz
  2007-07-26  8:13                 ` Jarek Poplawski
  2007-07-26  8:16                 ` Ingo Molnar
  2 siblings, 2 replies; 69+ messages in thread
From: Marcin Ślusarz @ 2007-07-26  7:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Linus Torvalds, Jarek Poplawski,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton

2007/7/25, Thomas Gleixner <tglx@linutronix.de>:
> (...)

I've tested Jarek's patch, 2 Ingo's patches (2nd and 3rd) and Thomas'
patch (one patch at time of course) - all of them fixed the problem,
but the last one flooded my logs with "Skip resend for irq 17". All
tests were done on 2.6.21.3.

I wanted to test them all on 2.6.22.1, but I didn't have enough time.
I've verified only that 2.6.22.1 has the same problem. I can test it
later, but I can report results back at beginning of next week.

Regards,
Marcin Slusarz

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  8:13                 ` Jarek Poplawski
@ 2007-07-26  8:10                   ` Thomas Gleixner
  2007-07-26  8:31                     ` Ingo Molnar
  2007-07-26  9:11                     ` 2.6.20->2.6.21 - networking dies after random time Jarek Poplawski
  2007-07-26  8:19                   ` Jarek Poplawski
  1 sibling, 2 replies; 69+ messages in thread
From: Thomas Gleixner @ 2007-07-26  8:10 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Marcin Ślusarz, Ingo Molnar, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Thu, 2007-07-26 at 10:13 +0200, Jarek Poplawski wrote:
> > I wanted to test them all on 2.6.22.1, but I didn't have enough time.
> > I've verified only that 2.6.22.1 has the same problem. I can test it
> > later, but I can report results back at beginning of next week.
> 
> 
> So, everything is clear - any changes are good!
> Except the signed-off ones... 
> 
> Thanks Marcin,
> Jarek P.
> 
> PS: Now, it seems to me Thomas could be the nearest. BTW, could somebody
> give me some tip, how these re-triggered interrupts are skipped on dev's
> reset before enable_irq?

I think the correct solution is really not to resend level type
interrupts. If the interrupt line is still active, then the interrupt
comes up by itself. I'm cooking a patch for that.

The other question is: 

Is the driver confused by the resent irq or is the chip-set unhappy
about the resend ?

We could figure the latter out by activating the software based resend
method.

	tglx



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  7:16               ` Marcin Ślusarz
@ 2007-07-26  8:13                 ` Jarek Poplawski
  2007-07-26  8:10                   ` Thomas Gleixner
  2007-07-26  8:19                   ` Jarek Poplawski
  2007-07-26  8:16                 ` Ingo Molnar
  1 sibling, 2 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-07-26  8:13 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Thomas Gleixner, Ingo Molnar, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Thu, Jul 26, 2007 at 09:16:10AM +0200, Marcin Ślusarz wrote:
> 2007/7/25, Thomas Gleixner <tglx@linutronix.de>:
> >(...)
> 
> I've tested Jarek's patch, 2 Ingo's patches (2nd and 3rd) and Thomas'
> patch (one patch at time of course) - all of them fixed the problem,
> but the last one flooded my logs with "Skip resend for irq 17". All
> tests were done on 2.6.21.3.
> 
> I wanted to test them all on 2.6.22.1, but I didn't have enough time.
> I've verified only that 2.6.22.1 has the same problem. I can test it
> later, but I can report results back at beginning of next week.


So, everything is clear - any changes are good!
Except the signed-off ones... 

Thanks Marcin,
Jarek P.

PS: Now, it seems to me Thomas could be the nearest. BTW, could somebody
give me some tip, how these re-triggered interrupts are skipped on dev's
reset before enable_irq?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  7:16               ` Marcin Ślusarz
  2007-07-26  8:13                 ` Jarek Poplawski
@ 2007-07-26  8:16                 ` Ingo Molnar
  1 sibling, 0 replies; 69+ messages in thread
From: Ingo Molnar @ 2007-07-26  8:16 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Thomas Gleixner, Linus Torvalds, Jarek Poplawski,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton


* Marcin Ślusarz <marcin.slusarz@gmail.com> wrote:

> 2007/7/25, Thomas Gleixner <tglx@linutronix.de>:
> >(...)
> 
> I've tested Jarek's patch, 2 Ingo's patches (2nd and 3rd) and Thomas' 
> patch (one patch at time of course) - all of them fixed the problem, 
> but the last one flooded my logs with "Skip resend for irq 17". All 
> tests were done on 2.6.21.3.

that's great! I think we have two good theories about what might be 
going on:

 - the driver might be buggy in that it gets confused by the 'resent' 
   irq.

 - or the chipset/cpu has a bug where it might get confused about the
   resent APIC vector getting mixed up with the same vector coming
   externally too. (Now, it makes little sense to 'resend' a
   level-triggered interrupt on x86 platforms that have flat PIC 
   hierarchies (other architectures might need more than that to
   retrigger an interrupt) - but there's nothing wrong about it in 
   theory and it needs fixing for edge irqs anyway.)

in any case, the problem was triggered by our change generating much 
more resent irqs than before. Nevertheless we'd like to fix that resend 
bug (and if the driver is buggy, the driver bug too). It's really good 
progress so far - we are working on doing the real fix now.

	Ingo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  8:13                 ` Jarek Poplawski
  2007-07-26  8:10                   ` Thomas Gleixner
@ 2007-07-26  8:19                   ` Jarek Poplawski
  1 sibling, 0 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-07-26  8:19 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Thomas Gleixner, Ingo Molnar, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Thu, Jul 26, 2007 at 10:13:26AM +0200, Jarek Poplawski wrote:
...
> So, everything is clear - any changes are good!
> Except the signed-off ones... 

Oops! Marcin's patch was both signed-off and good.
So, there is probably something more...

Sorry Marcin,
Jarek P.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  8:10                   ` Thomas Gleixner
@ 2007-07-26  8:31                     ` Ingo Molnar
  2007-07-26  8:55                       ` Jarek Poplawski
  2007-07-26  9:11                     ` 2.6.20->2.6.21 - networking dies after random time Jarek Poplawski
  1 sibling, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2007-07-26  8:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jarek Poplawski, Marcin ??lusarz, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox


* Thomas Gleixner <tglx@linutronix.de> wrote:

> The other question is:
> 
> Is the driver confused by the resent irq or is the chip-set unhappy 
> about the resend ?
> 
> We could figure the latter out by activating the software based resend 
> method.

yeah. The patch below enables sw-resend on x86, to test the theory 
whether the APIC-driven hardware-vector-resend code has some problem.

Marcin, could you please give this one a try too? Good behavior would be 
a fully working kernel (no hung device) with no extra kernel messages. 
Bad behavior would be any extra kernel message or any non-working 
device.

	Ingo

----------------------------->
Subject: x86: activate HARDIRQS_SW_RESEND
From: Ingo Molnar <mingo@elte.hu>

activate the software-triggered IRQ-resend logic.

it appears some chipsets/cpus do not handle local-APIC driven IRQ
resends all that well, so always use the soft mechanism to trigger
the execution of pending interrupts.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/i386/Kconfig   |    4 ++++
 kernel/irq/manage.c |    8 ++++++++
 2 files changed, 12 insertions(+)

Index: linux/arch/i386/Kconfig
===================================================================
--- linux.orig/arch/i386/Kconfig
+++ linux/arch/i386/Kconfig
@@ -1270,6 +1270,10 @@ config GENERIC_PENDING_IRQ
 	depends on GENERIC_HARDIRQS && SMP
 	default y
 
+config HARDIRQS_SW_RESEND
+	bool
+	default y
+
 config X86_SMP
 	bool
 	depends on SMP && !X86_VOYAGER
Index: linux/kernel/irq/manage.c
===================================================================
--- linux.orig/kernel/irq/manage.c
+++ linux/kernel/irq/manage.c
@@ -181,6 +181,14 @@ void enable_irq(unsigned int irq)
 		desc->depth--;
 	}
 	spin_unlock_irqrestore(&desc->lock, flags);
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+	/*
+	 * Do a bh disable/enable pair to trigger any pending
+	 * irq resend logic:
+	 */
+	local_bh_disable();
+	local_bh_enable();
+#endif
 }
 EXPORT_SYMBOL(enable_irq);
 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  8:31                     ` Ingo Molnar
@ 2007-07-26  8:55                       ` Jarek Poplawski
  2007-07-26  9:12                         ` Ingo Molnar
  0 siblings, 1 reply; 69+ messages in thread
From: Jarek Poplawski @ 2007-07-26  8:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Marcin ??lusarz, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Thu, Jul 26, 2007 at 10:31:20AM +0200, Ingo Molnar wrote:
...
> yeah. The patch below enables sw-resend on x86, to test the theory 
> whether the APIC-driven hardware-vector-resend code has some problem.

I think Marcin is using x86_64 (Athlon 64) yet.

Jarek P.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  8:10                   ` Thomas Gleixner
  2007-07-26  8:31                     ` Ingo Molnar
@ 2007-07-26  9:11                     ` Jarek Poplawski
  1 sibling, 0 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-07-26  9:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marcin Ślusarz, Ingo Molnar, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Thu, Jul 26, 2007 at 10:10:31AM +0200, Thomas Gleixner wrote:
> On Thu, 2007-07-26 at 10:13 +0200, Jarek Poplawski wrote:
...
> > PS: Now, it seems to me Thomas could be the nearest. BTW, could somebody
> > give me some tip, how these re-triggered interrupts are skipped on dev's
> > reset before enable_irq?
> 
> I think the correct solution is really not to resend level type
> interrupts. If the interrupt line is still active, then the interrupt
> comes up by itself. I'm cooking a patch for that.
> 
> The other question is: 
> 
> Is the driver confused by the resent irq or is the chip-set unhappy
> about the resend ?
> 
> We could figure the latter out by activating the software based resend
> method.

Probably I miss something, but isn't there any problem with level type,
when APIC re-triggers an interrupt, which is not acked by driver nor
card (after some hw reset/clear)?

Jarek P.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  8:55                       ` Jarek Poplawski
@ 2007-07-26  9:12                         ` Ingo Molnar
  2007-07-30  7:29                           ` Marcin Ślusarz
  0 siblings, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2007-07-26  9:12 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Thomas Gleixner, Marcin ??lusarz, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox


* Jarek Poplawski <jarkao2@o2.pl> wrote:

> On Thu, Jul 26, 2007 at 10:31:20AM +0200, Ingo Molnar wrote:
> ...
> > yeah. The patch below enables sw-resend on x86, to test the theory 
> > whether the APIC-driven hardware-vector-resend code has some problem.
> 
> I think Marcin is using x86_64 (Athlon 64) yet.

yeah - i meant to cover both arches but forgot about x86_64 - updated 
patch attached below.

	Ingo

----------------->
Subject: x86: activate HARDIRQS_SW_RESEND
From: Ingo Molnar <mingo@elte.hu>

activate the software-triggered IRQ-resend logic.

it appears some chipsets/cpus do not handle local-APIC driven IRQ
resends all that well, so always use the soft mechanism to trigger
the execution of pending interrupts.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/i386/Kconfig   |    4 ++++
 arch/x86_64/Kconfig |    4 ++++
 kernel/irq/manage.c |    8 ++++++++
 3 files changed, 16 insertions(+)

Index: linux-rt-rebase.q/arch/i386/Kconfig
===================================================================
--- linux-rt-rebase.q.orig/arch/i386/Kconfig
+++ linux-rt-rebase.q/arch/i386/Kconfig
@@ -1284,6 +1284,10 @@ config GENERIC_PENDING_IRQ
 	depends on GENERIC_HARDIRQS && SMP
 	default y
 
+config HARDIRQS_SW_RESEND
+	bool
+	default y
+
 config X86_SMP
 	bool
 	depends on SMP && !X86_VOYAGER
Index: linux-rt-rebase.q/arch/x86_64/Kconfig
===================================================================
--- linux-rt-rebase.q.orig/arch/x86_64/Kconfig
+++ linux-rt-rebase.q/arch/x86_64/Kconfig
@@ -721,6 +721,10 @@ config GENERIC_PENDING_IRQ
 	depends on GENERIC_HARDIRQS && SMP
 	default y
 
+config HARDIRQS_SW_RESEND
+	bool
+	default y
+
 menu "Power management options"
 
 source kernel/power/Kconfig
Index: linux-rt-rebase.q/kernel/irq/manage.c
===================================================================
--- linux-rt-rebase.q.orig/kernel/irq/manage.c
+++ linux-rt-rebase.q/kernel/irq/manage.c
@@ -175,6 +175,14 @@ void enable_irq(unsigned int irq)
 		desc->depth--;
 	}
 	spin_unlock_irqrestore(&desc->lock, flags);
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+	/*
+	 * Do a bh disable/enable pair to trigger any pending
+	 * irq resend logic:
+	 */
+	local_bh_disable();
+	local_bh_enable();
+#endif
 }
 EXPORT_SYMBOL(enable_irq);
 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH][netdrvr] lib8390: comment on locking by Alan Cox Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-25 14:46                 ` Alan Cox
@ 2007-07-26 12:44                   ` Jarek Poplawski
  2007-07-26 12:47                     ` Alan Cox
  2007-07-30 19:47                     ` Jeff Garzik
  2007-07-30  8:46                   ` Ingo Molnar
  1 sibling, 2 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-07-26 12:44 UTC (permalink / raw)
  To: Alan Cox
  Cc: Thomas Gleixner, Ingo Molnar, Linus Torvalds, Marcin ??lusarz,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Paul Gortmaker, Jeff Garzik

Hi,

Very below is my patch proposal with a comment, which in my opinion
is precious enough to save it for future help in reading and
understanding the code.

I hope Alan will not blame me I've not asked for his permission before
sending, and he would ack this patch as it is or at least most of this.

Thanks & regards,
Jarek P.

On Wed, Jul 25, 2007 at 03:46:56PM +0100, Alan Cox wrote:
> > > The code in question lib8390.c does
> > > 
> > > 	disable_irq();
> > > 	fiddle_with_the_network_card_hardware()
> > > 	enable_irq();
> > ...
> > > 
> > > No idea how this affects the network card, as the code there must be
> > > able to handle interrupts, which are not originated from the card due to
> > > interrupt sharing.
> > 
> > I think, in this last yesterday's patch Ingo could be right, yet!
> > The comment at the beginnig points this is done like that because
> > of chip's slowness. And problems with timing are mysterious.
> > 
> > On the other hand author of this code didn't use spin_lock_irqsave
> > for some reason, probably after testing this option too. So, I hope
> > this is the right path, but alas, I'm not sure this patch has to
> > prove this 100%.
> 
> The author (me) didn't use spin_lock_irqsave because the slowness of the
> card means that approach caused horrible problems like losing serial data
> at 38400 baud on some chips. Rememeber many 8390 nics on PCI were ISA
> chips with FPGA front ends.
> 
> > Anyway, in my opinion this situation where interrupts could/have_to
> > be used for such strange things should confirm the need of more
> > options for handling irqs individually.
> 
> Ok the logic behind the 8390 is very simple:
> 
> Things to know
> 	- IRQ delivery is asynchronous to the PCI bus
> 	- Blocking the local CPU IRQ via spin locks was too slow
> 	- The chip has register windows needing locking work
> 
> So the path was once (I say once as people appear to have changed it
> in the mean time and it now looks rather bogus if the changes to use
> disable_irq_nosync_irqsave are disabling the local IRQ)
> 
> 
> 	Take the page lock
> 	Mask the IRQ on chip
> 	Disable the IRQ (but not mask locally- someone seems to have
> 		broken this with the lock validator stuff)
> 		[This must be _nosync as the page lock may otherwise
> 			deadlock us]
> 	Drop the page lock and turn IRQs back on
> 	
> 	At this point an existing IRQ may still be running but we can't
> 	get a new one
> 
> 	Take the lock (so we know the IRQ has terminated) but don't mask
> the IRQs on the processor
> 	Set irqlock [for debug]
> 
> 	Transmit (slow as ****)
> 
> 	re-enable the IRQ
> 
> 
> We have to use disable_irq because otherwise you will get delayed
> interrupts on the APIC bus deadlocking the transmit path.
> 
> Quite hairy but the chip simply wasn't designed for SMP and you can't
> even ACK an interrupt without risking corrupting other parallel
> activities on the chip.
> 
> Alan
> 
------>

From: Jarek Poplawski <jarkao2@o2.pl>

Subject: lib8390: comment on locking by Alan Cox

Additional explanation of problems with locking by Alan Cox.

Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Paul Gortmaker <p_gortmaker@yahoo.com>
Cc: Jeff Garzik <jeff@garzik.org>

---

diff -Nurp 2.6.23-rc1-/drivers/net/lib8390.c 2.6.23-rc1/drivers/net/lib8390.c
--- 2.6.23-rc1-/drivers/net/lib8390.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.23-rc1/drivers/net/lib8390.c	2007-07-26 13:55:17.000000000 +0200
@@ -143,6 +143,52 @@ static void __NS8390_init(struct net_dev
  *	annoying the transmit function is called bh atomic. That places
  *	restrictions on the user context callers as disable_irq won't save
  *	them.
+ *
+ *	Additional explanation of problems with locking by Alan Cox:
+ *
+ *	"The author (me) didn't use spin_lock_irqsave because the slowness of the
+ *	card means that approach caused horrible problems like losing serial data
+ *	at 38400 baud on some chips. Rememeber many 8390 nics on PCI were ISA
+ *	chips with FPGA front ends.
+ *	
+ *	Ok the logic behind the 8390 is very simple:
+ *	
+ *	Things to know
+ *		- IRQ delivery is asynchronous to the PCI bus
+ *		- Blocking the local CPU IRQ via spin locks was too slow
+ *		- The chip has register windows needing locking work
+ *	
+ *	So the path was once (I say once as people appear to have changed it
+ *	in the mean time and it now looks rather bogus if the changes to use
+ *	disable_irq_nosync_irqsave are disabling the local IRQ)
+ *	
+ *	
+ *		Take the page lock
+ *		Mask the IRQ on chip
+ *		Disable the IRQ (but not mask locally- someone seems to have
+ *			broken this with the lock validator stuff)
+ *			[This must be _nosync as the page lock may otherwise
+ *				deadlock us]
+ *		Drop the page lock and turn IRQs back on
+ *		
+ *		At this point an existing IRQ may still be running but we can't
+ *		get a new one
+ *	
+ *		Take the lock (so we know the IRQ has terminated) but don't mask
+ *	the IRQs on the processor
+ *		Set irqlock [for debug]
+ *	
+ *		Transmit (slow as ****)
+ *	
+ *		re-enable the IRQ
+ *	
+ *	
+ *	We have to use disable_irq because otherwise you will get delayed
+ *	interrupts on the APIC bus deadlocking the transmit path.
+ *	
+ *	Quite hairy but the chip simply wasn't designed for SMP and you can't
+ *	even ACK an interrupt without risking corrupting other parallel
+ *	activities on the chip." [lkml, 25 Jul 2007]
  */
 
 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH][netdrvr] lib8390: comment on locking by Alan Cox Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26 12:44                   ` [PATCH][netdrvr] lib8390: comment on locking by Alan Cox " Jarek Poplawski
@ 2007-07-26 12:47                     ` Alan Cox
  2007-07-30 19:47                     ` Jeff Garzik
  1 sibling, 0 replies; 69+ messages in thread
From: Alan Cox @ 2007-07-26 12:47 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Thomas Gleixner, Ingo Molnar, Linus Torvalds, Marcin ??lusarz,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Paul Gortmaker, Jeff Garzik

On Thu, 26 Jul 2007 14:44:01 +0200
Jarek Poplawski <jarkao2@o2.pl> wrote:

> Hi,
> 
> Very below is my patch proposal with a comment, which in my opinion
> is precious enough to save it for future help in reading and
> understanding the code.
> 
> I hope Alan will not blame me I've not asked for his permission before
> sending, and he would ack this patch as it is or at least most of this.

Fine by me


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26  9:12                         ` Ingo Molnar
@ 2007-07-30  7:29                           ` Marcin Ślusarz
  2007-07-30  8:49                             ` Ingo Molnar
                                               ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Marcin Ślusarz @ 2007-07-30  7:29 UTC (permalink / raw)
  To: Ingo Molnar, Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/7/26, Ingo Molnar <mingo@elte.hu>:
> (..)
> yeah - i meant to cover both arches but forgot about x86_64 - updated
> patch attached below.
>
>         Ingo
>
> ----------------->
> Subject: x86: activate HARDIRQS_SW_RESEND
> From: Ingo Molnar <mingo@elte.hu>
>
> activate the software-triggered IRQ-resend logic.
>
> it appears some chipsets/cpus do not handle local-APIC driven IRQ
> resends all that well, so always use the soft mechanism to trigger
> the execution of pending interrupts.
>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  arch/i386/Kconfig   |    4 ++++
>  arch/x86_64/Kconfig |    4 ++++
>  kernel/irq/manage.c |    8 ++++++++
>  3 files changed, 16 insertions(+)
>
> Index: linux-rt-rebase.q/arch/i386/Kconfig
> ===================================================================
> --- linux-rt-rebase.q.orig/arch/i386/Kconfig
> +++ linux-rt-rebase.q/arch/i386/Kconfig
> @@ -1284,6 +1284,10 @@ config GENERIC_PENDING_IRQ
>         depends on GENERIC_HARDIRQS && SMP
>         default y
>
> +config HARDIRQS_SW_RESEND
> +       bool
> +       default y
> +
>  config X86_SMP
>         bool
>         depends on SMP && !X86_VOYAGER
> Index: linux-rt-rebase.q/arch/x86_64/Kconfig
> ===================================================================
> --- linux-rt-rebase.q.orig/arch/x86_64/Kconfig
> +++ linux-rt-rebase.q/arch/x86_64/Kconfig
> @@ -721,6 +721,10 @@ config GENERIC_PENDING_IRQ
>         depends on GENERIC_HARDIRQS && SMP
>         default y
>
> +config HARDIRQS_SW_RESEND
> +       bool
> +       default y
> +
>  menu "Power management options"
>
>  source kernel/power/Kconfig
> Index: linux-rt-rebase.q/kernel/irq/manage.c
> ===================================================================
> --- linux-rt-rebase.q.orig/kernel/irq/manage.c
> +++ linux-rt-rebase.q/kernel/irq/manage.c
> @@ -175,6 +175,14 @@ void enable_irq(unsigned int irq)
>                 desc->depth--;
>         }
>         spin_unlock_irqrestore(&desc->lock, flags);
> +#ifdef CONFIG_HARDIRQS_SW_RESEND
> +       /*
> +        * Do a bh disable/enable pair to trigger any pending
> +        * irq resend logic:
> +        */
> +       local_bh_disable();
> +       local_bh_enable();
> +#endif
>  }
>  EXPORT_SYMBOL(enable_irq);

This patch didn't help (tested on 2.6.22.1) - ne2k_pci timed out.

ps: I retested all patches posted in this thread on top of 2.6.22.1
and behavior from 2.6.21.3 didn't changed. My next tests will be on
2.6.22.x only.

Regards,
Marcin Slusarz

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-25 14:46                 ` Alan Cox
  2007-07-26 12:44                   ` [PATCH][netdrvr] lib8390: comment on locking by Alan Cox " Jarek Poplawski
@ 2007-07-30  8:46                   ` Ingo Molnar
  2007-07-30 13:05                     ` Alan Cox
  1 sibling, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2007-07-30  8:46 UTC (permalink / raw)
  To: Alan Cox
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Marcin ??lusarz, Jean-Baptiste Vignaud, linux-kernel, shemminger,
	linux-net, netdev, Andrew Morton


* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> Ok the logic behind the 8390 is very simple:

thanks for the explanation Alan! A few comments and a question:

> Things to know
> 	- IRQ delivery is asynchronous to the PCI bus
> 	- Blocking the local CPU IRQ via spin locks was too slow
> 	- The chip has register windows needing locking work
> 
> So the path was once (I say once as people appear to have changed it 
> in the mean time and it now looks rather bogus if the changes to use 
> disable_irq_nosync_irqsave are disabling the local IRQ)
> 
> 
> 	Take the page lock
> 	Mask the IRQ on chip
> 	Disable the IRQ (but not mask locally- someone seems to have
> 		broken this with the lock validator stuff)
> 		[This must be _nosync as the page lock may otherwise
> 			deadlock us]

( side-note: you can ignore the lock validator stuff here, the validator
  changes are supposed to a NOP on the !lockdep case. Local irqs will
  only be disabled if the validator is running. This could cause dropped
  serial irqs on very old boxes but i doubt anyone will want to run the
  validator on those. )

> 	Drop the page lock and turn IRQs back on
> 	
> 	At this point an existing IRQ may still be running but we can't
> 	get a new one
> 
> 	Take the lock (so we know the IRQ has terminated) but don't mask
> the IRQs on the processor
> 	Set irqlock [for debug]
> 
> 	Transmit (slow as ****)
> 
> 	re-enable the IRQ
> 
> 
> We have to use disable_irq because otherwise you will get delayed 
> interrupts on the APIC bus deadlocking the transmit path.
> 
> Quite hairy but the chip simply wasn't designed for SMP and you can't 
> even ACK an interrupt without risking corrupting other parallel 
> activities on the chip.

So the whole locking is to be able to keep irqs enabled for a long time, 
without risking entry of the same IRQ handler on this same CPU, correct?

Marcin's test results suggest that if an IRQ is resent right at the 
enable_irq() point [be that via the hw irq-resend mechanism or the sw 
irq-resend mechanism], the hang happens.

In the previous 2.6.20 logic we'd not normally generate an IRQ at that 
point (because we masked the irq and the card itself deasserts the line 
so any level-triggered irq is now moot).

Once Thomas hacked off this resend mechanism for level-triggered irqs, 
Marcin saw the hangs go away.

So it seems to me that maybe the driver could be surprised via these 
spurious interrupts that happen right after the irq_enable(). Does the 
patch below make any sense in your opinion?

	Ingo

Index: linux/drivers/net/lib8390.c
===================================================================
--- linux.orig/drivers/net/lib8390.c
+++ linux/drivers/net/lib8390.c
@@ -375,6 +375,8 @@ static int ei_start_xmit(struct sk_buff 
 	/* Turn 8390 interrupts back on. */
 	ei_local->irqlock = 0;
 	ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
+	/* force POST: */
+	ei_inb_p(e8390_base + EN0_IMR);
 
 	spin_unlock(&ei_local->page_lock);
 	enable_irq_lockdep_irqrestore(dev->irq, &flags);

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-30  7:29                           ` Marcin Ślusarz
@ 2007-07-30  8:49                             ` Ingo Molnar
  2007-08-01  7:24                               ` Marcin Ślusarz
  2007-07-31 13:20                             ` Jarek Poplawski
  2007-07-31 15:58                             ` [patch] genirq: temporary fix for level-triggered IRQ resend Ingo Molnar
  2 siblings, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2007-07-30  8:49 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox


* Marcin Ślusarz <marcin.slusarz@gmail.com> wrote:

> > Subject: x86: activate HARDIRQS_SW_RESEND
> > From: Ingo Molnar <mingo@elte.hu>
> >
> > activate the software-triggered IRQ-resend logic.

> This patch didn't help (tested on 2.6.22.1) - ne2k_pci timed out.

ok. This makes it more likely that the driver itself (or the card) gets 
confused by the resend.

does the patch below fix those timeouts? It tests the theory whether any 
POST latency could expose this problem.

	Ingo

Index: linux/drivers/net/lib8390.c
===================================================================
--- linux.orig/drivers/net/lib8390.c
+++ linux/drivers/net/lib8390.c
@@ -375,6 +375,8 @@ static int ei_start_xmit(struct sk_buff 
 	/* Turn 8390 interrupts back on. */
 	ei_local->irqlock = 0;
 	ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
+	/* force POST: */
+	ei_inb_p(e8390_base + EN0_IMR);
 
 	spin_unlock(&ei_local->page_lock);
 	enable_irq_lockdep_irqrestore(dev->irq, &flags);

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-30  8:46                   ` Ingo Molnar
@ 2007-07-30 13:05                     ` Alan Cox
  0 siblings, 0 replies; 69+ messages in thread
From: Alan Cox @ 2007-07-30 13:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Marcin ??lusarz, Jean-Baptiste Vignaud, linux-kernel, shemminger,
	linux-net, netdev, Andrew Morton

> So the whole locking is to be able to keep irqs enabled for a long time, 
> without risking entry of the same IRQ handler on this same CPU, correct?

As implemented - on any CPU.

We also need to know that the IRQ handler is not doing useful work on
another processor which is why we take the lock after disabling the
interrupt line everywhere. Without that we might be completing an IRQ on
another CPU and that would race the transmit and make a nasty mess.

> So it seems to me that maybe the driver could be surprised via these 
> spurious interrupts that happen right after the irq_enable(). Does the 
> patch below make any sense in your opinion?

For MMIO it does look like that may be needed. Looks sensible.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH][netdrvr] lib8390: comment on locking by Alan Cox Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-26 12:44                   ` [PATCH][netdrvr] lib8390: comment on locking by Alan Cox " Jarek Poplawski
  2007-07-26 12:47                     ` Alan Cox
@ 2007-07-30 19:47                     ` Jeff Garzik
  1 sibling, 0 replies; 69+ messages in thread
From: Jeff Garzik @ 2007-07-30 19:47 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Alan Cox, Thomas Gleixner, Ingo Molnar, Linus Torvalds,
	Marcin ??lusarz, Jean-Baptiste Vignaud, linux-kernel, shemminger,
	linux-net, netdev, Andrew Morton, Paul Gortmaker

Jarek Poplawski wrote:
> Hi,
> 
> Very below is my patch proposal with a comment, which in my opinion
> is precious enough to save it for future help in reading and
> understanding the code.
> 
> I hope Alan will not blame me I've not asked for his permission before
> sending, and he would ack this patch as it is or at least most of this.
> 
> Thanks & regards,
> Jarek P.
> 
> On Wed, Jul 25, 2007 at 03:46:56PM +0100, Alan Cox wrote:
>>>> The code in question lib8390.c does
>>>>
>>>> 	disable_irq();
>>>> 	fiddle_with_the_network_card_hardware()
>>>> 	enable_irq();
>>> ...
>>>> No idea how this affects the network card, as the code there must be
>>>> able to handle interrupts, which are not originated from the card due to
>>>> interrupt sharing.
>>> I think, in this last yesterday's patch Ingo could be right, yet!
>>> The comment at the beginnig points this is done like that because
>>> of chip's slowness. And problems with timing are mysterious.
>>>
>>> On the other hand author of this code didn't use spin_lock_irqsave
>>> for some reason, probably after testing this option too. So, I hope
>>> this is the right path, but alas, I'm not sure this patch has to
>>> prove this 100%.
>> The author (me) didn't use spin_lock_irqsave because the slowness of the
>> card means that approach caused horrible problems like losing serial data
>> at 38400 baud on some chips. Rememeber many 8390 nics on PCI were ISA
>> chips with FPGA front ends.
>>
>>> Anyway, in my opinion this situation where interrupts could/have_to
>>> be used for such strange things should confirm the need of more
>>> options for handling irqs individually.
>> Ok the logic behind the 8390 is very simple:
>>
>> Things to know
>> 	- IRQ delivery is asynchronous to the PCI bus
>> 	- Blocking the local CPU IRQ via spin locks was too slow
>> 	- The chip has register windows needing locking work
>>
>> So the path was once (I say once as people appear to have changed it
>> in the mean time and it now looks rather bogus if the changes to use
>> disable_irq_nosync_irqsave are disabling the local IRQ)
>>
>>
>> 	Take the page lock
>> 	Mask the IRQ on chip
>> 	Disable the IRQ (but not mask locally- someone seems to have
>> 		broken this with the lock validator stuff)
>> 		[This must be _nosync as the page lock may otherwise
>> 			deadlock us]
>> 	Drop the page lock and turn IRQs back on
>> 	
>> 	At this point an existing IRQ may still be running but we can't
>> 	get a new one
>>
>> 	Take the lock (so we know the IRQ has terminated) but don't mask
>> the IRQs on the processor
>> 	Set irqlock [for debug]
>>
>> 	Transmit (slow as ****)
>>
>> 	re-enable the IRQ
>>
>>
>> We have to use disable_irq because otherwise you will get delayed
>> interrupts on the APIC bus deadlocking the transmit path.
>>
>> Quite hairy but the chip simply wasn't designed for SMP and you can't
>> even ACK an interrupt without risking corrupting other parallel
>> activities on the chip.
>>
>> Alan
>>
> ------>
> 
> From: Jarek Poplawski <jarkao2@o2.pl>
> 
> Subject: lib8390: comment on locking by Alan Cox
> 
> Additional explanation of problems with locking by Alan Cox.
> 
> Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
> Cc: Paul Gortmaker <p_gortmaker@yahoo.com>
> Cc: Jeff Garzik <jeff@garzik.org>

applied



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-30  7:29                           ` Marcin Ślusarz
  2007-07-30  8:49                             ` Ingo Molnar
@ 2007-07-31 13:20                             ` Jarek Poplawski
  2007-08-06  7:00                               ` Marcin Ślusarz
  2007-07-31 15:58                             ` [patch] genirq: temporary fix for level-triggered IRQ resend Ingo Molnar
  2 siblings, 1 reply; 69+ messages in thread
From: Jarek Poplawski @ 2007-07-31 13:20 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Mon, Jul 30, 2007 at 09:29:38AM +0200, Marcin Ślusarz wrote:
...
> ps: I retested all patches posted in this thread on top of 2.6.22.1
> and behavior from 2.6.21.3 didn't changed. My next tests will be on
> 2.6.22.x only.

Marcin,

I see you're quite busy, but if after testing this next Ingo's patch
you are alive yet, maybe you could try one more "idea"? No patch this
time, but if you could try this after adding boot option "noirqdebug"
(I'd like to be sure it's not about timinig after all).

Cheers & thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch] genirq: temporary fix for level-triggered IRQ resend
  2007-07-30  7:29                           ` Marcin Ślusarz
  2007-07-30  8:49                             ` Ingo Molnar
  2007-07-31 13:20                             ` Jarek Poplawski
@ 2007-07-31 15:58                             ` Ingo Molnar
  2007-07-31 16:00                               ` Ingo Molnar
  2007-08-02 17:03                               ` Gabriel C
  2 siblings, 2 replies; 69+ messages in thread
From: Ingo Molnar @ 2007-07-31 15:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox, marcin.slusarz


Linus,

with -rc2 approaching i think we should apply the minimal fix below to 
get Marcin's ne2k-pci networking back in working order. The 
WARN_ON_ONCE() will not prevent the system from working and it will be a 
reminder.

a better workaround would be to inhibit the resent vector via the 
IO-APIC irqchip - but i'd still like to have the patch below because the 
ne2k driver _should_ be able to survive the spurious irq that happens. 
(even on Marcin's system that ne2k-pci irq line is shared with another 
networking card, so an irq could happen at any moment - it's just that 
with the delayed-disable logic it happens _all the time_.)

	Ingo

----------------------->
From: Thomas Gleixner <tglx@linutronix.de>
Subject: genirq: temporary fix for level-triggered IRQ resend

delayed disable relies on the ability to re-trigger the interrupt in the
case that a real interrupt happens after the software disable was set.
In this case we actually disable the interrupt on the hardware level
_after_ it occurred.

On enable_irq, we need to re-trigger the interrupt. On i386 this relies
on a hardware resend mechanism (send_IPI_self()). 

Actually we only need the resend for edge type interrupts. Level type
interrupts come back once enable_irq() re-enables the interrupt line.

I assume that the interrupt in question is level triggered because it is
shared and above the legacy irqs 0-15:

	17:         12   IO-APIC-fasteoi   eth1, eth0

Looking into the IO_APIC code, the resend via send_IPI_self() happens
unconditionally. So the resend is done for level and edge interrupts.
This makes the problem more mysterious.

The code in question lib8390.c does

	disable_irq();
	fiddle_with_the_network_card_hardware()
	enable_irq();

The fiddle_with_the_network_card_hardware() might cause interrupts,
which are cleared in the same code path again,

Marcin found that when he disables the irq line on the hardware level
(removing the delayed disable) the card is kept alive.

So the difference is that we can get a resend on enable_irq, when an
interrupt happens during the time, where we are in the disabled region.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/irq/resend.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux/kernel/irq/resend.c
===================================================================
--- linux.orig/kernel/irq/resend.c
+++ linux/kernel/irq/resend.c
@@ -62,6 +62,15 @@ void check_irq_resend(struct irq_desc *d
 	 */
 	desc->chip->enable(irq);
 
+	/*
+	 * Temporary hack to figure out more about the problem, which
+	 * is causing the ancient network cards to die.
+	 */
+	if (desc->handle_irq != handle_edge_irq) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+
 	if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
 		desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] genirq: temporary fix for level-triggered IRQ resend
  2007-07-31 15:58                             ` [patch] genirq: temporary fix for level-triggered IRQ resend Ingo Molnar
@ 2007-07-31 16:00                               ` Ingo Molnar
  2007-08-08 11:00                                 ` Jarek Poplawski
  2007-08-02 17:03                               ` Gabriel C
  1 sibling, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2007-07-31 16:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jarek Poplawski, Thomas Gleixner, Jean-Baptiste Vignaud,
	linux-kernel, shemminger, linux-net, netdev, Andrew Morton,
	Alan Cox, marcin.slusarz


* Ingo Molnar <mingo@elte.hu> wrote:

> Linus,
> 
> with -rc2 approaching i think we should apply the minimal fix below to 
> get Marcin's ne2k-pci networking back in working order. The 
> WARN_ON_ONCE() will not prevent the system from working and it will be 
> a reminder.

there's one more test-patch that Marcin has not tested yet (see below) - 
perhaps a POST artifact in ne2k could explain this bug.

	Ingo

------------------------->

* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> Ok the logic behind the 8390 is very simple:

thanks for the explanation Alan! A few comments and a question:

> Things to know
> 	- IRQ delivery is asynchronous to the PCI bus
> 	- Blocking the local CPU IRQ via spin locks was too slow
> 	- The chip has register windows needing locking work
> 
> So the path was once (I say once as people appear to have changed it 
> in the mean time and it now looks rather bogus if the changes to use 
> disable_irq_nosync_irqsave are disabling the local IRQ)
> 
> 
> 	Take the page lock
> 	Mask the IRQ on chip
> 	Disable the IRQ (but not mask locally- someone seems to have
> 		broken this with the lock validator stuff)
> 		[This must be _nosync as the page lock may otherwise
> 			deadlock us]

( side-note: you can ignore the lock validator stuff here, the validator
  changes are supposed to a NOP on the !lockdep case. Local irqs will
  only be disabled if the validator is running. This could cause dropped
  serial irqs on very old boxes but i doubt anyone will want to run the
  validator on those. )

> 	Drop the page lock and turn IRQs back on
> 	
> 	At this point an existing IRQ may still be running but we can't
> 	get a new one
> 
> 	Take the lock (so we know the IRQ has terminated) but don't mask
> the IRQs on the processor
> 	Set irqlock [for debug]
> 
> 	Transmit (slow as ****)
> 
> 	re-enable the IRQ
> 
> 
> We have to use disable_irq because otherwise you will get delayed 
> interrupts on the APIC bus deadlocking the transmit path.
> 
> Quite hairy but the chip simply wasn't designed for SMP and you can't 
> even ACK an interrupt without risking corrupting other parallel 
> activities on the chip.

So the whole locking is to be able to keep irqs enabled for a long time, 
without risking entry of the same IRQ handler on this same CPU, correct?

Marcin's test results suggest that if an IRQ is resent right at the 
enable_irq() point [be that via the hw irq-resend mechanism or the sw 
irq-resend mechanism], the hang happens.

In the previous 2.6.20 logic we'd not normally generate an IRQ at that 
point (because we masked the irq and the card itself deasserts the line 
so any level-triggered irq is now moot).

Once Thomas hacked off this resend mechanism for level-triggered irqs, 
Marcin saw the hangs go away.

So it seems to me that maybe the driver could be surprised via these 
spurious interrupts that happen right after the irq_enable(). Does the 
patch below make any sense in your opinion?

	Ingo

Index: linux/drivers/net/lib8390.c
===================================================================
--- linux.orig/drivers/net/lib8390.c
+++ linux/drivers/net/lib8390.c
@@ -375,6 +375,8 @@ static int ei_start_xmit(struct sk_buff 
 	/* Turn 8390 interrupts back on. */
 	ei_local->irqlock = 0;
 	ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
+	/* force POST: */
+	ei_inb_p(e8390_base + EN0_IMR);
 
 	spin_unlock(&ei_local->page_lock);
 	enable_irq_lockdep_irqrestore(dev->irq, &flags);

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-30  8:49                             ` Ingo Molnar
@ 2007-08-01  7:24                               ` Marcin Ślusarz
  2007-08-01  7:27                                 ` Ingo Molnar
  0 siblings, 1 reply; 69+ messages in thread
From: Marcin Ślusarz @ 2007-08-01  7:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/7/30, Ingo Molnar <mingo@elte.hu>:
> (..)
> does the patch below fix those timeouts? It tests the theory whether any
> POST latency could expose this problem.
>
>         Ingo
>
> Index: linux/drivers/net/lib8390.c
> ===================================================================
> --- linux.orig/drivers/net/lib8390.c
> +++ linux/drivers/net/lib8390.c
> @@ -375,6 +375,8 @@ static int ei_start_xmit(struct sk_buff
>         /* Turn 8390 interrupts back on. */
>         ei_local->irqlock = 0;
>         ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
> +       /* force POST: */
> +       ei_inb_p(e8390_base + EN0_IMR);
>
>         spin_unlock(&ei_local->page_lock);
>         enable_irq_lockdep_irqrestore(dev->irq, &flags);
>

Bad news. It doesn't fix the problem.

Marcin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-01  7:24                               ` Marcin Ślusarz
@ 2007-08-01  7:27                                 ` Ingo Molnar
  2007-08-06  6:58                                   ` Marcin Ślusarz
  0 siblings, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2007-08-01  7:27 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox


* Marcin Ślusarz <marcin.slusarz@gmail.com> wrote:

> >         ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
> > +       /* force POST: */
> > +       ei_inb_p(e8390_base + EN0_IMR);
> >
> >         spin_unlock(&ei_local->page_lock);
> >         enable_irq_lockdep_irqrestore(dev->irq, &flags);
> >
> 
> Bad news. It doesn't fix the problem.

ok, it wasnt supposed to be _that_ easy i guess :-) Can you please 
(re-)confirm that the workaround below indeed fixes the hung card 
problem? (after producing a single WARN_ON message into the syslog)

also, does removing the ne2k-pci module and reinserting it again solve 
the problem too, or is your network card stuck forever once it got into 
that state?

	Ingo

----------------------->
From: Thomas Gleixner <tglx@linutronix.de>
Subject: genirq: temporary fix for level-triggered IRQ resend

delayed disable relies on the ability to re-trigger the interrupt in the
case that a real interrupt happens after the software disable was set.
In this case we actually disable the interrupt on the hardware level
_after_ it occurred.

On enable_irq, we need to re-trigger the interrupt. On i386 this relies
on a hardware resend mechanism (send_IPI_self()). 

Actually we only need the resend for edge type interrupts. Level type
interrupts come back once enable_irq() re-enables the interrupt line.

I assume that the interrupt in question is level triggered because it is
shared and above the legacy irqs 0-15:

	17:         12   IO-APIC-fasteoi   eth1, eth0

Looking into the IO_APIC code, the resend via send_IPI_self() happens
unconditionally. So the resend is done for level and edge interrupts.
This makes the problem more mysterious.

The code in question lib8390.c does

	disable_irq();
	fiddle_with_the_network_card_hardware()
	enable_irq();

The fiddle_with_the_network_card_hardware() might cause interrupts,
which are cleared in the same code path again,

Marcin found that when he disables the irq line on the hardware level
(removing the delayed disable) the card is kept alive.

So the difference is that we can get a resend on enable_irq, when an
interrupt happens during the time, where we are in the disabled region.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/irq/resend.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux/kernel/irq/resend.c
===================================================================
--- linux.orig/kernel/irq/resend.c
+++ linux/kernel/irq/resend.c
@@ -62,6 +62,15 @@ void check_irq_resend(struct irq_desc *d
 	 */
 	desc->chip->enable(irq);
 
+	/*
+	 * Temporary hack to figure out more about the problem, which
+	 * is causing the ancient network cards to die.
+	 */
+	if (desc->handle_irq != handle_edge_irq) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+
 	if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
 		desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] genirq: temporary fix for level-triggered IRQ resend
  2007-07-31 15:58                             ` [patch] genirq: temporary fix for level-triggered IRQ resend Ingo Molnar
  2007-07-31 16:00                               ` Ingo Molnar
@ 2007-08-02 17:03                               ` Gabriel C
  2007-08-02 20:11                                 ` Ingo Molnar
  1 sibling, 1 reply; 69+ messages in thread
From: Gabriel C @ 2007-08-02 17:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Jarek Poplawski, Thomas Gleixner,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox, marcin.slusarz

Ingo Molnar wrote:
> Linus,
> 
> with -rc2 approaching i think we should apply the minimal fix below to 
> get Marcin's ne2k-pci networking back in working order. The 
> WARN_ON_ONCE() will not prevent the system from working and it will be a 
> reminder.
> 
> a better workaround would be to inhibit the resent vector via the 
> IO-APIC irqchip - but i'd still like to have the patch below because the 
> ne2k driver _should_ be able to survive the spurious irq that happens. 
> (even on Marcin's system that ne2k-pci irq line is shared with another 
> networking card, so an irq could happen at any moment - it's just that 
> with the delayed-disable logic it happens _all the time_.)
> 

I get a warning on each boot now with this patch .. 

[   63.686613] WARNING: at kernel/irq/resend.c:70 check_irq_resend()
[   63.686636]  [<c013c55c>] check_irq_resend+0x8c/0xa0
[   63.686653]  [<c013c15f>] enable_irq+0xad/0xb3
[   63.686662]  [<e886481e>] vortex_timer+0x20c/0x3d5 [3c59x]
[   63.686675]  [<c01164b9>] scheduler_tick+0x154/0x273
[   63.686685]  [<c012fed1>] getnstimeofday+0x34/0xe3
[   63.686697]  [<c0121f4a>] run_timer_softirq+0x137/0x197
[   63.686709]  [<e8864612>] vortex_timer+0x0/0x3d5 [3c59x]
[   63.686720]  [<c011ed09>] __do_softirq+0x75/0xe1
[   63.686729]  [<c011edac>] do_softirq+0x37/0x3d
[   63.686735]  [<c011ef85>] irq_exit+0x7c/0x7e
[   63.686740]  [<c010e013>] smp_apic_timer_interrupt+0x59/0x84
[   63.686751]  [<c0103428>] apic_timer_interrupt+0x28/0x30
[   63.686759]  [<c0101355>] default_idle+0x0/0x3f
[   63.686767]  [<c0101385>] default_idle+0x30/0x3f
[   63.686773]  [<c0100c19>] cpu_idle+0x5e/0x8e
[   63.686779]  [<c03fdc5f>] start_kernel+0x2d7/0x368


That means ?:)


> 	Ingo
> 


Gabriel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] genirq: temporary fix for level-triggered IRQ resend
  2007-08-02 17:03                               ` Gabriel C
@ 2007-08-02 20:11                                 ` Ingo Molnar
  2007-08-03  6:07                                   ` [patch] genirq: fix simple and fasteoi irq handlers Jarek Poplawski
  2007-08-06  6:07                                   ` [patch (take 2)] " Jarek Poplawski
  0 siblings, 2 replies; 69+ messages in thread
From: Ingo Molnar @ 2007-08-02 20:11 UTC (permalink / raw)
  To: Gabriel C
  Cc: Linus Torvalds, Jarek Poplawski, Thomas Gleixner,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox, marcin.slusarz


* Gabriel C <nix.or.die@googlemail.com> wrote:

> I get a warning on each boot now with this patch ..
> 
> [   63.686613] WARNING: at kernel/irq/resend.c:70 check_irq_resend()
> [   63.686636]  [<c013c55c>] check_irq_resend+0x8c/0xa0
> [   63.686653]  [<c013c15f>] enable_irq+0xad/0xb3
> [   63.686662]  [<e886481e>] vortex_timer+0x20c/0x3d5 [3c59x]
> [   63.686675]  [<c01164b9>] scheduler_tick+0x154/0x273
> [   63.686685]  [<c012fed1>] getnstimeofday+0x34/0xe3
> [   63.686697]  [<c0121f4a>] run_timer_softirq+0x137/0x197
> [   63.686709]  [<e8864612>] vortex_timer+0x0/0x3d5 [3c59x]
> [   63.686720]  [<c011ed09>] __do_softirq+0x75/0xe1
> [   63.686729]  [<c011edac>] do_softirq+0x37/0x3d
> [   63.686735]  [<c011ef85>] irq_exit+0x7c/0x7e
> [   63.686740]  [<c010e013>] smp_apic_timer_interrupt+0x59/0x84
> [   63.686751]  [<c0103428>] apic_timer_interrupt+0x28/0x30
> [   63.686759]  [<c0101355>] default_idle+0x0/0x3f
> [   63.686767]  [<c0101385>] default_idle+0x30/0x3f
> [   63.686773]  [<c0100c19>] cpu_idle+0x5e/0x8e
> [   63.686779]  [<c03fdc5f>] start_kernel+0x2d7/0x368
> 
> 
> That means ?:)

if your network still works fine then you can ignore it :-)

we are still trying to figure out what happens with ne2k-pci. The 
message will vanish soon.

	Ingo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch] genirq: fix simple and fasteoi irq handlers
  2007-08-02 20:11                                 ` Ingo Molnar
@ 2007-08-03  6:07                                   ` Jarek Poplawski
  2007-08-03  8:04                                     ` Ingo Molnar
  2007-08-06  7:05                                     ` Marcin Ślusarz
  2007-08-06  6:07                                   ` [patch (take 2)] " Jarek Poplawski
  1 sibling, 2 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-03  6:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gabriel C, Linus Torvalds, Thomas Gleixner,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox, marcin.slusarz

On Thu, Aug 02, 2007 at 10:11:26PM +0200, Ingo Molnar wrote:
> 
> * Gabriel C <nix.or.die@googlemail.com> wrote:
> 
> > I get a warning on each boot now with this patch ..
> > 
> > [   63.686613] WARNING: at kernel/irq/resend.c:70 check_irq_resend()
...
> we are still trying to figure out what happens with ne2k-pci. The 
> message will vanish soon.

Hi,

I can't guarantee this is all needed to fix this bug, but I think
this patch is necessary here.

Regards,
Jarek P.
------------>

Subject: genirq: fix simple and fasteoi irq handlers

After the "genirq: do not mask interrupts by default" patch interrupts
should be disabled not immediately upon request, but after they happen.
But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
driver's work.

The main reason of problems here, pointing the broken patch and making
the first patch which can fix this was done by Marcin Slusarz.
Additional test patches of Thomas Gleixner and Ingo Molnar tested by
Marcin Slusarz helped to narrow possible reasons even more. Thanks.

PS: this patch fixes only one evident error here, but there could be
more places affected by above-mentioned change in irq handling.


Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c	2007-08-02 20:42:38.000000000 +0200
@@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru
 
 	spin_lock(&desc->lock);
 
-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out_unlock;
 	kstat_cpu(cpu).irqs[irq]++;
 
 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		if (desc->chip->mask)
 			desc->chip->mask(irq);
 		desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
@@ -392,18 +391,16 @@ handle_fasteoi_irq(unsigned int irq, str
 
 	spin_lock(&desc->lock);
 
-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out;
-
 	desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
 	kstat_cpu(cpu).irqs[irq]++;
 
 	/*
-	 * If its disabled or no action available
+	 * If it's running, disabled or no action available
 	 * then mask it and get out of here:
 	 */
 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		desc->status |= IRQ_PENDING;
 		if (desc->chip->mask)
 			desc->chip->mask(irq);

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] genirq: fix simple and fasteoi irq handlers
  2007-08-03  6:07                                   ` [patch] genirq: fix simple and fasteoi irq handlers Jarek Poplawski
@ 2007-08-03  8:04                                     ` Ingo Molnar
  2007-08-03  8:46                                       ` Ingo Molnar
  2007-08-03  9:10                                       ` Jarek Poplawski
  2007-08-06  7:05                                     ` Marcin Ślusarz
  1 sibling, 2 replies; 69+ messages in thread
From: Ingo Molnar @ 2007-08-03  8:04 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Gabriel C, Linus Torvalds, Thomas Gleixner,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox, marcin.slusarz


* Jarek Poplawski <jarkao2@o2.pl> wrote:

> I can't guarantee this is all needed to fix this bug, but I think this 
> patch is necessary here.

hmmm ... very interesting! Now _this_ is something we'd like to see 
tested. Could you send a patch to Marcin that also undoes the workaround 
we have in place now, so that he could check whether ne2k-pci works fine 
with your fix alone?

	Ingo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] genirq: fix simple and fasteoi irq handlers
  2007-08-03  8:04                                     ` Ingo Molnar
@ 2007-08-03  8:46                                       ` Ingo Molnar
  2007-08-03  9:10                                       ` Jarek Poplawski
  1 sibling, 0 replies; 69+ messages in thread
From: Ingo Molnar @ 2007-08-03  8:46 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Gabriel C, Linus Torvalds, Thomas Gleixner,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox, marcin.slusarz


* Ingo Molnar <mingo@elte.hu> wrote:

> * Jarek Poplawski <jarkao2@o2.pl> wrote:
> 
> > I can't guarantee this is all needed to fix this bug, but I think 
> > this patch is necessary here.
> 
> hmmm ... very interesting! Now _this_ is something we'd like to see 
> tested. Could you send a patch to Marcin that also undoes the 
> workaround we have in place now, so that he could check whether 
> ne2k-pci works fine with your fix alone?

or it would be nice if Marcin could test pure 2.6.22 plus your fix 
(without any other patches applied).

	Ingo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] genirq: fix simple and fasteoi irq handlers
  2007-08-03  8:04                                     ` Ingo Molnar
  2007-08-03  8:46                                       ` Ingo Molnar
@ 2007-08-03  9:10                                       ` Jarek Poplawski
  2007-08-03 11:57                                         ` Marcin Ślusarz
  1 sibling, 1 reply; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-03  9:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gabriel C, Linus Torvalds, Thomas Gleixner,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox, marcin.slusarz

On Fri, Aug 03, 2007 at 10:04:08AM +0200, Ingo Molnar wrote:
> 
> * Jarek Poplawski <jarkao2@o2.pl> wrote:
> 
> > I can't guarantee this is all needed to fix this bug, but I think this 
> > patch is necessary here.
> 
> hmmm ... very interesting! Now _this_ is something we'd like to see 
> tested. Could you send a patch to Marcin that also undoes the workaround 
> we have in place now, so that he could check whether ne2k-pci works fine 
> with your fix alone?

I'm not sure this is needed... Marcin got this patch, I hope, and I
don't have another possibility to contact with him. Since he managed
with this bisection and all the previous patches I don't think there
could be any problems, so:

Marcin! I'd be very glad if you could test this patch alone; this
should apply without any problems to 2.6.21 (with some offset) and
later "vanilla" versions (or try to revert Ingo's last patch with
patch -p1 -R). Please, contact me on any problems (alas not during
the weekend...).

Thanks,
Jarek P.

PS: of course, I'm very curious of this testing too, but, on the other
hand, as I've written earlier, I think this patch is needed for logical
reasons only, and it really doesn't look like it could make any damage
here.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] genirq: fix simple and fasteoi irq handlers
  2007-08-03  9:10                                       ` Jarek Poplawski
@ 2007-08-03 11:57                                         ` Marcin Ślusarz
  2007-08-03 12:26                                           ` Jarek Poplawski
  0 siblings, 1 reply; 69+ messages in thread
From: Marcin Ślusarz @ 2007-08-03 11:57 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, Gabriel C, Linus Torvalds, Thomas Gleixner,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/8/3, Jarek Poplawski <jarkao2@o2.pl>:
> On Fri, Aug 03, 2007 at 10:04:08AM +0200, Ingo Molnar wrote:
> >
> > * Jarek Poplawski <jarkao2@o2.pl> wrote:
> >
> > > I can't guarantee this is all needed to fix this bug, but I think this
> > > patch is necessary here.
> >
> > hmmm ... very interesting! Now _this_ is something we'd like to see
> > tested. Could you send a patch to Marcin that also undoes the workaround
> > we have in place now, so that he could check whether ne2k-pci works fine
> > with your fix alone?
>
> I'm not sure this is needed... Marcin got this patch, I hope, and I
> don't have another possibility to contact with him. Since he managed
> with this bisection and all the previous patches I don't think there
> could be any problems, so:
>
> Marcin! I'd be very glad if you could test this patch alone; this
> should apply without any problems to 2.6.21 (with some offset) and
> later "vanilla" versions (or try to revert Ingo's last patch with
> patch -p1 -R). Please, contact me on any problems (alas not during
> the weekend...).

I'll test this patch tomorrow (and confirm that the last one from Ingo
works fine) and report results on monday (sorry, no internet at home
since I moved out of city :|).

Marcin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] genirq: fix simple and fasteoi irq handlers
  2007-08-03 11:57                                         ` Marcin Ślusarz
@ 2007-08-03 12:26                                           ` Jarek Poplawski
  0 siblings, 0 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-03 12:26 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Gabriel C, Linus Torvalds, Thomas Gleixner,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Fri, Aug 03, 2007 at 01:57:00PM +0200, Marcin Ślusarz wrote:
...
> I'll test this patch tomorrow (and confirm that the last one from Ingo
> works fine) and report results on monday (sorry, no internet at home
> since I moved out of city :|).

So, you are a lucky guy! I have only no internet at home.
...and time for dreaming about moving out of a city...

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch (take 2)] genirq: fix simple and fasteoi irq handlers
  2007-08-02 20:11                                 ` Ingo Molnar
  2007-08-03  6:07                                   ` [patch] genirq: fix simple and fasteoi irq handlers Jarek Poplawski
@ 2007-08-06  6:07                                   ` Jarek Poplawski
  2007-08-06  6:14                                     ` Ingo Molnar
  1 sibling, 1 reply; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-06  6:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gabriel C, Linus Torvalds, Thomas Gleixner,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox, marcin.slusarz

Hi,

Sorry, guys! I tried to be so logical that I forgot to unmask my eyes.

Regards,
Jarek P.
------------>
(take 2)

Subject: genirq: fix simple and fasteoi irq handlers

After the "genirq: do not mask interrupts by default" patch interrupts
should be disabled not immediately upon request, but after they happen.
But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
driver's work.

The main reason of problems here, pointing the broken patch and making
the first patch which can fix this was done by Marcin Slusarz.
Additional test patches of Thomas Gleixner and Ingo Molnar tested by
Marcin Slusarz helped to narrow possible reasons even more. Thanks.

PS: this patch fixes only one evident error here, but there could be
more places affected by above-mentioned change in irq handling.

PS 2:
After rethinking, IMHO, there are two most probable scenarios here:

1. After hw resend there could be a conflict between retriggered
edge type irq and the next level type one: e.g. if this level type
irq (io_apic is enabled then) is triggered while retriggered irq is
serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
the next such levels are triggered and looping, so probably kind of
flood in io_apic until this retriggered edge service has ended. 
2. There is something wrong with ioapic_retrigger_irq (less probable
because this should be probably seen with 'normal' edge retriggers,
but on the other hand, they could be less common).

So, if there is #1, this fixed patch should work.

But, since level types don't need this retriggers too much I think
this "don't mask interrupts by default" idea should be rethinked:
is there enough gain to risk such hard to diagnose errors?
  
So, IMHO, there should be at least possibility to turn this off for
level types in config (it should be a visible option, so people could
find & try this before writing for help or changing a network card).


Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c	2007-08-05 21:49:46.000000000 +0200
@@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru
 
 	spin_lock(&desc->lock);
 
-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out_unlock;
 	kstat_cpu(cpu).irqs[irq]++;
 
 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		if (desc->chip->mask)
 			desc->chip->mask(irq);
 		desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
@@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru
 
 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
+	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+		desc->chip->unmask(irq);
 out_unlock:
 	spin_unlock(&desc->lock);
 }
@@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str
 
 	spin_lock(&desc->lock);
 
-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out;
-
 	desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
 	kstat_cpu(cpu).irqs[irq]++;
 
 	/*
-	 * If its disabled or no action available
+	 * If it's running, disabled or no action available
 	 * then mask it and get out of here:
 	 */
 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		desc->status |= IRQ_PENDING;
 		if (desc->chip->mask)
 			desc->chip->mask(irq);
@@ -420,6 +419,8 @@ handle_fasteoi_irq(unsigned int irq, str
 
 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
+	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+		desc->chip->unmask(irq);
 out:
 	desc->chip->eoi(irq);
 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch (take 2)] genirq: fix simple and fasteoi irq handlers
  2007-08-06  6:07                                   ` [patch (take 2)] " Jarek Poplawski
@ 2007-08-06  6:14                                     ` Ingo Molnar
  2007-08-06  7:07                                       ` Marcin Ślusarz
  2007-08-06  7:19                                       ` Jarek Poplawski
  0 siblings, 2 replies; 69+ messages in thread
From: Ingo Molnar @ 2007-08-06  6:14 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Gabriel C, Linus Torvalds, Thomas Gleixner,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox, marcin.slusarz


* Jarek Poplawski <jarkao2@o2.pl> wrote:

> Subject: genirq: fix simple and fasteoi irq handlers
> 
> After the "genirq: do not mask interrupts by default" patch interrupts 
> should be disabled not immediately upon request, but after they 
> happen. But, handle_simple_irq() and handle_fasteoi_irq() can skip 
> this once or more if an irq is just serviced (IRQ_INPROGRESS), 
> possibly disrupting a driver's work.

nice fix. I think this is exactly the type of bug we were hoping to be 
able to identify and fix, and it could explain the regression in its 
entirety. The big question - does it fix Marcin's regression?

	Ingo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-01  7:27                                 ` Ingo Molnar
@ 2007-08-06  6:58                                   ` Marcin Ślusarz
  0 siblings, 0 replies; 69+ messages in thread
From: Marcin Ślusarz @ 2007-08-06  6:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/8/1, Ingo Molnar <mingo@elte.hu>:
> ok, it wasnt supposed to be _that_ easy i guess :-) Can you please
> (re-)confirm that the workaround below indeed fixes the hung card
> problem? (after producing a single WARN_ON message into the syslog)
yes, with this patch everything works fine

end of dmesg:

EXT3 FS on sda7, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Adding 1020112k swap on /dev/sda2.  Priority:-1 extents:1 across:1020112k
skge eth1: enabling interface
NET: Registered protocol family 17
WARNING: at kernel/irq/resend.c:70 check_irq_resend()

Call Trace:
 [<ffffffff8025e5c8>] check_irq_resend+0xa8/0xc0
 [<ffffffff8025e1ca>] enable_irq+0xea/0xf0
 [<ffffffff8800f21d>] :8390:ei_start_xmit+0x14d/0x30c
 [<ffffffff8052b5ce>] dev_hard_start_xmit+0x26e/0x2d0
 [<ffffffff80539b10>] __qdisc_run+0xc0/0x1f0
 [<ffffffff8052db9f>] dev_queue_xmit+0x24f/0x310
 [<ffffffff880d7ac9>] :af_packet:packet_sendmsg+0x259/0x2c0
 [<ffffffff8051f0bf>] sock_sendmsg+0xdf/0x110
 [<ffffffff8024b8c9>] trace_hardirqs_on+0xd9/0x180
 [<ffffffff8024c1dd>] __lock_acquire+0x31d/0xff0
 [<ffffffff80243290>] autoremove_wake_function+0x0/0x40
 [<ffffffff803e3103>] __up_read+0x23/0xb0
 [<ffffffff803e3125>] __up_read+0x45/0xb0
 [<ffffffff805bd8f5>] _spin_unlock_irqrestore+0x65/0x80
 [<ffffffff8024b8c9>] trace_hardirqs_on+0xd9/0x180
 [<ffffffff803e3125>] __up_read+0x45/0xb0
 [<ffffffff802464b6>] up_read+0x26/0x30
 [<ffffffff8051f4f1>] sys_sendto+0x111/0x150
 [<ffffffff8024b8c9>] trace_hardirqs_on+0xd9/0x180
 [<ffffffff805bd93b>] _spin_unlock_irq+0x2b/0x60
 [<ffffffff8023861a>] do_sigaction+0x11a/0x1d0
 [<ffffffff802097fe>] system_call+0x7e/0x83

Marking TSC unstable due to cpufreq changes
Time: acpi_pm clocksource has been installed.

> also, does removing the ne2k-pci module and reinserting it again solve
> the problem too, or is your network card stuck forever once it got into
> that state?
it doesn't change anything - i tried reloading both modules (ne2k_pci and skge)

Marcin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-07-31 13:20                             ` Jarek Poplawski
@ 2007-08-06  7:00                               ` Marcin Ślusarz
  2007-08-06  7:03                                 ` Ingo Molnar
  0 siblings, 1 reply; 69+ messages in thread
From: Marcin Ślusarz @ 2007-08-06  7:00 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/7/31, Jarek Poplawski <jarkao2@o2.pl>:
> Marcin,
>
> I see you're quite busy, but if after testing this next Ingo's patch
> you are alive yet, maybe you could try one more "idea"? No patch this
> time, but if you could try this after adding boot option "noirqdebug"
> (I'd like to be sure it's not about timinig after all).
It didn't change anything. Network card still timed out.

Marcin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06  7:00                               ` Marcin Ślusarz
@ 2007-08-06  7:03                                 ` Ingo Molnar
  2007-08-06 17:43                                   ` Chuck Ebbert
  2007-08-07  7:46                                   ` Marcin Ślusarz
  0 siblings, 2 replies; 69+ messages in thread
From: Ingo Molnar @ 2007-08-06  7:03 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox


* Marcin Ślusarz <marcin.slusarz@gmail.com> wrote:

> 2007/7/31, Jarek Poplawski <jarkao2@o2.pl>:
> > Marcin,
> >
> > I see you're quite busy, but if after testing this next Ingo's patch
> > you are alive yet, maybe you could try one more "idea"? No patch this
> > time, but if you could try this after adding boot option "noirqdebug"
> > (I'd like to be sure it's not about timinig after all).
> It didn't change anything. Network card still timed out.

please try Jarek's second patch too - there was a missing unmask.

	Ingo

-------------->
Subject: genirq: fix simple and fasteoi irq handlers
From: Jarek Poplawski <jarkao2@o2.pl>

After the "genirq: do not mask interrupts by default" patch interrupts
should be disabled not immediately upon request, but after they happen.
But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
driver's work.

The main reason of problems here, pointing the broken patch and making
the first patch which can fix this was done by Marcin Slusarz.
Additional test patches of Thomas Gleixner and Ingo Molnar tested by
Marcin Slusarz helped to narrow possible reasons even more. Thanks.

PS: this patch fixes only one evident error here, but there could be
more places affected by above-mentioned change in irq handling.

PS 2:
After rethinking, IMHO, there are two most probable scenarios here:

1. After hw resend there could be a conflict between retriggered
edge type irq and the next level type one: e.g. if this level type
irq (io_apic is enabled then) is triggered while retriggered irq is
serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
the next such levels are triggered and looping, so probably kind of
flood in io_apic until this retriggered edge service has ended. 
2. There is something wrong with ioapic_retrigger_irq (less probable
because this should be probably seen with 'normal' edge retriggers,
but on the other hand, they could be less common).

So, if there is #1, this fixed patch should work.

But, since level types don't need this retriggers too much I think
this "don't mask interrupts by default" idea should be rethinked:
is there enough gain to risk such hard to diagnose errors?
  
So, IMHO, there should be at least possibility to turn this off for
level types in config (it should be a visible option, so people could
find & try this before writing for help or changing a network card).


Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c	2007-08-05 21:49:46.000000000 +0200
@@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru
 
 	spin_lock(&desc->lock);
 
-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out_unlock;
 	kstat_cpu(cpu).irqs[irq]++;
 
 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		if (desc->chip->mask)
 			desc->chip->mask(irq);
 		desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
@@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru
 
 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
+	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+		desc->chip->unmask(irq);
 out_unlock:
 	spin_unlock(&desc->lock);
 }
@@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str
 
 	spin_lock(&desc->lock);
 
-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out;
-
 	desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
 	kstat_cpu(cpu).irqs[irq]++;
 
 	/*
-	 * If its disabled or no action available
+	 * If it's running, disabled or no action available
 	 * then mask it and get out of here:
 	 */
 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		desc->status |= IRQ_PENDING;
 		if (desc->chip->mask)
 			desc->chip->mask(irq);
@@ -420,6 +419,8 @@ handle_fasteoi_irq(unsigned int irq, str
 
 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
+	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+		desc->chip->unmask(irq);
 out:
 	desc->chip->eoi(irq);
 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] genirq: fix simple and fasteoi irq handlers
  2007-08-03  6:07                                   ` [patch] genirq: fix simple and fasteoi irq handlers Jarek Poplawski
  2007-08-03  8:04                                     ` Ingo Molnar
@ 2007-08-06  7:05                                     ` Marcin Ślusarz
  1 sibling, 0 replies; 69+ messages in thread
From: Marcin Ślusarz @ 2007-08-06  7:05 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, Gabriel C, Linus Torvalds, Thomas Gleixner,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/8/3, Jarek Poplawski <jarkao2@o2.pl>:
> Hi,
>
> I can't guarantee this is all needed to fix this bug, but I think
> this patch is necessary here.
>
> Regards,
> Jarek P.
> ------------>
>
> Subject: genirq: fix simple and fasteoi irq handlers
>
> After the "genirq: do not mask interrupts by default" patch interrupts
> should be disabled not immediately upon request, but after they happen.
> But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
> more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
> driver's work.
>
> The main reason of problems here, pointing the broken patch and making
> the first patch which can fix this was done by Marcin Slusarz.
> Additional test patches of Thomas Gleixner and Ingo Molnar tested by
> Marcin Slusarz helped to narrow possible reasons even more. Thanks.
>
> PS: this patch fixes only one evident error here, but there could be
> more places affected by above-mentioned change in irq handling.
>
>
> Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
>
> ---
>
> diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
> --- 2.6.23-rc1-/kernel/irq/chip.c       2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.23-rc1/kernel/irq/chip.c        2007-08-02 20:42:38.000000000 +0200
> @@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru
>
>         spin_lock(&desc->lock);
>
> -       if (unlikely(desc->status & IRQ_INPROGRESS))
> -               goto out_unlock;
>         kstat_cpu(cpu).irqs[irq]++;
>
>         action = desc->action;
> -       if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
> +       if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
> +                                                IRQ_DISABLED)))) {
>                 if (desc->chip->mask)
>                         desc->chip->mask(irq);
>                 desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
> @@ -392,18 +391,16 @@ handle_fasteoi_irq(unsigned int irq, str
>
>         spin_lock(&desc->lock);
>
> -       if (unlikely(desc->status & IRQ_INPROGRESS))
> -               goto out;
> -
>         desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
>         kstat_cpu(cpu).irqs[irq]++;
>
>         /*
> -        * If its disabled or no action available
> +        * If it's running, disabled or no action available
>          * then mask it and get out of here:
>          */
>         action = desc->action;
> -       if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
> +       if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
> +                                                IRQ_DISABLED)))) {
>                 desc->status |= IRQ_PENDING;
>                 if (desc->chip->mask)
>                         desc->chip->mask(irq);
>
This patch didn't fix my NIC (tried on 2.6.22.1)

Marcin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch (take 2)] genirq: fix simple and fasteoi irq handlers
  2007-08-06  6:14                                     ` Ingo Molnar
@ 2007-08-06  7:07                                       ` Marcin Ślusarz
  2007-08-06  7:19                                       ` Jarek Poplawski
  1 sibling, 0 replies; 69+ messages in thread
From: Marcin Ślusarz @ 2007-08-06  7:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jarek Poplawski, Gabriel C, Linus Torvalds, Thomas Gleixner,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/8/6, Ingo Molnar <mingo@elte.hu>:
>
> * Jarek Poplawski <jarkao2@o2.pl> wrote:
>
> > Subject: genirq: fix simple and fasteoi irq handlers
> >
> > After the "genirq: do not mask interrupts by default" patch interrupts
> > should be disabled not immediately upon request, but after they
> > happen. But, handle_simple_irq() and handle_fasteoi_irq() can skip
> > this once or more if an irq is just serviced (IRQ_INPROGRESS),
> > possibly disrupting a driver's work.
>
> nice fix. I think this is exactly the type of bug we were hoping to be
> able to identify and fix, and it could explain the regression in its
> entirety. The big question - does it fix Marcin's regression?
I'll try this patch tomorrow.

Marcin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch (take 2)] genirq: fix simple and fasteoi irq handlers
  2007-08-06  6:14                                     ` Ingo Molnar
  2007-08-06  7:07                                       ` Marcin Ślusarz
@ 2007-08-06  7:19                                       ` Jarek Poplawski
  1 sibling, 0 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-06  7:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gabriel C, Linus Torvalds, Thomas Gleixner,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox, marcin.slusarz

On Mon, Aug 06, 2007 at 08:14:59AM +0200, Ingo Molnar wrote:
> 
> * Jarek Poplawski <jarkao2@o2.pl> wrote:
> 
> > Subject: genirq: fix simple and fasteoi irq handlers
> > 
> > After the "genirq: do not mask interrupts by default" patch interrupts 
> > should be disabled not immediately upon request, but after they 
> > happen. But, handle_simple_irq() and handle_fasteoi_irq() can skip 
> > this once or more if an irq is just serviced (IRQ_INPROGRESS), 
> > possibly disrupting a driver's work.
> 
> nice fix. I think this is exactly the type of bug we were hoping to be 
> able to identify and fix, and it could explain the regression in its 
> entirety. The big question - does it fix Marcin's regression?

Alas, there still could be something more... To be more sure, even
with this, there should be some debug printk (which could mess too),
but I don't know how much patience (and similar boxes...) Marcin has.
Of course, this "temporary fix" in -rc2 should give us more time.
But, I think you should confirm this gain with levels (I mean there
could be some saving on flag setting/ checking too). E.g. I've thought
about adding another ioapic_chip struct for fasteoi without .retrigger
(and maybe with .disable = .mask) maybe with some #ifdef CONFIG_...,
but maybe there could be reconsidered IRQ_DELAYED_DISABLE too (but
with this, there probably was a possibility to run this hw ->retrigger
'by chance' too, so with some strange IO-APICS there would be still
an unnecessary risk here).

The big question for me is still why this isn't more common: it seems
some (most of?) IO-APICS have some safety against this?

BTW: Marcin, if you're still willing to test anything (and your box is
alive after my previous 'could not make any damage' patch - sorry!),
this should be done with something before -rc2, so 2.6.22 or .23-rc1.

Jarek P.

PS: I've just read Marcin's messages - so, happily, it seems
everybody's alive! Thanks.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06  7:03                                 ` Ingo Molnar
@ 2007-08-06 17:43                                   ` Chuck Ebbert
  2007-08-06 19:08                                     ` Ingo Molnar
  2007-08-07 10:09                                     ` Jarek Poplawski
  2007-08-07  7:46                                   ` Marcin Ślusarz
  1 sibling, 2 replies; 69+ messages in thread
From: Chuck Ebbert @ 2007-08-06 17:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcin Ślusarz, Jarek Poplawski, Thomas Gleixner,
	Linus Torvalds, Jean-Baptiste Vignaud, linux-kernel, shemminger,
	linux-net, netdev, Andrew Morton, Alan Cox

On 08/06/2007 03:03 AM, Ingo Molnar wrote:
> 
> But, since level types don't need this retriggers too much I think
> this "don't mask interrupts by default" idea should be rethinked:
> is there enough gain to risk such hard to diagnose errors?
>   
> 

I reverted those masking changes in Fedora and the baffling problem
with 3Com 3C905 network adapters went away.

Before, they would print:

eth0: transmit timed out, tx_status 00 status e601.
  diagnostics: net 0ccc media 8880 dma 0000003a fifo 0000
eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
  Flags; bus-master 1, dirty 295757(13) current 295757(13)
  Transmit list 00000000 vs. f7150a20.
  0: @f7150200  length 80000070 status 0c010070
  1: @f71502a0  length 80000070 status 0c010070
  2: @f7150340  length 8000005c status 0c01005c

Now they just work, apparently...

So why not just revert the change?


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06 17:43                                   ` Chuck Ebbert
@ 2007-08-06 19:08                                     ` Ingo Molnar
  2007-08-09 14:50                                       ` [RFC] " Jarek Poplawski
  2007-08-07 10:09                                     ` Jarek Poplawski
  1 sibling, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2007-08-06 19:08 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Marcin Ślusarz, Jarek Poplawski, Thomas Gleixner,
	Linus Torvalds, Jean-Baptiste Vignaud, linux-kernel, shemminger,
	linux-net, netdev, Andrew Morton, Alan Cox


* Chuck Ebbert <cebbert@redhat.com> wrote:

> Before, they would print:
> 
> eth0: transmit timed out, tx_status 00 status e601.
>   diagnostics: net 0ccc media 8880 dma 0000003a fifo 0000
> eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
>   Flags; bus-master 1, dirty 295757(13) current 295757(13)
>   Transmit list 00000000 vs. f7150a20.
>   0: @f7150200  length 80000070 status 0c010070
>   1: @f71502a0  length 80000070 status 0c010070
>   2: @f7150340  length 8000005c status 0c01005c
> 
> Now they just work, apparently...

could you please try the patch below? If this doesnt do the trick then i 
guess we need to revert that change.

	Ingo

------------>
(take 2)

Subject: genirq: fix simple and fasteoi irq handlers

After the "genirq: do not mask interrupts by default" patch interrupts
should be disabled not immediately upon request, but after they happen.
But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
driver's work.

The main reason of problems here, pointing the broken patch and making
the first patch which can fix this was done by Marcin Slusarz.
Additional test patches of Thomas Gleixner and Ingo Molnar tested by
Marcin Slusarz helped to narrow possible reasons even more. Thanks.

PS: this patch fixes only one evident error here, but there could be
more places affected by above-mentioned change in irq handling.

PS 2:
After rethinking, IMHO, there are two most probable scenarios here:

1. After hw resend there could be a conflict between retriggered
edge type irq and the next level type one: e.g. if this level type
irq (io_apic is enabled then) is triggered while retriggered irq is
serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
the next such levels are triggered and looping, so probably kind of
flood in io_apic until this retriggered edge service has ended. 
2. There is something wrong with ioapic_retrigger_irq (less probable
because this should be probably seen with 'normal' edge retriggers,
but on the other hand, they could be less common).

So, if there is #1, this fixed patch should work.

But, since level types don't need this retriggers too much I think
this "don't mask interrupts by default" idea should be rethinked:
is there enough gain to risk such hard to diagnose errors?
  
So, IMHO, there should be at least possibility to turn this off for
level types in config (it should be a visible option, so people could
find & try this before writing for help or changing a network card).


Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c	2007-08-05 21:49:46.000000000 +0200
@@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru
 
 	spin_lock(&desc->lock);
 
-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out_unlock;
 	kstat_cpu(cpu).irqs[irq]++;
 
 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		if (desc->chip->mask)
 			desc->chip->mask(irq);
 		desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
@@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru
 
 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
+	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+		desc->chip->unmask(irq);
 out_unlock:
 	spin_unlock(&desc->lock);
 }
@@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str
 
 	spin_lock(&desc->lock);
 
-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out;
-
 	desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
 	kstat_cpu(cpu).irqs[irq]++;
 
 	/*
-	 * If its disabled or no action available
+	 * If it's running, disabled or no action available
 	 * then mask it and get out of here:
 	 */
 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		desc->status |= IRQ_PENDING;
 		if (desc->chip->mask)
 			desc->chip->mask(irq);
@@ -420,6 +419,8 @@ handle_fasteoi_irq(unsigned int irq, str
 
 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
+	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+		desc->chip->unmask(irq);
 out:
 	desc->chip->eoi(irq);
 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06  7:03                                 ` Ingo Molnar
  2007-08-06 17:43                                   ` Chuck Ebbert
@ 2007-08-07  7:46                                   ` Marcin Ślusarz
  2007-08-07  8:23                                     ` Jarek Poplawski
  1 sibling, 1 reply; 69+ messages in thread
From: Marcin Ślusarz @ 2007-08-07  7:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/8/6, Ingo Molnar <mingo@elte.hu>:
> (..)
> please try Jarek's second patch too - there was a missing unmask.
>
>         Ingo
>
> -------------->
> Subject: genirq: fix simple and fasteoi irq handlers
> From: Jarek Poplawski <jarkao2@o2.pl>
>
> After the "genirq: do not mask interrupts by default" patch interrupts
> should be disabled not immediately upon request, but after they happen.
> But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
> more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
> driver's work.
>
> The main reason of problems here, pointing the broken patch and making
> the first patch which can fix this was done by Marcin Slusarz.
> Additional test patches of Thomas Gleixner and Ingo Molnar tested by
> Marcin Slusarz helped to narrow possible reasons even more. Thanks.
>
> PS: this patch fixes only one evident error here, but there could be
> more places affected by above-mentioned change in irq handling.
>
> PS 2:
> After rethinking, IMHO, there are two most probable scenarios here:
>
> 1. After hw resend there could be a conflict between retriggered
> edge type irq and the next level type one: e.g. if this level type
> irq (io_apic is enabled then) is triggered while retriggered irq is
> serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
> the next such levels are triggered and looping, so probably kind of
> flood in io_apic until this retriggered edge service has ended.
> 2. There is something wrong with ioapic_retrigger_irq (less probable
> because this should be probably seen with 'normal' edge retriggers,
> but on the other hand, they could be less common).
>
> So, if there is #1, this fixed patch should work.
>
> But, since level types don't need this retriggers too much I think
> this "don't mask interrupts by default" idea should be rethinked:
> is there enough gain to risk such hard to diagnose errors?
>
> So, IMHO, there should be at least possibility to turn this off for
> level types in config (it should be a visible option, so people could
> find & try this before writing for help or changing a network card).
>
>
> Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
>
> ---
>
> diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
> --- 2.6.23-rc1-/kernel/irq/chip.c       2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.23-rc1/kernel/irq/chip.c        2007-08-05 21:49:46.000000000 +0200
> @@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru
>
>         spin_lock(&desc->lock);
>
> -       if (unlikely(desc->status & IRQ_INPROGRESS))
> -               goto out_unlock;
>         kstat_cpu(cpu).irqs[irq]++;
>
>         action = desc->action;
> -       if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
> +       if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
> +                                                IRQ_DISABLED)))) {
>                 if (desc->chip->mask)
>                         desc->chip->mask(irq);
>                 desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
> @@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru
>
>         spin_lock(&desc->lock);
>         desc->status &= ~IRQ_INPROGRESS;
> +       if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
> +               desc->chip->unmask(irq);
>  out_unlock:
>         spin_unlock(&desc->lock);
>  }
> @@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str
>
>         spin_lock(&desc->lock);
>
> -       if (unlikely(desc->status & IRQ_INPROGRESS))
> -               goto out;
> -
>         desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
>         kstat_cpu(cpu).irqs[irq]++;
>
>         /*
> -        * If its disabled or no action available
> +        * If it's running, disabled or no action available
>          * then mask it and get out of here:
>          */
>         action = desc->action;
> -       if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
> +       if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
> +                                                IRQ_DISABLED)))) {
>                 desc->status |= IRQ_PENDING;
>                 if (desc->chip->mask)
>                         desc->chip->mask(irq);
> @@ -420,6 +419,8 @@ handle_fasteoi_irq(unsigned int irq, str
>
>         spin_lock(&desc->lock);
>         desc->status &= ~IRQ_INPROGRESS;
> +       if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
> +               desc->chip->unmask(irq);
>  out:
>         desc->chip->eoi(irq);
>
>
Network card still locks up (tested on 2.6.22.1). I had to upload more
data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
might be a coincidence...

Marcin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07  7:46                                   ` Marcin Ślusarz
@ 2007-08-07  8:23                                     ` Jarek Poplawski
       [not found]                                       ` <4bacf17f0708070237w19d184b3p7f74b53612edb9a6@mail.gmail.com>
  0 siblings, 1 reply; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-07  8:23 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote:
> 2007/8/6, Ingo Molnar <mingo@elte.hu>:
> > (..)
> > please try Jarek's second patch too - there was a missing unmask.
> >
> >         Ingo
> >
> > -------------->
> > Subject: genirq: fix simple and fasteoi irq handlers
> > From: Jarek Poplawski <jarkao2@o2.pl>
...
> Network card still locks up (tested on 2.6.22.1). I had to upload more
> data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
> might be a coincidence...

Thanks! It's a good news after all - it would be really strange why
this place doesn't hit more people (it seems there is some safety
elsewhere for this).

BTW: I hope, this previous Thomas' patch with Ingo's warning to resend.c
(with a warning), had no problems with a similar load?

So, once more, I would suspect hw retrigger code. Ingo, IMHO, this
patch for testing HARDIRQS_SW_RESEND could be reworked, so that
desc->chip->retrigger() is done only for eadges and the tasklet only
for levels. BTW, I think this current warning in the "temporary" is
is too early - we don't know if after this the ->retrigger() will
take place.

Regards,
Jarek P.

PS: Marcin, if you need a break in this testing let us know!
I think the main idea of this bug is known enough.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
       [not found]                                       ` <4bacf17f0708070237w19d184b3p7f74b53612edb9a6@mail.gmail.com>
@ 2007-08-07  9:52                                         ` Jarek Poplawski
  2007-08-07 12:13                                           ` Jarek Poplawski
  0 siblings, 1 reply; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-07  9:52 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote:
> 2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> > On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote:
> > > Network card still locks up (tested on 2.6.22.1). I had to upload more
> > > data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
> > > might be a coincidence...
> >
> > Thanks! It's a good news after all - it would be really strange why
> > this place doesn't hit more people (it seems there is some safety
> > elsewhere for this).
> >
> > BTW: I hope, this previous Thomas' patch with Ingo's warning to resend.c
> > (with a warning), had no problems with a similar load?
> I always tested on 500-600 MB "dataset"
> 
> > PS: Marcin, if you need a break in this testing let us know!
> No, i don't need a break. I'll have more time in next weeks.

Great! So, I'll try to send a patch with _SW_RESEND in a few hours,
if Ingo doesn't prepare something for you.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06 17:43                                   ` Chuck Ebbert
  2007-08-06 19:08                                     ` Ingo Molnar
@ 2007-08-07 10:09                                     ` Jarek Poplawski
  1 sibling, 0 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-07 10:09 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Ingo Molnar, Marcin Ślusarz, Thomas Gleixner,
	Linus Torvalds, Jean-Baptiste Vignaud, linux-kernel, shemminger,
	linux-net, netdev, Andrew Morton, Alan Cox

On Mon, Aug 06, 2007 at 01:43:48PM -0400, Chuck Ebbert wrote:
> On 08/06/2007 03:03 AM, Ingo Molnar wrote:
> > 
> > But, since level types don't need this retriggers too much I think
> > this "don't mask interrupts by default" idea should be rethinked:
> > is there enough gain to risk such hard to diagnose errors?
> >   
> > 
> 
> I reverted those masking changes in Fedora and the baffling problem
> with 3Com 3C905 network adapters went away.
> 
> Before, they would print:
> 
> eth0: transmit timed out, tx_status 00 status e601.
>   diagnostics: net 0ccc media 8880 dma 0000003a fifo 0000
> eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
>   Flags; bus-master 1, dirty 295757(13) current 295757(13)
>   Transmit list 00000000 vs. f7150a20.
>   0: @f7150200  length 80000070 status 0c010070
>   1: @f71502a0  length 80000070 status 0c010070
>   2: @f7150340  length 8000005c status 0c01005c
> 
> Now they just work, apparently...
> 
> So why not just revert the change?
> 

Ingo has written about such possibility. But, it would be good
to know which precisely place is to blame, as well. Since this
diagnosing takes time, I think Chuck is right, and maybe at least
this temporary patch for resend.c without this warning, should
be recomended for stables (2.6.21 and 2.6.22)?

Jarek P.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07  9:52                                         ` Jarek Poplawski
@ 2007-08-07 12:13                                           ` Jarek Poplawski
  2007-08-07 12:55                                             ` Jarek Poplawski
  2007-08-08 11:09                                             ` Marcin Ślusarz
  0 siblings, 2 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-07 12:13 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Tue, Aug 07, 2007 at 11:52:46AM +0200, Jarek Poplawski wrote:
> On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote:
> > 2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> > > On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote:
> > > > Network card still locks up (tested on 2.6.22.1). I had to upload more
> > > > data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
> > > > might be a coincidence...
> > >
> > > Thanks! It's a good news after all - it would be really strange why
> > > this place doesn't hit more people (it seems there is some safety
> > > elsewhere for this).
> > >
> > > BTW: I hope, this previous Thomas' patch with Ingo's warning to resend.c
> > > (with a warning), had no problems with a similar load?
> > I always tested on 500-600 MB "dataset"
> > 
> > > PS: Marcin, if you need a break in this testing let us know!
> > No, i don't need a break. I'll have more time in next weeks.
> 
> Great! So, I'll try to send a patch with _SW_RESEND in a few hours,
> if Ingo doesn't prepare something for you.

So, the let's try this idea yet: modified Ingo's "x86: activate
HARDIRQS_SW_RESEND" patch.
(Don't forget about make oldconfig before make.)
For testing only.

Cheers,
Jarek P.

PS: alas there was not even time for "compile checking"...

---

diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
--- 2.6.22.1-/arch/i386/Kconfig	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22.1/arch/i386/Kconfig	2007-08-07 13:13:03.000000000 +0200
@@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
 	depends on GENERIC_HARDIRQS && SMP
 	default y
 
+config HARDIRQS_SW_RESEND
+	bool
+	default y
+
 config X86_SMP
 	bool
 	depends on SMP && !X86_VOYAGER
diff -Nurp 2.6.22.1-/arch/x86_64/Kconfig 2.6.22.1/arch/x86_64/Kconfig
--- 2.6.22.1-/arch/x86_64/Kconfig	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22.1/arch/x86_64/Kconfig	2007-08-07 13:13:03.000000000 +0200
@@ -690,6 +690,10 @@ config GENERIC_PENDING_IRQ
 	depends on GENERIC_HARDIRQS && SMP
 	default y
 
+config HARDIRQS_SW_RESEND
+	bool
+	default y
+
 menu "Power management options"
 
 source kernel/power/Kconfig
diff -Nurp 2.6.22.1-/kernel/irq/manage.c 2.6.22.1/kernel/irq/manage.c
--- 2.6.22.1-/kernel/irq/manage.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22.1/kernel/irq/manage.c	2007-08-07 13:13:03.000000000 +0200
@@ -169,6 +169,14 @@ void enable_irq(unsigned int irq)
 		desc->depth--;
 	}
 	spin_unlock_irqrestore(&desc->lock, flags);
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+	/*
+	 * Do a bh disable/enable pair to trigger any pending
+	 * irq resend logic:
+	 */
+	local_bh_disable();
+	local_bh_enable();
+#endif
 }
 EXPORT_SYMBOL(enable_irq);
 
diff -Nurp 2.6.22.1-/kernel/irq/resend.c 2.6.22.1/kernel/irq/resend.c
--- 2.6.22.1-/kernel/irq/resend.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22.1/kernel/irq/resend.c	2007-08-07 13:57:54.000000000 +0200
@@ -62,16 +62,24 @@ void check_irq_resend(struct irq_desc *d
 	 */
 	desc->chip->enable(irq);
 
+	/*
+	 * Temporary hack to figure out more about the problem, which
+	 * is causing the ancient network cards to die.
+	 */
+
 	if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
 		desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
 
-		if (!desc->chip || !desc->chip->retrigger ||
-					!desc->chip->retrigger(irq)) {
+		if (desc->handle_irq == handle_edge_irq) {
+			if (desc->chip->retrigger)
+				desc->chip->retrigger(irq);
+			return;
+		}
 #ifdef CONFIG_HARDIRQS_SW_RESEND
-			/* Set it pending and activate the softirq: */
-			set_bit(irq, irqs_resend);
-			tasklet_schedule(&resend_tasklet);
+		WARN_ON_ONCE(1);
+		/* Set it pending and activate the softirq: */
+		set_bit(irq, irqs_resend);
+		tasklet_schedule(&resend_tasklet);
 #endif
-		}
 	}
 }

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07 12:13                                           ` Jarek Poplawski
@ 2007-08-07 12:55                                             ` Jarek Poplawski
  2007-08-08 11:11                                               ` Marcin Ślusarz
  2007-08-08 11:09                                             ` Marcin Ślusarz
  1 sibling, 1 reply; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-07 12:55 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Tue, Aug 07, 2007 at 02:13:39PM +0200, Jarek Poplawski wrote:
> On Tue, Aug 07, 2007 at 11:52:46AM +0200, Jarek Poplawski wrote:
> > On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote:
...
> > > No, i don't need a break. I'll have more time in next weeks.
> > 
> > Great! So, I'll try to send a patch with _SW_RESEND in a few hours,
> > if Ingo doesn't prepare something for you.
> 
> So, the let's try this idea yet: modified Ingo's "x86: activate
> HARDIRQS_SW_RESEND" patch.
> (Don't forget about make oldconfig before make.)
> For testing only.
> 
> Cheers,
> Jarek P.
> 
> PS: alas there was not even time for "compile checking"...

And here is one more patch to test the same idea (chip->retrigger()).
Let's try i386 way! (I hope I will not be arrested for this...)
(Should be tested without any previous patches.)

Jarek P.

PS: as above

---

diff -Nurp 2.6.22.1-/arch/x86_64/kernel/io_apic.c 2.6.22.1/arch/x86_64/kernel/io_apic.c
--- 2.6.22.1-/arch/x86_64/kernel/io_apic.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.22.1/arch/x86_64/kernel/io_apic.c	2007-08-07 14:37:45.000000000 +0200
@@ -1311,15 +1311,8 @@ static unsigned int startup_ioapic_irq(u
 static int ioapic_retrigger_irq(unsigned int irq)
 {
 	struct irq_cfg *cfg = &irq_cfg[irq];
-	cpumask_t mask;
-	unsigned long flags;
-
-	spin_lock_irqsave(&vector_lock, flags);
-	cpus_clear(mask);
-	cpu_set(first_cpu(cfg->domain), mask);
 
-	send_IPI_mask(mask, cfg->vector);
-	spin_unlock_irqrestore(&vector_lock, flags);
+	send_IPI_self(cfg->vector);
 
 	return 1;
 }

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch] genirq: temporary fix for level-triggered IRQ resend
  2007-07-31 16:00                               ` Ingo Molnar
@ 2007-08-08 11:00                                 ` Jarek Poplawski
  0 siblings, 0 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-08 11:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Thomas Gleixner, Jean-Baptiste Vignaud,
	linux-kernel, shemminger, linux-net, netdev, Andrew Morton,
	Alan Cox, marcin.slusarz

Hi,

I see there is a bit of complaining on this original resend temporary
patch. But, since it seems to do a good job for some people, here is
my proposal to limit the 'range of fire' a little bit.

Marcin and Jean-Baptiste: try to test this with 2.6.23-rc2, please.
(Unless Ingo or Thomas have other plans with this problem?)

Thanks,
Jarek P.

-------->

Subject: [patch] genirq: temporary fix for level-triggered IRQ resend


Marcin Slusarz reported a ne2k-pci "hung network interface" regression.

delayed disable relies on the ability to re-trigger the interrupt in the
case that a real interrupt happens after the software disable was set.
In this case we actually disable the interrupt on the hardware level
_after_ it occurred.

On enable_irq, we need to re-trigger the interrupt. On i386 this relies
on a hardware resend mechanism (send_IPI_self()).

Actually we only need the resend for edge type interrupts. Level type
interrupts come back once enable_irq() re-enables the interrupt line.

Marcin found that when he disables the irq line on the hardware level
(removing the delayed disable) the card is kept alive.

Thomas Gleixner prepared a testing patch which proved to fix hung ups
after limiting irqs resending to edge type only. Then Ingo Molnar's
version of this patch was applied to 2.6.23-rc2 kernel as a temporary
solution. Since, the problem is still diagnosed, but seems to affect
mainly X86_64 arch, here is a modified, but still temporary, version of
this solution, which should limit the range of warnings and changes in
interrupt handling to minimum.


Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Jean-Baptiste Vignaud <vignaud@xandmail.fr>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>

---

diff -Nurp 2.6.23-rc2-/kernel/irq/resend.c 2.6.23-rc2/kernel/irq/resend.c
--- 2.6.23-rc2-/kernel/irq/resend.c	2007-08-08 11:48:15.000000000 +0200
+++ 2.6.23-rc2/kernel/irq/resend.c	2007-08-08 11:58:42.000000000 +0200
@@ -62,18 +62,19 @@ void check_irq_resend(struct irq_desc *d
 	 */
 	desc->chip->enable(irq);
 
+	if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
+		desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
+
+#ifdef CONFIG_X86_64
 	/*
 	 * Temporary hack to figure out more about the problem, which
 	 * is causing the ancient network cards to die.
 	 */
-	if (desc->handle_irq != handle_edge_irq) {
-		WARN_ON_ONCE(1);
-		return;
-	}
-
-	if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
-		desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
-
+		if (desc->handle_irq == handle_fasteoi_irq) {
+			WARN_ON_ONCE(1);
+			return;
+		}
+#endif
 		if (!desc->chip || !desc->chip->retrigger ||
 					!desc->chip->retrigger(irq)) {
 #ifdef CONFIG_HARDIRQS_SW_RESEND

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07 12:13                                           ` Jarek Poplawski
  2007-08-07 12:55                                             ` Jarek Poplawski
@ 2007-08-08 11:09                                             ` Marcin Ślusarz
  2007-08-08 11:42                                               ` Jarek Poplawski
  1 sibling, 1 reply; 69+ messages in thread
From: Marcin Ślusarz @ 2007-08-08 11:09 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> So, the let's try this idea yet: modified Ingo's "x86: activate
> HARDIRQS_SW_RESEND" patch.
> (Don't forget about make oldconfig before make.)
> For testing only.
>
> Cheers,
> Jarek P.
>
> PS: alas there was not even time for "compile checking"...
>
> ---
>
> diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
> --- 2.6.22.1-/arch/i386/Kconfig 2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.22.1/arch/i386/Kconfig  2007-08-07 13:13:03.000000000 +0200
> @@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
>         depends on GENERIC_HARDIRQS && SMP
>         default y
>
> +config HARDIRQS_SW_RESEND
> +       bool
> +       default y
> +
>  config X86_SMP
>         bool
>         depends on SMP && !X86_VOYAGER
> diff -Nurp 2.6.22.1-/arch/x86_64/Kconfig 2.6.22.1/arch/x86_64/Kconfig
> --- 2.6.22.1-/arch/x86_64/Kconfig       2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.22.1/arch/x86_64/Kconfig        2007-08-07 13:13:03.000000000 +0200
> @@ -690,6 +690,10 @@ config GENERIC_PENDING_IRQ
>         depends on GENERIC_HARDIRQS && SMP
>         default y
>
> +config HARDIRQS_SW_RESEND
> +       bool
> +       default y
> +
>  menu "Power management options"
>
>  source kernel/power/Kconfig
> diff -Nurp 2.6.22.1-/kernel/irq/manage.c 2.6.22.1/kernel/irq/manage.c
> --- 2.6.22.1-/kernel/irq/manage.c       2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.22.1/kernel/irq/manage.c        2007-08-07 13:13:03.000000000 +0200
> @@ -169,6 +169,14 @@ void enable_irq(unsigned int irq)
>                 desc->depth--;
>         }
>         spin_unlock_irqrestore(&desc->lock, flags);
> +#ifdef CONFIG_HARDIRQS_SW_RESEND
> +       /*
> +        * Do a bh disable/enable pair to trigger any pending
> +        * irq resend logic:
> +        */
> +       local_bh_disable();
> +       local_bh_enable();
> +#endif
>  }
>  EXPORT_SYMBOL(enable_irq);
>
> diff -Nurp 2.6.22.1-/kernel/irq/resend.c 2.6.22.1/kernel/irq/resend.c
> --- 2.6.22.1-/kernel/irq/resend.c       2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.22.1/kernel/irq/resend.c        2007-08-07 13:57:54.000000000 +0200
> @@ -62,16 +62,24 @@ void check_irq_resend(struct irq_desc *d
>          */
>         desc->chip->enable(irq);
>
> +       /*
> +        * Temporary hack to figure out more about the problem, which
> +        * is causing the ancient network cards to die.
> +        */
> +
>         if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
>                 desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
>
> -               if (!desc->chip || !desc->chip->retrigger ||
> -                                       !desc->chip->retrigger(irq)) {
> +               if (desc->handle_irq == handle_edge_irq) {
> +                       if (desc->chip->retrigger)
> +                               desc->chip->retrigger(irq);
> +                       return;
> +               }
>  #ifdef CONFIG_HARDIRQS_SW_RESEND
> -                       /* Set it pending and activate the softirq: */
> -                       set_bit(irq, irqs_resend);
> -                       tasklet_schedule(&resend_tasklet);
> +               WARN_ON_ONCE(1);
> +               /* Set it pending and activate the softirq: */
> +               set_bit(irq, irqs_resend);
> +               tasklet_schedule(&resend_tasklet);
>  #endif
> -               }
>         }
>  }
>
Works fine with:
WARNING: at kernel/irq/resend.c:79 check_irq_resend()

Call Trace:
 [<ffffffff8025e660>] check_irq_resend+0xc0/0xd0
 [<ffffffff8025e1cd>] enable_irq+0xed/0xf0
 [<ffffffff8807f21d>] :8390:ei_start_xmit+0x14d/0x30c
 [<ffffffff8024d055>] lock_release_non_nested+0xe5/0x190
 [<ffffffff80539b78>] __qdisc_run+0x98/0x1f0
 [<ffffffff80539b8e>] __qdisc_run+0xae/0x1f0
 [<ffffffff8052b65e>] dev_hard_start_xmit+0x26e/0x2d0
 [<ffffffff80539ba0>] __qdisc_run+0xc0/0x1f0
 [<ffffffff8052dc2f>] dev_queue_xmit+0x24f/0x310
 [<ffffffff805337a7>] neigh_resolve_output+0xe7/0x290
 [<ffffffff8054f5c0>] dst_output+0x0/0x10
 [<ffffffff80552aff>] ip_output+0x19f/0x340
 [<ffffffff80551f77>] ip_queue_xmit+0x217/0x430
 [<ffffffff80563b2a>] tcp_transmit_skb+0x40a/0x7c0
 [<ffffffff805657bb>] __tcp_push_pending_frames+0x11b/0x940
 [<ffffffff8055972a>] tcp_sendmsg+0x87a/0xc80
 [<ffffffff80577735>] inet_sendmsg+0x45/0x80
 [<ffffffff8051e2d4>] sock_aio_write+0x104/0x120
 [<ffffffff80285fc1>] do_sync_write+0xf1/0x130
 [<ffffffff80243290>] autoremove_wake_function+0x0/0x40
 [<ffffffff802868e9>] vfs_write+0x159/0x170
 [<ffffffff80286ef0>] sys_write+0x50/0x90
 [<ffffffff802097fe>] system_call+0x7e/0x83

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-07 12:55                                             ` Jarek Poplawski
@ 2007-08-08 11:11                                               ` Marcin Ślusarz
  0 siblings, 0 replies; 69+ messages in thread
From: Marcin Ślusarz @ 2007-08-08 11:11 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> And here is one more patch to test the same idea (chip->retrigger()).
> Let's try i386 way! (I hope I will not be arrested for this...)
> (Should be tested without any previous patches.)
>
> Jarek P.
>
> PS: as above
>
> ---
>
> diff -Nurp 2.6.22.1-/arch/x86_64/kernel/io_apic.c 2.6.22.1/arch/x86_64/kernel/io_apic.c
> --- 2.6.22.1-/arch/x86_64/kernel/io_apic.c      2007-07-09 01:32:17.000000000 +0200
> +++ 2.6.22.1/arch/x86_64/kernel/io_apic.c       2007-08-07 14:37:45.000000000 +0200
> @@ -1311,15 +1311,8 @@ static unsigned int startup_ioapic_irq(u
>  static int ioapic_retrigger_irq(unsigned int irq)
>  {
>         struct irq_cfg *cfg = &irq_cfg[irq];
> -       cpumask_t mask;
> -       unsigned long flags;
> -
> -       spin_lock_irqsave(&vector_lock, flags);
> -       cpus_clear(mask);
> -       cpu_set(first_cpu(cfg->domain), mask);
>
> -       send_IPI_mask(mask, cfg->vector);
> -       spin_unlock_irqrestore(&vector_lock, flags);
> +       send_IPI_self(cfg->vector);
>
>         return 1;
>  }
>
Network card timed out with this patch.

Marcin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-08 11:09                                             ` Marcin Ślusarz
@ 2007-08-08 11:42                                               ` Jarek Poplawski
  2007-08-08 11:53                                                 ` Jarek Poplawski
  2007-08-09  9:19                                                 ` [patch (testing)] " Jarek Poplawski
  0 siblings, 2 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-08 11:42 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

Read below please:

On Wed, Aug 08, 2007 at 01:09:36PM +0200, Marcin Ślusarz wrote:
> 2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> > So, the let's try this idea yet: modified Ingo's "x86: activate
> > HARDIRQS_SW_RESEND" patch.
> > (Don't forget about make oldconfig before make.)
> > For testing only.
> >
> > Cheers,
> > Jarek P.
> >
> > PS: alas there was not even time for "compile checking"...
> >
> > ---
> >
> > diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
> > --- 2.6.22.1-/arch/i386/Kconfig 2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/arch/i386/Kconfig  2007-08-07 13:13:03.000000000 +0200
> > @@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
> >         depends on GENERIC_HARDIRQS && SMP
> >         default y
> >
> > +config HARDIRQS_SW_RESEND
> > +       bool
> > +       default y
> > +
> >  config X86_SMP
> >         bool
> >         depends on SMP && !X86_VOYAGER
> > diff -Nurp 2.6.22.1-/arch/x86_64/Kconfig 2.6.22.1/arch/x86_64/Kconfig
> > --- 2.6.22.1-/arch/x86_64/Kconfig       2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/arch/x86_64/Kconfig        2007-08-07 13:13:03.000000000 +0200
> > @@ -690,6 +690,10 @@ config GENERIC_PENDING_IRQ
> >         depends on GENERIC_HARDIRQS && SMP
> >         default y
> >
> > +config HARDIRQS_SW_RESEND
> > +       bool
> > +       default y
> > +
> >  menu "Power management options"
> >
> >  source kernel/power/Kconfig
> > diff -Nurp 2.6.22.1-/kernel/irq/manage.c 2.6.22.1/kernel/irq/manage.c
> > --- 2.6.22.1-/kernel/irq/manage.c       2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/kernel/irq/manage.c        2007-08-07 13:13:03.000000000 +0200
> > @@ -169,6 +169,14 @@ void enable_irq(unsigned int irq)
> >                 desc->depth--;
> >         }
> >         spin_unlock_irqrestore(&desc->lock, flags);
> > +#ifdef CONFIG_HARDIRQS_SW_RESEND
> > +       /*
> > +        * Do a bh disable/enable pair to trigger any pending
> > +        * irq resend logic:
> > +        */
> > +       local_bh_disable();
> > +       local_bh_enable();
> > +#endif
> >  }
> >  EXPORT_SYMBOL(enable_irq);
> >
> > diff -Nurp 2.6.22.1-/kernel/irq/resend.c 2.6.22.1/kernel/irq/resend.c
> > --- 2.6.22.1-/kernel/irq/resend.c       2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.22.1/kernel/irq/resend.c        2007-08-07 13:57:54.000000000 +0200
> > @@ -62,16 +62,24 @@ void check_irq_resend(struct irq_desc *d
> >          */
> >         desc->chip->enable(irq);
> >
> > +       /*
> > +        * Temporary hack to figure out more about the problem, which
> > +        * is causing the ancient network cards to die.
> > +        */
> > +
> >         if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
> >                 desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
> >
> > -               if (!desc->chip || !desc->chip->retrigger ||
> > -                                       !desc->chip->retrigger(irq)) {
> > +               if (desc->handle_irq == handle_edge_irq) {
> > +                       if (desc->chip->retrigger)
> > +                               desc->chip->retrigger(irq);
> > +                       return;
> > +               }
> >  #ifdef CONFIG_HARDIRQS_SW_RESEND
> > -                       /* Set it pending and activate the softirq: */
> > -                       set_bit(irq, irqs_resend);
> > -                       tasklet_schedule(&resend_tasklet);
> > +               WARN_ON_ONCE(1);
> > +               /* Set it pending and activate the softirq: */
> > +               set_bit(irq, irqs_resend);
> > +               tasklet_schedule(&resend_tasklet);
> >  #endif
> > -               }
> >         }
> >  }
> >
> Works fine with:

Very nice! It would be about time this kernel should start behave...

> WARNING: at kernel/irq/resend.c:79 check_irq_resend()
> 
> Call Trace:
>  [<ffffffff8025e660>] check_irq_resend+0xc0/0xd0
>  [<ffffffff8025e1cd>] enable_irq+0xed/0xf0
>  [<ffffffff8807f21d>] :8390:ei_start_xmit+0x14d/0x30c
>  [<ffffffff8024d055>] lock_release_non_nested+0xe5/0x190
>  [<ffffffff80539b78>] __qdisc_run+0x98/0x1f0
>  [<ffffffff80539b8e>] __qdisc_run+0xae/0x1f0
>  [<ffffffff8052b65e>] dev_hard_start_xmit+0x26e/0x2d0
>  [<ffffffff80539ba0>] __qdisc_run+0xc0/0x1f0
>  [<ffffffff8052dc2f>] dev_queue_xmit+0x24f/0x310
>  [<ffffffff805337a7>] neigh_resolve_output+0xe7/0x290
>  [<ffffffff8054f5c0>] dst_output+0x0/0x10
>  [<ffffffff80552aff>] ip_output+0x19f/0x340
>  [<ffffffff80551f77>] ip_queue_xmit+0x217/0x430
>  [<ffffffff80563b2a>] tcp_transmit_skb+0x40a/0x7c0
>  [<ffffffff805657bb>] __tcp_push_pending_frames+0x11b/0x940
>  [<ffffffff8055972a>] tcp_sendmsg+0x87a/0xc80
>  [<ffffffff80577735>] inet_sendmsg+0x45/0x80
>  [<ffffffff8051e2d4>] sock_aio_write+0x104/0x120
>  [<ffffffff80285fc1>] do_sync_write+0xf1/0x130
>  [<ffffffff80243290>] autoremove_wake_function+0x0/0x40
>  [<ffffffff802868e9>] vfs_write+0x159/0x170
>  [<ffffffff80286ef0>] sys_write+0x50/0x90
>  [<ffffffff802097fe>] system_call+0x7e/0x83
> 

So, it looks like x86_64 io_apic's IPI code was unused too long...
I hope it's a piece of cake for Ingo now...

Thanks very much Marcin!

If it's possible for you and Jean-Baptiste, try this today patch
with -rc2, and maybe once more this one patch (-rc1 or older).

Regards,
Jarek P. 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-08 11:42                                               ` Jarek Poplawski
@ 2007-08-08 11:53                                                 ` Jarek Poplawski
  2007-08-09  9:19                                                 ` [patch (testing)] " Jarek Poplawski
  1 sibling, 0 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-08 11:53 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Wed, Aug 08, 2007 at 01:42:43PM +0200, Jarek Poplawski wrote:
...
> So, it looks like x86_64 io_apic's IPI code was unused too long...

To be fair it's x86_64 lapic's IPI code.

Jarek P.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-08 11:42                                               ` Jarek Poplawski
  2007-08-08 11:53                                                 ` Jarek Poplawski
@ 2007-08-09  9:19                                                 ` Jarek Poplawski
       [not found]                                                   ` <4bacf17f0708092333n17e0ba19jf2c769531610868d@mail.gmail.com>
  1 sibling, 1 reply; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-09  9:19 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Wed, Aug 08, 2007 at 01:42:43PM +0200, Jarek Poplawski wrote:
> Read below please:
> 
> On Wed, Aug 08, 2007 at 01:09:36PM +0200, Marcin Ślusarz wrote:
> > 2007/8/7, Jarek Poplawski <jarkao2@o2.pl>:
> > > So, the let's try this idea yet: modified Ingo's "x86: activate
> > > HARDIRQS_SW_RESEND" patch.
> > > (Don't forget about make oldconfig before make.)
> > > For testing only.
...
> > > diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
> > > --- 2.6.22.1-/arch/i386/Kconfig 2007-07-09 01:32:17.000000000 +0200
> > > +++ 2.6.22.1/arch/i386/Kconfig  2007-08-07 13:13:03.000000000 +0200
> > > @@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
> > >         depends on GENERIC_HARDIRQS && SMP
> > >         default y
> > >
> > > +config HARDIRQS_SW_RESEND
...
> > Works fine with:
> 
> Very nice! It would be about time this kernel should start behave...
> 
> > WARNING: at kernel/irq/resend.c:79 check_irq_resend()
> > 
> > Call Trace:
...
> So, it looks like x86_64 io_apic's IPI code was unused too long...
> I hope it's a piece of cake for Ingo now...

So, we know now it's almost definitely something about lapic and IPIs
but, maybe it's not this code to blame...

Here is one more patch to check the possibility it's about the way
the resend edge type irqs are handled by level type handlers:
so, let's check if acking isn't too late...

Marcin and Jean-Baptiste: I would be very glad, as usual! And no need
to hurry; I think we know enough to fix this for you, but maybe this
test could explain if there are errors in lapics or only bad handling.

Many thanks,
Jarek P. 

PS: this patch is very experimental, and only intended for testing.
It should be applied to clean 2.6.23-rc1 or a bit older (eg. 2.6.22)
(so 2.6.23-rc2 or any patches from this thread shouldn't be around) 

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c	2007-08-08 20:49:07.000000000 +0200
@@ -389,12 +389,19 @@ handle_fasteoi_irq(unsigned int irq, str
 	unsigned int cpu = smp_processor_id();
 	struct irqaction *action;
 	irqreturn_t action_ret;
+	int edge = 0;
 
 	spin_lock(&desc->lock);
 
 	if (unlikely(desc->status & IRQ_INPROGRESS))
 		goto out;
 
+	if ((desc->status & (IRQ_PENDING | IRQ_REPLAY)) ==
+							IRQ_REPLAY) {
+		desc->chip->ack(irq);
+		edge = 1;
+	}
+
 	desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
 	kstat_cpu(cpu).irqs[irq]++;
 
@@ -421,7 +428,8 @@ handle_fasteoi_irq(unsigned int irq, str
 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
 out:
-	desc->chip->eoi(irq);
+	if (!edge)
+		desc->chip->eoi(irq);
 
 	spin_unlock(&desc->lock);
 }
diff -Nurp 2.6.23-rc1-/kernel/irq/resend.c 2.6.23-rc1/kernel/irq/resend.c
--- 2.6.23-rc1-/kernel/irq/resend.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.23-rc1/kernel/irq/resend.c	2007-08-08 20:44:14.000000000 +0200
@@ -57,14 +57,10 @@ void check_irq_resend(struct irq_desc *d
 {
 	unsigned int status = desc->status;
 
-	/*
-	 * Make sure the interrupt is enabled, before resending it:
-	 */
-	desc->chip->enable(irq);
-
 	if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
 		desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
 
+		WARN_ON_ONCE(1);
 		if (!desc->chip || !desc->chip->retrigger ||
 					!desc->chip->retrigger(irq)) {
 #ifdef CONFIG_HARDIRQS_SW_RESEND
@@ -74,4 +70,5 @@ void check_irq_resend(struct irq_desc *d
 #endif
 		}
 	}
+	desc->chip->enable(irq);
 }

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC] Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-06 19:08                                     ` Ingo Molnar
@ 2007-08-09 14:50                                       ` Jarek Poplawski
       [not found]                                         ` <p738x8kg0dp.fsf@bingen.suse.de>
  0 siblings, 1 reply; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-09 14:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chuck Ebbert, Marcin Ślusarz, Thomas Gleixner,
	Linus Torvalds, Jean-Baptiste Vignaud, linux-kernel, shemminger,
	linux-net, netdev, Andrew Morton, Alan Cox


It seems, we can start to think about some preferred solutions,
already. Here are some of my preliminary conclusions and suggestions.

The problem of timeouts with some 'older' network cards seems to hit
mainly x86_64 arch, and after diagnosing and testing (still beeing
done) it's caused by resending level type irqs.

Possible solutions:

1. Reverting the whole "do not mask interrupts by default" patch:

Seems to be the most natural, but there are some minuses: this patch
has been here for a long time and some archs/drivers could have been
changed in the meantime and depend on this; this could also remove
possible good sides of this patch for which it was aimed (mainly for
edge type irqs), so some old problems could be back.


2. Excluding all level type irqs from resending. There are 2 ways:

a) Just like this is done now in -rc2 with Thomas' patch and AFAIK
seems to work very well; the way to find the type of interrupt by
the name of a handler looks a bit 'temporary' but it's simple and
working; it could be a bit moved and under #ifdef CONFIG_, so
this could affect only choosen ones.

b) Alternatively this could be done by e.g. assigning separate
irq_chip structures for edge and level handlers, so for levels there
would be chip->retrigger == NULL; so this could be done by arches
without any need for 'global' changes. The only minus here is a few
bytes of memory more; on the other hand it looks more readable and
elastic.

3. Using HARDIRQS_SW_RESEND for level type irqs:
seems to work, but needs more testing; but if #2 is theoretically
and practically OK, why bother. This also doesn't need any global
changes.

I prefer 2b + 3: since this is very delicate measure, I think there
should be visible CONFIG_ option: at least for disabling
chip->retrigger handler if used by arch and maybe for using
_SW_RESEND instead, as well.

It looks like these changes are needed for this x86_64 only, but
in my opinion some (good?!) apics could sometimes work OK with this
level resending only "by chance": theoretically it's questionable:
since edge type irqs used for this should be resent if not acked, and
level handlers ack time depends on how long drivers do their job,
there are possible strange and hard to repeat effects, for no reason
(if these levels can really always be skipped without any problem,
like seen in testing).

Then of course it would be easier to do this 'globally' with 2a.:
skipping by default (but #ifdef) chip->retrigger namely for:
handler_fasteoi_irq and handler_level_irq and handler_simple_irq;
handle_edge_irq plus others are always tried. 

I hope, Ingo or Thomas will use something of these, or let us know
about maybe something else which should by tested for final inclusion.

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC] Re: 2.6.20->2.6.21 - networking dies after random time
       [not found]                                         ` <p738x8kg0dp.fsf@bingen.suse.de>
@ 2007-08-09 15:30                                           ` Jarek Poplawski
  0 siblings, 0 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-09 15:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Chuck Ebbert, =?iso-8859-2?q? Marcin Ślusarz?=,
	Thomas Gleixner, Linus Torvalds, Jean-Baptiste Vignaud,
	linux-kernel, shemminger, linux-net, netdev, Andrew Morton,
	Alan Cox

On Thu, Aug 09, 2007 at 06:04:34PM +0200, Andi Kleen wrote:
> Jarek Poplawski <jarkao2@o2.pl> writes:
> 
> > It seems, we can start to think about some preferred solutions,
> > already. Here are some of my preliminary conclusions and suggestions.
> > 
> > The problem of timeouts with some 'older' network cards seems to hit
> > mainly x86_64 arch, and after diagnosing and testing (still beeing
> > done) it's caused by resending level type irqs.
> 
> i386 interrupt code should be similar, except for the lack of 
> per CPU irqs, but that shouldnt' affect resending.
> 
> > 
> > Possible solutions:
> 
> We should probably at least add some statistic counters to the
> standard kernel to try to detect these cases.
> 
> > It looks like these changes are needed for this x86_64 only,
> 
> Why?

Maybe I missed something, but considering the popularity of i386
there was not so much of consistent reporting?! I was very surprised,
when I read a few days ago that Linus seems to think that this one
here is only an individual problem...

Jarek P.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
       [not found]                                                   ` <4bacf17f0708092333n17e0ba19jf2c769531610868d@mail.gmail.com>
@ 2007-08-10  7:10                                                     ` Jarek Poplawski
  2007-08-10 10:43                                                       ` Marcin Ślusarz
  0 siblings, 1 reply; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-10  7:10 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Fri, Aug 10, 2007 at 08:33:27AM +0200, Marcin Ślusarz wrote:
> 2007/8/9, Jarek Poplawski <jarkao2@o2.pl>:
...
> > diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
> > --- 2.6.23-rc1-/kernel/irq/chip.c       2007-07-09 01:32:17.000000000 +0200
> > +++ 2.6.23-rc1/kernel/irq/chip.c        2007-08-08 20:49:07.000000000 +0200
> > @@ -389,12 +389,19 @@ handle_fasteoi_irq(unsigned int irq, str
> >         unsigned int cpu = smp_processor_id();
> >         struct irqaction *action;
> >         irqreturn_t action_ret;
> > +       int edge = 0;
> >
...
> NETDEV WATCHDOG: eth0: transmit timed out
> eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=351.
> eth0: Resetting the 8390 t=4295229000...<6>NETDEV WATCHDOG: eth0:
> transmit timed out
> eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=718.
> eth0: Resetting the 8390 t=4295230000...<6>NETDEV WATCHDOG: eth0:
> transmit timed out
> eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=874.
> etc...

So, we still have to wait for the exact explanation...

Thanks very much Marcin!

I think, there is this one possible for your testing yet?:
Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
Date: Wed, 8 Aug 2007 13:00:37 +0200

If it's not a great problem it would be interesting to try this with
different CONFIG_HZ too e.g. you could start with 100 (I guess, you
tested very similar thing in 2.6.23-rc2 with 1000(?) already).

Jean-Baptiste: you can skip/break testing of this 'experimental'
patch, too.

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-10  7:10                                                     ` Jarek Poplawski
@ 2007-08-10 10:43                                                       ` Marcin Ślusarz
  2007-08-10 11:37                                                         ` Jarek Poplawski
  0 siblings, 1 reply; 69+ messages in thread
From: Marcin Ślusarz @ 2007-08-10 10:43 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

2007/8/10, Jarek Poplawski <jarkao2@o2.pl>:
> (..)
> I think, there is this one possible for your testing yet?:
> Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
> Date: Wed, 8 Aug 2007 13:00:37 +0200
I think I already tested this patch, but this thread is sooo big and I
can't find my response...

> If it's not a great problem it would be interesting to try this with
> different CONFIG_HZ too e.g. you could start with 100 (I guess, you
> tested very similar thing in 2.6.23-rc2 with 1000(?) already).
My all tests were done on 2.6.22.1

Marcin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time
  2007-08-10 10:43                                                       ` Marcin Ślusarz
@ 2007-08-10 11:37                                                         ` Jarek Poplawski
  0 siblings, 0 replies; 69+ messages in thread
From: Jarek Poplawski @ 2007-08-10 11:37 UTC (permalink / raw)
  To: Marcin Ślusarz
  Cc: Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
	netdev, Andrew Morton, Alan Cox

On Fri, Aug 10, 2007 at 12:43:43PM +0200, Marcin Ślusarz wrote:
> 2007/8/10, Jarek Poplawski <jarkao2@o2.pl>:
> > (..)
> > I think, there is this one possible for your testing yet?:
> > Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
> > Date: Wed, 8 Aug 2007 13:00:37 +0200
> I think I already tested this patch, but this thread is sooo big and I
> can't find my response...

I think it was very similar Ingo's patch, which after your testing
is in 2.6.23-rc2 now. I've moved return for level type irqs a little
later, and it works a new way only for "x86_64".

But, I think, now this patch is less important: if you find some time,
try mostly new Ingo's or Thomas' patches (latest possible versions).

> 
> > If it's not a great problem it would be interesting to try this with
> > different CONFIG_HZ too e.g. you could start with 100 (I guess, you
> > tested very similar thing in 2.6.23-rc2 with 1000(?) already).
> My all tests were done on 2.6.22.1

Fine! If it's not said otherwise this version should be appropriate for
most of these patches. But, please, send your current configs on next
posibility (dmesg, /proc/interrupts and .config).

Bye (till monday),
Jarek P.

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2007-08-10 11:37 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-06-29  8:50 2.6.20->2.6.21 - networking dies after random time Jean-Baptiste Vignaud
2007-06-29 15:07 ` Jarek Poplawski
2007-07-23  5:44   ` Marcin Ślusarz
2007-07-23  8:53     ` Jarek Poplawski
2007-07-24  7:18     ` Jarek Poplawski
2007-07-24  8:05     ` Ingo Molnar
2007-07-24  9:42       ` Ingo Molnar
2007-07-24 19:30         ` Linus Torvalds
2007-07-24 20:04           ` Ingo Molnar
2007-07-25  0:19             ` Thomas Gleixner
2007-07-25  7:23               ` Jarek Poplawski
2007-07-25 13:57               ` Jarek Poplawski
2007-07-25 14:46                 ` Alan Cox
2007-07-26 12:44                   ` [PATCH][netdrvr] lib8390: comment on locking by Alan Cox " Jarek Poplawski
2007-07-26 12:47                     ` Alan Cox
2007-07-30 19:47                     ` Jeff Garzik
2007-07-30  8:46                   ` Ingo Molnar
2007-07-30 13:05                     ` Alan Cox
2007-07-26  7:16               ` Marcin Ślusarz
2007-07-26  8:13                 ` Jarek Poplawski
2007-07-26  8:10                   ` Thomas Gleixner
2007-07-26  8:31                     ` Ingo Molnar
2007-07-26  8:55                       ` Jarek Poplawski
2007-07-26  9:12                         ` Ingo Molnar
2007-07-30  7:29                           ` Marcin Ślusarz
2007-07-30  8:49                             ` Ingo Molnar
2007-08-01  7:24                               ` Marcin Ślusarz
2007-08-01  7:27                                 ` Ingo Molnar
2007-08-06  6:58                                   ` Marcin Ślusarz
2007-07-31 13:20                             ` Jarek Poplawski
2007-08-06  7:00                               ` Marcin Ślusarz
2007-08-06  7:03                                 ` Ingo Molnar
2007-08-06 17:43                                   ` Chuck Ebbert
2007-08-06 19:08                                     ` Ingo Molnar
2007-08-09 14:50                                       ` [RFC] " Jarek Poplawski
     [not found]                                         ` <p738x8kg0dp.fsf@bingen.suse.de>
2007-08-09 15:30                                           ` Jarek Poplawski
2007-08-07 10:09                                     ` Jarek Poplawski
2007-08-07  7:46                                   ` Marcin Ślusarz
2007-08-07  8:23                                     ` Jarek Poplawski
     [not found]                                       ` <4bacf17f0708070237w19d184b3p7f74b53612edb9a6@mail.gmail.com>
2007-08-07  9:52                                         ` Jarek Poplawski
2007-08-07 12:13                                           ` Jarek Poplawski
2007-08-07 12:55                                             ` Jarek Poplawski
2007-08-08 11:11                                               ` Marcin Ślusarz
2007-08-08 11:09                                             ` Marcin Ślusarz
2007-08-08 11:42                                               ` Jarek Poplawski
2007-08-08 11:53                                                 ` Jarek Poplawski
2007-08-09  9:19                                                 ` [patch (testing)] " Jarek Poplawski
     [not found]                                                   ` <4bacf17f0708092333n17e0ba19jf2c769531610868d@mail.gmail.com>
2007-08-10  7:10                                                     ` Jarek Poplawski
2007-08-10 10:43                                                       ` Marcin Ślusarz
2007-08-10 11:37                                                         ` Jarek Poplawski
2007-07-31 15:58                             ` [patch] genirq: temporary fix for level-triggered IRQ resend Ingo Molnar
2007-07-31 16:00                               ` Ingo Molnar
2007-08-08 11:00                                 ` Jarek Poplawski
2007-08-02 17:03                               ` Gabriel C
2007-08-02 20:11                                 ` Ingo Molnar
2007-08-03  6:07                                   ` [patch] genirq: fix simple and fasteoi irq handlers Jarek Poplawski
2007-08-03  8:04                                     ` Ingo Molnar
2007-08-03  8:46                                       ` Ingo Molnar
2007-08-03  9:10                                       ` Jarek Poplawski
2007-08-03 11:57                                         ` Marcin Ślusarz
2007-08-03 12:26                                           ` Jarek Poplawski
2007-08-06  7:05                                     ` Marcin Ślusarz
2007-08-06  6:07                                   ` [patch (take 2)] " Jarek Poplawski
2007-08-06  6:14                                     ` Ingo Molnar
2007-08-06  7:07                                       ` Marcin Ślusarz
2007-08-06  7:19                                       ` Jarek Poplawski
2007-07-26  9:11                     ` 2.6.20->2.6.21 - networking dies after random time Jarek Poplawski
2007-07-26  8:19                   ` Jarek Poplawski
2007-07-26  8:16                 ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).