linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] xirc2ps_cs irq return fix
       [not found] <200305252318.h4PNIPX4026812@hera.kernel.org>
@ 2003-05-26  0:44 ` Jeff Garzik
  2003-05-26  1:00   ` Zwane Mwaikambo
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff Garzik @ 2003-05-26  0:44 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Andrew Morton

Linux Kernel Mailing List wrote:
> diff -Nru a/drivers/net/pcmcia/xirc2ps_cs.c b/drivers/net/pcmcia/xirc2ps_cs.c
> --- a/drivers/net/pcmcia/xirc2ps_cs.c	Sun May 25 16:18:34 2003
> +++ b/drivers/net/pcmcia/xirc2ps_cs.c	Sun May 25 16:18:34 2003
>      if (!netif_device_present(dev))
> -	return IRQ_NONE;
> +	return IRQ_HANDLED;

As I mentioned in the thread, this piece of code is obviously wrong.

Think about how scalable this fix is??  Do you really want to crap up 
all pcmcia drivers with this silly -- and wrong -- check?

IIRC the pcmcia layer or new irqreturn_t was blamed for the problem. 
Come on.  Linux mantra is -against- papering over bugs.

	Jeff




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xirc2ps_cs irq return fix
  2003-05-26  0:44 ` [PATCH] xirc2ps_cs irq return fix Jeff Garzik
@ 2003-05-26  1:00   ` Zwane Mwaikambo
  2003-05-26 23:35     ` Alan Cox
  0 siblings, 1 reply; 9+ messages in thread
From: Zwane Mwaikambo @ 2003-05-26  1:00 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Linux Kernel Mailing List, Andrew Morton

On Sun, 25 May 2003, Jeff Garzik wrote:

> As I mentioned in the thread, this piece of code is obviously wrong.
> 
> Think about how scalable this fix is??  Do you really want to crap up 
> all pcmcia drivers with this silly -- and wrong -- check?

My interpretation of it is the PCMCIA controller was triggering interrupts 
on exit and the link handler for the card was still installed even after 
the netdevice was down.

> IIRC the pcmcia layer or new irqreturn_t was blamed for the problem. 
> Come on.  Linux mantra is -against- papering over bugs.

I have to take responsibility for that little mess :(

	Zwane
-- 
function.linuxpower.ca

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xirc2ps_cs irq return fix
  2003-05-26  1:00   ` Zwane Mwaikambo
@ 2003-05-26 23:35     ` Alan Cox
  2003-05-27  3:49       ` Jeff Garzik
  0 siblings, 1 reply; 9+ messages in thread
From: Alan Cox @ 2003-05-26 23:35 UTC (permalink / raw)
  To: Zwane Mwaikambo; +Cc: Jeff Garzik, Linux Kernel Mailing List, Andrew Morton

On Llu, 2003-05-26 at 02:00, Zwane Mwaikambo wrote:
> My interpretation of it is the PCMCIA controller was triggering interrupts 
> on exit and the link handler for the card was still installed even after 
> the netdevice was down.

This is exactly what will happen all the time on PCMCIA devices. The
edge triggered interrupt will cause an IRQ to float around every remove
of the device on most hardware

The fix is basically correct, although the odd floating IRQ ought to be
cleaned up by the heuristics being fixed in the core IRQ disaster
detector


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xirc2ps_cs irq return fix
  2003-05-26 23:35     ` Alan Cox
@ 2003-05-27  3:49       ` Jeff Garzik
  2003-05-27  3:55         ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff Garzik @ 2003-05-27  3:49 UTC (permalink / raw)
  To: Alan Cox; +Cc: Zwane Mwaikambo, Linux Kernel Mailing List, Andrew Morton

Alan Cox wrote:
> On Llu, 2003-05-26 at 02:00, Zwane Mwaikambo wrote:
> 
>>My interpretation of it is the PCMCIA controller was triggering interrupts 
>>on exit and the link handler for the card was still installed even after 
>>the netdevice was down.
> 
> 
> This is exactly what will happen all the time on PCMCIA devices. The
> edge triggered interrupt will cause an IRQ to float around every remove
> of the device on most hardware
> 
> The fix is basically correct, although the odd floating IRQ ought to be
> cleaned up by the heuristics being fixed in the core IRQ disaster
> detector


huh?

If the fix is correct, we need to do a bombing run that adds the 
following code to each driver,

	if (!netif_device_present(dev))
		return IRQ_HANDLED;

but of course the irq _wasn't_ handled by the driver.  Silly, isn't it?

I still maintain it needs to be fixed elsewhere.  Touching every driver 
for this is just not scalable, in addition to pointing the finger at the 
pcmcia core instead of individual drivers.

	Jeff



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xirc2ps_cs irq return fix
  2003-05-27  3:49       ` Jeff Garzik
@ 2003-05-27  3:55         ` Andrew Morton
  2003-05-27  9:40           ` Alan Cox
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2003-05-27  3:55 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: alan, zwane, linux-kernel

Jeff Garzik <jgarzik@pobox.com> wrote:
>
> If the fix is correct, we need to do a bombing run that adds the 
>  following code to each driver,
> 
>  	if (!netif_device_present(dev))
>  		return IRQ_HANDLED;

no...  What we need to do is to kill the printk in handle_IRQ_event().

It should only be turned on for special situations, where someone is trying
to hunt down a reproducible lockup.  These are situations in which the odd
false positive just doesn't matter.  And we know that there will always
be false positives due to apic delivery latency (at least).

I think the time is right to do this.  Add CONFIG_DEBUG_IRQ and get on with
fixing real stuff.




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xirc2ps_cs irq return fix
  2003-05-27  3:55         ` Andrew Morton
@ 2003-05-27  9:40           ` Alan Cox
  2003-05-27 10:45             ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Alan Cox @ 2003-05-27  9:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jeff Garzik, zwane, Linux Kernel Mailing List

On Maw, 2003-05-27 at 04:55, Andrew Morton wrote:
> It should only be turned on for special situations, where someone is trying
> to hunt down a reproducible lockup.  These are situations in which the odd
> false positive just doesn't matter.  And we know that there will always
> be false positives due to apic delivery latency (at least).
> 
> I think the time is right to do this.  Add CONFIG_DEBUG_IRQ and get on with
> fixing real stuff.

I not 100% convinced. I think it should be left by default but only if you get
something like a million unhandled interrupts in a row. Thats the "your box has died"
case where the info is useful anyway.

Maybe the number in a row is what you want to tweak in config or /proc/sys ?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xirc2ps_cs irq return fix
  2003-05-27 10:45             ` Andrew Morton
@ 2003-05-27 10:40               ` Alan Cox
  2003-05-27 19:37                 ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Alan Cox @ 2003-05-27 10:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: jgarzik, zwane, Linux Kernel Mailing List

On Maw, 2003-05-27 at 11:45, Andrew Morton wrote:
> The below patch has been in -mm for some time.  It was supposed to kill the
> IRQ if 750 of the previous 1000 IRQs weren't handled.
> 
> I disabled the killing code because it was triggering on someone's
> works-just-fine setup.
> 
> There will be pain involved in getting all this to work right.  Do you
> really think there's much value in it?

Being able to at least turn it on at run time is valuable when you are debugging
a box operated by someone who doesnt habitually rebuild kernels. The 750 of 1000
thing doesnt work because it can happen to be timing triggered by blocks of IRQ's
from a chip being folded together. The "million in a row" should be a stuck IRQ,
maybe 50,000 in a row even but just "zillions in a row"


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xirc2ps_cs irq return fix
  2003-05-27  9:40           ` Alan Cox
@ 2003-05-27 10:45             ` Andrew Morton
  2003-05-27 10:40               ` Alan Cox
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2003-05-27 10:45 UTC (permalink / raw)
  To: Alan Cox; +Cc: jgarzik, zwane, linux-kernel

Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>
> On Maw, 2003-05-27 at 04:55, Andrew Morton wrote:
> > It should only be turned on for special situations, where someone is trying
> > to hunt down a reproducible lockup.  These are situations in which the odd
> > false positive just doesn't matter.  And we know that there will always
> > be false positives due to apic delivery latency (at least).
> > 
> > I think the time is right to do this.  Add CONFIG_DEBUG_IRQ and get on with
> > fixing real stuff.
> 
> I not 100% convinced. I think it should be left by default but only if you get
> something like a million unhandled interrupts in a row. Thats the "your box has died"
> case where the info is useful anyway.
> 
> Maybe the number in a row is what you want to tweak in config or /proc/sys ?
> 

Maybe.

The below patch has been in -mm for some time.  It was supposed to kill the
IRQ if 750 of the previous 1000 IRQs weren't handled.

I disabled the killing code because it was triggering on someone's
works-just-fine setup.

There will be pain involved in getting all this to work right.  Do you
really think there's much value in it?





Attempt to do something intelligent with IRQ handlers which don't return
IRQ_HANDLED.

- If they return neither IRQ_HANDLED nor IRQ_NONE, complain.

- If they return IRQ_NONE more than 750 times in 1000 interrupts, complain
  and disable the IRQ.



 arch/i386/kernel/irq.c |   96 +++++++++++++++++++++++++++++++++++--------------
 include/linux/irq.h    |    2 +
 2 files changed, 72 insertions(+), 26 deletions(-)

diff -puN arch/i386/kernel/irq.c~irq-check-rate-limit arch/i386/kernel/irq.c
--- 25/arch/i386/kernel/irq.c~irq-check-rate-limit	2003-05-08 00:27:15.000000000 -0700
+++ 25-akpm/arch/i386/kernel/irq.c	2003-05-08 00:27:32.000000000 -0700
@@ -66,8 +66,12 @@
 /*
  * Controller mappings for all interrupt sources:
  */
-irq_desc_t irq_desc[NR_IRQS] __cacheline_aligned =
-	{ [0 ... NR_IRQS-1] = { 0, &no_irq_type, NULL, 0, SPIN_LOCK_UNLOCKED}};
+irq_desc_t irq_desc[NR_IRQS] __cacheline_aligned = {
+	[0 ... NR_IRQS-1] = {
+		.handler = &no_irq_type,
+		.lock = SPIN_LOCK_UNLOCKED
+	}
+};
 
 static void register_irq_proc (unsigned int irq);
 
@@ -209,7 +213,6 @@ int handle_IRQ_event(unsigned int irq,
 {
 	int status = 1;	/* Force the "do bottom halves" bit */
 	int retval = 0;
-	struct irqaction *first_action = action;
 
 	if (!(action->flags & SA_INTERRUPT))
 		local_irq_enable();
@@ -222,30 +225,69 @@ int handle_IRQ_event(unsigned int irq,
 	if (status & SA_SAMPLE_RANDOM)
 		add_interrupt_randomness(irq);
 	local_irq_disable();
-	if (retval != 1) {
-		static int count = 100;
-		if (count) {
-			count--;
-			if (retval) {
-				printk("irq event %d: bogus retval mask %x\n",
-					irq, retval);
-			} else {
-				printk("irq %d: nobody cared!\n", irq);
-			}
-			dump_stack();
-			printk("handlers:\n");
-			action = first_action;
-			do {
-				printk("[<%p>]", action->handler);
-				print_symbol(" (%s)",
-					(unsigned long)action->handler);
-				printk("\n");
-				action = action->next;
-			} while (action);
+	return retval;
+}
+
+static void report_bad_irq(int irq, irq_desc_t *desc, irqreturn_t action_ret)
+{
+	static int count = 100;
+	struct irqaction *action;
+
+	if (count) {
+		count--;
+		if (action_ret != IRQ_HANDLED && action_ret != IRQ_NONE) {
+			printk("irq event %d: bogus return value %x\n",
+					irq, action_ret);
+		} else {
+			printk("irq %d: nobody cared!\n", irq);
 		}
+		dump_stack();
+		printk("handlers:\n");
+		action = desc->action;
+		do {
+			printk("[<%p>]", action->handler);
+			print_symbol(" (%s)",
+				(unsigned long)action->handler);
+			printk("\n");
+			action = action->next;
+		} while (action);
 	}
+}
 
-	return status;
+/*
+ * If 750 of the previous 1000 interrupts have not been handled then assume
+ * that the IRQ is stuck in some manner.  Drop a diagnostic and try to turn the
+ * IRQ off.
+ *
+ * Called under desc->lock
+ */
+static void note_interrupt(int irq, irq_desc_t *desc, irqreturn_t action_ret)
+{
+	if (action_ret != IRQ_HANDLED) {
+		desc->irqs_unhandled++;
+		if (action_ret != IRQ_NONE)
+			report_bad_irq(irq, desc, action_ret);
+	}
+
+	desc->irq_count++;
+	if (desc->irq_count < 1000)
+		return;
+
+	desc->irq_count = 0;
+	if (desc->irqs_unhandled > 750) {
+		/*
+		 * The interrupt is stuck
+		 */
+		report_bad_irq(irq, desc, action_ret);
+		/*
+		 * Now kill the IRQ
+		 */
+#if 0
+		desc->status |= IRQ_DISABLED;
+		desc->handler->disable(irq);
+#endif
+	}
+	desc->irqs_unhandled = 0;
 }
 
 /*
@@ -418,10 +460,12 @@ asmlinkage unsigned int do_IRQ(struct pt
 	 * SMP environment.
 	 */
 	for (;;) {
+		irqreturn_t action_ret;
+
 		spin_unlock(&desc->lock);
-		handle_IRQ_event(irq, &regs, action);
+		action_ret = handle_IRQ_event(irq, &regs, action);
 		spin_lock(&desc->lock);
-		
+		note_interrupt(irq, desc, action_ret);
 		if (likely(!(desc->status & IRQ_PENDING)))
 			break;
 		desc->status &= ~IRQ_PENDING;
diff -puN include/linux/irq.h~irq-check-rate-limit include/linux/irq.h
--- 25/include/linux/irq.h~irq-check-rate-limit	2003-05-08 00:27:15.000000000 -0700
+++ 25-akpm/include/linux/irq.h	2003-05-08 00:27:15.000000000 -0700
@@ -61,6 +61,8 @@ typedef struct {
 	hw_irq_controller *handler;
 	struct irqaction *action;	/* IRQ action list */
 	unsigned int depth;		/* nested irq disables */
+	unsigned int irq_count;		/* For detecting broken interrupts */
+	unsigned int irqs_unhandled;
 	spinlock_t lock;
 } ____cacheline_aligned irq_desc_t;
 

_



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xirc2ps_cs irq return fix
  2003-05-27 10:40               ` Alan Cox
@ 2003-05-27 19:37                 ` Andrew Morton
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Morton @ 2003-05-27 19:37 UTC (permalink / raw)
  To: Alan Cox; +Cc: jgarzik, zwane, linux-kernel

Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>
> On Maw, 2003-05-27 at 11:45, Andrew Morton wrote:
> > The below patch has been in -mm for some time.  It was supposed to kill the
> > IRQ if 750 of the previous 1000 IRQs weren't handled.
> > 
> > I disabled the killing code because it was triggering on someone's
> > works-just-fine setup.
> > 
> > There will be pain involved in getting all this to work right.  Do you
> > really think there's much value in it?
> 
> Being able to at least turn it on at run time is valuable when you are debugging
> a box operated by someone who doesnt habitually rebuild kernels. The 750 of 1000
> thing doesnt work because it can happen to be timing triggered by blocks of IRQ's
> from a chip being folded together. The "million in a row" should be a stuck IRQ,
> maybe 50,000 in a row even but just "zillions in a row"
> 

The problem with zillions-in-a-row is that the babbling device could be
sharing the IRQ with a non-babbling device.  So the legitimate interrupts
from the non-babbler will cause the detection state machine to reset
itself.  It will never go off, and the box remains locked up.

Maybe the 750/1000 should be 100/100,000.  Who knows...


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2003-05-27 19:24 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200305252318.h4PNIPX4026812@hera.kernel.org>
2003-05-26  0:44 ` [PATCH] xirc2ps_cs irq return fix Jeff Garzik
2003-05-26  1:00   ` Zwane Mwaikambo
2003-05-26 23:35     ` Alan Cox
2003-05-27  3:49       ` Jeff Garzik
2003-05-27  3:55         ` Andrew Morton
2003-05-27  9:40           ` Alan Cox
2003-05-27 10:45             ` Andrew Morton
2003-05-27 10:40               ` Alan Cox
2003-05-27 19:37                 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).