linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Possible race in request_irq() (__setup_irq())
@ 2012-05-16 12:34 Alexander Sverdlin
  2012-05-16 13:44 ` Thomas Gleixner
  0 siblings, 1 reply; 6+ messages in thread
From: Alexander Sverdlin @ 2012-05-16 12:34 UTC (permalink / raw)
  To: linux-kernel, Thomas Gleixner; +Cc: alexander.sverdlin.ext

Hi!

I want to discuss possible race condition in __setup_irq() which is called from request_irq() function.
Unfortunately, we faced real crash with kernel 2.6.32.58 which is too far from current, but potential problem still persist in current linux-next.

This is what happens in 2.6.32.58 on our custom MIPS board:
-----------------------------------------------------------------
CPU 1 Unable to handle kernel paging request at virtual address 0000000000000008, epc == ffffffff801cb9f8, ra == ffffffff801cdef4
Oops[#1]:
Cpu 1
$ 0   : 0000000000000000 ffffffff80660008 0000000000000000 ffffffff805e5570
$ 4   : 000000000000002b 0000000000000000 0000000000080000 ffffffffffff00fe
$ 8   : 0000200300080000 0000000000000000 0000000000000000 a800000001000400
$12   : 000000001000cce0 000000001000001f a800000009400000 0000000000000000
$16   : ffffffff805db500 000000000000002b 8001070000000010 8001070000000238
$20   : 8001070000000220 0000000000000001 0000000000080000 000000000000002b
$24   : 0000000000000353 ffffffff80349ed8..................................
$28   : a8000000007ac000 a8000000007afc10 ffffffff80650000 ffffffff801cdef4
Hi    : 0000000000000000
Lo    : 0000000000000000
epc   : ffffffff801cb9f8 handle_IRQ_event+0x30/0x190
    Not tainted
ra    : ffffffff801cdef4 handle_percpu_irq+0x54/0xc0
Status: 1000cce2    KX SX UX KERNEL EXL.
Cause : 00800008
BadVA : 0000000000000008
PrId  : 000d900a (Cavium Octeon II)
Modules linked in: fsmddg_sfn srio_i2c eeprom bmu_ctrl sfp gpio_bmu leds_bmu led fsmddg_system_driver_mfd_dt fsmddg_paniclog ipv6 cramfs phram [last unloaded: fsmddg_sfn]
Process swapper (pid: 0, threadinfo=a8000000007ac000, task=a8000000007aae38, tls=0000000000000000)
Stack : ffffffff805db500 000000000000002b 8001070000000010 8001070000000238
        8001070000000220 0000000000000001 0000000000080000 8001070000000108
        ffffffff80650000 ffffffff801cdef4 0000000000000001 000000000000002b
        0000000000000001 ffffffff8015f094 000000000000002b ffffffff8011a594
        0000000000000000 ffffffff8066c290 ffffffff8066c290 0000000000000002
        ffffffff8065a758 ffffffff80650000 ffffffff80650000 800000000fd00000
        f5b3fdf8ceff7fff ffffffff80100880 0000000000000000 ffffffff80660008
        ffffffff80100a80 0000000000000000 0000000000100000 a80000004e5b7038
        000000001000cce0 ffffffffffff00fe 0000000000000001 0000000000000000
        0000000000000000 a800000001000400 0000000000000000 000000000000cc00
        ...
Call Trace:
[<ffffffff801cb9f8>] handle_IRQ_event+0x30/0x190
[<ffffffff801cdef4>] handle_percpu_irq+0x54/0xc0
[<ffffffff8015f094>] do_IRQ+0x2c/0x40
[<ffffffff8011a594>] plat_irq_dispatch+0x10c/0x1e8
[<ffffffff80100880>] ret_from_irq+0x0/0x4
[<ffffffff80100aa0>] r4k_wait+0x20/0x40
[<ffffffff8015fcc4>] cpu_idle+0x9c/0x108


Code: ffb30018  ffb20010  ffb10008 <dca20008> e8450002  00a0802d  41606020  3c028057  02e0982d.
Disabling lock debugging due to kernel taint
Kernel panic - not syncing: Fatal exception in interrupt
Rebooting in 3 seconds..octeon_restart SMP
-----------------------------------------------------------------

The reason for this is that IRQ is enabled before IRQ handler is actually configured. Please take a look at kernel/irq/manage.c of the linux-next, __setup_irq():

1053         if (!shared) {
1054                 init_waitqueue_head(&desc->wait_for_threads);
1055 
1056                 /* Setup the type (level, edge polarity) if configured: */
1057                 if (new->flags & IRQF_TRIGGER_MASK) {
1058                         ret = __irq_set_trigger(desc, irq,
1059                                         new->flags & IRQF_TRIGGER_MASK);
1060 
1061                         if (ret)
1062                                 goto out_mask;
1063                 }
1064 
1065                 desc->istate &= ~(IRQS_AUTODETECT | IRQS_SPURIOUS_DISABLED | \
1066                                   IRQS_ONESHOT | IRQS_WAITING);
1067                 irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
1068 
1069                 if (new->flags & IRQF_PERCPU) {
1070                         irqd_set(&desc->irq_data, IRQD_PER_CPU);
1071                         irq_settings_set_per_cpu(desc);
1072                 }
1073 
1074                 if (new->flags & IRQF_ONESHOT)
1075                         desc->istate |= IRQS_ONESHOT;
1076 
1077                 if (irq_settings_can_autoenable(desc))
1078                         irq_startup(desc, true);

So in case no IRQ_NOAUTOEN flag was passed to request_irq() function, IRQ will be enabled here.

1079                 else
1080                         /* Undo nested disables: */
1081                         desc->depth = 1;
1082 
1083                 /* Exclude IRQ from balancing if requested */
1084                 if (new->flags & IRQF_NOBALANCING) {
1085                         irq_settings_set_no_balancing(desc);
1086                         irqd_set(&desc->irq_data, IRQD_NO_BALANCING);
1087                 }
1088 
1089                 /* Set default affinity mask once everything is setup */
1090                 setup_affinity(irq, desc, mask);
1091 
1092         } else if (new->flags & IRQF_TRIGGER_MASK) {
1093                 unsigned int nmsk = new->flags & IRQF_TRIGGER_MASK;
1094                 unsigned int omsk = irq_settings_get_trigger_mask(desc);
1095 
1096                 if (nmsk != omsk)
1097                         /* hope the handler works with current  trigger mode */
1098                         pr_warning("irq %d uses trigger mode %u; requested %u\n",
1099                                    irq, nmsk, omsk);
1100         }
1101 
1102         new->irq = irq;
1103         *old_ptr = new;

Descriptor table for this IRQ will be filled with handler only here!

This code is inside raw_spin_lock_irqsave() protected region, but actually IRQ could be triggered on another core where IRQs are not disabled!
So if IRQ affinity is set up in the way that IRQ itself and request_irq() happen on different cores, IRQ that is already pending in hardware will occur
before it's handler is actually set up.

And this actually happens on our boards. The only reason the topic of the message contains "Possible" is that this race present in kernel for quite a long time and I have not found
any occurrences on other SMP systems than our Octeon. Other possible cause could be wrong usage of request_irq(), but the whole configuration seems to be legal:

IRQ affinity is set to 1 (core 0 processes IRQ).
request_irq() happens during kernel init on core 5. 
IRQ is already pending (but not enabled) before request_irq() happens.
IRQ is not shared and should be enabled by request_irq() automatically.

The fix could be like following. It also takes into account, that "desc" structure must be propagated to other cores on SMP, before
first IRQ occurs. Similar fix works fine with 2.6.32.58, as I'm currently unable to test linux-next on Octeon MIPS. 

This modification ensures that IRQs are only enabled when their
handler has actually been set up and propagated to other cores on SMP.

Signed-off-by: Alexander Sverdlin <alexander.sverdlin@sysgo.com>

diff -uprN linux.old/kernel/irq/manage.c linux/kernel/irq/manage.c
--- linux.old/kernel/irq/manage.c	2012-05-16 14:08:38.000000000 +0200
+++ linux/kernel/irq/manage.c	2012-05-16 14:19:16.000000000 +0200
@@ -1074,12 +1074,6 @@ __setup_irq(unsigned int irq, struct irq
 		if (new->flags & IRQF_ONESHOT)
 			desc->istate |= IRQS_ONESHOT;
 
-		if (irq_settings_can_autoenable(desc))
-			irq_startup(desc, true);
-		else
-			/* Undo nested disables: */
-			desc->depth = 1;
-
 		/* Exclude IRQ from balancing if requested */
 		if (new->flags & IRQF_NOBALANCING) {
 			irq_settings_set_no_balancing(desc);
@@ -1107,11 +1101,35 @@ __setup_irq(unsigned int irq, struct irq
 	desc->irqs_unhandled = 0;
 
 	/*
+	 * Enable IRQs ONLY after handler has been already written to the
+	 * descriptor, as IRQ can happen immediately on another core
+	 */
+	if (!shared) {
+		if (irq_settings_can_autoenable(desc)) {
+			/*
+			 * IRQ could immediately fire on another core, so make
+			 * sure, that changes to the descriptor are propagated
+			 * to other cores
+			 */
+			smp_mb();
+			irq_startup(desc, true);
+		} else {
+			/* Undo nested disables: */
+			desc->depth = 1;
+		}
+	}
+
+	/*
 	 * Check whether we disabled the irq via the spurious handler
 	 * before. Reenable it and give it another chance.
 	 */
 	if (shared && (desc->istate & IRQS_SPURIOUS_DISABLED)) {
 		desc->istate &= ~IRQS_SPURIOUS_DISABLED;
+		/*
+		 * IRQ could immediately fire on another core, so make sure,
+		 * that changes to the descriptor are propagated to other cores
+		 */
+		smp_mb();
 		__enable_irq(desc, irq, false);
 	}
 


-- 
Best regards,
Alexander Sverdlin.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Possible race in request_irq() (__setup_irq())
  2012-05-16 12:34 Possible race in request_irq() (__setup_irq()) Alexander Sverdlin
@ 2012-05-16 13:44 ` Thomas Gleixner
  2012-05-21  7:47   ` Alexander Sverdlin
  2012-05-25 14:01   ` Alexander Sverdlin
  0 siblings, 2 replies; 6+ messages in thread
From: Thomas Gleixner @ 2012-05-16 13:44 UTC (permalink / raw)
  To: Alexander Sverdlin; +Cc: linux-kernel, alexander.sverdlin.ext

On Wed, 16 May 2012, Alexander Sverdlin wrote:

> [<ffffffff801cb9f8>] handle_IRQ_event+0x30/0x190
> [<ffffffff801cdef4>] handle_percpu_irq+0x54/0xc0
> [<ffffffff8015f094>] do_IRQ+0x2c/0x40
> [<ffffffff8011a594>] plat_irq_dispatch+0x10c/0x1e8
> [<ffffffff80100880>] ret_from_irq+0x0/0x4
> [<ffffffff80100aa0>] r4k_wait+0x20/0x40
> [<ffffffff8015fcc4>] cpu_idle+0x9c/0x108
> 

> This code is inside raw_spin_lock_irqsave() protected region, but
> actually IRQ could be triggered on another core where IRQs are not
> disabled!

And that interrupt will spin on desc->lock until this code has set up
the action. So nothing happens at all.

Except for per_cpu interrupts. Now, that's a different issue and you
are doing something completely wrong here.

> So if IRQ affinity is set up in the way that IRQ itself and
> request_irq() happen on different cores, IRQ that is already pending
> in hardware will occur before it's handler is actually set up.

per_cpu interrupts are special.
 
> And this actually happens on our boards. The only reason the topic
> of the message contains "Possible" is that this race present in
> kernel for quite a long time and I have not found any occurrences on
> other SMP systems than our Octeon. Other possible cause could be
> wrong usage of request_irq(), but the whole configuration seems to
> be legal:

Well, there is no law which forbids doing that.

> IRQ affinity is set to 1 (core 0 processes IRQ).
> request_irq() happens during kernel init on core 5. 
> IRQ is already pending (but not enabled) before request_irq() happens.
> IRQ is not shared and should be enabled by request_irq() automatically.

But it's wrong nevertheless.

Your irq is using handle_percpu_irq() as the flow handler.

handle_percpu_irq() is a special flow handler which does not take the
irq descriptor lock for performance reasons. It's a single interrupt
number which has a percpu dev_id and can be handled on all cores in
parallel.

The interrupts need to be marked as such and requested with
request_percpu_irq(). Those interrupts are either marked as
NOAUTOENABLE or set up by the low level setup code, which runs on the
boot cpu with interrupt enabled.

Those interrupts are marked as percpu and can only be requested with
request_percpu_irq().

>From your description it looks like you are using a regular interrupt,
because interrupt affinities of per cpu interrupts cannot be set. They
are hardwired.

I don't know what your archaeologic kernel version is doing there, but
the current cavium code only uses handle_percpu_irq flow handler for a
handful special interrupts which are handled and setup by the cavium
core code correctly.

So nothing to fix here.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Possible race in request_irq() (__setup_irq())
  2012-05-16 13:44 ` Thomas Gleixner
@ 2012-05-21  7:47   ` Alexander Sverdlin
  2012-05-25 14:01   ` Alexander Sverdlin
  1 sibling, 0 replies; 6+ messages in thread
From: Alexander Sverdlin @ 2012-05-21  7:47 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, alexander.sverdlin.ext

Hello Thomas,

On 05/16/2012 03:44 PM, Thomas Gleixner wrote:
> I don't know what your archaeologic kernel version is doing there, but
> the current cavium code only uses handle_percpu_irq flow handler for a
> handful special interrupts which are handled and setup by the cavium
> core code correctly.
> 
> So nothing to fix here.

Thanks for clarification!
I'll clean-up all our IRQ code, remove unnecessary PER_CPU flags and drop a note, how it helped.

-- 
Best regards,
Alexander Sverdlin.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Possible race in request_irq() (__setup_irq())
  2012-05-16 13:44 ` Thomas Gleixner
  2012-05-21  7:47   ` Alexander Sverdlin
@ 2012-05-25 14:01   ` Alexander Sverdlin
  2012-05-25 14:22     ` Thomas Gleixner
  2012-05-25 16:24     ` David Daney
  1 sibling, 2 replies; 6+ messages in thread
From: Alexander Sverdlin @ 2012-05-25 14:01 UTC (permalink / raw)
  To: Thomas Gleixner, David Daney, Venkat Subbiah; +Cc: linux-kernel

Hello Thomas, David, Venkat,

On 05/16/2012 03:44 PM, Thomas Gleixner wrote:
> Your irq is using handle_percpu_irq() as the flow handler.
> 
> handle_percpu_irq() is a special flow handler which does not take the
> irq descriptor lock for performance reasons. It's a single interrupt
> number which has a percpu dev_id and can be handled on all cores in
> parallel.
> 
> The interrupts need to be marked as such and requested with
> request_percpu_irq(). Those interrupts are either marked as
> NOAUTOENABLE or set up by the low level setup code, which runs on the
> boot cpu with interrupt enabled.
> 
> Those interrupts are marked as percpu and can only be requested with
> request_percpu_irq().
> 

Could someone comment please, why exactly this happens in current linux-next for Octeon:

In arch/mips/cavium-octeon/octeon-irq.c MBOX IRQs are set up to be handled by handle_percpu_irq():

static void __init octeon_irq_init_ciu(void)
{
...
	octeon_irq_set_ciu_mapping(OCTEON_IRQ_MBOX0, 0, 32, chip_mbox, handle_percpu_irq);
	octeon_irq_set_ciu_mapping(OCTEON_IRQ_MBOX1, 0, 33, chip_mbox, handle_percpu_irq);

But in arch/mips/cavium-octeon/smp.c it's requested as normal IRQ:

void octeon_prepare_cpus(unsigned int max_cpus)
{
...
	if (request_irq(OCTEON_IRQ_MBOX0, mailbox_interrupt,
			IRQF_PERCPU | IRQF_NO_THREAD, "SMP-IPI",
			mailbox_interrupt)) {
		panic("Cannot request_irq(OCTEON_IRQ_MBOX0)");
	}

Is it a bug, or some kind of special case?

-- 
Best regards,
Alexander Sverdlin.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Possible race in request_irq() (__setup_irq())
  2012-05-25 14:01   ` Alexander Sverdlin
@ 2012-05-25 14:22     ` Thomas Gleixner
  2012-05-25 16:24     ` David Daney
  1 sibling, 0 replies; 6+ messages in thread
From: Thomas Gleixner @ 2012-05-25 14:22 UTC (permalink / raw)
  To: Alexander Sverdlin; +Cc: David Daney, Venkat Subbiah, linux-kernel

On Fri, 25 May 2012, Alexander Sverdlin wrote:
> Hello Thomas, David, Venkat,
> 
> On 05/16/2012 03:44 PM, Thomas Gleixner wrote:
> > Your irq is using handle_percpu_irq() as the flow handler.
> > 
> > handle_percpu_irq() is a special flow handler which does not take the
> > irq descriptor lock for performance reasons. It's a single interrupt
> > number which has a percpu dev_id and can be handled on all cores in
> > parallel.
> > 
> > The interrupts need to be marked as such and requested with
> > request_percpu_irq(). Those interrupts are either marked as
> > NOAUTOENABLE or set up by the low level setup code, which runs on the
> > boot cpu with interrupt enabled.
> > 
> > Those interrupts are marked as percpu and can only be requested with
> > request_percpu_irq().
> > 
> 
> Could someone comment please, why exactly this happens in current linux-next for Octeon:
> 
> In arch/mips/cavium-octeon/octeon-irq.c MBOX IRQs are set up to be handled by handle_percpu_irq():
> 
> static void __init octeon_irq_init_ciu(void)
> {
> ...
> 	octeon_irq_set_ciu_mapping(OCTEON_IRQ_MBOX0, 0, 32, chip_mbox, handle_percpu_irq);
> 	octeon_irq_set_ciu_mapping(OCTEON_IRQ_MBOX1, 0, 33, chip_mbox, handle_percpu_irq);
> 
> But in arch/mips/cavium-octeon/smp.c it's requested as normal IRQ:
> 
> void octeon_prepare_cpus(unsigned int max_cpus)
> {
> ...
> 	if (request_irq(OCTEON_IRQ_MBOX0, mailbox_interrupt,
> 			IRQF_PERCPU | IRQF_NO_THREAD, "SMP-IPI",
> 			mailbox_interrupt)) {
> 		panic("Cannot request_irq(OCTEON_IRQ_MBOX0)");
> 	}
> 
> Is it a bug, or some kind of special case?

I forgot the other case where a simple dev_id is used. That one is a
simple per cpu interrupt, which has a regular dev_id. Note the flag:
IRQF_PERCPU. So yes we have two variants, but both are dealing with
per cpu interrupts. The subtle difference is how the dev_id is handled
and how the enable/disable mechanism works.

What are you trying to do/solve ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Possible race in request_irq() (__setup_irq())
  2012-05-25 14:01   ` Alexander Sverdlin
  2012-05-25 14:22     ` Thomas Gleixner
@ 2012-05-25 16:24     ` David Daney
  1 sibling, 0 replies; 6+ messages in thread
From: David Daney @ 2012-05-25 16:24 UTC (permalink / raw)
  To: Alexander Sverdlin
  Cc: Thomas Gleixner, David Daney, Venkat Subbiah, linux-kernel

On 05/25/2012 07:01 AM, Alexander Sverdlin wrote:
> Hello Thomas, David, Venkat,
>
> On 05/16/2012 03:44 PM, Thomas Gleixner wrote:
>> Your irq is using handle_percpu_irq() as the flow handler.
>>
>> handle_percpu_irq() is a special flow handler which does not take the
>> irq descriptor lock for performance reasons. It's a single interrupt
>> number which has a percpu dev_id and can be handled on all cores in
>> parallel.
>>
>> The interrupts need to be marked as such and requested with
>> request_percpu_irq(). Those interrupts are either marked as
>> NOAUTOENABLE or set up by the low level setup code, which runs on the
>> boot cpu with interrupt enabled.
>>
>> Those interrupts are marked as percpu and can only be requested with
>> request_percpu_irq().
>>
>
> Could someone comment please, why exactly this happens in current linux-next for Octeon:
>
> In arch/mips/cavium-octeon/octeon-irq.c MBOX IRQs are set up to be handled by handle_percpu_irq():
>
> static void __init octeon_irq_init_ciu(void)
> {
> ...
> 	octeon_irq_set_ciu_mapping(OCTEON_IRQ_MBOX0, 0, 32, chip_mbox, handle_percpu_irq);
> 	octeon_irq_set_ciu_mapping(OCTEON_IRQ_MBOX1, 0, 33, chip_mbox, handle_percpu_irq);
>
> But in arch/mips/cavium-octeon/smp.c it's requested as normal IRQ:
>
> void octeon_prepare_cpus(unsigned int max_cpus)
> {
> ...
> 	if (request_irq(OCTEON_IRQ_MBOX0, mailbox_interrupt,
> 			IRQF_PERCPU | IRQF_NO_THREAD, "SMP-IPI",
> 			mailbox_interrupt)) {
> 		panic("Cannot request_irq(OCTEON_IRQ_MBOX0)");
> 	}
>
> Is it a bug, or some kind of special case?
>

Probably a bug.  But it works, so it has never been an issue.

Patches to bring this code more in line with current Linux irq orthodoxy 
would receive prompt review.  Especially if there were an explanation of 
some problem they were fixing.

David Daney

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-05-25 16:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-16 12:34 Possible race in request_irq() (__setup_irq()) Alexander Sverdlin
2012-05-16 13:44 ` Thomas Gleixner
2012-05-21  7:47   ` Alexander Sverdlin
2012-05-25 14:01   ` Alexander Sverdlin
2012-05-25 14:22     ` Thomas Gleixner
2012-05-25 16:24     ` David Daney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).