linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)
@ 2019-10-08 13:04 Yi Zheng
  2019-10-09 17:45 ` Tony Lindgren
  2019-10-14 12:34 ` Thomas Gleixner
  0 siblings, 2 replies; 6+ messages in thread
From: Yi Zheng @ 2019-10-08 13:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Jason Cooper, Tony Lindgren, Sekhar Nori, Zheng Yi

[-- Attachment #1: Type: text/plain, Size: 6130 bytes --]

Hi,

     I found something wrong on my AM3352 SoC machine, the GPIO triggered IRQ is
     masked unexpectedly.  That bug cause the devices using that GPIO-IRQ can
     not work. Even the latest kernel version (v5.4-rc2-20-geda57a0e4299)!

     After a long time hacking, I guess the bug is in kernel/irq/chip.c, the
     important base code for _ALL_ the Processor Platform! That is why this mail
     is sent to you.

     Shortly speaking, the bug is about wrong interrupt iteration. That cause
     the software flag "IRQD_IRQ_MASKED" not refect the real masking status
     register in the interrupt controller (INTC,
     drivers/irqchip/irq-omap-intc.c).

     Here is my hw/sw settings:
     (1) FPGA implements some UART, using a GPIO pin as INT signal,

     (2) In the SoC, 481ac000.gpio is under the INTC controller, irq=28, and
         hwirq=32 It is an level triggered IRQ.

     (3) The ISR calling stack shows that handle_level_irq() is the most
         important function for debugging.

     There is some defects on IRQ processing:

     (1) At the beginning of handle_level_irq(), the IRQ-28 is masked, and ACK
         action is executed: On my machine, it runs the 'else' branch:

            static inline void mask_ack_irq(struct irq_desc *desc)
            {
                if (desc->irq_data.chip->irq_mask_ack) {
                        desc->irq_data.chip->irq_mask_ack(&desc->irq_data);
                        irq_state_set_masked(desc);
                } else {
                        mask_irq(desc);
                        if (desc->irq_data.chip->irq_ack)
                                desc->irq_data.chip->irq_ack(&desc->irq_data);
                }
            }

         It is an 2-steps procedure:
         1. mask_irq()
         2. desc->irq_data.chip->irq_ack()

         the 2nd step, the function ptr is omap_mask_ack_irq(), which
         _MASK_ the hardware INTC-IRQ-32 and then do the real ACK action.

     (2) mask_irq()/unmask_irq() are not atomic actions: They check the
         IRQD_IRQ_MASKED flag firstly, and then mask/unmask the irq by calling
         the function ptrs which installed by irq controller drv.  Then, those 2
         functions set/clear the IRQD_IRQ_MASKED flag.

         I think the sequence of the hw/sw action should be mirrored reversed:
         mask_irq():
            check IRQD_IRQ_MASKED;
            set hardware IRQ mask register;
            set software IRQD_IRQ_MASKED flag;

         unmask_irq():
            check IRQD_IRQ_MASKED;
            /* NOTE: should before the hw unmask action!! */
            clear software IRQD_IRQ_MASKED flag;
            clear hardware IRQ mask register;

         The current unmask_irq(), hw-mask action runs before sw-mask action,
         which gives an very small time window. That cause an unexpected
         iterated IRQ.

     Here is my the detail of my analyzing of handle_level_irq():

     (1) Let record the HW-IRQ-Controller Status and the SW-Flag IRQD_IRQ_MASKED
         pair as following: (hw-mask, sw-mask).

     (2) In the 1st level of IRQ-28 ISR calling, in unmask_irq(), after the HW
         unmask action, and before the sw-flag IRQD_IRQ_MASKED is cleared, there
         is a VERY SMALL TIME WINDOW, in which, another IRQ-28 may triggered.

         In that time window, the mask status is (0, 1), which is no an valid
         value.

     (3) In the 2nd level of the ISR(IRQ-28), The mask status is IRQ-28(0, 1),
         so mask_irq() do nothing, because sw-flag is "1".  That is an wrong
         status, the programmer thinks that IRQ-28 has been masked, but
         physically not!

     (4) Before the ACK func-ptr calling, there comes the 3rd level IRQ (28/32)!
         Although mask_irq() do not physically mask the IRQ, ACK acion
         (omap_mask_ack_irq) of the omap-intc drv mask the IRQ physically. The
         3rd ISR runs OK.

         When 3rd level ISR exist, the mask status is (0, 0), that is OK!

     (5) The 3rd level ISR finished, the 2nd level ISR continue.  It run ACK
         function ptr -- omap_mask_ack_irq(). The HW-IRQ-mask is set again!
         Now, the mask status is (1, 0), it is unreasonable value!

     (6) The 2nd level ISR run to cond_unmask_irq(). Due to the ill-formed
         mask-status value(1, 0), the unmask_irq() will not be called.  Even in
         unmask_irq(), another checking SW-Flag exists. The real unmask action
         will not run!

     (7) Now, the 2nd level ISR return, with the mask status (1, 0).  The 1st
         level ISR continues, in unmask_irq(), it run irq_state_clr_masked();
         And it repeatedly clear the IRQD_IRQ_MASKED. The final mask status is
         (1, 0).

         What (1, 0) value means? The CPU call will not receive IRQ any more!
         That is my bug phenomenon. If I clean the hardware interrupt
         controller's mask bit, my devices work again.


      NOTE: (1) My SoC is a single core ARM chip: TI-AM3352, so the raw
         spin-lock irq_desc->lock will be optimized to
         nothing. handle_level_irq() has no spin-lock protection, right?

         (2) In AM3352, INTC driver ACK the IRQ by write 0x01 into INTC Control
             Register(offset 0x48).  The chip doc seems that bit[0] of
             INTC-Control Reg is only an enable/disable flag.  The IRQ may
             generated even if no ACK action done. Any one can give me an
             clarification?

         (3) My analysis is not verified on the real machine. After some code
             change for debug(add counter to indicates the iteration level, save
             the IRQ mask status etc.), the device IRQ wrongly masked problem
             vanished. In fact, the original code can not re-produce the
             phenomena easily. In tens of machine, only one can get the bug. I
             have try my best to hacking the code, but the only verified result
             is here: when bug occur, the HW IRQ is masked, but the
             IRQD_IRQ_MASKED flag is cleared.

      My fixup is in the attachment, which remove the unexpected time window of
      IRQ iteration.

[-- Attachment #2: irq-chip-fixup.patch --]
[-- Type: text/x-patch, Size: 369 bytes --]

--- kernel/irq/chip.c	2019-07-13 09:28:23.683787367 +0800
+++ /tmp/chip.c	2019-10-08 11:32:35.082258572 +0800
@@ -432,8 +432,8 @@ void unmask_irq(struct irq_desc *desc)
 		return;
 
 	if (desc->irq_data.chip->irq_unmask) {
-		desc->irq_data.chip->irq_unmask(&desc->irq_data);
 		irq_state_clr_masked(desc);
+		desc->irq_data.chip->irq_unmask(&desc->irq_data);
 	}
 }
 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)
  2019-10-08 13:04 Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry) Yi Zheng
@ 2019-10-09 17:45 ` Tony Lindgren
  2019-10-10  6:20   ` Yi Zheng
  2019-10-14 12:34 ` Thomas Gleixner
  1 sibling, 1 reply; 6+ messages in thread
From: Tony Lindgren @ 2019-10-09 17:45 UTC (permalink / raw)
  To: Yi Zheng
  Cc: Thomas Gleixner, linux-kernel, Jason Cooper, Sekhar Nori, Zheng Yi

Hi,

* Yi Zheng <goodmenzy@gmail.com> [191008 13:06]:
>       NOTE: (1) My SoC is a single core ARM chip: TI-AM3352, so the raw
>          spin-lock irq_desc->lock will be optimized to
>          nothing. handle_level_irq() has no spin-lock protection, right?

Well not always, With CONFIG_SMP we modify only some of the SMP code on boot,
see arch/arm/kernel/head.S for smp_on_up and then the related macro usage.

>          (2) In AM3352, INTC driver ACK the IRQ by write 0x01 into INTC Control
>              Register(offset 0x48).  The chip doc seems that bit[0] of
>              INTC-Control Reg is only an enable/disable flag.  The IRQ may
>              generated even if no ACK action done. Any one can give me an
>              clarification?

The TI INTC is probably better documented in dm3630 trm, it's the same
controller but with a different revision.

>          (3) My analysis is not verified on the real machine. After some code
>              change for debug(add counter to indicates the iteration level, save
>              the IRQ mask status etc.), the device IRQ wrongly masked problem
>              vanished. In fact, the original code can not re-produce the
>              phenomena easily. In tens of machine, only one can get the bug. I
>              have try my best to hacking the code, but the only verified result
>              is here: when bug occur, the HW IRQ is masked, but the
>              IRQD_IRQ_MASKED flag is cleared.
> 
>       My fixup is in the attachment, which remove the unexpected time window of
>       IRQ iteration.

Let's see what Thomas has to say for that. Meanwhile, please take a
look at Documentation/process/submitting-patches.rst for getting things
right for sending out patches that can be applied without manual
editing :)

Cheers,

Tony

> --- kernel/irq/chip.c	2019-07-13 09:28:23.683787367 +0800
> +++ /tmp/chip.c	2019-10-08 11:32:35.082258572 +0800
> @@ -432,8 +432,8 @@ void unmask_irq(struct irq_desc *desc)
>  		return;
>  
>  	if (desc->irq_data.chip->irq_unmask) {
> -		desc->irq_data.chip->irq_unmask(&desc->irq_data);
>  		irq_state_clr_masked(desc);
> +		desc->irq_data.chip->irq_unmask(&desc->irq_data);
>  	}
>  }
>  


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)
  2019-10-09 17:45 ` Tony Lindgren
@ 2019-10-10  6:20   ` Yi Zheng
  2019-10-10 13:51     ` Tony Lindgren
  0 siblings, 1 reply; 6+ messages in thread
From: Yi Zheng @ 2019-10-10  6:20 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Thomas Gleixner, linux-kernel, Jason Cooper, Sekhar Nori, Zheng Yi

Hi

   My patch is canceled. I have tested that simple adjustment, the
device IRQ masked unexpectedly. It seems that it is more easily to
occur with that patch.

So, the root cause is not found yet.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)
  2019-10-10  6:20   ` Yi Zheng
@ 2019-10-10 13:51     ` Tony Lindgren
  0 siblings, 0 replies; 6+ messages in thread
From: Tony Lindgren @ 2019-10-10 13:51 UTC (permalink / raw)
  To: Yi Zheng
  Cc: Thomas Gleixner, linux-kernel, Jason Cooper, Sekhar Nori, Zheng Yi

* Yi Zheng <goodmenzy@gmail.com> [191010 06:22]:
> Hi
> 
>    My patch is canceled. I have tested that simple adjustment, the
> device IRQ masked unexpectedly. It seems that it is more easily to
> occur with that patch.
> 
> So, the root cause is not found yet.

OK. Based on your description, it could be a missing flush of a
posted write somewhere in the gpio-omap or intc path.

Regards,

Tony

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)
  2019-10-08 13:04 Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry) Yi Zheng
  2019-10-09 17:45 ` Tony Lindgren
@ 2019-10-14 12:34 ` Thomas Gleixner
  2019-10-14 14:51   ` Yi Zheng
  1 sibling, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2019-10-14 12:34 UTC (permalink / raw)
  To: Yi Zheng; +Cc: linux-kernel, Jason Cooper, Tony Lindgren, Sekhar Nori, Zheng Yi

On Tue, 8 Oct 2019, Yi Zheng wrote:
>      There is some defects on IRQ processing:
> 
>      (1) At the beginning of handle_level_irq(), the IRQ-28 is masked, and ACK
>          action is executed: On my machine, it runs the 'else' branch:
> 
>             static inline void mask_ack_irq(struct irq_desc *desc)
>             {
>                 if (desc->irq_data.chip->irq_mask_ack) {
>                         desc->irq_data.chip->irq_mask_ack(&desc->irq_data);
>                         irq_state_set_masked(desc);
>                 } else {
>                         mask_irq(desc);
>                         if (desc->irq_data.chip->irq_ack)
>                                 desc->irq_data.chip->irq_ack(&desc->irq_data);
>                 }
>             }
> 
>          It is an 2-steps procedure:
>          1. mask_irq()
>          2. desc->irq_data.chip->irq_ack()
> 
>          the 2nd step, the function ptr is omap_mask_ack_irq(), which
>          _MASK_ the hardware INTC-IRQ-32 and then do the real ACK action.

Sure. Where is the problem?

>      (2) mask_irq()/unmask_irq() are not atomic actions: They check the
>          IRQD_IRQ_MASKED flag firstly, and then mask/unmask the irq by calling
>          the function ptrs which installed by irq controller drv.  Then, those 2
>          functions set/clear the IRQD_IRQ_MASKED flag.
> 
>          I think the sequence of the hw/sw action should be mirrored reversed:
>          mask_irq():
>             check IRQD_IRQ_MASKED;
>             set hardware IRQ mask register;
>             set software IRQD_IRQ_MASKED flag;
> 
>          unmask_irq():
>             check IRQD_IRQ_MASKED;
>             /* NOTE: should before the hw unmask action!! */
>             clear software IRQD_IRQ_MASKED flag;
>             clear hardware IRQ mask register;
> 
>          The current unmask_irq(), hw-mask action runs before sw-mask action,
>          which gives an very small time window. That cause an unexpected
>          iterated IRQ.

It's completely irrelevant because _ALL_ those operations run with
irq_desc->lock held. So nothing can actually observe that state.

>      Here is my the detail of my analyzing of handle_level_irq():
> 
>      (1) Let record the HW-IRQ-Controller Status and the SW-Flag IRQD_IRQ_MASKED
>          pair as following: (hw-mask, sw-mask).
> 
>      (2) In the 1st level of IRQ-28 ISR calling, in unmask_irq(), after the HW
>          unmask action, and before the sw-flag IRQD_IRQ_MASKED is cleared, there
>          is a VERY SMALL TIME WINDOW, in which, another IRQ-28 may triggered.
> 
>          In that time window, the mask status is (0, 1), which is no an valid
>          value.

Again. Irrelevant because not observable.

>       My fixup is in the attachment, which remove the unexpected time window of
>       IRQ iteration.

Please don't send attachments. See Documentation/process/submitting-patches.rst

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)
  2019-10-14 12:34 ` Thomas Gleixner
@ 2019-10-14 14:51   ` Yi Zheng
  0 siblings, 0 replies; 6+ messages in thread
From: Yi Zheng @ 2019-10-14 14:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Jason Cooper, Tony Lindgren, Sekhar Nori, Zheng Yi

Hi, Thomas

I canceled that patch. In my testing, that will not fix the problem.
Why the IRQ is unexpectedly masked, that is not an easy bug for me.
More time need to hacking the driver/kernel code.

Thanks for your reply.

Thomas Gleixner <tglx@linutronix.de> 于2019年10月14日周一 下午8:34写道:
>
> On Tue, 8 Oct 2019, Yi Zheng wrote:
> >      There is some defects on IRQ processing:
> >
> >      (1) At the beginning of handle_level_irq(), the IRQ-28 is masked, and ACK
> >          action is executed: On my machine, it runs the 'else' branch:
> >
> >             static inline void mask_ack_irq(struct irq_desc *desc)
> >             {
> >                 if (desc->irq_data.chip->irq_mask_ack) {
> >                         desc->irq_data.chip->irq_mask_ack(&desc->irq_data);
> >                         irq_state_set_masked(desc);
> >                 } else {
> >                         mask_irq(desc);
> >                         if (desc->irq_data.chip->irq_ack)
> >                                 desc->irq_data.chip->irq_ack(&desc->irq_data);
> >                 }
> >             }
> >
> >          It is an 2-steps procedure:
> >          1. mask_irq()
> >          2. desc->irq_data.chip->irq_ack()
> >
> >          the 2nd step, the function ptr is omap_mask_ack_irq(), which
> >          _MASK_ the hardware INTC-IRQ-32 and then do the real ACK action.
>
> Sure. Where is the problem?
>
> >      (2) mask_irq()/unmask_irq() are not atomic actions: They check the
> >          IRQD_IRQ_MASKED flag firstly, and then mask/unmask the irq by calling
> >          the function ptrs which installed by irq controller drv.  Then, those 2
> >          functions set/clear the IRQD_IRQ_MASKED flag.
> >
> >          I think the sequence of the hw/sw action should be mirrored reversed:
> >          mask_irq():
> >             check IRQD_IRQ_MASKED;
> >             set hardware IRQ mask register;
> >             set software IRQD_IRQ_MASKED flag;
> >
> >          unmask_irq():
> >             check IRQD_IRQ_MASKED;
> >             /* NOTE: should before the hw unmask action!! */
> >             clear software IRQD_IRQ_MASKED flag;
> >             clear hardware IRQ mask register;
> >
> >          The current unmask_irq(), hw-mask action runs before sw-mask action,
> >          which gives an very small time window. That cause an unexpected
> >          iterated IRQ.
>
> It's completely irrelevant because _ALL_ those operations run with
> irq_desc->lock held. So nothing can actually observe that state.
>
> >      Here is my the detail of my analyzing of handle_level_irq():
> >
> >      (1) Let record the HW-IRQ-Controller Status and the SW-Flag IRQD_IRQ_MASKED
> >          pair as following: (hw-mask, sw-mask).
> >
> >      (2) In the 1st level of IRQ-28 ISR calling, in unmask_irq(), after the HW
> >          unmask action, and before the sw-flag IRQD_IRQ_MASKED is cleared, there
> >          is a VERY SMALL TIME WINDOW, in which, another IRQ-28 may triggered.
> >
> >          In that time window, the mask status is (0, 1), which is no an valid
> >          value.
>
> Again. Irrelevant because not observable.
>
> >       My fixup is in the attachment, which remove the unexpected time window of
> >       IRQ iteration.
>
> Please don't send attachments. See Documentation/process/submitting-patches.rst
>
> Thanks,
>
>         tglx
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-10-14 14:53 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-08 13:04 Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry) Yi Zheng
2019-10-09 17:45 ` Tony Lindgren
2019-10-10  6:20   ` Yi Zheng
2019-10-10 13:51     ` Tony Lindgren
2019-10-14 12:34 ` Thomas Gleixner
2019-10-14 14:51   ` Yi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).