linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
@ 2019-12-09 10:27 Ramon Fried
  2020-01-02  8:19 ` Ramon Fried
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Ramon Fried @ 2019-12-09 10:27 UTC (permalink / raw)
  To: hkallweit1, bhelgaas, marc.zyngier, tglx, lorenzo.pieralisi
  Cc: linux-pci, linux-kernel

Hi,
While debugging the root cause of spurious IRQ's on my PCIe MSI line it appears
that because of the line:
    info->chip->flags |= IRQCHIP_ONESHOT_SAFE;
in pci_msi_create_irq_domain()
The IRQF_ONESHOT is ignored, especially when requesting IRQ through
pci_request_threaded_irq() where handler is NULL.

The problem is that the MSI masking now only surrounds the HW handler,
and all additional MSI that occur before the threaded handler is
complete are considered by the note_interrupt() as spurious.

Besides the side effect of that, I don't really understand the logic
of not masking the MSI until the threaded handler is complete,
especially when there's no HW handler and only threaded handler.

Your thoughts?

Thank,
Ramon.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2019-12-09 10:27 MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs Ramon Fried
@ 2020-01-02  8:19 ` Ramon Fried
  2020-01-13 14:59 ` Ramon Fried
  2020-01-14 12:15 ` Thomas Gleixner
  2 siblings, 0 replies; 18+ messages in thread
From: Ramon Fried @ 2020-01-02  8:19 UTC (permalink / raw)
  To: hkallweit1, Bjorn Helgaas, marc.zyngier, tglx, lorenzo.pieralisi
  Cc: linux-pci, linux-kernel

Ping

On Mon, Dec 9, 2019 at 12:27 PM Ramon Fried <rfried.dev@gmail.com> wrote:
>
> Hi,
> While debugging the root cause of spurious IRQ's on my PCIe MSI line it appears
> that because of the line:
>     info->chip->flags |= IRQCHIP_ONESHOT_SAFE;
> in pci_msi_create_irq_domain()
> The IRQF_ONESHOT is ignored, especially when requesting IRQ through
> pci_request_threaded_irq() where handler is NULL.
>
> The problem is that the MSI masking now only surrounds the HW handler,
> and all additional MSI that occur before the threaded handler is
> complete are considered by the note_interrupt() as spurious.
>
> Besides the side effect of that, I don't really understand the logic
> of not masking the MSI until the threaded handler is complete,
> especially when there's no HW handler and only threaded handler.
>
> Your thoughts?
>
> Thank,
> Ramon.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2019-12-09 10:27 MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs Ramon Fried
  2020-01-02  8:19 ` Ramon Fried
@ 2020-01-13 14:59 ` Ramon Fried
  2020-01-14 12:15 ` Thomas Gleixner
  2 siblings, 0 replies; 18+ messages in thread
From: Ramon Fried @ 2020-01-13 14:59 UTC (permalink / raw)
  To: hkallweit1, Bjorn Helgaas, tglx, lorenzo.pieralisi, maz
  Cc: linux-pci, linux-kernel

Resending with correct e-mail address of maz.

On Mon, Dec 9, 2019 at 12:27 PM Ramon Fried <rfried.dev@gmail.com> wrote:
>
> Hi,
> While debugging the root cause of spurious IRQ's on my PCIe MSI line it appears
> that because of the line:
>     info->chip->flags |= IRQCHIP_ONESHOT_SAFE;
> in pci_msi_create_irq_domain()
> The IRQF_ONESHOT is ignored, especially when requesting IRQ through
> pci_request_threaded_irq() where handler is NULL.
>
> The problem is that the MSI masking now only surrounds the HW handler,
> and all additional MSI that occur before the threaded handler is
> complete are considered by the note_interrupt() as spurious.
>
> Besides the side effect of that, I don't really understand the logic
> of not masking the MSI until the threaded handler is complete,
> especially when there's no HW handler and only threaded handler.
>
> Your thoughts?
>
> Thank,
> Ramon.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2019-12-09 10:27 MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs Ramon Fried
  2020-01-02  8:19 ` Ramon Fried
  2020-01-13 14:59 ` Ramon Fried
@ 2020-01-14 12:15 ` Thomas Gleixner
  2020-01-14 21:38   ` Ramon Fried
  2 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2020-01-14 12:15 UTC (permalink / raw)
  To: Ramon Fried, hkallweit1, bhelgaas, maz, lorenzo.pieralisi
  Cc: linux-pci, linux-kernel

Ramon Fried <rfried.dev@gmail.com> writes:
> While debugging the root cause of spurious IRQ's on my PCIe MSI line it appears
> that because of the line:
>     info->chip->flags |= IRQCHIP_ONESHOT_SAFE;
> in pci_msi_create_irq_domain()
>
> The IRQF_ONESHOT is ignored, especially when requesting IRQ through
> pci_request_threaded_irq() where handler is NULL.

Which is perfectly fine.

> The problem is that the MSI masking now only surrounds the HW handler,
> and all additional MSI that occur before the threaded handler is
> complete are considered by the note_interrupt() as spurious.

Which is not a problem as long as the thread finishes before 100k MSIs
arrived on that line. If that happens then there is something really
wrong. Either the device fires MSIs like crazy or the threaded handler
is stuck somewhere.

> Besides the side effect of that, I don't really understand the logic
> of not masking the MSI until the threaded handler is complete,
> especially when there's no HW handler and only threaded handler.

What's wrong with having another interrupt firing while the threaded
handler is running? Nothing, really. It actually can be desired because
the threaded handler is allowed to sleep.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2020-01-14 12:15 ` Thomas Gleixner
@ 2020-01-14 21:38   ` Ramon Fried
  2020-01-14 21:40     ` Ramon Fried
  0 siblings, 1 reply; 18+ messages in thread
From: Ramon Fried @ 2020-01-14 21:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: hkallweit1, Bjorn Helgaas, maz, lorenzo.pieralisi, linux-pci,
	linux-kernel

On Tue, Jan 14, 2020 at 2:15 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Ramon Fried <rfried.dev@gmail.com> writes:
> > While debugging the root cause of spurious IRQ's on my PCIe MSI line it appears
> > that because of the line:
> >     info->chip->flags |= IRQCHIP_ONESHOT_SAFE;
> > in pci_msi_create_irq_domain()
> >
> > The IRQF_ONESHOT is ignored, especially when requesting IRQ through
> > pci_request_threaded_irq() where handler is NULL.
>
> Which is perfectly fine.
>
> > The problem is that the MSI masking now only surrounds the HW handler,
> > and all additional MSI that occur before the threaded handler is
> > complete are considered by the note_interrupt() as spurious.
>
> Which is not a problem as long as the thread finishes before 100k MSIs
> arrived on that line. If that happens then there is something really
> wrong. Either the device fires MSIs like crazy or the threaded handler
> is stuck somewhere.
>
> > Besides the side effect of that, I don't really understand the logic
> > of not masking the MSI until the threaded handler is complete,
> > especially when there's no HW handler and only threaded handler.
>
> What's wrong with having another interrupt firing while the threaded
> handler is running? Nothing, really. It actually can be desired because
> the threaded handler is allowed to sleep.
What do you mean, isn't it the purpose IRQ masking ?
Interrupt coalescing is done to mitigate these IRQ's, these HW
interrupts just consume
CPU cycles and don't do anything useful (scheduling an already
scheduled thread).
Thanks,
Ramon.
>
> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2020-01-14 21:38   ` Ramon Fried
@ 2020-01-14 21:40     ` Ramon Fried
  2020-01-14 22:54       ` Thomas Gleixner
  0 siblings, 1 reply; 18+ messages in thread
From: Ramon Fried @ 2020-01-14 21:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: hkallweit1, Bjorn Helgaas, maz, lorenzo.pieralisi, linux-pci,
	linux-kernel

On Tue, Jan 14, 2020 at 11:38 PM Ramon Fried <rfried.dev@gmail.com> wrote:
>
> On Tue, Jan 14, 2020 at 2:15 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Ramon Fried <rfried.dev@gmail.com> writes:
> > > While debugging the root cause of spurious IRQ's on my PCIe MSI line it appears
> > > that because of the line:
> > >     info->chip->flags |= IRQCHIP_ONESHOT_SAFE;
> > > in pci_msi_create_irq_domain()
> > >
> > > The IRQF_ONESHOT is ignored, especially when requesting IRQ through
> > > pci_request_threaded_irq() where handler is NULL.
> >
> > Which is perfectly fine.
> >
> > > The problem is that the MSI masking now only surrounds the HW handler,
> > > and all additional MSI that occur before the threaded handler is
> > > complete are considered by the note_interrupt() as spurious.
> >
> > Which is not a problem as long as the thread finishes before 100k MSIs
> > arrived on that line. If that happens then there is something really
> > wrong. Either the device fires MSIs like crazy or the threaded handler
> > is stuck somewhere.
> >
> > > Besides the side effect of that, I don't really understand the logic
> > > of not masking the MSI until the threaded handler is complete,
> > > especially when there's no HW handler and only threaded handler.
> >
> > What's wrong with having another interrupt firing while the threaded
> > handler is running? Nothing, really. It actually can be desired because
> > the threaded handler is allowed to sleep.
> What do you mean, isn't it the purpose IRQ masking ?
> Interrupt coalescing is done to mitigate these IRQ's, these HW
> interrupts just consume
> CPU cycles and don't do anything useful (scheduling an already
> scheduled thread).
Additionally, in this case there isn't even an HW IRQ handler, it's
passed as NULL in the request IRQ function in this scenario.
> Thanks,
> Ramon.
> >
> > Thanks,
> >
> >         tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2020-01-14 21:40     ` Ramon Fried
@ 2020-01-14 22:54       ` Thomas Gleixner
  2020-01-16  0:19         ` Ramon Fried
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2020-01-14 22:54 UTC (permalink / raw)
  To: Ramon Fried
  Cc: hkallweit1, Bjorn Helgaas, maz, lorenzo.pieralisi, linux-pci,
	linux-kernel

Ramon Fried <rfried.dev@gmail.com> writes:
> On Tue, Jan 14, 2020 at 11:38 PM Ramon Fried <rfried.dev@gmail.com> wrote:
>> On Tue, Jan 14, 2020 at 2:15 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > Ramon Fried <rfried.dev@gmail.com> writes:
>> > > Besides the side effect of that, I don't really understand the logic
>> > > of not masking the MSI until the threaded handler is complete,
>> > > especially when there's no HW handler and only threaded handler.
>> >
>> > What's wrong with having another interrupt firing while the threaded
>> > handler is running? Nothing, really. It actually can be desired because
>> > the threaded handler is allowed to sleep.
>> >
>> What do you mean, isn't it the purpose IRQ masking ?  Interrupt
>> coalescing is done to mitigate these IRQ's, these HW interrupts just
>> consume CPU cycles and don't do anything useful (scheduling an
>> already scheduled thread).

Again, that depends on your POV. It's a perfectly valid scenario to have
another HW irq coming in preventing the thread to go to sleep and just
run for another cycle. So no, masking is not necessarily required and
the semantics of MSI is edge type, so the hardware should not fire
another interrupt _before_ the threaded handler actually took care of
the initial one.

> Additionally, in this case there isn't even an HW IRQ handler, it's
> passed as NULL in the request IRQ function in this scenario.

This is completely irrelevant. The primary hardware IRQ handler is
provided by the core code in this case.

Due to the semantics of MSI this is perfectly fine and aside of your
problem this has worked perfectly fine so far and it's an actual
performance win because it avoid fiddling with the MSI mask which is
slow.

You still have not told which driver/hardware is affected by this. Can
you please provide that information so we can finally look at the actual
hardware/driver combo?

Either the driver is broken or the hardware does not comply with the MSI
spec.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2020-01-14 22:54       ` Thomas Gleixner
@ 2020-01-16  0:19         ` Ramon Fried
  2020-01-16  1:39           ` Thomas Gleixner
  0 siblings, 1 reply; 18+ messages in thread
From: Ramon Fried @ 2020-01-16  0:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: hkallweit1, Bjorn Helgaas, maz, lorenzo.pieralisi, linux-pci,
	linux-kernel

On Wed, Jan 15, 2020 at 12:54 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Ramon Fried <rfried.dev@gmail.com> writes:
> > On Tue, Jan 14, 2020 at 11:38 PM Ramon Fried <rfried.dev@gmail.com> wrote:
> >> On Tue, Jan 14, 2020 at 2:15 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > Ramon Fried <rfried.dev@gmail.com> writes:
> >> > > Besides the side effect of that, I don't really understand the logic
> >> > > of not masking the MSI until the threaded handler is complete,
> >> > > especially when there's no HW handler and only threaded handler.
> >> >
> >> > What's wrong with having another interrupt firing while the threaded
> >> > handler is running? Nothing, really. It actually can be desired because
> >> > the threaded handler is allowed to sleep.
> >> >
> >> What do you mean, isn't it the purpose IRQ masking ?  Interrupt
> >> coalescing is done to mitigate these IRQ's, these HW interrupts just
> >> consume CPU cycles and don't do anything useful (scheduling an
> >> already scheduled thread).
>
> Again, that depends on your POV. It's a perfectly valid scenario to have
> another HW irq coming in preventing the thread to go to sleep and just
> run for another cycle. So no, masking is not necessarily required and
> the semantics of MSI is edge type, so the hardware should not fire
> another interrupt _before_ the threaded handler actually took care of
> the initial one.
>
> > Additionally, in this case there isn't even an HW IRQ handler, it's
> > passed as NULL in the request IRQ function in this scenario.
>
> This is completely irrelevant. The primary hardware IRQ handler is
> provided by the core code in this case.
>
You're right.

> Due to the semantics of MSI this is perfectly fine and aside of your
> problem this has worked perfectly fine so far and it's an actual
> performance win because it avoid fiddling with the MSI mask which is
> slow.
>
fiddling with MSI masks is a configuration space write, which is
non-posted, so it does come with a price.
The question is if a test was ever conducted to see the it's better
than spurious IRQ's.

> You still have not told which driver/hardware is affected by this. Can
> you please provide that information so we can finally look at the actual
> hardware/driver combo?
>
Sure,
I'm writing an MSI IRQ controller, it's basically a MIPS GIC interrupt
line which several MSI are multiplexed on it.
It's configured with handle_level_irq() as the GIC is level IRQ.

The ack callback acks the GIC irq.
the mask/unmask calls pci_msi_mask_irq() / pci_msi_unmask_irq()

Thanks,
Ramon.
> Either the driver is broken or the hardware does not comply with the MSI
> spec.
>
> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2020-01-16  0:19         ` Ramon Fried
@ 2020-01-16  1:39           ` Thomas Gleixner
  2020-01-16  7:58             ` Ramon Fried
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2020-01-16  1:39 UTC (permalink / raw)
  To: Ramon Fried
  Cc: hkallweit1, Bjorn Helgaas, maz, lorenzo.pieralisi, linux-pci,
	linux-kernel, maz

Ramon,

Ramon Fried <rfried.dev@gmail.com> writes:
> On Wed, Jan 15, 2020 at 12:54 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> Ramon Fried <rfried.dev@gmail.com> writes:
>> Due to the semantics of MSI this is perfectly fine and aside of your
>> problem this has worked perfectly fine so far and it's an actual
>> performance win because it avoid fiddling with the MSI mask which is
>> slow.
>>
> fiddling with MSI masks is a configuration space write, which is
> non-posted, so it does come with a price.
> The question is if a test was ever conducted to see the it's better
> than spurious IRQ's.

The point is that there are no spurious interrupts in the sane cases and
the tests we did showed a real performance improvements in high
frequency interrupt situations due to avoiding the config space access.

Please stop claiming that this spurious interrupt problem is there by
design. It's not. Read the MSI spec.

Also boot your laptop/workstation with 'threadirqs' on the kernel
command line and check how many spurious interrupts come in. On a test
machine which has that command line parameter set I see exactly ONE with
an uptime of several days and heavy MSI interrupt activity. The ONE is
even there without 'threadirqs' on the command line, so I really can't
be bothered to analyze that.

>> You still have not told which driver/hardware is affected by this. Can
>> you please provide that information so we can finally look at the actual
>> hardware/driver combo?
>>
> Sure,
> I'm writing an MSI IRQ controller, it's basically a MIPS GIC interrupt
> line which several MSI are multiplexed on it.

I assume you write the driver, not the VHDL for the actual hardware,
right? If so, you still did not tell which hardware that is and where we
can find information about it.

I further assume that 'multiplexed' means that the hardware is something
like an MSI receiver on the CPU/chipset which handles multiple MSI
messages and forwards them to a single shared interrupt line on the MIPS
GIC. Right?

Can you please provide a pointer to the hardware documentation?

> It's configured with handle_level_irq() as the GIC is level IRQ.

Which is completely bonkers. MSI has edge semantics and sharing an
interrupt line for edge type interrupts is broken by design, unless the
hardware which handles the incoming MSIs and forwards them to the level
type interrupt line is designed properly and the driver does the right
thing.

> The ack callback acks the GIC irq.  the mask/unmask calls
> pci_msi_mask_irq() / pci_msi_unmask_irq()

What? How is that supposed to work with multiple MSIs?

Either the hardware is a trainwreck or the driver or both.

I can't tell as I can't find my crystal ball. Maybe I should replace it
with an Mobileye :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2020-01-16  1:39           ` Thomas Gleixner
@ 2020-01-16  7:58             ` Ramon Fried
  2020-01-16  9:32               ` Thomas Gleixner
  0 siblings, 1 reply; 18+ messages in thread
From: Ramon Fried @ 2020-01-16  7:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: hkallweit1, Bjorn Helgaas, maz, lorenzo.pieralisi, linux-pci,
	linux-kernel

On Thu, Jan 16, 2020 at 3:39 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Ramon,
>
> Ramon Fried <rfried.dev@gmail.com> writes:
> > On Wed, Jan 15, 2020 at 12:54 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> Ramon Fried <rfried.dev@gmail.com> writes:
> >> Due to the semantics of MSI this is perfectly fine and aside of your
> >> problem this has worked perfectly fine so far and it's an actual
> >> performance win because it avoid fiddling with the MSI mask which is
> >> slow.
> >>
> > fiddling with MSI masks is a configuration space write, which is
> > non-posted, so it does come with a price.
> > The question is if a test was ever conducted to see the it's better
> > than spurious IRQ's.
>
> The point is that there are no spurious interrupts in the sane cases and
> the tests we did showed a real performance improvements in high
> frequency interrupt situations due to avoiding the config space access.
>
> Please stop claiming that this spurious interrupt problem is there by
> design. It's not. Read the MSI spec.
>
> Also boot your laptop/workstation with 'threadirqs' on the kernel
> command line and check how many spurious interrupts come in. On a test
> machine which has that command line parameter set I see exactly ONE with
> an uptime of several days and heavy MSI interrupt activity. The ONE is
> even there without 'threadirqs' on the command line, so I really can't
> be bothered to analyze that.
>
> >> You still have not told which driver/hardware is affected by this. Can
> >> you please provide that information so we can finally look at the actual
> >> hardware/driver combo?
> >>
> > Sure,
> > I'm writing an MSI IRQ controller, it's basically a MIPS GIC interrupt
> > line which several MSI are multiplexed on it.
>
> I assume you write the driver, not the VHDL for the actual hardware,
> right? If so, you still did not tell which hardware that is and where we
> can find information about it.
There's no official information I can share but I can explain how it works:
Basically, 32 MSI vectors are represented by a single GIC irq.
There's a status registers which every bit correspond to an MSI vector, and
individual MSI needs to be acked on that registers. in any case where
there's asserted bit
the GIC IRQ level is high.

>
> I further assume that 'multiplexed' means that the hardware is something
> like an MSI receiver on the CPU/chipset which handles multiple MSI
> messages and forwards them to a single shared interrupt line on the MIPS
> GIC. Right?
Yes.
>
> Can you please provide a pointer to the hardware documentation?
There's no official documentation for that.
>
> > It's configured with handle_level_irq() as the GIC is level IRQ.
>
> Which is completely bonkers. MSI has edge semantics and sharing an
> interrupt line for edge type interrupts is broken by design, unless the
> hardware which handles the incoming MSIs and forwards them to the level
> type interrupt line is designed properly and the driver does the right
> thing.
Yes, the design of the HW is sort of broken. I concur.
>
> > The ack callback acks the GIC irq.  the mask/unmask calls
> > pci_msi_mask_irq() / pci_msi_unmask_irq()
>
> What? How is that supposed to work with multiple MSIs?
Acking is per MSI vector as I described above, so it should work.
>
> Either the hardware is a trainwreck or the driver or both.
>
> I can't tell as I can't find my crystal ball. Maybe I should replace it
> with an Mobileye :)
:)
>
> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2020-01-16  7:58             ` Ramon Fried
@ 2020-01-16  9:32               ` Thomas Gleixner
  2020-01-17 13:32                 ` Ramon Fried
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2020-01-16  9:32 UTC (permalink / raw)
  To: Ramon Fried
  Cc: hkallweit1, Bjorn Helgaas, maz, lorenzo.pieralisi, linux-pci,
	linux-kernel

Ramon,

Ramon Fried <rfried.dev@gmail.com> writes:
> On Thu, Jan 16, 2020 at 3:39 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> Ramon Fried <rfried.dev@gmail.com> writes:
>
> Basically, 32 MSI vectors are represented by a single GIC irq.
> There's a status registers which every bit correspond to an MSI vector, and
> individual MSI needs to be acked on that registers. in any case where
> there's asserted bit the GIC IRQ level is high.

Which is not that bad. 

>> > It's configured with handle_level_irq() as the GIC is level IRQ.
>>
>> Which is completely bonkers. MSI has edge semantics and sharing an
>> interrupt line for edge type interrupts is broken by design, unless the
>> hardware which handles the incoming MSIs and forwards them to the level
>> type interrupt line is designed properly and the driver does the right
>> thing.
>
> Yes, the design of the HW is sort of broken. I concur.

As you describe it, it's not that bad.

>> > The ack callback acks the GIC irq.  the mask/unmask calls
>> > pci_msi_mask_irq() / pci_msi_unmask_irq()
>>
>> What? How is that supposed to work with multiple MSIs?
> Acking is per MSI vector as I described above, so it should work.

No. This is the wrong approach. Lets look at the hardware:

| GIC   line X |------| MSI block | <--- Messages from devices

The MSI block latches the incoming message up to the point where it is
acknowledged in the MSI block. This makes sure that the level semantics
of the GIC are met.

So from a software perspective you want to do something like this:

gic_irq_handler()
{
   mask_ack(gic_irqX);

   pending = read(msi_status);
   for_each_bit(bit, pending) {
       ack(msi_status, bit);  // This clears the latch in the MSI block
       handle_irq(irqof(bit));
   }
   unmask(gic_irqX);
}

And that works perfectly correct without masking the MSI interrupt at
the PCI level for a threaded handler simply because the PCI device will
not send another interrupt until the previous one has been handled by
the driver unless the PCI device is broken.

If that's the case, then you have to work around that at the device
driver level, not at the irq chip level, by installing a primary handler
which quiesces the device (not the MSI part).

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2020-01-16  9:32               ` Thomas Gleixner
@ 2020-01-17 13:32                 ` Ramon Fried
  2020-01-17 14:38                   ` Thomas Gleixner
  0 siblings, 1 reply; 18+ messages in thread
From: Ramon Fried @ 2020-01-17 13:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: hkallweit1, Bjorn Helgaas, maz, lorenzo.pieralisi, linux-pci,
	linux-kernel

Thomas,
Thanks a lot for the detailed answer.

> > Basically, 32 MSI vectors are represented by a single GIC irq.
> > There's a status registers which every bit correspond to an MSI vector, and
> > individual MSI needs to be acked on that registers. in any case where
> > there's asserted bit the GIC IRQ level is high.
>
> Which is not that bad.
>
> >> > It's configured with handle_level_irq() as the GIC is level IRQ.
> >>
> >> Which is completely bonkers. MSI has edge semantics and sharing an
> >> interrupt line for edge type interrupts is broken by design, unless the
> >> hardware which handles the incoming MSIs and forwards them to the level
> >> type interrupt line is designed properly and the driver does the right
> >> thing.
> >
> > Yes, the design of the HW is sort of broken. I concur.
>
> As you describe it, it's not that bad.
>
> >> > The ack callback acks the GIC irq.  the mask/unmask calls
> >> > pci_msi_mask_irq() / pci_msi_unmask_irq()
> >>
> >> What? How is that supposed to work with multiple MSIs?
> > Acking is per MSI vector as I described above, so it should work.
>
> No. This is the wrong approach. Lets look at the hardware:
>
> | GIC   line X |------| MSI block | <--- Messages from devices
>
> The MSI block latches the incoming message up to the point where it is
> acknowledged in the MSI block. This makes sure that the level semantics
> of the GIC are met.
>
> So from a software perspective you want to do something like this:
>
> gic_irq_handler()
> {
>    mask_ack(gic_irqX);
>
>    pending = read(msi_status);
>    for_each_bit(bit, pending) {
>        ack(msi_status, bit);  // This clears the latch in the MSI block
>        handle_irq(irqof(bit));
>    }
>    unmask(gic_irqX);
> }
>
> And that works perfectly correct without masking the MSI interrupt at
> the PCI level for a threaded handler simply because the PCI device will
> not send another interrupt until the previous one has been handled by
> the driver unless the PCI device is broken.
I'm missing something here, isn't this implementation blocks IRQ's only during
the HW handler and not during the threaded handler ? (Assuming that I selected
handle_level_irq() as the default handler)

Actually my implementation current implementation is very similar to what
you just described:

static void eq_msi_isr(struct irq_desc *desc)
{
        struct irq_chip *chip = irq_desc_get_chip(desc);
        struct eq_msi *msi;
        u16 status;
        unsigned long bitmap;
        u32 bit;
        u32 virq;

        chained_irq_enter(chip, desc);
        msi = irq_desc_get_handler_data(desc);

        while ((status = readw(msi->gcsr_regs_base + LINK_GCSR5_OFFSET)
                & MSI_IRQ_REQ) != 0) {
                pr_debug("MSI: %x\n", status >> 12);

                bitmap = status >> 12;
                for_each_set_bit(bit, &bitmap, msi->num_of_vectors) {
                        virq = irq_find_mapping(msi->inner_domain, bit);
                        if (virq) {
                                generic_handle_irq(virq);
                        } else {
                                pr_err("unexpected MSI\n");
                                handle_bad_irq(desc);
                        }
                }
        }
        chained_irq_exit(chip, desc);
}

Additionally the domain allocation is defined like:
static int eq_irq_domain_alloc(struct irq_domain *domain, unsigned int virq,
                                 unsigned int nr_irqs, void *args)
{
        struct eq_msi *msi = domain->host_data;
        unsigned long bit;
        u32 mask;

        /* We only allow 32 MSI per device */
        WARN_ON(nr_irqs > 32);
        if (nr_irqs > 32)
                return -ENOSPC;

        bit = find_first_zero_bit(msi->used, msi->num_of_vectors);
        if (bit >= msi->num_of_vectors)
                return -ENOSPC;

        set_bit(bit, msi->used);

        mask = readw(msi->gcsr_regs_base + LINK_GCSR6_OFFSET);
        mask |= BIT(bit) << 12;
        writew(mask, msi->gcsr_regs_base + LINK_GCSR6_OFFSET);

        irq_domain_set_info(domain, virq, bit, &eq_msi_bottom_irq_chip,
                            domain->host_data, handle_level_irq,
                            NULL, NULL);

        pr_debug("Enabling MSI irq: %lu\n", bit);

        return 0;
}

>
> If that's the case, then you have to work around that at the device
> driver level, not at the irq chip level, by installing a primary handler
> which quiesces the device (not the MSI part).
>
> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2020-01-17 13:32                 ` Ramon Fried
@ 2020-01-17 14:38                   ` Thomas Gleixner
  2020-01-17 15:43                     ` Ramon Fried
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2020-01-17 14:38 UTC (permalink / raw)
  To: Ramon Fried
  Cc: hkallweit1, Bjorn Helgaas, maz, lorenzo.pieralisi, linux-pci,
	linux-kernel

Ramon,

Ramon Fried <rfried.dev@gmail.com> writes:
>> So from a software perspective you want to do something like this:
>>
>> gic_irq_handler()
>> {
>>    mask_ack(gic_irqX);
>>
>>    pending = read(msi_status);
>>    for_each_bit(bit, pending) {
>>        ack(msi_status, bit);  // This clears the latch in the MSI block
>>        handle_irq(irqof(bit));
>>    }
>>    unmask(gic_irqX);
>> }
>>
>> And that works perfectly correct without masking the MSI interrupt at
>> the PCI level for a threaded handler simply because the PCI device
>> will not send another interrupt until the previous one has been
>> handled by the driver unless the PCI device is broken.
>
> I'm missing something here, isn't this implementation blocks IRQ's only during
> the HW handler and not during the threaded handler ? (Assuming that I selected
> handle_level_irq() as the default handler)

handle_level_irq() is the proper handler for the actual GIC interrupt
which does the demultiplexing. The MSI interrupts want to have
handle_edge_irq().

> Actually my implementation current implementation is very similar to what
> you just described:
>
> static void eq_msi_isr(struct irq_desc *desc)
> {
>         struct irq_chip *chip = irq_desc_get_chip(desc);
>         struct eq_msi *msi;
>         u16 status;
>         unsigned long bitmap;
>         u32 bit;
>         u32 virq;
>
>         chained_irq_enter(chip, desc);
>         msi = irq_desc_get_handler_data(desc);
>
>         while ((status = readw(msi->gcsr_regs_base + LINK_GCSR5_OFFSET)
>                 & MSI_IRQ_REQ) != 0) {
>                 pr_debug("MSI: %x\n", status >> 12);
>
>                 bitmap = status >> 12;
>                 for_each_set_bit(bit, &bitmap, msi->num_of_vectors) {
>                         virq = irq_find_mapping(msi->inner_domain, bit);
>                         if (virq) {
>                                 generic_handle_irq(virq);
>                         } else {
>                                 pr_err("unexpected MSI\n");
>                                 handle_bad_irq(desc);

Now if you look at the example I gave you there is a subtle difference:

>>    pending = read(msi_status);
>>    for_each_bit(bit, pending) {
>>        ack(msi_status, bit);  // This clears the latch in the MSI block
>>        handle_irq(irqof(bit));
>>    }

And this clearing is important when one of the MSI interrupts is
actually having a threaded handler.

 MSI interrupt fires
  -> sets bit in msi_status
    -> MSI block raises GIC interrupt because msi_status != 0

 CPU handles GIC interrupt
   pending = read(msi_status);
   for_each_bit(bit, pending)
      handle_irq()
        primary_handler()
           -> WAKEUP_THREAD

   RETI, but msi_status is still != 0

> Additionally the domain allocation is defined like:
> static int eq_irq_domain_alloc(struct irq_domain *domain, unsigned int virq,
>                                  unsigned int nr_irqs, void *args)
> {
>         struct eq_msi *msi = domain->host_data;
>         unsigned long bit;
>         u32 mask;
>
>         /* We only allow 32 MSI per device */
>         WARN_ON(nr_irqs > 32);
>         if (nr_irqs > 32)
>                 return -ENOSPC;
>
>         bit = find_first_zero_bit(msi->used, msi->num_of_vectors);
>         if (bit >= msi->num_of_vectors)
>                 return -ENOSPC;
>
>         set_bit(bit, msi->used);
>
>         mask = readw(msi->gcsr_regs_base + LINK_GCSR6_OFFSET);
>         mask |= BIT(bit) << 12;
>         writew(mask, msi->gcsr_regs_base + LINK_GCSR6_OFFSET);
>
>         irq_domain_set_info(domain, virq, bit, &eq_msi_bottom_irq_chip,
>                             domain->host_data, handle_level_irq,

This is wrong. MSI is edge type, not level and you are really mixing up
the concepts here.

The fact that the MSI block raises a level interrupt on the output side
has absolutely nothing to do with the type of the MSI interrupt itself.

MSI is edge type by definition and this does not change just because
there is a translation unit between the MSI interrupt and the CPU
controller.

The actual MSI interrupts do not even know about the existance of that
MSI block at all. They do not care, as all they need to know is a
message and an address. When an interrupt is raised in the device the
MSI chip associated to the device (PCI or something else) writes this
message to the address exactly ONCE. And this exactly ONCE defines the
edge nature of MSI.

A proper designed MSI device should not send another message before the
interrupt handler which is associated to the device has handled the
interrupt at the device level.

So you really have to understand that the GIC interrupt and the MSI
interrupts are two different entities. They just have a 'connection'
because the message/address which is handed to the MSI device triggers
that GIC interrupt via the MSI translation unit. But they are still
different and independent entities.

See?

Thanks,

        tglx


        




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2020-01-17 14:38                   ` Thomas Gleixner
@ 2020-01-17 15:43                     ` Ramon Fried
  2020-01-17 17:11                       ` Thomas Gleixner
  0 siblings, 1 reply; 18+ messages in thread
From: Ramon Fried @ 2020-01-17 15:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: hkallweit1, Bjorn Helgaas, maz, lorenzo.pieralisi, linux-pci,
	linux-kernel

On Fri, Jan 17, 2020 at 4:38 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Ramon,
>
> Ramon Fried <rfried.dev@gmail.com> writes:
> >> So from a software perspective you want to do something like this:
> >>
> >> gic_irq_handler()
> >> {
> >>    mask_ack(gic_irqX);
> >>
> >>    pending = read(msi_status);
> >>    for_each_bit(bit, pending) {
> >>        ack(msi_status, bit);  // This clears the latch in the MSI block
> >>        handle_irq(irqof(bit));
> >>    }
> >>    unmask(gic_irqX);
> >> }
> >>
> >> And that works perfectly correct without masking the MSI interrupt at
> >> the PCI level for a threaded handler simply because the PCI device
> >> will not send another interrupt until the previous one has been
> >> handled by the driver unless the PCI device is broken.
> >
> > I'm missing something here, isn't this implementation blocks IRQ's only during
> > the HW handler and not during the threaded handler ? (Assuming that I selected
> > handle_level_irq() as the default handler)
>
> handle_level_irq() is the proper handler for the actual GIC interrupt
> which does the demultiplexing. The MSI interrupts want to have
> handle_edge_irq().
>
> > Actually my implementation current implementation is very similar to what
> > you just described:
> >
> > static void eq_msi_isr(struct irq_desc *desc)
> > {
> >         struct irq_chip *chip = irq_desc_get_chip(desc);
> >         struct eq_msi *msi;
> >         u16 status;
> >         unsigned long bitmap;
> >         u32 bit;
> >         u32 virq;
> >
> >         chained_irq_enter(chip, desc);
> >         msi = irq_desc_get_handler_data(desc);
> >
> >         while ((status = readw(msi->gcsr_regs_base + LINK_GCSR5_OFFSET)
> >                 & MSI_IRQ_REQ) != 0) {
> >                 pr_debug("MSI: %x\n", status >> 12);
> >
> >                 bitmap = status >> 12;
> >                 for_each_set_bit(bit, &bitmap, msi->num_of_vectors) {
> >                         virq = irq_find_mapping(msi->inner_domain, bit);
> >                         if (virq) {
> >                                 generic_handle_irq(virq);
> >                         } else {
> >                                 pr_err("unexpected MSI\n");
> >                                 handle_bad_irq(desc);
>
> Now if you look at the example I gave you there is a subtle difference:
>
> >>    pending = read(msi_status);
> >>    for_each_bit(bit, pending) {
> >>        ack(msi_status, bit);  // This clears the latch in the MSI block
> >>        handle_irq(irqof(bit));
> >>    }
>
> And this clearing is important when one of the MSI interrupts is
> actually having a threaded handler.
>
>  MSI interrupt fires
>   -> sets bit in msi_status
>     -> MSI block raises GIC interrupt because msi_status != 0
>
>  CPU handles GIC interrupt
>    pending = read(msi_status);
>    for_each_bit(bit, pending)
>       handle_irq()
>         primary_handler()
>            -> WAKEUP_THREAD
>
>    RETI, but msi_status is still != 0
>
> > Additionally the domain allocation is defined like:
> > static int eq_irq_domain_alloc(struct irq_domain *domain, unsigned int virq,
> >                                  unsigned int nr_irqs, void *args)
> > {
> >         struct eq_msi *msi = domain->host_data;
> >         unsigned long bit;
> >         u32 mask;
> >
> >         /* We only allow 32 MSI per device */
> >         WARN_ON(nr_irqs > 32);
> >         if (nr_irqs > 32)
> >                 return -ENOSPC;
> >
> >         bit = find_first_zero_bit(msi->used, msi->num_of_vectors);
> >         if (bit >= msi->num_of_vectors)
> >                 return -ENOSPC;
> >
> >         set_bit(bit, msi->used);
> >
> >         mask = readw(msi->gcsr_regs_base + LINK_GCSR6_OFFSET);
> >         mask |= BIT(bit) << 12;
> >         writew(mask, msi->gcsr_regs_base + LINK_GCSR6_OFFSET);
> >
> >         irq_domain_set_info(domain, virq, bit, &eq_msi_bottom_irq_chip,
> >                             domain->host_data, handle_level_irq,
>
> This is wrong. MSI is edge type, not level and you are really mixing up
> the concepts here.
>
> The fact that the MSI block raises a level interrupt on the output side
> has absolutely nothing to do with the type of the MSI interrupt itself.
>
> MSI is edge type by definition and this does not change just because
> there is a translation unit between the MSI interrupt and the CPU
> controller.
>
> The actual MSI interrupts do not even know about the existance of that
> MSI block at all. They do not care, as all they need to know is a
> message and an address. When an interrupt is raised in the device the
> MSI chip associated to the device (PCI or something else) writes this
> message to the address exactly ONCE. And this exactly ONCE defines the
> edge nature of MSI.
OK, now I understand my mistake. thanks.
>
> A proper designed MSI device should not send another message before the
> interrupt handler which is associated to the device has handled the
> interrupt at the device level.
By "MSI device" you mean the MSI controller in the SOC or the endpoint
that sends the MSI ?
>
> So you really have to understand that the GIC interrupt and the MSI
> interrupts are two different entities. They just have a 'connection'
> because the message/address which is handed to the MSI device triggers
> that GIC interrupt via the MSI translation unit. But they are still
> different and independent entities.
>
> See?
>
> Thanks,
>
>         tglx
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2020-01-17 15:43                     ` Ramon Fried
@ 2020-01-17 17:11                       ` Thomas Gleixner
  2020-01-17 20:01                         ` Ramon Fried
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2020-01-17 17:11 UTC (permalink / raw)
  To: Ramon Fried
  Cc: hkallweit1, Bjorn Helgaas, maz, lorenzo.pieralisi, linux-pci,
	linux-kernel

Ramon,

Ramon Fried <rfried.dev@gmail.com> writes:
> On Fri, Jan 17, 2020 at 4:38 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> This is wrong. MSI is edge type, not level and you are really mixing up
>> the concepts here.
>>
>> The fact that the MSI block raises a level interrupt on the output side
>> has absolutely nothing to do with the type of the MSI interrupt itself.
>>
>> MSI is edge type by definition and this does not change just because
>> there is a translation unit between the MSI interrupt and the CPU
>> controller.
>>
>> The actual MSI interrupts do not even know about the existance of that
>> MSI block at all. They do not care, as all they need to know is a
>> message and an address. When an interrupt is raised in the device the
>> MSI chip associated to the device (PCI or something else) writes this
>> message to the address exactly ONCE. And this exactly ONCE defines the
>> edge nature of MSI.
>
> OK, now I understand my mistake. thanks.

:)

>> A proper designed MSI device should not send another message before the
>> interrupt handler which is associated to the device has handled the
>> interrupt at the device level.
>
> By "MSI device" you mean the MSI controller in the SOC or the endpoint
> that sends the MSI ?

The device which incorporates the MSI endpoint. 

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2020-01-17 17:11                       ` Thomas Gleixner
@ 2020-01-17 20:01                         ` Ramon Fried
  2020-01-17 22:47                           ` Thomas Gleixner
  0 siblings, 1 reply; 18+ messages in thread
From: Ramon Fried @ 2020-01-17 20:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: hkallweit1, Bjorn Helgaas, maz, lorenzo.pieralisi, linux-pci,
	linux-kernel

On Fri, Jan 17, 2020 at 7:11 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Ramon,
>
> Ramon Fried <rfried.dev@gmail.com> writes:
> > On Fri, Jan 17, 2020 at 4:38 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> This is wrong. MSI is edge type, not level and you are really mixing up
> >> the concepts here.
> >>
> >> The fact that the MSI block raises a level interrupt on the output side
> >> has absolutely nothing to do with the type of the MSI interrupt itself.
> >>
> >> MSI is edge type by definition and this does not change just because
> >> there is a translation unit between the MSI interrupt and the CPU
> >> controller.
> >>
> >> The actual MSI interrupts do not even know about the existance of that
> >> MSI block at all. They do not care, as all they need to know is a
> >> message and an address. When an interrupt is raised in the device the
> >> MSI chip associated to the device (PCI or something else) writes this
> >> message to the address exactly ONCE. And this exactly ONCE defines the
> >> edge nature of MSI.
> >
> > OK, now I understand my mistake. thanks.
>
> :)
>
> >> A proper designed MSI device should not send another message before the
> >> interrupt handler which is associated to the device has handled the
> >> interrupt at the device level.
> >
> > By "MSI device" you mean the MSI controller in the SOC or the endpoint
> > that sends the MSI ?
>
> The device which incorporates the MSI endpoint.
This is not how the MSI specs describe it, so I'm confused.
According to spec, MSI is just an ordinary post PCIe TLP to a certain
memory on the root-complex.
The only information it has whether to send an MSI or not is the
masked/pending register in the config space.
So, basically, back to my original question, without tinkering with
these bits, the device will always send the MSI's,
it's just that they will be masked on the MSI controller on the host. right ?

Thanks,
Ramon.
>
> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2020-01-17 20:01                         ` Ramon Fried
@ 2020-01-17 22:47                           ` Thomas Gleixner
  2020-01-20  8:02                             ` Ramon Fried
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2020-01-17 22:47 UTC (permalink / raw)
  To: Ramon Fried
  Cc: hkallweit1, Bjorn Helgaas, maz, lorenzo.pieralisi, linux-pci,
	linux-kernel

Ramon,

Ramon Fried <rfried.dev@gmail.com> writes:
> On Fri, Jan 17, 2020 at 7:11 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> The device which incorporates the MSI endpoint.
>
> This is not how the MSI specs describe it, so I'm confused.
> According to spec, MSI is just an ordinary post PCIe TLP to a certain
> memory on the root-complex.

That's the message transport itself.

> The only information it has whether to send an MSI or not is the
> masked/pending register in the config space.

Correct.

> So, basically, back to my original question, without tinkering with
> these bits, the device will always send the MSI's,

What do you mean with 'always send'?

It will send ONE message every time the device IP block raises an
interrupt as long as it's not masked in the config space.

> it's just that they will be masked on the MSI controller on the
> host. right ?

No. If you mask them on the host then you can lose interupts.

Lets take an example. Network card.

   Incoming packet
   network controller raises interrupt
   MSI endpoint sends message
   Message raises interrupt in CPU

   interrupt is serviced
      handle_edge_irq()
        acknowledge interrupt at the CPU level
        call_driver_interrupt_handler()
           fiddle_with_device()
   return from interrupt

So now if you use a threaded handler in that driver (or use force
threading) then this looks so:

   Incoming packet
   network controller raises interrupt
   MSI endpoint sends message
   Message raises interrupt in CPU

   interrupt is serviced
      handle_edge_irq()
      acknowledge interrupt at the CPU level
      call_primary_interrupt_handler()
          wake_irq_thread()
  return from interrupt

  run_irq_thread()
      call_driver_interrupt_handler()
        fiddle_with_device()
      wait_for_next_irq();

In both cases the network controller can raise another interrupt
_before_ the intial one has been fully handled and of course the MSI
endpoint will send a new message which triggers the pending logic in the
edge handler or in case of a threaded handler kicks the thread to run
another round.

Now you might think that if there are tons of incoming packets then the
network controller will raise tons of interrupts before the interrupt
handler completes. That would be outright stupid. So what the network
controller (assumed it is sanely designed) does is:

  packet arrives
  if (!raised_marker) {
     raise_interrupt;
     set raised_marker;
  }   

So now the interrupt handler comes around to talk to the device and the
processing clears the raised_marker at some point. Either by software or
automatically when the queue is empty.

If you translate that into a electrical diagram:

Packet    1    2      3        4

            ________     _____   _ 
NetC-Int  _|        |___|     |_| |_____

MSI-EP     M            M       M          M = Message

CPU INT     |            |       |

Driver        _______      _________
handler  ____|       |____|         |______

If you look at packet #4 then you notice that the interrupt for this
packet is raised and the message is sent _before_ the handler finishes.

And that's where we need to look at interrupt masking.

1) Masking at the MSI endpoint (PCI configspace)

   This is slow and depending on the PCI host this might require
   to take global locks, which is even worse if you have multi queue
   devices firing all at the same time.

   So, no this is horrible and it's also not required.

2) Masking at the host interrupt controller

   Depending on the implementation of the controller masking can cause
   interrupt loss. In the above case the message for packet #4 could
   be dropped by the controller. And yes, there are interrupt
   controllers out there which have exactly this problem.

   That's why the edge handler does not mask the interrupt in the first
   place.

So now you can claim that your MSI host controller does not have that
problem. Fine, then you could do masking at the host controller level,
but what does that buy you? Lets look at the picture again:

Packet    1    2      3        4

            ________     _____   ____
NetC-Int  _|        |___|     |_|    |__

MSI-EP     M            M       M             M = Message

CPU INT     |            |         |
Driver       _________    ________   __
handler  ____M       U____M       U_M  U____  M = Mask, U = Unmask

You unmask just to get the next interrupt so you mask/handle/unmask
again. That's actually slower because you get the overhead of unmask,
which raises the next interrupt in the CPU (it's already latched in the
MSI translator) and then yet another mask/unmask pair.  No matter what,
you'll lose.

And if you take a look at network drivers, then you find quite some of
them which do only one thing in their interrupt service routine:

     napi_schedule();

That's raising the NAPI softirq and nothing else. They touch not even
the device at all and delegate all the processing to softirq
context. They rely on the sanity of the network controller not to send
gazillions of interrupts before the pending stuff has been handled.

That's not any different than interrupt threading. It's exactly the same
except that the handling runs in softirq context and not in an dedicated
interrupt thread.

So if you observe issues with your PCI device that it sends gazillions
of interrupts before the pending ones are handled, then you might talk
to the people who created that beast or you need to do what some of the
network controllers do:

  hard_interrupt_handler()
    tell_device_to_shutup();
    napi_schedule();

and then something in the NAPI handling tells the device that it can
send interrupts again.

You can do exactly the same thing with interrupt threading. Register a
primary handler and a threaded handler and let the primary handler do:

  hard_interrupt_handler()
    tell_device_to_shutup();
    return IRQ_WAKE_THREAD;

Coming back to your mask/unmask thing. That has another downside which
is layering violation and software complexity.

MSI interrupts are edge type by specification:

  "MSI and MSI-X are edge-triggered interrupt mechanisms; neither the
   PCI Local Bus Specification nor this specification support
   level-triggered MSI/MSI-X interrupts."

The whole point of edge-triggered interrupts is that they are just a
momentary notification which means that they can avoid the whole
mask/unmask dance and other issues. There are some limitations to edge
type interrupts:

  - Cannot be shared, which is a good thing. Shared interrupts are
    a pain in all aspects

  - Can be lost if the momentary notification does not reach the
    receiver. For actual electrical edge type interrupts this happens
    when the active state is too short so that the edge detection
    on the receiver side fails to detect it.

    For MSI this is usually not a problem. If the message gets lost on
    the bus then you have other worries than the lost interrupt.

    But for both electrical and message based the interrupt receiver on
    the host/CPU side can be a problem when masking is in play. There
    are quite some broken controllers out there which have that issue
    and it's not trivial to get it right especially with message based
    interrupts due to the async nature of the involved parts.

That's one thing, but now lets look at the layering.

Your MSI host side IP is not an interrupt controller. It is a bridge
which translates incoming MSI messages and multiplexes them to a level
interrupt on the GIC. It provides a status register which allows you to
demultiplex the pending interrupts so you don't have to poll all
registered handlers to figure out which device actually fired an
interrupt. Additionally it allows masking, but that's an implementation
detail and you really should just ignore it except for startup/shutdown.

From the kernels interrupt system POV the MSI host side controller is
just a bridge between MSI and GIC.

That's clearly reflected in the irq hierarchy:

|-------------|
|             |
| GIC         |
|             |
|-------------|

|-------------|         |----------|
|             |         |          |
| MSI bridge  |---------| PCI/MSI  |
|             |         |          |
|-------------|         |----------|

The GIC and the MSI bridge are independent components. The fact that the
MSI bridge has an interrupt output which is connected to the GIC does
not create an hierarchy. From the GIC point of view the MSI bridge is
just like any other peripheral which is connected to one of its input
lines.

But the PCI/MSI domain has a hierarchical parent, the MSI Bridge. The
reason why this relationship exists is that the PCI/MSI domain needs a
way to allocate a message/address for interrupt delivery. And that
information is provided by the MSI bridge domain.

In an interrupt hierarchy the type of the interrupt (edge/level) and the
required handler is determined by the outmost domain, in this case the
PCI/MSI domain. This domain mandates edge type and the edge handler.

And that outermost domain is the primary interrupt chip which is
involved when the core code manages and handles interrupts. So
mask/unmask happens at the pci_msi interrupt chip which fiddles with the
MSI config space. The outermost device can call down into the hierarchy
to let the underlying domain take further action or delegate certain
actions completely to the underlying domain, but that delegation is
pretty much restricted. One example for delegation is the irq_ack()
action. The ack has to hit the underlying domain usually as on the MSI
endpoint there is no such thing. If the underlying domain does not need
that then the irq_ack() routine in the underlying domain is just empty
or not implemented. But you cannot delegate mask/unmask and other
fundamental actions because they must happen on the MSI endpoint no
matter what.

You cannot create some artifical level semantics on the PCI/MSI side and
you cannot artificially connect your demultiplexing handler to the
threaded handler of the PCI interrupt without violating all basic rules
of engineering and common sense at once.

Let me show you the picture from above expanded with your situation:

Packet    1    2      3        4

            ________     _____   _ 
NetC-Int  _|        |___|     |_| |_____

MSI-EP     M            M       M               M = Message

             _            _      _
Bridge    __| |__________| |____| |_______

             _            _      _
GIC input __| |__________| |____| |_______

CPU INT     |            |       |

Demux         _            _      _
handler    __A |__________A |____A |_______     A == Acknowledge in the bridge

Thread         _______      _________
handler   ____|       |____|         |______

Hope that helps and clarifies it.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs
  2020-01-17 22:47                           ` Thomas Gleixner
@ 2020-01-20  8:02                             ` Ramon Fried
  0 siblings, 0 replies; 18+ messages in thread
From: Ramon Fried @ 2020-01-20  8:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: hkallweit1, Bjorn Helgaas, maz, lorenzo.pieralisi, linux-pci,
	linux-kernel

On Sat, Jan 18, 2020 at 12:47 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Ramon,
>
> Ramon Fried <rfried.dev@gmail.com> writes:
> > On Fri, Jan 17, 2020 at 7:11 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> The device which incorporates the MSI endpoint.
> >
> > This is not how the MSI specs describe it, so I'm confused.
> > According to spec, MSI is just an ordinary post PCIe TLP to a certain
> > memory on the root-complex.
>
> That's the message transport itself.
>
> > The only information it has whether to send an MSI or not is the
> > masked/pending register in the config space.
>
> Correct.
>
> > So, basically, back to my original question, without tinkering with
> > these bits, the device will always send the MSI's,
>
> What do you mean with 'always send'?
>
> It will send ONE message every time the device IP block raises an
> interrupt as long as it's not masked in the config space.
>
> > it's just that they will be masked on the MSI controller on the
> > host. right ?
>
> No. If you mask them on the host then you can lose interupts.
>
> Lets take an example. Network card.
>
>    Incoming packet
>    network controller raises interrupt
>    MSI endpoint sends message
>    Message raises interrupt in CPU
>
>    interrupt is serviced
>       handle_edge_irq()
>         acknowledge interrupt at the CPU level
>         call_driver_interrupt_handler()
>            fiddle_with_device()
>    return from interrupt
>
> So now if you use a threaded handler in that driver (or use force
> threading) then this looks so:
>
>    Incoming packet
>    network controller raises interrupt
>    MSI endpoint sends message
>    Message raises interrupt in CPU
>
>    interrupt is serviced
>       handle_edge_irq()
>       acknowledge interrupt at the CPU level
>       call_primary_interrupt_handler()
>           wake_irq_thread()
>   return from interrupt
>
>   run_irq_thread()
>       call_driver_interrupt_handler()
>         fiddle_with_device()
>       wait_for_next_irq();
>
> In both cases the network controller can raise another interrupt
> _before_ the intial one has been fully handled and of course the MSI
> endpoint will send a new message which triggers the pending logic in the
> edge handler or in case of a threaded handler kicks the thread to run
> another round.
>
> Now you might think that if there are tons of incoming packets then the
> network controller will raise tons of interrupts before the interrupt
> handler completes. That would be outright stupid. So what the network
> controller (assumed it is sanely designed) does is:
>
>   packet arrives
>   if (!raised_marker) {
>      raise_interrupt;
>      set raised_marker;
>   }
>
> So now the interrupt handler comes around to talk to the device and the
> processing clears the raised_marker at some point. Either by software or
> automatically when the queue is empty.
>
> If you translate that into a electrical diagram:
>
> Packet    1    2      3        4
>
>             ________     _____   _
> NetC-Int  _|        |___|     |_| |_____
>
> MSI-EP     M            M       M          M = Message
>
> CPU INT     |            |       |
>
> Driver        _______      _________
> handler  ____|       |____|         |______
>
> If you look at packet #4 then you notice that the interrupt for this
> packet is raised and the message is sent _before_ the handler finishes.
>
> And that's where we need to look at interrupt masking.
>
> 1) Masking at the MSI endpoint (PCI configspace)
>
>    This is slow and depending on the PCI host this might require
>    to take global locks, which is even worse if you have multi queue
>    devices firing all at the same time.
>
>    So, no this is horrible and it's also not required.
>
> 2) Masking at the host interrupt controller
>
>    Depending on the implementation of the controller masking can cause
>    interrupt loss. In the above case the message for packet #4 could
>    be dropped by the controller. And yes, there are interrupt
>    controllers out there which have exactly this problem.
>
>    That's why the edge handler does not mask the interrupt in the first
>    place.
>
> So now you can claim that your MSI host controller does not have that
> problem. Fine, then you could do masking at the host controller level,
> but what does that buy you? Lets look at the picture again:
>
> Packet    1    2      3        4
>
>             ________     _____   ____
> NetC-Int  _|        |___|     |_|    |__
>
> MSI-EP     M            M       M             M = Message
>
> CPU INT     |            |         |
> Driver       _________    ________   __
> handler  ____M       U____M       U_M  U____  M = Mask, U = Unmask
>
> You unmask just to get the next interrupt so you mask/handle/unmask
> again. That's actually slower because you get the overhead of unmask,
> which raises the next interrupt in the CPU (it's already latched in the
> MSI translator) and then yet another mask/unmask pair.  No matter what,
> you'll lose.
>
> And if you take a look at network drivers, then you find quite some of
> them which do only one thing in their interrupt service routine:
>
>      napi_schedule();
>
> That's raising the NAPI softirq and nothing else. They touch not even
> the device at all and delegate all the processing to softirq
> context. They rely on the sanity of the network controller not to send
> gazillions of interrupts before the pending stuff has been handled.
>
> That's not any different than interrupt threading. It's exactly the same
> except that the handling runs in softirq context and not in an dedicated
> interrupt thread.
>
> So if you observe issues with your PCI device that it sends gazillions
> of interrupts before the pending ones are handled, then you might talk
> to the people who created that beast or you need to do what some of the
> network controllers do:
>
>   hard_interrupt_handler()
>     tell_device_to_shutup();
>     napi_schedule();
>
> and then something in the NAPI handling tells the device that it can
> send interrupts again.
>
> You can do exactly the same thing with interrupt threading. Register a
> primary handler and a threaded handler and let the primary handler do:
>
>   hard_interrupt_handler()
>     tell_device_to_shutup();
>     return IRQ_WAKE_THREAD;
>
> Coming back to your mask/unmask thing. That has another downside which
> is layering violation and software complexity.
>
> MSI interrupts are edge type by specification:
>
>   "MSI and MSI-X are edge-triggered interrupt mechanisms; neither the
>    PCI Local Bus Specification nor this specification support
>    level-triggered MSI/MSI-X interrupts."
>
> The whole point of edge-triggered interrupts is that they are just a
> momentary notification which means that they can avoid the whole
> mask/unmask dance and other issues. There are some limitations to edge
> type interrupts:
>
>   - Cannot be shared, which is a good thing. Shared interrupts are
>     a pain in all aspects
>
>   - Can be lost if the momentary notification does not reach the
>     receiver. For actual electrical edge type interrupts this happens
>     when the active state is too short so that the edge detection
>     on the receiver side fails to detect it.
>
>     For MSI this is usually not a problem. If the message gets lost on
>     the bus then you have other worries than the lost interrupt.
>
>     But for both electrical and message based the interrupt receiver on
>     the host/CPU side can be a problem when masking is in play. There
>     are quite some broken controllers out there which have that issue
>     and it's not trivial to get it right especially with message based
>     interrupts due to the async nature of the involved parts.
>
> That's one thing, but now lets look at the layering.
>
> Your MSI host side IP is not an interrupt controller. It is a bridge
> which translates incoming MSI messages and multiplexes them to a level
> interrupt on the GIC. It provides a status register which allows you to
> demultiplex the pending interrupts so you don't have to poll all
> registered handlers to figure out which device actually fired an
> interrupt. Additionally it allows masking, but that's an implementation
> detail and you really should just ignore it except for startup/shutdown.
>
> From the kernels interrupt system POV the MSI host side controller is
> just a bridge between MSI and GIC.
>
> That's clearly reflected in the irq hierarchy:
>
> |-------------|
> |             |
> | GIC         |
> |             |
> |-------------|
>
> |-------------|         |----------|
> |             |         |          |
> | MSI bridge  |---------| PCI/MSI  |
> |             |         |          |
> |-------------|         |----------|
>
> The GIC and the MSI bridge are independent components. The fact that the
> MSI bridge has an interrupt output which is connected to the GIC does
> not create an hierarchy. From the GIC point of view the MSI bridge is
> just like any other peripheral which is connected to one of its input
> lines.
>
> But the PCI/MSI domain has a hierarchical parent, the MSI Bridge. The
> reason why this relationship exists is that the PCI/MSI domain needs a
> way to allocate a message/address for interrupt delivery. And that
> information is provided by the MSI bridge domain.
>
> In an interrupt hierarchy the type of the interrupt (edge/level) and the
> required handler is determined by the outmost domain, in this case the
> PCI/MSI domain. This domain mandates edge type and the edge handler.
>
> And that outermost domain is the primary interrupt chip which is
> involved when the core code manages and handles interrupts. So
> mask/unmask happens at the pci_msi interrupt chip which fiddles with the
> MSI config space. The outermost device can call down into the hierarchy
> to let the underlying domain take further action or delegate certain
> actions completely to the underlying domain, but that delegation is
> pretty much restricted. One example for delegation is the irq_ack()
> action. The ack has to hit the underlying domain usually as on the MSI
> endpoint there is no such thing. If the underlying domain does not need
> that then the irq_ack() routine in the underlying domain is just empty
> or not implemented. But you cannot delegate mask/unmask and other
> fundamental actions because they must happen on the MSI endpoint no
> matter what.
>
> You cannot create some artifical level semantics on the PCI/MSI side and
> you cannot artificially connect your demultiplexing handler to the
> threaded handler of the PCI interrupt without violating all basic rules
> of engineering and common sense at once.
>
> Let me show you the picture from above expanded with your situation:
>
> Packet    1    2      3        4
>
>             ________     _____   _
> NetC-Int  _|        |___|     |_| |_____
>
> MSI-EP     M            M       M               M = Message
>
>              _            _      _
> Bridge    __| |__________| |____| |_______
>
>              _            _      _
> GIC input __| |__________| |____| |_______
>
> CPU INT     |            |       |
>
> Demux         _            _      _
> handler    __A |__________A |____A |_______     A == Acknowledge in the bridge
>
> Thread         _______      _________
> handler   ____|       |____|         |______
>
> Hope that helps and clarifies it.
>
> Thanks,
>
>         tglx
Wow Thomas, this is an amazing answer, I need to go over it few times
to see that I understand everything.
I wish we there was a way to pin this somewhere, so it won't get lost
in the mailing list archive, I think it's a very
nice explanation that should have it's wiki page or something.

Thanks,
Ramon.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-01-20  8:02 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-09 10:27 MSI irqchip configured as IRQCHIP_ONESHOT_SAFE causes spurious IRQs Ramon Fried
2020-01-02  8:19 ` Ramon Fried
2020-01-13 14:59 ` Ramon Fried
2020-01-14 12:15 ` Thomas Gleixner
2020-01-14 21:38   ` Ramon Fried
2020-01-14 21:40     ` Ramon Fried
2020-01-14 22:54       ` Thomas Gleixner
2020-01-16  0:19         ` Ramon Fried
2020-01-16  1:39           ` Thomas Gleixner
2020-01-16  7:58             ` Ramon Fried
2020-01-16  9:32               ` Thomas Gleixner
2020-01-17 13:32                 ` Ramon Fried
2020-01-17 14:38                   ` Thomas Gleixner
2020-01-17 15:43                     ` Ramon Fried
2020-01-17 17:11                       ` Thomas Gleixner
2020-01-17 20:01                         ` Ramon Fried
2020-01-17 22:47                           ` Thomas Gleixner
2020-01-20  8:02                             ` Ramon Fried

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).