Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke

From: "Gowans, James" <jgowans@amazon.com>
To: "maz@kernel.org" <maz@kernel.org>
Cc: "tglx@linutronix.de" <tglx@linutronix.de>,
	"Raslan, KarimAllah" <karahmed@amazon.com>,
	"Woodhouse, David" <dwmw@amazon.co.uk>,
	"zouyipeng@huawei.com" <zouyipeng@huawei.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Sironi, Filippo" <sironi@amazon.de>,
	"chris.zjh@huawei.com" <chris.zjh@huawei.com>
Subject: Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
Date: Tue, 18 Apr 2023 10:56:07 +0000	[thread overview]
Message-ID: <7fdfb01590d8e502f384aa0bb0dc9c614caa5dfc.camel@amazon.com> (raw)
In-Reply-To: <86pm89kyyt.wl-maz@kernel.org>

On Wed, 2023-04-12 at 14:32 +0100, Marc Zyngier wrote:
> 
> > > > 1. Do we need to mask the IRQ and then unmask it later? I don't think so
> > > > but it's not entirely clear why handle_edge_irq does this anyway; it's
> > > > an edge IRQ so not sure why it needs to be masked.
> > > 
> > > Please measure that cost and weep, specially in the context of
> > > multiple concurrent interrupts serviced by a single ITS (cost of
> > > locking the command queue, of waiting for a full round trip to the ITS
> > > for a couple of commands...).
> > 
> > Fortunately this mask/unmasking would only happen in the rare(ish) cased of the
> > race condition described here being hit. Exactly the same as
> > with handle_edge_irq(), the masking and later unmasking would only be done
> > when irq_may_run() == false due to the race being hit. Considering that this is
> > a rare occurrence, I think we could stomach the occasional overhead? I was more
> > asking if it's actually *necessary* to do this masking/unmasking. I'm not sure
> > it's necessary anyway, hence it wasn't implemented in my patch.
> 
> But does it solve anything? At the point where you mask the interrupt,
> you already have consumed it. You'd still need to make it pending
> somehow, which is what your patch somehow.

I don't really know - the reason I asked the question is that the related
handle_edge_irq() does this mask/unmasking, and I wasn't quite sure why it
did that and hence if we needed to do something similar.
Anyway, let's focus on your patch rather - I think it's more compelling.

> > Yes. This bothered me too initially, but on reflection I'm not sure it's
> > actually a problem. One possible issue that came to mind was around CPU
> > offlining, but in the event that a CPU being offlined was running interrupt
> > handlers it wouldn't be able to complete the offline anyway until the handlers
> > were finished, so I don't think this is an issue. Do you see any practical issue
> > with running the handler once more on the original CPU immediately after the
> > affinity has been changed?
> 
> My take on this is that we put the pressure on the CPU we want to move
> away from. I'd rather we put the it on the GIC itself, and use its
> Turing-complete powers to force it to redeliver the interrupt at a
> more convenient time.

This idea and implementation looks and works great! It may need a few
tweaks; discussing below.

> 
> From c96d2ab37fe273724f1264fba5f4913259875d56 Mon Sep 17 00:00:00 2001
> From: Marc Zyngier <maz@kernel.org>
> Date: Mon, 10 Apr 2023 10:56:32 +0100
> Subject: [PATCH] irqchip/gicv3-its: Force resend of LPIs taken while
> already
>  in-progress

Perhaps you can pillage some of my commit message to explain the race here
when you send this patch?
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> 
> diff --git a/include/linux/irq.h b/include/linux/irq.h
> index b1b28affb32a..4b2a7cc96eb2 100644
> --- a/include/linux/irq.h
> +++ b/include/linux/irq.h
> @@ -223,6 +223,8 @@ struct irq_data {
>   *                               irq_chip::irq_set_affinity() when
> deactivated.
>   * IRQD_IRQ_ENABLED_ON_SUSPEND - Interrupt is enabled on suspend by irq
> pm if
>   *                               irqchip have flag
> IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND set.
> + * IRQD_RESEND_WHEN_IN_PROGRESS - Interrupt may fire when already in
> progress,
> + *                               needs resending.
>   */
>  enum {
>         IRQD_TRIGGER_MASK               = 0xf,
> @@ -249,6 +251,7 @@ enum {
>         IRQD_HANDLE_ENFORCE_IRQCTX      = (1 << 28),
>         IRQD_AFFINITY_ON_ACTIVATE       = (1 << 29),
>         IRQD_IRQ_ENABLED_ON_SUSPEND     = (1 << 30),
> +       IRQD_RESEND_WHEN_IN_PROGRESS    = (1 << 31),
>  };

Do we really want a new flag here? I'd be keen to fix this race for all
drivers, not just those who know to set this flag. I think the patch
you're suggesting is pretty close to being safe to enable generally? If so
my preference is for one less config option - just run it always.

>  static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> index 49e7bc871fec..73546ba8bc43 100644
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -692,8 +692,11 @@ void handle_fasteoi_irq(struct irq_desc *desc)
> 
>         raw_spin_lock(&desc->lock);
> 
> -       if (!irq_may_run(desc))
> +       if (!irq_may_run(desc)) {
> +               if (irqd_needs_resend_when_in_progress(&desc->irq_data))
> +                       check_irq_resend(desc, true);
>                 goto out;
> +       }

This will run check_irq_resend() on the *newly affined* CPU, while the old
one is still running the original handler. AFAICT what will happen is:
check_irq_resend
  try_retrigger
    irq_chip_retrigger_hierarchy
      its_irq_retrigger
... which will cause the ITS to *immediately* re-trigger the IRQ. The
original CPU can still be running the handler in that case.

If that happens, consider what will happen in check_irq_resend:
- first IRQ comes in, successflly runs try_retrigger and sets IRQS_REPLAY.
- it is *immediately* retriggered by ITS, and because the original handler
on the other CPU is still running, comes into check_irq_resend again.
- check_irq_resend now observes that IRQS_REPLAY is set and early outs.
- No more resends, the IRQ is still lost. :-(

Now I admit the failure mode is getting a bit pathological: two re-
triggers while the original handler is still running, but I was able to
hit this on my test machine by intentionally slowing
the handler down by a few dozen micros. Should we cater for this?

I can see two possibilities:
- tweak check_irq_resend() to not early-out in this case but to keep re-
triggering until it eventually runs.
- move the check_irq_resend to only happen later, *after* the original
handler has finished running. This would be very similar to what I
suggested in my original patch, except instead of running a do/while loop,
the code would observe that the pending flag was set again and run
check_irq_resend.

I'm also wondering what will happen for users who don't have the
chip->irq_retrigger callback set and fall back to the tasklet
via irq_sw_resend()... Looks like it will work fine. However if we do my
suggestion and move check_irq_resend to the end of handle_fasteoi_irq then
the tasklet will be scheduled on the old CPU again, which may be sub-
optimal.

JG